11 "cell_type": "heading",
19 "cell_type": "markdown",
22 "One of the most powerful features of a database is\n",
23 "the ability to [filter](../../gloss.html#filter) data,\n",
25 "to select only those records that match certain criteria.\n",
27 "suppose we want to see when a particular site was visited.\n",
28 "We can select these records from the `Visited` table\n",
29 "by using a `where` clause in our query:"
36 "%load_ext sqlitemagic"
47 "%%sqlite survey.db\n",
48 "select * from Visited where site='DR-1';"
56 "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
57 "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
58 "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
62 "output_type": "display_data",
64 "<IPython.core.display.HTML at 0x1023c9050>"
71 "cell_type": "markdown",
74 "The database manager executes this query in two stages.\n",
76 "it checks at each row in the `Visited` table\n",
77 "to see which ones satisfy the `where`.\n",
78 "It then uses the column names following the `select` keyword\n",
79 "to determine what columns to display."
83 "cell_type": "markdown",
86 "This processing order means that\n",
87 "we can filter records using `where`\n",
88 "based on values in columns that aren't then displayed:"
95 "%%sqlite survey.db\n",
96 "select ident from Visited where site='DR-1';"
104 "<tr><td>619</td></tr>\n",
105 "<tr><td>622</td></tr>\n",
106 "<tr><td>844</td></tr>\n",
110 "output_type": "display_data",
112 "<IPython.core.display.HTML at 0x1023c3890>"
119 "cell_type": "markdown",
122 "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
126 "cell_type": "markdown",
129 "We can use many other Boolean operators to filter our data.\n",
131 "we can ask for all information from the DR-1 site collected since 1930:"
138 "%%sqlite survey.db\n",
139 "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
141 "language": "python",
147 "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
151 "output_type": "display_data",
153 "<IPython.core.display.HTML at 0x1023c4790>"
160 "cell_type": "markdown",
163 "(The parentheses around the individual tests aren't strictly required,\n",
164 "but they help make the query easier to read.)\n",
166 "> Most database managers have a special data type for dates.\n",
167 "> In fact, many have two:\n",
168 "> one for dates,\n",
169 "> such as \"May 31, 1971\",\n",
170 "> and one for durations,\n",
171 "> such as \"31 days\".\n",
172 "> SQLite doesn't:\n",
174 "> it stores dates as either text\n",
175 "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
177 "> (the number of days since November 24, 4714 BCE),\n",
179 "> (the number of seconds since midnight, January 1, 1970).\n",
180 "> If this sounds complicated,\n",
182 "> but not nearly as complicated as figuring out\n",
183 "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
187 "cell_type": "markdown",
190 "If we want to find out what measurements were taken by either Lake or Roerich,\n",
191 "we can combine the tests on their names using `or`:"
198 "%%sqlite survey.db\n",
199 "select * from Survey where person='lake' or person='roe';"
201 "language": "python",
207 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
208 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
209 "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
210 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
211 "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
212 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
213 "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
214 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
215 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
216 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
220 "output_type": "display_data",
222 "<IPython.core.display.HTML at 0x1023c4850>"
229 "cell_type": "markdown",
233 "we can use `in` to see if a value is in a specific set:"
240 "%%sqlite survey.db\n",
241 "select * from Survey where person in ('lake', 'roe');"
243 "language": "python",
249 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
250 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
251 "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
252 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
253 "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
254 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
255 "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
256 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
257 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
258 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
262 "output_type": "display_data",
264 "<IPython.core.display.HTML at 0x1023c4790>"
271 "cell_type": "markdown",
274 "We can combine `and` with `or`,\n",
275 "but we need to be careful about which operator is executed first.\n",
276 "If we *don't* use parentheses,\n",
284 "%%sqlite survey.db\n",
285 "select * from Survey where quant='sal' and person='lake' or person='roe';"
287 "language": "python",
293 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
294 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
295 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
296 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
297 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
298 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
299 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
303 "output_type": "display_data",
305 "<IPython.core.display.HTML at 0x1023c4850>"
312 "cell_type": "markdown",
315 "which is salinity measurements by Lake,\n",
316 "and *any* measurement by Roerich.\n",
317 "We probably want this instead:"
324 "%%sqlite survey.db\n",
325 "select * from Survey where quant='sal' and (person='lake' or person='roe');"
327 "language": "python",
333 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
334 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
335 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
336 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
337 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
338 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
342 "output_type": "display_data",
344 "<IPython.core.display.HTML at 0x1023c3850>"
351 "cell_type": "markdown",
355 "we can use `distinct` with `where`\n",
356 "to give a second level of filtering:"
363 "%%sqlite survey.db\n",
364 "select distinct person, quant from Survey where person='lake' or person='roe';"
366 "language": "python",
372 "<tr><td>lake</td><td>sal</td></tr>\n",
373 "<tr><td>lake</td><td>rad</td></tr>\n",
374 "<tr><td>lake</td><td>temp</td></tr>\n",
375 "<tr><td>roe</td><td>sal</td></tr>\n",
376 "<tr><td>roe</td><td>rad</td></tr>\n",
380 "output_type": "display_data",
382 "<IPython.core.display.HTML at 0x1023c3890>"
389 "cell_type": "markdown",
393 "`distinct` is applied to the values displayed in the chosen columns,\n",
394 "not to the entire rows as they are being processed.\n",
396 "> What we have just done is how most people \"grow\" their SQL queries.\n",
397 "> We started with something simple that did part of what we wanted,\n",
398 "> then added more clauses one by one,\n",
399 "> testing their effects as we went.\n",
400 "> This is a good strategy—in fact,\n",
401 "> for complex queries it's often the *only* strategy—but\n",
402 "> it depends on quick turnaround,\n",
403 "> and on us recognizing the right answer when we get it.\n",
405 "> The best way to achieve quick turnaround is often\n",
406 "> to put a subset of data in a temporary database\n",
407 "> and run our queries against that,\n",
408 "> or to fill a small database with synthesized records.\n",
410 "> instead of trying our queries against an actual database of 20 million Australians,\n",
411 "> we could run it against a sample of ten thousand,\n",
412 "> or write a small program to generate ten thousand random (but plausible) records\n",
417 "cell_type": "markdown",
422 "1. Suppose we want to select all sites that lie within 30° of the equator.\n",
423 " Our first query is:\n",
426 " select * from Site where (lat > -30) or (lat < 30);\n",
429 " Explain why this is wrong,\n",
430 " and rewrite the query so that it is correct.\n",
432 "2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
433 " Write a query that selects all records from `Survey`\n",
434 " with salinity values outside this range.\n",
436 "3. The SQL test `*column-name* like *pattern*`\n",
437 " is true if the value in the named column\n",
438 " matches the pattern given;\n",
439 " the character '%' can be used any number of times in the pattern\n",
440 " to mean \"match zero or more characters\".\n",
443 " <tr> <th>Expression</th> <th>Value</th> </tr>\n",
444 " <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
445 " <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
446 " <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
447 " <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
448 " <tr> <td><code>'alpha' like 'a%p%'</code> <td>True</td> </tr>\n",
450 " The expression `*column-name* not like *pattern*`\n",
451 " inverts the test.\n",
453 " write a query that finds all the records in `Visited`\n",
454 " that *aren't* from sites labelled 'DR-something'."