11 "cell_type": "heading",
19 "cell_type": "markdown",
28 "* Write queries that select records that satisfy user-specified conditions.\n",
29 "* Explain the order in which the clauses in a query are executed."
33 "cell_type": "markdown",
36 "One of the most powerful features of a database is\n",
37 "the ability to [filter](https://github.com/swcarpentry/bc/blob/master/gloss.md#filter) data,\n",
39 "to select only those records that match certain criteria.\n",
41 "suppose we want to see when a particular site was visited.\n",
42 "We can select these records from the `Visited` table\n",
43 "by using a `where` clause in our query:"
50 "%load_ext sqlitemagic"
61 "%%sqlite survey.db\n",
62 "select * from Visited where site='DR-1';"
70 "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
71 "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
72 "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
76 "output_type": "display_data",
78 "<IPython.core.display.HTML at 0x1023c9050>"
85 "cell_type": "markdown",
88 "The database manager executes this query in two stages.\n",
90 "it checks at each row in the `Visited` table\n",
91 "to see which ones satisfy the `where`.\n",
92 "It then uses the column names following the `select` keyword\n",
93 "to determine what columns to display."
97 "cell_type": "markdown",
100 "This processing order means that\n",
101 "we can filter records using `where`\n",
102 "based on values in columns that aren't then displayed:"
109 "%%sqlite survey.db\n",
110 "select ident from Visited where site='DR-1';"
112 "language": "python",
118 "<tr><td>619</td></tr>\n",
119 "<tr><td>622</td></tr>\n",
120 "<tr><td>844</td></tr>\n",
124 "output_type": "display_data",
126 "<IPython.core.display.HTML at 0x1023c3890>"
133 "cell_type": "markdown",
136 "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
140 "cell_type": "markdown",
143 "We can use many other Boolean operators to filter our data.\n",
145 "we can ask for all information from the DR-1 site collected since 1930:"
152 "%%sqlite survey.db\n",
153 "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
155 "language": "python",
161 "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
165 "output_type": "display_data",
167 "<IPython.core.display.HTML at 0x1023c4790>"
174 "cell_type": "markdown",
177 "(The parentheses around the individual tests aren't strictly required,\n",
178 "but they help make the query easier to read.)\n",
180 "> Most database managers have a special data type for dates.\n",
181 "> In fact, many have two:\n",
182 "> one for dates,\n",
183 "> such as \"May 31, 1971\",\n",
184 "> and one for durations,\n",
185 "> such as \"31 days\".\n",
186 "> SQLite doesn't:\n",
188 "> it stores dates as either text\n",
189 "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
191 "> (the number of days since November 24, 4714 BCE),\n",
193 "> (the number of seconds since midnight, January 1, 1970).\n",
194 "> If this sounds complicated,\n",
196 "> but not nearly as complicated as figuring out\n",
197 "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
201 "cell_type": "markdown",
204 "If we want to find out what measurements were taken by either Lake or Roerich,\n",
205 "we can combine the tests on their names using `or`:"
212 "%%sqlite survey.db\n",
213 "select * from Survey where person='lake' or person='roe';"
215 "language": "python",
221 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
222 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
223 "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
224 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
225 "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
226 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
227 "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
228 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
229 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
230 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
234 "output_type": "display_data",
236 "<IPython.core.display.HTML at 0x1023c4850>"
243 "cell_type": "markdown",
247 "we can use `in` to see if a value is in a specific set:"
254 "%%sqlite survey.db\n",
255 "select * from Survey where person in ('lake', 'roe');"
257 "language": "python",
263 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
264 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
265 "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
266 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
267 "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
268 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
269 "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
270 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
271 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
272 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
276 "output_type": "display_data",
278 "<IPython.core.display.HTML at 0x1023c4790>"
285 "cell_type": "markdown",
288 "We can combine `and` with `or`,\n",
289 "but we need to be careful about which operator is executed first.\n",
290 "If we *don't* use parentheses,\n",
298 "%%sqlite survey.db\n",
299 "select * from Survey where quant='sal' and person='lake' or person='roe';"
301 "language": "python",
307 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
308 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
309 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
310 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
311 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
312 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
313 "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
317 "output_type": "display_data",
319 "<IPython.core.display.HTML at 0x1023c4850>"
326 "cell_type": "markdown",
329 "which is salinity measurements by Lake,\n",
330 "and *any* measurement by Roerich.\n",
331 "We probably want this instead:"
338 "%%sqlite survey.db\n",
339 "select * from Survey where quant='sal' and (person='lake' or person='roe');"
341 "language": "python",
347 "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
348 "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
349 "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
350 "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
351 "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
352 "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
356 "output_type": "display_data",
358 "<IPython.core.display.HTML at 0x1023c3850>"
365 "cell_type": "markdown",
369 "we can use `distinct` with `where`\n",
370 "to give a second level of filtering:"
377 "%%sqlite survey.db\n",
378 "select distinct person, quant from Survey where person='lake' or person='roe';"
380 "language": "python",
386 "<tr><td>lake</td><td>sal</td></tr>\n",
387 "<tr><td>lake</td><td>rad</td></tr>\n",
388 "<tr><td>lake</td><td>temp</td></tr>\n",
389 "<tr><td>roe</td><td>sal</td></tr>\n",
390 "<tr><td>roe</td><td>rad</td></tr>\n",
394 "output_type": "display_data",
396 "<IPython.core.display.HTML at 0x1023c3890>"
403 "cell_type": "markdown",
407 "`distinct` is applied to the values displayed in the chosen columns,\n",
408 "not to the entire rows as they are being processed.\n",
410 "> What we have just done is how most people \"grow\" their SQL queries.\n",
411 "> We started with something simple that did part of what we wanted,\n",
412 "> then added more clauses one by one,\n",
413 "> testing their effects as we went.\n",
414 "> This is a good strategy—in fact,\n",
415 "> for complex queries it's often the *only* strategy—but\n",
416 "> it depends on quick turnaround,\n",
417 "> and on us recognizing the right answer when we get it.\n",
419 "> The best way to achieve quick turnaround is often\n",
420 "> to put a subset of data in a temporary database\n",
421 "> and run our queries against that,\n",
422 "> or to fill a small database with synthesized records.\n",
424 "> instead of trying our queries against an actual database of 20 million Australians,\n",
425 "> we could run it against a sample of ten thousand,\n",
426 "> or write a small program to generate ten thousand random (but plausible) records\n",
431 "cell_type": "markdown",
436 "1. Suppose we want to select all sites that lie more than 30° from the poles.\n",
437 " Our first query is:\n",
440 " select * from Site where (lat > -60) or (lat < 60);\n",
443 " Explain why this is wrong,\n",
444 " and rewrite the query so that it is correct.\n",
446 "2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
447 " Write a query that selects all records from `Survey`\n",
448 " with salinity values outside this range.\n",
450 "3. The SQL test `*column-name* like *pattern*`\n",
451 " is true if the value in the named column\n",
452 " matches the pattern given;\n",
453 " the character '%' can be used any number of times in the pattern\n",
454 " to mean \"match zero or more characters\".\n",
457 " <tr> <th>Expression</th> <th>Value</th> </tr>\n",
458 " <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
459 " <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
460 " <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
461 " <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
462 " <tr> <td><code>'alpha' like 'a%p%'</code></td> <td>True</td> </tr>\n",
464 " The expression `*column-name* not like *pattern*`\n",
465 " inverts the test.\n",
467 " write a query that finds all the records in `Visited`\n",
468 " that *aren't* from sites labelled 'DR-something'."
472 "cell_type": "markdown",
481 "* Use `where` to filter records according to Boolean conditions.\n",
482 "* Filtering is done on whole records,\n",
483 " so conditions can use fields that are not actually displayed."