*.ipynb: Use absolute URLs to link to the glossary
[swc-sql.git] / 03-filter.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 2,
13      "metadata": {},
14      "source": [
15       "Filtering"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {
21       "cell_tags": [
22        "objectives"
23       ]
24      },
25      "source": [
26       "#### Objectives\n",
27       "\n",
28       "*   Write queries that select records that satisfy user-specified conditions.\n",
29       "*   Explain the order in which the clauses in a query are executed."
30      ]
31     },
32     {
33      "cell_type": "markdown",
34      "metadata": {},
35      "source": [
36       "One of the most powerful features of a database is\n",
37       "the ability to [filter](https://github.com/swcarpentry/bc/blob/master/gloss.md#filter) data,\n",
38       "i.e.,\n",
39       "to select only those records that match certain criteria.\n",
40       "For example,\n",
41       "suppose we want to see when a particular site was visited.\n",
42       "We can select these records from the `Visited` table\n",
43       "by using a `where` clause in our query:"
44      ]
45     },
46     {
47      "cell_type": "code",
48      "collapsed": false,
49      "input": [
50       "%load_ext sqlitemagic"
51      ],
52      "language": "python",
53      "metadata": {},
54      "outputs": [],
55      "prompt_number": 1
56     },
57     {
58      "cell_type": "code",
59      "collapsed": false,
60      "input": [
61       "%%sqlite survey.db\n",
62       "select * from Visited where site='DR-1';"
63      ],
64      "language": "python",
65      "metadata": {},
66      "outputs": [
67       {
68        "html": [
69         "<table>\n",
70         "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
71         "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
72         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
73         "</table>"
74        ],
75        "metadata": {},
76        "output_type": "display_data",
77        "text": [
78         "<IPython.core.display.HTML at 0x1023c9050>"
79        ]
80       }
81      ],
82      "prompt_number": 2
83     },
84     {
85      "cell_type": "markdown",
86      "metadata": {},
87      "source": [
88       "The database manager executes this query in two stages.\n",
89       "First,\n",
90       "it checks at each row in the `Visited` table\n",
91       "to see which ones satisfy the `where`.\n",
92       "It then uses the column names following the `select` keyword\n",
93       "to determine what columns to display."
94      ]
95     },
96     {
97      "cell_type": "markdown",
98      "metadata": {},
99      "source": [
100       "This processing order means that\n",
101       "we can filter records using `where`\n",
102       "based on values in columns that aren't then displayed:"
103      ]
104     },
105     {
106      "cell_type": "code",
107      "collapsed": false,
108      "input": [
109       "%%sqlite survey.db\n",
110       "select ident from Visited where site='DR-1';"
111      ],
112      "language": "python",
113      "metadata": {},
114      "outputs": [
115       {
116        "html": [
117         "<table>\n",
118         "<tr><td>619</td></tr>\n",
119         "<tr><td>622</td></tr>\n",
120         "<tr><td>844</td></tr>\n",
121         "</table>"
122        ],
123        "metadata": {},
124        "output_type": "display_data",
125        "text": [
126         "<IPython.core.display.HTML at 0x1023c3890>"
127        ]
128       }
129      ],
130      "prompt_number": 3
131     },
132     {
133      "cell_type": "markdown",
134      "metadata": {},
135      "source": [
136       "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
137      ]
138     },
139     {
140      "cell_type": "markdown",
141      "metadata": {},
142      "source": [
143       "We can use many other Boolean operators to filter our data.\n",
144       "For example,\n",
145       "we can ask for all information from the DR-1 site collected since 1930:"
146      ]
147     },
148     {
149      "cell_type": "code",
150      "collapsed": false,
151      "input": [
152       "%%sqlite survey.db\n",
153       "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
154      ],
155      "language": "python",
156      "metadata": {},
157      "outputs": [
158       {
159        "html": [
160         "<table>\n",
161         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
162         "</table>"
163        ],
164        "metadata": {},
165        "output_type": "display_data",
166        "text": [
167         "<IPython.core.display.HTML at 0x1023c4790>"
168        ]
169       }
170      ],
171      "prompt_number": 4
172     },
173     {
174      "cell_type": "markdown",
175      "metadata": {},
176      "source": [
177       "(The parentheses around the individual tests aren't strictly required,\n",
178       "but they help make the query easier to read.)\n",
179       "\n",
180       "> Most database managers have a special data type for dates.\n",
181       "> In fact, many have two:\n",
182       "> one for dates,\n",
183       "> such as \"May 31, 1971\",\n",
184       "> and one for durations,\n",
185       "> such as \"31 days\".\n",
186       "> SQLite doesn't:\n",
187       "> instead,\n",
188       "> it stores dates as either text\n",
189       "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
190       "> real numbers\n",
191       "> (the number of days since November 24, 4714 BCE),\n",
192       "> or integers\n",
193       "> (the number of seconds since midnight, January 1, 1970).\n",
194       "> If this sounds complicated,\n",
195       "> it is,\n",
196       "> but not nearly as complicated as figuring out\n",
197       "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
198      ]
199     },
200     {
201      "cell_type": "markdown",
202      "metadata": {},
203      "source": [
204       "If we want to find out what measurements were taken by either Lake or Roerich,\n",
205       "we can combine the tests on their names using `or`:"
206      ]
207     },
208     {
209      "cell_type": "code",
210      "collapsed": false,
211      "input": [
212       "%%sqlite survey.db\n",
213       "select * from Survey where person='lake' or person='roe';"
214      ],
215      "language": "python",
216      "metadata": {},
217      "outputs": [
218       {
219        "html": [
220         "<table>\n",
221         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
222         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
223         "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
224         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
225         "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
226         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
227         "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
228         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
229         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
230         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
231         "</table>"
232        ],
233        "metadata": {},
234        "output_type": "display_data",
235        "text": [
236         "<IPython.core.display.HTML at 0x1023c4850>"
237        ]
238       }
239      ],
240      "prompt_number": 5
241     },
242     {
243      "cell_type": "markdown",
244      "metadata": {},
245      "source": [
246       "Alternatively,\n",
247       "we can use `in` to see if a value is in a specific set:"
248      ]
249     },
250     {
251      "cell_type": "code",
252      "collapsed": false,
253      "input": [
254       "%%sqlite survey.db\n",
255       "select * from Survey where person in ('lake', 'roe');"
256      ],
257      "language": "python",
258      "metadata": {},
259      "outputs": [
260       {
261        "html": [
262         "<table>\n",
263         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
264         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
265         "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
266         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
267         "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
268         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
269         "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
270         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
271         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
272         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
273         "</table>"
274        ],
275        "metadata": {},
276        "output_type": "display_data",
277        "text": [
278         "<IPython.core.display.HTML at 0x1023c4790>"
279        ]
280       }
281      ],
282      "prompt_number": 6
283     },
284     {
285      "cell_type": "markdown",
286      "metadata": {},
287      "source": [
288       "We can combine `and` with `or`,\n",
289       "but we need to be careful about which operator is executed first.\n",
290       "If we *don't* use parentheses,\n",
291       "we get this:"
292      ]
293     },
294     {
295      "cell_type": "code",
296      "collapsed": false,
297      "input": [
298       "%%sqlite survey.db\n",
299       "select * from Survey where quant='sal' and person='lake' or person='roe';"
300      ],
301      "language": "python",
302      "metadata": {},
303      "outputs": [
304       {
305        "html": [
306         "<table>\n",
307         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
308         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
309         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
310         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
311         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
312         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
313         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
314         "</table>"
315        ],
316        "metadata": {},
317        "output_type": "display_data",
318        "text": [
319         "<IPython.core.display.HTML at 0x1023c4850>"
320        ]
321       }
322      ],
323      "prompt_number": 7
324     },
325     {
326      "cell_type": "markdown",
327      "metadata": {},
328      "source": [
329       "which is salinity measurements by Lake,\n",
330       "and *any* measurement by Roerich.\n",
331       "We probably want this instead:"
332      ]
333     },
334     {
335      "cell_type": "code",
336      "collapsed": false,
337      "input": [
338       "%%sqlite survey.db\n",
339       "select * from Survey where quant='sal' and (person='lake' or person='roe');"
340      ],
341      "language": "python",
342      "metadata": {},
343      "outputs": [
344       {
345        "html": [
346         "<table>\n",
347         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
348         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
349         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
350         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
351         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
352         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
353         "</table>"
354        ],
355        "metadata": {},
356        "output_type": "display_data",
357        "text": [
358         "<IPython.core.display.HTML at 0x1023c3850>"
359        ]
360       }
361      ],
362      "prompt_number": 8
363     },
364     {
365      "cell_type": "markdown",
366      "metadata": {},
367      "source": [
368       "Finally,\n",
369       "we can use `distinct` with `where`\n",
370       "to give a second level of filtering:"
371      ]
372     },
373     {
374      "cell_type": "code",
375      "collapsed": false,
376      "input": [
377       "%%sqlite survey.db\n",
378       "select distinct person, quant from Survey where person='lake' or person='roe';"
379      ],
380      "language": "python",
381      "metadata": {},
382      "outputs": [
383       {
384        "html": [
385         "<table>\n",
386         "<tr><td>lake</td><td>sal</td></tr>\n",
387         "<tr><td>lake</td><td>rad</td></tr>\n",
388         "<tr><td>lake</td><td>temp</td></tr>\n",
389         "<tr><td>roe</td><td>sal</td></tr>\n",
390         "<tr><td>roe</td><td>rad</td></tr>\n",
391         "</table>"
392        ],
393        "metadata": {},
394        "output_type": "display_data",
395        "text": [
396         "<IPython.core.display.HTML at 0x1023c3890>"
397        ]
398       }
399      ],
400      "prompt_number": 9
401     },
402     {
403      "cell_type": "markdown",
404      "metadata": {},
405      "source": [
406       "But remember:\n",
407       "`distinct` is applied to the values displayed in the chosen columns,\n",
408       "not to the entire rows as they are being processed.\n",
409       "\n",
410       "> What we have just done is how most people \"grow\" their SQL queries.\n",
411       "> We started with something simple that did part of what we wanted,\n",
412       "> then added more clauses one by one,\n",
413       "> testing their effects as we went.\n",
414       "> This is a good strategy&mdash;in fact,\n",
415       "> for complex queries it's often the *only* strategy&mdash;but\n",
416       "> it depends on quick turnaround,\n",
417       "> and on us recognizing the right answer when we get it.\n",
418       ">     \n",
419       "> The best way to achieve quick turnaround is often\n",
420       "> to put a subset of data in a temporary database\n",
421       "> and run our queries against that,\n",
422       "> or to fill a small database with synthesized records.\n",
423       "> For example,\n",
424       "> instead of trying our queries against an actual database of 20 million Australians,\n",
425       "> we could run it against a sample of ten thousand,\n",
426       "> or write a small program to generate ten thousand random (but plausible) records\n",
427       "> and use that."
428      ]
429     },
430     {
431      "cell_type": "markdown",
432      "metadata": {},
433      "source": [
434       "#### Challenges\n",
435       "\n",
436       "1.  Suppose we want to select all sites that lie more than 30&deg; from the poles.\n",
437       "    Our first query is:\n",
438       "\n",
439       "    ~~~\n",
440       "    select * from Site where (lat > -60) or (lat < 60);\n",
441       "    ~~~\n",
442       "\n",
443       "    Explain why this is wrong,\n",
444       "    and rewrite the query so that it is correct.\n",
445       "\n",
446       "2.  Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
447       "    Write a query that selects all records from `Survey`\n",
448       "    with salinity values outside this range.\n",
449       "\n",
450       "3.  The SQL test `*column-name* like *pattern*`\n",
451       "    is true if the value in the named column\n",
452       "    matches the pattern given;\n",
453       "    the character '%' can be used any number of times in the pattern\n",
454       "    to mean \"match zero or more characters\".\n",
455       "\n",
456       "    <table>\n",
457       "      <tr> <th>Expression</th> <th>Value</th> </tr>\n",
458       "      <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
459       "      <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
460       "      <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
461       "      <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
462       "      <tr> <td><code>'alpha' like 'a%p%'</code></td> <td>True</td> </tr>\n",
463       "    </table>\n",
464       "    The expression `*column-name* not like *pattern*`\n",
465       "    inverts the test.\n",
466       "    Using `like`,\n",
467       "    write a query that finds all the records in `Visited`\n",
468       "    that *aren't* from sites labelled 'DR-something'."
469      ]
470     },
471     {
472      "cell_type": "markdown",
473      "metadata": {
474       "cell_tags": [
475        "keypoints"
476       ]
477      },
478      "source": [
479       "#### Key Points\n",
480       "\n",
481       "*   Use `where` to filter records according to Boolean conditions.\n",
482       "*   Filtering is done on whole records,\n",
483       "    so conditions can use fields that are not actually displayed."
484      ]
485     }
486    ],
487    "metadata": {}
488   }
489  ]
490 }