018295463157f05802ae352d467daea5304b3ec6
[swc-sql.git] / sql / novice / 03-filter.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 1,
13      "metadata": {},
14      "source": [
15       "Filtering"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {},
21      "source": [
22       "One of the most powerful features of a database is\n",
23       "the ability to [filter](../../gloss.html#filter) data,\n",
24       "i.e.,\n",
25       "to select only those records that match certain criteria.\n",
26       "For example,\n",
27       "suppose we want to see when a particular site was visited.\n",
28       "We can select these records from the `Visited` table\n",
29       "by using a `where` clause in our query:"
30      ]
31     },
32     {
33      "cell_type": "code",
34      "collapsed": false,
35      "input": [
36       "%load_ext sqlitemagic"
37      ],
38      "language": "python",
39      "metadata": {},
40      "outputs": [],
41      "prompt_number": 1
42     },
43     {
44      "cell_type": "code",
45      "collapsed": false,
46      "input": [
47       "%%sqlite survey.db\n",
48       "select * from Visited where site='DR-1';"
49      ],
50      "language": "python",
51      "metadata": {},
52      "outputs": [
53       {
54        "html": [
55         "<table>\n",
56         "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
57         "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
58         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
59         "</table>"
60        ],
61        "metadata": {},
62        "output_type": "display_data",
63        "text": [
64         "<IPython.core.display.HTML at 0x1023c9050>"
65        ]
66       }
67      ],
68      "prompt_number": 2
69     },
70     {
71      "cell_type": "markdown",
72      "metadata": {},
73      "source": [
74       "The database manager executes this query in two stages.\n",
75       "First,\n",
76       "it checks at each row in the `Visited` table\n",
77       "to see which ones satisfy the `where`.\n",
78       "It then uses the column names following the `select` keyword\n",
79       "to determine what columns to display."
80      ]
81     },
82     {
83      "cell_type": "markdown",
84      "metadata": {},
85      "source": [
86       "This processing order means that\n",
87       "we can filter records using `where`\n",
88       "based on values in columns that aren't then displayed:"
89      ]
90     },
91     {
92      "cell_type": "code",
93      "collapsed": false,
94      "input": [
95       "%%sqlite survey.db\n",
96       "select ident from Visited where site='DR-1';"
97      ],
98      "language": "python",
99      "metadata": {},
100      "outputs": [
101       {
102        "html": [
103         "<table>\n",
104         "<tr><td>619</td></tr>\n",
105         "<tr><td>622</td></tr>\n",
106         "<tr><td>844</td></tr>\n",
107         "</table>"
108        ],
109        "metadata": {},
110        "output_type": "display_data",
111        "text": [
112         "<IPython.core.display.HTML at 0x1023c3890>"
113        ]
114       }
115      ],
116      "prompt_number": 3
117     },
118     {
119      "cell_type": "markdown",
120      "metadata": {},
121      "source": [
122       "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
123      ]
124     },
125     {
126      "cell_type": "markdown",
127      "metadata": {},
128      "source": [
129       "We can use many other Boolean operators to filter our data.\n",
130       "For example,\n",
131       "we can ask for all information from the DR-1 site collected since 1930:"
132      ]
133     },
134     {
135      "cell_type": "code",
136      "collapsed": false,
137      "input": [
138       "%%sqlite survey.db\n",
139       "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
140      ],
141      "language": "python",
142      "metadata": {},
143      "outputs": [
144       {
145        "html": [
146         "<table>\n",
147         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
148         "</table>"
149        ],
150        "metadata": {},
151        "output_type": "display_data",
152        "text": [
153         "<IPython.core.display.HTML at 0x1023c4790>"
154        ]
155       }
156      ],
157      "prompt_number": 4
158     },
159     {
160      "cell_type": "markdown",
161      "metadata": {},
162      "source": [
163       "(The parentheses around the individual tests aren't strictly required,\n",
164       "but they help make the query easier to read.)\n",
165       "\n",
166       "> Most database managers have a special data type for dates.\n",
167       "> In fact, many have two:\n",
168       "> one for dates,\n",
169       "> such as \"May 31, 1971\",\n",
170       "> and one for durations,\n",
171       "> such as \"31 days\".\n",
172       "> SQLite doesn't:\n",
173       "> instead,\n",
174       "> it stores dates as either text\n",
175       "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
176       "> real numbers\n",
177       "> (the number of days since November 24, 4714 BCE),\n",
178       "> or integers\n",
179       "> (the number of seconds since midnight, January 1, 1970).\n",
180       "> If this sounds complicated,\n",
181       "> it is,\n",
182       "> but not nearly as complicated as figuring out\n",
183       "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
184      ]
185     },
186     {
187      "cell_type": "markdown",
188      "metadata": {},
189      "source": [
190       "If we want to find out what measurements were taken by either Lake or Roerich,\n",
191       "we can combine the tests on their names using `or`:"
192      ]
193     },
194     {
195      "cell_type": "code",
196      "collapsed": false,
197      "input": [
198       "%%sqlite survey.db\n",
199       "select * from Survey where person='lake' or person='roe';"
200      ],
201      "language": "python",
202      "metadata": {},
203      "outputs": [
204       {
205        "html": [
206         "<table>\n",
207         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
208         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
209         "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
210         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
211         "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
212         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
213         "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
214         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
215         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
216         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
217         "</table>"
218        ],
219        "metadata": {},
220        "output_type": "display_data",
221        "text": [
222         "<IPython.core.display.HTML at 0x1023c4850>"
223        ]
224       }
225      ],
226      "prompt_number": 5
227     },
228     {
229      "cell_type": "markdown",
230      "metadata": {},
231      "source": [
232       "Alternatively,\n",
233       "we can use `in` to see if a value is in a specific set:"
234      ]
235     },
236     {
237      "cell_type": "code",
238      "collapsed": false,
239      "input": [
240       "%%sqlite survey.db\n",
241       "select * from Survey where person in ('lake', 'roe');"
242      ],
243      "language": "python",
244      "metadata": {},
245      "outputs": [
246       {
247        "html": [
248         "<table>\n",
249         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
250         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
251         "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
252         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
253         "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
254         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
255         "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
256         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
257         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
258         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
259         "</table>"
260        ],
261        "metadata": {},
262        "output_type": "display_data",
263        "text": [
264         "<IPython.core.display.HTML at 0x1023c4790>"
265        ]
266       }
267      ],
268      "prompt_number": 6
269     },
270     {
271      "cell_type": "markdown",
272      "metadata": {},
273      "source": [
274       "We can combine `and` with `or`,\n",
275       "but we need to be careful about which operator is executed first.\n",
276       "If we *don't* use parentheses,\n",
277       "we get this:"
278      ]
279     },
280     {
281      "cell_type": "code",
282      "collapsed": false,
283      "input": [
284       "%%sqlite survey.db\n",
285       "select * from Survey where quant='sal' and person='lake' or person='roe';"
286      ],
287      "language": "python",
288      "metadata": {},
289      "outputs": [
290       {
291        "html": [
292         "<table>\n",
293         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
294         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
295         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
296         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
297         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
298         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
299         "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
300         "</table>"
301        ],
302        "metadata": {},
303        "output_type": "display_data",
304        "text": [
305         "<IPython.core.display.HTML at 0x1023c4850>"
306        ]
307       }
308      ],
309      "prompt_number": 7
310     },
311     {
312      "cell_type": "markdown",
313      "metadata": {},
314      "source": [
315       "which is salinity measurements by Lake,\n",
316       "and *any* measurement by Roerich.\n",
317       "We probably want this instead:"
318      ]
319     },
320     {
321      "cell_type": "code",
322      "collapsed": false,
323      "input": [
324       "%%sqlite survey.db\n",
325       "select * from Survey where quant='sal' and (person='lake' or person='roe');"
326      ],
327      "language": "python",
328      "metadata": {},
329      "outputs": [
330       {
331        "html": [
332         "<table>\n",
333         "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
334         "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
335         "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
336         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
337         "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
338         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
339         "</table>"
340        ],
341        "metadata": {},
342        "output_type": "display_data",
343        "text": [
344         "<IPython.core.display.HTML at 0x1023c3850>"
345        ]
346       }
347      ],
348      "prompt_number": 8
349     },
350     {
351      "cell_type": "markdown",
352      "metadata": {},
353      "source": [
354       "Finally,\n",
355       "we can use `distinct` with `where`\n",
356       "to give a second level of filtering:"
357      ]
358     },
359     {
360      "cell_type": "code",
361      "collapsed": false,
362      "input": [
363       "%%sqlite survey.db\n",
364       "select distinct person, quant from Survey where person='lake' or person='roe';"
365      ],
366      "language": "python",
367      "metadata": {},
368      "outputs": [
369       {
370        "html": [
371         "<table>\n",
372         "<tr><td>lake</td><td>sal</td></tr>\n",
373         "<tr><td>lake</td><td>rad</td></tr>\n",
374         "<tr><td>lake</td><td>temp</td></tr>\n",
375         "<tr><td>roe</td><td>sal</td></tr>\n",
376         "<tr><td>roe</td><td>rad</td></tr>\n",
377         "</table>"
378        ],
379        "metadata": {},
380        "output_type": "display_data",
381        "text": [
382         "<IPython.core.display.HTML at 0x1023c3890>"
383        ]
384       }
385      ],
386      "prompt_number": 9
387     },
388     {
389      "cell_type": "markdown",
390      "metadata": {},
391      "source": [
392       "But remember:\n",
393       "`distinct` is applied to the values displayed in the chosen columns,\n",
394       "not to the entire rows as they are being processed.\n",
395       "\n",
396       "> What we have just done is how most people \"grow\" their SQL queries.\n",
397       "> We started with something simple that did part of what we wanted,\n",
398       "> then added more clauses one by one,\n",
399       "> testing their effects as we went.\n",
400       "> This is a good strategy&mdash;in fact,\n",
401       "> for complex queries it's often the *only* strategy&mdash;but\n",
402       "> it depends on quick turnaround,\n",
403       "> and on us recognizing the right answer when we get it.\n",
404       ">     \n",
405       "> The best way to achieve quick turnaround is often\n",
406       "> to put a subset of data in a temporary database\n",
407       "> and run our queries against that,\n",
408       "> or to fill a small database with synthesized records.\n",
409       "> For example,\n",
410       "> instead of trying our queries against an actual database of 20 million Australians,\n",
411       "> we could run it against a sample of ten thousand,\n",
412       "> or write a small program to generate ten thousand random (but plausible) records\n",
413       "> and use that."
414      ]
415     },
416     {
417      "cell_type": "markdown",
418      "metadata": {},
419      "source": [
420       "### Challenges\n",
421       "\n",
422       "1.  Suppose we want to select all sites that lie within 30&deg; of the equator.\n",
423       "    Our first query is:\n",
424       "\n",
425       "    ```\n",
426       "    select * from Site where (lat > -30) or (lat < 30);\n",
427       "    ```\n",
428       "\n",
429       "    Explain why this is wrong,\n",
430       "    and rewrite the query so that it is correct.\n",
431       "\n",
432       "2.  Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
433       "    Write a query that selects all records from `Survey`\n",
434       "    with salinity values outside this range.\n",
435       "\n",
436       "3.  The SQL test `*column-name* like *pattern*`\n",
437       "    is true if the value in the named column\n",
438       "    matches the pattern given;\n",
439       "    the character '%' can be used any number of times in the pattern\n",
440       "    to mean \"match zero or more characters\".\n",
441       "\n",
442       "    <table>\n",
443       "      <tr> <th>Expression</th> <th>Value</th> </tr>\n",
444       "      <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
445       "      <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
446       "      <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
447       "      <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
448       "      <tr> <td><code>'alpha' like 'a%p%'</code> <td>True</td> </tr>\n",
449       "    </table>\n",
450       "    The expression `*column-name* not like *pattern*`\n",
451       "    inverts the test.\n",
452       "    Using `like`,\n",
453       "    write a query that finds all the records in `Visited`\n",
454       "    that *aren't* from sites labelled 'DR-something'."
455      ]
456     }
457    ],
458    "metadata": {}
459   }
460  ]
461 }