*.ipynb: Use absolute URLs to link to the glossary
[swc-sql.git] / 05-null.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 2,
13      "metadata": {},
14      "source": [
15       "Missing Data"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {
21       "cell_tags": [
22        "objectives"
23       ]
24      },
25      "source": [
26       "#### Objectives\n",
27       "\n",
28       "*   Explain how databases represent missing information.\n",
29       "*   Explain the three-valued logic databases use when manipulating missing information.\n",
30       "*   Write queries that handle missing information correctly."
31      ]
32     },
33     {
34      "cell_type": "markdown",
35      "metadata": {},
36      "source": [
37       "Real-world data is never complete—there are always holes.\n",
38       "Databases represent these holes using special value called `null`.\n",
39       "`null` is not zero, `False`, or the empty string;\n",
40       "it is a one-of-a-kind value that means \"nothing here\".\n",
41       "Dealing with `null` requires a few special tricks\n",
42       "and some careful thinking.\n",
43       "\n",
44       "To start,\n",
45       "let's have a look at the `Visited` table.\n",
46       "There are eight records,\n",
47       "but #752 doesn't have a date—or rather,\n",
48       "its date is null:"
49      ]
50     },
51     {
52      "cell_type": "code",
53      "collapsed": false,
54      "input": [
55       "%load_ext sqlitemagic"
56      ],
57      "language": "python",
58      "metadata": {},
59      "outputs": [],
60      "prompt_number": 1
61     },
62     {
63      "cell_type": "code",
64      "collapsed": false,
65      "input": [
66       "%%sqlite survey.db\n",
67       "select * from Visited;"
68      ],
69      "language": "python",
70      "metadata": {},
71      "outputs": [
72       {
73        "html": [
74         "<table>\n",
75         "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
76         "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
77         "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
78         "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
79         "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
80         "<tr><td>752</td><td>DR-3</td><td>None</td></tr>\n",
81         "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
82         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
83         "</table>"
84        ],
85        "metadata": {},
86        "output_type": "display_data",
87        "text": [
88         "<IPython.core.display.HTML at 0x1023c4550>"
89        ]
90       }
91      ],
92      "prompt_number": 2
93     },
94     {
95      "cell_type": "markdown",
96      "metadata": {},
97      "source": [
98       "Null doesn't behave like other values.\n",
99       "If we select the records that come before 1930:"
100      ]
101     },
102     {
103      "cell_type": "code",
104      "collapsed": false,
105      "input": [
106       "%%sqlite survey.db\n",
107       "select * from Visited where dated<'1930-00-00';"
108      ],
109      "language": "python",
110      "metadata": {},
111      "outputs": [
112       {
113        "html": [
114         "<table>\n",
115         "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
116         "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
117         "</table>"
118        ],
119        "metadata": {},
120        "output_type": "display_data",
121        "text": [
122         "<IPython.core.display.HTML at 0x1023c46d0>"
123        ]
124       }
125      ],
126      "prompt_number": 3
127     },
128     {
129      "cell_type": "markdown",
130      "metadata": {},
131      "source": [
132       "we get two results,\n",
133       "and if we select the ones that come during or after 1930:"
134      ]
135     },
136     {
137      "cell_type": "code",
138      "collapsed": false,
139      "input": [
140       "%%sqlite survey.db\n",
141       "select * from Visited where dated>='1930-00-00';"
142      ],
143      "language": "python",
144      "metadata": {},
145      "outputs": [
146       {
147        "html": [
148         "<table>\n",
149         "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
150         "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
151         "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
152         "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
153         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
154         "</table>"
155        ],
156        "metadata": {},
157        "output_type": "display_data",
158        "text": [
159         "<IPython.core.display.HTML at 0x1023c4550>"
160        ]
161       }
162      ],
163      "prompt_number": 4
164     },
165     {
166      "cell_type": "markdown",
167      "metadata": {},
168      "source": [
169       "we get five,\n",
170       "but record #752 isn't in either set of results.\n",
171       "The reason is that\n",
172       "`null<'1930-00-00'`\n",
173       "is neither true nor false:\n",
174       "null means, \"We don't know,\"\n",
175       "and if we don't know the value on the left side of a comparison,\n",
176       "we don't know whether the comparison is true or false.\n",
177       "Since databases represent \"don't know\" as null,\n",
178       "the value of `null<'1930-00-00'`\n",
179       "is actually `null`.\n",
180       "`null>='1930-00-00'` is also null\n",
181       "because we can't answer to that question either.\n",
182       "And since the only records kept by a `where`\n",
183       "are those for which the test is true,\n",
184       "record #752 isn't included in either set of results.\n",
185       "\n",
186       "Comparisons aren't the only operations that behave this way with nulls.\n",
187       "`1+null` is `null`,\n",
188       "`5*null` is `null`,\n",
189       "`log(null)` is `null`,\n",
190       "and so on.\n",
191       "In particular,\n",
192       "comparing things to null with = and != produces null:"
193      ]
194     },
195     {
196      "cell_type": "code",
197      "collapsed": false,
198      "input": [
199       "%%sqlite survey.db\n",
200       "select * from Visited where dated=NULL;"
201      ],
202      "language": "python",
203      "metadata": {},
204      "outputs": [
205       {
206        "html": [
207         "<table>\n",
208         "\n",
209         "</table>"
210        ],
211        "metadata": {},
212        "output_type": "display_data",
213        "text": [
214         "<IPython.core.display.HTML at 0x1023c46d0>"
215        ]
216       }
217      ],
218      "prompt_number": 5
219     },
220     {
221      "cell_type": "code",
222      "collapsed": false,
223      "input": [
224       "%%sqlite survey.db\n",
225       "select * from Visited where dated!=NULL;"
226      ],
227      "language": "python",
228      "metadata": {},
229      "outputs": [
230       {
231        "html": [
232         "<table>\n",
233         "\n",
234         "</table>"
235        ],
236        "metadata": {},
237        "output_type": "display_data",
238        "text": [
239         "<IPython.core.display.HTML at 0x1023c4550>"
240        ]
241       }
242      ],
243      "prompt_number": 6
244     },
245     {
246      "cell_type": "markdown",
247      "metadata": {},
248      "source": [
249       "To check whether a value is `null` or not,\n",
250       "we must use a special test `is null`:"
251      ]
252     },
253     {
254      "cell_type": "code",
255      "collapsed": false,
256      "input": [
257       "%%sqlite survey.db\n",
258       "select * from Visited where dated is NULL;"
259      ],
260      "language": "python",
261      "metadata": {},
262      "outputs": [
263       {
264        "html": [
265         "<table>\n",
266         "<tr><td>752</td><td>DR-3</td><td>None</td></tr>\n",
267         "</table>"
268        ],
269        "metadata": {},
270        "output_type": "display_data",
271        "text": [
272         "<IPython.core.display.HTML at 0x1023c46d0>"
273        ]
274       }
275      ],
276      "prompt_number": 7
277     },
278     {
279      "cell_type": "markdown",
280      "metadata": {},
281      "source": [
282       "or its inverse `is not null`:"
283      ]
284     },
285     {
286      "cell_type": "code",
287      "collapsed": false,
288      "input": [
289       "%%sqlite survey.db\n",
290       "select * from Visited where dated is not NULL;"
291      ],
292      "language": "python",
293      "metadata": {},
294      "outputs": [
295       {
296        "html": [
297         "<table>\n",
298         "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
299         "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
300         "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
301         "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
302         "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
303         "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
304         "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
305         "</table>"
306        ],
307        "metadata": {},
308        "output_type": "display_data",
309        "text": [
310         "<IPython.core.display.HTML at 0x1023c4550>"
311        ]
312       }
313      ],
314      "prompt_number": 8
315     },
316     {
317      "cell_type": "markdown",
318      "metadata": {},
319      "source": [
320       "Null values cause headaches wherever they appear.\n",
321       "For example,\n",
322       "suppose we want to find all the salinity measurements\n",
323       "that weren't taken by Dyer.\n",
324       "It's natural to write the query like this:"
325      ]
326     },
327     {
328      "cell_type": "code",
329      "collapsed": false,
330      "input": [
331       "%%sqlite survey.db\n",
332       "select * from Survey where quant='sal' and person!='lake';"
333      ],
334      "language": "python",
335      "metadata": {},
336      "outputs": [
337       {
338        "html": [
339         "<table>\n",
340         "<tr><td>619</td><td>dyer</td><td>sal</td><td>0.13</td></tr>\n",
341         "<tr><td>622</td><td>dyer</td><td>sal</td><td>0.09</td></tr>\n",
342         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
343         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
344         "</table>"
345        ],
346        "metadata": {},
347        "output_type": "display_data",
348        "text": [
349         "<IPython.core.display.HTML at 0x1023c46d0>"
350        ]
351       }
352      ],
353      "prompt_number": 9
354     },
355     {
356      "cell_type": "markdown",
357      "metadata": {},
358      "source": [
359       "but this query filters omits the records\n",
360       "where we don't know who took the measurement.\n",
361       "Once again,\n",
362       "the reason is that when `person` is `null`,\n",
363       "the `!=` comparison produces `null`,\n",
364       "so the record isn't kept in our results.\n",
365       "If we want to keep these records\n",
366       "we need to add an explicit check:"
367      ]
368     },
369     {
370      "cell_type": "code",
371      "collapsed": false,
372      "input": [
373       "%%sqlite survey.db\n",
374       "select * from Survey where quant='sal' and (person!='lake' or person is null);"
375      ],
376      "language": "python",
377      "metadata": {},
378      "outputs": [
379       {
380        "html": [
381         "<table>\n",
382         "<tr><td>619</td><td>dyer</td><td>sal</td><td>0.13</td></tr>\n",
383         "<tr><td>622</td><td>dyer</td><td>sal</td><td>0.09</td></tr>\n",
384         "<tr><td>735</td><td>None</td><td>sal</td><td>0.06</td></tr>\n",
385         "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
386         "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
387         "</table>"
388        ],
389        "metadata": {},
390        "output_type": "display_data",
391        "text": [
392         "<IPython.core.display.HTML at 0x1023c4550>"
393        ]
394       }
395      ],
396      "prompt_number": 10
397     },
398     {
399      "cell_type": "markdown",
400      "metadata": {},
401      "source": [
402       "We still have to decide whether this is the right thing to do or not.\n",
403       "If we want to be absolutely sure that\n",
404       "we aren't including any measurements by Lake in our results,\n",
405       "we need to exclude all the records for which we don't know who did the work."
406      ]
407     },
408     {
409      "cell_type": "markdown",
410      "metadata": {},
411      "source": [
412       "#### Challenges\n",
413       "\n",
414       "1.  Write a query that sorts the records in `Visited` by date,\n",
415       "    omitting entries for which the date is not known\n",
416       "    (i.e., is null).\n",
417       "\n",
418       "1.  What do you expect the query:\n",
419       "\n",
420       "    ~~~\n",
421       "    select * from Visited where dated in ('1927-02-08', null);\n",
422       "    ~~~\n",
423       "\n",
424       "    to produce?\n",
425       "    What does it actually produce?\n",
426       "\n",
427       "1.  Some database designers prefer to use\n",
428       "    a [sentinel value](https://github.com/swcarpentry/bc/blob/master/gloss.md#sentinel-value)\n",
429       "    to mark missing data rather than `null`.\n",
430       "    For example,\n",
431       "    they will use the date \"0000-00-00\" to mark a missing date,\n",
432       "    or -1.0 to mark a missing salinity or radiation reading\n",
433       "    (since actual readings cannot be negative).\n",
434       "    What does this simplify?\n",
435       "    What burdens or risks does it introduce?"
436      ]
437     },
438     {
439      "cell_type": "markdown",
440      "metadata": {
441       "cell_tags": [
442        "keypoints"
443       ]
444      },
445      "source": [
446       "#### Key Points\n",
447       "\n",
448       "*   Databases use `null` to represent missing information.\n",
449       "*   Any arithmetic or Boolean operation involving `null` produces `null` as a result.\n",
450       "*   The only operators that can safely be used with `null` are `is null` and `is not null`."
451      ]
452     }
453    ],
454    "metadata": {}
455   }
456  ]
457 }