*.ipynb: Use absolute URLs to link to the glossary
[swc-sql.git] / 02-sort-dup.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 2,
13      "metadata": {},
14      "source": [
15       "Sorting and Removing Duplicates"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {
21       "cell_tags": [
22        "objectives"
23       ]
24      },
25      "source": [
26       "#### Objectives\n",
27       "\n",
28       "*   Write queries that display results in a particular order.\n",
29       "*   Write queries that eliminate duplicate values from data."
30      ]
31     },
32     {
33      "cell_type": "markdown",
34      "metadata": {},
35      "source": [
36       "Data is often redundant,\n",
37       "so queries often return redundant information.\n",
38       "For example,\n",
39       "if we select the quantitites that have been measured\n",
40       "from the `survey` table,\n",
41       "we get this:"
42      ]
43     },
44     {
45      "cell_type": "code",
46      "collapsed": false,
47      "input": [
48       "%load_ext sqlitemagic"
49      ],
50      "language": "python",
51      "metadata": {},
52      "outputs": [],
53      "prompt_number": 1
54     },
55     {
56      "cell_type": "code",
57      "collapsed": false,
58      "input": [
59       "%%sqlite survey.db\n",
60       "select quant from Survey;"
61      ],
62      "language": "python",
63      "metadata": {},
64      "outputs": [
65       {
66        "html": [
67         "<table>\n",
68         "<tr><td>rad</td></tr>\n",
69         "<tr><td>sal</td></tr>\n",
70         "<tr><td>rad</td></tr>\n",
71         "<tr><td>sal</td></tr>\n",
72         "<tr><td>rad</td></tr>\n",
73         "<tr><td>sal</td></tr>\n",
74         "<tr><td>temp</td></tr>\n",
75         "<tr><td>rad</td></tr>\n",
76         "<tr><td>sal</td></tr>\n",
77         "<tr><td>temp</td></tr>\n",
78         "<tr><td>rad</td></tr>\n",
79         "<tr><td>temp</td></tr>\n",
80         "<tr><td>sal</td></tr>\n",
81         "<tr><td>rad</td></tr>\n",
82         "<tr><td>sal</td></tr>\n",
83         "<tr><td>temp</td></tr>\n",
84         "<tr><td>sal</td></tr>\n",
85         "<tr><td>rad</td></tr>\n",
86         "<tr><td>sal</td></tr>\n",
87         "<tr><td>sal</td></tr>\n",
88         "<tr><td>rad</td></tr>\n",
89         "</table>"
90        ],
91        "metadata": {},
92        "output_type": "display_data",
93        "text": [
94         "<IPython.core.display.HTML at 0x102358c90>"
95        ]
96       }
97      ],
98      "prompt_number": 2
99     },
100     {
101      "cell_type": "markdown",
102      "metadata": {},
103      "source": [
104       "We can eliminate the redundant output\n",
105       "to make the result more readable\n",
106       "by adding the `distinct` keyword\n",
107       "to our query:"
108      ]
109     },
110     {
111      "cell_type": "code",
112      "collapsed": false,
113      "input": [
114       "%%sqlite survey.db\n",
115       "select distinct quant from Survey;"
116      ],
117      "language": "python",
118      "metadata": {},
119      "outputs": [
120       {
121        "html": [
122         "<table>\n",
123         "<tr><td>rad</td></tr>\n",
124         "<tr><td>sal</td></tr>\n",
125         "<tr><td>temp</td></tr>\n",
126         "</table>"
127        ],
128        "metadata": {},
129        "output_type": "display_data",
130        "text": [
131         "<IPython.core.display.HTML at 0x102358d90>"
132        ]
133       }
134      ],
135      "prompt_number": 3
136     },
137     {
138      "cell_type": "markdown",
139      "metadata": {},
140      "source": [
141       "If we select more than one column&mdash;for example,\n",
142       "both the survey site ID and the quantity measured&mdash;then\n",
143       "the distinct pairs of values are returned:"
144      ]
145     },
146     {
147      "cell_type": "code",
148      "collapsed": false,
149      "input": [
150       "%%sqlite survey.db\n",
151       "select distinct taken, quant from Survey;"
152      ],
153      "language": "python",
154      "metadata": {},
155      "outputs": [
156       {
157        "html": [
158         "<table>\n",
159         "<tr><td>619</td><td>rad</td></tr>\n",
160         "<tr><td>619</td><td>sal</td></tr>\n",
161         "<tr><td>622</td><td>rad</td></tr>\n",
162         "<tr><td>622</td><td>sal</td></tr>\n",
163         "<tr><td>734</td><td>rad</td></tr>\n",
164         "<tr><td>734</td><td>sal</td></tr>\n",
165         "<tr><td>734</td><td>temp</td></tr>\n",
166         "<tr><td>735</td><td>rad</td></tr>\n",
167         "<tr><td>735</td><td>sal</td></tr>\n",
168         "<tr><td>735</td><td>temp</td></tr>\n",
169         "<tr><td>751</td><td>rad</td></tr>\n",
170         "<tr><td>751</td><td>temp</td></tr>\n",
171         "<tr><td>751</td><td>sal</td></tr>\n",
172         "<tr><td>752</td><td>rad</td></tr>\n",
173         "<tr><td>752</td><td>sal</td></tr>\n",
174         "<tr><td>752</td><td>temp</td></tr>\n",
175         "<tr><td>837</td><td>rad</td></tr>\n",
176         "<tr><td>837</td><td>sal</td></tr>\n",
177         "<tr><td>844</td><td>rad</td></tr>\n",
178         "</table>"
179        ],
180        "metadata": {},
181        "output_type": "display_data",
182        "text": [
183         "<IPython.core.display.HTML at 0x102353c90>"
184        ]
185       }
186      ],
187      "prompt_number": 4
188     },
189     {
190      "cell_type": "markdown",
191      "metadata": {},
192      "source": [
193       "Notice in both cases that duplicates are removed\n",
194       "even if they didn't appear to be adjacent in the database.\n",
195       "Again,\n",
196       "it's important to remember that rows aren't actually ordered:\n",
197       "they're just displayed that way."
198      ]
199     },
200     {
201      "cell_type": "markdown",
202      "metadata": {},
203      "source": [
204       "#### Challenges\n",
205       "\n",
206       "1.  Write a query that selects distinct dates from the `Site` table."
207      ]
208     },
209     {
210      "cell_type": "markdown",
211      "metadata": {},
212      "source": [
213       "As we mentioned earlier,\n",
214       "database records are not stored in any particular order.\n",
215       "This means that query results aren't necessarily sorted,\n",
216       "and even if they are,\n",
217       "we often want to sort them in a different way,\n",
218       "e.g., by the name of the project instead of by the name of the scientist.\n",
219       "We can do this in SQL by adding an `order by` clause to our query:"
220      ]
221     },
222     {
223      "cell_type": "code",
224      "collapsed": false,
225      "input": [
226       "%%sqlite survey.db\n",
227       "select * from Person order by ident;"
228      ],
229      "language": "python",
230      "metadata": {},
231      "outputs": [
232       {
233        "html": [
234         "<table>\n",
235         "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
236         "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
237         "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
238         "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
239         "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
240         "</table>"
241        ],
242        "metadata": {},
243        "output_type": "display_data",
244        "text": [
245         "<IPython.core.display.HTML at 0x102353b10>"
246        ]
247       }
248      ],
249      "prompt_number": 5
250     },
251     {
252      "cell_type": "markdown",
253      "metadata": {},
254      "source": [
255       "By default,\n",
256       "results are sorted in ascending order\n",
257       "(i.e.,\n",
258       "from least to greatest).\n",
259       "We can sort in the opposite order using `desc` (for \"descending\"):"
260      ]
261     },
262     {
263      "cell_type": "code",
264      "collapsed": false,
265      "input": [
266       "%%sqlite survey.db\n",
267       "select * from person order by ident desc;"
268      ],
269      "language": "python",
270      "metadata": {},
271      "outputs": [
272       {
273        "html": [
274         "<table>\n",
275         "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
276         "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
277         "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
278         "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
279         "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
280         "</table>"
281        ],
282        "metadata": {},
283        "output_type": "display_data",
284        "text": [
285         "<IPython.core.display.HTML at 0x102353c50>"
286        ]
287       }
288      ],
289      "prompt_number": 6
290     },
291     {
292      "cell_type": "markdown",
293      "metadata": {},
294      "source": [
295       "(And if we want to make it clear that we're sorting in ascending order,\n",
296       "we can use `asc` instead of `desc`.)\n",
297       "  \n",
298       "We can also sort on several fields at once.\n",
299       "For example,\n",
300       "this query sorts results first in ascending order by `taken`,\n",
301       "and then in descending order by `person`\n",
302       "within each group of equal `taken` values:"
303      ]
304     },
305     {
306      "cell_type": "code",
307      "collapsed": false,
308      "input": [
309       "%%sqlite survey.db\n",
310       "select taken, person from Survey order by taken asc, person desc;"
311      ],
312      "language": "python",
313      "metadata": {},
314      "outputs": [
315       {
316        "html": [
317         "<table>\n",
318         "<tr><td>619</td><td>dyer</td></tr>\n",
319         "<tr><td>619</td><td>dyer</td></tr>\n",
320         "<tr><td>622</td><td>dyer</td></tr>\n",
321         "<tr><td>622</td><td>dyer</td></tr>\n",
322         "<tr><td>734</td><td>pb</td></tr>\n",
323         "<tr><td>734</td><td>pb</td></tr>\n",
324         "<tr><td>734</td><td>lake</td></tr>\n",
325         "<tr><td>735</td><td>pb</td></tr>\n",
326         "<tr><td>735</td><td>None</td></tr>\n",
327         "<tr><td>735</td><td>None</td></tr>\n",
328         "<tr><td>751</td><td>pb</td></tr>\n",
329         "<tr><td>751</td><td>pb</td></tr>\n",
330         "<tr><td>751</td><td>lake</td></tr>\n",
331         "<tr><td>752</td><td>roe</td></tr>\n",
332         "<tr><td>752</td><td>lake</td></tr>\n",
333         "<tr><td>752</td><td>lake</td></tr>\n",
334         "<tr><td>752</td><td>lake</td></tr>\n",
335         "<tr><td>837</td><td>roe</td></tr>\n",
336         "<tr><td>837</td><td>lake</td></tr>\n",
337         "<tr><td>837</td><td>lake</td></tr>\n",
338         "<tr><td>844</td><td>roe</td></tr>\n",
339         "</table>"
340        ],
341        "metadata": {},
342        "output_type": "display_data",
343        "text": [
344         "<IPython.core.display.HTML at 0x1023557d0>"
345        ]
346       }
347      ],
348      "prompt_number": 7
349     },
350     {
351      "cell_type": "markdown",
352      "metadata": {},
353      "source": [
354       "This is easier to understand if we also remove duplicates:"
355      ]
356     },
357     {
358      "cell_type": "code",
359      "collapsed": false,
360      "input": [
361       "%%sqlite survey.db\n",
362       "select distinct taken, person from Survey order by taken asc, person desc;"
363      ],
364      "language": "python",
365      "metadata": {},
366      "outputs": [
367       {
368        "html": [
369         "<table>\n",
370         "<tr><td>619</td><td>dyer</td></tr>\n",
371         "<tr><td>622</td><td>dyer</td></tr>\n",
372         "<tr><td>734</td><td>pb</td></tr>\n",
373         "<tr><td>734</td><td>lake</td></tr>\n",
374         "<tr><td>735</td><td>pb</td></tr>\n",
375         "<tr><td>735</td><td>None</td></tr>\n",
376         "<tr><td>751</td><td>pb</td></tr>\n",
377         "<tr><td>751</td><td>lake</td></tr>\n",
378         "<tr><td>752</td><td>roe</td></tr>\n",
379         "<tr><td>752</td><td>lake</td></tr>\n",
380         "<tr><td>837</td><td>roe</td></tr>\n",
381         "<tr><td>837</td><td>lake</td></tr>\n",
382         "<tr><td>844</td><td>roe</td></tr>\n",
383         "</table>"
384        ],
385        "metadata": {},
386        "output_type": "display_data",
387        "text": [
388         "<IPython.core.display.HTML at 0x102353b10>"
389        ]
390       }
391      ],
392      "prompt_number": 8
393     },
394     {
395      "cell_type": "markdown",
396      "metadata": {},
397      "source": [
398       "#### Challenges\n",
399       "\n",
400       "1.  Write a query that returns the distinct dates in the `Visited` table.\n",
401       "\n",
402       "2.  Write a query that displays the full names of the scientists in the `Person` table, ordered by family name."
403      ]
404     },
405     {
406      "cell_type": "markdown",
407      "metadata": {
408       "cell_tags": [
409        "keypoints"
410       ]
411      },
412      "source": [
413       "#### Key Points\n",
414       "\n",
415       "*   The records in a database table are not intrinsically ordered:\n",
416       "    if we want to display them in some order,\n",
417       "    we must specify that explicitly.\n",
418       "*   The values in a database are not guaranteed to be unique:\n",
419       "    if we want to eliminate duplicates,\n",
420       "    we must specify that explicitly as well."
421      ]
422     }
423    ],
424    "metadata": {}
425   }
426  ]
427 }