*.ipynb: Use absolute URLs to link to the glossary
[swc-sql.git] / 01-select.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 2,
13      "metadata": {},
14      "source": [
15       "Selecting Data"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {},
21      "source": [
22       "In the late 1920s and early 1930s,\n",
23       "William Dyer,\n",
24       "Frank Pabodie,\n",
25       "and Valentina Roerich led expeditions to the\n",
26       "[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)\n",
27       "in the South Pacific,\n",
28       "and then onward to Antarctica.\n",
29       "Two years ago,\n",
30       "their expeditions were found in a storage locker at Miskatonic University.\n",
31       "We have scanned and OCR'd the data they contain,\n",
32       "and we now want to store that information\n",
33       "in a way that will make search and analysis easy.\n",
34       "\n",
35       "We basically have three options:\n",
36       "text files,\n",
37       "a spreadsheet,\n",
38       "or a database.\n",
39       "Text files are easiest to create,\n",
40       "and work well with version control,\n",
41       "but then we would then have to build search and analysis tools ourselves.\n",
42       "Spreadsheets are good for doing simple analysis,\n",
43       "they don't handle large or complex data sets very well.\n",
44       "We would therefore like to put this data in a database,\n",
45       "and these lessons will show how to do that."
46      ]
47     },
48     {
49      "cell_type": "markdown",
50      "metadata": {
51       "cell_tags": [
52        "objectives"
53       ]
54      },
55      "source": [
56       "#### Objectives\n",
57       "\n",
58       "*   Explain the difference between a table, a record, and a field.\n",
59       "*   Explain the difference between a database and a database manager.\n",
60       "*   Write a query to select all values for specific fields from a single table."
61      ]
62     },
63     {
64      "cell_type": "heading",
65      "level": 3,
66      "metadata": {},
67      "source": [
68       "A Few Definitions"
69      ]
70     },
71     {
72      "cell_type": "markdown",
73      "metadata": {},
74      "source": [
75       "A [relational database](https://github.com/swcarpentry/bc/blob/master/gloss.md#relational-database)\n",
76       "is a way to store and manipulate information\n",
77       "that is arranged as [tables](https://github.com/swcarpentry/bc/blob/master/gloss.md#table-database).\n",
78       "Each table has columns (also known as [fields](https://github.com/swcarpentry/bc/blob/master/gloss.md#field-database)) which describe the data,\n",
79       "and rows (also known as [records](https://github.com/swcarpentry/bc/blob/master/gloss.md#record-database)) which contain the data.\n",
80       "  \n",
81       "When we are using a spreadsheet,\n",
82       "we put formulas into cells to calculate new values based on old ones.\n",
83       "When we are using a database,\n",
84       "we send commands\n",
85       "(usually called [queries](https://github.com/swcarpentry/bc/blob/master/gloss.md#query))\n",
86       "to a [database manager](https://github.com/swcarpentry/bc/blob/master/gloss.md#database-manager):\n",
87       "a program that manipulates the database for us.\n",
88       "The database manager does whatever lookups and calculations the query specifies,\n",
89       "returning the results in a tabular form\n",
90       "that we can then use as a starting point for further queries.\n",
91       "  \n",
92       "> Every database manager—Oracle,\n",
93       "> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores\n",
94       "> data in a different way,\n",
95       "> so a database created with one cannot be used directly by another.\n",
96       "> However,\n",
97       "> every database manager can import and export data in a variety of formats,\n",
98       "> so it *is* possible to move information from one to another.\n",
99       "\n",
100       "Queries are written in a language called [SQL](https://github.com/swcarpentry/bc/blob/master/gloss.md#sql),\n",
101       "which stands for \"Structured Query Language\".\n",
102       "SQL provides hundreds of different ways to analyze and recombine data;\n",
103       "we will only look at a handful,\n",
104       "but that handful accounts for most of what scientists do.\n",
105       "\n",
106       "The tables below show the database we will use in our examples:"
107      ]
108     },
109     {
110      "cell_type": "markdown",
111      "metadata": {},
112      "source": [
113       "<table>\n",
114       "<tr>\n",
115       "<td valign=\"top\">\n",
116       "**Person**: people who took readings.\n",
117       "\n",
118       "<table>\n",
119       "  <tr> <th>ident</th> <th>personal</th> <th>family</th> </tr>\n",
120       "  <tr> <td>dyer</td> <td>William</td> <td>Dyer</td> </tr>\n",
121       "  <tr> <td>pb</td> <td>Frank</td> <td>Pabodie</td> </tr>\n",
122       "  <tr> <td>lake</td> <td>Anderson</td> <td>Lake</td> </tr>\n",
123       "  <tr> <td>roe</td> <td>Valentina</td> <td>Roerich</td> </tr>\n",
124       "  <tr> <td>danforth</td> <td>Frank</td> <td>Danforth</td> </tr>\n",
125       "</table>\n",
126       "\n",
127       "**Site**: locations where readings were taken.\n",
128       "\n",
129       "<table>\n",
130       "  <tr> <th>name</th> <th>lat</th> <th>long</th> </tr>\n",
131       "  <tr> <td>DR-1</td> <td>-49.85</td> <td>-128.57</td> </tr>\n",
132       "  <tr> <td>DR-3</td> <td>-47.15</td> <td>-126.72</td> </tr>\n",
133       "  <tr> <td>MSK-4</td> <td>-48.87</td> <td>-123.4</td> </tr>\n",
134       "</table>\n",
135       "\n",
136       "**Visited**: when readings were taken at specific sites.\n",
137       "\n",
138       "<table>\n",
139       "  <tr> <th>ident</th> <th>site</th> <th>dated</th> </tr>\n",
140       "  <tr> <td>619</td> <td>DR-1</td> <td>1927-02-08</td> </tr>\n",
141       "  <tr> <td>622</td> <td>DR-1</td> <td>1927-02-10</td> </tr>\n",
142       "  <tr> <td>734</td> <td>DR-3</td> <td>1939-01-07</td> </tr>\n",
143       "  <tr> <td>735</td> <td>DR-3</td> <td>1930-01-12</td> </tr>\n",
144       "  <tr> <td>751</td> <td>DR-3</td> <td>1930-02-26</td> </tr>\n",
145       "  <tr> <td>752</td> <td>DR-3</td> <td bgcolor=\"red\">&nbsp;</td> </tr>\n",
146       "  <tr> <td>837</td> <td>MSK-4</td> <td>1932-01-14</td> </tr>\n",
147       "  <tr> <td>844</td> <td>DR-1</td> <td>1932-03-22</td> </tr>\n",
148       "</table>\n",
149       "</td>\n",
150       "<td valign=\"top\">\n",
151       "**Survey**: the actual readings.\n",
152       "\n",
153       "<table>\n",
154       "  <tr> <th>taken</th> <th>person</th> <th>quant</th> <th>reading</th> </tr>\n",
155       "  <tr> <td>619</td> <td>dyer</td> <td>rad</td> <td>9.82</td> </tr>\n",
156       "  <tr> <td>619</td> <td>dyer</td> <td>sal</td> <td>0.13</td> </tr>\n",
157       "  <tr> <td>622</td> <td>dyer</td> <td>rad</td> <td>7.8</td> </tr>\n",
158       "  <tr> <td>622</td> <td>dyer</td> <td>sal</td> <td>0.09</td> </tr>\n",
159       "  <tr> <td>734</td> <td>pb</td> <td>rad</td> <td>8.41</td> </tr>\n",
160       "  <tr> <td>734</td> <td>lake</td> <td>sal</td> <td>0.05</td> </tr>\n",
161       "  <tr> <td>734</td> <td>pb</td> <td>temp</td> <td>-21.5</td> </tr>\n",
162       "  <tr> <td>735</td> <td>pb</td> <td>rad</td> <td>7.22</td> </tr>\n",
163       "  <tr> <td>735</td> <td bgcolor=\"red\">&nbsp;</td> <td>sal</td> <td>0.06</td> </tr>\n",
164       "  <tr> <td>735</td> <td bgcolor=\"red\">&nbsp;</td> <td>temp</td> <td>-26.0</td> </tr>\n",
165       "  <tr> <td>751</td> <td>pb</td> <td>rad</td> <td>4.35</td> </tr>\n",
166       "  <tr> <td>751</td> <td>pb</td> <td>temp</td> <td>-18.5</td> </tr>\n",
167       "  <tr> <td>751</td> <td>lake</td> <td>sal</td> <td>0.1</td> </tr>\n",
168       "  <tr> <td>752</td> <td>lake</td> <td>rad</td> <td>2.19</td> </tr>\n",
169       "  <tr> <td>752</td> <td>lake</td> <td>sal</td> <td>0.09</td> </tr>\n",
170       "  <tr> <td>752</td> <td>lake</td> <td>temp</td> <td>-16.0</td> </tr>\n",
171       "  <tr> <td>752</td> <td>roe</td> <td>sal</td> <td>41.6</td> </tr>\n",
172       "  <tr> <td>837</td> <td>lake</td> <td>rad</td> <td>1.46</td> </tr>\n",
173       "  <tr> <td>837</td> <td>lake</td> <td>sal</td> <td>0.21</td> </tr>\n",
174       "  <tr> <td>837</td> <td>roe</td> <td>sal</td> <td>22.5</td> </tr>\n",
175       "  <tr> <td>844</td> <td>roe</td> <td>rad</td> <td>11.25</td> </tr>\n",
176       "</table>\n",
177       "</td>\n",
178       "</tr>\n",
179       "</table>"
180      ]
181     },
182     {
183      "cell_type": "markdown",
184      "metadata": {},
185      "source": [
186       "Notice that three entries&mdash;one in the `Visited` table,\n",
187       "and two in the `Survey` table&mdash;are shown in red\n",
188       "because they don't contain any actual data:\n",
189       "we'll return to these missing values [later](#s:null).\n",
190       "For now,\n",
191       "let's write an SQL query that displays scientists' names.\n",
192       "We do this using the SQL command `select`,\n",
193       "giving it the names of the columns we want and the table we want them from.\n",
194       "Our query and its output look like this:"
195      ]
196     },
197     {
198      "cell_type": "code",
199      "collapsed": false,
200      "input": [
201       "%load_ext sqlitemagic"
202      ],
203      "language": "python",
204      "metadata": {},
205      "outputs": [],
206      "prompt_number": 1
207     },
208     {
209      "cell_type": "code",
210      "collapsed": false,
211      "input": [
212       "%%sqlite survey.db\n",
213       "select family, personal from Person;"
214      ],
215      "language": "python",
216      "metadata": {},
217      "outputs": [
218       {
219        "html": [
220         "<table>\n",
221         "<tr><td>Dyer</td><td>William</td></tr>\n",
222         "<tr><td>Pabodie</td><td>Frank</td></tr>\n",
223         "<tr><td>Lake</td><td>Anderson</td></tr>\n",
224         "<tr><td>Roerich</td><td>Valentina</td></tr>\n",
225         "<tr><td>Danforth</td><td>Frank</td></tr>\n",
226         "</table>"
227        ],
228        "metadata": {},
229        "output_type": "display_data",
230        "text": [
231         "<IPython.core.display.HTML at 0x102353a10>"
232        ]
233       }
234      ],
235      "prompt_number": 2
236     },
237     {
238      "cell_type": "markdown",
239      "metadata": {},
240      "source": [
241       "The semi-colon at the end of the query\n",
242       "tells the database manager that the query is complete and ready to run.\n",
243       "We have written our commands and column names in lower case,\n",
244       "and the table name in Title Case,\n",
245       "but we don't have to:\n",
246       "as the example below shows,\n",
247       "SQL is [case insensitive](https://github.com/swcarpentry/bc/blob/master/gloss.md#case-insensitive)."
248      ]
249     },
250     {
251      "cell_type": "code",
252      "collapsed": false,
253      "input": [
254       "%%sqlite survey.db\n",
255       "SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;"
256      ],
257      "language": "python",
258      "metadata": {},
259      "outputs": [
260       {
261        "html": [
262         "<table>\n",
263         "<tr><td>Dyer</td><td>William</td></tr>\n",
264         "<tr><td>Pabodie</td><td>Frank</td></tr>\n",
265         "<tr><td>Lake</td><td>Anderson</td></tr>\n",
266         "<tr><td>Roerich</td><td>Valentina</td></tr>\n",
267         "<tr><td>Danforth</td><td>Frank</td></tr>\n",
268         "</table>"
269        ],
270        "metadata": {},
271        "output_type": "display_data",
272        "text": [
273         "<IPython.core.display.HTML at 0x102353c90>"
274        ]
275       }
276      ],
277      "prompt_number": 3
278     },
279     {
280      "cell_type": "markdown",
281      "metadata": {},
282      "source": [
283       "Whatever casing convention you choose,\n",
284       "please be consistent:\n",
285       "complex queries are hard enough to read without the extra cognitive load of random capitalization."
286      ]
287     },
288     {
289      "cell_type": "markdown",
290      "metadata": {},
291      "source": [
292       "Going back to our query,\n",
293       "it's important to understand that\n",
294       "the rows and columns in a database table aren't actually stored in any particular order.\n",
295       "They will always be *displayed* in some order,\n",
296       "but we can control that in various ways.\n",
297       "For example,\n",
298       "we could swap the columns in the output by writing our query as:"
299      ]
300     },
301     {
302      "cell_type": "code",
303      "collapsed": false,
304      "input": [
305       "%%sqlite survey.db\n",
306       "select personal, family from Person;"
307      ],
308      "language": "python",
309      "metadata": {},
310      "outputs": [
311       {
312        "html": [
313         "<table>\n",
314         "<tr><td>William</td><td>Dyer</td></tr>\n",
315         "<tr><td>Frank</td><td>Pabodie</td></tr>\n",
316         "<tr><td>Anderson</td><td>Lake</td></tr>\n",
317         "<tr><td>Valentina</td><td>Roerich</td></tr>\n",
318         "<tr><td>Frank</td><td>Danforth</td></tr>\n",
319         "</table>"
320        ],
321        "metadata": {},
322        "output_type": "display_data",
323        "text": [
324         "<IPython.core.display.HTML at 0x102353a10>"
325        ]
326       }
327      ],
328      "prompt_number": 4
329     },
330     {
331      "cell_type": "markdown",
332      "metadata": {},
333      "source": [
334       "or even repeat columns:"
335      ]
336     },
337     {
338      "cell_type": "code",
339      "collapsed": false,
340      "input": [
341       "%%sqlite survey.db\n",
342       "select ident, ident, ident from Person;"
343      ],
344      "language": "python",
345      "metadata": {},
346      "outputs": [
347       {
348        "html": [
349         "<table>\n",
350         "<tr><td>dyer</td><td>dyer</td><td>dyer</td></tr>\n",
351         "<tr><td>pb</td><td>pb</td><td>pb</td></tr>\n",
352         "<tr><td>lake</td><td>lake</td><td>lake</td></tr>\n",
353         "<tr><td>roe</td><td>roe</td><td>roe</td></tr>\n",
354         "<tr><td>danforth</td><td>danforth</td><td>danforth</td></tr>\n",
355         "</table>"
356        ],
357        "metadata": {},
358        "output_type": "display_data",
359        "text": [
360         "<IPython.core.display.HTML at 0x102353b10>"
361        ]
362       }
363      ],
364      "prompt_number": 5
365     },
366     {
367      "cell_type": "markdown",
368      "metadata": {},
369      "source": [
370       "As a shortcut,\n",
371       "we can select all of the columns in a table using `*`:"
372      ]
373     },
374     {
375      "cell_type": "code",
376      "collapsed": false,
377      "input": [
378       "%%sqlite survey.db\n",
379       "select * from Person;"
380      ],
381      "language": "python",
382      "metadata": {},
383      "outputs": [
384       {
385        "html": [
386         "<table>\n",
387         "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
388         "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
389         "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
390         "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
391         "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
392         "</table>"
393        ],
394        "metadata": {},
395        "output_type": "display_data",
396        "text": [
397         "<IPython.core.display.HTML at 0x102353bd0>"
398        ]
399       }
400      ],
401      "prompt_number": 6
402     },
403     {
404      "cell_type": "markdown",
405      "metadata": {},
406      "source": [
407       "#### Challenges\n",
408       "\n",
409       "1.  Write a query that selects only site names from the `Site` table.\n",
410       "\n",
411       "2.  Many people format queries as:\n",
412       "\n",
413       "    ~~~\n",
414       "    SELECT personal, family FROM person;\n",
415       "    ~~~\n",
416       "\n",
417       "    or as:\n",
418       "\n",
419       "    ~~~\n",
420       "    select Personal, Family from PERSON;\n",
421       "    ~~~\n",
422       "\n",
423       "    What style do you find easiest to read, and why?"
424      ]
425     },
426     {
427      "cell_type": "markdown",
428      "metadata": {
429       "cell_tags": [
430        "keypoints"
431       ]
432      },
433      "source": [
434       "#### Key Points\n",
435       "\n",
436       "*   A relational database stores information in tables,\n",
437       "    each of which has a fixed set of columns and a variable number of records.\n",
438       "*   A database manager is a program that manipulates information stored in a database.\n",
439       "*   We write queries in a specialized language called SQL to extract information from databases.\n",
440       "*   SQL is case-insensitive."
441      ]
442     }
443    ],
444    "metadata": {}
445   }
446  ]
447 }