More database lessons for novices. To do:
[swc-sql.git] / sql / novice / 07-join.ipynb
1 {
2  "metadata": {
3   "name": ""
4  },
5  "nbformat": 3,
6  "nbformat_minor": 0,
7  "worksheets": [
8   {
9    "cells": [
10     {
11      "cell_type": "heading",
12      "level": 1,
13      "metadata": {},
14      "source": [
15       "Combining Data"
16      ]
17     },
18     {
19      "cell_type": "markdown",
20      "metadata": {},
21      "source": [
22       "In order to submit her data to a web site\n",
23       "that aggregates historical meteorological data,\n",
24       "Gina needs to format it as\n",
25       "latitude, longitude, date, quantity, and reading.\n",
26       "However,\n",
27       "her latitudes and longitudes are in the `Site` table,\n",
28       "while the dates of measurements are in the `Visited` table\n",
29       "and the readings themselves are in the `Survey` table.\n",
30       "She needs to combine these tables somehow.\n",
31       "\n",
32       "The SQL command to do this is `join`.\n",
33       "To see how it works,\n",
34       "let's start by joining the `Site` and `Visited` tables:"
35      ]
36     },
37     {
38      "cell_type": "code",
39      "collapsed": false,
40      "input": [
41       "%load_ext sqlitemagic"
42      ],
43      "language": "python",
44      "metadata": {},
45      "outputs": [],
46      "prompt_number": 1
47     },
48     {
49      "cell_type": "code",
50      "collapsed": false,
51      "input": [
52       "%%sqlite survey.db\n",
53       "select * from Site join Visited;"
54      ],
55      "language": "python",
56      "metadata": {},
57      "outputs": [
58       {
59        "html": [
60         "<table>\n",
61         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
62         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
63         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
64         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
65         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
66         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
67         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
68         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
69         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
70         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
71         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
72         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
73         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
74         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
75         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
76         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
77         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
78         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
79         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
80         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
81         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
82         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
83         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
84         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
85         "</table>"
86        ],
87        "metadata": {},
88        "output_type": "display_data",
89        "text": [
90         "<IPython.core.display.HTML at 0x1023c4550>"
91        ]
92       }
93      ],
94      "prompt_number": 2
95     },
96     {
97      "cell_type": "markdown",
98      "metadata": {},
99      "source": [
100       "`join` creates\n",
101       "the [cross product](../gloss.html#cross-product)\n",
102       "of two tables,\n",
103       "i.e.,\n",
104       "it joins each record of one with each record of the other\n",
105       "to give all possible combinations.\n",
106       "Since there are three records in `Site`\n",
107       "and eight in `Visited`,\n",
108       "the join's output has 24 records.\n",
109       "And since each table has three fields,\n",
110       "the output has six fields.\n",
111       "  \n",
112       "What the join *hasn't* done is\n",
113       "figure out if the records being joined have anything to do with each other.\n",
114       "It has no way of knowing whether they do or not until we tell it how.\n",
115       "To do that,\n",
116       "we add a clause specifying that\n",
117       "we're only interested in combinations that have the same site name:"
118      ]
119     },
120     {
121      "cell_type": "code",
122      "collapsed": false,
123      "input": [
124       "%%sqlite survey.db\n",
125       "select * from Site join Visited on Site.name=Visited.site;"
126      ],
127      "language": "python",
128      "metadata": {},
129      "outputs": [
130       {
131        "html": [
132         "<table>\n",
133         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
134         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
135         "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
136         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
137         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
138         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
139         "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
140         "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
141         "</table>"
142        ],
143        "metadata": {},
144        "output_type": "display_data",
145        "text": [
146         "<IPython.core.display.HTML at 0x1023c4090>"
147        ]
148       }
149      ],
150      "prompt_number": 3
151     },
152     {
153      "cell_type": "markdown",
154      "metadata": {},
155      "source": [
156       "`on` does the same job as `where`:\n",
157       "it only keeps records that pass some test.\n",
158       "(The difference between the two is that `on` filters records\n",
159       "as they're being created,\n",
160       "while `where` waits until the join is done\n",
161       "and then does the filtering.)\n",
162       "Once we add this to our query,\n",
163       "the database manager throws away records\n",
164       "that combined information about two different sites,\n",
165       "leaving us with just the ones we want.\n",
166       "  \n",
167       "Notice that we used `table.field` to specify field names\n",
168       "in the output of the join.\n",
169       "We do this because tables can have fields with the same name,\n",
170       "and we need to be specific which ones we're talking about.\n",
171       "For example,\n",
172       "if we joined the `person` and `visited` tables,\n",
173       "the result would inherit a field called `ident`\n",
174       "from each of the original tables.\n",
175       "\n",
176       "We can now use the same dotted notation\n",
177       "to select the three columns we actually want\n",
178       "out of our join:"
179      ]
180     },
181     {
182      "cell_type": "code",
183      "collapsed": false,
184      "input": [
185       "%%sqlite survey.db\n",
186       "select Site.lat, Site.long, Visited.dated\n",
187       "from   Site join Visited\n",
188       "on     Site.name=Visited.site;"
189      ],
190      "language": "python",
191      "metadata": {},
192      "outputs": [
193       {
194        "html": [
195         "<table>\n",
196         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td></tr>\n",
197         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td></tr>\n",
198         "<tr><td>-49.85</td><td>-128.57</td><td>1932-03-22</td></tr>\n",
199         "<tr><td>-47.15</td><td>-126.72</td><td>None</td></tr>\n",
200         "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td></tr>\n",
201         "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td></tr>\n",
202         "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td></tr>\n",
203         "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td></tr>\n",
204         "</table>"
205        ],
206        "metadata": {},
207        "output_type": "display_data",
208        "text": [
209         "<IPython.core.display.HTML at 0x1023c3890>"
210        ]
211       }
212      ],
213      "prompt_number": 4
214     },
215     {
216      "cell_type": "markdown",
217      "metadata": {},
218      "source": [
219       "If joining two tables is good,\n",
220       "joining many tables must be better.\n",
221       "In fact,\n",
222       "we can join any number of tables\n",
223       "simply by adding more `join` clauses to our query,\n",
224       "and more `on` tests to filter out combinations of records\n",
225       "that don't make sense:"
226      ]
227     },
228     {
229      "cell_type": "code",
230      "collapsed": false,
231      "input": [
232       "%%sqlite survey.db\n",
233       "select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading\n",
234       "from   Site join Visited join Survey\n",
235       "on     Site.name=Visited.site\n",
236       "and    Visited.ident=Survey.taken\n",
237       "and    Visited.dated is not null;"
238      ],
239      "language": "python",
240      "metadata": {},
241      "outputs": [
242       {
243        "html": [
244         "<table>\n",
245         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td><td>rad</td><td>9.82</td></tr>\n",
246         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td><td>sal</td><td>0.13</td></tr>\n",
247         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td><td>rad</td><td>7.8</td></tr>\n",
248         "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td><td>sal</td><td>0.09</td></tr>\n",
249         "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>rad</td><td>8.41</td></tr>\n",
250         "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>sal</td><td>0.05</td></tr>\n",
251         "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>temp</td><td>-21.5</td></tr>\n",
252         "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>rad</td><td>7.22</td></tr>\n",
253         "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>sal</td><td>0.06</td></tr>\n",
254         "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>temp</td><td>-26.0</td></tr>\n",
255         "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>rad</td><td>4.35</td></tr>\n",
256         "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>sal</td><td>0.1</td></tr>\n",
257         "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>temp</td><td>-18.5</td></tr>\n",
258         "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>rad</td><td>1.46</td></tr>\n",
259         "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>sal</td><td>0.21</td></tr>\n",
260         "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>sal</td><td>22.5</td></tr>\n",
261         "<tr><td>-49.85</td><td>-128.57</td><td>1932-03-22</td><td>rad</td><td>11.25</td></tr>\n",
262         "</table>"
263        ],
264        "metadata": {},
265        "output_type": "display_data",
266        "text": [
267         "<IPython.core.display.HTML at 0x1023c4550>"
268        ]
269       }
270      ],
271      "prompt_number": 5
272     },
273     {
274      "cell_type": "markdown",
275      "metadata": {},
276      "source": [
277       "We can tell which records from `Site`, `Visited`, and `Survey`\n",
278       "correspond with each other\n",
279       "because those tables contain\n",
280       "[primary keys](../gloss.html#primary-key)\n",
281       "and [foreign keys](../gloss.html#foreign-key).\n",
282       "A primary key is a value,\n",
283       "or combination of values,\n",
284       "that uniquely identifies each record in a table.\n",
285       "A foreign key is a value (or combination of values) from one table\n",
286       "that identifies a unique record in another table.\n",
287       "Another way of saying this is that\n",
288       "a foreign key is the primary key of one table\n",
289       "that appears in some other table.\n",
290       "In our database,\n",
291       "`Person.ident` is the primary key in the `Person` table,\n",
292       "while `Survey.person` is a foreign key\n",
293       "relating the `Survey` table's entries\n",
294       "to entries in `Person`.\n",
295       "\n",
296       "Most database designers believe that\n",
297       "every table should have a well-defined primary key.\n",
298       "They also believe that this key should be separate from the data itself,\n",
299       "so that if we ever need to change the data,\n",
300       "we only need to make one change in one place.\n",
301       "One easy way to do this is\n",
302       "to create an arbitrary, unique ID for each record\n",
303       "as we add it to the database.\n",
304       "This is actually very common:\n",
305       "those IDs have names like \"student numbers\" and \"patient numbers\",\n",
306       "and they almost always turn out to have originally been\n",
307       "a unique record identifier in some database system or other.\n",
308       "As the query below demonstrates,\n",
309       "SQLite automatically numbers records as they're added to tables,\n",
310       "and we can use those record numbers in queries:"
311      ]
312     },
313     {
314      "cell_type": "code",
315      "collapsed": false,
316      "input": [
317       "%%sqlite survey.db\n",
318       "select rowid, * from Person;"
319      ],
320      "language": "python",
321      "metadata": {},
322      "outputs": [
323       {
324        "html": [
325         "<table>\n",
326         "<tr><td>1</td><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
327         "<tr><td>2</td><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
328         "<tr><td>3</td><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
329         "<tr><td>4</td><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
330         "<tr><td>5</td><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
331         "</table>"
332        ],
333        "metadata": {},
334        "output_type": "display_data",
335        "text": [
336         "<IPython.core.display.HTML at 0x1023c4550>"
337        ]
338       }
339      ],
340      "prompt_number": 6
341     },
342     {
343      "cell_type": "markdown",
344      "metadata": {},
345      "source": [
346       "### Challenges\n",
347       "\n",
348       "1.  Write a query that lists all radiation readings from the DR-1 site.\n",
349       "\n",
350       "2.  Write a query that lists all sites visited by people named \"Frank\".\n",
351       "\n",
352       "3.  Describe in your own words what the following query produces:\n",
353       "\n",
354       "    ```\n",
355       "    select Site.name from Site join Visited\n",
356       "    on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
357       "    ```\n",
358       "\n",
359       "4.  Why does the `Person` table have an `ident` field?\n",
360       "    Why do we not just use scientists' names in the `Survey` table?\n",
361       "\n",
362       "5.  Why does the table `Site` exist?\n",
363       "    Why don't we just record latitudes and longitudes\n",
364       "    directly in the `Visited` and `Survey` tables?"
365      ]
366     },
367     {
368      "cell_type": "heading",
369      "level": 2,
370      "metadata": {},
371      "source": [
372       "Next Steps"
373      ]
374     },
375     {
376      "cell_type": "markdown",
377      "metadata": {},
378      "source": [
379       "FIXME"
380      ]
381     }
382    ],
383    "metadata": {}
384   }
385  ]
386 }