"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4ed0>"
+ "<IPython.core.display.HTML at 0x1023c4f90>"
]
}
],
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4ed0>"
+ "<IPython.core.display.HTML at 0x1023c4110>"
]
}
],
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4090>"
+ "<IPython.core.display.HTML at 0x1023c46d0>"
]
}
],
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4090>"
+ "<IPython.core.display.HTML at 0x1023c4b50>"
]
}
],
- "prompt_number": 7
+ "prompt_number": 5
},
{
"cell_type": "code",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4fd0>"
+ "<IPython.core.display.HTML at 0x1023c4f90>"
]
}
],
- "prompt_number": 8
+ "prompt_number": 6
},
{
"cell_type": "code",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4d90>"
+ "<IPython.core.display.HTML at 0x1023c46d0>"
]
}
],
- "prompt_number": 9
+ "prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4fd0>"
+ "<IPython.core.display.HTML at 0x1023c4f90>"
]
}
],
- "prompt_number": 10
+ "prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4d90>"
+ "<IPython.core.display.HTML at 0x1023c46d0>"
]
}
],
- "prompt_number": 11
+ "prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4fd0>"
+ "<IPython.core.display.HTML at 0x1023c41d0>"
]
}
],
- "prompt_number": 12
+ "prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4fd0>"
+ "<IPython.core.display.HTML at 0x1023c41d0>"
]
}
],
- "prompt_number": 13
+ "prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"output_type": "display_data",
"text": [
- "<IPython.core.display.HTML at 0x1023c4090>"
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 12
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Aggregating all records at once doesn't always make sense.\n",
+ "For example,\n",
+ "suppose Gina suspects that there is a systematic bias in her data,\n",
+ "and that some scientists' radiation readings are higher than others.\n",
+ "We know that this doesn't work:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, count(reading), round(avg(reading), 2)\n",
+ "from Survey\n",
+ "where quant='rad';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>roe</td><td>8</td><td>6.56</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4110>"
+ ]
+ }
+ ],
+ "prompt_number": 13
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "because the database manager selects a single arbitrary scientist's name\n",
+ "rather than aggregating separately for each scientist.\n",
+ "Since there are only five scientists,\n",
+ "she could write five queries of the form:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, count(reading), round(avg(reading), 2)\n",
+ "from Survey\n",
+ "where quant='rad'\n",
+ "and person='dyer';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>dyer</td><td>2</td><td>8.81</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
]
}
],
"prompt_number": 14
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "but this would be tedious,\n",
+ "and if she ever had a data set with fifty or five hundred scientists,\n",
+ "the chances of her getting all of those queries right is small.\n",
+ "\n",
+ "What we need to do is\n",
+ "tell the database manager to aggregate the hours for each scientist separately\n",
+ "using a `group by` clause:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, count(reading), round(avg(reading), 2)\n",
+ "from Survey\n",
+ "where quant='rad'\n",
+ "group by person;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>dyer</td><td>2</td><td>8.81</td></tr>\n",
+ "<tr><td>lake</td><td>2</td><td>1.82</td></tr>\n",
+ "<tr><td>pb</td><td>3</td><td>6.66</td></tr>\n",
+ "<tr><td>roe</td><td>1</td><td>11.25</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4b50>"
+ ]
+ }
+ ],
+ "prompt_number": 15
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`group by` does exactly what its name implies:\n",
+ "groups all the records with the same value for the specified field together\n",
+ "so that aggregation can process each batch separately.\n",
+ "Since all the records in each batch have the same value for `person`,\n",
+ "it no longer matters that the database manager\n",
+ "is picking an arbitrary one to display\n",
+ "alongside the aggregated `reading` values."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME: diagram"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Just as we can sort by multiple criteria at once,\n",
+ "we can also group by multiple criteria.\n",
+ "To get the average reading by scientist and quantity measured,\n",
+ "for example,\n",
+ "we just add another field to the `group by` clause:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, quant, count(reading), round(avg(reading), 2)\n",
+ "from Survey\n",
+ "group by person, quant;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>None</td><td>sal</td><td>1</td><td>0.06</td></tr>\n",
+ "<tr><td>None</td><td>temp</td><td>1</td><td>-26.0</td></tr>\n",
+ "<tr><td>dyer</td><td>rad</td><td>2</td><td>8.81</td></tr>\n",
+ "<tr><td>dyer</td><td>sal</td><td>2</td><td>0.11</td></tr>\n",
+ "<tr><td>lake</td><td>rad</td><td>2</td><td>1.82</td></tr>\n",
+ "<tr><td>lake</td><td>sal</td><td>4</td><td>0.11</td></tr>\n",
+ "<tr><td>lake</td><td>temp</td><td>1</td><td>-16.0</td></tr>\n",
+ "<tr><td>pb</td><td>rad</td><td>3</td><td>6.66</td></tr>\n",
+ "<tr><td>pb</td><td>temp</td><td>2</td><td>-20.0</td></tr>\n",
+ "<tr><td>roe</td><td>rad</td><td>1</td><td>11.25</td></tr>\n",
+ "<tr><td>roe</td><td>sal</td><td>2</td><td>32.05</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4110>"
+ ]
+ }
+ ],
+ "prompt_number": 16
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that we have added `person` to the list of fields displayed,\n",
+ "since the results wouldn't make much sense otherwise.\n",
+ "\n",
+ "Let's go one step further and remove all the entries\n",
+ "where we don't know who took the measurement:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, quant, count(reading), round(avg(reading), 2)\n",
+ "from Survey\n",
+ "where person is not null\n",
+ "group by person, quant\n",
+ "order by person, quant;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>dyer</td><td>rad</td><td>2</td><td>8.81</td></tr>\n",
+ "<tr><td>dyer</td><td>sal</td><td>2</td><td>0.11</td></tr>\n",
+ "<tr><td>lake</td><td>rad</td><td>2</td><td>1.82</td></tr>\n",
+ "<tr><td>lake</td><td>sal</td><td>4</td><td>0.11</td></tr>\n",
+ "<tr><td>lake</td><td>temp</td><td>1</td><td>-16.0</td></tr>\n",
+ "<tr><td>pb</td><td>rad</td><td>3</td><td>6.66</td></tr>\n",
+ "<tr><td>pb</td><td>temp</td><td>2</td><td>-20.0</td></tr>\n",
+ "<tr><td>roe</td><td>rad</td><td>1</td><td>11.25</td></tr>\n",
+ "<tr><td>roe</td><td>sal</td><td>2</td><td>32.05</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 17
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Looking more closely,\n",
+ "this query:\n",
+ "\n",
+ "1. selected records from the `Survey` table\n",
+ " where the `person` field was not null;\n",
+ "\n",
+ "2. grouped those records into subsets\n",
+ " so that the `person` and `quant` values in each subset\n",
+ " were the same;\n",
+ "\n",
+ "3. ordered those subsets first by `person`,\n",
+ " and then within each sub-group by `quant`;\n",
+ " and\n",
+ "\n",
+ "4. counted the number of records in each subset,\n",
+ " calculated the average `reading` in each,\n",
+ " and chose a `person` and `quant` value from each\n",
+ " (it doesn't matter which ones,\n",
+ " since they're all equal)."
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"\n",
" Can you find a way to order the list by surname?"
]
- },
- {
- "cell_type": "heading",
- "level": 2,
- "metadata": {},
- "source": [
- "Next Steps"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "FIXME"
- ]
}
],
"metadata": {}
],
"prompt_number": 6
},
+ {
+ "cell_type": "heading",
+ "level": 2,
+ "metadata": {},
+ "source": [
+ "Data Hygiene"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Challenges\n",
- "\n",
- "1. Write a query that lists all radiation readings from the DR-1 site.\n",
+ "Now that we have seen how joins work,\n",
+ "we can see why the relational model is so useful\n",
+ "and how best to use it.\n",
+ "The first rule is that every value should be [atomic](../gloss.html#atomic-value),\n",
+ "i.e.,\n",
+ "not contain parts that we might want to work with separately.\n",
+ "We store personal and family names in separate columns instead of putting the entire name in one column\n",
+ "so that we don't have to use substring operations to get the name's components.\n",
+ "More importantly,\n",
+ "we store the two parts of the name separately because splitting on spaces is unreliable:\n",
+ "just think of a name like \"Eloise St. Cyr\" or \"Jan Mikkel Steubart\".\n",
"\n",
- "2. Write a query that lists all sites visited by people named \"Frank\".\n",
+ "The second rule is that every record should have a unique primary key.\n",
+ "This can be a serial number that has no intrinsic meaning,\n",
+ "one of the values in the record (like the `ident` field in the `Person` table),\n",
+ "or even a combination of values:\n",
+ "the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement.\n",
"\n",
- "3. Describe in your own words what the following query produces:\n",
+ "The third rule is that there should be no redundant information.\n",
+ "For example,\n",
+ "we could get rid of the `Site` table and rewrite the `Visited` table like this:\n",
"\n",
- " ```\n",
- " select Site.name from Site join Visited\n",
- " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
- " ```\n",
+ "<table>\n",
+ " <tr> <td>619</td> <td>-49.85</td> <td>-128.57</td> <td>1927-02-08</td> </tr>\n",
+ " <tr> <td>622</td> <td>-49.85</td> <td>-128.57</td> <td>1927-02-10</td> </tr>\n",
+ " <tr> <td>734</td> <td>-47.15</td> <td>-126.72</td> <td>1939-01-07</td> </tr>\n",
+ " <tr> <td>735</td> <td>-47.15</td> <td>-126.72</td> <td>1930-01-12</td> </tr>\n",
+ " <tr> <td>751</td> <td>-47.15</td> <td>-126.72</td> <td>1930-02-26</td> </tr>\n",
+ " <tr> <td>752</td> <td>-47.15</td> <td>-126.72</td> <td>null</td> </tr>\n",
+ " <tr> <td>837</td> <td>-48.87</td> <td>-123.40</td> <td>1932-01-14</td> </tr>\n",
+ " <tr> <td>844</td> <td>-49.85</td> <td>-128.57</td> <td>1932-03-22</td> </tr>\n",
+ "</table>\n",
"\n",
- "4. Why does the `Person` table have an `ident` field?\n",
- " Why do we not just use scientists' names in the `Survey` table?\n",
+ "In fact,\n",
+ "we could use a single table that recorded all the information about each reading in each row,\n",
+ "just as a spreadsheet would.\n",
+ "The problem is that it's very hard to keep data organized this way consistent:\n",
+ "if we realize that the date of a particular visit to a particular site is wrong,\n",
+ "we have to change multiple records in the database.\n",
+ "What's worse,\n",
+ "we may have to guess which records to change,\n",
+ "since some other sites may actually have been visited on that date.\n",
"\n",
- "5. Why does the table `Site` exist?\n",
- " Why don't we just record latitudes and longitudes\n",
- " directly in the `Visited` and `Survey` tables?"
- ]
- },
- {
- "cell_type": "heading",
- "level": 2,
- "metadata": {},
- "source": [
- "Next Steps"
+ "The fourth rule is that the units for every value should be stored explicitly.\n",
+ "Our database doesn't do this,\n",
+ "and that's a problem:\n",
+ "Roerich's salinity measurements are several orders of magnitude larger than anyone else's,\n",
+ "but we don't know if that means she was using parts per million instead of parts per thousand,\n",
+ "or whether there actually was a saline anomaly at that site in 1932."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "FIXME"
+ "### Challenges\n",
+ "\n",
+ "1. Write a query that lists all radiation readings from the DR-1 site.\n",
+ "\n",
+ "2. Write a query that lists all sites visited by people named \"Frank\".\n",
+ "\n",
+ "3. Describe in your own words what the following query produces:\n",
+ "\n",
+ " ```\n",
+ " select Site.name from Site join Visited\n",
+ " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
+ " ```"
]
}
],