From ccc47733a30a7f1bef1641d03c2e1dd0cc719a22 Mon Sep 17 00:00:00 2001 From: Greg Wilson Date: Sun, 10 Nov 2013 08:18:08 -0500 Subject: [PATCH] 1. Filled in 'group by'. 2. Put data hygiene in discussion of joins. 3. Got rid of 'next steps' sections. --- sql/novice/01-select.ipynb | 23 -- sql/novice/02-sort-dup.ipynb | 15 - sql/novice/03-filter.ipynb | 15 - sql/novice/04-calc.ipynb | 15 - sql/novice/05-null.ipynb | 15 - sql/novice/06-agg.ipynb | 312 +++++++++++++++++--- sql/novice/07-join.ipynb | 87 ++++-- sql/novice/09-hygiene.ipynb | 29 -- sql/novice/{10-prog.ipynb => 09-prog.ipynb} | 0 9 files changed, 343 insertions(+), 168 deletions(-) delete mode 100644 sql/novice/09-hygiene.ipynb rename sql/novice/{10-prog.ipynb => 09-prog.ipynb} (100%) diff --git a/sql/novice/01-select.ipynb b/sql/novice/01-select.ipynb index 0215577..3de1227 100644 --- a/sql/novice/01-select.ipynb +++ b/sql/novice/01-select.ipynb @@ -406,29 +406,6 @@ "\n", " What style do you find easiest to read, and why?" ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [], - "language": "python", - "metadata": {}, - "outputs": [] } ], "metadata": {} diff --git a/sql/novice/02-sort-dup.ipynb b/sql/novice/02-sort-dup.ipynb index e583799..44d1335 100644 --- a/sql/novice/02-sort-dup.ipynb +++ b/sql/novice/02-sort-dup.ipynb @@ -385,21 +385,6 @@ "\n", "FIXME" ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] } ], "metadata": {} diff --git a/sql/novice/03-filter.ipynb b/sql/novice/03-filter.ipynb index 08e0baa..c620cf3 100644 --- a/sql/novice/03-filter.ipynb +++ b/sql/novice/03-filter.ipynb @@ -453,21 +453,6 @@ " write a query that finds all the records in `Visited`\n", " that *aren't* from sites labelled 'DR-something'." ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] } ], "metadata": {} diff --git a/sql/novice/04-calc.ipynb b/sql/novice/04-calc.ipynb index 2ecf97d..79c64cb 100644 --- a/sql/novice/04-calc.ipynb +++ b/sql/novice/04-calc.ipynb @@ -290,21 +290,6 @@ "(For this data,\n", "the list should contain only \"DR\" and \"MSK\")." ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] } ], "metadata": {} diff --git a/sql/novice/05-null.ipynb b/sql/novice/05-null.ipynb index 921d10b..331424b 100644 --- a/sql/novice/05-null.ipynb +++ b/sql/novice/05-null.ipynb @@ -417,21 +417,6 @@ " What does this simplify?\n", " What burdens or risks does it introduce?" ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] } ], "metadata": {} diff --git a/sql/novice/06-agg.ipynb b/sql/novice/06-agg.ipynb index 871bbcc..551f9f1 100644 --- a/sql/novice/06-agg.ipynb +++ b/sql/novice/06-agg.ipynb @@ -60,7 +60,7 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], @@ -96,7 +96,7 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], @@ -121,7 +121,7 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], @@ -157,11 +157,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 7 + "prompt_number": 5 }, { "cell_type": "code", @@ -182,11 +182,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 8 + "prompt_number": 6 }, { "cell_type": "code", @@ -207,11 +207,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 9 + "prompt_number": 7 }, { "cell_type": "markdown", @@ -249,11 +249,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 10 + "prompt_number": 8 }, { "cell_type": "markdown", @@ -282,11 +282,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 11 + "prompt_number": 9 }, { "cell_type": "markdown", @@ -324,11 +324,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 12 + "prompt_number": 10 }, { "cell_type": "markdown", @@ -369,11 +369,11 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" ] } ], - "prompt_number": 13 + "prompt_number": 11 }, { "cell_type": "markdown", @@ -401,12 +401,273 @@ "metadata": {}, "output_type": "display_data", "text": [ - "" + "" + ] + } + ], + "prompt_number": 12 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Aggregating all records at once doesn't always make sense.\n", + "For example,\n", + "suppose Gina suspects that there is a systematic bias in her data,\n", + "and that some scientists' radiation readings are higher than others.\n", + "We know that this doesn't work:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, count(reading), round(avg(reading), 2)\n", + "from Survey\n", + "where quant='rad';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
roe86.56
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 13 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "because the database manager selects a single arbitrary scientist's name\n", + "rather than aggregating separately for each scientist.\n", + "Since there are only five scientists,\n", + "she could write five queries of the form:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, count(reading), round(avg(reading), 2)\n", + "from Survey\n", + "where quant='rad'\n", + "and person='dyer';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
dyer28.81
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" ] } ], "prompt_number": 14 }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "but this would be tedious,\n", + "and if she ever had a data set with fifty or five hundred scientists,\n", + "the chances of her getting all of those queries right is small.\n", + "\n", + "What we need to do is\n", + "tell the database manager to aggregate the hours for each scientist separately\n", + "using a `group by` clause:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, count(reading), round(avg(reading), 2)\n", + "from Survey\n", + "where quant='rad'\n", + "group by person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
dyer28.81
lake21.82
pb36.66
roe111.25
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 15 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`group by` does exactly what its name implies:\n", + "groups all the records with the same value for the specified field together\n", + "so that aggregation can process each batch separately.\n", + "Since all the records in each batch have the same value for `person`,\n", + "it no longer matters that the database manager\n", + "is picking an arbitrary one to display\n", + "alongside the aggregated `reading` values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME: diagram" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just as we can sort by multiple criteria at once,\n", + "we can also group by multiple criteria.\n", + "To get the average reading by scientist and quantity measured,\n", + "for example,\n", + "we just add another field to the `group by` clause:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, quant, count(reading), round(avg(reading), 2)\n", + "from Survey\n", + "group by person, quant;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
Nonesal10.06
Nonetemp1-26.0
dyerrad28.81
dyersal20.11
lakerad21.82
lakesal40.11
laketemp1-16.0
pbrad36.66
pbtemp2-20.0
roerad111.25
roesal232.05
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 16 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that we have added `person` to the list of fields displayed,\n", + "since the results wouldn't make much sense otherwise.\n", + "\n", + "Let's go one step further and remove all the entries\n", + "where we don't know who took the measurement:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, quant, count(reading), round(avg(reading), 2)\n", + "from Survey\n", + "where person is not null\n", + "group by person, quant\n", + "order by person, quant;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
dyerrad28.81
dyersal20.11
lakerad21.82
lakesal40.11
laketemp1-16.0
pbrad36.66
pbtemp2-20.0
roerad111.25
roesal232.05
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 17 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Looking more closely,\n", + "this query:\n", + "\n", + "1. selected records from the `Survey` table\n", + " where the `person` field was not null;\n", + "\n", + "2. grouped those records into subsets\n", + " so that the `person` and `quant` values in each subset\n", + " were the same;\n", + "\n", + "3. ordered those subsets first by `person`,\n", + " and then within each sub-group by `quant`;\n", + " and\n", + "\n", + "4. counted the number of records in each subset,\n", + " calculated the average `reading` in each,\n", + " and chose a `person` and `quant` value from each\n", + " (it doesn't matter which ones,\n", + " since they're all equal)." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -445,21 +706,6 @@ "\n", " Can you find a way to order the list by surname?" ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] } ], "metadata": {} diff --git a/sql/novice/07-join.ipynb b/sql/novice/07-join.ipynb index 7510ba3..4c6cac4 100644 --- a/sql/novice/07-join.ipynb +++ b/sql/novice/07-join.ipynb @@ -339,44 +339,85 @@ ], "prompt_number": 6 }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Data Hygiene" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Challenges\n", - "\n", - "1. Write a query that lists all radiation readings from the DR-1 site.\n", + "Now that we have seen how joins work,\n", + "we can see why the relational model is so useful\n", + "and how best to use it.\n", + "The first rule is that every value should be [atomic](../gloss.html#atomic-value),\n", + "i.e.,\n", + "not contain parts that we might want to work with separately.\n", + "We store personal and family names in separate columns instead of putting the entire name in one column\n", + "so that we don't have to use substring operations to get the name's components.\n", + "More importantly,\n", + "we store the two parts of the name separately because splitting on spaces is unreliable:\n", + "just think of a name like \"Eloise St. Cyr\" or \"Jan Mikkel Steubart\".\n", "\n", - "2. Write a query that lists all sites visited by people named \"Frank\".\n", + "The second rule is that every record should have a unique primary key.\n", + "This can be a serial number that has no intrinsic meaning,\n", + "one of the values in the record (like the `ident` field in the `Person` table),\n", + "or even a combination of values:\n", + "the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement.\n", "\n", - "3. Describe in your own words what the following query produces:\n", + "The third rule is that there should be no redundant information.\n", + "For example,\n", + "we could get rid of the `Site` table and rewrite the `Visited` table like this:\n", "\n", - " ```\n", - " select Site.name from Site join Visited\n", - " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n", - " ```\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
619 -49.85 -128.57 1927-02-08
622 -49.85 -128.57 1927-02-10
734 -47.15 -126.72 1939-01-07
735 -47.15 -126.72 1930-01-12
751 -47.15 -126.72 1930-02-26
752 -47.15 -126.72 null
837 -48.87 -123.40 1932-01-14
844 -49.85 -128.57 1932-03-22
\n", "\n", - "4. Why does the `Person` table have an `ident` field?\n", - " Why do we not just use scientists' names in the `Survey` table?\n", + "In fact,\n", + "we could use a single table that recorded all the information about each reading in each row,\n", + "just as a spreadsheet would.\n", + "The problem is that it's very hard to keep data organized this way consistent:\n", + "if we realize that the date of a particular visit to a particular site is wrong,\n", + "we have to change multiple records in the database.\n", + "What's worse,\n", + "we may have to guess which records to change,\n", + "since some other sites may actually have been visited on that date.\n", "\n", - "5. Why does the table `Site` exist?\n", - " Why don't we just record latitudes and longitudes\n", - " directly in the `Visited` and `Survey` tables?" - ] - }, - { - "cell_type": "heading", - "level": 2, - "metadata": {}, - "source": [ - "Next Steps" + "The fourth rule is that the units for every value should be stored explicitly.\n", + "Our database doesn't do this,\n", + "and that's a problem:\n", + "Roerich's salinity measurements are several orders of magnitude larger than anyone else's,\n", + "but we don't know if that means she was using parts per million instead of parts per thousand,\n", + "or whether there actually was a saline anomaly at that site in 1932." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "FIXME" + "### Challenges\n", + "\n", + "1. Write a query that lists all radiation readings from the DR-1 site.\n", + "\n", + "2. Write a query that lists all sites visited by people named \"Frank\".\n", + "\n", + "3. Describe in your own words what the following query produces:\n", + "\n", + " ```\n", + " select Site.name from Site join Visited\n", + " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n", + " ```" ] } ], diff --git a/sql/novice/09-hygiene.ipynb b/sql/novice/09-hygiene.ipynb deleted file mode 100644 index 9b607e3..0000000 --- a/sql/novice/09-hygiene.ipynb +++ /dev/null @@ -1,29 +0,0 @@ -{ - "metadata": { - "name": "" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ - { - "cells": [ - { - "cell_type": "heading", - "level": 1, - "metadata": {}, - "source": [ - "Data Hygiene" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "FIXME" - ] - } - ], - "metadata": {} - } - ] -} \ No newline at end of file diff --git a/sql/novice/10-prog.ipynb b/sql/novice/09-prog.ipynb similarity index 100% rename from sql/novice/10-prog.ipynb rename to sql/novice/09-prog.ipynb -- 2.26.2