1. Filled in 'group by'.
authorGreg Wilson <gvwilson@third-bit.com>
Sun, 10 Nov 2013 13:18:08 +0000 (08:18 -0500)
committerW. Trevor King <wking@tremily.us>
Tue, 11 Mar 2014 04:30:47 +0000 (21:30 -0700)
2. Put data hygiene in discussion of joins.
3. Got rid of 'next steps' sections.

sql/novice/01-select.ipynb
sql/novice/02-sort-dup.ipynb
sql/novice/03-filter.ipynb
sql/novice/04-calc.ipynb
sql/novice/05-null.ipynb
sql/novice/06-agg.ipynb
sql/novice/07-join.ipynb
sql/novice/09-hygiene.ipynb [deleted file]
sql/novice/09-prog.ipynb [moved from sql/novice/10-prog.ipynb with 100% similarity]

index 0215577a98689d77d095bf35830800d7e8e29cb9..3de1227e3ff78367a0c7545dc2eee707f45cc949 100644 (file)
       "\n",
       "    What style do you find easiest to read, and why?"
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
     }
    ],
    "metadata": {}
index e5837999e47de0ede930ce46d24efa2a3d424272..44d1335ab9b5f51e58cde18430199d5e1ebc65f9 100644 (file)
       "\n",
       "FIXME"
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
     }
    ],
    "metadata": {}
index 08e0baa11ff453bf56b3a674301a9f7bfd2892df..c620cf3607c87b87d920dc45c0bafcbc0b8b276c 100644 (file)
       "    write a query that finds all the records in `Visited`\n",
       "    that *aren't* from sites labelled 'DR-something'."
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
     }
    ],
    "metadata": {}
index 2ecf97d84809f2a2924afb29a20eba364d5f4a96..79c64cb58328275ddbe26b6ad39b519ec4d00341 100644 (file)
       "(For this data,\n",
       "the list should contain only \"DR\" and \"MSK\")."
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
     }
    ],
    "metadata": {}
index 921d10b6b32d89ba759d5838a055ea62bac76128..331424b57f55b2a9ccbc5301c66ab0dc3a0f4b93 100644 (file)
       "    What does this simplify?\n",
       "    What burdens or risks does it introduce?"
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
     }
    ],
    "metadata": {}
index 871bbcc5c452f38ee7b276e2229918958bf165e8..551f9f1f78891986b90d741e1a421187a94e9551 100644 (file)
@@ -60,7 +60,7 @@
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4ed0>"
+        "<IPython.core.display.HTML at 0x1023c4f90>"
        ]
       }
      ],
@@ -96,7 +96,7 @@
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4ed0>"
+        "<IPython.core.display.HTML at 0x1023c4110>"
        ]
       }
      ],
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4090>"
+        "<IPython.core.display.HTML at 0x1023c46d0>"
        ]
       }
      ],
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4090>"
+        "<IPython.core.display.HTML at 0x1023c4b50>"
        ]
       }
      ],
-     "prompt_number": 7
+     "prompt_number": 5
     },
     {
      "cell_type": "code",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4fd0>"
+        "<IPython.core.display.HTML at 0x1023c4f90>"
        ]
       }
      ],
-     "prompt_number": 8
+     "prompt_number": 6
     },
     {
      "cell_type": "code",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4d90>"
+        "<IPython.core.display.HTML at 0x1023c46d0>"
        ]
       }
      ],
-     "prompt_number": 9
+     "prompt_number": 7
     },
     {
      "cell_type": "markdown",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4fd0>"
+        "<IPython.core.display.HTML at 0x1023c4f90>"
        ]
       }
      ],
-     "prompt_number": 10
+     "prompt_number": 8
     },
     {
      "cell_type": "markdown",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4d90>"
+        "<IPython.core.display.HTML at 0x1023c46d0>"
        ]
       }
      ],
-     "prompt_number": 11
+     "prompt_number": 9
     },
     {
      "cell_type": "markdown",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4fd0>"
+        "<IPython.core.display.HTML at 0x1023c41d0>"
        ]
       }
      ],
-     "prompt_number": 12
+     "prompt_number": 10
     },
     {
      "cell_type": "markdown",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4fd0>"
+        "<IPython.core.display.HTML at 0x1023c41d0>"
        ]
       }
      ],
-     "prompt_number": 13
+     "prompt_number": 11
     },
     {
      "cell_type": "markdown",
        "metadata": {},
        "output_type": "display_data",
        "text": [
-        "<IPython.core.display.HTML at 0x1023c4090>"
+        "<IPython.core.display.HTML at 0x1023c46d0>"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Aggregating all records at once doesn't always make sense.\n",
+      "For example,\n",
+      "suppose Gina suspects that there is a systematic bias in her data,\n",
+      "and that some scientists' radiation readings are higher than others.\n",
+      "We know that this doesn't work:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select person, count(reading), round(avg(reading), 2)\n",
+      "from  Survey\n",
+      "where quant='rad';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>roe</td><td>8</td><td>6.56</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4110>"
+       ]
+      }
+     ],
+     "prompt_number": 13
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "because the database manager selects a single arbitrary scientist's name\n",
+      "rather than aggregating separately for each scientist.\n",
+      "Since there are only five scientists,\n",
+      "she could write five queries of the form:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select person, count(reading), round(avg(reading), 2)\n",
+      "from  Survey\n",
+      "where quant='rad'\n",
+      "and   person='dyer';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>dyer</td><td>2</td><td>8.81</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c46d0>"
        ]
       }
      ],
      "prompt_number": 14
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "but this would be tedious,\n",
+      "and if she ever had a data set with fifty or five hundred scientists,\n",
+      "the chances of her getting all of those queries right is small.\n",
+      "\n",
+      "What we need to do is\n",
+      "tell the database manager to aggregate the hours for each scientist separately\n",
+      "using a `group by` clause:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select   person, count(reading), round(avg(reading), 2)\n",
+      "from     Survey\n",
+      "where    quant='rad'\n",
+      "group by person;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>dyer</td><td>2</td><td>8.81</td></tr>\n",
+        "<tr><td>lake</td><td>2</td><td>1.82</td></tr>\n",
+        "<tr><td>pb</td><td>3</td><td>6.66</td></tr>\n",
+        "<tr><td>roe</td><td>1</td><td>11.25</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4b50>"
+       ]
+      }
+     ],
+     "prompt_number": 15
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "`group by` does exactly what its name implies:\n",
+      "groups all the records with the same value for the specified field together\n",
+      "so that aggregation can process each batch separately.\n",
+      "Since all the records in each batch have the same value for `person`,\n",
+      "it no longer matters that the database manager\n",
+      "is picking an arbitrary one to display\n",
+      "alongside the aggregated `reading` values."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "FIXME: diagram"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Just as we can sort by multiple criteria at once,\n",
+      "we can also group by multiple criteria.\n",
+      "To get the average reading by scientist and quantity measured,\n",
+      "for example,\n",
+      "we just add another field to the `group by` clause:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select   person, quant, count(reading), round(avg(reading), 2)\n",
+      "from     Survey\n",
+      "group by person, quant;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>None</td><td>sal</td><td>1</td><td>0.06</td></tr>\n",
+        "<tr><td>None</td><td>temp</td><td>1</td><td>-26.0</td></tr>\n",
+        "<tr><td>dyer</td><td>rad</td><td>2</td><td>8.81</td></tr>\n",
+        "<tr><td>dyer</td><td>sal</td><td>2</td><td>0.11</td></tr>\n",
+        "<tr><td>lake</td><td>rad</td><td>2</td><td>1.82</td></tr>\n",
+        "<tr><td>lake</td><td>sal</td><td>4</td><td>0.11</td></tr>\n",
+        "<tr><td>lake</td><td>temp</td><td>1</td><td>-16.0</td></tr>\n",
+        "<tr><td>pb</td><td>rad</td><td>3</td><td>6.66</td></tr>\n",
+        "<tr><td>pb</td><td>temp</td><td>2</td><td>-20.0</td></tr>\n",
+        "<tr><td>roe</td><td>rad</td><td>1</td><td>11.25</td></tr>\n",
+        "<tr><td>roe</td><td>sal</td><td>2</td><td>32.05</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4110>"
+       ]
+      }
+     ],
+     "prompt_number": 16
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Note that we have added `person` to the list of fields displayed,\n",
+      "since the results wouldn't make much sense otherwise.\n",
+      "\n",
+      "Let's go one step further and remove all the entries\n",
+      "where we don't know who took the measurement:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select   person, quant, count(reading), round(avg(reading), 2)\n",
+      "from     Survey\n",
+      "where    person is not null\n",
+      "group by person, quant\n",
+      "order by person, quant;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>dyer</td><td>rad</td><td>2</td><td>8.81</td></tr>\n",
+        "<tr><td>dyer</td><td>sal</td><td>2</td><td>0.11</td></tr>\n",
+        "<tr><td>lake</td><td>rad</td><td>2</td><td>1.82</td></tr>\n",
+        "<tr><td>lake</td><td>sal</td><td>4</td><td>0.11</td></tr>\n",
+        "<tr><td>lake</td><td>temp</td><td>1</td><td>-16.0</td></tr>\n",
+        "<tr><td>pb</td><td>rad</td><td>3</td><td>6.66</td></tr>\n",
+        "<tr><td>pb</td><td>temp</td><td>2</td><td>-20.0</td></tr>\n",
+        "<tr><td>roe</td><td>rad</td><td>1</td><td>11.25</td></tr>\n",
+        "<tr><td>roe</td><td>sal</td><td>2</td><td>32.05</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c46d0>"
+       ]
+      }
+     ],
+     "prompt_number": 17
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Looking more closely,\n",
+      "this query:\n",
+      "\n",
+      "1.  selected records from the `Survey` table\n",
+      "    where the `person` field was not null;\n",
+      "\n",
+      "2.  grouped those records into subsets\n",
+      "    so that the `person` and `quant` values in each subset\n",
+      "    were the same;\n",
+      "\n",
+      "3.  ordered those subsets first by `person`,\n",
+      "    and then within each sub-group by `quant`;\n",
+      "    and\n",
+      "\n",
+      "4.  counted the number of records in each subset,\n",
+      "    calculated the average `reading` in each,\n",
+      "    and chose a `person` and `quant` value from each\n",
+      "    (it doesn't matter which ones,\n",
+      "    since they're all equal)."
+     ]
+    },
     {
      "cell_type": "markdown",
      "metadata": {},
       "\n",
       "    Can you find a way to order the list by surname?"
      ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
     }
    ],
    "metadata": {}
index 7510ba3d50222659ad04573b482ecedfe3b7b353..4c6cac4374a82ff0efb6f3593c93b939e60106f7 100644 (file)
      ],
      "prompt_number": 6
     },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Data Hygiene"
+     ]
+    },
     {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "### Challenges\n",
-      "\n",
-      "1.  Write a query that lists all radiation readings from the DR-1 site.\n",
+      "Now that we have seen how joins work,\n",
+      "we can see why the relational model is so useful\n",
+      "and how best to use it.\n",
+      "The first rule is that every value should be [atomic](../gloss.html#atomic-value),\n",
+      "i.e.,\n",
+      "not contain parts that we might want to work with separately.\n",
+      "We store personal and family names in separate columns instead of putting the entire name in one column\n",
+      "so that we don't have to use substring operations to get the name's components.\n",
+      "More importantly,\n",
+      "we store the two parts of the name separately because splitting on spaces is unreliable:\n",
+      "just think of a name like \"Eloise St. Cyr\" or \"Jan Mikkel Steubart\".\n",
       "\n",
-      "2.  Write a query that lists all sites visited by people named \"Frank\".\n",
+      "The second rule is that every record should have a unique primary key.\n",
+      "This can be a serial number that has no intrinsic meaning,\n",
+      "one of the values in the record (like the `ident` field in the `Person` table),\n",
+      "or even a combination of values:\n",
+      "the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement.\n",
       "\n",
-      "3.  Describe in your own words what the following query produces:\n",
+      "The third rule is that there should be no redundant information.\n",
+      "For example,\n",
+      "we could get rid of the `Site` table and rewrite the `Visited` table like this:\n",
       "\n",
-      "    ```\n",
-      "    select Site.name from Site join Visited\n",
-      "    on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
-      "    ```\n",
+      "<table>\n",
+      "  <tr> <td>619</td> <td>-49.85</td> <td>-128.57</td> <td>1927-02-08</td> </tr>\n",
+      "  <tr> <td>622</td> <td>-49.85</td> <td>-128.57</td> <td>1927-02-10</td> </tr>\n",
+      "  <tr> <td>734</td> <td>-47.15</td> <td>-126.72</td> <td>1939-01-07</td> </tr>\n",
+      "  <tr> <td>735</td> <td>-47.15</td> <td>-126.72</td> <td>1930-01-12</td> </tr>\n",
+      "  <tr> <td>751</td> <td>-47.15</td> <td>-126.72</td> <td>1930-02-26</td> </tr>\n",
+      "  <tr> <td>752</td> <td>-47.15</td> <td>-126.72</td> <td>null</td> </tr>\n",
+      "  <tr> <td>837</td> <td>-48.87</td> <td>-123.40</td> <td>1932-01-14</td> </tr>\n",
+      "  <tr> <td>844</td> <td>-49.85</td> <td>-128.57</td> <td>1932-03-22</td> </tr>\n",
+      "</table>\n",
       "\n",
-      "4.  Why does the `Person` table have an `ident` field?\n",
-      "    Why do we not just use scientists' names in the `Survey` table?\n",
+      "In fact,\n",
+      "we could use a single table that recorded all the information about each reading in each row,\n",
+      "just as a spreadsheet would.\n",
+      "The problem is that it's very hard to keep data organized this way consistent:\n",
+      "if we realize that the date of a particular visit to a particular site is wrong,\n",
+      "we have to change multiple records in the database.\n",
+      "What's worse,\n",
+      "we may have to guess which records to change,\n",
+      "since some other sites may actually have been visited on that date.\n",
       "\n",
-      "5.  Why does the table `Site` exist?\n",
-      "    Why don't we just record latitudes and longitudes\n",
-      "    directly in the `Visited` and `Survey` tables?"
-     ]
-    },
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Next Steps"
+      "The fourth rule is that the units for every value should be stored explicitly.\n",
+      "Our database doesn't do this,\n",
+      "and that's a problem:\n",
+      "Roerich's salinity measurements are several orders of magnitude larger than anyone else's,\n",
+      "but we don't know if that means she was using parts per million instead of parts per thousand,\n",
+      "or whether there actually was a saline anomaly at that site in 1932."
      ]
     },
     {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "FIXME"
+      "### Challenges\n",
+      "\n",
+      "1.  Write a query that lists all radiation readings from the DR-1 site.\n",
+      "\n",
+      "2.  Write a query that lists all sites visited by people named \"Frank\".\n",
+      "\n",
+      "3.  Describe in your own words what the following query produces:\n",
+      "\n",
+      "    ```\n",
+      "    select Site.name from Site join Visited\n",
+      "    on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
+      "    ```"
      ]
     }
    ],
diff --git a/sql/novice/09-hygiene.ipynb b/sql/novice/09-hygiene.ipynb
deleted file mode 100644 (file)
index 9b607e3..0000000
+++ /dev/null
@@ -1,29 +0,0 @@
-{
- "metadata": {
-  "name": ""
- },
- "nbformat": 3,
- "nbformat_minor": 0,
- "worksheets": [
-  {
-   "cells": [
-    {
-     "cell_type": "heading",
-     "level": 1,
-     "metadata": {},
-     "source": [
-      "Data Hygiene"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "FIXME"
-     ]
-    }
-   ],
-   "metadata": {}
-  }
- ]
-}
\ No newline at end of file