First three lessons on SQL for beginners
authorGreg Wilson <gvwilson@third-bit.com>
Sat, 9 Nov 2013 16:43:13 +0000 (11:43 -0500)
committerW. Trevor King <wking@tremily.us>
Tue, 11 Mar 2014 04:30:29 +0000 (21:30 -0700)
sql/novice/01-select.ipynb [new file with mode: 0644]
sql/novice/02-sort-dup.ipynb [new file with mode: 0644]
sql/novice/03-filter.ipynb [new file with mode: 0644]
sql/novice/db.txt [new file with mode: 0644]
sql/novice/sqlitemagic.py [new file with mode: 0644]
sql/novice/survey.db [new file with mode: 0644]
sql/novice/survey.sql [new file with mode: 0644]
sql/novice/tables.html [new file with mode: 0644]

diff --git a/sql/novice/01-select.ipynb b/sql/novice/01-select.ipynb
new file mode 100644 (file)
index 0000000..0215577
--- /dev/null
@@ -0,0 +1,437 @@
+{
+ "metadata": {
+  "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "heading",
+     "level": 1,
+     "metadata": {},
+     "source": [
+      "Selecting Data"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In the late 1920s and early 1930s,\n",
+      "William Dyer,\n",
+      "Frank Pabodie,\n",
+      "and Valentina Roerich led expeditions to the\n",
+      "[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)\n",
+      "in the South Pacific,\n",
+      "and then onward to Antarctica.\n",
+      "Two years ago,\n",
+      "their expeditions were found in a storage locker at Miskatonic University.\n",
+      "We have scanned and OCR'd the data they contain,\n",
+      "and we now want to store that information\n",
+      "in a way that will make search and analysis easy.\n",
+      "\n",
+      "We basically have three options:\n",
+      "text files,\n",
+      "a spreadsheet,\n",
+      "or a database.\n",
+      "Text files are easiest to create,\n",
+      "and work well with version control,\n",
+      "but then we would then have to build search and analysis tools ourselves.\n",
+      "Spreadsheets are good for doing simple analysis,\n",
+      "they don't handle large or complex data sets very well.\n",
+      "We would therefore like to put this data in a database,\n",
+      "and these lessons will show how to do that."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "A Few Definitions"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "A [relational database](../gloss.html#relational-database)\n",
+      "is a way to store and manipulate information\n",
+      "that is arranged as [tables](../gloss.html#table).\n",
+      "Each table has columns (also known as [fields](../gloss.html#field-database)) which describe the data,\n",
+      "and rows (also known as [records](../gloss.html#record-database)) which contain the data.\n",
+      "  \n",
+      "When we are using a spreadsheet,\n",
+      "we put formulas into cells to calculate new values based on old ones.\n",
+      "When we are using a database,\n",
+      "we send commands\n",
+      "(usually called [queries](../gloss.html#query))\n",
+      "to a [database manager](../gloss.html#database-manager):\n",
+      "a program that manipulates the database for us.\n",
+      "The database manager does whatever lookups and calculations the query specifies,\n",
+      "returning the results in a tabular form\n",
+      "that we can then use as a starting point for further queries.\n",
+      "  \n",
+      "> Every database manager&mdash;Oracle,\n",
+      "> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite&mdash;stores\n",
+      "> data in a different way,\n",
+      "> so a database created with one cannot be used directly by another.\n",
+      "> However,\n",
+      "> every database manager can import and export data in a variety of formats,\n",
+      "> so it *is* possible to move information from one to another.\n",
+      "\n",
+      "Queries are written in a language called [SQL](../gloss.html#sql),\n",
+      "which stands for \"Structured Query Language\".\n",
+      "SQL provides hundreds of different ways to analyze and recombine data;\n",
+      "we will only look at a handful,\n",
+      "but that handful accounts for most of what scientists do.\n",
+      "\n",
+      "The tables below show the database we will use in our examples:"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<table>\n",
+      "<tr>\n",
+      "<td valign=\"top\">\n",
+      "**Person**: people who took readings.\n",
+      "\n",
+      "<table>\n",
+      "  <tr> <th>ident</th> <th>personal</th> <th>family</th> </tr>\n",
+      "  <tr> <td>dyer</td> <td>William</td> <td>Dyer</td> </tr>\n",
+      "  <tr> <td>pb</td> <td>Frank</td> <td>Pabodie</td> </tr>\n",
+      "  <tr> <td>lake</td> <td>Anderson</td> <td>Lake</td> </tr>\n",
+      "  <tr> <td>roe</td> <td>Valentina</td> <td>Roerich</td> </tr>\n",
+      "  <tr> <td>danforth</td> <td>Frank</td> <td>Danforth</td> </tr>\n",
+      "</table>\n",
+      "\n",
+      "**Site**: locations where readings were taken.\n",
+      "\n",
+      "<table>\n",
+      "  <tr> <th>name</th> <th>lat</th> <th>long</th> </tr>\n",
+      "  <tr> <td>DR-1</td> <td>-49.85</td> <td>-128.57</td> </tr>\n",
+      "  <tr> <td>DR-3</td> <td>-47.15</td> <td>-126.72</td> </tr>\n",
+      "  <tr> <td>MSK-4</td> <td>-48.87</td> <td>-123.4</td> </tr>\n",
+      "</table>\n",
+      "\n",
+      "**Visited**: when readings were taken at specific sites.\n",
+      "\n",
+      "<table>\n",
+      "  <tr> <th>ident</th> <th>site</th> <th>dated</th> </tr>\n",
+      "  <tr> <td>619</td> <td>DR-1</td> <td>1927-02-08</td> </tr>\n",
+      "  <tr> <td>622</td> <td>DR-1</td> <td>1927-02-10</td> </tr>\n",
+      "  <tr> <td>734</td> <td>DR-3</td> <td>1939-01-07</td> </tr>\n",
+      "  <tr> <td>735</td> <td>DR-3</td> <td>1930-01-12</td> </tr>\n",
+      "  <tr> <td>751</td> <td>DR-3</td> <td>1930-02-26</td> </tr>\n",
+      "  <tr> <td>752</td> <td>DR-3</td> <td>NULL</td> </tr>\n",
+      "  <tr> <td>837</td> <td>MSK-4</td> <td>1932-01-14</td> </tr>\n",
+      "  <tr> <td>844</td> <td>DR-1</td> <td>1932-03-22</td> </tr>\n",
+      "</table>\n",
+      "</td>\n",
+      "<td valign=\"top\">\n",
+      "**Survey**: the actual readings.\n",
+      "\n",
+      "<table>\n",
+      "  <tr> <th>taken</th> <th>person</th> <th>quant</th> <th>reading</th> </tr>\n",
+      "  <tr> <td>619</td> <td>dyer</td> <td>rad</td> <td>9.82</td> </tr>\n",
+      "  <tr> <td>619</td> <td>dyer</td> <td>sal</td> <td>0.13</td> </tr>\n",
+      "  <tr> <td>622</td> <td>dyer</td> <td>rad</td> <td>7.8</td> </tr>\n",
+      "  <tr> <td>622</td> <td>dyer</td> <td>sal</td> <td>0.09</td> </tr>\n",
+      "  <tr> <td>734</td> <td>pb</td> <td>rad</td> <td>8.41</td> </tr>\n",
+      "  <tr> <td>734</td> <td>lake</td> <td>sal</td> <td>0.05</td> </tr>\n",
+      "  <tr> <td>734</td> <td>pb</td> <td>temp</td> <td>-21.5</td> </tr>\n",
+      "  <tr> <td>735</td> <td>pb</td> <td>rad</td> <td>7.22</td> </tr>\n",
+      "  <tr> <td>735</td> <td>NULL</td> <td>sal</td> <td>0.06</td> </tr>\n",
+      "  <tr> <td>735</td> <td>NULL</td> <td>temp</td> <td>-26.0</td> </tr>\n",
+      "  <tr> <td>751</td> <td>pb</td> <td>rad</td> <td>4.35</td> </tr>\n",
+      "  <tr> <td>751</td> <td>pb</td> <td>temp</td> <td>-18.5</td> </tr>\n",
+      "  <tr> <td>751</td> <td>lake</td> <td>sal</td> <td>0.1</td> </tr>\n",
+      "  <tr> <td>752</td> <td>lake</td> <td>rad</td> <td>2.19</td> </tr>\n",
+      "  <tr> <td>752</td> <td>lake</td> <td>sal</td> <td>0.09</td> </tr>\n",
+      "  <tr> <td>752</td> <td>lake</td> <td>temp</td> <td>-16.0</td> </tr>\n",
+      "  <tr> <td>752</td> <td>roe</td> <td>sal</td> <td>41.6</td> </tr>\n",
+      "  <tr> <td>837</td> <td>lake</td> <td>rad</td> <td>1.46</td> </tr>\n",
+      "  <tr> <td>837</td> <td>lake</td> <td>sal</td> <td>0.21</td> </tr>\n",
+      "  <tr> <td>837</td> <td>roe</td> <td>sal</td> <td>22.5</td> </tr>\n",
+      "  <tr> <td>844</td> <td>roe</td> <td>rad</td> <td>11.25</td> </tr>\n",
+      "</table>\n",
+      "</td>\n",
+      "</tr>\n",
+      "</table>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Notice that three entries&mdash;one in the `Visited` table,\n",
+      "and two in the `Survey` table&mdash;are shown as `NULL`.\n",
+      "We'll return to these values [later](#s:null).\n",
+      "For now,\n",
+      "let's write an SQL query that displays scientists' names.\n",
+      "We do this using the SQL command `select`,\n",
+      "giving it the names of the columns we want and the table we want them from.\n",
+      "Our query and its output look like this:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%load_ext sqlitemagic"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select family, personal from Person;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>Dyer</td><td>William</td></tr>\n",
+        "<tr><td>Pabodie</td><td>Frank</td></tr>\n",
+        "<tr><td>Lake</td><td>Anderson</td></tr>\n",
+        "<tr><td>Roerich</td><td>Valentina</td></tr>\n",
+        "<tr><td>Danforth</td><td>Frank</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c6050>"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The semi-colon at the end of the query\n",
+      "tells the database manager that the query is complete and ready to run.\n",
+      "We have written our commands and column names in lower case,\n",
+      "and the table name in Title Case,\n",
+      "but we don't have to:\n",
+      "as the example below shows,\n",
+      "SQL is [case insensitive](../gloss.html#case-insensitive)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>Dyer</td><td>William</td></tr>\n",
+        "<tr><td>Pabodie</td><td>Frank</td></tr>\n",
+        "<tr><td>Lake</td><td>Anderson</td></tr>\n",
+        "<tr><td>Roerich</td><td>Valentina</td></tr>\n",
+        "<tr><td>Danforth</td><td>Frank</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c41d0>"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Whatever casing convention you choose,\n",
+      "please be consistent:\n",
+      "complex queries are hard enough to read without the extra cognitive load of random capitalization."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Going back to our query,\n",
+      "it's important to understand that\n",
+      "the rows and columns in a database table aren't actually stored in any particular order.\n",
+      "They will always be *displayed* in some order,\n",
+      "but we can control that in various ways.\n",
+      "For example,\n",
+      "we could swap the columns in the output by writing our query as:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select personal, family from Person;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>William</td><td>Dyer</td></tr>\n",
+        "<tr><td>Frank</td><td>Pabodie</td></tr>\n",
+        "<tr><td>Anderson</td><td>Lake</td></tr>\n",
+        "<tr><td>Valentina</td><td>Roerich</td></tr>\n",
+        "<tr><td>Frank</td><td>Danforth</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4fd0>"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "or even repeat columns:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select ident, ident, ident from Person;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>dyer</td><td>dyer</td><td>dyer</td></tr>\n",
+        "<tr><td>pb</td><td>pb</td><td>pb</td></tr>\n",
+        "<tr><td>lake</td><td>lake</td><td>lake</td></tr>\n",
+        "<tr><td>roe</td><td>roe</td><td>roe</td></tr>\n",
+        "<tr><td>danforth</td><td>danforth</td><td>danforth</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4090>"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "As a shortcut,\n",
+      "we can select all of the columns in a table using `*`:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Person;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
+        "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
+        "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
+        "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
+        "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3b10>"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Challenges\n",
+      "\n",
+      "1.  Write a query that selects only site names from the `Site` table.\n",
+      "\n",
+      "2.  Many people format queries as:\n",
+      "\n",
+      "    ```\n",
+      "    SELECT personal, family FROM person;\n",
+      "    ```\n",
+      "\n",
+      "    or as:\n",
+      "\n",
+      "    ```\n",
+      "    select Personal, Family from PERSON;\n",
+      "    ```\n",
+      "\n",
+      "    What style do you find easiest to read, and why?"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Next Steps"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "FIXME"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
diff --git a/sql/novice/02-sort-dup.ipynb b/sql/novice/02-sort-dup.ipynb
new file mode 100644 (file)
index 0000000..e583799
--- /dev/null
@@ -0,0 +1,408 @@
+{
+ "metadata": {
+  "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "heading",
+     "level": 1,
+     "metadata": {},
+     "source": [
+      "Sorting and Removing Duplicates"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Data is often redundant,\n",
+      "so queries often return redundant information.\n",
+      "For example,\n",
+      "if we select the quantitites that have been measured\n",
+      "from the `survey` table,\n",
+      "we get this:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%load_ext sqlitemagic"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select quant from Survey;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>temp</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>temp</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>temp</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>temp</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>rad</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023cbe10>"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We can eliminate the redundant output\n",
+      "to make the result more readable\n",
+      "by adding the `distinct` keyword\n",
+      "to our query:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select distinct quant from Survey;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>rad</td></tr>\n",
+        "<tr><td>sal</td></tr>\n",
+        "<tr><td>temp</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023cbf10>"
+       ]
+      }
+     ],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "If we select more than one column&mdash;for example,\n",
+      "both the survey site ID and the quantity measured&mdash;then\n",
+      "the distinct pairs of values are returned:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select distinct taken, quant from Survey;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>619</td><td>rad</td></tr>\n",
+        "<tr><td>619</td><td>sal</td></tr>\n",
+        "<tr><td>622</td><td>rad</td></tr>\n",
+        "<tr><td>622</td><td>sal</td></tr>\n",
+        "<tr><td>734</td><td>rad</td></tr>\n",
+        "<tr><td>734</td><td>sal</td></tr>\n",
+        "<tr><td>734</td><td>temp</td></tr>\n",
+        "<tr><td>735</td><td>rad</td></tr>\n",
+        "<tr><td>735</td><td>sal</td></tr>\n",
+        "<tr><td>735</td><td>temp</td></tr>\n",
+        "<tr><td>751</td><td>rad</td></tr>\n",
+        "<tr><td>751</td><td>temp</td></tr>\n",
+        "<tr><td>751</td><td>sal</td></tr>\n",
+        "<tr><td>752</td><td>rad</td></tr>\n",
+        "<tr><td>752</td><td>sal</td></tr>\n",
+        "<tr><td>752</td><td>temp</td></tr>\n",
+        "<tr><td>837</td><td>rad</td></tr>\n",
+        "<tr><td>837</td><td>sal</td></tr>\n",
+        "<tr><td>844</td><td>rad</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023cbf10>"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Notice in both cases that duplicates are removed\n",
+      "even if they didn't appear to be adjacent in the database.\n",
+      "Again,\n",
+      "it's important to remember that rows aren't actually ordered:\n",
+      "they're just displayed that way."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Challenges\n",
+      "\n",
+      "1.  Write a query that selects distinct dates from the `Site` table."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "As we mentioned earlier,\n",
+      "database records are not stored in any particular order.\n",
+      "This means that query results aren't necessarily sorted,\n",
+      "and even if they are,\n",
+      "we often want to sort them in a different way,\n",
+      "e.g., by the name of the project instead of by the name of the scientist.\n",
+      "We can do this in SQL by adding an `order by` clause to our query:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Person order by ident;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
+        "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
+        "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
+        "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
+        "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023cbe90>"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "By default,\n",
+      "results are sorted in ascending order\n",
+      "(i.e.,\n",
+      "from least to greatest).\n",
+      "We can sort in the opposite order using `desc` (for \"descending\"):"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from person order by ident desc;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
+        "<tr><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
+        "<tr><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
+        "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
+        "<tr><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023cbe10>"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "(And if we want to make it clear that we're sorting in ascending order,\n",
+      "we can use `asc` instead of `desc`.)\n",
+      "  \n",
+      "We can also sort on several fields at once.\n",
+      "For example,\n",
+      "this query sorts results first in ascending order by `taken`,\n",
+      "and then in descending order by `person`\n",
+      "within each group of equal `taken` values:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select taken, person from Survey order by taken asc, person desc;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>619</td><td>dyer</td></tr>\n",
+        "<tr><td>619</td><td>dyer</td></tr>\n",
+        "<tr><td>622</td><td>dyer</td></tr>\n",
+        "<tr><td>622</td><td>dyer</td></tr>\n",
+        "<tr><td>734</td><td>pb</td></tr>\n",
+        "<tr><td>734</td><td>pb</td></tr>\n",
+        "<tr><td>734</td><td>lake</td></tr>\n",
+        "<tr><td>735</td><td>pb</td></tr>\n",
+        "<tr><td>735</td><td>None</td></tr>\n",
+        "<tr><td>735</td><td>None</td></tr>\n",
+        "<tr><td>751</td><td>pb</td></tr>\n",
+        "<tr><td>751</td><td>pb</td></tr>\n",
+        "<tr><td>751</td><td>lake</td></tr>\n",
+        "<tr><td>752</td><td>roe</td></tr>\n",
+        "<tr><td>752</td><td>lake</td></tr>\n",
+        "<tr><td>752</td><td>lake</td></tr>\n",
+        "<tr><td>752</td><td>lake</td></tr>\n",
+        "<tr><td>837</td><td>roe</td></tr>\n",
+        "<tr><td>837</td><td>lake</td></tr>\n",
+        "<tr><td>837</td><td>lake</td></tr>\n",
+        "<tr><td>844</td><td>roe</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4090>"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "This is easier to understand if we also remove duplicates:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select distinct taken, person from Survey order by taken asc, person desc;"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>619</td><td>dyer</td></tr>\n",
+        "<tr><td>622</td><td>dyer</td></tr>\n",
+        "<tr><td>734</td><td>pb</td></tr>\n",
+        "<tr><td>734</td><td>lake</td></tr>\n",
+        "<tr><td>735</td><td>pb</td></tr>\n",
+        "<tr><td>735</td><td>None</td></tr>\n",
+        "<tr><td>751</td><td>pb</td></tr>\n",
+        "<tr><td>751</td><td>lake</td></tr>\n",
+        "<tr><td>752</td><td>roe</td></tr>\n",
+        "<tr><td>752</td><td>lake</td></tr>\n",
+        "<tr><td>837</td><td>roe</td></tr>\n",
+        "<tr><td>837</td><td>lake</td></tr>\n",
+        "<tr><td>844</td><td>roe</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c96d0>"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Challenges\n",
+      "\n",
+      "FIXME"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Next Steps"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "FIXME"
+     ]
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
diff --git a/sql/novice/03-filter.ipynb b/sql/novice/03-filter.ipynb
new file mode 100644 (file)
index 0000000..08e0baa
--- /dev/null
@@ -0,0 +1,476 @@
+{
+ "metadata": {
+  "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "heading",
+     "level": 1,
+     "metadata": {},
+     "source": [
+      "Filtering"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "One of the most powerful features of a database is\n",
+      "the ability to [filter](../gloss.html#filter) data,\n",
+      "i.e.,\n",
+      "to select only those records that match certain criteria.\n",
+      "For example,\n",
+      "suppose we want to see when a particular site was visited.\n",
+      "We can select these records from the `Visited` table\n",
+      "by using a `where` clause in our query:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%load_ext sqlitemagic"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Visited where site='DR-1';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+        "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c4550>"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "The database manager executes this query in two stages.\n",
+      "First,\n",
+      "it checks at each row in the `Visited` table\n",
+      "to see which ones satisfy the `where`.\n",
+      "It then uses the column names following the `select` keyword\n",
+      "to determine what columns to display."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "FIXME: diagram"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "This processing order means that\n",
+      "we can filter records using `where`\n",
+      "based on values in columns that aren't then displayed:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select ident from Visited where site='DR-1';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>619</td></tr>\n",
+        "<tr><td>622</td></tr>\n",
+        "<tr><td>844</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c46d0>"
+       ]
+      }
+     ],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We can use many other Boolean operators to filter our data.\n",
+      "For example,\n",
+      "we can ask for all information from the DR-1 site collected since 1930:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "(The parentheses around the individual tests aren't strictly required,\n",
+      "but they help make the query easier to read.)\n",
+      "\n",
+      "> Most database managers have a special data type for dates.\n",
+      "> In fact, many have two:\n",
+      "> one for dates,\n",
+      "> such as \"May 31, 1971\",\n",
+      "> and one for durations,\n",
+      "> such as \"31 days\".\n",
+      "> SQLite doesn't:\n",
+      "> instead,\n",
+      "> it stores dates as either text\n",
+      "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
+      "> real numbers\n",
+      "> (the number of days since November 24, 4714 BCE),\n",
+      "> or integers\n",
+      "> (the number of seconds since midnight, January 1, 1970).\n",
+      "> If this sounds complicated,\n",
+      "> it is,\n",
+      "> but not nearly as complicated as figuring out\n",
+      "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "If we want to find out what measurements were taken by either Lake or Roerich,\n",
+      "we can combine the tests on their names using `or`:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Survey where person='lake' or person='roe';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
+        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
+        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
+        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Alternatively,\n",
+      "we can use `in` to see if a value is in a specific set:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Survey where person in ('lake', 'roe');"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
+        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
+        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
+        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We can combine `and` with `or`,\n",
+      "but we need to be careful about which operator is executed first.\n",
+      "If we *don't* use parentheses,\n",
+      "we get this:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Survey where quant='sal' and person='lake' or person='roe';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
+        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
+        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
+        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "which is salinity measurements by Lake,\n",
+      "and *any* measurement by Roerich.\n",
+      "We probably want this instead:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select * from Survey where quant='sal' and (person='lake' or person='roe');"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
+        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
+        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
+        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
+        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Finally,\n",
+      "we can use `distinct` with `where`\n",
+      "to give a second level of filtering:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%sqlite survey.db\n",
+      "select distinct person, quant from Survey where person='lake' or person='roe';"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<table>\n",
+        "<tr><td>lake</td><td>sal</td></tr>\n",
+        "<tr><td>lake</td><td>rad</td></tr>\n",
+        "<tr><td>lake</td><td>temp</td></tr>\n",
+        "<tr><td>roe</td><td>sal</td></tr>\n",
+        "<tr><td>roe</td><td>rad</td></tr>\n",
+        "</table>"
+       ],
+       "metadata": {},
+       "output_type": "display_data",
+       "text": [
+        "<IPython.core.display.HTML at 0x1023c3890>"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "But remember:\n",
+      "`distinct` is applied to the values displayed in the chosen columns,\n",
+      "not to the entire rows as they are being processed.\n",
+      "\n",
+      "> What we have just done is how most people \"grow\" their SQL queries.\n",
+      "> We started with something simple that did part of what we wanted,\n",
+      "> then added more clauses one by one,\n",
+      "> testing their effects as we went.\n",
+      "> This is a good strategy&mdash;in fact,\n",
+      "> for complex queries it's often the *only* strategy&mdash;but\n",
+      "> it depends on quick turnaround,\n",
+      "> and on us recognizing the right answer when we get it.\n",
+      ">     \n",
+      "> The best way to achieve quick turnaround is often\n",
+      "> to put a subset of data in a temporary database\n",
+      "> and run our queries against that,\n",
+      "> or to fill a small database with synthesized records.\n",
+      "> For example,\n",
+      "> instead of trying our queries against an actual database of 20 million Australians,\n",
+      "> we could run it against a sample of ten thousand,\n",
+      "> or write a small program to generate ten thousand random (but plausible) records\n",
+      "> and use that."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Challenges\n",
+      "\n",
+      "1.  Suppose we want to select all sites that lie within 30&deg; of the equator.\n",
+      "    Our first query is:\n",
+      "\n",
+      "    ```\n",
+      "    select * from Site where (lat > -30) or (lat < 30);\n",
+      "    ```\n",
+      "\n",
+      "    Explain why this is wrong,\n",
+      "    and rewrite the query so that it is correct.\n",
+      "\n",
+      "2.  Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
+      "    Write a query that selects all records from `Survey`\n",
+      "    with salinity values outside this range.\n",
+      "\n",
+      "3.  The SQL test `*column-name* like *pattern*`\n",
+      "    is true if the value in the named column\n",
+      "    matches the pattern given;\n",
+      "    the character '%' can be used any number of times in the pattern\n",
+      "    to mean \"match zero or more characters\".\n",
+      "\n",
+      "    <table>\n",
+      "      <tr> <th>Expression</th> <th>Value</th> </tr>\n",
+      "      <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
+      "      <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
+      "      <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
+      "      <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
+      "      <tr> <td><code>'alpha' like 'a%p%'</code> <td>True</td> </tr>\n",
+      "    </table>\n",
+      "    The expression `*column-name* not like *pattern*`\n",
+      "    inverts the test.\n",
+      "    Using `like`,\n",
+      "    write a query that finds all the records in `Visited`\n",
+      "    that *aren't* from sites labelled 'DR-something'."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Next Steps"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "FIXME"
+     ]
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
diff --git a/sql/novice/db.txt b/sql/novice/db.txt
new file mode 100644 (file)
index 0000000..59aa25a
--- /dev/null
@@ -0,0 +1,2927 @@
+% Databases and SQL
+% Greg Wilson
+% June 2013
+
+In the late 1920s and early 1930s,
+William Dyer,
+Frank Pabodie,
+and Valentina Roerich led expeditions to the
+[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)
+in the South Pacific,
+and then onward to Antarctica.
+Two years ago,
+Gina Geographer discovered their expedition journals
+in a storage locker at Miskatonic University.
+She has scanned and OCR'd the data they contain,
+and wants to store that information
+in a way that will make search and analysis easy.
+
+Gina basically has three options:
+text files,
+a spreadsheet,
+or a database.
+Text files are easiest to create,
+and work well with version control,
+but she would then have to build all her search and analysis herself.
+Spreadsheets are good for doing simple analysis,
+but as she found in her last project,
+they don't handle large or complex data sets very well.
+She would therefore like to put her data in a database,
+and this chapter will show her how.
+
+As many scientists have found out the hard way,
+if collecting data is the first 90% of the work,
+managing it is the other 90%.
+In this chapter,
+we'll see how to use a database to store and analyze field observations.
+The techniques we will explore apply directly to other kinds of databases as well,
+and as we'll see,
+knowing how to get information *out* of a database is essential to
+figuring out how to put data *in*.
+
+## For Instructors {.guide}
+
+Relational databases are not as widely used in science as in business,
+but they are still a common way to store large data sets with complex structure.
+Even when the data itself isn't in a database,
+the metadata could be:
+for example,
+meteorological data might be stored in files on disk,
+but data about when and where observations were made,
+data ranges,
+and so on could be in a database
+to make it easier for scientists to find what they want to.
+  
+The first few sections
+(up to [Ordering Results](#s:sort))
+usually go very quickly.
+The pace usually slows down a bit when null values and aggregation are discussed,
+mostly because learners have a lot of details to keep straight by this point.
+Things *really* slow down during the discussion of joins,
+but this is the key idea in the whole lesson:
+important ideas like primary keys and referential integrity only make sense
+once learners have seen how they're used in joins.
+It's worth going over things a couple of times if necessary
+(with lots of examples).
+  
+The final three sections are independent of each other,
+and can be dropped if time is short.
+Of the three,
+people seem to care most about how to add data
+(which only takes a few minutes to demonstrate),
+and how to use databases from inside "real" programs.
+The material on transactions is more abstract than the rest,
+and should be omitted if [web programming](web.html)
+isn't being taught.
+Overall,
+this material takes three hours to present
+assuming that a short exercise is done with each topic.
+  
+### Prerequisites {.prereq}
+
+Everything up to the [final section](#s:programming)
+only requires some understanding of Boolean operators,
+data types,
+and pipelines,
+and what's needed can actually be introduced on the fly.
+That [final section](#s:programming),
+which shows how to use databases from within programs,
+depends on most of the [basic Python material](python.html).
+
+### Teaching Notes {.notes}
+
+* It isn't necessary to cover [sets and dictionaries](setdict.html)
+  before this material,
+  but if that has been discussed,
+  it's helpful to point out that a relational table is a generalized dictionary.
+* Simple calculations are actually easier to do in a spreadsheet,
+  the advantages of using a database become clear
+  as soon as filtering and joins are needed.
+  Instructors may therefore want to show a spreadsheet
+  with the information from the four database tables
+  consolidated into a single sheet,
+  and demonstrate what's needed in both systems to answer questions like,
+  "What was the average radiation reading in 1931?"
+* Some learners may have heard that NoSQL databases
+  (i.e., ones that don't use the relational model)
+  are the next big thing,
+  and ask why we're not teaching those.
+  The answers are:
+  * Relational databases are far more widely used than NoSQL databases.
+  * We have far more experience with relational databases
+    than with any other kind,
+    so we have a better idea of what to teach
+    and how to teach it.
+  * NoSQL databases are as different from each other
+    as they are from relational databases.
+    Until a leader emerges,
+    it isn't clear *which* NoSQL database we should teach.
+* This discussion is a useful companion to that of vectorization
+  in the lesson on [numerical computing](numpy.html):
+  in both cases,
+  the key point is to describe *what* to do,
+  and let the computer figure out *how* to do it.
+
+## Selecting {#s:select}
+
+### Learning Objectives {.objectives}
+
+* Explain the difference between a table, a database, and a database manager.
+* Explain the difference between a field and a record.
+* Select specific fields from specific tables, and display them in a specific order.
+
+Duration: 15 minutes (not including time required to download database file and connect to it)
+
+### Lesson
+
+A [relational database](../gloss.html#relational-database)
+is a way to store and manipulate information
+that is arranged as [tables](../gloss.html#table).
+Each table has columns (also known as [fields](../gloss.html#field-database)) which describe the data,
+and rows (also known as [records](../gloss.html#record-database)) which contain the data.
+  
+<a id="a:dbms"></a>
+When we are using a spreadsheet,
+we put formulas into cells to calculate new values based on old ones.
+When we are using a database,
+we send commands
+(usually called [queries](../gloss.html#query))
+to a [database manager](../gloss.html#database-manager):
+a program that manipulates the database for us.
+The database manager does whatever lookups and calculations the query specifies,
+returning the results in a tabular form
+that we can then use as a starting point for further queries.
+  
+> ### Under the Hood {.box}
+> 
+> Every database manager&mdash;Oracle,
+> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite&mdash;stores
+> data in a different way,
+> so a database created with one cannot be used directly by another.
+> However,
+> every database manager can import and export data in a variety of formats,
+> so it *is* possible to move information from one to another.
+
+Queries are written in a language called [SQL](../gloss.html#sql),
+which stands for "Structured Query Language".
+SQL provides hundreds of different ways to analyze and recombine data;
+we will only look at a handful,
+but that handful accounts for most of what scientists do.
+  
+[Figure 1](#f:survey_db) shows
+a simple database that stores some of the data
+Gina extracted from the logs of those long-ago expeditions.
+It contains four tables:
+
+<div class="db">  
+
+Table                   Purpose
+--------------------    --------------------
+`Person`                People who took readings.
+`Site`                  Locations of observation sites.
+`Visited`               When readings were taken at specific sites.
+`Survey`                The actual measurement values.
+
+</div>
+
+**Person**
+
+<div class="db">
+
+ident                   personal                family
+--------------------    --------------------    --------------------
+dyer                    William                 Dyer
+pb                      Frank                   Pabodie
+lake                    Anderson                Lake
+roe                     Valentina               Roerich
+danforth                Frank                   Danforth
+
+</div>
+
+**Survey**
+
+<div class="db">
+
+taken                   person                  quant                   reading
+--------------------    --------------------    --------------------    --------------------
+619                     dyer                    rad                     9.82
+619                     dyer                    sal                     0.13
+622                     dyer                    rad                     7.8
+622                     dyer                    sal                     0.09
+734                     pb                      rad                     8.41
+734                     lake                    sal                     0.05
+734                     pb                      temp                    -21.5
+735                     pb                      rad                     7.22
+735                     NULL                    sal                     0.06
+735                     NULL                    temp                    -26.0
+751                     pb                      rad                     4.35
+751                     pb                      temp                    -18.5
+751                     lake                    sal                     0.1
+752                     lake                    rad                     2.19
+752                     lake                    sal                     0.09
+752                     lake                    temp                    -16.0
+752                     roe                     sal                     41.6
+837                     lake                    rad                     1.46
+837                     lake                    sal                     0.21
+837                     roe                     sal                     22.5
+844                     roe                     rad                     11.25
+
+</div>
+
+**Site**
+
+<div class="db">
+
+name                    lat                     long
+--------------------    --------------------    --------------------
+DR-1                    -49.85                  -128.57
+DR-3                    -47.15                  -126.72
+MSK-4                   -48.87                  -123.4
+
+</div>
+
+**Visited**
+
+<div class="db">
+
+ident                   site                    dated
+--------------------    --------------------    --------------------
+619                     DR-1                    1927-02-08
+622                     DR-1                    1927-02-10
+734                     DR-3                    1939-01-07
+735                     DR-3                    1930-01-12
+751                     DR-3                    1930-02-26
+752                     DR-3                    NULL
+837                     MSK-4                   1932-01-14
+844                     DR-1                    1932-03-22
+
+</div>
+
+<figcaption>Figure 1: Survey Database</figcaption>
+
+Notice that three entries&mdash;one in the `Visited` table,
+and two in the `Survey` table&mdash;are shown as `NULL`.
+We'll return to these values [later](#s:null).
+For now,
+let's write an SQL query that displays scientists' names.
+We do this using the SQL command `select`,
+giving it the names of the columns we want and the table we want them from.
+Our query and its output look like this:
+  
+    sqlite> select family, personal from Person;
+
+<div class="db">
+
+--------------------    --------------------
+Dyer                    William
+Pabodie                 Frank
+Lake                    Anderson
+Roerich                 Valentina
+Danforth                Frank
+--------------------    --------------------
+
+</div>
+
+The semi-colon at the end of the query
+tells the database manager that the query is complete and ready to run.
+If we enter the query without the semi-colon,
+or press 'enter' part-way through the query,
+the SQLite interpreter will give us a different prompt
+to show us that it's waiting for more input:
+  
+    sqlite> select family, personal
+       ...> from Person
+       ...> ;
+
+<div class="db">
+
+--------------------    --------------------
+Dyer                    William
+Pabodie                 Frank
+Lake                    Anderson
+Roerich                 Valentina
+Danforth                Frank
+--------------------    --------------------
+
+</div>
+
+From now on,
+we won't bother to display the prompt(s) with our commands.
+  
+> ### Case and Consistency {.box}
+> 
+> We have written our command and the column names in lower case,
+> and the table name in title case,
+> but we could use any mix:
+> SQL is [case insensitive](../gloss.html#case-insensitive),
+> so we could write them all in upper case,
+> or even like this:
+> 
+> ``` {.sql}
+> SeLeCt famILY, PERSonal frOM PERson;
+> ```
+> 
+> But please don't:
+> large SQL queries are hard enough to read
+> without the extra cognitive load of random capitalization.
+
+> ### Displaying Results {.box}
+> 
+> Exactly *how* the database displays the query's results
+> depends on what kind of interface we are using.
+> If we are running SQLite directly from the shell,
+> its default output looks like this:
+>     
+>     Dyer|William
+>     Pabodie|Frank
+>     Lake|Anderson
+>     Roerich|Valentina
+>     Danforth|Frank
+> 
+> If we are using a graphical interface,
+> such as the [SQLite Manager](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/) plugin for Firefox
+> or the [database extension](https://github.com/catherinedevlin/ipython-sql) for the IPython Notebook,
+> our output will be displayed graphically
+> ([Figure 2](#f:firefox_output)
+> and [Figure 3](#f:notebook_output)).
+> We'll use a simple table-based display in these notes.
+> 
+>     <figure id="f:firefox_output">
+>       <img src="db/firefox_output.png" alt="Firefox SQLite Manager Output" />
+>       <figcaption>Figure 2: Firefox SQLite Manager Output</figcaption>
+>     </figure>
+> 
+>     <figure id="f:notebook_output">
+>       <img src="db/notebook_output.png" alt="IPython Notebook Database Extension Output" />
+>       <figcaption>Figure 3: IPython Notebook Database Extension Output</figcaption>
+>     </figure>
+
+Going back to our query,
+it's important to understand that
+the rows and columns in a database table aren't actually stored in any particular order.
+They will always be *displayed* in some order,
+but we can control that in various ways.
+For example,
+we could swap the columns in the output by writing our query as:
+  
+``` {.sql}
+select personal, family from Person;
+```
+
+<div class="db">
+
+--------------------    --------------------
+William                 Dyer
+Frank                   Pabodie
+Anderson                Lake
+Valentina               Roerich
+Frank                   Danforth
+--------------------    --------------------
+
+</div>
+  
+or even repeat columns:
+  
+``` {.sql}
+select ident, ident, ident from Person;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+dyer                    dyer                    dyer
+pb                      pb                      pb
+lake                    lake                    lake
+roe                     roe                     roe
+danforth                danforth                danforth
+--------------------    --------------------    --------------------
+
+</div>
+  
+We will see ways to rearrange the rows [later](#s:sort).
+  
+As a shortcut, we can select all of the columns in a table
+using the wildcard `*`:
+  
+``` {.sql}
+select * from Person;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+dyer                    William                 Dyer
+pb                      Frank                   Pabodie
+lake                    Anderson                Lake
+roe                     Valentina               Roerich
+danforth                Frank                   Danforth
+--------------------    --------------------    --------------------
+
+</div>
+  
+### Summary {.keypoints}
+
+* A relational database stores information in tables with fields and records.
+* A database manager is a program that manipulates a database.
+* The commands or queries given to a database manager are usually written in a specialized language called SQL.
+* SQL is case insensitive.
+* The rows and columns of a database table aren't stored in any particular order.
+* Use `select *fields* from *table*` to get all the values for specific fields from a single table.
+* Use `select * from *table*` to select everything from a table.
+
+### Challenges {.challenges}
+
+* Write a query that selects only site names from the `Site` table.
+
+* Many people format queries as:
+
+    ```
+    SELECT personal, family FROM person;
+    ```
+
+    or as:
+
+    ```
+    select Personal, Family from PERSON;
+    ```
+
+    What style do you find easiest to read, and why?
+
+## Removing Duplicates {#s:distinct}
+
+### Learning Objectives {.objectives}
+
+* Write queries that only display distinct results once.
+
+Duration: 5 minutes.
+
+### Lesson
+
+Data is often redundant,
+so queries often return redundant information.
+For example,
+if we select the quantitites that have been measured
+from the `survey` table,
+we get this:
+  
+``` {.sql}
+select quant from Survey;
+```
+
+<div class="db">
+
+--------------------
+rad
+sal
+rad
+sal
+rad
+sal
+temp
+rad
+sal
+temp
+rad
+temp
+sal
+rad
+sal
+temp
+sal
+rad
+sal
+sal
+rad
+--------------------
+
+</div>
+  
+We can eliminate the redundant output
+to make the result more readable
+by adding the `distinct` keyword
+to our query:
+  
+``` {.sql}
+select distinct quant from Survey;
+```
+
+<div class="db">
+
+--------------------
+rad
+sal
+temp
+--------------------
+
+</div>
+  
+If we select more than one column&mdash;for example,
+both the survey site ID and the quantity measured&mdash;then
+the distinct pairs of values are returned:
+  
+``` {.sql}
+select distinct taken, quant from Survey;
+```
+
+<div class="db">
+
+--------------------    --------------------
+619                     rad
+619                     sal
+622                     rad
+622                     sal
+734                     rad
+734                     sal
+734                     temp
+735                     rad
+735                     sal
+735                     temp
+751                     rad
+751                     temp
+751                     sal
+752                     rad
+752                     sal
+752                     temp
+837                     rad
+837                     sal
+844                     rad
+--------------------    --------------------
+
+</div>
+  
+Notice in both cases that duplicates are removed
+even if they didn't appear to be adjacent in the database.
+Again,
+it's important to remember that rows aren't actually ordered:
+they're just displayed that way.
+  
+### Summary {.keypoints}
+
+* Use `distinct` to eliminate duplicates from a query's output.
+
+### Challenges {.challenges}
+
+* Write a query that selects distinct dates from the `Site` table.
+
+* If you are using SQLite from the command line,
+  you can run a single query by passing it to the interpreter
+  right after the path to the database file:
+
+    ```
+    $ sqlite3 survey.db 'select * from Person;'
+    ```
+
+    <div class="db">
+
+    --------------------    --------------------    --------------------
+    dyer                    William                 Dyer
+    pb                      Frank                   Pabodie
+    lake                    Anderson                Lake
+    roe                     Valentina               Roerich
+    danforth                Frank                   Danforth
+    --------------------    --------------------    --------------------
+
+    </div>
+
+    Fill in the missing commands in the pipeline below
+    so that the output contains no redundant values.
+
+    ```
+    $ sqlite3 survey.db 'select person, quant from Survey;' | ____ | ____
+    ```
+
+    Do you think this is less efficient, just as efficient, or more efficient
+    that using `distinct` for large data?
+
+## Filtering {#s:filter}
+
+### Learning Objectives {.objectives}
+
+* Write queries that select records based on the values of their fields.
+* Write queries that select records using combinations of several tests on their fields' values.
+* Build up complex filtering criteria incrementally.
+* Explain the logical order in which filtering by field value and displaying fields takes place.
+
+Duration: 5-10 minutes.
+
+### Lesson
+
+One of the most powerful features of a database is
+the ability to [filter](../gloss.html#filter) data,
+i.e.,
+to select only those records that match certain criteria.
+For example,
+suppose we want to see when a particular site was visited.
+We can select these records from the `Visited` table
+by using a `where` clause in our query:
+  
+``` {.sql}
+select * from Visited where site='DR-1';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+619                     DR-1                    1927-02-08
+622                     DR-1                    1927-02-10
+844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------
+
+</div>
+  
+The database manager executes this query in two stages
+([Figure 4](#f:pipeline_where)).
+First,
+it checks at each row in the `Visited` table
+to see which ones satisfy the `where`.
+It then uses the column names following the `select` keyword
+to determine what columns to display.
+  
+  <figure id="f:pipeline_where">
+    <img src="db/pipeline_where.png" alt="Two-Stage Query Processing Pipeline" />
+    <figcaption>Figure 4: Two-Stage Query Processing Pipeline</figcaption>
+  </figure>
+
+This processing order means that
+we can filter records using `where`
+based on values in columns that aren't then displayed:
+  
+``` {.sql}
+select ident from Visited where site='DR-1';
+```
+
+<div class="db">
+
+--------------------
+619
+622
+844
+--------------------
+
+</div>
+  
+We can use many other Boolean operators to filter our data.
+For example,
+we can ask for all information from the DR-1 site collected since 1930:
+  
+``` {.sql}
+select * from Visited where (site='DR-1') and (dated>='1930-00-00');
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------
+
+</div>
+  
+(The parentheses around the individual tests aren't strictly required,
+but they help make the query easier to read.)
+  
+> ### Working With Dates {#a:dates .box}
+> 
+> Most database managers have a special data type for dates.
+> In fact, many have two:
+> one for dates,
+> such as "May 31, 1971",
+> and one for durations,
+> such as "31 days".
+> SQLite doesn't:
+> instead,
+> it stores dates as either text
+> (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"),
+> real numbers
+> (the number of days since November 24, 4714 BCE),
+> or integers
+> (the number of seconds since midnight, January 1, 1970).
+> If this sounds complicated,
+> it is,
+> but not nearly as complicated as figuring out
+> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar).
+
+If we want to find out what measurements were taken by either Lake or Roerich,
+we can combine the tests on their names using `or`:
+  
+``` {.sql}
+select * from Survey where person='lake' or person='roe';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+734                     lake                    sal                     0.05
+751                     lake                    sal                     0.1
+752                     lake                    rad                     2.19
+752                     lake                    sal                     0.09
+752                     lake                    temp                    -16.0
+752                     roe                     sal                     41.6
+837                     lake                    rad                     1.46
+837                     lake                    sal                     0.21
+837                     roe                     sal                     22.5
+844                     roe                     rad                     11.25
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+Alternatively,
+we can use `in` to see if a value is in a specific set:
+  
+``` {.sql}
+select * from Survey where person in ('lake', 'roe');
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+734                     lake                    sal                     0.05
+751                     lake                    sal                     0.1
+752                     lake                    rad                     2.19
+752                     lake                    sal                     0.09
+752                     lake                    temp                    -16.0
+752                     roe                     sal                     41.6
+837                     lake                    rad                     1.46
+837                     lake                    sal                     0.21
+837                     roe                     sal                     22.5
+844                     roe                     rad                     11.25
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+We can combine `and` with `or`,
+but we need to be careful about which operator is executed first.
+If we *don't* use parentheses,
+we get this:
+  
+``` {.sql}
+select * from Survey where quant='sal' and person='lake' or person='roe';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+734                     lake                    sal                     0.05
+751                     lake                    sal                     0.1
+752                     lake                    sal                     0.09
+752                     roe                     sal                     41.6
+837                     lake                    sal                     0.21
+837                     roe                     sal                     22.5
+844                     roe                     rad                     11.25
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+which is salinity measurements by Lake,
+and *any* measurement by Roerich.
+We probably want this instead:
+  
+``` {.sql}
+select * from Survey where quant='sal' and (person='lake' or person='roe');
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+734                     lake                    sal                     0.05
+751                     lake                    sal                     0.1
+752                     lake                    sal                     0.09
+752                     roe                     sal                     41.6
+837                     lake                    sal                     0.21
+837                     roe                     sal                     22.5
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+Finally,
+we can use `distinct` with `where`
+to give a second level of filtering:
+  
+``` {.sql}
+select distinct person, quant from Survey where person='lake' or person='roe';
+```
+
+<div class="db">
+
+--------------------    --------------------
+lake                    sal
+lake                    rad
+lake                    temp
+roe                     sal
+roe                     rad
+--------------------    --------------------
+
+</div>
+  
+But remember:
+`distinct` is applied to the values displayed in the chosen columns,
+not to the entire rows as they are being processed.
+  
+> ### Growing Queries {.box}
+> 
+> What we have just done is how most people "grow" their SQL queries.
+> We started with something simple that did part of what we wanted,
+> then added more clauses one by one,
+> testing their effects as we went.
+> This is a good strategy&mdash;in fact,
+> for complex queries it's often the *only* strategy&mdash;but
+> it depends on quick turnaround,
+> and on us recognizing the right answer when we get it.
+>     
+> The best way to achieve quick turnaround is often
+> to put a subset of data in a temporary database
+> and run our queries against that,
+> or to fill a small database with synthesized records.
+> For example,
+> instead of trying our queries against an actual database of 20 million Australians,
+> we could run it against a sample of ten thousand,
+> or write a small program to generate ten thousand random (but plausible) records
+> and use that.
+
+### Summary {.keypoints}
+
+* Use `where *test*` in a query to filter records based on Boolean tests.
+* Use `and` and `or` to combine tests.
+* Use `in` to check if a value is in a set.
+* Build up queries a bit at a time, and test them against small data sets.
+
+### Challenges {.challenges}
+
+* Gina wants to select all sites that lie within 30&deg; of the equator.
+  Her query is:
+
+    ``` {.sql}
+    select * from Site where (lat > -30) or (lat < 30);
+    ``` {.sql}
+
+    Explain why this is wrong,
+    and rewrite the query so that it is correct.
+
+* Normalized salinity readings are supposed to be between 0.0 and 1.0.
+  Write a query that selects all records from `Survey`
+  with salinity values outside this range.
+
+* The SQL test `*column-name* like *pattern*`
+  is true if the value in the named column
+  matches the pattern given;
+  the character '%' can be used any number of times in the pattern
+  to mean "match zero or more characters".
+
+    Expression              Value
+    --------------------    --------------------
+    `'a' like 'a'`          `True`
+    `'a' like '%a'`         `True`
+    `'b' like '%a'`         `False`
+    `'alpha' like 'a%'`     `True`
+    `'alpha' like 'a%p%'`   `True`
+    `'beta' like 'a%p%'`    `False`
+
+    The expression `*column-name* not like *pattern*`
+    inverts the test.
+    Using `like`,
+    write a query that finds all the records in `Visited`
+    that *aren't* from sites labelled 'DR-something'.
+
+## Calculating New Values {#s:calc}
+
+### Learning Objectives {.objectives}
+
+* Write queries that do arithmetic using the values in individual records.
+
+Duration: 5 minutes.
+
+### Lesson
+
+After carefully reading the expedition logs,
+Gina realizes that the radiation measurements they report
+may need to be corrected upward by 5%.
+Rather than modifying the stored data,
+she can do this calculation on the fly
+as part of her query:
+  
+``` {.sql}
+select 1.05 * reading from Survey where quant='rad';
+```
+
+<div class="db">
+
+--------------------
+10.311
+8.19
+8.8305
+7.581
+4.5675
+2.2995
+1.533
+11.8125
+--------------------
+
+</div>
+  
+When we run the query,
+the expression `1.05 * reading` is evaluated for each row.
+Expressions can use any of the fields,
+all of usual arithmetic operators,
+and a variety of common functions.
+(Exactly which ones depends on which database manager is being used.)
+For example,
+we can convert temperature readings from Fahrenheit to Celsius
+and round to two decimal places as follows:
+  
+``` {.sql}
+select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';
+```
+
+<div class="db">
+
+--------------------    --------------------
+734                     -29.72
+735                     -32.22
+751                     -28.06
+752                     -26.67
+--------------------    --------------------
+
+</div>
+  
+We can also combine values from different fields,
+for example by using the string concatenation operator `||`:
+  
+``` {.sql}
+select personal || ' ' || family from Person;
+```
+
+<div class="db">
+
+--------------------
+William Dyer
+Frank Pabodie
+Anderson Lake
+Valentina Roerich
+Frank Danforth
+--------------------
+
+</div>
+  
+> ### A Note on Names {.box}
+> 
+> It may seem strange to use `personal` and `family` as field names
+> instead of `first` and `last`,
+> but it's a necessary first step toward handling cultural differences.
+> For example,
+> consider the following rules:
+>     
+> Full Name                     Alphabetized Under      Reason
+> --------------------          --------------------    --------------------
+> Liu Xiaobo                    Liu                     Chinese family names come first
+> Leonardo da Vinci             Leonardo                "da Vinci" just means "from Vinci"
+> Catherine de Medici           Medici                  family name
+> Jean de La Fontaine           La Fontaine             family name is "La Fontaine"
+> Juan Ponce de Leon            Ponce de Leon           full family name is "Ponce de Leon"
+> Gabriel Garcia Marquez                Garcia Marquez          double-barrelled Spanish surnames
+> Wernher von Braun             von *or* Braun          depending on whether he was in Germany or the US
+> Elizabeth Alexandra May Windsor       Elizabeth               monarchs alphabetize by the name under which they reigned
+> Thomas a Beckett              Thomas                  and saints according to the names by which they were canonized
+>     
+> Clearly,
+> even a two-part division into "personal" and "family"
+> isn't enough...
+
+### Summary {.keypoints}
+
+* Use expressions as fields to calculate per-record values.</li>
+
+### Challenges {.challenges}
+
+* After further reading,
+  Gina realizes that Valentina Roerich
+  was reporting salinity as percentages.
+  Write a query that returns all of her salinity measurements
+  from the `Survey` table
+  with the values divided by 100.
+
+* The `union` operator combines the results of two queries:
+
+    ``` {.sql}
+    select * from Person where ident='dyer' union select * from Person where ident='roe';
+    ```
+
+    <div class="db">
+
+    --------------------    --------------------    --------------------
+    dyer                    William                 Dyer
+    roe                     Valentina                       Roerich
+    --------------------    --------------------    --------------------
+
+    </div>
+        
+    Use `union` to create a consolidated list of salinity measurements
+    in which Roerich's, and only Roerich's,
+    have been corrected as described in the previous challenge.
+    The output should be something like:
+
+    <div class="db">
+
+    --------------------    --------------------
+    619                     0.13
+    622                     0.09
+    734                     0.05
+    751                     0.1
+    752                     0.09
+    752                     0.416
+    837                     0.21
+    837                     0.225
+    --------------------    --------------------
+
+    </div>
+
+* The site identifiers in the `Visited` table have two parts
+  separated by a '-':
+
+    ``` {.sql}
+    select distinct site from Visited;
+    ```
+
+    <div class="db">
+
+    --------------------
+    DR-1
+    DR-3
+    MSK-4
+    --------------------
+
+    </div>
+  
+    Some major site identifiers are two letters long and some are three.
+    The "in string" function `instr(X, Y)`
+    returns the 1-based index of the first occurrence of string Y in string X,
+    or 0 if Y does not exist in X.
+    The substring function `substr(X, I)`
+    returns the substring of X starting at index I.
+    Use these two functions to produce a list of unique major site identifiers.
+    (For this data,
+    the list should contain only "DR" and "MSK").
+
+* Pabodie's journal notes that all his temperature measurements
+  are in &deg;F,
+  but Lake's journal does not report whether he used &deg;F or &deg;C.
+  How should Gina treat his measurements,
+  and why?
+
+## Ordering Results {#s:sort}
+
+### Learning Objectives {.objectives}
+
+* Write queries that order results according to fields' values.
+* Write queries that order results according to calculated values.
+* Explain why it is possible to sort records using the values of fields that are not displayed.
+
+Duration: 5 minutes.
+
+### Lesson
+
+As we mentioned earlier,
+database records are not stored in any particular order.
+This means that query results aren't necessarily sorted,
+and even if they are,
+we often want to sort them in a different way,
+e.g., by the name of the project instead of by the name of the scientist.
+We can do this in SQL by adding an `order by` clause to our query:
+  
+``` {.sql}
+select reading from Survey where quant='rad' order by reading;
+```
+
+<div class="db">
+
+--------------------
+1.46
+2.19
+4.35
+7.22
+7.8
+8.41
+9.82
+11.25
+--------------------
+
+</div>
+  
+By default,
+results are sorted in ascending order
+(i.e.,
+from least to greatest).
+We can sort in the opposite order using `desc` (for "descending"):
+  
+``` {.sql}
+select reading from Survey where quant='rad' order by reading desc;
+```
+
+<div class="db">
+
+--------------------
+11.25
+9.82
+8.41
+7.8
+7.22
+4.35
+2.19
+1.46
+--------------------
+
+</div>
+  
+(And if we want to make it clear that we're sorting in ascending order,
+we can use `asc` instead of `desc`.)
+  
+We can also sort on several fields at once.
+For example,
+this query sorts results first in ascending order by `taken`,
+and then in descending order by `person`
+within each group of equal `taken` values:
+    
+``` {.sql}
+select taken, person from Survey order by taken asc, person desc;
+```
+
+<div class="db">
+
+--------------------    --------------------
+619                     dyer
+619                     dyer
+622                     dyer
+622                     dyer
+734                     pb
+734                     pb
+734                     lake
+735                     pb
+735                     
+735                     
+751                     pb
+751                     pb
+751                     lake
+752                     roe
+752                     lake
+752                     lake
+752                     lake
+837                     roe
+837                     lake
+837                     lake
+844                     roe
+--------------------    --------------------
+
+</div>
+  
+This is easier to understand if we also remove duplicates:
+  
+``` {.sql}
+select distinct taken, person from Survey order by taken asc, person desc;
+```
+
+<div class="db">
+
+--------------------    --------------------
+619                     dyer
+622                     dyer
+734                     pb
+734                     lake
+735                     pb
+735                     
+751                     pb
+751                     lake
+752                     roe
+752                     lake
+837                     roe
+837                     lake
+844                     roe
+--------------------    --------------------
+
+</div>
+
+Since sorting happens before columns are filtered,
+we can sort by a field that isn't actually displayed:
+  
+``` {.sql}
+select reading from Survey where quant='rad' order by taken;
+```
+
+<div class="db">
+
+--------------------
+9.82
+7.8
+8.41
+7.22
+4.35
+2.19
+1.46
+11.25
+--------------------
+
+</div>
+  
+We can also sort results by the value of an expression.
+In SQLite,
+for example,
+the `random` function returns a pseudo-random integer
+each time it is called
+(i.e.,
+once per record):
+  
+``` {.sql}
+select random(), ident from Person;
+```
+
+<div class="db">
+
+--------------------    --------------------
+-6309766557809954936    dyer
+-2098461436941487136    pb
+-2248225962969032314    lake
+6062184424509295966     roe
+-1268956870222271271    danforth
+--------------------    --------------------
+
+</div>
+  
+So to randomize the order of our query results,
+e.g., when doing clinical trials,
+we can sort them by the value of this function:
+  
+``` {.sql}
+select ident from Person order by random();
+```
+
+<div class="db">
+
+--------------------
+danforth
+pb
+dyer
+lake
+roe
+--------------------
+
+</div>
+
+``` {.sql}
+select ident from Person order by random();
+```
+
+<div class="db">
+
+--------------------
+roe
+dyer
+pb
+lake
+danforth
+--------------------
+
+</div>
+  
+Our query pipeline now has four stages
+([Figure 5](#f:pipeline_sort_distinct)):
+  
+* Select the rows that pass the `where` criteria.
+* Sort them if required.
+* Filter the columns according to the `select` criteria.
+* Remove duplicates if required.
+
+  <figure id="f:pipeline_sort_distinct">
+    <img src="db/pipeline_sort_distinct.png" alt="Four-Stage Query Processing Pipeline" />
+    <figcaption>Figure 5: Four-Stage Query Processing Pipeline</figcaption>
+  </figure>
+
+### Summary {.keypoints}
+
+* Use `order by` (with `asc` or `desc`) to order a query's results.
+* Use `random` to generate pseudo-random numbers.
+
+### Challenges {.challenges}
+
+* Create a list of sites identifiers
+  and their distance from the equator in kilometers,
+  sorted from furthest to closest.
+  (A degree of latitude corresponds to 111.12 km.)
+
+* Gina needs a list of radiation measurements from all sites
+  sorted by when they were taken.
+  The query:
+
+    ``` {.sql}
+    select * from Survey where quant='rad' order by taken;
+    ```
+
+    produces the correct answer for the data used in our examples.
+    Explain when and why it might produce the wrong answer.
+
+## Missing Data {#s:null}
+
+### Learning Objectives {.objectives}
+
+* Explain what databases use the special value `NULL` to represent.
+* Explain why databases should *not* uses their own special values (like 9999 or "N/A") to represent missing or unknown data.
+* Explain what atomic and aggregate calculations involving `NULL` produce, and why.
+* Write queries that include or exclude records containing `NULL`.
+
+Duration: 10-20 minutes
+(depending on whether or not the instructor includes an anecdote about
+what happens when you *don't* take missing data into account).
+
+### Lesson
+
+Real-world data is never complete&mdash;there are always holes.
+Databases represent these holes using special value called `null`.
+`null` is not zero, `False`, or the empty string;
+it is a one-of-a-kind value that means "nothing here".
+Dealing with `null` requires a few special tricks
+and some careful thinking.
+  
+To start,
+let's have a look at the `Visited` table.
+There are eight records,
+but #752 doesn't have a date&mdash;or rather,
+its date is null:
+  
+``` {.sql}
+select * from Visited;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+619                     DR-1                    1927-02-08
+622                     DR-1                    1927-02-10
+734                     DR-3                    1939-01-07
+735                     DR-3                    1930-01-12
+751                     DR-3                    1930-02-26
+752                     DR-3                    
+837                     MS-4                    1932-01-14
+844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------
+
+</div>
+  
+> ### Displaying Nulls {.box}
+> 
+> Different databases display nulls differently.
+> Unfortunately,
+> SQLite's default is to print nothing at all,
+> which makes nulls easy to overlook
+> (particularly if they're in the middle of a long row).
+
+Null doesn't behave like other values.
+If we select the records that come before 1930:
+  
+``` {.sql}
+select * from Visited where dated<'1930-00-00';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+619                     DR-1                    1927-02-08
+622                     DR-1                    1927-02-10
+--------------------    --------------------    --------------------
+
+</div>
+  
+we get two results,
+and if we select the ones that come during or after 1930:
+  
+``` {.sql}
+select * from Visited where dated>='1930-00-00';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+734                     DR-3                    1939-01-07
+735                     DR-3                    1930-01-12
+751                     DR-3                    1930-02-26
+837                     MS-4                    1932-01-14
+844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------
+
+</div>
+  
+we get five,
+but record #752 isn't in either set of results.
+The reason is that
+`null<'1930-00-00'`
+is neither true nor false:
+null means, "We don't know,"
+and if we don't know the value on the left side of a comparison,
+we don't know whether the comparison is true or false.
+Since databases represent "don't know" as null,
+the value of `null<'1930-00-00'`
+is actually `null`.
+`null>='1930-00-00'` is also null
+because we can't answer to that question either.
+And since the only records kept by a `where`
+are those for which the test is true,
+record #752 isn't included in either set of results.
+  
+Comparisons aren't the only operations that behave this way with nulls.
+`1+null` is `null`,
+`5*null` is `null`,
+`log(null)` is `null`,
+and so on.
+In particular,
+comparing things to null with = and != produces null:
+  
+``` {.sql}
+select * from Visited where dated=NULL;
+```
+
+``` {.sql}
+select * from Visited where dated!=NULL;
+```
+  
+To check whether a value is `null` or not,
+we must use a special test `is null`:
+  
+``` {.sql}
+select * from Visited where dated is NULL;
+```
+
+<div class="db">
+
+--------------------    --------------------
+752                     DR-3            
+--------------------    --------------------
+
+</div>
+  
+or its inverse `is not null`:
+  
+``` {.sql}
+select * from Visited where dated is not NULL;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+619                     DR-1                    1927-02-08
+622                     DR-1                    1927-02-10
+734                     DR-3                    1939-01-07
+735                     DR-3                    1930-01-12
+751                     DR-3                    1930-02-26
+837                     MS-4                    1932-01-14
+844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------
+
+</div>
+  
+Null values cause headaches wherever they appear.
+For example,
+suppose we want to find the all of salinity measurements
+that weren't taken by Dyer.
+It's natural to write the query like this:
+  
+``` {.sql}
+select * from Survey where quant='sal' and person!='lake';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+619                     dyer                    sal                     0.13
+622                     dyer                    sal                     0.09
+752                     roe                     sal                     41.6
+837                     roe                     sal                     22.5
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+but this query filters omits the records
+where we don't know who took the measurement.
+Once again,
+the reason is that when `person` is `null`,
+the `!=` comparison produces `null`,
+so the record isn't kept in our results.
+If we want to keep these records
+we need to add an explicit check:
+  
+``` {.sql}
+select * from Survey where quant='sal' and (person!='lake' or person is null);
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+619                     dyer                    sal                     0.13
+622                     dyer                    sal                     0.09
+735                                             sal                     0.06
+752                     roe                     sal                     41.6
+837                     roe                     sal                     22.5
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+
+  
+We still have to decide whether this is the right thing to do or not.
+If we want to be absolutely sure that
+we aren't including any measurements by Lake in our results,
+we need to exclude all the records for which we don't know who did the work.
+  
+> ### What Happens When You Forget {.box}
+> 
+> Several years ago,
+> I was helping a group who were looking at
+> the spread of drug-resistant tuberculosis (DRTB)
+> in industrialized countries.
+> In particular,
+> they wanted to know if it was spreading faster among less affluent people.
+>     
+> We tackled the problem by combining two data sets.
+> The first gave us skin and blood test results for DRTB
+> along with patients' postal codes
+> (the only identifying information we were allowed---we didn't even have gender).
+> The second was Canadian census data that gave us
+> median income per postal code.
+> Since a PC is about 300-800 people,
+> we felt justified in joining the first with the second
+> to estimate incomes for people with positive and negative test results.
+>     
+> To our surprise,
+> we didn't find a correlation between income and infection.
+> We were just about to publish when someone spotted the mistake I'd made.
+>     
+> Question: Who *doesn't* have a postal code?
+>     
+> Answer: Homeless people.
+>     
+> When I did the join,
+> I was throwing away homeless people,
+> which introduced a statistically significant error in my results.
+> But I couldn't just set the income of anyone without a postal code to zero,
+> because our sample included another set of people without postal codes:
+> 16-21 year olds whose addresses were suppressed
+> because they had tested positive for sexually-transmitted diseases.
+>     
+> At this point the problem is no longer a database issue,
+> but rather a question of statistics.
+> The takeaway is,
+> checking your queries when you're programming is as important as
+> checking your samples when you're doing chemistry.
+
+### Summary {.keypoints}
+
+* Use `null` in place of missing information.
+* Almost every operation involving `null` produces `null` as a result.
+* Test for nulls using `is null` and `is not null`.
+
+### Challenges {.challenges}
+
+* Write a query that sorts the records in `Visited` by date,
+  omitting entries for which the date is not known
+  (i.e., is null).
+
+* What do you expect the query:
+
+    ``` {.sql}
+    select * from Visited where dated in ('1927-02-08', null);
+    ```
+
+    to produce?
+    What does it actually produce?
+
+* Some database designers prefer to use
+  a [sentinel value](../gloss.html#sentinel-value)
+  to mark missing data rather than `null`.
+  For example,
+  they will use the date "0000-00-00" to mark a missing date,
+  or -1.0 to mark a missing salinity or radiation reading
+  (since actual readings cannot be negative).
+  What does this simplify?
+  What burdens or risks does it introduce?
+
+## Aggregation {#s:aggregate}
+
+### Learning Objectives {.objectives}
+
+* Write queries that combine values from many records to create a single aggregate value.
+* Write queries that put records into groups based on their values.
+* Write queries that combine values group by group.
+* Explain what is displayed for *unaggregated* fields when some fields are aggregated.
+
+Duration: 10 minutes.
+
+### Lesson
+
+Gina now wants to calculate ranges and averages for her data.
+She knows how to select all of the dates from the `Visited` table:
+  
+``` {.sql}
+select dated from Visited;
+```
+
+<div class="db">
+
+--------------------
+1927-02-08
+1927-02-10
+1939-01-07
+1930-01-12
+1930-02-26
+     
+1932-01-14
+1932-03-22
+--------------------
+
+</div>
+  
+but to combine them,
+she must use an [aggregation function](../gloss.html#aggregation-function)
+such as `min` or `max`.
+Each of these functions takes a set of records as input,
+and produces a single record as output:
+  
+``` {.sql}
+select min(dated) from Visited;
+```
+
+<div class="db">
+
+--------------------
+1927-02-08
+--------------------
+
+</div>
+
+``` {.sql}
+select max(dated) from Visited;
+```
+
+<div class="db">
+
+--------------------
+1939-01-07
+--------------------
+
+</div>
+  
+`min` and `max` are just two of
+the aggregation functions built into SQL.
+Three others are `avg`,
+`count`,
+and `sum`:
+  
+``` {.sql}
+select avg(reading) from Survey where quant='sal';
+```
+
+<div class="db">
+
+--------------------
+7.20333333333
+--------------------
+
+</div>
+
+``` {.sql}
+select count(reading) from Survey where quant='sal';
+```
+
+<div class="db">
+
+--------------------
+9
+--------------------
+
+</div>
+
+``` {.sql}
+select sum(reading) from Survey where quant='sal';
+```
+
+<div class="db">
+
+--------------------
+64.83
+--------------------
+
+</div>
+  
+We used `count(reading)` here,
+but we could just as easily have counted `quant`
+or any other field in the table,
+or even used `count(*)`,
+since the function doesn't care about the values themselves,
+just how many values there are.
+  
+SQL lets us do several aggregations at once.
+We can,
+for example,
+find the range of sensible salinity measurements:
+  
+``` {.sql}
+select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;
+```
+
+<div class="db">
+
+--------------------    --------------------
+0.05                    0.21
+--------------------    --------------------
+
+</div>
+  
+We can also combine aggregated results with raw results,
+although the output might surprise you:
+  
+``` {.sql}
+select person, count(*) from Survey where quant='sal' and reading<=1.0;
+```
+
+<div class="db">
+
+--------------------    --------------------
+lake                    7
+--------------------    --------------------
+
+</div>
+  
+Why does Lake's name appear rather than Roerich's or Dyer's?
+The answer is that when it has to aggregate a field,
+but isn't told how to,
+the database manager chooses an actual value from the input set.
+It might use the first one processed,
+the last one,
+or something else entirely.
+  
+Another important fact is that when there are no values to aggregate,
+aggregation's result is "don't know"
+rather than zero or some other arbitrary value:
+  
+``` {.sql}
+select person, max(reading), sum(reading) from Survey where quant='missing';
+```
+  
+One final important feature of aggregation functions is that
+they are inconsistent with the rest of SQL in a very useful way.
+If we add two values,
+and one of them is null,
+the result is null.
+By extension,
+if we use `sum` to add all the values in a set,
+and any of those values are null,
+the result should also be null.
+It's much more useful,
+though,
+for aggregation functions to ignore null values
+and only combine those that are non-null.
+This behavior lets us write our queries as:
+  
+``` {.sql}
+select min(dated) from Visited;
+```
+
+<div class="db">
+
+--------------------
+1927-02-08
+--------------------
+
+</div>
+  
+instead of always having to filter explicitly:
+  
+``` {.sql}
+select min(dated) from Visited where dated is not null;
+```
+
+<div class="db">
+
+--------------------
+1927-02-08
+--------------------
+
+</div>
+  
+### Summary {.keypoints}
+
+* Use aggregation functions like `sum` and `max` to combine query results.
+* Use `count` function to count the number of results.
+* If some fields are aggregated and others are not, the database manager chooses an arbitrary result for the unaggregated field.
+* Most aggregation functions skip nulls when combining values.
+
+### Challenges {.challenges}
+
+* How many temperature readings did Frank Pabodie record,
+  and what was their average value?
+
+* The average of a set of values is the sum of the values
+  divided by the number of values.
+  Does this mean that the `avg` function returns 2.0 or 3.0
+  when given the values 1.0, `null`, and 5.0?
+
+* Gina wants to calculate the difference between
+  each individual radiation reading
+  and the average of all the radiation readings.
+  She writes the query:
+
+    ``` {.sql}
+    select reading-avg(reading) from Survey where quant='rad';
+    ```
+
+    What does this actually produce, and why?
+
+* The function `group_concat(field, separator)`
+  concatenates all the values in a field
+  using the specified separator character
+  (or ',' if the separator isn't specified).
+  Use this to produce a one-line list of scientists' names,
+  such as:
+
+    ``` {.sql}
+    William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
+    ```
+
+    Can you find a way to order the list by surname?
+
+## Grouping {#s:grouping}
+
+### Learning Objectives {.objectives}
+
+* Group results to be aggregated separately.
+* Explain when grouping occurs in the processing pipeline.
+
+Duration: 5 minutes.
+
+### Lesson
+
+Aggregating all records at once doesn't always make sense.
+For example,
+suppose Gina suspects that there is a systematic bias in her data,
+and that some scientists' radiation readings are higher than others.
+We know that this doesn't work:
+  
+``` {.sql}
+select person, count(reading), round(avg(reading), 2)
+from  Survey
+where quant='rad';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+roe                     8                       6.56
+--------------------    --------------------    --------------------
+
+</div>
+  
+because the database manager selects a single arbitrary scientist's name
+rather than aggregating separately for each scientist.
+Since there are only five scientists,
+she could write five queries of the form:
+  
+``` {.sql}
+select person, count(reading), round(avg(reading), 2)
+from  Survey
+where quant='rad'
+and   person='dyer';
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+dyer                    2                       8.81
+--------------------    --------------------    --------------------
+
+</div>
+  
+but this would be tedious,
+and if she ever had a data set with fifty or five hundred scientists,
+the chances of her getting all of those queries right is small.
+  
+What we need to do is
+tell the database manager to aggregate the hours for each scientist separately
+using a `group by` clause:
+  
+``` {.sql}
+select   person, count(reading), round(avg(reading), 2)
+from     Survey
+where    quant='rad'
+group by person;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+dyer                    2                       8.81
+lake                    2                       1.82
+pb                      3                       6.66
+roe                     1                       11.25
+--------------------    --------------------    --------------------
+
+</div>
+  
+`group by` does exactly what its name implies:
+groups all the records with the same value for the specified field together
+so that aggregation can process each batch separately.
+Since all the records in each batch have the same value for `person`,
+it no longer matters that the database manager
+is picking an arbitrary one to display
+alongside the aggregated `reading` values
+([Figure 6](#f:grouped_aggregation)).
+  
+  <figure id="f:grouped_aggregation">
+    <img src="db/grouped_aggregation.png" alt="Grouped Aggregation" />
+    <figcaption>Figure 6: Grouped Aggregation</figcaption>
+  </figure>
+
+Just as we can sort by multiple criteria at once,
+we can also group by multiple criteria.
+To get the average reading by scientist and quantity measured,
+for example,
+we just add another field to the `group by` clause:
+  
+``` {.sql}
+select   person, quant, count(reading), round(avg(reading), 2)
+from     Survey
+group by person, quant;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+                        sal                     1                       0.06
+                        temp                    1                       -26.0
+dyer                    rad                     2                       8.81
+dyer                    sal                     2                       0.11
+lake                    rad                     2                       1.82
+lake                    sal                     4                       0.11
+lake                    temp                    1                       -16.0
+pb                      rad                     3                       6.66
+pb                      temp                    2                       -20.0
+roe                     rad                     1                       11.25
+roe                     sal                     2                       32.05
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+Note that we have added `person` to the list of fields displayed,
+since the results wouldn't make much sense otherwise.
+  
+Let's go one step further and remove all the entries
+where we don't know who took the measurement:
+  
+``` {.sql}
+select   person, quant, count(reading), round(avg(reading), 2)
+from     Survey
+where    person is not null
+group by person, quant
+order by person, quant;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+dyer                    rad                     2                       8.81
+dyer                    sal                     2                       0.11
+lake                    rad                     2                       1.82
+lake                    sal                     4                       0.11
+lake                    temp                    1                       -16.0
+pb                      rad                     3                       6.66
+pb                      temp                    2                       -20.0
+roe                     rad                     1                       11.25
+roe                     sal                     2                       32.05
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+Looking more closely,
+this query:
+  
+* selected records from the `Survey` table
+  where the `person` field was not null;
+
+* grouped those records into subsets
+  so that the `person` and `quant` values in each subset
+  were the same;
+
+* ordered those subsets first by `person`,
+  and then within each sub-group by `quant`;
+  and
+
+* counted the number of records in each subset,
+  calculated the average `reading` in each,
+  and chose a `person` and `quant` value from each
+  (it doesn't matter which ones,
+  since they're all equal).
+
+Our query processing pipeline now looks like
+[Figure 7](#f:pipeline_grouping).
+  
+  <figure id="f:pipeline_grouping">
+    <img src="db/pipeline_grouping.png" alt="Query Processing Pipeline With Grouping" />
+    <figcaption>Figure 7: Query Processing Pipeline With Grouping</figcaption>
+  </figure>
+
+### Summary {.keypoints}
+
+* Use `group by` to group values for separate aggregation.
+
+### Challenges {.challenges}
+
+* Write a single query that finds the earliest and latest date
+  that each site was visited.
+
+* Show the records produced by each stage of
+  [Figure 7](#f:pipeline_grouping)
+  for the following query:
+
+    ``` {.sql}
+    select   min(reading), max(reading) from Survey
+    where    taken in (734, 735)
+    and      quant='temp'
+    group by taken, quant;
+    ```
+
+* How can the query in the previous challenge be simplified
+  without changing its result?
+
+## Combining Data {#s:join}
+
+### Learning Objectives {.objectives}
+
+* Explain what primary keys and foreign keys are.
+* Write queries that combine information from two or more tables by matching keys.
+* Write queries using aliases for table names.
+* Explain why the `tablename.fieldname` notation is needed when tables are joined.
+* Explain the logical sequence of operations that occurs when two or more tables are joined.
+
+Duration: 20 minutes (and expect to have to walk through an example step-by-step).
+
+### Lesson
+
+In order to submit her data to a web site
+that aggregates historical meteorological data,
+Gina needs to format it as
+latitude, longitude, date, quantity, and reading.
+However,
+her latitudes and longitudes are in the `Site` table,
+while the dates of measurements are in the `Visited` table
+and the readings themselves are in the `Survey` table.
+She needs to combine these tables somehow.
+
+The SQL command to do this is `join`.
+To see how it works,
+let's start by joining the `Site` and `Visited` tables:
+  
+``` {.sql}
+select * from Site join Visited;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------    --------------------    --------------------
+DR-1                    -49.85                  -128.57                 619                     DR-1                    1927-02-08
+DR-1                    -49.85                  -128.57                 622                     DR-1                    1927-02-10
+DR-1                    -49.85                  -128.57                 734                     DR-3                    1939-01-07
+DR-1                    -49.85                  -128.57                 735                     DR-3                    1930-01-12
+DR-1                    -49.85                  -128.57                 751                     DR-3                    1930-02-26
+DR-1                    -49.85                  -128.57                 752                     DR-3                    
+DR-1                    -49.85                  -128.57                 837                     MS-4                    1932-01-14
+DR-1                    -49.85                  -128.57                 844                     DR-1                    1932-03-22
+DR-3                    -47.15                  -126.72                 619                     DR-1                    1927-02-08
+DR-3                    -47.15                  -126.72                 622                     DR-1                    1927-02-10
+DR-3                    -47.15                  -126.72                 734                     DR-3                    1939-01-07
+DR-3                    -47.15                  -126.72                 735                     DR-3                    1930-01-12
+DR-3                    -47.15                  -126.72                 751                     DR-3                    1930-02-26
+DR-3                    -47.15                  -126.72                 752                     DR-3                    
+DR-3                    -47.15                  -126.72                 837                     MS-4                    1932-01-14
+DR-3                    -47.15                  -126.72                 844                     DR-1                    1932-03-22
+MS-4                    -48.87                  -123.4                  619                     DR-1                    1927-02-08
+MS-4                    -48.87                  -123.4                  622                     DR-1                    1927-02-10
+MS-4                    -48.87                  -123.4                  734                     DR-3                    1939-01-07
+MS-4                    -48.87                  -123.4                  735                     DR-3                    1930-01-12
+MS-4                    -48.87                  -123.4                  751                     DR-3                    1930-02-26
+MS-4                    -48.87                  -123.4                  752                     DR-3                    
+MS-4                    -48.87                  -123.4                  837                     MS-4                    1932-01-14
+MS-4                    -48.87                  -123.4                  844                     DR-1                    1932-03-22
+--------------------    --------------------    --------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+`join` creates
+the [cross product](../gloss.html#cross-product)
+of two tables,
+i.e.,
+it joins each record of one with each record of the other
+to give all possible combinations.
+Since there are three records in `Site`
+and eight in `Visited`,
+the join's output has 24 records.
+And since each table has three fields,
+the output has six fields.
+  
+What the join *hasn't* done is
+figure out if the records being joined have anything to do with each other.
+It has no way of knowing whether they do or not until we tell it how.
+To do that,
+we add a clause specifying that
+we're only interested in combinations that have the same site name:
+  
+``` {.sql}
+select * from Site join Visited on Site.name=Visited.site;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------    --------------------    --------------------
+DR-1                    -49.85                  -128.57                 619                     DR-1                    1927-02-08
+DR-1                    -49.85                  -128.57                 622                     DR-1                    1927-02-10
+DR-1                    -49.85                  -128.57                 844                     DR-1                    1932-03-22
+DR-3                    -47.15                  -126.72                 734                     DR-3                    1939-01-07
+DR-3                    -47.15                  -126.72                 735                     DR-3                    1930-01-12
+DR-3                    -47.15                  -126.72                 751                     DR-3                    1930-02-26
+DR-3                    -47.15                  -126.72                 752                     DR-3                    
+MS-4                    -48.87                  -123.4                  837                     MS-4                    1932-01-14
+--------------------    --------------------    --------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+`on` does the same job as `where`:
+it only keeps records that pass some test.
+(The difference between the two is that `on` filters records
+as they're being created,
+while `where` waits until the join is done
+and then does the filtering.)
+Once we add this to our query,
+the database manager throws away records
+that combined information about two different sites,
+leaving us with just the ones we want.
+  
+Notice that we used `table.field` to specify field names
+in the output of the join.
+We do this because tables can have fields with the same name,
+and we need to be specific which ones we're talking about.
+For example,
+if we joined the `person` and `visited` tables,
+the result would inherit a field called `ident`
+from each of the original tables.
+
+We can now use the same dotted notation
+to select the three columns we actually want
+out of our join:
+  
+``` {.sql}
+select Site.lat, Site.long, Visited.dated
+from   Site join Visited
+on     Site.name=Visited.site;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------
+-49.85                  -128.57                 1927-02-08
+-49.85                  -128.57                 1927-02-10
+-49.85                  -128.57                 1932-03-22
+-47.15                  -126.72                 
+-47.15                  -126.72                 1930-01-12
+-47.15                  -126.72                 1930-02-26
+-47.15                  -126.72                 1939-01-07
+-48.87                  -123.4                  1932-01-14
+--------------------    --------------------    --------------------
+
+</div>
+  
+If joining two tables is good,
+joining many tables must be better.
+In fact,
+we can join any number of tables
+simply by adding more `join` clauses to our query,
+and more `on` tests to filter out combinations of records
+that don't make sense:
+
+``` {.sql}
+select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
+from   Site join Visited join Survey
+on     Site.name=Visited.site
+and    Visited.ident=Survey.taken
+and    Visited.dated is not null;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------    --------------------
+-49.85                  -128.57                 1927-02-08              rad                     9.82
+-49.85                  -128.57                 1927-02-08              sal                     0.13
+-49.85                  -128.57                 1927-02-10              rad                     7.8
+-49.85                  -128.57                 1927-02-10              sal                     0.09
+-47.15                  -126.72                 1939-01-07              rad                     8.41
+-47.15                  -126.72                 1939-01-07              sal                     0.05
+-47.15                  -126.72                 1939-01-07              temp                    -21.5
+-47.15                  -126.72                 1930-01-12              rad                     7.22
+-47.15                  -126.72                 1930-01-12              sal                     0.06
+-47.15                  -126.72                 1930-01-12              temp                    -26.0
+-47.15                  -126.72                 1930-02-26              rad                     4.35
+-47.15                  -126.72                 1930-02-26              sal                     0.1
+-47.15                  -126.72                 1930-02-26              temp                    -18.5
+-48.87                  -123.4                  1932-01-14              rad                     1.46
+-48.87                  -123.4                  1932-01-14              sal                     0.21
+-48.87                  -123.4                  1932-01-14              sal                     22.5
+-49.85                  -128.57                 1932-03-22              rad                     11.25
+--------------------    --------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+<a id="a:keys"></a>
+We can tell which records from `Site`, `Visited`, and `Survey`
+correspond with each other
+because those tables contain
+[primary keys](../gloss.html#primary-key)
+and [foreign keys](../gloss.html#foreign-key).
+A primary key is a value,
+or combination of values,
+that uniquely identifies each record in a table.
+A foreign key is a value (or combination of values) from one table
+that identifies a unique record in another table.
+Another way of saying this is that
+a foreign key is the primary key of one table
+that appears in some other table.
+In our database,
+`Person.ident` is the primary key in the `Person` table,
+while `Survey.person` is a foreign key
+relating the `Survey` table's entries
+to entries in `Person`.
+
+Most database designers believe that
+every table should have a well-defined primary key.
+They also believe that this key should be separate from the data itself,
+so that if we ever need to change the data,
+we only need to make one change in one place.
+One easy way to do this is
+to create an arbitrary, unique ID for each record
+as we add it to the database.
+This is actually very common:
+those IDs have names like "student numbers" and "patient numbers",
+and they almost always turn out to have originally been
+a unique record identifier in some database system or other.
+As the query below demonstrates,
+SQLite automatically numbers records as they're added to tables,
+and we can use those record numbers in queries:
+  
+``` {.sql}
+select rowid, * from Person;
+```
+
+<div class="db">
+
+--------------------    --------------------    --------------------    --------------------
+1                       dyer                    William                 Dyer
+2                       pb                      Frank                   Pabodie
+3                       lake                    Anderson                Lake
+4                       roe                     Valentina               Roerich
+5                       danforth                Frank                   Danforth
+--------------------    --------------------    --------------------    --------------------
+
+</div>
+  
+### Summary {.keypoints}
+
+* Use `join` to create all possible combinations of records from two or more tables.
+* Use `join *tables* on *test*` to keep only those combinations that pass some test.
+* Use `*table*.*field*` to specify a particular field of a particular table.
+* Every record in a table should be uniquely identified by the value of its primary key.
+
+### Challenges {.challenges}
+
+* Write a query that lists all radiation readings from the DR-1 site.
+
+* Write a query that lists all sites visited by people named "Frank".
+
+* Describe in your own words what the following query produces:
+
+    ``` {.sql}
+    select Site.name from Site join Visited
+    on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';
+    ```
+
+* Why does the `Person` table have an `ident` field?
+  Why do we not just use scientists' names in the `Survey` table?
+
+* Why does the table `Site` exist?
+  Why didn't Gina just record latitudes and longitudes
+  directly in the `Visited` and `Survey` tables?
+
+## Creating and Modifying Tables {#s:create}
+
+### Learning Objectives {.box}
+
+* Write queries that create database tables with fields of common types.
+* Write queries that specify the primary and foreign key relationships of tables.
+* Write queries that specify whether field values must be unique and/or are allowed to be `null`.
+* Write queries that erase database tables.
+* Write queries that add records to database tables.
+* Write queries that delete specific records from tables.
+* Explain what referential integrity is, and how a database can become inconsistent as data is changed.
+
+Duration: 10 minutes.
+
+### Lesson
+
+So far we have only looked at how to get information out of a database,
+both because that is more frequent than adding information,
+and because most other operations only make sense
+once queries are understood.
+If we want to create and modify data,
+we need to know two other pairs of commands.
+  
+The first pair are `create table` and `drop table`.
+While they are written as two words,
+they are actually single commands.
+The first one creates a new table;
+its arguments are the names and types of the table's columns.
+For example,
+the following statements create the four tables in our survey database:
+  
+    create table Person(ident text, personal text, family text);
+    create table Site(name text, lat real, long real);
+    create table Visited(ident integer, site text, dated text);
+    create table Survey(taken integer, person text, quant real, reading real);
+
+We can get rid of one of our tables using:
+  
+    drop table Survey;
+
+Be very careful when doing this:
+most databases have some support for undoing changes,
+but it's better not to have to rely on it.
+  
+Different database systems support different data types for table columns,
+but most provide the following:
+
+<div class="db">
+
+--------------------    --------------------
+`integer`               A signed integer.
+`real`                  A floating point value.
+`text`                  A string.
+`blob`                  Any "binary large object" such as an image or audio file.
+--------------------    --------------------
+
+</div>
+
+Most databases also support Booleans and date/time values;
+SQLite uses the integers 0 and 1 for the former,
+and represents the latter as discussed [earlier](#a:dates).
+An increasing number of databases also support geographic data types,
+such as latitude and longitude.
+Keeping track of what particular systems do or do not offer,
+and what names they give different data types,
+is an unending portability headache.
+  
+When we create a table,
+we can specify several kinds of constraints on its columns.
+For example,
+a better definition for the `Survey` table would be:
+  
+    create table Survey(
+        taken   integer not null, -- where reading taken
+        person  text,             -- may not know who took it
+        quant   real not null,    -- the quantity measured
+        reading real not null,    -- the actual reading
+        primary key(taken, quant),
+        foreign key(taken) references Visited(ident),
+        foreign key(person) references Person(ident)
+    );
+
+Once again,
+exactly what constraints are avialable
+and what they're called
+depends on which database manager we are using.
+  
+Once tables have been created,
+we can add and remove records using our other pair of commands,
+`insert` and `delete`.
+The simplest form of `insert` statement lists values in order:
+  
+    insert into Site values('DR-1', -49.85, -128.57);
+    insert into Site values('DR-3', -47.15, -126.72);
+    insert into Site values('MSK-4', -48.87, -123.40);
+
+We can also insert values into one table directly from another:
+  
+    create table JustLatLong(lat text, long TEXT);
+    insert into JustLatLong select lat, long from site;
+
+Deleting records can be a bit trickier,
+because we have to ensure that the database remains internally consistent.
+If all we care about is a single table,
+we can use the `DELETE` command with a `WHERE` clause
+that matches the records we want to discard.
+For example,
+once we realize that Frank Danforth didn't take any measurements,
+we can remove him from the `Person` table like this:
+  
+    delete from Person where ident = "danforth";
+
+But what if we removed Anderson Lake instead?
+Our `Survey` table would still contain seven records
+of measurements he'd taken:
+  
+``` {.sql}
+select count(*) from Survey where person='lake';
+```
+
+<div class="db">
+
+--------------------
+7
+--------------------
+
+</div>
+  
+That's never supposed to happen:
+`Survey.person` is a foreign key into the `Person` table,
+and all our queries assume there will be a row in the latter
+matching every value in the former.
+  
+This problem is called [referential integrity](../gloss.html#referential-integrity):
+we need to ensure that all references between tables can always be resolved correctly.
+One way to do this is to delete all the records
+that use `'lake'` as a foreign key
+before deleting the record that uses it as a primary key.
+If our database manager supports it,
+we can automate this
+using [cascading delete](../gloss.html#cascading-delete).
+However,
+this technique is outside the scope of this chapter.
+  
+> ### Other Ways to Do It {#a:hybrid .box}
+> 
+> Many applications use a hybrid storage model
+> instead of putting everything into a database:
+> the actual data (such as astronomical images) is stored in files,
+> while the database stores the files' names,
+> their modification dates,
+> the region of the sky they cover,
+> their spectral characteristics,
+> and so on.
+> This is also how most music player software is built:
+> the database inside the application keeps track of the MP3 files,
+> but the files themselves live on disk.
+
+### Summary {.keypoints}
+
+* Use `create table *name*(...)` to create a table.
+* Use `drop table *name*` to erase a table.
+* Specify field names and types when creating tables.
+* Specify `primary key`, `foreign key`, `not null`, and other constraints when creating tables.
+* Use `insert into *table* values(...)` to add records to a table.
+* Use `delete from *table* where *test*` to erase records from a table.
+* Maintain referential integrity when creating or deleting information.
+
+### Challenges {.challenges}
+
+* Write an SQL statement to replace all uses of `null`
+  in `Survey.person`
+  with the string `'unknown'`.
+
+* One of Gina's colleagues has sent her a [CSV](../gloss.html#csv) file
+  containing temperature readings by Robert Olmstead,
+  which is formatted like this:
+
+    ```
+    Taken,Temp
+    619,-21.5
+    622,-15.5
+    ```
+
+    Write a small Python program that reads this file in
+    and prints out the SQL `insert` statements needed
+    to add these records to the survey database.
+    Note: you will need to add an entry for Olmstead
+    to the `Person` table.
+    If you are testing your program repeatedly,
+    you may want to investigate SQL's `insert or replace` command.
+
+* SQLite has several administrative commands that aren't part of the SQL standard.
+  One of them is `.dump`,
+  which prints the SQL commands needed to re-create the database.
+  Another is `.load`,
+  which reads a file created by `.dump` and restores the database.
+  A colleague of yours thinks that storing dump files (which are text) in version control
+  is a good way to track and manage changes to the database.
+  What are the pros and cons of this approach?
+
+<section id="s:transactions">
+
+## Transactions
+
+### Learning Objectives {.objectives}
+
+* Explain what a race condition is.
+* Explain why database operations sometimes have to be placed ina transaction to ensure correct behavior.
+* Explain what it means to commit a transaction.</li>
+
+Duration: 10 minutes.
+
+### Lesson
+
+Suppose we have another table in our database that shows
+which pieces of equipment have been borrowed by which scientists:
+  
+``` {.sql}
+select * from Equipment;
+```
+
+<div class="db">
+
+--------------------    --------------------
+dyer                    CX-211 oscilloscope
+pb                      Greenworth balance
+lake                    Cavorite damping plates
+--------------------    --------------------
+
+</div>
+    
+(We should actually give each piece of equipment a unique ID,
+and use that ID here instead of the full name,
+just as we created a separate table for scientists earlier in this chapter,
+but we will bend the rules for now.)
+If William Dyer gives the oscilloscope to Valentina Roerich,
+we need to execute two statements to update this table:
+  
+    delete from Equipment where person="dyer" and thing="CX-211 oscilloscope";
+    insert into Equipment values("roe", "CX-211 oscilloscope");
+
+This is all fine&mdash;unless our program happens to crash
+between the first statement and the second.
+If that happens,
+the `Equipment` table won't have a record for the oscilloscope at all.
+Such a crash may seem unlikely,
+but remember:
+if a computer can do two billion operations per second,
+that means there are two billion opportunities every second for something to go wrong.
+And if our operations take a long time to complete&mdash;as they will
+when we are working with large datasets,
+or when the database is being heavily used&mdash;the odds of failure increase.
+
+What we really want is a way to ensure that every operation is [ACID](../gloss.html#acid):
+[atomic](../gloss.html#atomic-operation) (i.e. indivisible),
+consistent, isolated, and durable.
+The precise meanings of these terms doesn't matter;
+what does is the notion that
+every logical operation on the database should either run to completion
+as if nothing else was going on at the same time,
+or fail without having any effect at all.
+
+The tool we use to ensure that this happens is called
+a [transaction](../gloss.html#transaction).
+Here's how we should actually write the statements
+to move the oscilloscope from one person to another:
+  
+    begin transaction;
+    delete from Equipment where person="dyer" and thing="CX-211 oscilloscope";
+    insert into Equipment values("roe", "CX-211 oscilloscope");
+    end transaction;
+
+The database manager treats everything in the transaction as one large statement.
+If anything goes wrong inside,
+then none of the changes made in the transaction will actually be written to the database&mdash;it
+will be as if the transaction had never happened.
+Changes are only stored permanently
+when we [commit](../gloss.html#commit) them at the end of the transaction.
+  
+> ### Transactions and Commits {.box}
+> 
+> We first used the term "transaction" in
+> [our discussion of version control](svn.html#b:basics:transaction).
+> That's not a coincidence:
+> behind the scenes,
+> tools like Subversion are using many of the same algorithms as database managers
+> to ensure that either everything happens consistently
+> or nothing happens at all.
+> We [use the term "commit"](svn.html#a:commit) for the same reason:
+> just as our changes to local files aren't written back to the version control repository
+> until we commit them,
+> our (apparent) changes to a database aren't written to disk
+> until we say so.
+
+Transactions serve another purpose as well.
+Suppose there is another table in the database called `Exposure`
+that records the number of days each scientist was exposed to
+higher-than-normal levels of radiation:
+  
+``` {.sql}
+select * from Exposure;
+```
+
+<div class="db">
+
+--------------------    --------------------
+pb                      4
+dyer                    1
+lake                    5
+--------------------    --------------------
+
+</div>
+  
+After going through the journal entries for 1932,
+Gina wants to add two days to Lake's count:
+
+``` {.sql}  
+update Exposure set days = days + 2 where person='lake';
+```
+
+However,
+her labmate has been doing through the journal entries for 1933
+to help Gina meet a paper deadline.
+At the same moment as Gina runs her command,
+her labmate runs this
+to add one more day to Lake's exposure:
+
+``` {.sql}
+update Exposure set days = days + 1 where person='lake';
+```
+
+After both operations have completed,
+the database should show that Lake was exposed for eight days
+(the original five, plus two from Gina, plus one from her labmate).
+However,
+there is a small chance that it won't.
+To see why,
+let's break the two queries into their respective read and write steps
+and place them side by side:
+
+--------------------            --------------------  
+`X = read Exposure('lake', __)` `Y = read Exposure('lake', __)`
+`write Exposure('lake', X+2)`   `write Exposure('lake', Y+1)`
+--------------------            --------------------  
+  
+The database can only actually do one thing at once,
+so it must put these four operations into some sequential order.
+That order has to respect the original order within each column,
+but the database can interleave the two columns any way it wants.
+If it orders them like this:
+  
+--------------------            --------------------  
+`X = read Exposure('lake', __)` `X` is 5
+`write Exposure('lake', X+2)`   database contains 7
+`Y = read Exposure('lake', __)` `Y` is 7
+`write Exposure('lake', Y+1)`   database contains 8
+--------------------            --------------------  
+  
+then all is well.
+But what if it interleaves the operations like this:
+  
+--------------------            --------------------  
+`X = read Exposure('lake', __)` `X` is 5
+`Y = read Exposure('lake', __)` `Y` is 5
+`write Exposure('lake', X+2)`   database contains 7
+`write Exposure('lake', Y+1)`   database contains 6
+--------------------            --------------------  
+  
+This ordering puts the initial value, 5, into both `X` and `Y`.
+It then writes 7 back to the database (the third statement),
+and then overwrites that with 6,
+since `Y` holds 5.
+  
+This is called a [race condition](../gloss.html#race-condition),
+since the final result depends on a race between the two operations.
+Race conditions are part of what makes programming large systems with many components a nightmare:
+they are difficult to spot in advance
+(since they are caused by the interactions between components,
+rather than by anything in any one of those components),
+and can be almost impossible to debug
+(since they usually occur intermittently and infrequently).
+
+Transactions come to our rescue once again.
+If Gina and her labmate put their statements in transactions,
+the database will act as if it executed all of one and then all of the other.
+Whether or not it *actually* does this is up to whoever wrote the database manager:
+modern databases use very sophisticated algorithms to determine
+which operations actually have to be run sequentially,
+and which can safely be run in parallel to improve performance.
+The key thing is that
+every transaction will appear to have had the entire database to itself.
+  
+### Summary {.keypoints}
+
+* Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable.
+
+### Challenges {.challenges}
+
+* A friend of yours manages a database of aerial photographs.
+  New records are added all the time,
+  but existing records are never modified or updated.
+  Your friend claims that because of this,
+  he doesn't need to put his queries in transactions.
+  Is he right or wrong, and why?
+
+## Programming With Databases {#s:programming}
+
+### Learning Objectives {.objectives}
+
+* Write a Python program that queries a database and processes the results.
+* Explain what an SQL injection attack is.
+* Write a program that safely interpolates values into queries.
+
+Duration: 20 minutes.
+
+### Lesson
+
+To end this chapter,
+let's have a look at how to access a database from
+a general-purpose programming language like Python.
+Other languages use almost exactly the same model:
+library and function names may differ,
+but the concepts are the same.
+
+Here's a short Python program that selects latitudes and longitudes
+from an SQLite database stored in a file called `survey.db`:
+
+``` {.python}  
+import sqlite3
+connection = sqlite3.connect("survey.db")
+cursor = connection.cursor()
+cursor.execute("select site.lat, site.long from site;")
+results = cursor.fetchall()
+for r in results:
+    print r
+cursor.close()
+connection.close()
+```
+
+The program starts by importing the `sqlite3` library.
+If we were connecting to MySQL, DB2, or some other database,
+we would import a different library,
+but all of them provide the same functions,
+so that the rest of our program does not have to change
+(at least, not much)
+if we switch from one database to another.
+
+Line 2 establishes a connection to the database.
+Since we're using SQLite,
+all we need to specify is the name of the database file.
+Other systems may require us to provide a username and password as well.
+Line 3 then uses this connection to create
+a [cursor](../gloss.html#cursor);
+just like the cursor in an editor,
+its role is to keep track of where we are in the database.
+
+On line 4, we use that cursor to ask the database to execute a query for us.
+The query is written in SQL,
+and passed to `cursor.execute` as a string.
+It's our job to make sure that SQL is properly formatted;
+if it isn't,
+or if something goes wrong when it is being executed,
+the database will report an error.
+
+The database returns the results of the query to us
+in response to the `cursor.fetchall` call on line 5.
+This result is a list with one entry for each record in the result set;
+if we loop over that list (line 6) and print those list entries (line 7),
+we can see that each one is a tuple
+with one element for each field we asked for.
+
+Finally, lines 8 and 9 close our cursor and our connection,
+since the database can only keep a limited number of these open at one time.
+Since establishing a connection takes time,
+though,
+we shouldn't open a connection,
+do one operation,
+then close the connection,
+only to reopen it a few microseconds later to do another operation.
+Instead,
+it's normal to create one connection that stays open for the lifetime of the program.
+  
+> ### What Are The u's For? {.box}
+> 
+> You may have noticed that
+> each of the strings in our output has a lower-case 'u' in front of it.
+> That is Python's way of telling us that the string is stored in
+> [Unicode](../gloss.html#unicode).
+
+Queries in real applications will often depend on values provided by users.
+For example,
+a program might take a user ID as a command-line parameter
+and display the user's full name:
+
+``` {.python}  
+import sys
+import sqlite3
+
+query = "select personal, family from Person where ident='%s';"
+user_id = sys.argv[1]
+
+connection = sqlite3.connect("survey.db")
+cursor = connection.cursor()
+
+cursor.execute(query % user_id)
+results = cursor.fetchall()
+print results[0][0], results[0][1]
+
+cursor.close()
+connection.close()
+```
+
+The variable `query` holds the statement we want to execute
+with a `%s` format string where we want to insert
+the ID of the person we're looking up.
+It seems simple enough,
+but what happens if someone gives the program this input?
+  
+    dyer"; drop table Survey; select "
+
+It looks like there's garbage after the name of the project,
+but it is very carefully chosen garbage.
+If we insert this string into our query,
+the result is:
+
+``` {.sql}  
+select personal, family from Person where ident='dyer'; drop table Survey; select '';
+```
+
+Whoops:
+if we execute this,
+it will erase one of the tables in our database.
+  
+This technique is called [SQL injection](../gloss.html#sql-injection),
+and it has been used to attack thousands of programs over the years.
+In particular,
+many web sites that take data from users insert values directly into queries
+without checking them carefully first.
+  
+Since a villain might try to smuggle commands into our queries in many different ways,
+the safest way to deal with this threat is
+to replace characters like quotes with their escaped equivalents,
+so that we can safely put whatever the user gives us inside a string.
+We can do this by using a [prepared statement](../gloss.html#prepared-statement)
+instead of formatting our statements as strings.
+Here's what our example program looks like if we do this:
+
+``` {.python}
+import sys
+import sqlite3
+
+query = "select personal, family from Person where ident=?;"
+user_id = sys.argv[1]
+
+connection = sqlite3.connect("survey.db")
+cursor = connection.cursor()
+
+cursor.execute(query, [user_id])
+results = cursor.fetchall()
+print results[0][0], results[0][1]
+
+cursor.close()
+connection.close()
+```
+
+The key changes are in the query string and the `execute` call.
+Instead of formatting the query ourselves,
+we put question marks in the query template where we want to insert values.
+When we call `execute`,
+we provide a list
+that contains as many values as there are question marks in the query.
+The library matches values to question marks in order,
+and translates any special characters in the values
+into their escaped equivalents
+so that they are safe to use.
+
+### Summary {.keypoints}
+
+* Most applications that use databases embed SQL in a general-purpose programming language.
+* Database libraries use connections and cursors to manage interactions.
+* Programs can fetch all results at once, or a few results at a time.
+* If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries.
+* Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks.
+
+### Challenges {.challenges}
+
+* Write a Python program that creates a new database
+  in a file called `original.db`
+  containing a single table called `Pressure`,
+  with a single field called `reading`,
+  and inserts 100,000 random numbers between 10.0 and 25.0.
+  How long does it take this program to run?
+  How long does it take to run a program
+  that simply writes those random numbers to a file?
+
+* Write a Python program that creates a new database
+  called `backup.db`
+  with the same structure as `original.db`
+  and copies all the values greater than 20.0
+  from `original.db` to `backup.db`.
+  Which is faster:
+  filtering values in the query,
+  or reading everything into memory and filtering in Python?
+
+## Summing Up {#s:summary}
+
+There are many things databases can't do,
+or can't do well
+(which is why we have general-purpose programming languages like Python).
+However,
+they are still the best tool available
+for managing many kinds of complex, structured data.
+Thousands of programmer-years have gone into their design and implementation
+so that they can handle very large datasets&mdash;terabytes or more&mdash;quickly and reliably.
+Queries allow for great flexibility in how you are able to analyze your data,
+which makes databases a good choice when you are exploring data.
diff --git a/sql/novice/sqlitemagic.py b/sql/novice/sqlitemagic.py
new file mode 100644 (file)
index 0000000..4a1e83d
--- /dev/null
@@ -0,0 +1,49 @@
+"""sqlitemagic provices a simple magic for interacting with SQLite
+databases stored on disk.
+
+Usage:
+
+%%sqlite filename.db
+select personal, family from person;
+
+produces:
+
+Alan|Turing
+Grace|Hopper
+"""
+
+# This file is copyright 2013 by Greg Wilson: see
+# https://github.com/gvwilson/sqlitemagic/blob/master/LICENSE
+# for the license.
+# Inspired by https://github.com/tkf/ipython-sqlitemagic.
+
+import sqlite3
+from IPython.core.magic import Magics, magics_class, cell_magic
+from IPython.display import display, HTML
+
+@magics_class
+class SqliteMagic(Magics):
+    '''Provide the 'sqlite' calling point.'''
+
+    @cell_magic
+    def sqlite(self, filename, query):
+        connection = sqlite3.connect(filename)
+        cursor = connection.cursor()
+        try:
+            cursor.execute(query)
+            results = cursor.fetchall()
+            display(HTML(self.tablify(results)))
+        except Exception, e:
+            import sys
+            print >> sys.stderr, "exception", e
+        cursor.close()
+        connection.close()
+
+    def tablify(self, rows):
+        return '<table>\n' + '\n'.join(self.rowify(r) for r in rows) + '\n</table>'
+
+    def rowify(self, row):
+        return '<tr>' + ''.join('<td>' + str(r) + '</td>' for r in row) + '</tr>'
+
+def load_ipython_extension(ipython):
+    ipython.register_magics(SqliteMagic)
diff --git a/sql/novice/survey.db b/sql/novice/survey.db
new file mode 100644 (file)
index 0000000..a8abbf4
Binary files /dev/null and b/sql/novice/survey.db differ
diff --git a/sql/novice/survey.sql b/sql/novice/survey.sql
new file mode 100644 (file)
index 0000000..a1493a2
--- /dev/null
@@ -0,0 +1,370 @@
+-- The `Person` table is used to explain the most basic queries.
+-- Note that `danforth` has no measurements.
+create table Person(
+       ident    text,
+       personal text,
+       family   text
+);
+
+insert into Person values('dyer',     'William',   'Dyer');
+insert into Person values('pb',       'Frank',     'Pabodie');
+insert into Person values('lake',     'Anderson',  'Lake');
+insert into Person values('roe',      'Valentina', 'Roerich');
+insert into Person values('danforth', 'Frank',     'Danforth');
+
+-- The `Site` table is equally simple.  Use it to explain the
+-- difference between databases and spreadsheets: in a spreadsheet,
+-- the lat/long of measurements would probably be duplicated.
+create table Site(
+       name text,
+       lat  real,
+       long real
+);
+
+insert into Site values('DR-1', -49.85, -128.57);
+insert into Site values('DR-3', -47.15, -126.72);
+insert into Site values('MSK-4', -48.87, -123.40);
+
+-- `Visited` is an enhanced `join` table: it connects to the lat/long
+-- of specific measurements, and also provides their dates.
+-- Note that #752 is missing a date; we use this to talk about NULL.
+create table Visited(
+       ident integer,
+       site  text,
+       dated text
+);
+
+insert into Visited values(619, 'DR-1',  '1927-02-08');
+insert into Visited values(622, 'DR-1',  '1927-02-10');
+insert into Visited values(734, 'DR-3',  '1939-01-07');
+insert into Visited values(735, 'DR-3',  '1930-01-12');
+insert into Visited values(751, 'DR-3',  '1930-02-26');
+insert into Visited values(752, 'DR-3',  null);
+insert into Visited values(837, 'MSK-4', '1932-01-14');
+insert into Visited values(844, 'DR-1',  '1932-03-22');
+
+-- The `Survey` table is the actual readings.  Join it with `Site` to
+-- get lat/long, and with `Visited` to get dates (except for #752).
+-- Note that Roerich's salinity measurements are an order of magnitude
+-- too large (use this to talk about data cleanup).  Note also that
+-- there are two cases where we don't know who took the measurement,
+-- and that in most cases we don't have an entry (null or not) for the
+-- temperature.
+create table Survey(
+       taken   integer,
+       person  text,
+       quant   real,
+       reading real
+);
+
+insert into Survey values(619, 'dyer', 'rad',    9.82);
+insert into Survey values(619, 'dyer', 'sal',    0.13);
+insert into Survey values(622, 'dyer', 'rad',    7.80);
+insert into Survey values(622, 'dyer', 'sal',    0.09);
+insert into Survey values(734, 'pb',   'rad',    8.41);
+insert into Survey values(734, 'lake', 'sal',    0.05);
+insert into Survey values(734, 'pb',   'temp', -21.50);
+insert into Survey values(735, 'pb',   'rad',    7.22);
+insert into Survey values(735, null,   'sal',    0.06);
+insert into Survey values(735, null,   'temp', -26.00);
+insert into Survey values(751, 'pb',   'rad',    4.35);
+insert into Survey values(751, 'pb',   'temp', -18.50);
+insert into Survey values(751, 'lake', 'sal',    0.10);
+insert into Survey values(752, 'lake', 'rad',    2.19);
+insert into Survey values(752, 'lake', 'sal',    0.09);
+insert into Survey values(752, 'lake', 'temp', -16.00);
+insert into Survey values(752, 'roe',  'sal',   41.60);
+insert into Survey values(837, 'lake', 'rad',    1.46);
+insert into Survey values(837, 'lake', 'sal',    0.21);
+insert into Survey values(837, 'roe',  'sal',   22.50);
+insert into Survey values(844, 'roe',  'rad',   11.25);
+
+select '========================================';
+select 'Selecting';
+
+select '----------------------------------------';
+select 'get scientist names';
+select family, personal from Person;
+
+select '----------------------------------------';
+select 'commands are case insensitive';
+SeLeCt famILY, PERSonal frOM PERson;
+
+select '----------------------------------------';
+select 'we control column order';
+select personal, family from Person;
+
+select '----------------------------------------';
+select 'repeat columns';
+select ident, ident, ident from Person;
+
+select '----------------------------------------';
+select 'use * for wildcard';
+select * from Person;
+
+select '========================================';
+select 'Removing Duplicates';
+
+select '----------------------------------------';
+select 'show quantities in survey table';
+select quant from Survey;
+
+select '----------------------------------------';
+select 'unique quantity names';
+select distinct quant from Survey;
+
+select '----------------------------------------';
+select 'tuple uniqueness';
+select distinct taken, quant from Survey;
+
+select '========================================';
+select 'Filtering';
+
+select '----------------------------------------';
+select 'when a particular site was visited';
+select * from Visited where site='DR-1';
+
+select '----------------------------------------';
+select 'filtering columns after "where"';
+select ident from Visited where site='DR-1';
+
+select '----------------------------------------';
+select 'when a particular site was visited after 1930';
+select * from Visited where (site='DR-1') and (dated>='1930-00-00');
+
+select '----------------------------------------';
+select 'using "or" instead of "and"';
+select * from Survey where person='lake' or person='roe';
+
+select '----------------------------------------';
+select 'using "in" instead of "or"';
+select * from Survey where person in ('lake', 'roe');
+
+select '----------------------------------------';
+select 'with parentheses';
+select * from Survey where quant='sal' and person='lake' or person='roe';
+
+select '----------------------------------------';
+select 'using distinct with "in"';
+select distinct person, quant from Survey where person='lake' or person='roe';
+
+select '========================================';
+select 'Calculating New Values';
+
+select '----------------------------------------';
+select 'correct radiation readings';
+select 1.05 * reading from Survey where quant='rad';
+
+select '----------------------------------------';
+select 'convert temperatures to Celsius';
+select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';
+
+select '----------------------------------------';
+select 'formatting names';
+select personal || ' ' || family from Person;
+
+select '========================================';
+select 'Ordering Results';
+
+select '----------------------------------------';
+select 'ascending is the default';
+select reading from Survey where quant='rad' order by reading;
+
+select '----------------------------------------';
+select 'order descending';
+select reading from Survey where quant='rad' order by reading desc;
+
+select '----------------------------------------';
+select 'ordering and sub-ordering';
+select taken, person from Survey order by taken asc, person desc;
+
+select '----------------------------------------';
+select 'removing duplicates';
+select distinct taken, person from Survey order by taken asc, person desc;
+
+select '----------------------------------------';
+select 'sorted by undisplayed column';
+select reading from Survey where quant='rad' order by taken;
+
+select '----------------------------------------';
+select 'randomization';
+select random(), ident from Person;
+
+select '----------------------------------------';
+select 'random sorting';
+select ident from Person order by random();
+
+select '========================================';
+select 'Missing Data';
+
+select '----------------------------------------';
+select 'all Visited data';
+select * from Visited;
+
+select '----------------------------------------';
+select 'visits before 1930';
+select * from Visited where dated<'1930-00-00';
+
+select '----------------------------------------';
+select 'visits after 1930';
+select * from Visited where dated>='1930-00-00';
+
+select '----------------------------------------';
+select 'visits with unknown dates (wrong)';
+select * from Visited where dated=NULL;
+
+select '----------------------------------------';
+select '!= does not work either';
+select * from Visited where dated!=NULL;
+
+select '----------------------------------------';
+select 'visits with unknown dates (right)';
+select * from Visited where dated is NULL;
+
+select '----------------------------------------';
+select 'visits with known dates';
+select * from Visited where dated is not NULL;
+
+select '----------------------------------------';
+select 'salinity not measured by Lake';
+select * from Survey where quant='sal' and person!='lake';
+
+select '----------------------------------------';
+select 'salinity not known to be measured by Lake';
+select * from Survey where quant='sal' and (person!='lake' or person is null);
+
+select '========================================';
+select 'Aggregation';
+
+select '----------------------------------------';
+select 'all dates';
+select dated from Visited;
+
+select '----------------------------------------';
+select 'date range';
+select min(dated) from Visited;
+select max(dated) from Visited;
+select min(dated), max(dated) from Visited;
+
+select '----------------------------------------';
+select 'averaging';
+select avg(reading) from Survey where quant='sal';
+
+select '----------------------------------------';
+select 'counting';
+select count(reading) from Survey where quant='sal';
+
+select '----------------------------------------';
+select 'sum';
+select sum(reading) from Survey where quant='sal';
+
+select '----------------------------------------';
+select 'averaging sensible values';
+select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;
+
+select '----------------------------------------';
+select 'unaggregated with aggregated takes arbitrary';
+select person, count(*) from Survey where quant='sal' and reading<=1.0;
+
+select '----------------------------------------';
+select 'what happens when there is no input?'
+select person, max(reading), sum(reading) from Survey where quant='missing';
+
+select '----------------------------------------';
+select 'what happens when an input is null?'
+select min(dated) from Visited;
+
+select '----------------------------------------';
+select 'what happens when an input is null?'
+select min(dated) from Visited where dated is not null;
+
+select '----------------------------------------';
+select 'grouping Visited by site only keeps arbitrary';
+select * from Visited group by site;
+
+select '----------------------------------------';
+select 'get date ranges for sites';
+select   site, min(dated), max(dated) from Visited
+group by site;
+
+select '========================================';
+select 'Grouping';
+
+select '----------------------------------------';
+select 'radiation reading for a particular person';
+select person, count(reading), round(avg(reading), 2)
+from  Survey
+where quant='rad'
+and   person='dyer';
+
+select '----------------------------------------';
+select 'radiation reading for an arbitrary person';
+select person, count(reading), round(avg(reading), 2)
+from  Survey
+where quant='rad';
+
+select '----------------------------------------';
+select 'radiation readings by person';
+select   person, count(reading), round(avg(reading), 2)
+from     Survey
+where    quant='rad'
+group by person;
+
+select '----------------------------------------';
+select 'average by scientist and quantity';
+select   person, quant, count(reading), round(avg(reading), 2)
+from     Survey
+group by person, quant;
+
+select '----------------------------------------';
+select 'grouping and ordering';
+select   person, quant, count(reading), round(avg(reading), 2)
+from     Survey
+group by person, quant
+order by person, quant;
+
+select '----------------------------------------';
+select 'grouping and ordering without null';
+select   person, quant, count(reading), round(avg(reading), 2)
+from     Survey
+where    person is not null
+group by person, quant
+order by person, quant;
+
+select '========================================';
+select 'Combining Data';
+
+select '----------------------------------------';
+select 'combine "Site" with "Visited"';
+select * from Site join Visited;
+
+select '----------------------------------------';
+select 'filter where sites match';
+select * from Site join Visited on Site.name=Visited.site;
+
+select '----------------------------------------';
+select 'get latitude, longitude, and date';
+select Site.lat, Site.long, Visited.dated
+from   Site join Visited
+on     Site.name=Visited.site;
+
+select '----------------------------------------';
+select 'get latitude, longitude, date, quantity, and reading';
+select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
+from   Site join Visited join Survey
+on     Site.name=Visited.site
+and    Visited.ident=Survey.taken
+and    Visited.dated is not null;
+
+select '----------------------------------------';
+select 'get all radiation readings since 1930';
+select 'but notice that #752 is missing (NULL)...';
+select Survey.reading
+from   Survey join Visited
+where  Survey.taken=Visited.ident
+  and  Survey.quant='rad'
+  and  Visited.dated>='1930-00-00';
+
+select '----------------------------------------';
+select 'get row IDs from Person table';
+select rowid, * from Person;
diff --git a/sql/novice/tables.html b/sql/novice/tables.html
new file mode 100644 (file)
index 0000000..e3b5355
--- /dev/null
@@ -0,0 +1,68 @@
+<table>
+  <tr> <th>Table Name</th> <th>Purpose</th> </tr>
+  <tr> <td>Person</td> <td>People who took readings.</td> </tr>
+  <tr> <td>Site</td> <td>Locations of observation sites.</td> </tr>
+  <tr> <td>Visited</td> <td>When readings were taken at specific sites.</td> </tr>
+  <tr> <td>Survey</td> <td>The actual measurement values.</td> </tr>
+</table>
+
+**Person**
+
+<table>
+  <tr> <th>ident</th> <th>personal</th> <th>family</th> </tr>
+  <tr> <td>dyer</td> <td>William</td> <td>Dyer</td> </tr>
+  <tr> <td>pb</td> <td>Frank</td> <td>Pabodie</td> </tr>
+  <tr> <td>lake</td> <td>Anderson</td> <td>Lake</td> </tr>
+  <tr> <td>roe</td> <td>Valentina</td> <td>Roerich</td> </tr>
+  <tr> <td>danforth</td> <td>Frank</td> <td>Danforth</td> </tr>
+</table>
+
+**Survey**
+
+<table>
+  <tr> <th>taken</th> <th>person</th> <th>quant</th> <th>reading</th> </tr>
+  <tr> <td>619</td> <td>dyer</td> <td>rad</td> <td>9.82</td> </tr>
+  <tr> <td>619</td> <td>dyer</td> <td>sal</td> <td>0.13</td> </tr>
+  <tr> <td>622</td> <td>dyer</td> <td>rad</td> <td>7.8</td> </tr>
+  <tr> <td>622</td> <td>dyer</td> <td>sal</td> <td>0.09</td> </tr>
+  <tr> <td>734</td> <td>pb</td> <td>rad</td> <td>8.41</td> </tr>
+  <tr> <td>734</td> <td>lake</td> <td>sal</td> <td>0.05</td> </tr>
+  <tr> <td>734</td> <td>pb</td> <td>temp</td> <td>-21.5</td> </tr>
+  <tr> <td>735</td> <td>pb</td> <td>rad</td> <td>7.22</td> </tr>
+  <tr> <td>735</td> <td>NULL</td> <td>sal</td> <td>0.06</td> </tr>
+  <tr> <td>735</td> <td>NULL</td> <td>temp</td> <td>-26.0</td> </tr>
+  <tr> <td>751</td> <td>pb</td> <td>rad</td> <td>4.35</td> </tr>
+  <tr> <td>751</td> <td>pb</td> <td>temp</td> <td>-18.5</td> </tr>
+  <tr> <td>751</td> <td>lake</td> <td>sal</td> <td>0.1</td> </tr>
+  <tr> <td>752</td> <td>lake</td> <td>rad</td> <td>2.19</td> </tr>
+  <tr> <td>752</td> <td>lake</td> <td>sal</td> <td>0.09</td> </tr>
+  <tr> <td>752</td> <td>lake</td> <td>temp</td> <td>-16.0</td> </tr>
+  <tr> <td>752</td> <td>roe</td> <td>sal</td> <td>41.6</td> </tr>
+  <tr> <td>837</td> <td>lake</td> <td>rad</td> <td>1.46</td> </tr>
+  <tr> <td>837</td> <td>lake</td> <td>sal</td> <td>0.21</td> </tr>
+  <tr> <td>837</td> <td>roe</td> <td>sal</td> <td>22.5</td> </tr>
+  <tr> <td>844</td> <td>roe</td> <td>rad</td> <td>11.25</td> </tr>
+</table>
+
+**Site**
+
+<table>
+  <tr> <th>name</th> <th>lat</th> <th>long</th> </tr>
+  <tr> <td>DR-1</td> <td>-49.85</td> <td>-128.57</td> </tr>
+  <tr> <td>DR-3</td> <td>-47.15</td> <td>-126.72</td> </tr>
+  <tr> <td>MSK-4</td> <td>-48.87</td> <td>-123.4</td> </tr>
+</table>
+
+**Visited**
+
+<table>
+  <tr> <th>ident</th> <th>site</th> <th>dated</th> </tr>
+  <tr> <td>619</td> <td>DR-1</td> <td>1927-02-08</td> </tr>
+  <tr> <td>622</td> <td>DR-1</td> <td>1927-02-10</td> </tr>
+  <tr> <td>734</td> <td>DR-3</td> <td>1939-01-07</td> </tr>
+  <tr> <td>735</td> <td>DR-3</td> <td>1930-01-12</td> </tr>
+  <tr> <td>751</td> <td>DR-3</td> <td>1930-02-26</td> </tr>
+  <tr> <td>752</td> <td>DR-3</td> <td>NULL</td> </tr>
+  <tr> <td>837</td> <td>MSK-4</td> <td>1932-01-14</td> </tr>
+  <tr> <td>844</td> <td>DR-1</td> <td>1932-03-22</td> </tr>
+</table>