--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Calculating New Values"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After carefully re-reading the expedition logs,\n",
+ "we realize that the radiation measurements they report\n",
+ "may need to be corrected upward by 5%.\n",
+ "Rather than modifying the stored data,\n",
+ "we can do this calculation on the fly\n",
+ "as part of our query:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%load_ext sqlitemagic"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 1
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select 1.05 * reading from Survey where quant='rad';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>10.311</td></tr>\n",
+ "<tr><td>8.19</td></tr>\n",
+ "<tr><td>8.8305</td></tr>\n",
+ "<tr><td>7.581</td></tr>\n",
+ "<tr><td>4.5675</td></tr>\n",
+ "<tr><td>2.2995</td></tr>\n",
+ "<tr><td>1.533</td></tr>\n",
+ "<tr><td>11.8125</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4e50>"
+ ]
+ }
+ ],
+ "prompt_number": 2
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When we run the query,\n",
+ "the expression `1.05 * reading` is evaluated for each row.\n",
+ "Expressions can use any of the fields,\n",
+ "all of usual arithmetic operators,\n",
+ "and a variety of common functions.\n",
+ "(Exactly which ones depends on which database manager is being used.)\n",
+ "For example,\n",
+ "we can convert temperature readings from Fahrenheit to Celsius\n",
+ "and round to two decimal places:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>734</td><td>-29.72</td></tr>\n",
+ "<tr><td>735</td><td>-32.22</td></tr>\n",
+ "<tr><td>751</td><td>-28.06</td></tr>\n",
+ "<tr><td>752</td><td>-26.67</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023cbd90>"
+ ]
+ }
+ ],
+ "prompt_number": 4
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also combine values from different fields,\n",
+ "for example by using the string concatenation operator `||`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select personal || ' ' || family from Person;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>William Dyer</td></tr>\n",
+ "<tr><td>Frank Pabodie</td></tr>\n",
+ "<tr><td>Anderson Lake</td></tr>\n",
+ "<tr><td>Valentina Roerich</td></tr>\n",
+ "<tr><td>Frank Danforth</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 5
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "> It may seem strange to use `personal` and `family` as field names\n",
+ "> instead of `first` and `last`,\n",
+ "> but it's a necessary first step toward handling cultural differences.\n",
+ "> For example,\n",
+ "> consider the following rules:\n",
+ "\n",
+ "<table>\n",
+ " <tr> <th>Full Name</th> <th>Alphabetized Under</th> <th>Reason</th> </tr>\n",
+ " <tr> <td>Liu Xiaobo</td> <td>Liu</td> <td>Chinese family names come first</td> </tr>\n",
+ " <tr> <td> Leonardo da Vinci</td> <td>Leonardo</td> <td>\"da Vinci\" just means \"from Vinci\"</td> </tr>\n",
+ " <tr> <td> Catherine de Medici</td> <td>Medici</td> <td>family name</td> </tr>\n",
+ " <tr> <td> Jean de La Fontaine</td> <td>La Fontaine</td> <td>family name is \"La Fontaine\"</td> </tr>\n",
+ " <tr> <td> Juan Ponce de Leon</td> <td>Ponce de Leon</td> <td>full family name is \"Ponce de Leon\"</td> </tr>\n",
+ " <tr> <td> Gabriel Garcia Marquez</td> <td>Garcia Marquez</td> <td>double-barrelled Spanish surnames</td> </tr>\n",
+ " <tr> <td> Wernher von Braun</td> <td>von <em>or</em> Braun</td> <td>depending on whether he was in Germany or the US</td> </tr>\n",
+ " <tr> <td> Elizabeth Alexandra May Windsor</td> <td>Elizabeth</td> <td>monarchs alphabetize by the name under which they reigned</td> </tr>\n",
+ " <tr> <td> Thomas a Beckett</td> <td>Thomas</td> <td>and saints according to the names by which they were canonized</td> </tr>\n",
+ "</table>\n",
+ "\n",
+ "> Clearly,\n",
+ "> even a two-part division into \"personal\" and \"family\"\n",
+ "> isn't enough..."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Challenges\n",
+ "\n",
+ "1. After further reading,\n",
+ " we realize that Valentina Roerich\n",
+ " was reporting salinity as percentages.\n",
+ " Write a query that returns all of her salinity measurements\n",
+ " from the `Survey` table\n",
+ " with the values divided by 100.\n",
+ "\n",
+ "2. The `union` operator combines the results of two queries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Person where ident='dyer' union select * from Person where ident='roe';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
+ "<tr><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023cbd50>"
+ ]
+ }
+ ],
+ "prompt_number": 6
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Use `union` to create a consolidated list of salinity measurements\n",
+ "in which Roerich's, and only Roerich's,\n",
+ "have been corrected as described in the previous challenge.\n",
+ "The output should be something like:\n",
+ "\n",
+ "<table>\n",
+ " <tr> <td>619</td> <td>0.13</td> </tr>\n",
+ " <tr> <td>622</td> <td>0.09</td> </tr>\n",
+ " <tr> <td>734</td> <td>0.05</td> </tr>\n",
+ " <tr> <td>751</td> <td>0.1</td> </tr>\n",
+ " <tr> <td>752</td> <td>0.09</td> </tr>\n",
+ " <tr> <td>752</td> <td>0.416</td> </tr>\n",
+ " <tr> <td>837</td> <td>0.21</td> </tr>\n",
+ " <tr> <td>837</td> <td>0.225</td> </tr>\n",
+ "</table>\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "3. The site identifiers in the `Visited` table have two parts\n",
+ " separated by a '-':"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select distinct site from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>DR-1</td></tr>\n",
+ "<tr><td>DR-3</td></tr>\n",
+ "<tr><td>MSK-4</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4e50>"
+ ]
+ }
+ ],
+ "prompt_number": 7
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Some major site identifiers are two letters long and some are three.\n",
+ "The \"in string\" function `instr(X, Y)`\n",
+ "returns the 1-based index of the first occurrence of string Y in string X,\n",
+ "or 0 if Y does not exist in X.\n",
+ "The substring function `substr(X, I)`\n",
+ "returns the substring of X starting at index I.\n",
+ "Use these two functions to produce a list of unique major site identifiers.\n",
+ "(For this data,\n",
+ "the list should contain only \"DR\" and \"MSK\")."
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 2,
+ "metadata": {},
+ "source": [
+ "Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Missing Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Real-world data is never complete—there are always holes.\n",
+ "Databases represent these holes using special value called `null`.\n",
+ "`null` is not zero, `False`, or the empty string;\n",
+ "it is a one-of-a-kind value that means \"nothing here\".\n",
+ "Dealing with `null` requires a few special tricks\n",
+ "and some careful thinking.\n",
+ "\n",
+ "To start,\n",
+ "let's have a look at the `Visited` table.\n",
+ "There are eight records,\n",
+ "but #752 doesn't have a date—or rather,\n",
+ "its date is null:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%load_ext sqlitemagic"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 1
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 2
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Null doesn't behave like other values.\n",
+ "If we select the records that come before 1930:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated<'1930-00-00';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 3
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "we get two results,\n",
+ "and if we select the ones that come during or after 1930:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated>='1930-00-00';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 4
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "we get five,\n",
+ "but record #752 isn't in either set of results.\n",
+ "The reason is that\n",
+ "`null<'1930-00-00'`\n",
+ "is neither true nor false:\n",
+ "null means, \"We don't know,\"\n",
+ "and if we don't know the value on the left side of a comparison,\n",
+ "we don't know whether the comparison is true or false.\n",
+ "Since databases represent \"don't know\" as null,\n",
+ "the value of `null<'1930-00-00'`\n",
+ "is actually `null`.\n",
+ "`null>='1930-00-00'` is also null\n",
+ "because we can't answer to that question either.\n",
+ "And since the only records kept by a `where`\n",
+ "are those for which the test is true,\n",
+ "record #752 isn't included in either set of results.\n",
+ "\n",
+ "Comparisons aren't the only operations that behave this way with nulls.\n",
+ "`1+null` is `null`,\n",
+ "`5*null` is `null`,\n",
+ "`log(null)` is `null`,\n",
+ "and so on.\n",
+ "In particular,\n",
+ "comparing things to null with = and != produces null:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated=NULL;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 5
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated!=NULL;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 6
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To check whether a value is `null` or not,\n",
+ "we must use a special test `is null`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated is NULL;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 7
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "or its inverse `is not null`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Visited where dated is not NULL;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 8
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Null values cause headaches wherever they appear.\n",
+ "For example,\n",
+ "suppose we want to find the all of salinity measurements\n",
+ "that weren't taken by Dyer.\n",
+ "It's natural to write the query like this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Survey where quant='sal' and person!='lake';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>619</td><td>dyer</td><td>sal</td><td>0.13</td></tr>\n",
+ "<tr><td>622</td><td>dyer</td><td>sal</td><td>0.09</td></tr>\n",
+ "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+ "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c46d0>"
+ ]
+ }
+ ],
+ "prompt_number": 9
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "but this query filters omits the records\n",
+ "where we don't know who took the measurement.\n",
+ "Once again,\n",
+ "the reason is that when `person` is `null`,\n",
+ "the `!=` comparison produces `null`,\n",
+ "so the record isn't kept in our results.\n",
+ "If we want to keep these records\n",
+ "we need to add an explicit check:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Survey where quant='sal' and (person!='lake' or person is null);"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>619</td><td>dyer</td><td>sal</td><td>0.13</td></tr>\n",
+ "<tr><td>622</td><td>dyer</td><td>sal</td><td>0.09</td></tr>\n",
+ "<tr><td>735</td><td>None</td><td>sal</td><td>0.06</td></tr>\n",
+ "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
+ "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 10
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We still have to decide whether this is the right thing to do or not.\n",
+ "If we want to be absolutely sure that\n",
+ "we aren't including any measurements by Lake in our results,\n",
+ "we need to exclude all the records for which we don't know who did the work."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "1. Write a query that sorts the records in `Visited` by date,\n",
+ " omitting entries for which the date is not known\n",
+ " (i.e., is null).\n",
+ "\n",
+ "1. What do you expect the query:\n",
+ "\n",
+ " ```\n",
+ " select * from Visited where dated in ('1927-02-08', null);\n",
+ " ```\n",
+ "\n",
+ " to produce?\n",
+ " What does it actually produce?\n",
+ "\n",
+ "1. Some database designers prefer to use\n",
+ " a [sentinel value](../gloss.html#sentinel-value)\n",
+ " to mark missing data rather than `null`.\n",
+ " For example,\n",
+ " they will use the date \"0000-00-00\" to mark a missing date,\n",
+ " or -1.0 to mark a missing salinity or radiation reading\n",
+ " (since actual readings cannot be negative).\n",
+ " What does this simplify?\n",
+ " What burdens or risks does it introduce?"
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 2,
+ "metadata": {},
+ "source": [
+ "Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Aggregation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We now want to calculate ranges and averages for our data.\n",
+ "We know how to select all of the dates from the `Visited` table:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%load_ext sqlitemagic"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 1
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select dated from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1927-02-08</td></tr>\n",
+ "<tr><td>1927-02-10</td></tr>\n",
+ "<tr><td>1939-01-07</td></tr>\n",
+ "<tr><td>1930-01-12</td></tr>\n",
+ "<tr><td>1930-02-26</td></tr>\n",
+ "<tr><td>None</td></tr>\n",
+ "<tr><td>1932-01-14</td></tr>\n",
+ "<tr><td>1932-03-22</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4ed0>"
+ ]
+ }
+ ],
+ "prompt_number": 2
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "but to combine them,\n",
+ "wee must use an [aggregation function](../gloss.html#aggregation-function)\n",
+ "such as `min` or `max`.\n",
+ "Each of these functions takes a set of records as input,\n",
+ "and produces a single record as output:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select min(dated) from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1927-02-08</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4ed0>"
+ ]
+ }
+ ],
+ "prompt_number": 3
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select max(dated) from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1939-01-07</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4090>"
+ ]
+ }
+ ],
+ "prompt_number": 4
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`min` and `max` are just two of\n",
+ "the aggregation functions built into SQL.\n",
+ "Three others are `avg`,\n",
+ "`count`,\n",
+ "and `sum`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select avg(reading) from Survey where quant='sal';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>7.20333333333</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4090>"
+ ]
+ }
+ ],
+ "prompt_number": 7
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select count(reading) from Survey where quant='sal';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>9</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4fd0>"
+ ]
+ }
+ ],
+ "prompt_number": 8
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select sum(reading) from Survey where quant='sal';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>64.83</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4d90>"
+ ]
+ }
+ ],
+ "prompt_number": 9
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We used `count(reading)` here,\n",
+ "but we could just as easily have counted `quant`\n",
+ "or any other field in the table,\n",
+ "or even used `count(*)`,\n",
+ "since the function doesn't care about the values themselves,\n",
+ "just how many values there are.\n",
+ "\n",
+ "SQL lets us do several aggregations at once.\n",
+ "We can,\n",
+ "for example,\n",
+ "find the range of sensible salinity measurements:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>0.05</td><td>0.21</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4fd0>"
+ ]
+ }
+ ],
+ "prompt_number": 10
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can also combine aggregated results with raw results,\n",
+ "although the output might surprise you:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, count(*) from Survey where quant='sal' and reading<=1.0;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>lake</td><td>7</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4d90>"
+ ]
+ }
+ ],
+ "prompt_number": 11
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Why does Lake's name appear rather than Roerich's or Dyer's?\n",
+ "The answer is that when it has to aggregate a field,\n",
+ "but isn't told how to,\n",
+ "the database manager chooses an actual value from the input set.\n",
+ "It might use the first one processed,\n",
+ "the last one,\n",
+ "or something else entirely.\n",
+ "\n",
+ "Another important fact is that when there are no values to aggregate,\n",
+ "aggregation's result is \"don't know\"\n",
+ "rather than zero or some other arbitrary value:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select person, max(reading), sum(reading) from Survey where quant='missing';"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>None</td><td>None</td><td>None</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4fd0>"
+ ]
+ }
+ ],
+ "prompt_number": 12
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "One final important feature of aggregation functions is that\n",
+ "they are inconsistent with the rest of SQL in a very useful way.\n",
+ "If we add two values,\n",
+ "and one of them is null,\n",
+ "the result is null.\n",
+ "By extension,\n",
+ "if we use `sum` to add all the values in a set,\n",
+ "and any of those values are null,\n",
+ "the result should also be null.\n",
+ "It's much more useful,\n",
+ "though,\n",
+ "for aggregation functions to ignore null values\n",
+ "and only combine those that are non-null.\n",
+ "This behavior lets us write our queries as:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select min(dated) from Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1927-02-08</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4fd0>"
+ ]
+ }
+ ],
+ "prompt_number": 13
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "instead of always having to filter explicitly:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select min(dated) from Visited where dated is not null;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1927-02-08</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4090>"
+ ]
+ }
+ ],
+ "prompt_number": 14
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Challenges\n",
+ "\n",
+ "1. How many temperature readings did Frank Pabodie record,\n",
+ " and what was their average value?\n",
+ "\n",
+ "2. The average of a set of values is the sum of the values\n",
+ " divided by the number of values.\n",
+ " Does this mean that the `avg` function returns 2.0 or 3.0\n",
+ " when given the values 1.0, `null`, and 5.0?\n",
+ "\n",
+ "3. We want to calculate the difference between\n",
+ " each individual radiation reading\n",
+ " and the average of all the radiation readings.\n",
+ " We write the query:\n",
+ "\n",
+ " ```\n",
+ " select reading-avg(reading) from Survey where quant='rad';\n",
+ " ```\n",
+ "\n",
+ " What does this actually produce, and why?\n",
+ "\n",
+ "4. The function `group_concat(field, separator)`\n",
+ " concatenates all the values in a field\n",
+ " using the specified separator character\n",
+ " (or ',' if the separator isn't specified).\n",
+ " Use this to produce a one-line list of scientists' names,\n",
+ " such as:\n",
+ "\n",
+ " ```\n",
+ " William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth\n",
+ " ```\n",
+ "\n",
+ " Can you find a way to order the list by surname?"
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 2,
+ "metadata": {},
+ "source": [
+ "Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Combining Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In order to submit her data to a web site\n",
+ "that aggregates historical meteorological data,\n",
+ "Gina needs to format it as\n",
+ "latitude, longitude, date, quantity, and reading.\n",
+ "However,\n",
+ "her latitudes and longitudes are in the `Site` table,\n",
+ "while the dates of measurements are in the `Visited` table\n",
+ "and the readings themselves are in the `Survey` table.\n",
+ "She needs to combine these tables somehow.\n",
+ "\n",
+ "The SQL command to do this is `join`.\n",
+ "To see how it works,\n",
+ "let's start by joining the `Site` and `Visited` tables:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%load_ext sqlitemagic"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 1
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Site join Visited;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 2
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`join` creates\n",
+ "the [cross product](../gloss.html#cross-product)\n",
+ "of two tables,\n",
+ "i.e.,\n",
+ "it joins each record of one with each record of the other\n",
+ "to give all possible combinations.\n",
+ "Since there are three records in `Site`\n",
+ "and eight in `Visited`,\n",
+ "the join's output has 24 records.\n",
+ "And since each table has three fields,\n",
+ "the output has six fields.\n",
+ " \n",
+ "What the join *hasn't* done is\n",
+ "figure out if the records being joined have anything to do with each other.\n",
+ "It has no way of knowing whether they do or not until we tell it how.\n",
+ "To do that,\n",
+ "we add a clause specifying that\n",
+ "we're only interested in combinations that have the same site name:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select * from Site join Visited on Site.name=Visited.site;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>DR-1</td><td>-49.85</td><td>-128.57</td><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>734</td><td>DR-3</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>735</td><td>DR-3</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>751</td><td>DR-3</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>DR-3</td><td>-47.15</td><td>-126.72</td><td>752</td><td>DR-3</td><td>None</td></tr>\n",
+ "<tr><td>MSK-4</td><td>-48.87</td><td>-123.4</td><td>837</td><td>MSK-4</td><td>1932-01-14</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4090>"
+ ]
+ }
+ ],
+ "prompt_number": 3
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "`on` does the same job as `where`:\n",
+ "it only keeps records that pass some test.\n",
+ "(The difference between the two is that `on` filters records\n",
+ "as they're being created,\n",
+ "while `where` waits until the join is done\n",
+ "and then does the filtering.)\n",
+ "Once we add this to our query,\n",
+ "the database manager throws away records\n",
+ "that combined information about two different sites,\n",
+ "leaving us with just the ones we want.\n",
+ " \n",
+ "Notice that we used `table.field` to specify field names\n",
+ "in the output of the join.\n",
+ "We do this because tables can have fields with the same name,\n",
+ "and we need to be specific which ones we're talking about.\n",
+ "For example,\n",
+ "if we joined the `person` and `visited` tables,\n",
+ "the result would inherit a field called `ident`\n",
+ "from each of the original tables.\n",
+ "\n",
+ "We can now use the same dotted notation\n",
+ "to select the three columns we actually want\n",
+ "out of our join:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select Site.lat, Site.long, Visited.dated\n",
+ "from Site join Visited\n",
+ "on Site.name=Visited.site;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1932-03-22</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>None</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td></tr>\n",
+ "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c3890>"
+ ]
+ }
+ ],
+ "prompt_number": 4
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If joining two tables is good,\n",
+ "joining many tables must be better.\n",
+ "In fact,\n",
+ "we can join any number of tables\n",
+ "simply by adding more `join` clauses to our query,\n",
+ "and more `on` tests to filter out combinations of records\n",
+ "that don't make sense:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading\n",
+ "from Site join Visited join Survey\n",
+ "on Site.name=Visited.site\n",
+ "and Visited.ident=Survey.taken\n",
+ "and Visited.dated is not null;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td><td>rad</td><td>9.82</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-08</td><td>sal</td><td>0.13</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td><td>rad</td><td>7.8</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1927-02-10</td><td>sal</td><td>0.09</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>rad</td><td>8.41</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>sal</td><td>0.05</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1939-01-07</td><td>temp</td><td>-21.5</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>rad</td><td>7.22</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>sal</td><td>0.06</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-01-12</td><td>temp</td><td>-26.0</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>rad</td><td>4.35</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>sal</td><td>0.1</td></tr>\n",
+ "<tr><td>-47.15</td><td>-126.72</td><td>1930-02-26</td><td>temp</td><td>-18.5</td></tr>\n",
+ "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>rad</td><td>1.46</td></tr>\n",
+ "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>sal</td><td>0.21</td></tr>\n",
+ "<tr><td>-48.87</td><td>-123.4</td><td>1932-01-14</td><td>sal</td><td>22.5</td></tr>\n",
+ "<tr><td>-49.85</td><td>-128.57</td><td>1932-03-22</td><td>rad</td><td>11.25</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 5
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can tell which records from `Site`, `Visited`, and `Survey`\n",
+ "correspond with each other\n",
+ "because those tables contain\n",
+ "[primary keys](../gloss.html#primary-key)\n",
+ "and [foreign keys](../gloss.html#foreign-key).\n",
+ "A primary key is a value,\n",
+ "or combination of values,\n",
+ "that uniquely identifies each record in a table.\n",
+ "A foreign key is a value (or combination of values) from one table\n",
+ "that identifies a unique record in another table.\n",
+ "Another way of saying this is that\n",
+ "a foreign key is the primary key of one table\n",
+ "that appears in some other table.\n",
+ "In our database,\n",
+ "`Person.ident` is the primary key in the `Person` table,\n",
+ "while `Survey.person` is a foreign key\n",
+ "relating the `Survey` table's entries\n",
+ "to entries in `Person`.\n",
+ "\n",
+ "Most database designers believe that\n",
+ "every table should have a well-defined primary key.\n",
+ "They also believe that this key should be separate from the data itself,\n",
+ "so that if we ever need to change the data,\n",
+ "we only need to make one change in one place.\n",
+ "One easy way to do this is\n",
+ "to create an arbitrary, unique ID for each record\n",
+ "as we add it to the database.\n",
+ "This is actually very common:\n",
+ "those IDs have names like \"student numbers\" and \"patient numbers\",\n",
+ "and they almost always turn out to have originally been\n",
+ "a unique record identifier in some database system or other.\n",
+ "As the query below demonstrates,\n",
+ "SQLite automatically numbers records as they're added to tables,\n",
+ "and we can use those record numbers in queries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%sqlite survey.db\n",
+ "select rowid, * from Person;"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "html": [
+ "<table>\n",
+ "<tr><td>1</td><td>dyer</td><td>William</td><td>Dyer</td></tr>\n",
+ "<tr><td>2</td><td>pb</td><td>Frank</td><td>Pabodie</td></tr>\n",
+ "<tr><td>3</td><td>lake</td><td>Anderson</td><td>Lake</td></tr>\n",
+ "<tr><td>4</td><td>roe</td><td>Valentina</td><td>Roerich</td></tr>\n",
+ "<tr><td>5</td><td>danforth</td><td>Frank</td><td>Danforth</td></tr>\n",
+ "</table>"
+ ],
+ "metadata": {},
+ "output_type": "display_data",
+ "text": [
+ "<IPython.core.display.HTML at 0x1023c4550>"
+ ]
+ }
+ ],
+ "prompt_number": 6
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Challenges\n",
+ "\n",
+ "1. Write a query that lists all radiation readings from the DR-1 site.\n",
+ "\n",
+ "2. Write a query that lists all sites visited by people named \"Frank\".\n",
+ "\n",
+ "3. Describe in your own words what the following query produces:\n",
+ "\n",
+ " ```\n",
+ " select Site.name from Site join Visited\n",
+ " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n",
+ " ```\n",
+ "\n",
+ "4. Why does the `Person` table have an `ident` field?\n",
+ " Why do we not just use scientists' names in the `Survey` table?\n",
+ "\n",
+ "5. Why does the table `Site` exist?\n",
+ " Why don't we just record latitudes and longitudes\n",
+ " directly in the `Visited` and `Survey` tables?"
+ ]
+ },
+ {
+ "cell_type": "heading",
+ "level": 2,
+ "metadata": {},
+ "source": [
+ "Next Steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Creating and Modifying Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "So far we have only looked at how to get information out of a database,\n",
+ "both because that is more frequent than adding information,\n",
+ "and because most other operations only make sense\n",
+ "once queries are understood.\n",
+ "If we want to create and modify data,\n",
+ "we need to know two other pairs of commands.\n",
+ " \n",
+ "The first pair are `create table` and `drop table`.\n",
+ "While they are written as two words,\n",
+ "they are actually single commands.\n",
+ "The first one creates a new table;\n",
+ "its arguments are the names and types of the table's columns.\n",
+ "For example,\n",
+ "the following statements create the four tables in our survey database:\n",
+ "\n",
+ " create table Person(ident text, personal text, family text);\n",
+ " create table Site(name text, lat real, long real);\n",
+ " create table Visited(ident integer, site text, dated text);\n",
+ " create table Survey(taken integer, person text, quant real, reading real);\n",
+ "\n",
+ "We can get rid of one of our tables using:\n",
+ " \n",
+ " drop table Survey;\n",
+ "\n",
+ "Be very careful when doing this:\n",
+ "most databases have some support for undoing changes,\n",
+ "but it's better not to have to rely on it.\n",
+ " \n",
+ "Different database systems support different data types for table columns,\n",
+ "but most provide the following:\n",
+ "\n",
+ "<table>\n",
+ " <tr> <td>integer</td> <td>a signed integer</td> </tr>\n",
+ " <tr> <td>real</td> <td>a floating point number</td> </tr>\n",
+ " <tr> <td>text</td> <td>a character string</td> </tr>\n",
+ " <tr> <td>blob</td> <td>a \"binary large object\", such as an image</td> </tr>\n",
+ "</table>\n",
+ "\n",
+ "Most databases also support Booleans and date/time values;\n",
+ "SQLite uses the integers 0 and 1 for the former,\n",
+ "and represents the latter as discussed [earlier](#a:dates).\n",
+ "An increasing number of databases also support geographic data types,\n",
+ "such as latitude and longitude.\n",
+ "Keeping track of what particular systems do or do not offer,\n",
+ "and what names they give different data types,\n",
+ "is an unending portability headache.\n",
+ " \n",
+ "When we create a table,\n",
+ "we can specify several kinds of constraints on its columns.\n",
+ "For example,\n",
+ "a better definition for the `Survey` table would be:\n",
+ " \n",
+ " create table Survey(\n",
+ " taken integer not null, -- where reading taken\n",
+ " person text, -- may not know who took it\n",
+ " quant real not null, -- the quantity measured\n",
+ " reading real not null, -- the actual reading\n",
+ " primary key(taken, quant),\n",
+ " foreign key(taken) references Visited(ident),\n",
+ " foreign key(person) references Person(ident)\n",
+ " );\n",
+ "\n",
+ "Once again,\n",
+ "exactly what constraints are avialable\n",
+ "and what they're called\n",
+ "depends on which database manager we are using.\n",
+ " \n",
+ "Once tables have been created,\n",
+ "we can add and remove records using our other pair of commands,\n",
+ "`insert` and `delete`.\n",
+ "The simplest form of `insert` statement lists values in order:\n",
+ "\n",
+ " insert into Site values('DR-1', -49.85, -128.57);\n",
+ " insert into Site values('DR-3', -47.15, -126.72);\n",
+ " insert into Site values('MSK-4', -48.87, -123.40);\n",
+ "\n",
+ "We can also insert values into one table directly from another:\n",
+ " \n",
+ " create table JustLatLong(lat text, long TEXT);\n",
+ " insert into JustLatLong select lat, long from site;\n",
+ "\n",
+ "Deleting records can be a bit trickier,\n",
+ "because we have to ensure that the database remains internally consistent.\n",
+ "If all we care about is a single table,\n",
+ "we can use the `delete` command with a `where` clause\n",
+ "that matches the records we want to discard.\n",
+ "For example,\n",
+ "once we realize that Frank Danforth didn't take any measurements,\n",
+ "we can remove him from the `Person` table like this:\n",
+ " \n",
+ " delete from Person where ident = \"danforth\";\n",
+ "\n",
+ "But what if we removed Anderson Lake instead?\n",
+ "Our `Survey` table would still contain seven records\n",
+ "of measurements he'd taken,\n",
+ "but that's never supposed to happen:\n",
+ "`Survey.person` is a foreign key into the `Person` table,\n",
+ "and all our queries assume there will be a row in the latter\n",
+ "matching every value in the former.\n",
+ " \n",
+ "This problem is called [referential integrity](../gloss.html#referential-integrity):\n",
+ "we need to ensure that all references between tables can always be resolved correctly.\n",
+ "One way to do this is to delete all the records\n",
+ "that use `'lake'` as a foreign key\n",
+ "before deleting the record that uses it as a primary key.\n",
+ "If our database manager supports it,\n",
+ "we can automate this\n",
+ "using [cascading delete](../gloss.html#cascading-delete).\n",
+ "However,\n",
+ "this technique is outside the scope of this chapter.\n",
+ "\n",
+ "> Many applications use a hybrid storage model\n",
+ "> instead of putting everything into a database:\n",
+ "> the actual data (such as astronomical images) is stored in files,\n",
+ "> while the database stores the files' names,\n",
+ "> their modification dates,\n",
+ "> the region of the sky they cover,\n",
+ "> their spectral characteristics,\n",
+ "> and so on.\n",
+ "> This is also how most music player software is built:\n",
+ "> the database inside the application keeps track of the MP3 files,\n",
+ "> but the files themselves live on disk."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Challenges\n",
+ "\n",
+ "1. Write an SQL statement to replace all uses of `null`\n",
+ " in `Survey.person`\n",
+ " with the string `'unknown'`.\n",
+ "\n",
+ "2. One of our colleagues has sent us a [CSV](../gloss.html#csv) file\n",
+ " containing temperature readings by Robert Olmstead,\n",
+ " which is formatted like this:\n",
+ "\n",
+ " ```\n",
+ " Taken,Temp\n",
+ " 619,-21.5\n",
+ " 622,-15.5\n",
+ " ```\n",
+ "\n",
+ " Write a small Python program that reads this file in\n",
+ " and prints out the SQL `insert` statements needed\n",
+ " to add these records to the survey database.\n",
+ " Note: you will need to add an entry for Olmstead\n",
+ " to the `Person` table.\n",
+ " If you are testing your program repeatedly,\n",
+ " you may want to investigate SQL's `insert or replace` command.\n",
+ "\n",
+ "3. SQLite has several administrative commands that aren't part of the SQL standard.\n",
+ " One of them is `.dump`,\n",
+ " which prints the SQL commands needed to re-create the database.\n",
+ " Another is `.load`,\n",
+ " which reads a file created by `.dump` and restores the database.\n",
+ " A colleague of yours thinks that storing dump files (which are text) in version control\n",
+ " is a good way to track and manage changes to the database.\n",
+ " What are the pros and cons of this approach?"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Data Hygiene"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "FIXME"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
--- /dev/null
+{
+ "metadata": {
+ "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "heading",
+ "level": 1,
+ "metadata": {},
+ "source": [
+ "Programming with Databases"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To close,\n",
+ "let's have a look at how to access a database from\n",
+ "a general-purpose programming language like Python.\n",
+ "Other languages use almost exactly the same model:\n",
+ "library and function names may differ,\n",
+ "but the concepts are the same.\n",
+ "\n",
+ "Here's a short Python program that selects latitudes and longitudes\n",
+ "from an SQLite database stored in a file called `survey.db`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "import sqlite3\n",
+ "connection = sqlite3.connect(\"survey.db\")\n",
+ "cursor = connection.cursor()\n",
+ "cursor.execute(\"select site.lat, site.long from site;\")\n",
+ "results = cursor.fetchall()\n",
+ "for r in results:\n",
+ " print r\n",
+ "cursor.close()\n",
+ "connection.close()"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "(-49.85, -128.57)\n",
+ "(-47.15, -126.72)\n",
+ "(-48.87, -123.4)\n"
+ ]
+ }
+ ],
+ "prompt_number": 1
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The program starts by importing the `sqlite3` library.\n",
+ "If we were connecting to MySQL, DB2, or some other database,\n",
+ "we would import a different library,\n",
+ "but all of them provide the same functions,\n",
+ "so that the rest of our program does not have to change\n",
+ "(at least, not much)\n",
+ "if we switch from one database to another.\n",
+ "\n",
+ "Line 2 establishes a connection to the database.\n",
+ "Since we're using SQLite,\n",
+ "all we need to specify is the name of the database file.\n",
+ "Other systems may require us to provide a username and password as well.\n",
+ "Line 3 then uses this connection to create\n",
+ "a [cursor](../gloss.html#cursor);\n",
+ "just like the cursor in an editor,\n",
+ "its role is to keep track of where we are in the database.\n",
+ "\n",
+ "On line 4, we use that cursor to ask the database to execute a query for us.\n",
+ "The query is written in SQL,\n",
+ "and passed to `cursor.execute` as a string.\n",
+ "It's our job to make sure that SQL is properly formatted;\n",
+ "if it isn't,\n",
+ "or if something goes wrong when it is being executed,\n",
+ "the database will report an error.\n",
+ "\n",
+ "The database returns the results of the query to us\n",
+ "in response to the `cursor.fetchall` call on line 5.\n",
+ "This result is a list with one entry for each record in the result set;\n",
+ "if we loop over that list (line 6) and print those list entries (line 7),\n",
+ "we can see that each one is a tuple\n",
+ "with one element for each field we asked for.\n",
+ "\n",
+ "Finally, lines 8 and 9 close our cursor and our connection,\n",
+ "since the database can only keep a limited number of these open at one time.\n",
+ "Since establishing a connection takes time,\n",
+ "though,\n",
+ "we shouldn't open a connection,\n",
+ "do one operation,\n",
+ "then close the connection,\n",
+ "only to reopen it a few microseconds later to do another operation.\n",
+ "Instead,\n",
+ "it's normal to create one connection that stays open for the lifetime of the program."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Queries in real applications will often depend on values provided by users.\n",
+ "For example,\n",
+ "this function takes a user's ID as a parameter and returns their name:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def get_name(database_file, person_ident):\n",
+ " query = \"select personal || ' ' || family from Person where ident='\" + person_ident + \"';\"\n",
+ "\n",
+ " connection = sqlite3.connect(database_file)\n",
+ " cursor = connection.cursor()\n",
+ " cursor.execute(query)\n",
+ " results = cursor.fetchall()\n",
+ " cursor.close()\n",
+ " connection.close()\n",
+ "\n",
+ " return results[0][0]\n",
+ "\n",
+ "print \"full name for dyer:\", get_name('survey.db', 'dyer')"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "full name for dyer: William Dyer\n"
+ ]
+ }
+ ],
+ "prompt_number": 2
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We use string concatenation on the first line of this function\n",
+ "to construct a query containing the user ID we have been given.\n",
+ "This seems simple enough,\n",
+ "but what happens if someone gives us this string as input?\n",
+ "\n",
+ " dyer'; drop table Survey; select '\n",
+ "\n",
+ "It looks like there's garbage after the name of the project,\n",
+ "but it is very carefully chosen garbage.\n",
+ "If we insert this string into our query,\n",
+ "the result is:\n",
+ "\n",
+ " select personal || ' ' || family from Person where ident='dyer'; drop tale Survey; select '';\n",
+ "\n",
+ "If we execute this,\n",
+ "it will erase one of the tables in our database.\n",
+ " \n",
+ "This technique is called [SQL injection](../gloss.html#sql-injection),\n",
+ "and it has been used to attack thousands of programs over the years.\n",
+ "In particular,\n",
+ "many web sites that take data from users insert values directly into queries\n",
+ "without checking them carefully first.\n",
+ " \n",
+ "Since a villain might try to smuggle commands into our queries in many different ways,\n",
+ "the safest way to deal with this threat is\n",
+ "to replace characters like quotes with their escaped equivalents,\n",
+ "so that we can safely put whatever the user gives us inside a string.\n",
+ "We can do this by using a [prepared statement](../gloss.html#prepared-statement)\n",
+ "instead of formatting our statements as strings.\n",
+ "Here's what our example program looks like if we do this:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def get_name(database_file, person_ident):\n",
+ " query = \"select personal || ' ' || family from Person where ident=?;\"\n",
+ "\n",
+ " connection = sqlite3.connect(database_file)\n",
+ " cursor = connection.cursor()\n",
+ " cursor.execute(query, [person_ident])\n",
+ " results = cursor.fetchall()\n",
+ " cursor.close()\n",
+ " connection.close()\n",
+ "\n",
+ " return results[0][0]\n",
+ "\n",
+ "print \"full name for dyer:\", get_name('survey.db', 'dyer')"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "full name for dyer: William Dyer\n"
+ ]
+ }
+ ],
+ "prompt_number": 3
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The key changes are in the query string and the `execute` call.\n",
+ "Instead of formatting the query ourselves,\n",
+ "we put question marks in the query template where we want to insert values.\n",
+ "When we call `execute`,\n",
+ "we provide a list\n",
+ "that contains as many values as there are question marks in the query.\n",
+ "The library matches values to question marks in order,\n",
+ "and translates any special characters in the values\n",
+ "into their escaped equivalents\n",
+ "so that they are safe to use."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Challenges\n",
+ "\n",
+ "1. Write a Python program that creates a new database\n",
+ " in a file called `original.db`\n",
+ " containing a single table called `Pressure`,\n",
+ " with a single field called `reading`,\n",
+ " and inserts 100,000 random numbers between 10.0 and 25.0.\n",
+ " How long does it take this program to run?\n",
+ " How long does it take to run a program\n",
+ " that simply writes those random numbers to a file?\n",
+ "\n",
+ "2. Write a Python program that creates a new database\n",
+ " called `backup.db`\n",
+ " with the same structure as `original.db`\n",
+ " and copies all the values greater than 20.0\n",
+ " from `original.db` to `backup.db`.\n",
+ " Which is faster:\n",
+ " filtering values in the query,\n",
+ " or reading everything into memory and filtering in Python?"
+ ]
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file