From: Greg Wilson Date: Sun, 10 Nov 2013 10:30:52 +0000 (-0500) Subject: More database lessons for novices. To do: X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=f1b09bc54652abb32d34e39e6ccccbf08fba0a18;p=swc-boot-camp.git More database lessons for novices. To do: * `group by` in aggregation * data hygiene * explain the `%%sqlite` magic --- diff --git a/sql/novice/04-calc.ipynb b/sql/novice/04-calc.ipynb new file mode 100644 index 0000000..2ecf97d --- /dev/null +++ b/sql/novice/04-calc.ipynb @@ -0,0 +1,313 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Calculating New Values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After carefully re-reading the expedition logs,\n", + "we realize that the radiation measurements they report\n", + "may need to be corrected upward by 5%.\n", + "Rather than modifying the stored data,\n", + "we can do this calculation on the fly\n", + "as part of our query:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select 1.05 * reading from Survey where quant='rad';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
10.311
8.19
8.8305
7.581
4.5675
2.2995
1.533
11.8125
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we run the query,\n", + "the expression `1.05 * reading` is evaluated for each row.\n", + "Expressions can use any of the fields,\n", + "all of usual arithmetic operators,\n", + "and a variety of common functions.\n", + "(Exactly which ones depends on which database manager is being used.)\n", + "For example,\n", + "we can convert temperature readings from Fahrenheit to Celsius\n", + "and round to two decimal places:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
734-29.72
735-32.22
751-28.06
752-26.67
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also combine values from different fields,\n", + "for example by using the string concatenation operator `||`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select personal || ' ' || family from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
William Dyer
Frank Pabodie
Anderson Lake
Valentina Roerich
Frank Danforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> It may seem strange to use `personal` and `family` as field names\n", + "> instead of `first` and `last`,\n", + "> but it's a necessary first step toward handling cultural differences.\n", + "> For example,\n", + "> consider the following rules:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Full Name Alphabetized Under Reason
Liu Xiaobo Liu Chinese family names come first
Leonardo da Vinci Leonardo \"da Vinci\" just means \"from Vinci\"
Catherine de Medici Medici family name
Jean de La Fontaine La Fontaine family name is \"La Fontaine\"
Juan Ponce de Leon Ponce de Leon full family name is \"Ponce de Leon\"
Gabriel Garcia Marquez Garcia Marquez double-barrelled Spanish surnames
Wernher von Braun von or Braun depending on whether he was in Germany or the US
Elizabeth Alexandra May Windsor Elizabeth monarchs alphabetize by the name under which they reigned
Thomas a Beckett Thomas and saints according to the names by which they were canonized
\n", + "\n", + "> Clearly,\n", + "> even a two-part division into \"personal\" and \"family\"\n", + "> isn't enough..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. After further reading,\n", + " we realize that Valentina Roerich\n", + " was reporting salinity as percentages.\n", + " Write a query that returns all of her salinity measurements\n", + " from the `Survey` table\n", + " with the values divided by 100.\n", + "\n", + "2. The `union` operator combines the results of two queries:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Person where ident='dyer' union select * from Person where ident='roe';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "
dyerWilliamDyer
roeValentinaRoerich
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use `union` to create a consolidated list of salinity measurements\n", + "in which Roerich's, and only Roerich's,\n", + "have been corrected as described in the previous challenge.\n", + "The output should be something like:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
619 0.13
622 0.09
734 0.05
751 0.1
752 0.09
752 0.416
837 0.21
837 0.225
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "3. The site identifiers in the `Visited` table have two parts\n", + " separated by a '-':" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select distinct site from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "
DR-1
DR-3
MSK-4
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Some major site identifiers are two letters long and some are three.\n", + "The \"in string\" function `instr(X, Y)`\n", + "returns the 1-based index of the first occurrence of string Y in string X,\n", + "or 0 if Y does not exist in X.\n", + "The substring function `substr(X, I)`\n", + "returns the substring of X starting at index I.\n", + "Use these two functions to produce a list of unique major site identifiers.\n", + "(For this data,\n", + "the list should contain only \"DR\" and \"MSK\")." + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/05-null.ipynb b/sql/novice/05-null.ipynb new file mode 100644 index 0000000..921d10b --- /dev/null +++ b/sql/novice/05-null.ipynb @@ -0,0 +1,440 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Missing Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Real-world data is never complete—there are always holes.\n", + "Databases represent these holes using special value called `null`.\n", + "`null` is not zero, `False`, or the empty string;\n", + "it is a one-of-a-kind value that means \"nothing here\".\n", + "Dealing with `null` requires a few special tricks\n", + "and some careful thinking.\n", + "\n", + "To start,\n", + "let's have a look at the `Visited` table.\n", + "There are eight records,\n", + "but #752 doesn't have a date—or rather,\n", + "its date is null:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619DR-11927-02-08
622DR-11927-02-10
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
752DR-3None
837MSK-41932-01-14
844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Null doesn't behave like other values.\n", + "If we select the records that come before 1930:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated<'1930-00-00';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "
619DR-11927-02-08
622DR-11927-02-10
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "we get two results,\n", + "and if we select the ones that come during or after 1930:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated>='1930-00-00';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "we get five,\n", + "but record #752 isn't in either set of results.\n", + "The reason is that\n", + "`null<'1930-00-00'`\n", + "is neither true nor false:\n", + "null means, \"We don't know,\"\n", + "and if we don't know the value on the left side of a comparison,\n", + "we don't know whether the comparison is true or false.\n", + "Since databases represent \"don't know\" as null,\n", + "the value of `null<'1930-00-00'`\n", + "is actually `null`.\n", + "`null>='1930-00-00'` is also null\n", + "because we can't answer to that question either.\n", + "And since the only records kept by a `where`\n", + "are those for which the test is true,\n", + "record #752 isn't included in either set of results.\n", + "\n", + "Comparisons aren't the only operations that behave this way with nulls.\n", + "`1+null` is `null`,\n", + "`5*null` is `null`,\n", + "`log(null)` is `null`,\n", + "and so on.\n", + "In particular,\n", + "comparing things to null with = and != produces null:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated=NULL;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated!=NULL;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To check whether a value is `null` or not,\n", + "we must use a special test `is null`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated is NULL;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
752DR-3None
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or its inverse `is not null`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where dated is not NULL;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619DR-11927-02-08
622DR-11927-02-10
734DR-31939-01-07
735DR-31930-01-12
751DR-31930-02-26
837MSK-41932-01-14
844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Null values cause headaches wherever they appear.\n", + "For example,\n", + "suppose we want to find the all of salinity measurements\n", + "that weren't taken by Dyer.\n", + "It's natural to write the query like this:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where quant='sal' and person!='lake';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
619dyersal0.13
622dyersal0.09
752roesal41.6
837roesal22.5
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "but this query filters omits the records\n", + "where we don't know who took the measurement.\n", + "Once again,\n", + "the reason is that when `person` is `null`,\n", + "the `!=` comparison produces `null`,\n", + "so the record isn't kept in our results.\n", + "If we want to keep these records\n", + "we need to add an explicit check:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where quant='sal' and (person!='lake' or person is null);" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619dyersal0.13
622dyersal0.09
735Nonesal0.06
752roesal41.6
837roesal22.5
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 10 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We still have to decide whether this is the right thing to do or not.\n", + "If we want to be absolutely sure that\n", + "we aren't including any measurements by Lake in our results,\n", + "we need to exclude all the records for which we don't know who did the work." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "1. Write a query that sorts the records in `Visited` by date,\n", + " omitting entries for which the date is not known\n", + " (i.e., is null).\n", + "\n", + "1. What do you expect the query:\n", + "\n", + " ```\n", + " select * from Visited where dated in ('1927-02-08', null);\n", + " ```\n", + "\n", + " to produce?\n", + " What does it actually produce?\n", + "\n", + "1. Some database designers prefer to use\n", + " a [sentinel value](../gloss.html#sentinel-value)\n", + " to mark missing data rather than `null`.\n", + " For example,\n", + " they will use the date \"0000-00-00\" to mark a missing date,\n", + " or -1.0 to mark a missing salinity or radiation reading\n", + " (since actual readings cannot be negative).\n", + " What does this simplify?\n", + " What burdens or risks does it introduce?" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/06-agg.ipynb b/sql/novice/06-agg.ipynb new file mode 100644 index 0000000..871bbcc --- /dev/null +++ b/sql/novice/06-agg.ipynb @@ -0,0 +1,468 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Aggregation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now want to calculate ranges and averages for our data.\n", + "We know how to select all of the dates from the `Visited` table:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select dated from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
1927-02-08
1927-02-10
1939-01-07
1930-01-12
1930-02-26
None
1932-01-14
1932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "but to combine them,\n", + "wee must use an [aggregation function](../gloss.html#aggregation-function)\n", + "such as `min` or `max`.\n", + "Each of these functions takes a set of records as input,\n", + "and produces a single record as output:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select min(dated) from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
1927-02-08
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select max(dated) from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
1939-01-07
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`min` and `max` are just two of\n", + "the aggregation functions built into SQL.\n", + "Three others are `avg`,\n", + "`count`,\n", + "and `sum`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select avg(reading) from Survey where quant='sal';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
7.20333333333
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select count(reading) from Survey where quant='sal';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
9
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select sum(reading) from Survey where quant='sal';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
64.83
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We used `count(reading)` here,\n", + "but we could just as easily have counted `quant`\n", + "or any other field in the table,\n", + "or even used `count(*)`,\n", + "since the function doesn't care about the values themselves,\n", + "just how many values there are.\n", + "\n", + "SQL lets us do several aggregations at once.\n", + "We can,\n", + "for example,\n", + "find the range of sensible salinity measurements:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
0.050.21
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 10 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also combine aggregated results with raw results,\n", + "although the output might surprise you:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, count(*) from Survey where quant='sal' and reading<=1.0;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
lake7
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 11 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Why does Lake's name appear rather than Roerich's or Dyer's?\n", + "The answer is that when it has to aggregate a field,\n", + "but isn't told how to,\n", + "the database manager chooses an actual value from the input set.\n", + "It might use the first one processed,\n", + "the last one,\n", + "or something else entirely.\n", + "\n", + "Another important fact is that when there are no values to aggregate,\n", + "aggregation's result is \"don't know\"\n", + "rather than zero or some other arbitrary value:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select person, max(reading), sum(reading) from Survey where quant='missing';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
NoneNoneNone
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 12 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One final important feature of aggregation functions is that\n", + "they are inconsistent with the rest of SQL in a very useful way.\n", + "If we add two values,\n", + "and one of them is null,\n", + "the result is null.\n", + "By extension,\n", + "if we use `sum` to add all the values in a set,\n", + "and any of those values are null,\n", + "the result should also be null.\n", + "It's much more useful,\n", + "though,\n", + "for aggregation functions to ignore null values\n", + "and only combine those that are non-null.\n", + "This behavior lets us write our queries as:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select min(dated) from Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
1927-02-08
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 13 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "instead of always having to filter explicitly:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select min(dated) from Visited where dated is not null;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
1927-02-08
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 14 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. How many temperature readings did Frank Pabodie record,\n", + " and what was their average value?\n", + "\n", + "2. The average of a set of values is the sum of the values\n", + " divided by the number of values.\n", + " Does this mean that the `avg` function returns 2.0 or 3.0\n", + " when given the values 1.0, `null`, and 5.0?\n", + "\n", + "3. We want to calculate the difference between\n", + " each individual radiation reading\n", + " and the average of all the radiation readings.\n", + " We write the query:\n", + "\n", + " ```\n", + " select reading-avg(reading) from Survey where quant='rad';\n", + " ```\n", + "\n", + " What does this actually produce, and why?\n", + "\n", + "4. The function `group_concat(field, separator)`\n", + " concatenates all the values in a field\n", + " using the specified separator character\n", + " (or ',' if the separator isn't specified).\n", + " Use this to produce a one-line list of scientists' names,\n", + " such as:\n", + "\n", + " ```\n", + " William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth\n", + " ```\n", + "\n", + " Can you find a way to order the list by surname?" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/07-join.ipynb b/sql/novice/07-join.ipynb new file mode 100644 index 0000000..7510ba3 --- /dev/null +++ b/sql/novice/07-join.ipynb @@ -0,0 +1,386 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Combining Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to submit her data to a web site\n", + "that aggregates historical meteorological data,\n", + "Gina needs to format it as\n", + "latitude, longitude, date, quantity, and reading.\n", + "However,\n", + "her latitudes and longitudes are in the `Site` table,\n", + "while the dates of measurements are in the `Visited` table\n", + "and the readings themselves are in the `Survey` table.\n", + "She needs to combine these tables somehow.\n", + "\n", + "The SQL command to do this is `join`.\n", + "To see how it works,\n", + "let's start by joining the `Site` and `Visited` tables:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Site join Visited;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
DR-1-49.85-128.57619DR-11927-02-08
DR-1-49.85-128.57622DR-11927-02-10
DR-1-49.85-128.57734DR-31939-01-07
DR-1-49.85-128.57735DR-31930-01-12
DR-1-49.85-128.57751DR-31930-02-26
DR-1-49.85-128.57752DR-3None
DR-1-49.85-128.57837MSK-41932-01-14
DR-1-49.85-128.57844DR-11932-03-22
DR-3-47.15-126.72619DR-11927-02-08
DR-3-47.15-126.72622DR-11927-02-10
DR-3-47.15-126.72734DR-31939-01-07
DR-3-47.15-126.72735DR-31930-01-12
DR-3-47.15-126.72751DR-31930-02-26
DR-3-47.15-126.72752DR-3None
DR-3-47.15-126.72837MSK-41932-01-14
DR-3-47.15-126.72844DR-11932-03-22
MSK-4-48.87-123.4619DR-11927-02-08
MSK-4-48.87-123.4622DR-11927-02-10
MSK-4-48.87-123.4734DR-31939-01-07
MSK-4-48.87-123.4735DR-31930-01-12
MSK-4-48.87-123.4751DR-31930-02-26
MSK-4-48.87-123.4752DR-3None
MSK-4-48.87-123.4837MSK-41932-01-14
MSK-4-48.87-123.4844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`join` creates\n", + "the [cross product](../gloss.html#cross-product)\n", + "of two tables,\n", + "i.e.,\n", + "it joins each record of one with each record of the other\n", + "to give all possible combinations.\n", + "Since there are three records in `Site`\n", + "and eight in `Visited`,\n", + "the join's output has 24 records.\n", + "And since each table has three fields,\n", + "the output has six fields.\n", + " \n", + "What the join *hasn't* done is\n", + "figure out if the records being joined have anything to do with each other.\n", + "It has no way of knowing whether they do or not until we tell it how.\n", + "To do that,\n", + "we add a clause specifying that\n", + "we're only interested in combinations that have the same site name:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Site join Visited on Site.name=Visited.site;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
DR-1-49.85-128.57619DR-11927-02-08
DR-1-49.85-128.57622DR-11927-02-10
DR-1-49.85-128.57844DR-11932-03-22
DR-3-47.15-126.72734DR-31939-01-07
DR-3-47.15-126.72735DR-31930-01-12
DR-3-47.15-126.72751DR-31930-02-26
DR-3-47.15-126.72752DR-3None
MSK-4-48.87-123.4837MSK-41932-01-14
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`on` does the same job as `where`:\n", + "it only keeps records that pass some test.\n", + "(The difference between the two is that `on` filters records\n", + "as they're being created,\n", + "while `where` waits until the join is done\n", + "and then does the filtering.)\n", + "Once we add this to our query,\n", + "the database manager throws away records\n", + "that combined information about two different sites,\n", + "leaving us with just the ones we want.\n", + " \n", + "Notice that we used `table.field` to specify field names\n", + "in the output of the join.\n", + "We do this because tables can have fields with the same name,\n", + "and we need to be specific which ones we're talking about.\n", + "For example,\n", + "if we joined the `person` and `visited` tables,\n", + "the result would inherit a field called `ident`\n", + "from each of the original tables.\n", + "\n", + "We can now use the same dotted notation\n", + "to select the three columns we actually want\n", + "out of our join:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select Site.lat, Site.long, Visited.dated\n", + "from Site join Visited\n", + "on Site.name=Visited.site;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
-49.85-128.571927-02-08
-49.85-128.571927-02-10
-49.85-128.571932-03-22
-47.15-126.72None
-47.15-126.721930-01-12
-47.15-126.721930-02-26
-47.15-126.721939-01-07
-48.87-123.41932-01-14
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If joining two tables is good,\n", + "joining many tables must be better.\n", + "In fact,\n", + "we can join any number of tables\n", + "simply by adding more `join` clauses to our query,\n", + "and more `on` tests to filter out combinations of records\n", + "that don't make sense:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading\n", + "from Site join Visited join Survey\n", + "on Site.name=Visited.site\n", + "and Visited.ident=Survey.taken\n", + "and Visited.dated is not null;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
-49.85-128.571927-02-08rad9.82
-49.85-128.571927-02-08sal0.13
-49.85-128.571927-02-10rad7.8
-49.85-128.571927-02-10sal0.09
-47.15-126.721939-01-07rad8.41
-47.15-126.721939-01-07sal0.05
-47.15-126.721939-01-07temp-21.5
-47.15-126.721930-01-12rad7.22
-47.15-126.721930-01-12sal0.06
-47.15-126.721930-01-12temp-26.0
-47.15-126.721930-02-26rad4.35
-47.15-126.721930-02-26sal0.1
-47.15-126.721930-02-26temp-18.5
-48.87-123.41932-01-14rad1.46
-48.87-123.41932-01-14sal0.21
-48.87-123.41932-01-14sal22.5
-49.85-128.571932-03-22rad11.25
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can tell which records from `Site`, `Visited`, and `Survey`\n", + "correspond with each other\n", + "because those tables contain\n", + "[primary keys](../gloss.html#primary-key)\n", + "and [foreign keys](../gloss.html#foreign-key).\n", + "A primary key is a value,\n", + "or combination of values,\n", + "that uniquely identifies each record in a table.\n", + "A foreign key is a value (or combination of values) from one table\n", + "that identifies a unique record in another table.\n", + "Another way of saying this is that\n", + "a foreign key is the primary key of one table\n", + "that appears in some other table.\n", + "In our database,\n", + "`Person.ident` is the primary key in the `Person` table,\n", + "while `Survey.person` is a foreign key\n", + "relating the `Survey` table's entries\n", + "to entries in `Person`.\n", + "\n", + "Most database designers believe that\n", + "every table should have a well-defined primary key.\n", + "They also believe that this key should be separate from the data itself,\n", + "so that if we ever need to change the data,\n", + "we only need to make one change in one place.\n", + "One easy way to do this is\n", + "to create an arbitrary, unique ID for each record\n", + "as we add it to the database.\n", + "This is actually very common:\n", + "those IDs have names like \"student numbers\" and \"patient numbers\",\n", + "and they almost always turn out to have originally been\n", + "a unique record identifier in some database system or other.\n", + "As the query below demonstrates,\n", + "SQLite automatically numbers records as they're added to tables,\n", + "and we can use those record numbers in queries:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select rowid, * from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
1dyerWilliamDyer
2pbFrankPabodie
3lakeAndersonLake
4roeValentinaRoerich
5danforthFrankDanforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Write a query that lists all radiation readings from the DR-1 site.\n", + "\n", + "2. Write a query that lists all sites visited by people named \"Frank\".\n", + "\n", + "3. Describe in your own words what the following query produces:\n", + "\n", + " ```\n", + " select Site.name from Site join Visited\n", + " on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';\n", + " ```\n", + "\n", + "4. Why does the `Person` table have an `ident` field?\n", + " Why do we not just use scientists' names in the `Survey` table?\n", + "\n", + "5. Why does the table `Site` exist?\n", + " Why don't we just record latitudes and longitudes\n", + " directly in the `Visited` and `Survey` tables?" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/08-create.ipynb b/sql/novice/08-create.ipynb new file mode 100644 index 0000000..d463ae2 --- /dev/null +++ b/sql/novice/08-create.ipynb @@ -0,0 +1,188 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Creating and Modifying Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So far we have only looked at how to get information out of a database,\n", + "both because that is more frequent than adding information,\n", + "and because most other operations only make sense\n", + "once queries are understood.\n", + "If we want to create and modify data,\n", + "we need to know two other pairs of commands.\n", + " \n", + "The first pair are `create table` and `drop table`.\n", + "While they are written as two words,\n", + "they are actually single commands.\n", + "The first one creates a new table;\n", + "its arguments are the names and types of the table's columns.\n", + "For example,\n", + "the following statements create the four tables in our survey database:\n", + "\n", + " create table Person(ident text, personal text, family text);\n", + " create table Site(name text, lat real, long real);\n", + " create table Visited(ident integer, site text, dated text);\n", + " create table Survey(taken integer, person text, quant real, reading real);\n", + "\n", + "We can get rid of one of our tables using:\n", + " \n", + " drop table Survey;\n", + "\n", + "Be very careful when doing this:\n", + "most databases have some support for undoing changes,\n", + "but it's better not to have to rely on it.\n", + " \n", + "Different database systems support different data types for table columns,\n", + "but most provide the following:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
integer a signed integer
real a floating point number
text a character string
blob a \"binary large object\", such as an image
\n", + "\n", + "Most databases also support Booleans and date/time values;\n", + "SQLite uses the integers 0 and 1 for the former,\n", + "and represents the latter as discussed [earlier](#a:dates).\n", + "An increasing number of databases also support geographic data types,\n", + "such as latitude and longitude.\n", + "Keeping track of what particular systems do or do not offer,\n", + "and what names they give different data types,\n", + "is an unending portability headache.\n", + " \n", + "When we create a table,\n", + "we can specify several kinds of constraints on its columns.\n", + "For example,\n", + "a better definition for the `Survey` table would be:\n", + " \n", + " create table Survey(\n", + " taken integer not null, -- where reading taken\n", + " person text, -- may not know who took it\n", + " quant real not null, -- the quantity measured\n", + " reading real not null, -- the actual reading\n", + " primary key(taken, quant),\n", + " foreign key(taken) references Visited(ident),\n", + " foreign key(person) references Person(ident)\n", + " );\n", + "\n", + "Once again,\n", + "exactly what constraints are avialable\n", + "and what they're called\n", + "depends on which database manager we are using.\n", + " \n", + "Once tables have been created,\n", + "we can add and remove records using our other pair of commands,\n", + "`insert` and `delete`.\n", + "The simplest form of `insert` statement lists values in order:\n", + "\n", + " insert into Site values('DR-1', -49.85, -128.57);\n", + " insert into Site values('DR-3', -47.15, -126.72);\n", + " insert into Site values('MSK-4', -48.87, -123.40);\n", + "\n", + "We can also insert values into one table directly from another:\n", + " \n", + " create table JustLatLong(lat text, long TEXT);\n", + " insert into JustLatLong select lat, long from site;\n", + "\n", + "Deleting records can be a bit trickier,\n", + "because we have to ensure that the database remains internally consistent.\n", + "If all we care about is a single table,\n", + "we can use the `delete` command with a `where` clause\n", + "that matches the records we want to discard.\n", + "For example,\n", + "once we realize that Frank Danforth didn't take any measurements,\n", + "we can remove him from the `Person` table like this:\n", + " \n", + " delete from Person where ident = \"danforth\";\n", + "\n", + "But what if we removed Anderson Lake instead?\n", + "Our `Survey` table would still contain seven records\n", + "of measurements he'd taken,\n", + "but that's never supposed to happen:\n", + "`Survey.person` is a foreign key into the `Person` table,\n", + "and all our queries assume there will be a row in the latter\n", + "matching every value in the former.\n", + " \n", + "This problem is called [referential integrity](../gloss.html#referential-integrity):\n", + "we need to ensure that all references between tables can always be resolved correctly.\n", + "One way to do this is to delete all the records\n", + "that use `'lake'` as a foreign key\n", + "before deleting the record that uses it as a primary key.\n", + "If our database manager supports it,\n", + "we can automate this\n", + "using [cascading delete](../gloss.html#cascading-delete).\n", + "However,\n", + "this technique is outside the scope of this chapter.\n", + "\n", + "> Many applications use a hybrid storage model\n", + "> instead of putting everything into a database:\n", + "> the actual data (such as astronomical images) is stored in files,\n", + "> while the database stores the files' names,\n", + "> their modification dates,\n", + "> the region of the sky they cover,\n", + "> their spectral characteristics,\n", + "> and so on.\n", + "> This is also how most music player software is built:\n", + "> the database inside the application keeps track of the MP3 files,\n", + "> but the files themselves live on disk." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Write an SQL statement to replace all uses of `null`\n", + " in `Survey.person`\n", + " with the string `'unknown'`.\n", + "\n", + "2. One of our colleagues has sent us a [CSV](../gloss.html#csv) file\n", + " containing temperature readings by Robert Olmstead,\n", + " which is formatted like this:\n", + "\n", + " ```\n", + " Taken,Temp\n", + " 619,-21.5\n", + " 622,-15.5\n", + " ```\n", + "\n", + " Write a small Python program that reads this file in\n", + " and prints out the SQL `insert` statements needed\n", + " to add these records to the survey database.\n", + " Note: you will need to add an entry for Olmstead\n", + " to the `Person` table.\n", + " If you are testing your program repeatedly,\n", + " you may want to investigate SQL's `insert or replace` command.\n", + "\n", + "3. SQLite has several administrative commands that aren't part of the SQL standard.\n", + " One of them is `.dump`,\n", + " which prints the SQL commands needed to re-create the database.\n", + " Another is `.load`,\n", + " which reads a file created by `.dump` and restores the database.\n", + " A colleague of yours thinks that storing dump files (which are text) in version control\n", + " is a good way to track and manage changes to the database.\n", + " What are the pros and cons of this approach?" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/09-hygiene.ipynb b/sql/novice/09-hygiene.ipynb new file mode 100644 index 0000000..9b607e3 --- /dev/null +++ b/sql/novice/09-hygiene.ipynb @@ -0,0 +1,29 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Data Hygiene" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/10-prog.ipynb b/sql/novice/10-prog.ipynb new file mode 100644 index 0000000..9681a14 --- /dev/null +++ b/sql/novice/10-prog.ipynb @@ -0,0 +1,262 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Programming with Databases" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To close,\n", + "let's have a look at how to access a database from\n", + "a general-purpose programming language like Python.\n", + "Other languages use almost exactly the same model:\n", + "library and function names may differ,\n", + "but the concepts are the same.\n", + "\n", + "Here's a short Python program that selects latitudes and longitudes\n", + "from an SQLite database stored in a file called `survey.db`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import sqlite3\n", + "connection = sqlite3.connect(\"survey.db\")\n", + "cursor = connection.cursor()\n", + "cursor.execute(\"select site.lat, site.long from site;\")\n", + "results = cursor.fetchall()\n", + "for r in results:\n", + " print r\n", + "cursor.close()\n", + "connection.close()" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "(-49.85, -128.57)\n", + "(-47.15, -126.72)\n", + "(-48.87, -123.4)\n" + ] + } + ], + "prompt_number": 1 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The program starts by importing the `sqlite3` library.\n", + "If we were connecting to MySQL, DB2, or some other database,\n", + "we would import a different library,\n", + "but all of them provide the same functions,\n", + "so that the rest of our program does not have to change\n", + "(at least, not much)\n", + "if we switch from one database to another.\n", + "\n", + "Line 2 establishes a connection to the database.\n", + "Since we're using SQLite,\n", + "all we need to specify is the name of the database file.\n", + "Other systems may require us to provide a username and password as well.\n", + "Line 3 then uses this connection to create\n", + "a [cursor](../gloss.html#cursor);\n", + "just like the cursor in an editor,\n", + "its role is to keep track of where we are in the database.\n", + "\n", + "On line 4, we use that cursor to ask the database to execute a query for us.\n", + "The query is written in SQL,\n", + "and passed to `cursor.execute` as a string.\n", + "It's our job to make sure that SQL is properly formatted;\n", + "if it isn't,\n", + "or if something goes wrong when it is being executed,\n", + "the database will report an error.\n", + "\n", + "The database returns the results of the query to us\n", + "in response to the `cursor.fetchall` call on line 5.\n", + "This result is a list with one entry for each record in the result set;\n", + "if we loop over that list (line 6) and print those list entries (line 7),\n", + "we can see that each one is a tuple\n", + "with one element for each field we asked for.\n", + "\n", + "Finally, lines 8 and 9 close our cursor and our connection,\n", + "since the database can only keep a limited number of these open at one time.\n", + "Since establishing a connection takes time,\n", + "though,\n", + "we shouldn't open a connection,\n", + "do one operation,\n", + "then close the connection,\n", + "only to reopen it a few microseconds later to do another operation.\n", + "Instead,\n", + "it's normal to create one connection that stays open for the lifetime of the program." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Queries in real applications will often depend on values provided by users.\n", + "For example,\n", + "this function takes a user's ID as a parameter and returns their name:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def get_name(database_file, person_ident):\n", + " query = \"select personal || ' ' || family from Person where ident='\" + person_ident + \"';\"\n", + "\n", + " connection = sqlite3.connect(database_file)\n", + " cursor = connection.cursor()\n", + " cursor.execute(query)\n", + " results = cursor.fetchall()\n", + " cursor.close()\n", + " connection.close()\n", + "\n", + " return results[0][0]\n", + "\n", + "print \"full name for dyer:\", get_name('survey.db', 'dyer')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "full name for dyer: William Dyer\n" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We use string concatenation on the first line of this function\n", + "to construct a query containing the user ID we have been given.\n", + "This seems simple enough,\n", + "but what happens if someone gives us this string as input?\n", + "\n", + " dyer'; drop table Survey; select '\n", + "\n", + "It looks like there's garbage after the name of the project,\n", + "but it is very carefully chosen garbage.\n", + "If we insert this string into our query,\n", + "the result is:\n", + "\n", + " select personal || ' ' || family from Person where ident='dyer'; drop tale Survey; select '';\n", + "\n", + "If we execute this,\n", + "it will erase one of the tables in our database.\n", + " \n", + "This technique is called [SQL injection](../gloss.html#sql-injection),\n", + "and it has been used to attack thousands of programs over the years.\n", + "In particular,\n", + "many web sites that take data from users insert values directly into queries\n", + "without checking them carefully first.\n", + " \n", + "Since a villain might try to smuggle commands into our queries in many different ways,\n", + "the safest way to deal with this threat is\n", + "to replace characters like quotes with their escaped equivalents,\n", + "so that we can safely put whatever the user gives us inside a string.\n", + "We can do this by using a [prepared statement](../gloss.html#prepared-statement)\n", + "instead of formatting our statements as strings.\n", + "Here's what our example program looks like if we do this:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def get_name(database_file, person_ident):\n", + " query = \"select personal || ' ' || family from Person where ident=?;\"\n", + "\n", + " connection = sqlite3.connect(database_file)\n", + " cursor = connection.cursor()\n", + " cursor.execute(query, [person_ident])\n", + " results = cursor.fetchall()\n", + " cursor.close()\n", + " connection.close()\n", + "\n", + " return results[0][0]\n", + "\n", + "print \"full name for dyer:\", get_name('survey.db', 'dyer')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "full name for dyer: William Dyer\n" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The key changes are in the query string and the `execute` call.\n", + "Instead of formatting the query ourselves,\n", + "we put question marks in the query template where we want to insert values.\n", + "When we call `execute`,\n", + "we provide a list\n", + "that contains as many values as there are question marks in the query.\n", + "The library matches values to question marks in order,\n", + "and translates any special characters in the values\n", + "into their escaped equivalents\n", + "so that they are safe to use." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Write a Python program that creates a new database\n", + " in a file called `original.db`\n", + " containing a single table called `Pressure`,\n", + " with a single field called `reading`,\n", + " and inserts 100,000 random numbers between 10.0 and 25.0.\n", + " How long does it take this program to run?\n", + " How long does it take to run a program\n", + " that simply writes those random numbers to a file?\n", + "\n", + "2. Write a Python program that creates a new database\n", + " called `backup.db`\n", + " with the same structure as `original.db`\n", + " and copies all the values greater than 20.0\n", + " from `original.db` to `backup.db`.\n", + " Which is faster:\n", + " filtering values in the query,\n", + " or reading everything into memory and filtering in Python?" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file