From 812cdc5f0221367636de4585456ae523ddc4d21d Mon Sep 17 00:00:00 2001 From: Greg Wilson Date: Sat, 9 Nov 2013 11:43:13 -0500 Subject: [PATCH 1/1] First three lessons on SQL for beginners --- sql/novice/01-select.ipynb | 437 +++++ sql/novice/02-sort-dup.ipynb | 408 +++++ sql/novice/03-filter.ipynb | 476 ++++++ sql/novice/db.txt | 2927 ++++++++++++++++++++++++++++++++++ sql/novice/sqlitemagic.py | 49 + sql/novice/survey.db | Bin 0 -> 20480 bytes sql/novice/survey.sql | 370 +++++ sql/novice/tables.html | 68 + 8 files changed, 4735 insertions(+) create mode 100644 sql/novice/01-select.ipynb create mode 100644 sql/novice/02-sort-dup.ipynb create mode 100644 sql/novice/03-filter.ipynb create mode 100644 sql/novice/db.txt create mode 100644 sql/novice/sqlitemagic.py create mode 100644 sql/novice/survey.db create mode 100644 sql/novice/survey.sql create mode 100644 sql/novice/tables.html diff --git a/sql/novice/01-select.ipynb b/sql/novice/01-select.ipynb new file mode 100644 index 0000000..0215577 --- /dev/null +++ b/sql/novice/01-select.ipynb @@ -0,0 +1,437 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Selecting Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the late 1920s and early 1930s,\n", + "William Dyer,\n", + "Frank Pabodie,\n", + "and Valentina Roerich led expeditions to the\n", + "[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)\n", + "in the South Pacific,\n", + "and then onward to Antarctica.\n", + "Two years ago,\n", + "their expeditions were found in a storage locker at Miskatonic University.\n", + "We have scanned and OCR'd the data they contain,\n", + "and we now want to store that information\n", + "in a way that will make search and analysis easy.\n", + "\n", + "We basically have three options:\n", + "text files,\n", + "a spreadsheet,\n", + "or a database.\n", + "Text files are easiest to create,\n", + "and work well with version control,\n", + "but then we would then have to build search and analysis tools ourselves.\n", + "Spreadsheets are good for doing simple analysis,\n", + "they don't handle large or complex data sets very well.\n", + "We would therefore like to put this data in a database,\n", + "and these lessons will show how to do that." + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "A Few Definitions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A [relational database](../gloss.html#relational-database)\n", + "is a way to store and manipulate information\n", + "that is arranged as [tables](../gloss.html#table).\n", + "Each table has columns (also known as [fields](../gloss.html#field-database)) which describe the data,\n", + "and rows (also known as [records](../gloss.html#record-database)) which contain the data.\n", + " \n", + "When we are using a spreadsheet,\n", + "we put formulas into cells to calculate new values based on old ones.\n", + "When we are using a database,\n", + "we send commands\n", + "(usually called [queries](../gloss.html#query))\n", + "to a [database manager](../gloss.html#database-manager):\n", + "a program that manipulates the database for us.\n", + "The database manager does whatever lookups and calculations the query specifies,\n", + "returning the results in a tabular form\n", + "that we can then use as a starting point for further queries.\n", + " \n", + "> Every database manager—Oracle,\n", + "> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores\n", + "> data in a different way,\n", + "> so a database created with one cannot be used directly by another.\n", + "> However,\n", + "> every database manager can import and export data in a variety of formats,\n", + "> so it *is* possible to move information from one to another.\n", + "\n", + "Queries are written in a language called [SQL](../gloss.html#sql),\n", + "which stands for \"Structured Query Language\".\n", + "SQL provides hundreds of different ways to analyze and recombine data;\n", + "we will only look at a handful,\n", + "but that handful accounts for most of what scientists do.\n", + "\n", + "The tables below show the database we will use in our examples:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "**Person**: people who took readings.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ident personal family
dyer William Dyer
pb Frank Pabodie
lake Anderson Lake
roe Valentina Roerich
danforth Frank Danforth
\n", + "\n", + "**Site**: locations where readings were taken.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
name lat long
DR-1 -49.85 -128.57
DR-3 -47.15 -126.72
MSK-4 -48.87 -123.4
\n", + "\n", + "**Visited**: when readings were taken at specific sites.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ident site dated
619 DR-1 1927-02-08
622 DR-1 1927-02-10
734 DR-3 1939-01-07
735 DR-3 1930-01-12
751 DR-3 1930-02-26
752 DR-3 NULL
837 MSK-4 1932-01-14
844 DR-1 1932-03-22
\n", + "
\n", + "**Survey**: the actual readings.\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
taken person quant reading
619 dyer rad 9.82
619 dyer sal 0.13
622 dyer rad 7.8
622 dyer sal 0.09
734 pb rad 8.41
734 lake sal 0.05
734 pb temp -21.5
735 pb rad 7.22
735 NULL sal 0.06
735 NULL temp -26.0
751 pb rad 4.35
751 pb temp -18.5
751 lake sal 0.1
752 lake rad 2.19
752 lake sal 0.09
752 lake temp -16.0
752 roe sal 41.6
837 lake rad 1.46
837 lake sal 0.21
837 roe sal 22.5
844 roe rad 11.25
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that three entries—one in the `Visited` table,\n", + "and two in the `Survey` table—are shown as `NULL`.\n", + "We'll return to these values [later](#s:null).\n", + "For now,\n", + "let's write an SQL query that displays scientists' names.\n", + "We do this using the SQL command `select`,\n", + "giving it the names of the columns we want and the table we want them from.\n", + "Our query and its output look like this:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select family, personal from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
DyerWilliam
PabodieFrank
LakeAnderson
RoerichValentina
DanforthFrank
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The semi-colon at the end of the query\n", + "tells the database manager that the query is complete and ready to run.\n", + "We have written our commands and column names in lower case,\n", + "and the table name in Title Case,\n", + "but we don't have to:\n", + "as the example below shows,\n", + "SQL is [case insensitive](../gloss.html#case-insensitive)." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
DyerWilliam
PabodieFrank
LakeAnderson
RoerichValentina
DanforthFrank
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Whatever casing convention you choose,\n", + "please be consistent:\n", + "complex queries are hard enough to read without the extra cognitive load of random capitalization." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Going back to our query,\n", + "it's important to understand that\n", + "the rows and columns in a database table aren't actually stored in any particular order.\n", + "They will always be *displayed* in some order,\n", + "but we can control that in various ways.\n", + "For example,\n", + "we could swap the columns in the output by writing our query as:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select personal, family from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
WilliamDyer
FrankPabodie
AndersonLake
ValentinaRoerich
FrankDanforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or even repeat columns:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select ident, ident, ident from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
dyerdyerdyer
pbpbpb
lakelakelake
roeroeroe
danforthdanforthdanforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a shortcut,\n", + "we can select all of the columns in a table using `*`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Person;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
dyerWilliamDyer
pbFrankPabodie
lakeAndersonLake
roeValentinaRoerich
danforthFrankDanforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Write a query that selects only site names from the `Site` table.\n", + "\n", + "2. Many people format queries as:\n", + "\n", + " ```\n", + " SELECT personal, family FROM person;\n", + " ```\n", + "\n", + " or as:\n", + "\n", + " ```\n", + " select Personal, Family from PERSON;\n", + " ```\n", + "\n", + " What style do you find easiest to read, and why?" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/02-sort-dup.ipynb b/sql/novice/02-sort-dup.ipynb new file mode 100644 index 0000000..e583799 --- /dev/null +++ b/sql/novice/02-sort-dup.ipynb @@ -0,0 +1,408 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Sorting and Removing Duplicates" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Data is often redundant,\n", + "so queries often return redundant information.\n", + "For example,\n", + "if we select the quantitites that have been measured\n", + "from the `survey` table,\n", + "we get this:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select quant from Survey;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
rad
sal
rad
sal
rad
sal
temp
rad
sal
temp
rad
temp
sal
rad
sal
temp
sal
rad
sal
sal
rad
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can eliminate the redundant output\n", + "to make the result more readable\n", + "by adding the `distinct` keyword\n", + "to our query:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select distinct quant from Survey;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "
rad
sal
temp
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we select more than one column—for example,\n", + "both the survey site ID and the quantity measured—then\n", + "the distinct pairs of values are returned:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select distinct taken, quant from Survey;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619rad
619sal
622rad
622sal
734rad
734sal
734temp
735rad
735sal
735temp
751rad
751temp
751sal
752rad
752sal
752temp
837rad
837sal
844rad
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice in both cases that duplicates are removed\n", + "even if they didn't appear to be adjacent in the database.\n", + "Again,\n", + "it's important to remember that rows aren't actually ordered:\n", + "they're just displayed that way." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Write a query that selects distinct dates from the `Site` table." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we mentioned earlier,\n", + "database records are not stored in any particular order.\n", + "This means that query results aren't necessarily sorted,\n", + "and even if they are,\n", + "we often want to sort them in a different way,\n", + "e.g., by the name of the project instead of by the name of the scientist.\n", + "We can do this in SQL by adding an `order by` clause to our query:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Person order by ident;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
danforthFrankDanforth
dyerWilliamDyer
lakeAndersonLake
pbFrankPabodie
roeValentinaRoerich
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default,\n", + "results are sorted in ascending order\n", + "(i.e.,\n", + "from least to greatest).\n", + "We can sort in the opposite order using `desc` (for \"descending\"):" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from person order by ident desc;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
roeValentinaRoerich
pbFrankPabodie
lakeAndersonLake
dyerWilliamDyer
danforthFrankDanforth
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(And if we want to make it clear that we're sorting in ascending order,\n", + "we can use `asc` instead of `desc`.)\n", + " \n", + "We can also sort on several fields at once.\n", + "For example,\n", + "this query sorts results first in ascending order by `taken`,\n", + "and then in descending order by `person`\n", + "within each group of equal `taken` values:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select taken, person from Survey order by taken asc, person desc;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619dyer
619dyer
622dyer
622dyer
734pb
734pb
734lake
735pb
735None
735None
751pb
751pb
751lake
752roe
752lake
752lake
752lake
837roe
837lake
837lake
844roe
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is easier to understand if we also remove duplicates:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select distinct taken, person from Survey order by taken asc, person desc;" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
619dyer
622dyer
734pb
734lake
735pb
735None
751pb
751lake
752roe
752lake
837roe
837lake
844roe
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "FIXME" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/03-filter.ipynb b/sql/novice/03-filter.ipynb new file mode 100644 index 0000000..08e0baa --- /dev/null +++ b/sql/novice/03-filter.ipynb @@ -0,0 +1,476 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Filtering" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One of the most powerful features of a database is\n", + "the ability to [filter](../gloss.html#filter) data,\n", + "i.e.,\n", + "to select only those records that match certain criteria.\n", + "For example,\n", + "suppose we want to see when a particular site was visited.\n", + "We can select these records from the `Visited` table\n", + "by using a `where` clause in our query:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext sqlitemagic" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where site='DR-1';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "
619DR-11927-02-08
622DR-11927-02-10
844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The database manager executes this query in two stages.\n", + "First,\n", + "it checks at each row in the `Visited` table\n", + "to see which ones satisfy the `where`.\n", + "It then uses the column names following the `select` keyword\n", + "to determine what columns to display." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME: diagram" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This processing order means that\n", + "we can filter records using `where`\n", + "based on values in columns that aren't then displayed:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select ident from Visited where site='DR-1';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "
619
622
844
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can use many other Boolean operators to filter our data.\n", + "For example,\n", + "we can ask for all information from the DR-1 site collected since 1930:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Visited where (site='DR-1') and (dated>='1930-00-00');" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "
844DR-11932-03-22
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(The parentheses around the individual tests aren't strictly required,\n", + "but they help make the query easier to read.)\n", + "\n", + "> Most database managers have a special data type for dates.\n", + "> In fact, many have two:\n", + "> one for dates,\n", + "> such as \"May 31, 1971\",\n", + "> and one for durations,\n", + "> such as \"31 days\".\n", + "> SQLite doesn't:\n", + "> instead,\n", + "> it stores dates as either text\n", + "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n", + "> real numbers\n", + "> (the number of days since November 24, 4714 BCE),\n", + "> or integers\n", + "> (the number of seconds since midnight, January 1, 1970).\n", + "> If this sounds complicated,\n", + "> it is,\n", + "> but not nearly as complicated as figuring out\n", + "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want to find out what measurements were taken by either Lake or Roerich,\n", + "we can combine the tests on their names using `or`:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where person='lake' or person='roe';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively,\n", + "we can use `in` to see if a value is in a specific set:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where person in ('lake', 'roe');" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can combine `and` with `or`,\n", + "but we need to be careful about which operator is executed first.\n", + "If we *don't* use parentheses,\n", + "we get this:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where quant='sal' and person='lake' or person='roe';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
844roerad11.25
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "which is salinity measurements by Lake,\n", + "and *any* measurement by Roerich.\n", + "We probably want this instead:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select * from Survey where quant='sal' and (person='lake' or person='roe');" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally,\n", + "we can use `distinct` with `where`\n", + "to give a second level of filtering:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%sqlite survey.db\n", + "select distinct person, quant from Survey where person='lake' or person='roe';" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "html": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
lakesal
lakerad
laketemp
roesal
roerad
" + ], + "metadata": {}, + "output_type": "display_data", + "text": [ + "" + ] + } + ], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But remember:\n", + "`distinct` is applied to the values displayed in the chosen columns,\n", + "not to the entire rows as they are being processed.\n", + "\n", + "> What we have just done is how most people \"grow\" their SQL queries.\n", + "> We started with something simple that did part of what we wanted,\n", + "> then added more clauses one by one,\n", + "> testing their effects as we went.\n", + "> This is a good strategy—in fact,\n", + "> for complex queries it's often the *only* strategy—but\n", + "> it depends on quick turnaround,\n", + "> and on us recognizing the right answer when we get it.\n", + "> \n", + "> The best way to achieve quick turnaround is often\n", + "> to put a subset of data in a temporary database\n", + "> and run our queries against that,\n", + "> or to fill a small database with synthesized records.\n", + "> For example,\n", + "> instead of trying our queries against an actual database of 20 million Australians,\n", + "> we could run it against a sample of ten thousand,\n", + "> or write a small program to generate ten thousand random (but plausible) records\n", + "> and use that." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Challenges\n", + "\n", + "1. Suppose we want to select all sites that lie within 30° of the equator.\n", + " Our first query is:\n", + "\n", + " ```\n", + " select * from Site where (lat > -30) or (lat < 30);\n", + " ```\n", + "\n", + " Explain why this is wrong,\n", + " and rewrite the query so that it is correct.\n", + "\n", + "2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n", + " Write a query that selects all records from `Survey`\n", + " with salinity values outside this range.\n", + "\n", + "3. The SQL test `*column-name* like *pattern*`\n", + " is true if the value in the named column\n", + " matches the pattern given;\n", + " the character '%' can be used any number of times in the pattern\n", + " to mean \"match zero or more characters\".\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Expression Value
'a' like 'a' True
'a' like '%a' True
'b' like '%a' False
'alpha' like 'a%' True
'alpha' like 'a%p%' True
\n", + " The expression `*column-name* not like *pattern*`\n", + " inverts the test.\n", + " Using `like`,\n", + " write a query that finds all the records in `Visited`\n", + " that *aren't* from sites labelled 'DR-something'." + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "FIXME" + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/sql/novice/db.txt b/sql/novice/db.txt new file mode 100644 index 0000000..59aa25a --- /dev/null +++ b/sql/novice/db.txt @@ -0,0 +1,2927 @@ +% Databases and SQL +% Greg Wilson +% June 2013 + +In the late 1920s and early 1930s, +William Dyer, +Frank Pabodie, +and Valentina Roerich led expeditions to the +[Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility) +in the South Pacific, +and then onward to Antarctica. +Two years ago, +Gina Geographer discovered their expedition journals +in a storage locker at Miskatonic University. +She has scanned and OCR'd the data they contain, +and wants to store that information +in a way that will make search and analysis easy. + +Gina basically has three options: +text files, +a spreadsheet, +or a database. +Text files are easiest to create, +and work well with version control, +but she would then have to build all her search and analysis herself. +Spreadsheets are good for doing simple analysis, +but as she found in her last project, +they don't handle large or complex data sets very well. +She would therefore like to put her data in a database, +and this chapter will show her how. + +As many scientists have found out the hard way, +if collecting data is the first 90% of the work, +managing it is the other 90%. +In this chapter, +we'll see how to use a database to store and analyze field observations. +The techniques we will explore apply directly to other kinds of databases as well, +and as we'll see, +knowing how to get information *out* of a database is essential to +figuring out how to put data *in*. + +## For Instructors {.guide} + +Relational databases are not as widely used in science as in business, +but they are still a common way to store large data sets with complex structure. +Even when the data itself isn't in a database, +the metadata could be: +for example, +meteorological data might be stored in files on disk, +but data about when and where observations were made, +data ranges, +and so on could be in a database +to make it easier for scientists to find what they want to. + +The first few sections +(up to [Ordering Results](#s:sort)) +usually go very quickly. +The pace usually slows down a bit when null values and aggregation are discussed, +mostly because learners have a lot of details to keep straight by this point. +Things *really* slow down during the discussion of joins, +but this is the key idea in the whole lesson: +important ideas like primary keys and referential integrity only make sense +once learners have seen how they're used in joins. +It's worth going over things a couple of times if necessary +(with lots of examples). + +The final three sections are independent of each other, +and can be dropped if time is short. +Of the three, +people seem to care most about how to add data +(which only takes a few minutes to demonstrate), +and how to use databases from inside "real" programs. +The material on transactions is more abstract than the rest, +and should be omitted if [web programming](web.html) +isn't being taught. +Overall, +this material takes three hours to present +assuming that a short exercise is done with each topic. + +### Prerequisites {.prereq} + +Everything up to the [final section](#s:programming) +only requires some understanding of Boolean operators, +data types, +and pipelines, +and what's needed can actually be introduced on the fly. +That [final section](#s:programming), +which shows how to use databases from within programs, +depends on most of the [basic Python material](python.html). + +### Teaching Notes {.notes} + +* It isn't necessary to cover [sets and dictionaries](setdict.html) + before this material, + but if that has been discussed, + it's helpful to point out that a relational table is a generalized dictionary. +* Simple calculations are actually easier to do in a spreadsheet, + the advantages of using a database become clear + as soon as filtering and joins are needed. + Instructors may therefore want to show a spreadsheet + with the information from the four database tables + consolidated into a single sheet, + and demonstrate what's needed in both systems to answer questions like, + "What was the average radiation reading in 1931?" +* Some learners may have heard that NoSQL databases + (i.e., ones that don't use the relational model) + are the next big thing, + and ask why we're not teaching those. + The answers are: + * Relational databases are far more widely used than NoSQL databases. + * We have far more experience with relational databases + than with any other kind, + so we have a better idea of what to teach + and how to teach it. + * NoSQL databases are as different from each other + as they are from relational databases. + Until a leader emerges, + it isn't clear *which* NoSQL database we should teach. +* This discussion is a useful companion to that of vectorization + in the lesson on [numerical computing](numpy.html): + in both cases, + the key point is to describe *what* to do, + and let the computer figure out *how* to do it. + +## Selecting {#s:select} + +### Learning Objectives {.objectives} + +* Explain the difference between a table, a database, and a database manager. +* Explain the difference between a field and a record. +* Select specific fields from specific tables, and display them in a specific order. + +Duration: 15 minutes (not including time required to download database file and connect to it) + +### Lesson + +A [relational database](../gloss.html#relational-database) +is a way to store and manipulate information +that is arranged as [tables](../gloss.html#table). +Each table has columns (also known as [fields](../gloss.html#field-database)) which describe the data, +and rows (also known as [records](../gloss.html#record-database)) which contain the data. + + +When we are using a spreadsheet, +we put formulas into cells to calculate new values based on old ones. +When we are using a database, +we send commands +(usually called [queries](../gloss.html#query)) +to a [database manager](../gloss.html#database-manager): +a program that manipulates the database for us. +The database manager does whatever lookups and calculations the query specifies, +returning the results in a tabular form +that we can then use as a starting point for further queries. + +> ### Under the Hood {.box} +> +> Every database manager—Oracle, +> IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores +> data in a different way, +> so a database created with one cannot be used directly by another. +> However, +> every database manager can import and export data in a variety of formats, +> so it *is* possible to move information from one to another. + +Queries are written in a language called [SQL](../gloss.html#sql), +which stands for "Structured Query Language". +SQL provides hundreds of different ways to analyze and recombine data; +we will only look at a handful, +but that handful accounts for most of what scientists do. + +[Figure 1](#f:survey_db) shows +a simple database that stores some of the data +Gina extracted from the logs of those long-ago expeditions. +It contains four tables: + +
+ +Table Purpose +-------------------- -------------------- +`Person` People who took readings. +`Site` Locations of observation sites. +`Visited` When readings were taken at specific sites. +`Survey` The actual measurement values. + +
+ +**Person** + +
+ +ident personal family +-------------------- -------------------- -------------------- +dyer William Dyer +pb Frank Pabodie +lake Anderson Lake +roe Valentina Roerich +danforth Frank Danforth + +
+ +**Survey** + +
+ +taken person quant reading +-------------------- -------------------- -------------------- -------------------- +619 dyer rad 9.82 +619 dyer sal 0.13 +622 dyer rad 7.8 +622 dyer sal 0.09 +734 pb rad 8.41 +734 lake sal 0.05 +734 pb temp -21.5 +735 pb rad 7.22 +735 NULL sal 0.06 +735 NULL temp -26.0 +751 pb rad 4.35 +751 pb temp -18.5 +751 lake sal 0.1 +752 lake rad 2.19 +752 lake sal 0.09 +752 lake temp -16.0 +752 roe sal 41.6 +837 lake rad 1.46 +837 lake sal 0.21 +837 roe sal 22.5 +844 roe rad 11.25 + +
+ +**Site** + +
+ +name lat long +-------------------- -------------------- -------------------- +DR-1 -49.85 -128.57 +DR-3 -47.15 -126.72 +MSK-4 -48.87 -123.4 + +
+ +**Visited** + +
+ +ident site dated +-------------------- -------------------- -------------------- +619 DR-1 1927-02-08 +622 DR-1 1927-02-10 +734 DR-3 1939-01-07 +735 DR-3 1930-01-12 +751 DR-3 1930-02-26 +752 DR-3 NULL +837 MSK-4 1932-01-14 +844 DR-1 1932-03-22 + +
+ +
Figure 1: Survey Database
+ +Notice that three entries—one in the `Visited` table, +and two in the `Survey` table—are shown as `NULL`. +We'll return to these values [later](#s:null). +For now, +let's write an SQL query that displays scientists' names. +We do this using the SQL command `select`, +giving it the names of the columns we want and the table we want them from. +Our query and its output look like this: + + sqlite> select family, personal from Person; + +
+ +-------------------- -------------------- +Dyer William +Pabodie Frank +Lake Anderson +Roerich Valentina +Danforth Frank +-------------------- -------------------- + +
+ +The semi-colon at the end of the query +tells the database manager that the query is complete and ready to run. +If we enter the query without the semi-colon, +or press 'enter' part-way through the query, +the SQLite interpreter will give us a different prompt +to show us that it's waiting for more input: + + sqlite> select family, personal + ...> from Person + ...> ; + +
+ +-------------------- -------------------- +Dyer William +Pabodie Frank +Lake Anderson +Roerich Valentina +Danforth Frank +-------------------- -------------------- + +
+ +From now on, +we won't bother to display the prompt(s) with our commands. + +> ### Case and Consistency {.box} +> +> We have written our command and the column names in lower case, +> and the table name in title case, +> but we could use any mix: +> SQL is [case insensitive](../gloss.html#case-insensitive), +> so we could write them all in upper case, +> or even like this: +> +> ``` {.sql} +> SeLeCt famILY, PERSonal frOM PERson; +> ``` +> +> But please don't: +> large SQL queries are hard enough to read +> without the extra cognitive load of random capitalization. + +> ### Displaying Results {.box} +> +> Exactly *how* the database displays the query's results +> depends on what kind of interface we are using. +> If we are running SQLite directly from the shell, +> its default output looks like this: +> +> Dyer|William +> Pabodie|Frank +> Lake|Anderson +> Roerich|Valentina +> Danforth|Frank +> +> If we are using a graphical interface, +> such as the [SQLite Manager](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/) plugin for Firefox +> or the [database extension](https://github.com/catherinedevlin/ipython-sql) for the IPython Notebook, +> our output will be displayed graphically +> ([Figure 2](#f:firefox_output) +> and [Figure 3](#f:notebook_output)). +> We'll use a simple table-based display in these notes. +> +>
+> Firefox SQLite Manager Output +>
Figure 2: Firefox SQLite Manager Output
+>
+> +>
+> IPython Notebook Database Extension Output +>
Figure 3: IPython Notebook Database Extension Output
+>
+ +Going back to our query, +it's important to understand that +the rows and columns in a database table aren't actually stored in any particular order. +They will always be *displayed* in some order, +but we can control that in various ways. +For example, +we could swap the columns in the output by writing our query as: + +``` {.sql} +select personal, family from Person; +``` + +
+ +-------------------- -------------------- +William Dyer +Frank Pabodie +Anderson Lake +Valentina Roerich +Frank Danforth +-------------------- -------------------- + +
+ +or even repeat columns: + +``` {.sql} +select ident, ident, ident from Person; +``` + +
+ +-------------------- -------------------- -------------------- +dyer dyer dyer +pb pb pb +lake lake lake +roe roe roe +danforth danforth danforth +-------------------- -------------------- -------------------- + +
+ +We will see ways to rearrange the rows [later](#s:sort). + +As a shortcut, we can select all of the columns in a table +using the wildcard `*`: + +``` {.sql} +select * from Person; +``` + +
+ +-------------------- -------------------- -------------------- +dyer William Dyer +pb Frank Pabodie +lake Anderson Lake +roe Valentina Roerich +danforth Frank Danforth +-------------------- -------------------- -------------------- + +
+ +### Summary {.keypoints} + +* A relational database stores information in tables with fields and records. +* A database manager is a program that manipulates a database. +* The commands or queries given to a database manager are usually written in a specialized language called SQL. +* SQL is case insensitive. +* The rows and columns of a database table aren't stored in any particular order. +* Use `select *fields* from *table*` to get all the values for specific fields from a single table. +* Use `select * from *table*` to select everything from a table. + +### Challenges {.challenges} + +* Write a query that selects only site names from the `Site` table. + +* Many people format queries as: + + ``` + SELECT personal, family FROM person; + ``` + + or as: + + ``` + select Personal, Family from PERSON; + ``` + + What style do you find easiest to read, and why? + +## Removing Duplicates {#s:distinct} + +### Learning Objectives {.objectives} + +* Write queries that only display distinct results once. + +Duration: 5 minutes. + +### Lesson + +Data is often redundant, +so queries often return redundant information. +For example, +if we select the quantitites that have been measured +from the `survey` table, +we get this: + +``` {.sql} +select quant from Survey; +``` + +
+ +-------------------- +rad +sal +rad +sal +rad +sal +temp +rad +sal +temp +rad +temp +sal +rad +sal +temp +sal +rad +sal +sal +rad +-------------------- + +
+ +We can eliminate the redundant output +to make the result more readable +by adding the `distinct` keyword +to our query: + +``` {.sql} +select distinct quant from Survey; +``` + +
+ +-------------------- +rad +sal +temp +-------------------- + +
+ +If we select more than one column—for example, +both the survey site ID and the quantity measured—then +the distinct pairs of values are returned: + +``` {.sql} +select distinct taken, quant from Survey; +``` + +
+ +-------------------- -------------------- +619 rad +619 sal +622 rad +622 sal +734 rad +734 sal +734 temp +735 rad +735 sal +735 temp +751 rad +751 temp +751 sal +752 rad +752 sal +752 temp +837 rad +837 sal +844 rad +-------------------- -------------------- + +
+ +Notice in both cases that duplicates are removed +even if they didn't appear to be adjacent in the database. +Again, +it's important to remember that rows aren't actually ordered: +they're just displayed that way. + +### Summary {.keypoints} + +* Use `distinct` to eliminate duplicates from a query's output. + +### Challenges {.challenges} + +* Write a query that selects distinct dates from the `Site` table. + +* If you are using SQLite from the command line, + you can run a single query by passing it to the interpreter + right after the path to the database file: + + ``` + $ sqlite3 survey.db 'select * from Person;' + ``` + +
+ + -------------------- -------------------- -------------------- + dyer William Dyer + pb Frank Pabodie + lake Anderson Lake + roe Valentina Roerich + danforth Frank Danforth + -------------------- -------------------- -------------------- + +
+ + Fill in the missing commands in the pipeline below + so that the output contains no redundant values. + + ``` + $ sqlite3 survey.db 'select person, quant from Survey;' | ____ | ____ + ``` + + Do you think this is less efficient, just as efficient, or more efficient + that using `distinct` for large data? + +## Filtering {#s:filter} + +### Learning Objectives {.objectives} + +* Write queries that select records based on the values of their fields. +* Write queries that select records using combinations of several tests on their fields' values. +* Build up complex filtering criteria incrementally. +* Explain the logical order in which filtering by field value and displaying fields takes place. + +Duration: 5-10 minutes. + +### Lesson + +One of the most powerful features of a database is +the ability to [filter](../gloss.html#filter) data, +i.e., +to select only those records that match certain criteria. +For example, +suppose we want to see when a particular site was visited. +We can select these records from the `Visited` table +by using a `where` clause in our query: + +``` {.sql} +select * from Visited where site='DR-1'; +``` + +
+ +-------------------- -------------------- -------------------- +619 DR-1 1927-02-08 +622 DR-1 1927-02-10 +844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- + +
+ +The database manager executes this query in two stages +([Figure 4](#f:pipeline_where)). +First, +it checks at each row in the `Visited` table +to see which ones satisfy the `where`. +It then uses the column names following the `select` keyword +to determine what columns to display. + +
+ Two-Stage Query Processing Pipeline +
Figure 4: Two-Stage Query Processing Pipeline
+
+ +This processing order means that +we can filter records using `where` +based on values in columns that aren't then displayed: + +``` {.sql} +select ident from Visited where site='DR-1'; +``` + +
+ +-------------------- +619 +622 +844 +-------------------- + +
+ +We can use many other Boolean operators to filter our data. +For example, +we can ask for all information from the DR-1 site collected since 1930: + +``` {.sql} +select * from Visited where (site='DR-1') and (dated>='1930-00-00'); +``` + +
+ +-------------------- -------------------- -------------------- +844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- + +
+ +(The parentheses around the individual tests aren't strictly required, +but they help make the query easier to read.) + +> ### Working With Dates {#a:dates .box} +> +> Most database managers have a special data type for dates. +> In fact, many have two: +> one for dates, +> such as "May 31, 1971", +> and one for durations, +> such as "31 days". +> SQLite doesn't: +> instead, +> it stores dates as either text +> (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"), +> real numbers +> (the number of days since November 24, 4714 BCE), +> or integers +> (the number of seconds since midnight, January 1, 1970). +> If this sounds complicated, +> it is, +> but not nearly as complicated as figuring out +> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar). + +If we want to find out what measurements were taken by either Lake or Roerich, +we can combine the tests on their names using `or`: + +``` {.sql} +select * from Survey where person='lake' or person='roe'; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +734 lake sal 0.05 +751 lake sal 0.1 +752 lake rad 2.19 +752 lake sal 0.09 +752 lake temp -16.0 +752 roe sal 41.6 +837 lake rad 1.46 +837 lake sal 0.21 +837 roe sal 22.5 +844 roe rad 11.25 +-------------------- -------------------- -------------------- -------------------- + +
+ +Alternatively, +we can use `in` to see if a value is in a specific set: + +``` {.sql} +select * from Survey where person in ('lake', 'roe'); +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +734 lake sal 0.05 +751 lake sal 0.1 +752 lake rad 2.19 +752 lake sal 0.09 +752 lake temp -16.0 +752 roe sal 41.6 +837 lake rad 1.46 +837 lake sal 0.21 +837 roe sal 22.5 +844 roe rad 11.25 +-------------------- -------------------- -------------------- -------------------- + +
+ +We can combine `and` with `or`, +but we need to be careful about which operator is executed first. +If we *don't* use parentheses, +we get this: + +``` {.sql} +select * from Survey where quant='sal' and person='lake' or person='roe'; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +734 lake sal 0.05 +751 lake sal 0.1 +752 lake sal 0.09 +752 roe sal 41.6 +837 lake sal 0.21 +837 roe sal 22.5 +844 roe rad 11.25 +-------------------- -------------------- -------------------- -------------------- + +
+ +which is salinity measurements by Lake, +and *any* measurement by Roerich. +We probably want this instead: + +``` {.sql} +select * from Survey where quant='sal' and (person='lake' or person='roe'); +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +734 lake sal 0.05 +751 lake sal 0.1 +752 lake sal 0.09 +752 roe sal 41.6 +837 lake sal 0.21 +837 roe sal 22.5 +-------------------- -------------------- -------------------- -------------------- + +
+ +Finally, +we can use `distinct` with `where` +to give a second level of filtering: + +``` {.sql} +select distinct person, quant from Survey where person='lake' or person='roe'; +``` + +
+ +-------------------- -------------------- +lake sal +lake rad +lake temp +roe sal +roe rad +-------------------- -------------------- + +
+ +But remember: +`distinct` is applied to the values displayed in the chosen columns, +not to the entire rows as they are being processed. + +> ### Growing Queries {.box} +> +> What we have just done is how most people "grow" their SQL queries. +> We started with something simple that did part of what we wanted, +> then added more clauses one by one, +> testing their effects as we went. +> This is a good strategy—in fact, +> for complex queries it's often the *only* strategy—but +> it depends on quick turnaround, +> and on us recognizing the right answer when we get it. +> +> The best way to achieve quick turnaround is often +> to put a subset of data in a temporary database +> and run our queries against that, +> or to fill a small database with synthesized records. +> For example, +> instead of trying our queries against an actual database of 20 million Australians, +> we could run it against a sample of ten thousand, +> or write a small program to generate ten thousand random (but plausible) records +> and use that. + +### Summary {.keypoints} + +* Use `where *test*` in a query to filter records based on Boolean tests. +* Use `and` and `or` to combine tests. +* Use `in` to check if a value is in a set. +* Build up queries a bit at a time, and test them against small data sets. + +### Challenges {.challenges} + +* Gina wants to select all sites that lie within 30° of the equator. + Her query is: + + ``` {.sql} + select * from Site where (lat > -30) or (lat < 30); + ``` {.sql} + + Explain why this is wrong, + and rewrite the query so that it is correct. + +* Normalized salinity readings are supposed to be between 0.0 and 1.0. + Write a query that selects all records from `Survey` + with salinity values outside this range. + +* The SQL test `*column-name* like *pattern*` + is true if the value in the named column + matches the pattern given; + the character '%' can be used any number of times in the pattern + to mean "match zero or more characters". + + Expression Value + -------------------- -------------------- + `'a' like 'a'` `True` + `'a' like '%a'` `True` + `'b' like '%a'` `False` + `'alpha' like 'a%'` `True` + `'alpha' like 'a%p%'` `True` + `'beta' like 'a%p%'` `False` + + The expression `*column-name* not like *pattern*` + inverts the test. + Using `like`, + write a query that finds all the records in `Visited` + that *aren't* from sites labelled 'DR-something'. + +## Calculating New Values {#s:calc} + +### Learning Objectives {.objectives} + +* Write queries that do arithmetic using the values in individual records. + +Duration: 5 minutes. + +### Lesson + +After carefully reading the expedition logs, +Gina realizes that the radiation measurements they report +may need to be corrected upward by 5%. +Rather than modifying the stored data, +she can do this calculation on the fly +as part of her query: + +``` {.sql} +select 1.05 * reading from Survey where quant='rad'; +``` + +
+ +-------------------- +10.311 +8.19 +8.8305 +7.581 +4.5675 +2.2995 +1.533 +11.8125 +-------------------- + +
+ +When we run the query, +the expression `1.05 * reading` is evaluated for each row. +Expressions can use any of the fields, +all of usual arithmetic operators, +and a variety of common functions. +(Exactly which ones depends on which database manager is being used.) +For example, +we can convert temperature readings from Fahrenheit to Celsius +and round to two decimal places as follows: + +``` {.sql} +select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp'; +``` + +
+ +-------------------- -------------------- +734 -29.72 +735 -32.22 +751 -28.06 +752 -26.67 +-------------------- -------------------- + +
+ +We can also combine values from different fields, +for example by using the string concatenation operator `||`: + +``` {.sql} +select personal || ' ' || family from Person; +``` + +
+ +-------------------- +William Dyer +Frank Pabodie +Anderson Lake +Valentina Roerich +Frank Danforth +-------------------- + +
+ +> ### A Note on Names {.box} +> +> It may seem strange to use `personal` and `family` as field names +> instead of `first` and `last`, +> but it's a necessary first step toward handling cultural differences. +> For example, +> consider the following rules: +> +> Full Name Alphabetized Under Reason +> -------------------- -------------------- -------------------- +> Liu Xiaobo Liu Chinese family names come first +> Leonardo da Vinci Leonardo "da Vinci" just means "from Vinci" +> Catherine de Medici Medici family name +> Jean de La Fontaine La Fontaine family name is "La Fontaine" +> Juan Ponce de Leon Ponce de Leon full family name is "Ponce de Leon" +> Gabriel Garcia Marquez Garcia Marquez double-barrelled Spanish surnames +> Wernher von Braun von *or* Braun depending on whether he was in Germany or the US +> Elizabeth Alexandra May Windsor Elizabeth monarchs alphabetize by the name under which they reigned +> Thomas a Beckett Thomas and saints according to the names by which they were canonized +> +> Clearly, +> even a two-part division into "personal" and "family" +> isn't enough... + +### Summary {.keypoints} + +* Use expressions as fields to calculate per-record values. + +### Challenges {.challenges} + +* After further reading, + Gina realizes that Valentina Roerich + was reporting salinity as percentages. + Write a query that returns all of her salinity measurements + from the `Survey` table + with the values divided by 100. + +* The `union` operator combines the results of two queries: + + ``` {.sql} + select * from Person where ident='dyer' union select * from Person where ident='roe'; + ``` + +
+ + -------------------- -------------------- -------------------- + dyer William Dyer + roe Valentina Roerich + -------------------- -------------------- -------------------- + +
+ + Use `union` to create a consolidated list of salinity measurements + in which Roerich's, and only Roerich's, + have been corrected as described in the previous challenge. + The output should be something like: + +
+ + -------------------- -------------------- + 619 0.13 + 622 0.09 + 734 0.05 + 751 0.1 + 752 0.09 + 752 0.416 + 837 0.21 + 837 0.225 + -------------------- -------------------- + +
+ +* The site identifiers in the `Visited` table have two parts + separated by a '-': + + ``` {.sql} + select distinct site from Visited; + ``` + +
+ + -------------------- + DR-1 + DR-3 + MSK-4 + -------------------- + +
+ + Some major site identifiers are two letters long and some are three. + The "in string" function `instr(X, Y)` + returns the 1-based index of the first occurrence of string Y in string X, + or 0 if Y does not exist in X. + The substring function `substr(X, I)` + returns the substring of X starting at index I. + Use these two functions to produce a list of unique major site identifiers. + (For this data, + the list should contain only "DR" and "MSK"). + +* Pabodie's journal notes that all his temperature measurements + are in °F, + but Lake's journal does not report whether he used °F or °C. + How should Gina treat his measurements, + and why? + +## Ordering Results {#s:sort} + +### Learning Objectives {.objectives} + +* Write queries that order results according to fields' values. +* Write queries that order results according to calculated values. +* Explain why it is possible to sort records using the values of fields that are not displayed. + +Duration: 5 minutes. + +### Lesson + +As we mentioned earlier, +database records are not stored in any particular order. +This means that query results aren't necessarily sorted, +and even if they are, +we often want to sort them in a different way, +e.g., by the name of the project instead of by the name of the scientist. +We can do this in SQL by adding an `order by` clause to our query: + +``` {.sql} +select reading from Survey where quant='rad' order by reading; +``` + +
+ +-------------------- +1.46 +2.19 +4.35 +7.22 +7.8 +8.41 +9.82 +11.25 +-------------------- + +
+ +By default, +results are sorted in ascending order +(i.e., +from least to greatest). +We can sort in the opposite order using `desc` (for "descending"): + +``` {.sql} +select reading from Survey where quant='rad' order by reading desc; +``` + +
+ +-------------------- +11.25 +9.82 +8.41 +7.8 +7.22 +4.35 +2.19 +1.46 +-------------------- + +
+ +(And if we want to make it clear that we're sorting in ascending order, +we can use `asc` instead of `desc`.) + +We can also sort on several fields at once. +For example, +this query sorts results first in ascending order by `taken`, +and then in descending order by `person` +within each group of equal `taken` values: + +``` {.sql} +select taken, person from Survey order by taken asc, person desc; +``` + +
+ +-------------------- -------------------- +619 dyer +619 dyer +622 dyer +622 dyer +734 pb +734 pb +734 lake +735 pb +735 +735 +751 pb +751 pb +751 lake +752 roe +752 lake +752 lake +752 lake +837 roe +837 lake +837 lake +844 roe +-------------------- -------------------- + +
+ +This is easier to understand if we also remove duplicates: + +``` {.sql} +select distinct taken, person from Survey order by taken asc, person desc; +``` + +
+ +-------------------- -------------------- +619 dyer +622 dyer +734 pb +734 lake +735 pb +735 +751 pb +751 lake +752 roe +752 lake +837 roe +837 lake +844 roe +-------------------- -------------------- + +
+ +Since sorting happens before columns are filtered, +we can sort by a field that isn't actually displayed: + +``` {.sql} +select reading from Survey where quant='rad' order by taken; +``` + +
+ +-------------------- +9.82 +7.8 +8.41 +7.22 +4.35 +2.19 +1.46 +11.25 +-------------------- + +
+ +We can also sort results by the value of an expression. +In SQLite, +for example, +the `random` function returns a pseudo-random integer +each time it is called +(i.e., +once per record): + +``` {.sql} +select random(), ident from Person; +``` + +
+ +-------------------- -------------------- +-6309766557809954936 dyer +-2098461436941487136 pb +-2248225962969032314 lake +6062184424509295966 roe +-1268956870222271271 danforth +-------------------- -------------------- + +
+ +So to randomize the order of our query results, +e.g., when doing clinical trials, +we can sort them by the value of this function: + +``` {.sql} +select ident from Person order by random(); +``` + +
+ +-------------------- +danforth +pb +dyer +lake +roe +-------------------- + +
+ +``` {.sql} +select ident from Person order by random(); +``` + +
+ +-------------------- +roe +dyer +pb +lake +danforth +-------------------- + +
+ +Our query pipeline now has four stages +([Figure 5](#f:pipeline_sort_distinct)): + +* Select the rows that pass the `where` criteria. +* Sort them if required. +* Filter the columns according to the `select` criteria. +* Remove duplicates if required. + +
+ Four-Stage Query Processing Pipeline +
Figure 5: Four-Stage Query Processing Pipeline
+
+ +### Summary {.keypoints} + +* Use `order by` (with `asc` or `desc`) to order a query's results. +* Use `random` to generate pseudo-random numbers. + +### Challenges {.challenges} + +* Create a list of sites identifiers + and their distance from the equator in kilometers, + sorted from furthest to closest. + (A degree of latitude corresponds to 111.12 km.) + +* Gina needs a list of radiation measurements from all sites + sorted by when they were taken. + The query: + + ``` {.sql} + select * from Survey where quant='rad' order by taken; + ``` + + produces the correct answer for the data used in our examples. + Explain when and why it might produce the wrong answer. + +## Missing Data {#s:null} + +### Learning Objectives {.objectives} + +* Explain what databases use the special value `NULL` to represent. +* Explain why databases should *not* uses their own special values (like 9999 or "N/A") to represent missing or unknown data. +* Explain what atomic and aggregate calculations involving `NULL` produce, and why. +* Write queries that include or exclude records containing `NULL`. + +Duration: 10-20 minutes +(depending on whether or not the instructor includes an anecdote about +what happens when you *don't* take missing data into account). + +### Lesson + +Real-world data is never complete—there are always holes. +Databases represent these holes using special value called `null`. +`null` is not zero, `False`, or the empty string; +it is a one-of-a-kind value that means "nothing here". +Dealing with `null` requires a few special tricks +and some careful thinking. + +To start, +let's have a look at the `Visited` table. +There are eight records, +but #752 doesn't have a date—or rather, +its date is null: + +``` {.sql} +select * from Visited; +``` + +
+ +-------------------- -------------------- -------------------- +619 DR-1 1927-02-08 +622 DR-1 1927-02-10 +734 DR-3 1939-01-07 +735 DR-3 1930-01-12 +751 DR-3 1930-02-26 +752 DR-3 +837 MS-4 1932-01-14 +844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- + +
+ +> ### Displaying Nulls {.box} +> +> Different databases display nulls differently. +> Unfortunately, +> SQLite's default is to print nothing at all, +> which makes nulls easy to overlook +> (particularly if they're in the middle of a long row). + +Null doesn't behave like other values. +If we select the records that come before 1930: + +``` {.sql} +select * from Visited where dated<'1930-00-00'; +``` + +
+ +-------------------- -------------------- -------------------- +619 DR-1 1927-02-08 +622 DR-1 1927-02-10 +-------------------- -------------------- -------------------- + +
+ +we get two results, +and if we select the ones that come during or after 1930: + +``` {.sql} +select * from Visited where dated>='1930-00-00'; +``` + +
+ +-------------------- -------------------- -------------------- +734 DR-3 1939-01-07 +735 DR-3 1930-01-12 +751 DR-3 1930-02-26 +837 MS-4 1932-01-14 +844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- + +
+ +we get five, +but record #752 isn't in either set of results. +The reason is that +`null<'1930-00-00'` +is neither true nor false: +null means, "We don't know," +and if we don't know the value on the left side of a comparison, +we don't know whether the comparison is true or false. +Since databases represent "don't know" as null, +the value of `null<'1930-00-00'` +is actually `null`. +`null>='1930-00-00'` is also null +because we can't answer to that question either. +And since the only records kept by a `where` +are those for which the test is true, +record #752 isn't included in either set of results. + +Comparisons aren't the only operations that behave this way with nulls. +`1+null` is `null`, +`5*null` is `null`, +`log(null)` is `null`, +and so on. +In particular, +comparing things to null with = and != produces null: + +``` {.sql} +select * from Visited where dated=NULL; +``` + +``` {.sql} +select * from Visited where dated!=NULL; +``` + +To check whether a value is `null` or not, +we must use a special test `is null`: + +``` {.sql} +select * from Visited where dated is NULL; +``` + +
+ +-------------------- -------------------- +752 DR-3 +-------------------- -------------------- + +
+ +or its inverse `is not null`: + +``` {.sql} +select * from Visited where dated is not NULL; +``` + +
+ +-------------------- -------------------- -------------------- +619 DR-1 1927-02-08 +622 DR-1 1927-02-10 +734 DR-3 1939-01-07 +735 DR-3 1930-01-12 +751 DR-3 1930-02-26 +837 MS-4 1932-01-14 +844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- + +
+ +Null values cause headaches wherever they appear. +For example, +suppose we want to find the all of salinity measurements +that weren't taken by Dyer. +It's natural to write the query like this: + +``` {.sql} +select * from Survey where quant='sal' and person!='lake'; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +619 dyer sal 0.13 +622 dyer sal 0.09 +752 roe sal 41.6 +837 roe sal 22.5 +-------------------- -------------------- -------------------- -------------------- + +
+ +but this query filters omits the records +where we don't know who took the measurement. +Once again, +the reason is that when `person` is `null`, +the `!=` comparison produces `null`, +so the record isn't kept in our results. +If we want to keep these records +we need to add an explicit check: + +``` {.sql} +select * from Survey where quant='sal' and (person!='lake' or person is null); +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +619 dyer sal 0.13 +622 dyer sal 0.09 +735 sal 0.06 +752 roe sal 41.6 +837 roe sal 22.5 +-------------------- -------------------- -------------------- -------------------- + +
+ + +We still have to decide whether this is the right thing to do or not. +If we want to be absolutely sure that +we aren't including any measurements by Lake in our results, +we need to exclude all the records for which we don't know who did the work. + +> ### What Happens When You Forget {.box} +> +> Several years ago, +> I was helping a group who were looking at +> the spread of drug-resistant tuberculosis (DRTB) +> in industrialized countries. +> In particular, +> they wanted to know if it was spreading faster among less affluent people. +> +> We tackled the problem by combining two data sets. +> The first gave us skin and blood test results for DRTB +> along with patients' postal codes +> (the only identifying information we were allowed---we didn't even have gender). +> The second was Canadian census data that gave us +> median income per postal code. +> Since a PC is about 300-800 people, +> we felt justified in joining the first with the second +> to estimate incomes for people with positive and negative test results. +> +> To our surprise, +> we didn't find a correlation between income and infection. +> We were just about to publish when someone spotted the mistake I'd made. +> +> Question: Who *doesn't* have a postal code? +> +> Answer: Homeless people. +> +> When I did the join, +> I was throwing away homeless people, +> which introduced a statistically significant error in my results. +> But I couldn't just set the income of anyone without a postal code to zero, +> because our sample included another set of people without postal codes: +> 16-21 year olds whose addresses were suppressed +> because they had tested positive for sexually-transmitted diseases. +> +> At this point the problem is no longer a database issue, +> but rather a question of statistics. +> The takeaway is, +> checking your queries when you're programming is as important as +> checking your samples when you're doing chemistry. + +### Summary {.keypoints} + +* Use `null` in place of missing information. +* Almost every operation involving `null` produces `null` as a result. +* Test for nulls using `is null` and `is not null`. + +### Challenges {.challenges} + +* Write a query that sorts the records in `Visited` by date, + omitting entries for which the date is not known + (i.e., is null). + +* What do you expect the query: + + ``` {.sql} + select * from Visited where dated in ('1927-02-08', null); + ``` + + to produce? + What does it actually produce? + +* Some database designers prefer to use + a [sentinel value](../gloss.html#sentinel-value) + to mark missing data rather than `null`. + For example, + they will use the date "0000-00-00" to mark a missing date, + or -1.0 to mark a missing salinity or radiation reading + (since actual readings cannot be negative). + What does this simplify? + What burdens or risks does it introduce? + +## Aggregation {#s:aggregate} + +### Learning Objectives {.objectives} + +* Write queries that combine values from many records to create a single aggregate value. +* Write queries that put records into groups based on their values. +* Write queries that combine values group by group. +* Explain what is displayed for *unaggregated* fields when some fields are aggregated. + +Duration: 10 minutes. + +### Lesson + +Gina now wants to calculate ranges and averages for her data. +She knows how to select all of the dates from the `Visited` table: + +``` {.sql} +select dated from Visited; +``` + +
+ +-------------------- +1927-02-08 +1927-02-10 +1939-01-07 +1930-01-12 +1930-02-26 + +1932-01-14 +1932-03-22 +-------------------- + +
+ +but to combine them, +she must use an [aggregation function](../gloss.html#aggregation-function) +such as `min` or `max`. +Each of these functions takes a set of records as input, +and produces a single record as output: + +``` {.sql} +select min(dated) from Visited; +``` + +
+ +-------------------- +1927-02-08 +-------------------- + +
+ +``` {.sql} +select max(dated) from Visited; +``` + +
+ +-------------------- +1939-01-07 +-------------------- + +
+ +`min` and `max` are just two of +the aggregation functions built into SQL. +Three others are `avg`, +`count`, +and `sum`: + +``` {.sql} +select avg(reading) from Survey where quant='sal'; +``` + +
+ +-------------------- +7.20333333333 +-------------------- + +
+ +``` {.sql} +select count(reading) from Survey where quant='sal'; +``` + +
+ +-------------------- +9 +-------------------- + +
+ +``` {.sql} +select sum(reading) from Survey where quant='sal'; +``` + +
+ +-------------------- +64.83 +-------------------- + +
+ +We used `count(reading)` here, +but we could just as easily have counted `quant` +or any other field in the table, +or even used `count(*)`, +since the function doesn't care about the values themselves, +just how many values there are. + +SQL lets us do several aggregations at once. +We can, +for example, +find the range of sensible salinity measurements: + +``` {.sql} +select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0; +``` + +
+ +-------------------- -------------------- +0.05 0.21 +-------------------- -------------------- + +
+ +We can also combine aggregated results with raw results, +although the output might surprise you: + +``` {.sql} +select person, count(*) from Survey where quant='sal' and reading<=1.0; +``` + +
+ +-------------------- -------------------- +lake 7 +-------------------- -------------------- + +
+ +Why does Lake's name appear rather than Roerich's or Dyer's? +The answer is that when it has to aggregate a field, +but isn't told how to, +the database manager chooses an actual value from the input set. +It might use the first one processed, +the last one, +or something else entirely. + +Another important fact is that when there are no values to aggregate, +aggregation's result is "don't know" +rather than zero or some other arbitrary value: + +``` {.sql} +select person, max(reading), sum(reading) from Survey where quant='missing'; +``` + +One final important feature of aggregation functions is that +they are inconsistent with the rest of SQL in a very useful way. +If we add two values, +and one of them is null, +the result is null. +By extension, +if we use `sum` to add all the values in a set, +and any of those values are null, +the result should also be null. +It's much more useful, +though, +for aggregation functions to ignore null values +and only combine those that are non-null. +This behavior lets us write our queries as: + +``` {.sql} +select min(dated) from Visited; +``` + +
+ +-------------------- +1927-02-08 +-------------------- + +
+ +instead of always having to filter explicitly: + +``` {.sql} +select min(dated) from Visited where dated is not null; +``` + +
+ +-------------------- +1927-02-08 +-------------------- + +
+ +### Summary {.keypoints} + +* Use aggregation functions like `sum` and `max` to combine query results. +* Use `count` function to count the number of results. +* If some fields are aggregated and others are not, the database manager chooses an arbitrary result for the unaggregated field. +* Most aggregation functions skip nulls when combining values. + +### Challenges {.challenges} + +* How many temperature readings did Frank Pabodie record, + and what was their average value? + +* The average of a set of values is the sum of the values + divided by the number of values. + Does this mean that the `avg` function returns 2.0 or 3.0 + when given the values 1.0, `null`, and 5.0? + +* Gina wants to calculate the difference between + each individual radiation reading + and the average of all the radiation readings. + She writes the query: + + ``` {.sql} + select reading-avg(reading) from Survey where quant='rad'; + ``` + + What does this actually produce, and why? + +* The function `group_concat(field, separator)` + concatenates all the values in a field + using the specified separator character + (or ',' if the separator isn't specified). + Use this to produce a one-line list of scientists' names, + such as: + + ``` {.sql} + William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth + ``` + + Can you find a way to order the list by surname? + +## Grouping {#s:grouping} + +### Learning Objectives {.objectives} + +* Group results to be aggregated separately. +* Explain when grouping occurs in the processing pipeline. + +Duration: 5 minutes. + +### Lesson + +Aggregating all records at once doesn't always make sense. +For example, +suppose Gina suspects that there is a systematic bias in her data, +and that some scientists' radiation readings are higher than others. +We know that this doesn't work: + +``` {.sql} +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad'; +``` + +
+ +-------------------- -------------------- -------------------- +roe 8 6.56 +-------------------- -------------------- -------------------- + +
+ +because the database manager selects a single arbitrary scientist's name +rather than aggregating separately for each scientist. +Since there are only five scientists, +she could write five queries of the form: + +``` {.sql} +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad' +and person='dyer'; +``` + +
+ +-------------------- -------------------- -------------------- +dyer 2 8.81 +-------------------- -------------------- -------------------- + +
+ +but this would be tedious, +and if she ever had a data set with fifty or five hundred scientists, +the chances of her getting all of those queries right is small. + +What we need to do is +tell the database manager to aggregate the hours for each scientist separately +using a `group by` clause: + +``` {.sql} +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad' +group by person; +``` + +
+ +-------------------- -------------------- -------------------- +dyer 2 8.81 +lake 2 1.82 +pb 3 6.66 +roe 1 11.25 +-------------------- -------------------- -------------------- + +
+ +`group by` does exactly what its name implies: +groups all the records with the same value for the specified field together +so that aggregation can process each batch separately. +Since all the records in each batch have the same value for `person`, +it no longer matters that the database manager +is picking an arbitrary one to display +alongside the aggregated `reading` values +([Figure 6](#f:grouped_aggregation)). + +
+ Grouped Aggregation +
Figure 6: Grouped Aggregation
+
+ +Just as we can sort by multiple criteria at once, +we can also group by multiple criteria. +To get the average reading by scientist and quantity measured, +for example, +we just add another field to the `group by` clause: + +``` {.sql} +select person, quant, count(reading), round(avg(reading), 2) +from Survey +group by person, quant; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- + sal 1 0.06 + temp 1 -26.0 +dyer rad 2 8.81 +dyer sal 2 0.11 +lake rad 2 1.82 +lake sal 4 0.11 +lake temp 1 -16.0 +pb rad 3 6.66 +pb temp 2 -20.0 +roe rad 1 11.25 +roe sal 2 32.05 +-------------------- -------------------- -------------------- -------------------- + +
+ +Note that we have added `person` to the list of fields displayed, +since the results wouldn't make much sense otherwise. + +Let's go one step further and remove all the entries +where we don't know who took the measurement: + +``` {.sql} +select person, quant, count(reading), round(avg(reading), 2) +from Survey +where person is not null +group by person, quant +order by person, quant; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +dyer rad 2 8.81 +dyer sal 2 0.11 +lake rad 2 1.82 +lake sal 4 0.11 +lake temp 1 -16.0 +pb rad 3 6.66 +pb temp 2 -20.0 +roe rad 1 11.25 +roe sal 2 32.05 +-------------------- -------------------- -------------------- -------------------- + +
+ +Looking more closely, +this query: + +* selected records from the `Survey` table + where the `person` field was not null; + +* grouped those records into subsets + so that the `person` and `quant` values in each subset + were the same; + +* ordered those subsets first by `person`, + and then within each sub-group by `quant`; + and + +* counted the number of records in each subset, + calculated the average `reading` in each, + and chose a `person` and `quant` value from each + (it doesn't matter which ones, + since they're all equal). + +Our query processing pipeline now looks like +[Figure 7](#f:pipeline_grouping). + +
+ Query Processing Pipeline With Grouping +
Figure 7: Query Processing Pipeline With Grouping
+
+ +### Summary {.keypoints} + +* Use `group by` to group values for separate aggregation. + +### Challenges {.challenges} + +* Write a single query that finds the earliest and latest date + that each site was visited. + +* Show the records produced by each stage of + [Figure 7](#f:pipeline_grouping) + for the following query: + + ``` {.sql} + select min(reading), max(reading) from Survey + where taken in (734, 735) + and quant='temp' + group by taken, quant; + ``` + +* How can the query in the previous challenge be simplified + without changing its result? + +## Combining Data {#s:join} + +### Learning Objectives {.objectives} + +* Explain what primary keys and foreign keys are. +* Write queries that combine information from two or more tables by matching keys. +* Write queries using aliases for table names. +* Explain why the `tablename.fieldname` notation is needed when tables are joined. +* Explain the logical sequence of operations that occurs when two or more tables are joined. + +Duration: 20 minutes (and expect to have to walk through an example step-by-step). + +### Lesson + +In order to submit her data to a web site +that aggregates historical meteorological data, +Gina needs to format it as +latitude, longitude, date, quantity, and reading. +However, +her latitudes and longitudes are in the `Site` table, +while the dates of measurements are in the `Visited` table +and the readings themselves are in the `Survey` table. +She needs to combine these tables somehow. + +The SQL command to do this is `join`. +To see how it works, +let's start by joining the `Site` and `Visited` tables: + +``` {.sql} +select * from Site join Visited; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- +DR-1 -49.85 -128.57 619 DR-1 1927-02-08 +DR-1 -49.85 -128.57 622 DR-1 1927-02-10 +DR-1 -49.85 -128.57 734 DR-3 1939-01-07 +DR-1 -49.85 -128.57 735 DR-3 1930-01-12 +DR-1 -49.85 -128.57 751 DR-3 1930-02-26 +DR-1 -49.85 -128.57 752 DR-3 +DR-1 -49.85 -128.57 837 MS-4 1932-01-14 +DR-1 -49.85 -128.57 844 DR-1 1932-03-22 +DR-3 -47.15 -126.72 619 DR-1 1927-02-08 +DR-3 -47.15 -126.72 622 DR-1 1927-02-10 +DR-3 -47.15 -126.72 734 DR-3 1939-01-07 +DR-3 -47.15 -126.72 735 DR-3 1930-01-12 +DR-3 -47.15 -126.72 751 DR-3 1930-02-26 +DR-3 -47.15 -126.72 752 DR-3 +DR-3 -47.15 -126.72 837 MS-4 1932-01-14 +DR-3 -47.15 -126.72 844 DR-1 1932-03-22 +MS-4 -48.87 -123.4 619 DR-1 1927-02-08 +MS-4 -48.87 -123.4 622 DR-1 1927-02-10 +MS-4 -48.87 -123.4 734 DR-3 1939-01-07 +MS-4 -48.87 -123.4 735 DR-3 1930-01-12 +MS-4 -48.87 -123.4 751 DR-3 1930-02-26 +MS-4 -48.87 -123.4 752 DR-3 +MS-4 -48.87 -123.4 837 MS-4 1932-01-14 +MS-4 -48.87 -123.4 844 DR-1 1932-03-22 +-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- + +
+ +`join` creates +the [cross product](../gloss.html#cross-product) +of two tables, +i.e., +it joins each record of one with each record of the other +to give all possible combinations. +Since there are three records in `Site` +and eight in `Visited`, +the join's output has 24 records. +And since each table has three fields, +the output has six fields. + +What the join *hasn't* done is +figure out if the records being joined have anything to do with each other. +It has no way of knowing whether they do or not until we tell it how. +To do that, +we add a clause specifying that +we're only interested in combinations that have the same site name: + +``` {.sql} +select * from Site join Visited on Site.name=Visited.site; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- +DR-1 -49.85 -128.57 619 DR-1 1927-02-08 +DR-1 -49.85 -128.57 622 DR-1 1927-02-10 +DR-1 -49.85 -128.57 844 DR-1 1932-03-22 +DR-3 -47.15 -126.72 734 DR-3 1939-01-07 +DR-3 -47.15 -126.72 735 DR-3 1930-01-12 +DR-3 -47.15 -126.72 751 DR-3 1930-02-26 +DR-3 -47.15 -126.72 752 DR-3 +MS-4 -48.87 -123.4 837 MS-4 1932-01-14 +-------------------- -------------------- -------------------- -------------------- -------------------- -------------------- + +
+ +`on` does the same job as `where`: +it only keeps records that pass some test. +(The difference between the two is that `on` filters records +as they're being created, +while `where` waits until the join is done +and then does the filtering.) +Once we add this to our query, +the database manager throws away records +that combined information about two different sites, +leaving us with just the ones we want. + +Notice that we used `table.field` to specify field names +in the output of the join. +We do this because tables can have fields with the same name, +and we need to be specific which ones we're talking about. +For example, +if we joined the `person` and `visited` tables, +the result would inherit a field called `ident` +from each of the original tables. + +We can now use the same dotted notation +to select the three columns we actually want +out of our join: + +``` {.sql} +select Site.lat, Site.long, Visited.dated +from Site join Visited +on Site.name=Visited.site; +``` + +
+ +-------------------- -------------------- -------------------- +-49.85 -128.57 1927-02-08 +-49.85 -128.57 1927-02-10 +-49.85 -128.57 1932-03-22 +-47.15 -126.72 +-47.15 -126.72 1930-01-12 +-47.15 -126.72 1930-02-26 +-47.15 -126.72 1939-01-07 +-48.87 -123.4 1932-01-14 +-------------------- -------------------- -------------------- + +
+ +If joining two tables is good, +joining many tables must be better. +In fact, +we can join any number of tables +simply by adding more `join` clauses to our query, +and more `on` tests to filter out combinations of records +that don't make sense: + +``` {.sql} +select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading +from Site join Visited join Survey +on Site.name=Visited.site +and Visited.ident=Survey.taken +and Visited.dated is not null; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- -------------------- +-49.85 -128.57 1927-02-08 rad 9.82 +-49.85 -128.57 1927-02-08 sal 0.13 +-49.85 -128.57 1927-02-10 rad 7.8 +-49.85 -128.57 1927-02-10 sal 0.09 +-47.15 -126.72 1939-01-07 rad 8.41 +-47.15 -126.72 1939-01-07 sal 0.05 +-47.15 -126.72 1939-01-07 temp -21.5 +-47.15 -126.72 1930-01-12 rad 7.22 +-47.15 -126.72 1930-01-12 sal 0.06 +-47.15 -126.72 1930-01-12 temp -26.0 +-47.15 -126.72 1930-02-26 rad 4.35 +-47.15 -126.72 1930-02-26 sal 0.1 +-47.15 -126.72 1930-02-26 temp -18.5 +-48.87 -123.4 1932-01-14 rad 1.46 +-48.87 -123.4 1932-01-14 sal 0.21 +-48.87 -123.4 1932-01-14 sal 22.5 +-49.85 -128.57 1932-03-22 rad 11.25 +-------------------- -------------------- -------------------- -------------------- -------------------- + +
+ + +We can tell which records from `Site`, `Visited`, and `Survey` +correspond with each other +because those tables contain +[primary keys](../gloss.html#primary-key) +and [foreign keys](../gloss.html#foreign-key). +A primary key is a value, +or combination of values, +that uniquely identifies each record in a table. +A foreign key is a value (or combination of values) from one table +that identifies a unique record in another table. +Another way of saying this is that +a foreign key is the primary key of one table +that appears in some other table. +In our database, +`Person.ident` is the primary key in the `Person` table, +while `Survey.person` is a foreign key +relating the `Survey` table's entries +to entries in `Person`. + +Most database designers believe that +every table should have a well-defined primary key. +They also believe that this key should be separate from the data itself, +so that if we ever need to change the data, +we only need to make one change in one place. +One easy way to do this is +to create an arbitrary, unique ID for each record +as we add it to the database. +This is actually very common: +those IDs have names like "student numbers" and "patient numbers", +and they almost always turn out to have originally been +a unique record identifier in some database system or other. +As the query below demonstrates, +SQLite automatically numbers records as they're added to tables, +and we can use those record numbers in queries: + +``` {.sql} +select rowid, * from Person; +``` + +
+ +-------------------- -------------------- -------------------- -------------------- +1 dyer William Dyer +2 pb Frank Pabodie +3 lake Anderson Lake +4 roe Valentina Roerich +5 danforth Frank Danforth +-------------------- -------------------- -------------------- -------------------- + +
+ +### Summary {.keypoints} + +* Use `join` to create all possible combinations of records from two or more tables. +* Use `join *tables* on *test*` to keep only those combinations that pass some test. +* Use `*table*.*field*` to specify a particular field of a particular table. +* Every record in a table should be uniquely identified by the value of its primary key. + +### Challenges {.challenges} + +* Write a query that lists all radiation readings from the DR-1 site. + +* Write a query that lists all sites visited by people named "Frank". + +* Describe in your own words what the following query produces: + + ``` {.sql} + select Site.name from Site join Visited + on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00'; + ``` + +* Why does the `Person` table have an `ident` field? + Why do we not just use scientists' names in the `Survey` table? + +* Why does the table `Site` exist? + Why didn't Gina just record latitudes and longitudes + directly in the `Visited` and `Survey` tables? + +## Creating and Modifying Tables {#s:create} + +### Learning Objectives {.box} + +* Write queries that create database tables with fields of common types. +* Write queries that specify the primary and foreign key relationships of tables. +* Write queries that specify whether field values must be unique and/or are allowed to be `null`. +* Write queries that erase database tables. +* Write queries that add records to database tables. +* Write queries that delete specific records from tables. +* Explain what referential integrity is, and how a database can become inconsistent as data is changed. + +Duration: 10 minutes. + +### Lesson + +So far we have only looked at how to get information out of a database, +both because that is more frequent than adding information, +and because most other operations only make sense +once queries are understood. +If we want to create and modify data, +we need to know two other pairs of commands. + +The first pair are `create table` and `drop table`. +While they are written as two words, +they are actually single commands. +The first one creates a new table; +its arguments are the names and types of the table's columns. +For example, +the following statements create the four tables in our survey database: + + create table Person(ident text, personal text, family text); + create table Site(name text, lat real, long real); + create table Visited(ident integer, site text, dated text); + create table Survey(taken integer, person text, quant real, reading real); + +We can get rid of one of our tables using: + + drop table Survey; + +Be very careful when doing this: +most databases have some support for undoing changes, +but it's better not to have to rely on it. + +Different database systems support different data types for table columns, +but most provide the following: + +
+ +-------------------- -------------------- +`integer` A signed integer. +`real` A floating point value. +`text` A string. +`blob` Any "binary large object" such as an image or audio file. +-------------------- -------------------- + +
+ +Most databases also support Booleans and date/time values; +SQLite uses the integers 0 and 1 for the former, +and represents the latter as discussed [earlier](#a:dates). +An increasing number of databases also support geographic data types, +such as latitude and longitude. +Keeping track of what particular systems do or do not offer, +and what names they give different data types, +is an unending portability headache. + +When we create a table, +we can specify several kinds of constraints on its columns. +For example, +a better definition for the `Survey` table would be: + + create table Survey( + taken integer not null, -- where reading taken + person text, -- may not know who took it + quant real not null, -- the quantity measured + reading real not null, -- the actual reading + primary key(taken, quant), + foreign key(taken) references Visited(ident), + foreign key(person) references Person(ident) + ); + +Once again, +exactly what constraints are avialable +and what they're called +depends on which database manager we are using. + +Once tables have been created, +we can add and remove records using our other pair of commands, +`insert` and `delete`. +The simplest form of `insert` statement lists values in order: + + insert into Site values('DR-1', -49.85, -128.57); + insert into Site values('DR-3', -47.15, -126.72); + insert into Site values('MSK-4', -48.87, -123.40); + +We can also insert values into one table directly from another: + + create table JustLatLong(lat text, long TEXT); + insert into JustLatLong select lat, long from site; + +Deleting records can be a bit trickier, +because we have to ensure that the database remains internally consistent. +If all we care about is a single table, +we can use the `DELETE` command with a `WHERE` clause +that matches the records we want to discard. +For example, +once we realize that Frank Danforth didn't take any measurements, +we can remove him from the `Person` table like this: + + delete from Person where ident = "danforth"; + +But what if we removed Anderson Lake instead? +Our `Survey` table would still contain seven records +of measurements he'd taken: + +``` {.sql} +select count(*) from Survey where person='lake'; +``` + +
+ +-------------------- +7 +-------------------- + +
+ +That's never supposed to happen: +`Survey.person` is a foreign key into the `Person` table, +and all our queries assume there will be a row in the latter +matching every value in the former. + +This problem is called [referential integrity](../gloss.html#referential-integrity): +we need to ensure that all references between tables can always be resolved correctly. +One way to do this is to delete all the records +that use `'lake'` as a foreign key +before deleting the record that uses it as a primary key. +If our database manager supports it, +we can automate this +using [cascading delete](../gloss.html#cascading-delete). +However, +this technique is outside the scope of this chapter. + +> ### Other Ways to Do It {#a:hybrid .box} +> +> Many applications use a hybrid storage model +> instead of putting everything into a database: +> the actual data (such as astronomical images) is stored in files, +> while the database stores the files' names, +> their modification dates, +> the region of the sky they cover, +> their spectral characteristics, +> and so on. +> This is also how most music player software is built: +> the database inside the application keeps track of the MP3 files, +> but the files themselves live on disk. + +### Summary {.keypoints} + +* Use `create table *name*(...)` to create a table. +* Use `drop table *name*` to erase a table. +* Specify field names and types when creating tables. +* Specify `primary key`, `foreign key`, `not null`, and other constraints when creating tables. +* Use `insert into *table* values(...)` to add records to a table. +* Use `delete from *table* where *test*` to erase records from a table. +* Maintain referential integrity when creating or deleting information. + +### Challenges {.challenges} + +* Write an SQL statement to replace all uses of `null` + in `Survey.person` + with the string `'unknown'`. + +* One of Gina's colleagues has sent her a [CSV](../gloss.html#csv) file + containing temperature readings by Robert Olmstead, + which is formatted like this: + + ``` + Taken,Temp + 619,-21.5 + 622,-15.5 + ``` + + Write a small Python program that reads this file in + and prints out the SQL `insert` statements needed + to add these records to the survey database. + Note: you will need to add an entry for Olmstead + to the `Person` table. + If you are testing your program repeatedly, + you may want to investigate SQL's `insert or replace` command. + +* SQLite has several administrative commands that aren't part of the SQL standard. + One of them is `.dump`, + which prints the SQL commands needed to re-create the database. + Another is `.load`, + which reads a file created by `.dump` and restores the database. + A colleague of yours thinks that storing dump files (which are text) in version control + is a good way to track and manage changes to the database. + What are the pros and cons of this approach? + +
+ +## Transactions + +### Learning Objectives {.objectives} + +* Explain what a race condition is. +* Explain why database operations sometimes have to be placed ina transaction to ensure correct behavior. +* Explain what it means to commit a transaction. + +Duration: 10 minutes. + +### Lesson + +Suppose we have another table in our database that shows +which pieces of equipment have been borrowed by which scientists: + +``` {.sql} +select * from Equipment; +``` + +
+ +-------------------- -------------------- +dyer CX-211 oscilloscope +pb Greenworth balance +lake Cavorite damping plates +-------------------- -------------------- + +
+ +(We should actually give each piece of equipment a unique ID, +and use that ID here instead of the full name, +just as we created a separate table for scientists earlier in this chapter, +but we will bend the rules for now.) +If William Dyer gives the oscilloscope to Valentina Roerich, +we need to execute two statements to update this table: + + delete from Equipment where person="dyer" and thing="CX-211 oscilloscope"; + insert into Equipment values("roe", "CX-211 oscilloscope"); + +This is all fine—unless our program happens to crash +between the first statement and the second. +If that happens, +the `Equipment` table won't have a record for the oscilloscope at all. +Such a crash may seem unlikely, +but remember: +if a computer can do two billion operations per second, +that means there are two billion opportunities every second for something to go wrong. +And if our operations take a long time to complete—as they will +when we are working with large datasets, +or when the database is being heavily used—the odds of failure increase. + +What we really want is a way to ensure that every operation is [ACID](../gloss.html#acid): +[atomic](../gloss.html#atomic-operation) (i.e. indivisible), +consistent, isolated, and durable. +The precise meanings of these terms doesn't matter; +what does is the notion that +every logical operation on the database should either run to completion +as if nothing else was going on at the same time, +or fail without having any effect at all. + +The tool we use to ensure that this happens is called +a [transaction](../gloss.html#transaction). +Here's how we should actually write the statements +to move the oscilloscope from one person to another: + + begin transaction; + delete from Equipment where person="dyer" and thing="CX-211 oscilloscope"; + insert into Equipment values("roe", "CX-211 oscilloscope"); + end transaction; + +The database manager treats everything in the transaction as one large statement. +If anything goes wrong inside, +then none of the changes made in the transaction will actually be written to the database—it +will be as if the transaction had never happened. +Changes are only stored permanently +when we [commit](../gloss.html#commit) them at the end of the transaction. + +> ### Transactions and Commits {.box} +> +> We first used the term "transaction" in +> [our discussion of version control](svn.html#b:basics:transaction). +> That's not a coincidence: +> behind the scenes, +> tools like Subversion are using many of the same algorithms as database managers +> to ensure that either everything happens consistently +> or nothing happens at all. +> We [use the term "commit"](svn.html#a:commit) for the same reason: +> just as our changes to local files aren't written back to the version control repository +> until we commit them, +> our (apparent) changes to a database aren't written to disk +> until we say so. + +Transactions serve another purpose as well. +Suppose there is another table in the database called `Exposure` +that records the number of days each scientist was exposed to +higher-than-normal levels of radiation: + +``` {.sql} +select * from Exposure; +``` + +
+ +-------------------- -------------------- +pb 4 +dyer 1 +lake 5 +-------------------- -------------------- + +
+ +After going through the journal entries for 1932, +Gina wants to add two days to Lake's count: + +``` {.sql} +update Exposure set days = days + 2 where person='lake'; +``` + +However, +her labmate has been doing through the journal entries for 1933 +to help Gina meet a paper deadline. +At the same moment as Gina runs her command, +her labmate runs this +to add one more day to Lake's exposure: + +``` {.sql} +update Exposure set days = days + 1 where person='lake'; +``` + +After both operations have completed, +the database should show that Lake was exposed for eight days +(the original five, plus two from Gina, plus one from her labmate). +However, +there is a small chance that it won't. +To see why, +let's break the two queries into their respective read and write steps +and place them side by side: + +-------------------- -------------------- +`X = read Exposure('lake', __)` `Y = read Exposure('lake', __)` +`write Exposure('lake', X+2)` `write Exposure('lake', Y+1)` +-------------------- -------------------- + +The database can only actually do one thing at once, +so it must put these four operations into some sequential order. +That order has to respect the original order within each column, +but the database can interleave the two columns any way it wants. +If it orders them like this: + +-------------------- -------------------- +`X = read Exposure('lake', __)` `X` is 5 +`write Exposure('lake', X+2)` database contains 7 +`Y = read Exposure('lake', __)` `Y` is 7 +`write Exposure('lake', Y+1)` database contains 8 +-------------------- -------------------- + +then all is well. +But what if it interleaves the operations like this: + +-------------------- -------------------- +`X = read Exposure('lake', __)` `X` is 5 +`Y = read Exposure('lake', __)` `Y` is 5 +`write Exposure('lake', X+2)` database contains 7 +`write Exposure('lake', Y+1)` database contains 6 +-------------------- -------------------- + +This ordering puts the initial value, 5, into both `X` and `Y`. +It then writes 7 back to the database (the third statement), +and then overwrites that with 6, +since `Y` holds 5. + +This is called a [race condition](../gloss.html#race-condition), +since the final result depends on a race between the two operations. +Race conditions are part of what makes programming large systems with many components a nightmare: +they are difficult to spot in advance +(since they are caused by the interactions between components, +rather than by anything in any one of those components), +and can be almost impossible to debug +(since they usually occur intermittently and infrequently). + +Transactions come to our rescue once again. +If Gina and her labmate put their statements in transactions, +the database will act as if it executed all of one and then all of the other. +Whether or not it *actually* does this is up to whoever wrote the database manager: +modern databases use very sophisticated algorithms to determine +which operations actually have to be run sequentially, +and which can safely be run in parallel to improve performance. +The key thing is that +every transaction will appear to have had the entire database to itself. + +### Summary {.keypoints} + +* Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable. + +### Challenges {.challenges} + +* A friend of yours manages a database of aerial photographs. + New records are added all the time, + but existing records are never modified or updated. + Your friend claims that because of this, + he doesn't need to put his queries in transactions. + Is he right or wrong, and why? + +## Programming With Databases {#s:programming} + +### Learning Objectives {.objectives} + +* Write a Python program that queries a database and processes the results. +* Explain what an SQL injection attack is. +* Write a program that safely interpolates values into queries. + +Duration: 20 minutes. + +### Lesson + +To end this chapter, +let's have a look at how to access a database from +a general-purpose programming language like Python. +Other languages use almost exactly the same model: +library and function names may differ, +but the concepts are the same. + +Here's a short Python program that selects latitudes and longitudes +from an SQLite database stored in a file called `survey.db`: + +``` {.python} +import sqlite3 +connection = sqlite3.connect("survey.db") +cursor = connection.cursor() +cursor.execute("select site.lat, site.long from site;") +results = cursor.fetchall() +for r in results: + print r +cursor.close() +connection.close() +``` + +The program starts by importing the `sqlite3` library. +If we were connecting to MySQL, DB2, or some other database, +we would import a different library, +but all of them provide the same functions, +so that the rest of our program does not have to change +(at least, not much) +if we switch from one database to another. + +Line 2 establishes a connection to the database. +Since we're using SQLite, +all we need to specify is the name of the database file. +Other systems may require us to provide a username and password as well. +Line 3 then uses this connection to create +a [cursor](../gloss.html#cursor); +just like the cursor in an editor, +its role is to keep track of where we are in the database. + +On line 4, we use that cursor to ask the database to execute a query for us. +The query is written in SQL, +and passed to `cursor.execute` as a string. +It's our job to make sure that SQL is properly formatted; +if it isn't, +or if something goes wrong when it is being executed, +the database will report an error. + +The database returns the results of the query to us +in response to the `cursor.fetchall` call on line 5. +This result is a list with one entry for each record in the result set; +if we loop over that list (line 6) and print those list entries (line 7), +we can see that each one is a tuple +with one element for each field we asked for. + +Finally, lines 8 and 9 close our cursor and our connection, +since the database can only keep a limited number of these open at one time. +Since establishing a connection takes time, +though, +we shouldn't open a connection, +do one operation, +then close the connection, +only to reopen it a few microseconds later to do another operation. +Instead, +it's normal to create one connection that stays open for the lifetime of the program. + +> ### What Are The u's For? {.box} +> +> You may have noticed that +> each of the strings in our output has a lower-case 'u' in front of it. +> That is Python's way of telling us that the string is stored in +> [Unicode](../gloss.html#unicode). + +Queries in real applications will often depend on values provided by users. +For example, +a program might take a user ID as a command-line parameter +and display the user's full name: + +``` {.python} +import sys +import sqlite3 + +query = "select personal, family from Person where ident='%s';" +user_id = sys.argv[1] + +connection = sqlite3.connect("survey.db") +cursor = connection.cursor() + +cursor.execute(query % user_id) +results = cursor.fetchall() +print results[0][0], results[0][1] + +cursor.close() +connection.close() +``` + +The variable `query` holds the statement we want to execute +with a `%s` format string where we want to insert +the ID of the person we're looking up. +It seems simple enough, +but what happens if someone gives the program this input? + + dyer"; drop table Survey; select " + +It looks like there's garbage after the name of the project, +but it is very carefully chosen garbage. +If we insert this string into our query, +the result is: + +``` {.sql} +select personal, family from Person where ident='dyer'; drop table Survey; select ''; +``` + +Whoops: +if we execute this, +it will erase one of the tables in our database. + +This technique is called [SQL injection](../gloss.html#sql-injection), +and it has been used to attack thousands of programs over the years. +In particular, +many web sites that take data from users insert values directly into queries +without checking them carefully first. + +Since a villain might try to smuggle commands into our queries in many different ways, +the safest way to deal with this threat is +to replace characters like quotes with their escaped equivalents, +so that we can safely put whatever the user gives us inside a string. +We can do this by using a [prepared statement](../gloss.html#prepared-statement) +instead of formatting our statements as strings. +Here's what our example program looks like if we do this: + +``` {.python} +import sys +import sqlite3 + +query = "select personal, family from Person where ident=?;" +user_id = sys.argv[1] + +connection = sqlite3.connect("survey.db") +cursor = connection.cursor() + +cursor.execute(query, [user_id]) +results = cursor.fetchall() +print results[0][0], results[0][1] + +cursor.close() +connection.close() +``` + +The key changes are in the query string and the `execute` call. +Instead of formatting the query ourselves, +we put question marks in the query template where we want to insert values. +When we call `execute`, +we provide a list +that contains as many values as there are question marks in the query. +The library matches values to question marks in order, +and translates any special characters in the values +into their escaped equivalents +so that they are safe to use. + +### Summary {.keypoints} + +* Most applications that use databases embed SQL in a general-purpose programming language. +* Database libraries use connections and cursors to manage interactions. +* Programs can fetch all results at once, or a few results at a time. +* If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries. +* Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks. + +### Challenges {.challenges} + +* Write a Python program that creates a new database + in a file called `original.db` + containing a single table called `Pressure`, + with a single field called `reading`, + and inserts 100,000 random numbers between 10.0 and 25.0. + How long does it take this program to run? + How long does it take to run a program + that simply writes those random numbers to a file? + +* Write a Python program that creates a new database + called `backup.db` + with the same structure as `original.db` + and copies all the values greater than 20.0 + from `original.db` to `backup.db`. + Which is faster: + filtering values in the query, + or reading everything into memory and filtering in Python? + +## Summing Up {#s:summary} + +There are many things databases can't do, +or can't do well +(which is why we have general-purpose programming languages like Python). +However, +they are still the best tool available +for managing many kinds of complex, structured data. +Thousands of programmer-years have gone into their design and implementation +so that they can handle very large datasets—terabytes or more—quickly and reliably. +Queries allow for great flexibility in how you are able to analyze your data, +which makes databases a good choice when you are exploring data. diff --git a/sql/novice/sqlitemagic.py b/sql/novice/sqlitemagic.py new file mode 100644 index 0000000..4a1e83d --- /dev/null +++ b/sql/novice/sqlitemagic.py @@ -0,0 +1,49 @@ +"""sqlitemagic provices a simple magic for interacting with SQLite +databases stored on disk. + +Usage: + +%%sqlite filename.db +select personal, family from person; + +produces: + +Alan|Turing +Grace|Hopper +""" + +# This file is copyright 2013 by Greg Wilson: see +# https://github.com/gvwilson/sqlitemagic/blob/master/LICENSE +# for the license. +# Inspired by https://github.com/tkf/ipython-sqlitemagic. + +import sqlite3 +from IPython.core.magic import Magics, magics_class, cell_magic +from IPython.display import display, HTML + +@magics_class +class SqliteMagic(Magics): + '''Provide the 'sqlite' calling point.''' + + @cell_magic + def sqlite(self, filename, query): + connection = sqlite3.connect(filename) + cursor = connection.cursor() + try: + cursor.execute(query) + results = cursor.fetchall() + display(HTML(self.tablify(results))) + except Exception, e: + import sys + print >> sys.stderr, "exception", e + cursor.close() + connection.close() + + def tablify(self, rows): + return '\n' + '\n'.join(self.rowify(r) for r in rows) + '\n
' + + def rowify(self, row): + return '' + ''.join('' + str(r) + '' for r in row) + '' + +def load_ipython_extension(ipython): + ipython.register_magics(SqliteMagic) diff --git a/sql/novice/survey.db b/sql/novice/survey.db new file mode 100644 index 0000000000000000000000000000000000000000..a8abbf4ca52019803a2c8d063521d811810f59e8 GIT binary patch literal 20480 zcmeI2PiP!<6vyXpCRv;LCE5AyHmqrV~H?eEdkR;2}gl$WbG!B;zS7sV4*p-=_E&k&^E;)ZZ zH3$xx`H@7vBD3Z_SfC*RB!C2v01`j~NB{{S0VIF~kih?zKua@?39j{c5)T&Oo z=2WZH(3IjHwa)u@Ay=Hrfm|OrOUfqe-7TFxsBEw`c$A%T)u|q`G_J1SSK+xT4F8$$ zTB;s#$`z-4*iyI}feNXshgpP5*^;?B->a6NW&J-jpO(lS@-tZ>Uy_eWOwN)zsgS*7 z(EP_-Gq0P=<`?FY*)m@?p95DkB!C2v01`j~NB{{S0VIF~kN^^Rd+OR>+TR zP=5Ie-JvVP)KE4$O%}IX6?I!rfSl(t3o<;%ZAVU5Bq(E@x!xWbfF1a(v-Ha>$*vWv zUp%+oX|m^@6sM(zNV1kDIhexf6E&@6HGL;)vUCvy+*uI&7Z!QCpb77YeYf9xVzozS e{=50WzgAF$KA-t_VaR{9^G!kS(V72Vq5lifRe#3- literal 0 HcmV?d00001 diff --git a/sql/novice/survey.sql b/sql/novice/survey.sql new file mode 100644 index 0000000..a1493a2 --- /dev/null +++ b/sql/novice/survey.sql @@ -0,0 +1,370 @@ +-- The `Person` table is used to explain the most basic queries. +-- Note that `danforth` has no measurements. +create table Person( + ident text, + personal text, + family text +); + +insert into Person values('dyer', 'William', 'Dyer'); +insert into Person values('pb', 'Frank', 'Pabodie'); +insert into Person values('lake', 'Anderson', 'Lake'); +insert into Person values('roe', 'Valentina', 'Roerich'); +insert into Person values('danforth', 'Frank', 'Danforth'); + +-- The `Site` table is equally simple. Use it to explain the +-- difference between databases and spreadsheets: in a spreadsheet, +-- the lat/long of measurements would probably be duplicated. +create table Site( + name text, + lat real, + long real +); + +insert into Site values('DR-1', -49.85, -128.57); +insert into Site values('DR-3', -47.15, -126.72); +insert into Site values('MSK-4', -48.87, -123.40); + +-- `Visited` is an enhanced `join` table: it connects to the lat/long +-- of specific measurements, and also provides their dates. +-- Note that #752 is missing a date; we use this to talk about NULL. +create table Visited( + ident integer, + site text, + dated text +); + +insert into Visited values(619, 'DR-1', '1927-02-08'); +insert into Visited values(622, 'DR-1', '1927-02-10'); +insert into Visited values(734, 'DR-3', '1939-01-07'); +insert into Visited values(735, 'DR-3', '1930-01-12'); +insert into Visited values(751, 'DR-3', '1930-02-26'); +insert into Visited values(752, 'DR-3', null); +insert into Visited values(837, 'MSK-4', '1932-01-14'); +insert into Visited values(844, 'DR-1', '1932-03-22'); + +-- The `Survey` table is the actual readings. Join it with `Site` to +-- get lat/long, and with `Visited` to get dates (except for #752). +-- Note that Roerich's salinity measurements are an order of magnitude +-- too large (use this to talk about data cleanup). Note also that +-- there are two cases where we don't know who took the measurement, +-- and that in most cases we don't have an entry (null or not) for the +-- temperature. +create table Survey( + taken integer, + person text, + quant real, + reading real +); + +insert into Survey values(619, 'dyer', 'rad', 9.82); +insert into Survey values(619, 'dyer', 'sal', 0.13); +insert into Survey values(622, 'dyer', 'rad', 7.80); +insert into Survey values(622, 'dyer', 'sal', 0.09); +insert into Survey values(734, 'pb', 'rad', 8.41); +insert into Survey values(734, 'lake', 'sal', 0.05); +insert into Survey values(734, 'pb', 'temp', -21.50); +insert into Survey values(735, 'pb', 'rad', 7.22); +insert into Survey values(735, null, 'sal', 0.06); +insert into Survey values(735, null, 'temp', -26.00); +insert into Survey values(751, 'pb', 'rad', 4.35); +insert into Survey values(751, 'pb', 'temp', -18.50); +insert into Survey values(751, 'lake', 'sal', 0.10); +insert into Survey values(752, 'lake', 'rad', 2.19); +insert into Survey values(752, 'lake', 'sal', 0.09); +insert into Survey values(752, 'lake', 'temp', -16.00); +insert into Survey values(752, 'roe', 'sal', 41.60); +insert into Survey values(837, 'lake', 'rad', 1.46); +insert into Survey values(837, 'lake', 'sal', 0.21); +insert into Survey values(837, 'roe', 'sal', 22.50); +insert into Survey values(844, 'roe', 'rad', 11.25); + +select '========================================'; +select 'Selecting'; + +select '----------------------------------------'; +select 'get scientist names'; +select family, personal from Person; + +select '----------------------------------------'; +select 'commands are case insensitive'; +SeLeCt famILY, PERSonal frOM PERson; + +select '----------------------------------------'; +select 'we control column order'; +select personal, family from Person; + +select '----------------------------------------'; +select 'repeat columns'; +select ident, ident, ident from Person; + +select '----------------------------------------'; +select 'use * for wildcard'; +select * from Person; + +select '========================================'; +select 'Removing Duplicates'; + +select '----------------------------------------'; +select 'show quantities in survey table'; +select quant from Survey; + +select '----------------------------------------'; +select 'unique quantity names'; +select distinct quant from Survey; + +select '----------------------------------------'; +select 'tuple uniqueness'; +select distinct taken, quant from Survey; + +select '========================================'; +select 'Filtering'; + +select '----------------------------------------'; +select 'when a particular site was visited'; +select * from Visited where site='DR-1'; + +select '----------------------------------------'; +select 'filtering columns after "where"'; +select ident from Visited where site='DR-1'; + +select '----------------------------------------'; +select 'when a particular site was visited after 1930'; +select * from Visited where (site='DR-1') and (dated>='1930-00-00'); + +select '----------------------------------------'; +select 'using "or" instead of "and"'; +select * from Survey where person='lake' or person='roe'; + +select '----------------------------------------'; +select 'using "in" instead of "or"'; +select * from Survey where person in ('lake', 'roe'); + +select '----------------------------------------'; +select 'with parentheses'; +select * from Survey where quant='sal' and person='lake' or person='roe'; + +select '----------------------------------------'; +select 'using distinct with "in"'; +select distinct person, quant from Survey where person='lake' or person='roe'; + +select '========================================'; +select 'Calculating New Values'; + +select '----------------------------------------'; +select 'correct radiation readings'; +select 1.05 * reading from Survey where quant='rad'; + +select '----------------------------------------'; +select 'convert temperatures to Celsius'; +select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp'; + +select '----------------------------------------'; +select 'formatting names'; +select personal || ' ' || family from Person; + +select '========================================'; +select 'Ordering Results'; + +select '----------------------------------------'; +select 'ascending is the default'; +select reading from Survey where quant='rad' order by reading; + +select '----------------------------------------'; +select 'order descending'; +select reading from Survey where quant='rad' order by reading desc; + +select '----------------------------------------'; +select 'ordering and sub-ordering'; +select taken, person from Survey order by taken asc, person desc; + +select '----------------------------------------'; +select 'removing duplicates'; +select distinct taken, person from Survey order by taken asc, person desc; + +select '----------------------------------------'; +select 'sorted by undisplayed column'; +select reading from Survey where quant='rad' order by taken; + +select '----------------------------------------'; +select 'randomization'; +select random(), ident from Person; + +select '----------------------------------------'; +select 'random sorting'; +select ident from Person order by random(); + +select '========================================'; +select 'Missing Data'; + +select '----------------------------------------'; +select 'all Visited data'; +select * from Visited; + +select '----------------------------------------'; +select 'visits before 1930'; +select * from Visited where dated<'1930-00-00'; + +select '----------------------------------------'; +select 'visits after 1930'; +select * from Visited where dated>='1930-00-00'; + +select '----------------------------------------'; +select 'visits with unknown dates (wrong)'; +select * from Visited where dated=NULL; + +select '----------------------------------------'; +select '!= does not work either'; +select * from Visited where dated!=NULL; + +select '----------------------------------------'; +select 'visits with unknown dates (right)'; +select * from Visited where dated is NULL; + +select '----------------------------------------'; +select 'visits with known dates'; +select * from Visited where dated is not NULL; + +select '----------------------------------------'; +select 'salinity not measured by Lake'; +select * from Survey where quant='sal' and person!='lake'; + +select '----------------------------------------'; +select 'salinity not known to be measured by Lake'; +select * from Survey where quant='sal' and (person!='lake' or person is null); + +select '========================================'; +select 'Aggregation'; + +select '----------------------------------------'; +select 'all dates'; +select dated from Visited; + +select '----------------------------------------'; +select 'date range'; +select min(dated) from Visited; +select max(dated) from Visited; +select min(dated), max(dated) from Visited; + +select '----------------------------------------'; +select 'averaging'; +select avg(reading) from Survey where quant='sal'; + +select '----------------------------------------'; +select 'counting'; +select count(reading) from Survey where quant='sal'; + +select '----------------------------------------'; +select 'sum'; +select sum(reading) from Survey where quant='sal'; + +select '----------------------------------------'; +select 'averaging sensible values'; +select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0; + +select '----------------------------------------'; +select 'unaggregated with aggregated takes arbitrary'; +select person, count(*) from Survey where quant='sal' and reading<=1.0; + +select '----------------------------------------'; +select 'what happens when there is no input?' +select person, max(reading), sum(reading) from Survey where quant='missing'; + +select '----------------------------------------'; +select 'what happens when an input is null?' +select min(dated) from Visited; + +select '----------------------------------------'; +select 'what happens when an input is null?' +select min(dated) from Visited where dated is not null; + +select '----------------------------------------'; +select 'grouping Visited by site only keeps arbitrary'; +select * from Visited group by site; + +select '----------------------------------------'; +select 'get date ranges for sites'; +select site, min(dated), max(dated) from Visited +group by site; + +select '========================================'; +select 'Grouping'; + +select '----------------------------------------'; +select 'radiation reading for a particular person'; +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad' +and person='dyer'; + +select '----------------------------------------'; +select 'radiation reading for an arbitrary person'; +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad'; + +select '----------------------------------------'; +select 'radiation readings by person'; +select person, count(reading), round(avg(reading), 2) +from Survey +where quant='rad' +group by person; + +select '----------------------------------------'; +select 'average by scientist and quantity'; +select person, quant, count(reading), round(avg(reading), 2) +from Survey +group by person, quant; + +select '----------------------------------------'; +select 'grouping and ordering'; +select person, quant, count(reading), round(avg(reading), 2) +from Survey +group by person, quant +order by person, quant; + +select '----------------------------------------'; +select 'grouping and ordering without null'; +select person, quant, count(reading), round(avg(reading), 2) +from Survey +where person is not null +group by person, quant +order by person, quant; + +select '========================================'; +select 'Combining Data'; + +select '----------------------------------------'; +select 'combine "Site" with "Visited"'; +select * from Site join Visited; + +select '----------------------------------------'; +select 'filter where sites match'; +select * from Site join Visited on Site.name=Visited.site; + +select '----------------------------------------'; +select 'get latitude, longitude, and date'; +select Site.lat, Site.long, Visited.dated +from Site join Visited +on Site.name=Visited.site; + +select '----------------------------------------'; +select 'get latitude, longitude, date, quantity, and reading'; +select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading +from Site join Visited join Survey +on Site.name=Visited.site +and Visited.ident=Survey.taken +and Visited.dated is not null; + +select '----------------------------------------'; +select 'get all radiation readings since 1930'; +select 'but notice that #752 is missing (NULL)...'; +select Survey.reading +from Survey join Visited +where Survey.taken=Visited.ident + and Survey.quant='rad' + and Visited.dated>='1930-00-00'; + +select '----------------------------------------'; +select 'get row IDs from Person table'; +select rowid, * from Person; diff --git a/sql/novice/tables.html b/sql/novice/tables.html new file mode 100644 index 0000000..e3b5355 --- /dev/null +++ b/sql/novice/tables.html @@ -0,0 +1,68 @@ + + + + + + +
Table Name Purpose
Person People who took readings.
Site Locations of observation sites.
Visited When readings were taken at specific sites.
Survey The actual measurement values.
+ +**Person** + + + + + + + + +
ident personal family
dyer William Dyer
pb Frank Pabodie
lake Anderson Lake
roe Valentina Roerich
danforth Frank Danforth
+ +**Survey** + + + + + + + + + + + + + + + + + + + + + + + + +
taken person quant reading
619 dyer rad 9.82
619 dyer sal 0.13
622 dyer rad 7.8
622 dyer sal 0.09
734 pb rad 8.41
734 lake sal 0.05
734 pb temp -21.5
735 pb rad 7.22
735 NULL sal 0.06
735 NULL temp -26.0
751 pb rad 4.35
751 pb temp -18.5
751 lake sal 0.1
752 lake rad 2.19
752 lake sal 0.09
752 lake temp -16.0
752 roe sal 41.6
837 lake rad 1.46
837 lake sal 0.21
837 roe sal 22.5
844 roe rad 11.25
+ +**Site** + + + + + + +
name lat long
DR-1 -49.85 -128.57
DR-3 -47.15 -126.72
MSK-4 -48.87 -123.4
+ +**Visited** + + + + + + + + + + + +
ident site dated
619 DR-1 1927-02-08
622 DR-1 1927-02-10
734 DR-3 1939-01-07
735 DR-3 1930-01-12
751 DR-3 1930-02-26
752 DR-3 NULL
837 MSK-4 1932-01-14
844 DR-1 1932-03-22
-- 2.26.2