Reorganizing material to put novice + intermediate at top level
[swc-boot-camp.git] / sql / novice / 03-filter.ipynb
diff --git a/sql/novice/03-filter.ipynb b/sql/novice/03-filter.ipynb
deleted file mode 100644 (file)
index f06d40e..0000000
+++ /dev/null
@@ -1,490 +0,0 @@
-{
- "metadata": {
-  "name": ""
- },
- "nbformat": 3,
- "nbformat_minor": 0,
- "worksheets": [
-  {
-   "cells": [
-    {
-     "cell_type": "heading",
-     "level": 2,
-     "metadata": {},
-     "source": [
-      "Filtering"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {
-      "cell_tags": [
-       "objectives"
-      ]
-     },
-     "source": [
-      "#### Objectives\n",
-      "\n",
-      "*   Write queries that select records that satisfy user-specified conditions.\n",
-      "*   Explain the order in which the clauses in a query are executed."
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "One of the most powerful features of a database is\n",
-      "the ability to [filter](../../gloss.html#filter) data,\n",
-      "i.e.,\n",
-      "to select only those records that match certain criteria.\n",
-      "For example,\n",
-      "suppose we want to see when a particular site was visited.\n",
-      "We can select these records from the `Visited` table\n",
-      "by using a `where` clause in our query:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%load_ext sqlitemagic"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 1
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Visited where site='DR-1';"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
-        "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
-        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c9050>"
-       ]
-      }
-     ],
-     "prompt_number": 2
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "The database manager executes this query in two stages.\n",
-      "First,\n",
-      "it checks at each row in the `Visited` table\n",
-      "to see which ones satisfy the `where`.\n",
-      "It then uses the column names following the `select` keyword\n",
-      "to determine what columns to display."
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "This processing order means that\n",
-      "we can filter records using `where`\n",
-      "based on values in columns that aren't then displayed:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select ident from Visited where site='DR-1';"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>619</td></tr>\n",
-        "<tr><td>622</td></tr>\n",
-        "<tr><td>844</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c3890>"
-       ]
-      }
-     ],
-     "prompt_number": 3
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "We can use many other Boolean operators to filter our data.\n",
-      "For example,\n",
-      "we can ask for all information from the DR-1 site collected since 1930:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c4790>"
-       ]
-      }
-     ],
-     "prompt_number": 4
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "(The parentheses around the individual tests aren't strictly required,\n",
-      "but they help make the query easier to read.)\n",
-      "\n",
-      "> Most database managers have a special data type for dates.\n",
-      "> In fact, many have two:\n",
-      "> one for dates,\n",
-      "> such as \"May 31, 1971\",\n",
-      "> and one for durations,\n",
-      "> such as \"31 days\".\n",
-      "> SQLite doesn't:\n",
-      "> instead,\n",
-      "> it stores dates as either text\n",
-      "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
-      "> real numbers\n",
-      "> (the number of days since November 24, 4714 BCE),\n",
-      "> or integers\n",
-      "> (the number of seconds since midnight, January 1, 1970).\n",
-      "> If this sounds complicated,\n",
-      "> it is,\n",
-      "> but not nearly as complicated as figuring out\n",
-      "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "If we want to find out what measurements were taken by either Lake or Roerich,\n",
-      "we can combine the tests on their names using `or`:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Survey where person='lake' or person='roe';"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
-        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
-        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
-        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
-        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c4850>"
-       ]
-      }
-     ],
-     "prompt_number": 5
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "Alternatively,\n",
-      "we can use `in` to see if a value is in a specific set:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Survey where person in ('lake', 'roe');"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
-        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
-        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
-        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
-        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c4790>"
-       ]
-      }
-     ],
-     "prompt_number": 6
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "We can combine `and` with `or`,\n",
-      "but we need to be careful about which operator is executed first.\n",
-      "If we *don't* use parentheses,\n",
-      "we get this:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Survey where quant='sal' and person='lake' or person='roe';"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
-        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
-        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
-        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
-        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c4850>"
-       ]
-      }
-     ],
-     "prompt_number": 7
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "which is salinity measurements by Lake,\n",
-      "and *any* measurement by Roerich.\n",
-      "We probably want this instead:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select * from Survey where quant='sal' and (person='lake' or person='roe');"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
-        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
-        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
-        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
-        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
-        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c3850>"
-       ]
-      }
-     ],
-     "prompt_number": 8
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "Finally,\n",
-      "we can use `distinct` with `where`\n",
-      "to give a second level of filtering:"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%sqlite survey.db\n",
-      "select distinct person, quant from Survey where person='lake' or person='roe';"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "html": [
-        "<table>\n",
-        "<tr><td>lake</td><td>sal</td></tr>\n",
-        "<tr><td>lake</td><td>rad</td></tr>\n",
-        "<tr><td>lake</td><td>temp</td></tr>\n",
-        "<tr><td>roe</td><td>sal</td></tr>\n",
-        "<tr><td>roe</td><td>rad</td></tr>\n",
-        "</table>"
-       ],
-       "metadata": {},
-       "output_type": "display_data",
-       "text": [
-        "<IPython.core.display.HTML at 0x1023c3890>"
-       ]
-      }
-     ],
-     "prompt_number": 9
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "But remember:\n",
-      "`distinct` is applied to the values displayed in the chosen columns,\n",
-      "not to the entire rows as they are being processed.\n",
-      "\n",
-      "> What we have just done is how most people \"grow\" their SQL queries.\n",
-      "> We started with something simple that did part of what we wanted,\n",
-      "> then added more clauses one by one,\n",
-      "> testing their effects as we went.\n",
-      "> This is a good strategy&mdash;in fact,\n",
-      "> for complex queries it's often the *only* strategy&mdash;but\n",
-      "> it depends on quick turnaround,\n",
-      "> and on us recognizing the right answer when we get it.\n",
-      ">     \n",
-      "> The best way to achieve quick turnaround is often\n",
-      "> to put a subset of data in a temporary database\n",
-      "> and run our queries against that,\n",
-      "> or to fill a small database with synthesized records.\n",
-      "> For example,\n",
-      "> instead of trying our queries against an actual database of 20 million Australians,\n",
-      "> we could run it against a sample of ten thousand,\n",
-      "> or write a small program to generate ten thousand random (but plausible) records\n",
-      "> and use that."
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "#### Challenges\n",
-      "\n",
-      "1.  Suppose we want to select all sites that lie more than 30&deg; from the poles.\n",
-      "    Our first query is:\n",
-      "\n",
-      "    ~~~\n",
-      "    select * from Site where (lat > -60) or (lat < 60);\n",
-      "    ~~~\n",
-      "\n",
-      "    Explain why this is wrong,\n",
-      "    and rewrite the query so that it is correct.\n",
-      "\n",
-      "2.  Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
-      "    Write a query that selects all records from `Survey`\n",
-      "    with salinity values outside this range.\n",
-      "\n",
-      "3.  The SQL test `*column-name* like *pattern*`\n",
-      "    is true if the value in the named column\n",
-      "    matches the pattern given;\n",
-      "    the character '%' can be used any number of times in the pattern\n",
-      "    to mean \"match zero or more characters\".\n",
-      "\n",
-      "    <table>\n",
-      "      <tr> <th>Expression</th> <th>Value</th> </tr>\n",
-      "      <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
-      "      <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
-      "      <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
-      "      <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
-      "      <tr> <td><code>'alpha' like 'a%p%'</code></td> <td>True</td> </tr>\n",
-      "    </table>\n",
-      "    The expression `*column-name* not like *pattern*`\n",
-      "    inverts the test.\n",
-      "    Using `like`,\n",
-      "    write a query that finds all the records in `Visited`\n",
-      "    that *aren't* from sites labelled 'DR-something'."
-     ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {
-      "cell_tags": [
-       "keypoints"
-      ]
-     },
-     "source": [
-      "#### Key Points\n",
-      "\n",
-      "*   Use `where` to filter records according to Boolean conditions.\n",
-      "*   Filtering is done on whole records,\n",
-      "    so conditions can use fields that are not actually displayed."
-     ]
-    }
-   ],
-   "metadata": {}
-  }
- ]
-}
\ No newline at end of file