{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Filtering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "cell_tags": [
       "objectives"
      ]
     },
     "source": [
      "#### Objectives\n",
      "\n",
      "*   Write queries that select records that satisfy user-specified conditions.\n",
      "*   Explain the order in which the clauses in a query are executed."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One of the most powerful features of a database is\n",
      "the ability to [filter](https://github.com/swcarpentry/bc/blob/master/gloss.md#filter) data,\n",
      "i.e.,\n",
      "to select only those records that match certain criteria.\n",
      "For example,\n",
      "suppose we want to see when a particular site was visited.\n",
      "We can select these records from the `Visited` table\n",
      "by using a `where` clause in our query:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load_ext sqlitemagic"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Visited where site='DR-1';"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
        "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c9050>"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The database manager executes this query in two stages.\n",
      "First,\n",
      "it checks at each row in the `Visited` table\n",
      "to see which ones satisfy the `where`.\n",
      "It then uses the column names following the `select` keyword\n",
      "to determine what columns to display."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This processing order means that\n",
      "we can filter records using `where`\n",
      "based on values in columns that aren't then displayed:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select ident from Visited where site='DR-1';"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>619</td></tr>\n",
        "<tr><td>622</td></tr>\n",
        "<tr><td>844</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c3890>"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use many other Boolean operators to filter our data.\n",
      "For example,\n",
      "we can ask for all information from the DR-1 site collected since 1930:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c4790>"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "(The parentheses around the individual tests aren't strictly required,\n",
      "but they help make the query easier to read.)\n",
      "\n",
      "> Most database managers have a special data type for dates.\n",
      "> In fact, many have two:\n",
      "> one for dates,\n",
      "> such as \"May 31, 1971\",\n",
      "> and one for durations,\n",
      "> such as \"31 days\".\n",
      "> SQLite doesn't:\n",
      "> instead,\n",
      "> it stores dates as either text\n",
      "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
      "> real numbers\n",
      "> (the number of days since November 24, 4714 BCE),\n",
      "> or integers\n",
      "> (the number of seconds since midnight, January 1, 1970).\n",
      "> If this sounds complicated,\n",
      "> it is,\n",
      "> but not nearly as complicated as figuring out\n",
      "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we want to find out what measurements were taken by either Lake or Roerich,\n",
      "we can combine the tests on their names using `or`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Survey where person='lake' or person='roe';"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c4850>"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Alternatively,\n",
      "we can use `in` to see if a value is in a specific set:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Survey where person in ('lake', 'roe');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c4790>"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can combine `and` with `or`,\n",
      "but we need to be careful about which operator is executed first.\n",
      "If we *don't* use parentheses,\n",
      "we get this:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Survey where quant='sal' and person='lake' or person='roe';"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
        "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c4850>"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "which is salinity measurements by Lake,\n",
      "and *any* measurement by Roerich.\n",
      "We probably want this instead:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select * from Survey where quant='sal' and (person='lake' or person='roe');"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
        "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
        "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
        "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
        "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
        "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c3850>"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally,\n",
      "we can use `distinct` with `where`\n",
      "to give a second level of filtering:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%sqlite survey.db\n",
      "select distinct person, quant from Survey where person='lake' or person='roe';"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<table>\n",
        "<tr><td>lake</td><td>sal</td></tr>\n",
        "<tr><td>lake</td><td>rad</td></tr>\n",
        "<tr><td>lake</td><td>temp</td></tr>\n",
        "<tr><td>roe</td><td>sal</td></tr>\n",
        "<tr><td>roe</td><td>rad</td></tr>\n",
        "</table>"
       ],
       "metadata": {},
       "output_type": "display_data",
       "text": [
        "<IPython.core.display.HTML at 0x1023c3890>"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But remember:\n",
      "`distinct` is applied to the values displayed in the chosen columns,\n",
      "not to the entire rows as they are being processed.\n",
      "\n",
      "> What we have just done is how most people \"grow\" their SQL queries.\n",
      "> We started with something simple that did part of what we wanted,\n",
      "> then added more clauses one by one,\n",
      "> testing their effects as we went.\n",
      "> This is a good strategy&mdash;in fact,\n",
      "> for complex queries it's often the *only* strategy&mdash;but\n",
      "> it depends on quick turnaround,\n",
      "> and on us recognizing the right answer when we get it.\n",
      ">     \n",
      "> The best way to achieve quick turnaround is often\n",
      "> to put a subset of data in a temporary database\n",
      "> and run our queries against that,\n",
      "> or to fill a small database with synthesized records.\n",
      "> For example,\n",
      "> instead of trying our queries against an actual database of 20 million Australians,\n",
      "> we could run it against a sample of ten thousand,\n",
      "> or write a small program to generate ten thousand random (but plausible) records\n",
      "> and use that."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Challenges\n",
      "\n",
      "1.  Suppose we want to select all sites that lie more than 30&deg; from the poles.\n",
      "    Our first query is:\n",
      "\n",
      "    ~~~\n",
      "    select * from Site where (lat > -60) or (lat < 60);\n",
      "    ~~~\n",
      "\n",
      "    Explain why this is wrong,\n",
      "    and rewrite the query so that it is correct.\n",
      "\n",
      "2.  Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
      "    Write a query that selects all records from `Survey`\n",
      "    with salinity values outside this range.\n",
      "\n",
      "3.  The SQL test `*column-name* like *pattern*`\n",
      "    is true if the value in the named column\n",
      "    matches the pattern given;\n",
      "    the character '%' can be used any number of times in the pattern\n",
      "    to mean \"match zero or more characters\".\n",
      "\n",
      "    <table>\n",
      "      <tr> <th>Expression</th> <th>Value</th> </tr>\n",
      "      <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
      "      <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
      "      <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
      "      <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
      "      <tr> <td><code>'alpha' like 'a%p%'</code></td> <td>True</td> </tr>\n",
      "    </table>\n",
      "    The expression `*column-name* not like *pattern*`\n",
      "    inverts the test.\n",
      "    Using `like`,\n",
      "    write a query that finds all the records in `Visited`\n",
      "    that *aren't* from sites labelled 'DR-something'."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "cell_tags": [
       "keypoints"
      ]
     },
     "source": [
      "#### Key Points\n",
      "\n",
      "*   Use `where` to filter records according to Boolean conditions.\n",
      "*   Filtering is done on whole records,\n",
      "    so conditions can use fields that are not actually displayed."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}