+++ /dev/null
-{
- "metadata": {
- "name": ""
- },
- "nbformat": 3,
- "nbformat_minor": 0,
- "worksheets": [
- {
- "cells": [
- {
- "cell_type": "heading",
- "level": 2,
- "metadata": {},
- "source": [
- "Filtering"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "cell_tags": [
- "objectives"
- ]
- },
- "source": [
- "#### Objectives\n",
- "\n",
- "* Write queries that select records that satisfy user-specified conditions.\n",
- "* Explain the order in which the clauses in a query are executed."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "One of the most powerful features of a database is\n",
- "the ability to [filter](../../gloss.html#filter) data,\n",
- "i.e.,\n",
- "to select only those records that match certain criteria.\n",
- "For example,\n",
- "suppose we want to see when a particular site was visited.\n",
- "We can select these records from the `Visited` table\n",
- "by using a `where` clause in our query:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%load_ext sqlitemagic"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 1
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Visited where site='DR-1';"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>619</td><td>DR-1</td><td>1927-02-08</td></tr>\n",
- "<tr><td>622</td><td>DR-1</td><td>1927-02-10</td></tr>\n",
- "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c9050>"
- ]
- }
- ],
- "prompt_number": 2
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The database manager executes this query in two stages.\n",
- "First,\n",
- "it checks at each row in the `Visited` table\n",
- "to see which ones satisfy the `where`.\n",
- "It then uses the column names following the `select` keyword\n",
- "to determine what columns to display."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "This processing order means that\n",
- "we can filter records using `where`\n",
- "based on values in columns that aren't then displayed:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select ident from Visited where site='DR-1';"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>619</td></tr>\n",
- "<tr><td>622</td></tr>\n",
- "<tr><td>844</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c3890>"
- ]
- }
- ],
- "prompt_number": 3
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "<img src=\"files/img/sql-filter.svg\" alt=\"SQL Filtering in Action\" />"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can use many other Boolean operators to filter our data.\n",
- "For example,\n",
- "we can ask for all information from the DR-1 site collected since 1930:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>844</td><td>DR-1</td><td>1932-03-22</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c4790>"
- ]
- }
- ],
- "prompt_number": 4
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "(The parentheses around the individual tests aren't strictly required,\n",
- "but they help make the query easier to read.)\n",
- "\n",
- "> Most database managers have a special data type for dates.\n",
- "> In fact, many have two:\n",
- "> one for dates,\n",
- "> such as \"May 31, 1971\",\n",
- "> and one for durations,\n",
- "> such as \"31 days\".\n",
- "> SQLite doesn't:\n",
- "> instead,\n",
- "> it stores dates as either text\n",
- "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
- "> real numbers\n",
- "> (the number of days since November 24, 4714 BCE),\n",
- "> or integers\n",
- "> (the number of seconds since midnight, January 1, 1970).\n",
- "> If this sounds complicated,\n",
- "> it is,\n",
- "> but not nearly as complicated as figuring out\n",
- "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "If we want to find out what measurements were taken by either Lake or Roerich,\n",
- "we can combine the tests on their names using `or`:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Survey where person='lake' or person='roe';"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
- "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
- "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
- "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
- "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c4850>"
- ]
- }
- ],
- "prompt_number": 5
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Alternatively,\n",
- "we can use `in` to see if a value is in a specific set:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Survey where person in ('lake', 'roe');"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
- "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>rad</td><td>2.19</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>temp</td><td>-16.0</td></tr>\n",
- "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>rad</td><td>1.46</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
- "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
- "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c4790>"
- ]
- }
- ],
- "prompt_number": 6
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We can combine `and` with `or`,\n",
- "but we need to be careful about which operator is executed first.\n",
- "If we *don't* use parentheses,\n",
- "we get this:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Survey where quant='sal' and person='lake' or person='roe';"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
- "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
- "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
- "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
- "<tr><td>844</td><td>roe</td><td>rad</td><td>11.25</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c4850>"
- ]
- }
- ],
- "prompt_number": 7
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "which is salinity measurements by Lake,\n",
- "and *any* measurement by Roerich.\n",
- "We probably want this instead:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select * from Survey where quant='sal' and (person='lake' or person='roe');"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>734</td><td>lake</td><td>sal</td><td>0.05</td></tr>\n",
- "<tr><td>751</td><td>lake</td><td>sal</td><td>0.1</td></tr>\n",
- "<tr><td>752</td><td>lake</td><td>sal</td><td>0.09</td></tr>\n",
- "<tr><td>752</td><td>roe</td><td>sal</td><td>41.6</td></tr>\n",
- "<tr><td>837</td><td>lake</td><td>sal</td><td>0.21</td></tr>\n",
- "<tr><td>837</td><td>roe</td><td>sal</td><td>22.5</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c3850>"
- ]
- }
- ],
- "prompt_number": 8
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Finally,\n",
- "we can use `distinct` with `where`\n",
- "to give a second level of filtering:"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%sqlite survey.db\n",
- "select distinct person, quant from Survey where person='lake' or person='roe';"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "html": [
- "<table>\n",
- "<tr><td>lake</td><td>sal</td></tr>\n",
- "<tr><td>lake</td><td>rad</td></tr>\n",
- "<tr><td>lake</td><td>temp</td></tr>\n",
- "<tr><td>roe</td><td>sal</td></tr>\n",
- "<tr><td>roe</td><td>rad</td></tr>\n",
- "</table>"
- ],
- "metadata": {},
- "output_type": "display_data",
- "text": [
- "<IPython.core.display.HTML at 0x1023c3890>"
- ]
- }
- ],
- "prompt_number": 9
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "But remember:\n",
- "`distinct` is applied to the values displayed in the chosen columns,\n",
- "not to the entire rows as they are being processed.\n",
- "\n",
- "> What we have just done is how most people \"grow\" their SQL queries.\n",
- "> We started with something simple that did part of what we wanted,\n",
- "> then added more clauses one by one,\n",
- "> testing their effects as we went.\n",
- "> This is a good strategy—in fact,\n",
- "> for complex queries it's often the *only* strategy—but\n",
- "> it depends on quick turnaround,\n",
- "> and on us recognizing the right answer when we get it.\n",
- "> \n",
- "> The best way to achieve quick turnaround is often\n",
- "> to put a subset of data in a temporary database\n",
- "> and run our queries against that,\n",
- "> or to fill a small database with synthesized records.\n",
- "> For example,\n",
- "> instead of trying our queries against an actual database of 20 million Australians,\n",
- "> we could run it against a sample of ten thousand,\n",
- "> or write a small program to generate ten thousand random (but plausible) records\n",
- "> and use that."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Challenges\n",
- "\n",
- "1. Suppose we want to select all sites that lie more than 30° from the poles.\n",
- " Our first query is:\n",
- "\n",
- " ~~~\n",
- " select * from Site where (lat > -60) or (lat < 60);\n",
- " ~~~\n",
- "\n",
- " Explain why this is wrong,\n",
- " and rewrite the query so that it is correct.\n",
- "\n",
- "2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
- " Write a query that selects all records from `Survey`\n",
- " with salinity values outside this range.\n",
- "\n",
- "3. The SQL test `*column-name* like *pattern*`\n",
- " is true if the value in the named column\n",
- " matches the pattern given;\n",
- " the character '%' can be used any number of times in the pattern\n",
- " to mean \"match zero or more characters\".\n",
- "\n",
- " <table>\n",
- " <tr> <th>Expression</th> <th>Value</th> </tr>\n",
- " <tr> <td><code>'a' like 'a'</code></td> <td>True</td> </tr>\n",
- " <tr> <td><code>'a' like '%a'</code></td> <td>True</td> </tr>\n",
- " <tr> <td><code>'b' like '%a'</code></td> <td>False</td> </tr>\n",
- " <tr> <td><code>'alpha' like 'a%'</code></td> <td>True</td> </tr>\n",
- " <tr> <td><code>'alpha' like 'a%p%'</code></td> <td>True</td> </tr>\n",
- " </table>\n",
- " The expression `*column-name* not like *pattern*`\n",
- " inverts the test.\n",
- " Using `like`,\n",
- " write a query that finds all the records in `Visited`\n",
- " that *aren't* from sites labelled 'DR-something'."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "cell_tags": [
- "keypoints"
- ]
- },
- "source": [
- "#### Key Points\n",
- "\n",
- "* Use `where` to filter records according to Boolean conditions.\n",
- "* Filtering is done on whole records,\n",
- " so conditions can use fields that are not actually displayed."
- ]
- }
- ],
- "metadata": {}
- }
- ]
-}
\ No newline at end of file