{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Filtering" ] }, { "cell_type": "markdown", "metadata": { "cell_tags": [ "objectives" ] }, "source": [ "#### Objectives\n", "\n", "* Write queries that select records that satisfy user-specified conditions.\n", "* Explain the order in which the clauses in a query are executed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most powerful features of a database is\n", "the ability to [filter](https://github.com/swcarpentry/bc/blob/master/gloss.md#filter) data,\n", "i.e.,\n", "to select only those records that match certain criteria.\n", "For example,\n", "suppose we want to see when a particular site was visited.\n", "We can select these records from the `Visited` table\n", "by using a `where` clause in our query:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load_ext sqlitemagic" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Visited where site='DR-1';" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "
619DR-11927-02-08
622DR-11927-02-10
844DR-11932-03-22
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The database manager executes this query in two stages.\n", "First,\n", "it checks at each row in the `Visited` table\n", "to see which ones satisfy the `where`.\n", "It then uses the column names following the `select` keyword\n", "to determine what columns to display." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This processing order means that\n", "we can filter records using `where`\n", "based on values in columns that aren't then displayed:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select ident from Visited where site='DR-1';" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "
619
622
844
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"SQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use many other Boolean operators to filter our data.\n", "For example,\n", "we can ask for all information from the DR-1 site collected since 1930:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Visited where (site='DR-1') and (dated>='1930-00-00');" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "
844DR-11932-03-22
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "(The parentheses around the individual tests aren't strictly required,\n", "but they help make the query easier to read.)\n", "\n", "> Most database managers have a special data type for dates.\n", "> In fact, many have two:\n", "> one for dates,\n", "> such as \"May 31, 1971\",\n", "> and one for durations,\n", "> such as \"31 days\".\n", "> SQLite doesn't:\n", "> instead,\n", "> it stores dates as either text\n", "> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n", "> real numbers\n", "> (the number of days since November 24, 4714 BCE),\n", "> or integers\n", "> (the number of seconds since midnight, January 1, 1970).\n", "> If this sounds complicated,\n", "> it is,\n", "> but not nearly as complicated as figuring out\n", "> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to find out what measurements were taken by either Lake or Roerich,\n", "we can combine the tests on their names using `or`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Survey where person='lake' or person='roe';" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively,\n", "we can use `in` to see if a value is in a specific set:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Survey where person in ('lake', 'roe');" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
734lakesal0.05
751lakesal0.1
752lakerad2.19
752lakesal0.09
752laketemp-16.0
752roesal41.6
837lakerad1.46
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can combine `and` with `or`,\n", "but we need to be careful about which operator is executed first.\n", "If we *don't* use parentheses,\n", "we get this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Survey where quant='sal' and person='lake' or person='roe';" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
844roerad11.25
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "which is salinity measurements by Lake,\n", "and *any* measurement by Roerich.\n", "We probably want this instead:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select * from Survey where quant='sal' and (person='lake' or person='roe');" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
734lakesal0.05
751lakesal0.1
752lakesal0.09
752roesal41.6
837lakesal0.21
837roesal22.5
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally,\n", "we can use `distinct` with `where`\n", "to give a second level of filtering:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%sqlite survey.db\n", "select distinct person, quant from Survey where person='lake' or person='roe';" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "
lakesal
lakerad
laketemp
roesal
roerad
" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But remember:\n", "`distinct` is applied to the values displayed in the chosen columns,\n", "not to the entire rows as they are being processed.\n", "\n", "> What we have just done is how most people \"grow\" their SQL queries.\n", "> We started with something simple that did part of what we wanted,\n", "> then added more clauses one by one,\n", "> testing their effects as we went.\n", "> This is a good strategy—in fact,\n", "> for complex queries it's often the *only* strategy—but\n", "> it depends on quick turnaround,\n", "> and on us recognizing the right answer when we get it.\n", "> \n", "> The best way to achieve quick turnaround is often\n", "> to put a subset of data in a temporary database\n", "> and run our queries against that,\n", "> or to fill a small database with synthesized records.\n", "> For example,\n", "> instead of trying our queries against an actual database of 20 million Australians,\n", "> we could run it against a sample of ten thousand,\n", "> or write a small program to generate ten thousand random (but plausible) records\n", "> and use that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Challenges\n", "\n", "1. Suppose we want to select all sites that lie more than 30° from the poles.\n", " Our first query is:\n", "\n", " ~~~\n", " select * from Site where (lat > -60) or (lat < 60);\n", " ~~~\n", "\n", " Explain why this is wrong,\n", " and rewrite the query so that it is correct.\n", "\n", "2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n", " Write a query that selects all records from `Survey`\n", " with salinity values outside this range.\n", "\n", "3. The SQL test `*column-name* like *pattern*`\n", " is true if the value in the named column\n", " matches the pattern given;\n", " the character '%' can be used any number of times in the pattern\n", " to mean \"match zero or more characters\".\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Expression Value
'a' like 'a' True
'a' like '%a' True
'b' like '%a' False
'alpha' like 'a%' True
'alpha' like 'a%p%' True
\n", " The expression `*column-name* not like *pattern*`\n", " inverts the test.\n", " Using `like`,\n", " write a query that finds all the records in `Visited`\n", " that *aren't* from sites labelled 'DR-something'." ] }, { "cell_type": "markdown", "metadata": { "cell_tags": [ "keypoints" ] }, "source": [ "#### Key Points\n", "\n", "* Use `where` to filter records according to Boolean conditions.\n", "* Filtering is done on whole records,\n", " so conditions can use fields that are not actually displayed." ] } ], "metadata": {} } ] }