{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Filtering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One of the most powerful features of a database is\n",
"the ability to [filter](../../gloss.html#filter) data,\n",
"i.e.,\n",
"to select only those records that match certain criteria.\n",
"For example,\n",
"suppose we want to see when a particular site was visited.\n",
"We can select these records from the `Visited` table\n",
"by using a `where` clause in our query:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext sqlitemagic"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Visited where site='DR-1';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"619 | DR-1 | 1927-02-08 |
\n",
"622 | DR-1 | 1927-02-10 |
\n",
"844 | DR-1 | 1932-03-22 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The database manager executes this query in two stages.\n",
"First,\n",
"it checks at each row in the `Visited` table\n",
"to see which ones satisfy the `where`.\n",
"It then uses the column names following the `select` keyword\n",
"to determine what columns to display."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FIXME: diagram"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This processing order means that\n",
"we can filter records using `where`\n",
"based on values in columns that aren't then displayed:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select ident from Visited where site='DR-1';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"619 |
\n",
"622 |
\n",
"844 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use many other Boolean operators to filter our data.\n",
"For example,\n",
"we can ask for all information from the DR-1 site collected since 1930:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Visited where (site='DR-1') and (dated>='1930-00-00');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"844 | DR-1 | 1932-03-22 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(The parentheses around the individual tests aren't strictly required,\n",
"but they help make the query easier to read.)\n",
"\n",
"> Most database managers have a special data type for dates.\n",
"> In fact, many have two:\n",
"> one for dates,\n",
"> such as \"May 31, 1971\",\n",
"> and one for durations,\n",
"> such as \"31 days\".\n",
"> SQLite doesn't:\n",
"> instead,\n",
"> it stores dates as either text\n",
"> (in the ISO-8601 standard format \"YYYY-MM-DD HH:MM:SS.SSSS\"),\n",
"> real numbers\n",
"> (the number of days since November 24, 4714 BCE),\n",
"> or integers\n",
"> (the number of seconds since midnight, January 1, 1970).\n",
"> If this sounds complicated,\n",
"> it is,\n",
"> but not nearly as complicated as figuring out\n",
"> [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to find out what measurements were taken by either Lake or Roerich,\n",
"we can combine the tests on their names using `or`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Survey where person='lake' or person='roe';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"734 | lake | sal | 0.05 |
\n",
"751 | lake | sal | 0.1 |
\n",
"752 | lake | rad | 2.19 |
\n",
"752 | lake | sal | 0.09 |
\n",
"752 | lake | temp | -16.0 |
\n",
"752 | roe | sal | 41.6 |
\n",
"837 | lake | rad | 1.46 |
\n",
"837 | lake | sal | 0.21 |
\n",
"837 | roe | sal | 22.5 |
\n",
"844 | roe | rad | 11.25 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively,\n",
"we can use `in` to see if a value is in a specific set:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Survey where person in ('lake', 'roe');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"734 | lake | sal | 0.05 |
\n",
"751 | lake | sal | 0.1 |
\n",
"752 | lake | rad | 2.19 |
\n",
"752 | lake | sal | 0.09 |
\n",
"752 | lake | temp | -16.0 |
\n",
"752 | roe | sal | 41.6 |
\n",
"837 | lake | rad | 1.46 |
\n",
"837 | lake | sal | 0.21 |
\n",
"837 | roe | sal | 22.5 |
\n",
"844 | roe | rad | 11.25 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can combine `and` with `or`,\n",
"but we need to be careful about which operator is executed first.\n",
"If we *don't* use parentheses,\n",
"we get this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Survey where quant='sal' and person='lake' or person='roe';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"734 | lake | sal | 0.05 |
\n",
"751 | lake | sal | 0.1 |
\n",
"752 | lake | sal | 0.09 |
\n",
"752 | roe | sal | 41.6 |
\n",
"837 | lake | sal | 0.21 |
\n",
"837 | roe | sal | 22.5 |
\n",
"844 | roe | rad | 11.25 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"which is salinity measurements by Lake,\n",
"and *any* measurement by Roerich.\n",
"We probably want this instead:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select * from Survey where quant='sal' and (person='lake' or person='roe');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"734 | lake | sal | 0.05 |
\n",
"751 | lake | sal | 0.1 |
\n",
"752 | lake | sal | 0.09 |
\n",
"752 | roe | sal | 41.6 |
\n",
"837 | lake | sal | 0.21 |
\n",
"837 | roe | sal | 22.5 |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally,\n",
"we can use `distinct` with `where`\n",
"to give a second level of filtering:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%sqlite survey.db\n",
"select distinct person, quant from Survey where person='lake' or person='roe';"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"lake | sal |
\n",
"lake | rad |
\n",
"lake | temp |
\n",
"roe | sal |
\n",
"roe | rad |
\n",
"
"
],
"metadata": {},
"output_type": "display_data",
"text": [
""
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But remember:\n",
"`distinct` is applied to the values displayed in the chosen columns,\n",
"not to the entire rows as they are being processed.\n",
"\n",
"> What we have just done is how most people \"grow\" their SQL queries.\n",
"> We started with something simple that did part of what we wanted,\n",
"> then added more clauses one by one,\n",
"> testing their effects as we went.\n",
"> This is a good strategy—in fact,\n",
"> for complex queries it's often the *only* strategy—but\n",
"> it depends on quick turnaround,\n",
"> and on us recognizing the right answer when we get it.\n",
"> \n",
"> The best way to achieve quick turnaround is often\n",
"> to put a subset of data in a temporary database\n",
"> and run our queries against that,\n",
"> or to fill a small database with synthesized records.\n",
"> For example,\n",
"> instead of trying our queries against an actual database of 20 million Australians,\n",
"> we could run it against a sample of ten thousand,\n",
"> or write a small program to generate ten thousand random (but plausible) records\n",
"> and use that."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Challenges\n",
"\n",
"1. Suppose we want to select all sites that lie within 30° of the equator.\n",
" Our first query is:\n",
"\n",
" ```\n",
" select * from Site where (lat > -30) or (lat < 30);\n",
" ```\n",
"\n",
" Explain why this is wrong,\n",
" and rewrite the query so that it is correct.\n",
"\n",
"2. Normalized salinity readings are supposed to be between 0.0 and 1.0.\n",
" Write a query that selects all records from `Survey`\n",
" with salinity values outside this range.\n",
"\n",
"3. The SQL test `*column-name* like *pattern*`\n",
" is true if the value in the named column\n",
" matches the pattern given;\n",
" the character '%' can be used any number of times in the pattern\n",
" to mean \"match zero or more characters\".\n",
"\n",
" \n",
" Expression | Value |
\n",
" 'a' like 'a' | True |
\n",
" 'a' like '%a' | True |
\n",
" 'b' like '%a' | False |
\n",
" 'alpha' like 'a%' | True |
\n",
" 'alpha' like 'a%p%' | True |
\n",
"
\n",
" The expression `*column-name* not like *pattern*`\n",
" inverts the test.\n",
" Using `like`,\n",
" write a query that finds all the records in `Visited`\n",
" that *aren't* from sites labelled 'DR-something'."
]
}
],
"metadata": {}
}
]
}