5 In the late 1920s and early 1930s,
8 and Valentina Roerich led expeditions to the
9 [Pole of Inaccessibility](http://en.wikipedia.org/wiki/Pole_of_inaccessibility)
11 and then onward to Antarctica.
13 Gina Geographer discovered their expedition journals
14 in a storage locker at Miskatonic University.
15 She has scanned and OCR'd the data they contain,
16 and wants to store that information
17 in a way that will make search and analysis easy.
19 Gina basically has three options:
23 Text files are easiest to create,
24 and work well with version control,
25 but she would then have to build all her search and analysis herself.
26 Spreadsheets are good for doing simple analysis,
27 but as she found in her last project,
28 they don't handle large or complex data sets very well.
29 She would therefore like to put her data in a database,
30 and this chapter will show her how.
32 As many scientists have found out the hard way,
33 if collecting data is the first 90% of the work,
34 managing it is the other 90%.
36 we'll see how to use a database to store and analyze field observations.
37 The techniques we will explore apply directly to other kinds of databases as well,
39 knowing how to get information *out* of a database is essential to
40 figuring out how to put data *in*.
42 ## For Instructors {.guide}
44 Relational databases are not as widely used in science as in business,
45 but they are still a common way to store large data sets with complex structure.
46 Even when the data itself isn't in a database,
47 the metadata could be:
49 meteorological data might be stored in files on disk,
50 but data about when and where observations were made,
52 and so on could be in a database
53 to make it easier for scientists to find what they want to.
55 The first few sections
56 (up to [Ordering Results](#s:sort))
57 usually go very quickly.
58 The pace usually slows down a bit when null values and aggregation are discussed,
59 mostly because learners have a lot of details to keep straight by this point.
60 Things *really* slow down during the discussion of joins,
61 but this is the key idea in the whole lesson:
62 important ideas like primary keys and referential integrity only make sense
63 once learners have seen how they're used in joins.
64 It's worth going over things a couple of times if necessary
65 (with lots of examples).
67 The final three sections are independent of each other,
68 and can be dropped if time is short.
70 people seem to care most about how to add data
71 (which only takes a few minutes to demonstrate),
72 and how to use databases from inside "real" programs.
73 The material on transactions is more abstract than the rest,
74 and should be omitted if [web programming](web.html)
77 this material takes three hours to present
78 assuming that a short exercise is done with each topic.
80 ### Prerequisites {.prereq}
82 Everything up to the [final section](#s:programming)
83 only requires some understanding of Boolean operators,
86 and what's needed can actually be introduced on the fly.
87 That [final section](#s:programming),
88 which shows how to use databases from within programs,
89 depends on most of the [basic Python material](python.html).
91 ### Teaching Notes {.notes}
93 * It isn't necessary to cover [sets and dictionaries](setdict.html)
95 but if that has been discussed,
96 it's helpful to point out that a relational table is a generalized dictionary.
97 * Simple calculations are actually easier to do in a spreadsheet,
98 the advantages of using a database become clear
99 as soon as filtering and joins are needed.
100 Instructors may therefore want to show a spreadsheet
101 with the information from the four database tables
102 consolidated into a single sheet,
103 and demonstrate what's needed in both systems to answer questions like,
104 "What was the average radiation reading in 1931?"
105 * Some learners may have heard that NoSQL databases
106 (i.e., ones that don't use the relational model)
107 are the next big thing,
108 and ask why we're not teaching those.
110 * Relational databases are far more widely used than NoSQL databases.
111 * We have far more experience with relational databases
112 than with any other kind,
113 so we have a better idea of what to teach
115 * NoSQL databases are as different from each other
116 as they are from relational databases.
117 Until a leader emerges,
118 it isn't clear *which* NoSQL database we should teach.
119 * This discussion is a useful companion to that of vectorization
120 in the lesson on [numerical computing](numpy.html):
122 the key point is to describe *what* to do,
123 and let the computer figure out *how* to do it.
125 ## Selecting {#s:select}
127 ### Learning Objectives {.objectives}
129 * Explain the difference between a table, a database, and a database manager.
130 * Explain the difference between a field and a record.
131 * Select specific fields from specific tables, and display them in a specific order.
133 Duration: 15 minutes (not including time required to download database file and connect to it)
137 A [relational database](../gloss.html#relational-database)
138 is a way to store and manipulate information
139 that is arranged as [tables](../gloss.html#table).
140 Each table has columns (also known as [fields](../gloss.html#field-database)) which describe the data,
141 and rows (also known as [records](../gloss.html#record-database)) which contain the data.
144 When we are using a spreadsheet,
145 we put formulas into cells to calculate new values based on old ones.
146 When we are using a database,
148 (usually called [queries](../gloss.html#query))
149 to a [database manager](../gloss.html#database-manager):
150 a program that manipulates the database for us.
151 The database manager does whatever lookups and calculations the query specifies,
152 returning the results in a tabular form
153 that we can then use as a starting point for further queries.
155 > ### Under the Hood {.box}
157 > Every database manager—Oracle,
158 > IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores
159 > data in a different way,
160 > so a database created with one cannot be used directly by another.
162 > every database manager can import and export data in a variety of formats,
163 > so it *is* possible to move information from one to another.
165 Queries are written in a language called [SQL](../gloss.html#sql),
166 which stands for "Structured Query Language".
167 SQL provides hundreds of different ways to analyze and recombine data;
168 we will only look at a handful,
169 but that handful accounts for most of what scientists do.
171 [Figure 1](#f:survey_db) shows
172 a simple database that stores some of the data
173 Gina extracted from the logs of those long-ago expeditions.
174 It contains four tables:
179 -------------------- --------------------
180 `Person` People who took readings.
181 `Site` Locations of observation sites.
182 `Visited` When readings were taken at specific sites.
183 `Survey` The actual measurement values.
191 ident personal family
192 -------------------- -------------------- --------------------
196 roe Valentina Roerich
197 danforth Frank Danforth
205 taken person quant reading
206 -------------------- -------------------- -------------------- --------------------
236 -------------------- -------------------- --------------------
248 -------------------- -------------------- --------------------
260 <figcaption>Figure 1: Survey Database</figcaption>
262 Notice that three entries—one in the `Visited` table,
263 and two in the `Survey` table—are shown as `NULL`.
264 We'll return to these values [later](#s:null).
266 let's write an SQL query that displays scientists' names.
267 We do this using the SQL command `select`,
268 giving it the names of the columns we want and the table we want them from.
269 Our query and its output look like this:
271 sqlite> select family, personal from Person;
275 -------------------- --------------------
281 -------------------- --------------------
285 The semi-colon at the end of the query
286 tells the database manager that the query is complete and ready to run.
287 If we enter the query without the semi-colon,
288 or press 'enter' part-way through the query,
289 the SQLite interpreter will give us a different prompt
290 to show us that it's waiting for more input:
292 sqlite> select family, personal
298 -------------------- --------------------
304 -------------------- --------------------
309 we won't bother to display the prompt(s) with our commands.
311 > ### Case and Consistency {.box}
313 > We have written our command and the column names in lower case,
314 > and the table name in title case,
315 > but we could use any mix:
316 > SQL is [case insensitive](../gloss.html#case-insensitive),
317 > so we could write them all in upper case,
321 > SeLeCt famILY, PERSonal frOM PERson;
325 > large SQL queries are hard enough to read
326 > without the extra cognitive load of random capitalization.
328 > ### Displaying Results {.box}
330 > Exactly *how* the database displays the query's results
331 > depends on what kind of interface we are using.
332 > If we are running SQLite directly from the shell,
333 > its default output looks like this:
341 > If we are using a graphical interface,
342 > such as the [SQLite Manager](https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/) plugin for Firefox
343 > or the [database extension](https://github.com/catherinedevlin/ipython-sql) for the IPython Notebook,
344 > our output will be displayed graphically
345 > ([Figure 2](#f:firefox_output)
346 > and [Figure 3](#f:notebook_output)).
347 > We'll use a simple table-based display in these notes.
349 > <figure id="f:firefox_output">
350 > <img src="db/firefox_output.png" alt="Firefox SQLite Manager Output" />
351 > <figcaption>Figure 2: Firefox SQLite Manager Output</figcaption>
354 > <figure id="f:notebook_output">
355 > <img src="db/notebook_output.png" alt="IPython Notebook Database Extension Output" />
356 > <figcaption>Figure 3: IPython Notebook Database Extension Output</figcaption>
359 Going back to our query,
360 it's important to understand that
361 the rows and columns in a database table aren't actually stored in any particular order.
362 They will always be *displayed* in some order,
363 but we can control that in various ways.
365 we could swap the columns in the output by writing our query as:
368 select personal, family from Person;
373 -------------------- --------------------
379 -------------------- --------------------
383 or even repeat columns:
386 select ident, ident, ident from Person;
391 -------------------- -------------------- --------------------
396 danforth danforth danforth
397 -------------------- -------------------- --------------------
401 We will see ways to rearrange the rows [later](#s:sort).
403 As a shortcut, we can select all of the columns in a table
404 using the wildcard `*`:
407 select * from Person;
412 -------------------- -------------------- --------------------
416 roe Valentina Roerich
417 danforth Frank Danforth
418 -------------------- -------------------- --------------------
422 ### Summary {.keypoints}
424 * A relational database stores information in tables with fields and records.
425 * A database manager is a program that manipulates a database.
426 * The commands or queries given to a database manager are usually written in a specialized language called SQL.
427 * SQL is case insensitive.
428 * The rows and columns of a database table aren't stored in any particular order.
429 * Use `select *fields* from *table*` to get all the values for specific fields from a single table.
430 * Use `select * from *table*` to select everything from a table.
432 ### Challenges {.challenges}
434 * Write a query that selects only site names from the `Site` table.
436 * Many people format queries as:
439 SELECT personal, family FROM person;
445 select Personal, Family from PERSON;
448 What style do you find easiest to read, and why?
450 ## Removing Duplicates {#s:distinct}
452 ### Learning Objectives {.objectives}
454 * Write queries that only display distinct results once.
460 Data is often redundant,
461 so queries often return redundant information.
463 if we select the quantitites that have been measured
464 from the `survey` table,
468 select quant from Survey;
499 We can eliminate the redundant output
500 to make the result more readable
501 by adding the `distinct` keyword
505 select distinct quant from Survey;
518 If we select more than one column—for example,
519 both the survey site ID and the quantity measured—then
520 the distinct pairs of values are returned:
523 select distinct taken, quant from Survey;
528 -------------------- --------------------
548 -------------------- --------------------
552 Notice in both cases that duplicates are removed
553 even if they didn't appear to be adjacent in the database.
555 it's important to remember that rows aren't actually ordered:
556 they're just displayed that way.
558 ### Summary {.keypoints}
560 * Use `distinct` to eliminate duplicates from a query's output.
562 ### Challenges {.challenges}
564 * Write a query that selects distinct dates from the `Site` table.
566 * If you are using SQLite from the command line,
567 you can run a single query by passing it to the interpreter
568 right after the path to the database file:
571 $ sqlite3 survey.db 'select * from Person;'
576 -------------------- -------------------- --------------------
580 roe Valentina Roerich
581 danforth Frank Danforth
582 -------------------- -------------------- --------------------
586 Fill in the missing commands in the pipeline below
587 so that the output contains no redundant values.
590 $ sqlite3 survey.db 'select person, quant from Survey;' | ____ | ____
593 Do you think this is less efficient, just as efficient, or more efficient
594 that using `distinct` for large data?
596 ## Filtering {#s:filter}
598 ### Learning Objectives {.objectives}
600 * Write queries that select records based on the values of their fields.
601 * Write queries that select records using combinations of several tests on their fields' values.
602 * Build up complex filtering criteria incrementally.
603 * Explain the logical order in which filtering by field value and displaying fields takes place.
605 Duration: 5-10 minutes.
609 One of the most powerful features of a database is
610 the ability to [filter](../gloss.html#filter) data,
612 to select only those records that match certain criteria.
614 suppose we want to see when a particular site was visited.
615 We can select these records from the `Visited` table
616 by using a `where` clause in our query:
619 select * from Visited where site='DR-1';
624 -------------------- -------------------- --------------------
628 -------------------- -------------------- --------------------
632 The database manager executes this query in two stages
633 ([Figure 4](#f:pipeline_where)).
635 it checks at each row in the `Visited` table
636 to see which ones satisfy the `where`.
637 It then uses the column names following the `select` keyword
638 to determine what columns to display.
640 <figure id="f:pipeline_where">
641 <img src="db/pipeline_where.png" alt="Two-Stage Query Processing Pipeline" />
642 <figcaption>Figure 4: Two-Stage Query Processing Pipeline</figcaption>
645 This processing order means that
646 we can filter records using `where`
647 based on values in columns that aren't then displayed:
650 select ident from Visited where site='DR-1';
663 We can use many other Boolean operators to filter our data.
665 we can ask for all information from the DR-1 site collected since 1930:
668 select * from Visited where (site='DR-1') and (dated>='1930-00-00');
673 -------------------- -------------------- --------------------
675 -------------------- -------------------- --------------------
679 (The parentheses around the individual tests aren't strictly required,
680 but they help make the query easier to read.)
682 > ### Working With Dates {#a:dates .box}
684 > Most database managers have a special data type for dates.
685 > In fact, many have two:
687 > such as "May 31, 1971",
688 > and one for durations,
692 > it stores dates as either text
693 > (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"),
695 > (the number of days since November 24, 4714 BCE),
697 > (the number of seconds since midnight, January 1, 1970).
698 > If this sounds complicated,
700 > but not nearly as complicated as figuring out
701 > [historical dates in Sweden](http://en.wikipedia.org/wiki/Swedish_calendar).
703 If we want to find out what measurements were taken by either Lake or Roerich,
704 we can combine the tests on their names using `or`:
707 select * from Survey where person='lake' or person='roe';
712 -------------------- -------------------- -------------------- --------------------
723 -------------------- -------------------- -------------------- --------------------
728 we can use `in` to see if a value is in a specific set:
731 select * from Survey where person in ('lake', 'roe');
736 -------------------- -------------------- -------------------- --------------------
747 -------------------- -------------------- -------------------- --------------------
751 We can combine `and` with `or`,
752 but we need to be careful about which operator is executed first.
753 If we *don't* use parentheses,
757 select * from Survey where quant='sal' and person='lake' or person='roe';
762 -------------------- -------------------- -------------------- --------------------
770 -------------------- -------------------- -------------------- --------------------
774 which is salinity measurements by Lake,
775 and *any* measurement by Roerich.
776 We probably want this instead:
779 select * from Survey where quant='sal' and (person='lake' or person='roe');
784 -------------------- -------------------- -------------------- --------------------
791 -------------------- -------------------- -------------------- --------------------
796 we can use `distinct` with `where`
797 to give a second level of filtering:
800 select distinct person, quant from Survey where person='lake' or person='roe';
805 -------------------- --------------------
811 -------------------- --------------------
816 `distinct` is applied to the values displayed in the chosen columns,
817 not to the entire rows as they are being processed.
819 > ### Growing Queries {.box}
821 > What we have just done is how most people "grow" their SQL queries.
822 > We started with something simple that did part of what we wanted,
823 > then added more clauses one by one,
824 > testing their effects as we went.
825 > This is a good strategy—in fact,
826 > for complex queries it's often the *only* strategy—but
827 > it depends on quick turnaround,
828 > and on us recognizing the right answer when we get it.
830 > The best way to achieve quick turnaround is often
831 > to put a subset of data in a temporary database
832 > and run our queries against that,
833 > or to fill a small database with synthesized records.
835 > instead of trying our queries against an actual database of 20 million Australians,
836 > we could run it against a sample of ten thousand,
837 > or write a small program to generate ten thousand random (but plausible) records
840 ### Summary {.keypoints}
842 * Use `where *test*` in a query to filter records based on Boolean tests.
843 * Use `and` and `or` to combine tests.
844 * Use `in` to check if a value is in a set.
845 * Build up queries a bit at a time, and test them against small data sets.
847 ### Challenges {.challenges}
849 * Gina wants to select all sites that lie within 30° of the equator.
853 select * from Site where (lat > -30) or (lat < 30);
856 Explain why this is wrong,
857 and rewrite the query so that it is correct.
859 * Normalized salinity readings are supposed to be between 0.0 and 1.0.
860 Write a query that selects all records from `Survey`
861 with salinity values outside this range.
863 * The SQL test `*column-name* like *pattern*`
864 is true if the value in the named column
865 matches the pattern given;
866 the character '%' can be used any number of times in the pattern
867 to mean "match zero or more characters".
870 -------------------- --------------------
871 `'a' like 'a'` `True`
872 `'a' like '%a'` `True`
873 `'b' like '%a'` `False`
874 `'alpha' like 'a%'` `True`
875 `'alpha' like 'a%p%'` `True`
876 `'beta' like 'a%p%'` `False`
878 The expression `*column-name* not like *pattern*`
881 write a query that finds all the records in `Visited`
882 that *aren't* from sites labelled 'DR-something'.
884 ## Calculating New Values {#s:calc}
886 ### Learning Objectives {.objectives}
888 * Write queries that do arithmetic using the values in individual records.
894 After carefully reading the expedition logs,
895 Gina realizes that the radiation measurements they report
896 may need to be corrected upward by 5%.
897 Rather than modifying the stored data,
898 she can do this calculation on the fly
899 as part of her query:
902 select 1.05 * reading from Survey where quant='rad';
920 When we run the query,
921 the expression `1.05 * reading` is evaluated for each row.
922 Expressions can use any of the fields,
923 all of usual arithmetic operators,
924 and a variety of common functions.
925 (Exactly which ones depends on which database manager is being used.)
927 we can convert temperature readings from Fahrenheit to Celsius
928 and round to two decimal places as follows:
931 select taken, round(5*(reading-32)/9, 2) from Survey where quant='temp';
936 -------------------- --------------------
941 -------------------- --------------------
945 We can also combine values from different fields,
946 for example by using the string concatenation operator `||`:
949 select personal || ' ' || family from Person;
964 > ### A Note on Names {.box}
966 > It may seem strange to use `personal` and `family` as field names
967 > instead of `first` and `last`,
968 > but it's a necessary first step toward handling cultural differences.
970 > consider the following rules:
972 > Full Name Alphabetized Under Reason
973 > -------------------- -------------------- --------------------
974 > Liu Xiaobo Liu Chinese family names come first
975 > Leonardo da Vinci Leonardo "da Vinci" just means "from Vinci"
976 > Catherine de Medici Medici family name
977 > Jean de La Fontaine La Fontaine family name is "La Fontaine"
978 > Juan Ponce de Leon Ponce de Leon full family name is "Ponce de Leon"
979 > Gabriel Garcia Marquez Garcia Marquez double-barrelled Spanish surnames
980 > Wernher von Braun von *or* Braun depending on whether he was in Germany or the US
981 > Elizabeth Alexandra May Windsor Elizabeth monarchs alphabetize by the name under which they reigned
982 > Thomas a Beckett Thomas and saints according to the names by which they were canonized
985 > even a two-part division into "personal" and "family"
988 ### Summary {.keypoints}
990 * Use expressions as fields to calculate per-record values.</li>
992 ### Challenges {.challenges}
994 * After further reading,
995 Gina realizes that Valentina Roerich
996 was reporting salinity as percentages.
997 Write a query that returns all of her salinity measurements
998 from the `Survey` table
999 with the values divided by 100.
1001 * The `union` operator combines the results of two queries:
1004 select * from Person where ident='dyer' union select * from Person where ident='roe';
1009 -------------------- -------------------- --------------------
1011 roe Valentina Roerich
1012 -------------------- -------------------- --------------------
1016 Use `union` to create a consolidated list of salinity measurements
1017 in which Roerich's, and only Roerich's,
1018 have been corrected as described in the previous challenge.
1019 The output should be something like:
1023 -------------------- --------------------
1032 -------------------- --------------------
1036 * The site identifiers in the `Visited` table have two parts
1040 select distinct site from Visited;
1045 --------------------
1049 --------------------
1053 Some major site identifiers are two letters long and some are three.
1054 The "in string" function `instr(X, Y)`
1055 returns the 1-based index of the first occurrence of string Y in string X,
1056 or 0 if Y does not exist in X.
1057 The substring function `substr(X, I)`
1058 returns the substring of X starting at index I.
1059 Use these two functions to produce a list of unique major site identifiers.
1061 the list should contain only "DR" and "MSK").
1063 * Pabodie's journal notes that all his temperature measurements
1065 but Lake's journal does not report whether he used °F or °C.
1066 How should Gina treat his measurements,
1069 ## Ordering Results {#s:sort}
1071 ### Learning Objectives {.objectives}
1073 * Write queries that order results according to fields' values.
1074 * Write queries that order results according to calculated values.
1075 * Explain why it is possible to sort records using the values of fields that are not displayed.
1077 Duration: 5 minutes.
1081 As we mentioned earlier,
1082 database records are not stored in any particular order.
1083 This means that query results aren't necessarily sorted,
1084 and even if they are,
1085 we often want to sort them in a different way,
1086 e.g., by the name of the project instead of by the name of the scientist.
1087 We can do this in SQL by adding an `order by` clause to our query:
1090 select reading from Survey where quant='rad' order by reading;
1095 --------------------
1104 --------------------
1109 results are sorted in ascending order
1111 from least to greatest).
1112 We can sort in the opposite order using `desc` (for "descending"):
1115 select reading from Survey where quant='rad' order by reading desc;
1120 --------------------
1129 --------------------
1133 (And if we want to make it clear that we're sorting in ascending order,
1134 we can use `asc` instead of `desc`.)
1136 We can also sort on several fields at once.
1138 this query sorts results first in ascending order by `taken`,
1139 and then in descending order by `person`
1140 within each group of equal `taken` values:
1143 select taken, person from Survey order by taken asc, person desc;
1148 -------------------- --------------------
1170 -------------------- --------------------
1174 This is easier to understand if we also remove duplicates:
1177 select distinct taken, person from Survey order by taken asc, person desc;
1182 -------------------- --------------------
1196 -------------------- --------------------
1200 Since sorting happens before columns are filtered,
1201 we can sort by a field that isn't actually displayed:
1204 select reading from Survey where quant='rad' order by taken;
1209 --------------------
1218 --------------------
1222 We can also sort results by the value of an expression.
1225 the `random` function returns a pseudo-random integer
1226 each time it is called
1231 select random(), ident from Person;
1236 -------------------- --------------------
1237 -6309766557809954936 dyer
1238 -2098461436941487136 pb
1239 -2248225962969032314 lake
1240 6062184424509295966 roe
1241 -1268956870222271271 danforth
1242 -------------------- --------------------
1246 So to randomize the order of our query results,
1247 e.g., when doing clinical trials,
1248 we can sort them by the value of this function:
1251 select ident from Person order by random();
1256 --------------------
1262 --------------------
1267 select ident from Person order by random();
1272 --------------------
1278 --------------------
1282 Our query pipeline now has four stages
1283 ([Figure 5](#f:pipeline_sort_distinct)):
1285 * Select the rows that pass the `where` criteria.
1286 * Sort them if required.
1287 * Filter the columns according to the `select` criteria.
1288 * Remove duplicates if required.
1290 <figure id="f:pipeline_sort_distinct">
1291 <img src="db/pipeline_sort_distinct.png" alt="Four-Stage Query Processing Pipeline" />
1292 <figcaption>Figure 5: Four-Stage Query Processing Pipeline</figcaption>
1295 ### Summary {.keypoints}
1297 * Use `order by` (with `asc` or `desc`) to order a query's results.
1298 * Use `random` to generate pseudo-random numbers.
1300 ### Challenges {.challenges}
1302 * Create a list of sites identifiers
1303 and their distance from the equator in kilometers,
1304 sorted from furthest to closest.
1305 (A degree of latitude corresponds to 111.12 km.)
1307 * Gina needs a list of radiation measurements from all sites
1308 sorted by when they were taken.
1312 select * from Survey where quant='rad' order by taken;
1315 produces the correct answer for the data used in our examples.
1316 Explain when and why it might produce the wrong answer.
1318 ## Missing Data {#s:null}
1320 ### Learning Objectives {.objectives}
1322 * Explain what databases use the special value `NULL` to represent.
1323 * Explain why databases should *not* uses their own special values (like 9999 or "N/A") to represent missing or unknown data.
1324 * Explain what atomic and aggregate calculations involving `NULL` produce, and why.
1325 * Write queries that include or exclude records containing `NULL`.
1327 Duration: 10-20 minutes
1328 (depending on whether or not the instructor includes an anecdote about
1329 what happens when you *don't* take missing data into account).
1333 Real-world data is never complete—there are always holes.
1334 Databases represent these holes using special value called `null`.
1335 `null` is not zero, `False`, or the empty string;
1336 it is a one-of-a-kind value that means "nothing here".
1337 Dealing with `null` requires a few special tricks
1338 and some careful thinking.
1341 let's have a look at the `Visited` table.
1342 There are eight records,
1343 but #752 doesn't have a date—or rather,
1347 select * from Visited;
1352 -------------------- -------------------- --------------------
1361 -------------------- -------------------- --------------------
1365 > ### Displaying Nulls {.box}
1367 > Different databases display nulls differently.
1369 > SQLite's default is to print nothing at all,
1370 > which makes nulls easy to overlook
1371 > (particularly if they're in the middle of a long row).
1373 Null doesn't behave like other values.
1374 If we select the records that come before 1930:
1377 select * from Visited where dated<'1930-00-00';
1382 -------------------- -------------------- --------------------
1385 -------------------- -------------------- --------------------
1390 and if we select the ones that come during or after 1930:
1393 select * from Visited where dated>='1930-00-00';
1398 -------------------- -------------------- --------------------
1404 -------------------- -------------------- --------------------
1409 but record #752 isn't in either set of results.
1412 is neither true nor false:
1413 null means, "We don't know,"
1414 and if we don't know the value on the left side of a comparison,
1415 we don't know whether the comparison is true or false.
1416 Since databases represent "don't know" as null,
1417 the value of `null<'1930-00-00'`
1419 `null>='1930-00-00'` is also null
1420 because we can't answer to that question either.
1421 And since the only records kept by a `where`
1422 are those for which the test is true,
1423 record #752 isn't included in either set of results.
1425 Comparisons aren't the only operations that behave this way with nulls.
1428 `log(null)` is `null`,
1431 comparing things to null with = and != produces null:
1434 select * from Visited where dated=NULL;
1438 select * from Visited where dated!=NULL;
1441 To check whether a value is `null` or not,
1442 we must use a special test `is null`:
1445 select * from Visited where dated is NULL;
1450 -------------------- --------------------
1452 -------------------- --------------------
1456 or its inverse `is not null`:
1459 select * from Visited where dated is not NULL;
1464 -------------------- -------------------- --------------------
1472 -------------------- -------------------- --------------------
1476 Null values cause headaches wherever they appear.
1478 suppose we want to find the all of salinity measurements
1479 that weren't taken by Dyer.
1480 It's natural to write the query like this:
1483 select * from Survey where quant='sal' and person!='lake';
1488 -------------------- -------------------- -------------------- --------------------
1493 -------------------- -------------------- -------------------- --------------------
1497 but this query filters omits the records
1498 where we don't know who took the measurement.
1500 the reason is that when `person` is `null`,
1501 the `!=` comparison produces `null`,
1502 so the record isn't kept in our results.
1503 If we want to keep these records
1504 we need to add an explicit check:
1507 select * from Survey where quant='sal' and (person!='lake' or person is null);
1512 -------------------- -------------------- -------------------- --------------------
1518 -------------------- -------------------- -------------------- --------------------
1523 We still have to decide whether this is the right thing to do or not.
1524 If we want to be absolutely sure that
1525 we aren't including any measurements by Lake in our results,
1526 we need to exclude all the records for which we don't know who did the work.
1528 > ### What Happens When You Forget {.box}
1530 > Several years ago,
1531 > I was helping a group who were looking at
1532 > the spread of drug-resistant tuberculosis (DRTB)
1533 > in industrialized countries.
1535 > they wanted to know if it was spreading faster among less affluent people.
1537 > We tackled the problem by combining two data sets.
1538 > The first gave us skin and blood test results for DRTB
1539 > along with patients' postal codes
1540 > (the only identifying information we were allowed---we didn't even have gender).
1541 > The second was Canadian census data that gave us
1542 > median income per postal code.
1543 > Since a PC is about 300-800 people,
1544 > we felt justified in joining the first with the second
1545 > to estimate incomes for people with positive and negative test results.
1548 > we didn't find a correlation between income and infection.
1549 > We were just about to publish when someone spotted the mistake I'd made.
1551 > Question: Who *doesn't* have a postal code?
1553 > Answer: Homeless people.
1555 > When I did the join,
1556 > I was throwing away homeless people,
1557 > which introduced a statistically significant error in my results.
1558 > But I couldn't just set the income of anyone without a postal code to zero,
1559 > because our sample included another set of people without postal codes:
1560 > 16-21 year olds whose addresses were suppressed
1561 > because they had tested positive for sexually-transmitted diseases.
1563 > At this point the problem is no longer a database issue,
1564 > but rather a question of statistics.
1566 > checking your queries when you're programming is as important as
1567 > checking your samples when you're doing chemistry.
1569 ### Summary {.keypoints}
1571 * Use `null` in place of missing information.
1572 * Almost every operation involving `null` produces `null` as a result.
1573 * Test for nulls using `is null` and `is not null`.
1575 ### Challenges {.challenges}
1577 * Write a query that sorts the records in `Visited` by date,
1578 omitting entries for which the date is not known
1581 * What do you expect the query:
1584 select * from Visited where dated in ('1927-02-08', null);
1588 What does it actually produce?
1590 * Some database designers prefer to use
1591 a [sentinel value](../gloss.html#sentinel-value)
1592 to mark missing data rather than `null`.
1594 they will use the date "0000-00-00" to mark a missing date,
1595 or -1.0 to mark a missing salinity or radiation reading
1596 (since actual readings cannot be negative).
1597 What does this simplify?
1598 What burdens or risks does it introduce?
1600 ## Aggregation {#s:aggregate}
1602 ### Learning Objectives {.objectives}
1604 * Write queries that combine values from many records to create a single aggregate value.
1605 * Write queries that put records into groups based on their values.
1606 * Write queries that combine values group by group.
1607 * Explain what is displayed for *unaggregated* fields when some fields are aggregated.
1609 Duration: 10 minutes.
1613 Gina now wants to calculate ranges and averages for her data.
1614 She knows how to select all of the dates from the `Visited` table:
1617 select dated from Visited;
1622 --------------------
1631 --------------------
1635 but to combine them,
1636 she must use an [aggregation function](../gloss.html#aggregation-function)
1637 such as `min` or `max`.
1638 Each of these functions takes a set of records as input,
1639 and produces a single record as output:
1642 select min(dated) from Visited;
1647 --------------------
1649 --------------------
1654 select max(dated) from Visited;
1659 --------------------
1661 --------------------
1665 `min` and `max` are just two of
1666 the aggregation functions built into SQL.
1667 Three others are `avg`,
1672 select avg(reading) from Survey where quant='sal';
1677 --------------------
1679 --------------------
1684 select count(reading) from Survey where quant='sal';
1689 --------------------
1691 --------------------
1696 select sum(reading) from Survey where quant='sal';
1701 --------------------
1703 --------------------
1707 We used `count(reading)` here,
1708 but we could just as easily have counted `quant`
1709 or any other field in the table,
1710 or even used `count(*)`,
1711 since the function doesn't care about the values themselves,
1712 just how many values there are.
1714 SQL lets us do several aggregations at once.
1717 find the range of sensible salinity measurements:
1720 select min(reading), max(reading) from Survey where quant='sal' and reading<=1.0;
1725 -------------------- --------------------
1727 -------------------- --------------------
1731 We can also combine aggregated results with raw results,
1732 although the output might surprise you:
1735 select person, count(*) from Survey where quant='sal' and reading<=1.0;
1740 -------------------- --------------------
1742 -------------------- --------------------
1746 Why does Lake's name appear rather than Roerich's or Dyer's?
1747 The answer is that when it has to aggregate a field,
1748 but isn't told how to,
1749 the database manager chooses an actual value from the input set.
1750 It might use the first one processed,
1752 or something else entirely.
1754 Another important fact is that when there are no values to aggregate,
1755 aggregation's result is "don't know"
1756 rather than zero or some other arbitrary value:
1759 select person, max(reading), sum(reading) from Survey where quant='missing';
1762 One final important feature of aggregation functions is that
1763 they are inconsistent with the rest of SQL in a very useful way.
1764 If we add two values,
1765 and one of them is null,
1768 if we use `sum` to add all the values in a set,
1769 and any of those values are null,
1770 the result should also be null.
1771 It's much more useful,
1773 for aggregation functions to ignore null values
1774 and only combine those that are non-null.
1775 This behavior lets us write our queries as:
1778 select min(dated) from Visited;
1783 --------------------
1785 --------------------
1789 instead of always having to filter explicitly:
1792 select min(dated) from Visited where dated is not null;
1797 --------------------
1799 --------------------
1803 ### Summary {.keypoints}
1805 * Use aggregation functions like `sum` and `max` to combine query results.
1806 * Use `count` function to count the number of results.
1807 * If some fields are aggregated and others are not, the database manager chooses an arbitrary result for the unaggregated field.
1808 * Most aggregation functions skip nulls when combining values.
1810 ### Challenges {.challenges}
1812 * How many temperature readings did Frank Pabodie record,
1813 and what was their average value?
1815 * The average of a set of values is the sum of the values
1816 divided by the number of values.
1817 Does this mean that the `avg` function returns 2.0 or 3.0
1818 when given the values 1.0, `null`, and 5.0?
1820 * Gina wants to calculate the difference between
1821 each individual radiation reading
1822 and the average of all the radiation readings.
1823 She writes the query:
1826 select reading-avg(reading) from Survey where quant='rad';
1829 What does this actually produce, and why?
1831 * The function `group_concat(field, separator)`
1832 concatenates all the values in a field
1833 using the specified separator character
1834 (or ',' if the separator isn't specified).
1835 Use this to produce a one-line list of scientists' names,
1839 William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
1842 Can you find a way to order the list by surname?
1844 ## Grouping {#s:grouping}
1846 ### Learning Objectives {.objectives}
1848 * Group results to be aggregated separately.
1849 * Explain when grouping occurs in the processing pipeline.
1851 Duration: 5 minutes.
1855 Aggregating all records at once doesn't always make sense.
1857 suppose Gina suspects that there is a systematic bias in her data,
1858 and that some scientists' radiation readings are higher than others.
1859 We know that this doesn't work:
1862 select person, count(reading), round(avg(reading), 2)
1869 -------------------- -------------------- --------------------
1871 -------------------- -------------------- --------------------
1875 because the database manager selects a single arbitrary scientist's name
1876 rather than aggregating separately for each scientist.
1877 Since there are only five scientists,
1878 she could write five queries of the form:
1881 select person, count(reading), round(avg(reading), 2)
1889 -------------------- -------------------- --------------------
1891 -------------------- -------------------- --------------------
1895 but this would be tedious,
1896 and if she ever had a data set with fifty or five hundred scientists,
1897 the chances of her getting all of those queries right is small.
1899 What we need to do is
1900 tell the database manager to aggregate the hours for each scientist separately
1901 using a `group by` clause:
1904 select person, count(reading), round(avg(reading), 2)
1912 -------------------- -------------------- --------------------
1917 -------------------- -------------------- --------------------
1921 `group by` does exactly what its name implies:
1922 groups all the records with the same value for the specified field together
1923 so that aggregation can process each batch separately.
1924 Since all the records in each batch have the same value for `person`,
1925 it no longer matters that the database manager
1926 is picking an arbitrary one to display
1927 alongside the aggregated `reading` values
1928 ([Figure 6](#f:grouped_aggregation)).
1930 <figure id="f:grouped_aggregation">
1931 <img src="db/grouped_aggregation.png" alt="Grouped Aggregation" />
1932 <figcaption>Figure 6: Grouped Aggregation</figcaption>
1935 Just as we can sort by multiple criteria at once,
1936 we can also group by multiple criteria.
1937 To get the average reading by scientist and quantity measured,
1939 we just add another field to the `group by` clause:
1942 select person, quant, count(reading), round(avg(reading), 2)
1944 group by person, quant;
1949 -------------------- -------------------- -------------------- --------------------
1961 -------------------- -------------------- -------------------- --------------------
1965 Note that we have added `person` to the list of fields displayed,
1966 since the results wouldn't make much sense otherwise.
1968 Let's go one step further and remove all the entries
1969 where we don't know who took the measurement:
1972 select person, quant, count(reading), round(avg(reading), 2)
1974 where person is not null
1975 group by person, quant
1976 order by person, quant;
1981 -------------------- -------------------- -------------------- --------------------
1991 -------------------- -------------------- -------------------- --------------------
1995 Looking more closely,
1998 * selected records from the `Survey` table
1999 where the `person` field was not null;
2001 * grouped those records into subsets
2002 so that the `person` and `quant` values in each subset
2005 * ordered those subsets first by `person`,
2006 and then within each sub-group by `quant`;
2009 * counted the number of records in each subset,
2010 calculated the average `reading` in each,
2011 and chose a `person` and `quant` value from each
2012 (it doesn't matter which ones,
2013 since they're all equal).
2015 Our query processing pipeline now looks like
2016 [Figure 7](#f:pipeline_grouping).
2018 <figure id="f:pipeline_grouping">
2019 <img src="db/pipeline_grouping.png" alt="Query Processing Pipeline With Grouping" />
2020 <figcaption>Figure 7: Query Processing Pipeline With Grouping</figcaption>
2023 ### Summary {.keypoints}
2025 * Use `group by` to group values for separate aggregation.
2027 ### Challenges {.challenges}
2029 * Write a single query that finds the earliest and latest date
2030 that each site was visited.
2032 * Show the records produced by each stage of
2033 [Figure 7](#f:pipeline_grouping)
2034 for the following query:
2037 select min(reading), max(reading) from Survey
2038 where taken in (734, 735)
2040 group by taken, quant;
2043 * How can the query in the previous challenge be simplified
2044 without changing its result?
2046 ## Combining Data {#s:join}
2048 ### Learning Objectives {.objectives}
2050 * Explain what primary keys and foreign keys are.
2051 * Write queries that combine information from two or more tables by matching keys.
2052 * Write queries using aliases for table names.
2053 * Explain why the `tablename.fieldname` notation is needed when tables are joined.
2054 * Explain the logical sequence of operations that occurs when two or more tables are joined.
2056 Duration: 20 minutes (and expect to have to walk through an example step-by-step).
2060 In order to submit her data to a web site
2061 that aggregates historical meteorological data,
2062 Gina needs to format it as
2063 latitude, longitude, date, quantity, and reading.
2065 her latitudes and longitudes are in the `Site` table,
2066 while the dates of measurements are in the `Visited` table
2067 and the readings themselves are in the `Survey` table.
2068 She needs to combine these tables somehow.
2070 The SQL command to do this is `join`.
2071 To see how it works,
2072 let's start by joining the `Site` and `Visited` tables:
2075 select * from Site join Visited;
2080 -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
2081 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
2082 DR-1 -49.85 -128.57 622 DR-1 1927-02-10
2083 DR-1 -49.85 -128.57 734 DR-3 1939-01-07
2084 DR-1 -49.85 -128.57 735 DR-3 1930-01-12
2085 DR-1 -49.85 -128.57 751 DR-3 1930-02-26
2086 DR-1 -49.85 -128.57 752 DR-3
2087 DR-1 -49.85 -128.57 837 MS-4 1932-01-14
2088 DR-1 -49.85 -128.57 844 DR-1 1932-03-22
2089 DR-3 -47.15 -126.72 619 DR-1 1927-02-08
2090 DR-3 -47.15 -126.72 622 DR-1 1927-02-10
2091 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
2092 DR-3 -47.15 -126.72 735 DR-3 1930-01-12
2093 DR-3 -47.15 -126.72 751 DR-3 1930-02-26
2094 DR-3 -47.15 -126.72 752 DR-3
2095 DR-3 -47.15 -126.72 837 MS-4 1932-01-14
2096 DR-3 -47.15 -126.72 844 DR-1 1932-03-22
2097 MS-4 -48.87 -123.4 619 DR-1 1927-02-08
2098 MS-4 -48.87 -123.4 622 DR-1 1927-02-10
2099 MS-4 -48.87 -123.4 734 DR-3 1939-01-07
2100 MS-4 -48.87 -123.4 735 DR-3 1930-01-12
2101 MS-4 -48.87 -123.4 751 DR-3 1930-02-26
2102 MS-4 -48.87 -123.4 752 DR-3
2103 MS-4 -48.87 -123.4 837 MS-4 1932-01-14
2104 MS-4 -48.87 -123.4 844 DR-1 1932-03-22
2105 -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
2110 the [cross product](../gloss.html#cross-product)
2113 it joins each record of one with each record of the other
2114 to give all possible combinations.
2115 Since there are three records in `Site`
2116 and eight in `Visited`,
2117 the join's output has 24 records.
2118 And since each table has three fields,
2119 the output has six fields.
2121 What the join *hasn't* done is
2122 figure out if the records being joined have anything to do with each other.
2123 It has no way of knowing whether they do or not until we tell it how.
2125 we add a clause specifying that
2126 we're only interested in combinations that have the same site name:
2129 select * from Site join Visited on Site.name=Visited.site;
2134 -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
2135 DR-1 -49.85 -128.57 619 DR-1 1927-02-08
2136 DR-1 -49.85 -128.57 622 DR-1 1927-02-10
2137 DR-1 -49.85 -128.57 844 DR-1 1932-03-22
2138 DR-3 -47.15 -126.72 734 DR-3 1939-01-07
2139 DR-3 -47.15 -126.72 735 DR-3 1930-01-12
2140 DR-3 -47.15 -126.72 751 DR-3 1930-02-26
2141 DR-3 -47.15 -126.72 752 DR-3
2142 MS-4 -48.87 -123.4 837 MS-4 1932-01-14
2143 -------------------- -------------------- -------------------- -------------------- -------------------- --------------------
2147 `on` does the same job as `where`:
2148 it only keeps records that pass some test.
2149 (The difference between the two is that `on` filters records
2150 as they're being created,
2151 while `where` waits until the join is done
2152 and then does the filtering.)
2153 Once we add this to our query,
2154 the database manager throws away records
2155 that combined information about two different sites,
2156 leaving us with just the ones we want.
2158 Notice that we used `table.field` to specify field names
2159 in the output of the join.
2160 We do this because tables can have fields with the same name,
2161 and we need to be specific which ones we're talking about.
2163 if we joined the `person` and `visited` tables,
2164 the result would inherit a field called `ident`
2165 from each of the original tables.
2167 We can now use the same dotted notation
2168 to select the three columns we actually want
2172 select Site.lat, Site.long, Visited.dated
2173 from Site join Visited
2174 on Site.name=Visited.site;
2179 -------------------- -------------------- --------------------
2180 -49.85 -128.57 1927-02-08
2181 -49.85 -128.57 1927-02-10
2182 -49.85 -128.57 1932-03-22
2184 -47.15 -126.72 1930-01-12
2185 -47.15 -126.72 1930-02-26
2186 -47.15 -126.72 1939-01-07
2187 -48.87 -123.4 1932-01-14
2188 -------------------- -------------------- --------------------
2192 If joining two tables is good,
2193 joining many tables must be better.
2195 we can join any number of tables
2196 simply by adding more `join` clauses to our query,
2197 and more `on` tests to filter out combinations of records
2198 that don't make sense:
2201 select Site.lat, Site.long, Visited.dated, Survey.quant, Survey.reading
2202 from Site join Visited join Survey
2203 on Site.name=Visited.site
2204 and Visited.ident=Survey.taken
2205 and Visited.dated is not null;
2210 -------------------- -------------------- -------------------- -------------------- --------------------
2211 -49.85 -128.57 1927-02-08 rad 9.82
2212 -49.85 -128.57 1927-02-08 sal 0.13
2213 -49.85 -128.57 1927-02-10 rad 7.8
2214 -49.85 -128.57 1927-02-10 sal 0.09
2215 -47.15 -126.72 1939-01-07 rad 8.41
2216 -47.15 -126.72 1939-01-07 sal 0.05
2217 -47.15 -126.72 1939-01-07 temp -21.5
2218 -47.15 -126.72 1930-01-12 rad 7.22
2219 -47.15 -126.72 1930-01-12 sal 0.06
2220 -47.15 -126.72 1930-01-12 temp -26.0
2221 -47.15 -126.72 1930-02-26 rad 4.35
2222 -47.15 -126.72 1930-02-26 sal 0.1
2223 -47.15 -126.72 1930-02-26 temp -18.5
2224 -48.87 -123.4 1932-01-14 rad 1.46
2225 -48.87 -123.4 1932-01-14 sal 0.21
2226 -48.87 -123.4 1932-01-14 sal 22.5
2227 -49.85 -128.57 1932-03-22 rad 11.25
2228 -------------------- -------------------- -------------------- -------------------- --------------------
2233 We can tell which records from `Site`, `Visited`, and `Survey`
2234 correspond with each other
2235 because those tables contain
2236 [primary keys](../gloss.html#primary-key)
2237 and [foreign keys](../gloss.html#foreign-key).
2238 A primary key is a value,
2239 or combination of values,
2240 that uniquely identifies each record in a table.
2241 A foreign key is a value (or combination of values) from one table
2242 that identifies a unique record in another table.
2243 Another way of saying this is that
2244 a foreign key is the primary key of one table
2245 that appears in some other table.
2247 `Person.ident` is the primary key in the `Person` table,
2248 while `Survey.person` is a foreign key
2249 relating the `Survey` table's entries
2250 to entries in `Person`.
2252 Most database designers believe that
2253 every table should have a well-defined primary key.
2254 They also believe that this key should be separate from the data itself,
2255 so that if we ever need to change the data,
2256 we only need to make one change in one place.
2257 One easy way to do this is
2258 to create an arbitrary, unique ID for each record
2259 as we add it to the database.
2260 This is actually very common:
2261 those IDs have names like "student numbers" and "patient numbers",
2262 and they almost always turn out to have originally been
2263 a unique record identifier in some database system or other.
2264 As the query below demonstrates,
2265 SQLite automatically numbers records as they're added to tables,
2266 and we can use those record numbers in queries:
2269 select rowid, * from Person;
2274 -------------------- -------------------- -------------------- --------------------
2277 3 lake Anderson Lake
2278 4 roe Valentina Roerich
2279 5 danforth Frank Danforth
2280 -------------------- -------------------- -------------------- --------------------
2284 ### Summary {.keypoints}
2286 * Use `join` to create all possible combinations of records from two or more tables.
2287 * Use `join *tables* on *test*` to keep only those combinations that pass some test.
2288 * Use `*table*.*field*` to specify a particular field of a particular table.
2289 * Every record in a table should be uniquely identified by the value of its primary key.
2291 ### Challenges {.challenges}
2293 * Write a query that lists all radiation readings from the DR-1 site.
2295 * Write a query that lists all sites visited by people named "Frank".
2297 * Describe in your own words what the following query produces:
2300 select Site.name from Site join Visited
2301 on Site.lat<-49.0 and Site.name=Visited.site and Visited.dated>='1932-00-00';
2304 * Why does the `Person` table have an `ident` field?
2305 Why do we not just use scientists' names in the `Survey` table?
2307 * Why does the table `Site` exist?
2308 Why didn't Gina just record latitudes and longitudes
2309 directly in the `Visited` and `Survey` tables?
2311 ## Creating and Modifying Tables {#s:create}
2313 ### Learning Objectives {.box}
2315 * Write queries that create database tables with fields of common types.
2316 * Write queries that specify the primary and foreign key relationships of tables.
2317 * Write queries that specify whether field values must be unique and/or are allowed to be `null`.
2318 * Write queries that erase database tables.
2319 * Write queries that add records to database tables.
2320 * Write queries that delete specific records from tables.
2321 * Explain what referential integrity is, and how a database can become inconsistent as data is changed.
2323 Duration: 10 minutes.
2327 So far we have only looked at how to get information out of a database,
2328 both because that is more frequent than adding information,
2329 and because most other operations only make sense
2330 once queries are understood.
2331 If we want to create and modify data,
2332 we need to know two other pairs of commands.
2334 The first pair are `create table` and `drop table`.
2335 While they are written as two words,
2336 they are actually single commands.
2337 The first one creates a new table;
2338 its arguments are the names and types of the table's columns.
2340 the following statements create the four tables in our survey database:
2342 create table Person(ident text, personal text, family text);
2343 create table Site(name text, lat real, long real);
2344 create table Visited(ident integer, site text, dated text);
2345 create table Survey(taken integer, person text, quant real, reading real);
2347 We can get rid of one of our tables using:
2351 Be very careful when doing this:
2352 most databases have some support for undoing changes,
2353 but it's better not to have to rely on it.
2355 Different database systems support different data types for table columns,
2356 but most provide the following:
2360 -------------------- --------------------
2361 `integer` A signed integer.
2362 `real` A floating point value.
2364 `blob` Any "binary large object" such as an image or audio file.
2365 -------------------- --------------------
2369 Most databases also support Booleans and date/time values;
2370 SQLite uses the integers 0 and 1 for the former,
2371 and represents the latter as discussed [earlier](#a:dates).
2372 An increasing number of databases also support geographic data types,
2373 such as latitude and longitude.
2374 Keeping track of what particular systems do or do not offer,
2375 and what names they give different data types,
2376 is an unending portability headache.
2378 When we create a table,
2379 we can specify several kinds of constraints on its columns.
2381 a better definition for the `Survey` table would be:
2383 create table Survey(
2384 taken integer not null, -- where reading taken
2385 person text, -- may not know who took it
2386 quant real not null, -- the quantity measured
2387 reading real not null, -- the actual reading
2388 primary key(taken, quant),
2389 foreign key(taken) references Visited(ident),
2390 foreign key(person) references Person(ident)
2394 exactly what constraints are avialable
2395 and what they're called
2396 depends on which database manager we are using.
2398 Once tables have been created,
2399 we can add and remove records using our other pair of commands,
2400 `insert` and `delete`.
2401 The simplest form of `insert` statement lists values in order:
2403 insert into Site values('DR-1', -49.85, -128.57);
2404 insert into Site values('DR-3', -47.15, -126.72);
2405 insert into Site values('MSK-4', -48.87, -123.40);
2407 We can also insert values into one table directly from another:
2409 create table JustLatLong(lat text, long TEXT);
2410 insert into JustLatLong select lat, long from site;
2412 Deleting records can be a bit trickier,
2413 because we have to ensure that the database remains internally consistent.
2414 If all we care about is a single table,
2415 we can use the `DELETE` command with a `WHERE` clause
2416 that matches the records we want to discard.
2418 once we realize that Frank Danforth didn't take any measurements,
2419 we can remove him from the `Person` table like this:
2421 delete from Person where ident = "danforth";
2423 But what if we removed Anderson Lake instead?
2424 Our `Survey` table would still contain seven records
2425 of measurements he'd taken:
2428 select count(*) from Survey where person='lake';
2433 --------------------
2435 --------------------
2439 That's never supposed to happen:
2440 `Survey.person` is a foreign key into the `Person` table,
2441 and all our queries assume there will be a row in the latter
2442 matching every value in the former.
2444 This problem is called [referential integrity](../gloss.html#referential-integrity):
2445 we need to ensure that all references between tables can always be resolved correctly.
2446 One way to do this is to delete all the records
2447 that use `'lake'` as a foreign key
2448 before deleting the record that uses it as a primary key.
2449 If our database manager supports it,
2450 we can automate this
2451 using [cascading delete](../gloss.html#cascading-delete).
2453 this technique is outside the scope of this chapter.
2455 > ### Other Ways to Do It {#a:hybrid .box}
2457 > Many applications use a hybrid storage model
2458 > instead of putting everything into a database:
2459 > the actual data (such as astronomical images) is stored in files,
2460 > while the database stores the files' names,
2461 > their modification dates,
2462 > the region of the sky they cover,
2463 > their spectral characteristics,
2465 > This is also how most music player software is built:
2466 > the database inside the application keeps track of the MP3 files,
2467 > but the files themselves live on disk.
2469 ### Summary {.keypoints}
2471 * Use `create table *name*(...)` to create a table.
2472 * Use `drop table *name*` to erase a table.
2473 * Specify field names and types when creating tables.
2474 * Specify `primary key`, `foreign key`, `not null`, and other constraints when creating tables.
2475 * Use `insert into *table* values(...)` to add records to a table.
2476 * Use `delete from *table* where *test*` to erase records from a table.
2477 * Maintain referential integrity when creating or deleting information.
2479 ### Challenges {.challenges}
2481 * Write an SQL statement to replace all uses of `null`
2483 with the string `'unknown'`.
2485 * One of Gina's colleagues has sent her a [CSV](../gloss.html#csv) file
2486 containing temperature readings by Robert Olmstead,
2487 which is formatted like this:
2495 Write a small Python program that reads this file in
2496 and prints out the SQL `insert` statements needed
2497 to add these records to the survey database.
2498 Note: you will need to add an entry for Olmstead
2499 to the `Person` table.
2500 If you are testing your program repeatedly,
2501 you may want to investigate SQL's `insert or replace` command.
2503 * SQLite has several administrative commands that aren't part of the SQL standard.
2504 One of them is `.dump`,
2505 which prints the SQL commands needed to re-create the database.
2507 which reads a file created by `.dump` and restores the database.
2508 A colleague of yours thinks that storing dump files (which are text) in version control
2509 is a good way to track and manage changes to the database.
2510 What are the pros and cons of this approach?
2512 <section id="s:transactions">
2516 ### Learning Objectives {.objectives}
2518 * Explain what a race condition is.
2519 * Explain why database operations sometimes have to be placed ina transaction to ensure correct behavior.
2520 * Explain what it means to commit a transaction.</li>
2522 Duration: 10 minutes.
2526 Suppose we have another table in our database that shows
2527 which pieces of equipment have been borrowed by which scientists:
2530 select * from Equipment;
2535 -------------------- --------------------
2536 dyer CX-211 oscilloscope
2537 pb Greenworth balance
2538 lake Cavorite damping plates
2539 -------------------- --------------------
2543 (We should actually give each piece of equipment a unique ID,
2544 and use that ID here instead of the full name,
2545 just as we created a separate table for scientists earlier in this chapter,
2546 but we will bend the rules for now.)
2547 If William Dyer gives the oscilloscope to Valentina Roerich,
2548 we need to execute two statements to update this table:
2550 delete from Equipment where person="dyer" and thing="CX-211 oscilloscope";
2551 insert into Equipment values("roe", "CX-211 oscilloscope");
2553 This is all fine—unless our program happens to crash
2554 between the first statement and the second.
2556 the `Equipment` table won't have a record for the oscilloscope at all.
2557 Such a crash may seem unlikely,
2559 if a computer can do two billion operations per second,
2560 that means there are two billion opportunities every second for something to go wrong.
2561 And if our operations take a long time to complete—as they will
2562 when we are working with large datasets,
2563 or when the database is being heavily used—the odds of failure increase.
2565 What we really want is a way to ensure that every operation is [ACID](../gloss.html#acid):
2566 [atomic](../gloss.html#atomic-operation) (i.e. indivisible),
2567 consistent, isolated, and durable.
2568 The precise meanings of these terms doesn't matter;
2569 what does is the notion that
2570 every logical operation on the database should either run to completion
2571 as if nothing else was going on at the same time,
2572 or fail without having any effect at all.
2574 The tool we use to ensure that this happens is called
2575 a [transaction](../gloss.html#transaction).
2576 Here's how we should actually write the statements
2577 to move the oscilloscope from one person to another:
2580 delete from Equipment where person="dyer" and thing="CX-211 oscilloscope";
2581 insert into Equipment values("roe", "CX-211 oscilloscope");
2584 The database manager treats everything in the transaction as one large statement.
2585 If anything goes wrong inside,
2586 then none of the changes made in the transaction will actually be written to the database—it
2587 will be as if the transaction had never happened.
2588 Changes are only stored permanently
2589 when we [commit](../gloss.html#commit) them at the end of the transaction.
2591 > ### Transactions and Commits {.box}
2593 > We first used the term "transaction" in
2594 > [our discussion of version control](svn.html#b:basics:transaction).
2595 > That's not a coincidence:
2596 > behind the scenes,
2597 > tools like Subversion are using many of the same algorithms as database managers
2598 > to ensure that either everything happens consistently
2599 > or nothing happens at all.
2600 > We [use the term "commit"](svn.html#a:commit) for the same reason:
2601 > just as our changes to local files aren't written back to the version control repository
2602 > until we commit them,
2603 > our (apparent) changes to a database aren't written to disk
2606 Transactions serve another purpose as well.
2607 Suppose there is another table in the database called `Exposure`
2608 that records the number of days each scientist was exposed to
2609 higher-than-normal levels of radiation:
2612 select * from Exposure;
2617 -------------------- --------------------
2621 -------------------- --------------------
2625 After going through the journal entries for 1932,
2626 Gina wants to add two days to Lake's count:
2629 update Exposure set days = days + 2 where person='lake';
2633 her labmate has been doing through the journal entries for 1933
2634 to help Gina meet a paper deadline.
2635 At the same moment as Gina runs her command,
2636 her labmate runs this
2637 to add one more day to Lake's exposure:
2640 update Exposure set days = days + 1 where person='lake';
2643 After both operations have completed,
2644 the database should show that Lake was exposed for eight days
2645 (the original five, plus two from Gina, plus one from her labmate).
2647 there is a small chance that it won't.
2649 let's break the two queries into their respective read and write steps
2650 and place them side by side:
2652 -------------------- --------------------
2653 `X = read Exposure('lake', __)` `Y = read Exposure('lake', __)`
2654 `write Exposure('lake', X+2)` `write Exposure('lake', Y+1)`
2655 -------------------- --------------------
2657 The database can only actually do one thing at once,
2658 so it must put these four operations into some sequential order.
2659 That order has to respect the original order within each column,
2660 but the database can interleave the two columns any way it wants.
2661 If it orders them like this:
2663 -------------------- --------------------
2664 `X = read Exposure('lake', __)` `X` is 5
2665 `write Exposure('lake', X+2)` database contains 7
2666 `Y = read Exposure('lake', __)` `Y` is 7
2667 `write Exposure('lake', Y+1)` database contains 8
2668 -------------------- --------------------
2671 But what if it interleaves the operations like this:
2673 -------------------- --------------------
2674 `X = read Exposure('lake', __)` `X` is 5
2675 `Y = read Exposure('lake', __)` `Y` is 5
2676 `write Exposure('lake', X+2)` database contains 7
2677 `write Exposure('lake', Y+1)` database contains 6
2678 -------------------- --------------------
2680 This ordering puts the initial value, 5, into both `X` and `Y`.
2681 It then writes 7 back to the database (the third statement),
2682 and then overwrites that with 6,
2685 This is called a [race condition](../gloss.html#race-condition),
2686 since the final result depends on a race between the two operations.
2687 Race conditions are part of what makes programming large systems with many components a nightmare:
2688 they are difficult to spot in advance
2689 (since they are caused by the interactions between components,
2690 rather than by anything in any one of those components),
2691 and can be almost impossible to debug
2692 (since they usually occur intermittently and infrequently).
2694 Transactions come to our rescue once again.
2695 If Gina and her labmate put their statements in transactions,
2696 the database will act as if it executed all of one and then all of the other.
2697 Whether or not it *actually* does this is up to whoever wrote the database manager:
2698 modern databases use very sophisticated algorithms to determine
2699 which operations actually have to be run sequentially,
2700 and which can safely be run in parallel to improve performance.
2701 The key thing is that
2702 every transaction will appear to have had the entire database to itself.
2704 ### Summary {.keypoints}
2706 * Place operations in a transaction to ensure that they appear to be atomic, consistent, isolated, and durable.
2708 ### Challenges {.challenges}
2710 * A friend of yours manages a database of aerial photographs.
2711 New records are added all the time,
2712 but existing records are never modified or updated.
2713 Your friend claims that because of this,
2714 he doesn't need to put his queries in transactions.
2715 Is he right or wrong, and why?
2717 ## Programming With Databases {#s:programming}
2719 ### Learning Objectives {.objectives}
2721 * Write a Python program that queries a database and processes the results.
2722 * Explain what an SQL injection attack is.
2723 * Write a program that safely interpolates values into queries.
2725 Duration: 20 minutes.
2729 To end this chapter,
2730 let's have a look at how to access a database from
2731 a general-purpose programming language like Python.
2732 Other languages use almost exactly the same model:
2733 library and function names may differ,
2734 but the concepts are the same.
2736 Here's a short Python program that selects latitudes and longitudes
2737 from an SQLite database stored in a file called `survey.db`:
2741 connection = sqlite3.connect("survey.db")
2742 cursor = connection.cursor()
2743 cursor.execute("select site.lat, site.long from site;")
2744 results = cursor.fetchall()
2751 The program starts by importing the `sqlite3` library.
2752 If we were connecting to MySQL, DB2, or some other database,
2753 we would import a different library,
2754 but all of them provide the same functions,
2755 so that the rest of our program does not have to change
2756 (at least, not much)
2757 if we switch from one database to another.
2759 Line 2 establishes a connection to the database.
2760 Since we're using SQLite,
2761 all we need to specify is the name of the database file.
2762 Other systems may require us to provide a username and password as well.
2763 Line 3 then uses this connection to create
2764 a [cursor](../gloss.html#cursor);
2765 just like the cursor in an editor,
2766 its role is to keep track of where we are in the database.
2768 On line 4, we use that cursor to ask the database to execute a query for us.
2769 The query is written in SQL,
2770 and passed to `cursor.execute` as a string.
2771 It's our job to make sure that SQL is properly formatted;
2773 or if something goes wrong when it is being executed,
2774 the database will report an error.
2776 The database returns the results of the query to us
2777 in response to the `cursor.fetchall` call on line 5.
2778 This result is a list with one entry for each record in the result set;
2779 if we loop over that list (line 6) and print those list entries (line 7),
2780 we can see that each one is a tuple
2781 with one element for each field we asked for.
2783 Finally, lines 8 and 9 close our cursor and our connection,
2784 since the database can only keep a limited number of these open at one time.
2785 Since establishing a connection takes time,
2787 we shouldn't open a connection,
2789 then close the connection,
2790 only to reopen it a few microseconds later to do another operation.
2792 it's normal to create one connection that stays open for the lifetime of the program.
2794 > ### What Are The u's For? {.box}
2796 > You may have noticed that
2797 > each of the strings in our output has a lower-case 'u' in front of it.
2798 > That is Python's way of telling us that the string is stored in
2799 > [Unicode](../gloss.html#unicode).
2801 Queries in real applications will often depend on values provided by users.
2803 a program might take a user ID as a command-line parameter
2804 and display the user's full name:
2810 query = "select personal, family from Person where ident='%s';"
2811 user_id = sys.argv[1]
2813 connection = sqlite3.connect("survey.db")
2814 cursor = connection.cursor()
2816 cursor.execute(query % user_id)
2817 results = cursor.fetchall()
2818 print results[0][0], results[0][1]
2824 The variable `query` holds the statement we want to execute
2825 with a `%s` format string where we want to insert
2826 the ID of the person we're looking up.
2827 It seems simple enough,
2828 but what happens if someone gives the program this input?
2830 dyer"; drop table Survey; select "
2832 It looks like there's garbage after the name of the project,
2833 but it is very carefully chosen garbage.
2834 If we insert this string into our query,
2838 select personal, family from Person where ident='dyer'; drop table Survey; select '';
2843 it will erase one of the tables in our database.
2845 This technique is called [SQL injection](../gloss.html#sql-injection),
2846 and it has been used to attack thousands of programs over the years.
2848 many web sites that take data from users insert values directly into queries
2849 without checking them carefully first.
2851 Since a villain might try to smuggle commands into our queries in many different ways,
2852 the safest way to deal with this threat is
2853 to replace characters like quotes with their escaped equivalents,
2854 so that we can safely put whatever the user gives us inside a string.
2855 We can do this by using a [prepared statement](../gloss.html#prepared-statement)
2856 instead of formatting our statements as strings.
2857 Here's what our example program looks like if we do this:
2863 query = "select personal, family from Person where ident=?;"
2864 user_id = sys.argv[1]
2866 connection = sqlite3.connect("survey.db")
2867 cursor = connection.cursor()
2869 cursor.execute(query, [user_id])
2870 results = cursor.fetchall()
2871 print results[0][0], results[0][1]
2877 The key changes are in the query string and the `execute` call.
2878 Instead of formatting the query ourselves,
2879 we put question marks in the query template where we want to insert values.
2880 When we call `execute`,
2882 that contains as many values as there are question marks in the query.
2883 The library matches values to question marks in order,
2884 and translates any special characters in the values
2885 into their escaped equivalents
2886 so that they are safe to use.
2888 ### Summary {.keypoints}
2890 * Most applications that use databases embed SQL in a general-purpose programming language.
2891 * Database libraries use connections and cursors to manage interactions.
2892 * Programs can fetch all results at once, or a few results at a time.
2893 * If queries are constructed dynamically using input from users, malicious users may be able to inject their own commands into the queries.
2894 * Dynamically-constructed queries can use SQL's native formatting to safeguard against such attacks.
2896 ### Challenges {.challenges}
2898 * Write a Python program that creates a new database
2899 in a file called `original.db`
2900 containing a single table called `Pressure`,
2901 with a single field called `reading`,
2902 and inserts 100,000 random numbers between 10.0 and 25.0.
2903 How long does it take this program to run?
2904 How long does it take to run a program
2905 that simply writes those random numbers to a file?
2907 * Write a Python program that creates a new database
2909 with the same structure as `original.db`
2910 and copies all the values greater than 20.0
2911 from `original.db` to `backup.db`.
2913 filtering values in the query,
2914 or reading everything into memory and filtering in Python?
2916 ## Summing Up {#s:summary}
2918 There are many things databases can't do,
2920 (which is why we have general-purpose programming languages like Python).
2922 they are still the best tool available
2923 for managing many kinds of complex, structured data.
2924 Thousands of programmer-years have gone into their design and implementation
2925 so that they can handle very large datasets—terabytes or more—quickly and reliably.
2926 Queries allow for great flexibility in how you are able to analyze your data,
2927 which makes databases a good choice when you are exploring data.