</ol>
-<div class="box">
- <h3>Nothing's Perfekt</h3>
-
- <p>
- Version control systems do have one important shortcoming.
- While it is easy for them to find, display, and merge differences in text files,
- images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they
- use specialized binary data formats.
- Most version control systems don't know how to deal with these formats,
- so all they can say is, "These files differ."
- Reconciling those differences will probably require use of an auxiliary tool,
- such as an audio editor
- or Microsoft Word's "Compare and Merge" utility.
- </p>
-</div>
-
<p>
The rest of this chapter will explore how to use
a popular open source version control system called Subversion.
+ It does not have all the features of some newer systems,
+ such as <a href="git.html">Git</a>,
+ but it is still widely used,
+ and is simpler to pick up than those more advanced alternatives.
+ No matter which system you use,
+ the most important thing to learn is not the details of their more obscure commands,
+ but the workflow that they encourage.
</p>
<div class="guide">
<h2>For Instructors</h2>
- <p class="fixme">explain</p>
+ <p>
+ Version control is the most important practical skill we introduce.
+ As the last paragraph of the introduction above says,
+ the workflow matters more than the ins and outs of any particular tool.
+ By the end of 90 minutes,
+ the instructor should be able to get learners to chant,
+ "Update, edit, merge, commit," in unison,
+ and have them understand what those terms mean
+ and why that's a good way to structure their working day.
+ </p>
+
+ <p>
+ Provided there aren't network problems,
+ this entire lesson can be covered in <span class="duration">90 minutes</span>.
+ The example at the end
+ showing how to use Subversion keywords to track provenance
+ is the "ah ha!" moment for many learners.
+ If time is short,
+ skip the material on recovering old versions of files
+ in order to get to this section instead.
+ (The fact that provenance is harder in Git,
+ both mechanically and conceptually,
+ is one reason to keep teaching Subversion.)
+ </p>
<div class="prereq">
<h3>Prerequisites</h3>
- <p class="fixme">prereq</p>
+ <p>
+ Basic shell concepts and skills
+ (<code>ls</code>, <code>cd</code>, <code>mkdir</code>,
+ editing files);
+ basic shell scripting
+ (for the discussion of <a href="#s:provenance">provenance</a>).
+ </p>
</div>
<div class="notes">
<h3>Teaching Notes</h3>
<ul>
+ <li>
+ Make sure the network is working <em>before</em> starting this lesson.
+ </li>
+ <li>
+ Give learners a ten-minute overview of what version control does for them
+ before diving into the watch-and-do practicals.
+ Most of them will have tried to co-author papers by emailing files back and forth,
+ or will have biked into the office
+ only to realize that the USB key with last night's work
+ is still on the kitchen table.
+ Instructors can also make jokes about directories with names like
+ "final version",
+ "final version revised",
+ "final version with reviewer three's corrections",
+ "really final version",
+ and,
+ "come on this really has to be the last version"
+ to motivate version control as a better way to collaborate
+ and as a better way to back work up.
+ </li>
+ <li>
+ Version control is typically taught after the shell,
+ so collect learners' names during that session
+ and create a repository for them to share
+ with their names as both their IDs and their passwords.
+ The easiest way to create the repository is to use
+ a server managed by an ISP such as Dreamhost,
+ or on SourceForge, Google Code, or some other "forge" site,
+ all of which provide web interfaces for repository creation and management.
+ If your learners are advanced enough to be using SSH,
+ you can instead create it on any server they can access,
+ and connect with the <code>svn+ssh</code> protocol instead of HTTPS.
+ </li>
+ <li>
+ Be very clear what files learners are to edit
+ and what user IDs they are to use
+ when giving instructions.
+ It is common for them to edit the instructor's biography,
+ or to use the instructor's user ID and password when committing.
+ Be equally clear <em>when</em> they are to edit things:
+ it's also common for someone to edit the file the instructor is editing
+ and commit changes while the instructor is explaining what's going on,
+ so that a conflict occurs when the instructor comes to commit the file.
+ </li>
+ <li>
+ Learners could do most exercises with repositories on their own machines,
+ but it's hard for them to see how version control helps collaboration
+ unless they're sharing a repository with other learners.
+ In particular,
+ showing learners who changed what using <code>svn blame</code>
+ is only compelling if a file has been edited by at least two people.
+ </li>
+ <li>
+ If some learners are using Windows,
+ there will inevitably be issues merging files with different line endings.
+ <code>svn diff -x -w</code> is supposed to suppress differences in whitespace,
+ but we have found that it doesn't always work as advertised.
+ </li>
</ul>
</div>
<figcaption>Figure 8: Updated Repository</figcaption>
</figure>
+ <div class="box">
+ <h3>When <em>Not</em> to Use Version Control</h3>
+
+ <p>
+ Despite the rapidly decreasing cost of storage,
+ it is still possible to run out of disk space.
+ In some labs,
+ people can easy go through 2 TB/month if they're not careful.
+ Since version control tools usually store revisions in terms of lines,
+ with binary data files,
+ they end up essentially storing every revision separately.
+ This isn't that bad
+ (it's what we'd be doing anyway),
+ but it means version control isn't doing what it likes to do,
+ and the repository can get very large very quickly.
+ Another concern is that if very old data will no longer be used,
+ it can be nice to archive or delete old data files.
+ This is not possible if our data is version controlled:
+ information can only be added to a repository,
+ so it can only ever increase in size.
+ </p>
+
+ </div>
+
<p id="a:define-head">
Back in his cubicle,
Wolfman uses <code>svn update</code> to update his working copy.
</div>
+ <div class="box">
+ <h3>Nothing's Perfekt</h3>
+
+ <p>
+ Version control systems do have one important shortcoming.
+ While it is easy for them to find, display, and merge differences in text files,
+ images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they
+ use specialized binary data formats.
+ Most version control systems don't know how to deal with these formats,
+ so all they can say is, "These files differ."
+ Reconciling those differences will probably require use of an auxiliary tool,
+ such as an audio editor
+ or Microsoft Word's "Compare and Merge" utility.
+ </p>
+ </div>
+
<div class="box">
<h3>Diffing Other Files</h3>
and if necessary undo later on.
</p>
+ <div class="box">
+ <h3>Who Did What?</h3>
+
+ <p>
+ One other very useful command is <code>svn blame</code>,
+ which shows when each line in the file was last changed
+ and by whom:
+ </p>
+
+<pre>
+$ <span class="in">svn blame moons.txt</span>
+<span class="out"> 14 dracula Name Orbital Radius Orbital Period Mass Radius
+ 14 dracula (10**3 km) (days) (10**20 kg) (km)
+ 14 dracula Amalthea 181.4 0.498179 0.075 131 x 73 x 67
+ 9 mummy Io 421.6 1.769138 893.2 1821.6
+ 9 mummy Europa 670.9 3.551181 480.0 1560.8
+ 9 mummy Ganymede 1070.4 7.154553 1481.9 2631.2
+ 14 dracula Callisto 1882.7 16.689018 1075.9 2410.3
+ 14 dracula Himalia 11460 250.5662 0.095 85.0
+ 14 dracula Elara 11740 259.6528 0.008 40.0</span>
+</pre>
+
+ <p>
+ If you are ever wondering who to talk to about a change,
+ or why it was made,
+ <code>svn blame</code> is a good place to start.
+ </p>
+ </div>
+
<div class="keypoints">
<h3>Summary</h3>
<ul>
<p>
The command to create a repository is <code>svnadmin create</code>,
followed by the path to the repository.
- If we want to create a repository called <code>lair_repo</code>
+ If we want to create a repository called <code>missions_repo</code>
directly under our home directory,
we just <code>cd</code> to get home
- and run <code>svnadmin create lair_repo</code>.
- This command creates a directory called <code>lair_repo</code> to hold our repository,
+ and run <code>svnadmin create missions_repo</code>.
+ This command creates a directory called <code>missions_repo</code> to hold our repository,
and fills it with various files that Subversion uses
to keep track of the project's history:
</p>
<pre>
$ <span class="in">cd</span>
-$ <span class="in">svnadmin create lair_repo</span>
-$ <span class="in">ls -F lair_repo</span>
+$ <span class="in">svnadmin create missions_repo</span>
+$ <span class="in">ls -F missions_repo</span>
<span class="out">README.txt conf/ db/ format hooks/ locks/</span>
</pre>
we should use <code>svn checkout</code>
to get a working copy of this repository.
If our home directory is <code>/users/mummy</code>,
- then the full path to the repository we just created is <code>/users/mummy/lair_repo</code>,
- so we run <code>svn checkout file:///users/mummy/lair lair_working</code>.
+ then the full path to the repository we just created is <code>/users/mummy/missions_repo</code>,
+ so we run <code>svn checkout file:///users/mummy/missions missions_working</code>.
</p>
<p>
Working backward,
the second argument,
- <code>lair_working</code>,
+ <code>missions_working</code>,
specifies where the working copy is to be put.
The first argument is the URL of our repository,
and it has two parts.
- <code>/users/mummy/lair_repo</code> is the path to repository directory.
+ <code>/users/mummy/missions_repo</code> is the path to repository directory.
<code>file://</code> specifies the <a href="glossary.html#protocol">protocol</a>
that Subversion will use to communicate with the repository—in this case,
it says that the repository is part of the local machine's filesystem.
which specifies the name of the directory we want the working copy to be put in.
Without it,
Subversion will try to use the name of the repository,
- <code>lair_repo</code>,
+ <code>missions_repo</code>,
as the name of the working copy.
Since we're in the directory that contains the repository,
this means that Subversion will try to overwrite the repository with a working copy.
most people create a sub-directory in their account called something like <code>repos</code>,
and then create their repositories in that.
For example,
- we could create our repository in <code>/users/mummy/repos/lair</code>,
- then check out a working copy as <code>/users/mummy/lair</code>.
+ we could create our repository in <code>/users/mummy/repos/missions</code>,
+ then check out a working copy as <code>/users/mummy/missions</code>.
This practice makes both names easier to read.
</p>
- <p class="fixme">HERE</p>
-
<p>
- The obvious next steps are
- to put our repository on a server,
- rather than on our personal machine,
- and to give other people access to the repository we have just created
- so that they can work with us.
- We should <em>always</em> keep repositories on a different machine than
- the one we're using for day-to-day work
- so that if the latter is lost or damaged,
- we still have our master copy.
+ The obvious next step is to put our repository on a server,
+ rather than on our personal machine.
+ In fact,
+ we should <em>always</em> do this
+ so that we don't lose the history of our project
+ if our laptop is damaged or stolen.
+ A departmental server is also much more likely to be backed up regularly
+ than our personal machine…
</p>
<p>
- The second step—sharing the repository with others—requires
- skills that we are deliberately not going to cover.
- As we discuss in the lessons on <a href="web.html">web programming</a>,
- as soon as you make something available over the internet,
- you open up a channel for attack.
+ Creating a repository on a server is simple:
+ just log in and go through the steps described above.
+ Accessing that repository from another machine
+ is also straightforward.
+ If the machine's address is <code>serv.euphoric.edu</code>,
+ and our user ID is <code>dracula</code>,
+ the URL of the repository will be something like:
</p>
+<pre>
+svn+ssh://dracula@serv.euphoric.edu/home/dracula/repos/missions
+</pre>
+
<p>
- If you want to do this, you can:
+ Reading from left to right:
+ </p>
+
+ <ul>
+ <li>
+ <code>svn+ssh</code> is the protocol that Subversion uses to connect to the server
+ (in this case,
+ a combination of Subversion's own protocol
+ and <a href="shell.html#s:ssh">SSH</a>);
+ </li>
+ <li>
+ <code>dracula@serv.euphoric.edu</code> identifies the server and who we are
+ (just like an email address);
+ and
+ </li>
+ <li>
+ <code>/home/dracula/repos/missions</code> is the absolutely path of the repository
+ on the server.
+ </li>
+ </ul>
+
+ <p id="a:only_user">
+ That's fine if you are the only person using the repository,
+ but if you want to share it with others,
+ you need to worry about security.
+ As we discuss in the lesson on <a href="web.html">web programming</a>,
+ as soon as you provide a service on the internet,
+ there's the possibility that someone may try to attack your system through it.
+ Rather than trying to learn enough system administration skills
+ to set things up safely,
+ it is usually easier to:
</p>
<ul>
<li>
- ask your system administrator to set it up for you;
+ ask your department's system administrator to set it up for you;
</li>
<li>
- use an open source hosting service like <a href="http://www.sf.net">SourceForge</a>,
+ use a hosting service like <a href="http://www.sf.net">SourceForge</a>,
<a href="http://code.google.com">Google Code</a>,
<a href="https://github.com/">GitHub</a>,
or <a href="https://bitbucket.org/">BitBucket</a>; or
</li>
<li>
- spend a few dollars a month on a commercial hosting service like <a href="http://dreamhost.com">DreamHost</a>
+ spend a few dollars a month on a commercial hosting service
that provides web-based GUIs for creating and managing repositories.
</li>
<div class="keypoints">
<h3>Summary</h3>
<ul>
- <li>Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.</li>
<li><code>svnadmin create <em>name</em></code> creates a new repository.</li>
+ <li>Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.</li>
</ul>
</div>
<div class="challenges">
<h3>Challenges</h3>
- <p class="fixme">write some</p>
+
+ <ol>
+
+ <li>
+ Create a Subversion repository called <code>trials_repo</code>
+ in your home directory.
+ Check out a working copy in a directory called <code>trials_working</code>
+ (also in your home directory).
+ Add a couple of text files,
+ commit the changes,
+ and then use <code>svn info trials_working</code>
+ to see what Subversion tells you about your working copy.
+ </li>
+
+ <li>
+ We said <a href="#a:only_user">above</a> that
+ you might be the only person using a particular repository.
+ When and why is version control worth using
+ if no-one else is working on a project with you?
+ </li>
+
+ <li>
+ There are many ways to organize repositories.
+ Some of the most common are to create one repository for:
+ <ul>
+ <li>each person</li>
+ <li>each paper</li>
+ <li>all the work done on one grant</li>
+ <li>all the work done on one project</li>
+ <li>the entire lab (which is shared by everyone in the lab)</li>
+ <li>the entire department (typically with a top-level directory for each person or project in the department)</li>
+ </ul>
+ What activities does each one make easy or hard?
+ Which of these would you prefer, and why?
+ </li>
+
+ </ol>
</div>
</section>
</p>
<p>
- One of the central ideas of this course is that
- wen can automatically track the provenance of scientific data.
+ One of the big benefits of using version control is that
+ it lets us track the provenance of scientific data automatically.
To start,
suppose we have a text file <code>combustion.dat</code> in a Subversion repository.
Run the following two commands:
$ svn commit -m "Turning on the 'Revision' keyword" combustion.dat
</pre>
- <p>
- Now open the file in an editor
+ <p class="continue">
+ This does nothing by itself,
+ but now open the file in an editor
and add the following line somewhere near the top:
</p>
<pre>
-# $Revision:$
+$Revision:$
</pre>
<p>
- The '#' sign isn't important:
- it's just what <code>.dat</code> files use to show comments.
- The <code>$Revision:$</code> string,
- on the other hand,
- means something special to Subversion.
+ The <code>$Revision:$</code> string means something special to Subversion.
Save the file, and commit the change:
</p>
</p>
<pre>
-# $Revision: 143$
+$Revision: 143$
</pre>
<p class="continue">
- i.e., Subversion has inserted the version number
+ i.e., it has inserted the version number
after the colon and before the closing <code>$</code>.
+ If we edit the file again—e.g., add a couple of lines with random numbers—and
+ commit once more,
+ the line is updated again to:
</p>
+<pre>
+$Revision: 144$
+</pre>
+
<p>
Here's what just happened.
- First, Subversion allows you to set
+ First, Subversion allows uss to add
<a href="glossary.html#property-subversion">properties</a>
- for files and and directories.
- These properties aren't in the files or directories themselves,
- but live in Subversion's database.
+ to files and and directories.
+ These properties aren't stored in the files or directories themselves,
+ but in Subversion's database.
One of those properties,
<code>svn:keywords</code>,
tells Subversion to look in files that are being changed
with the current version number,
the name of the person making the change,
or whatever else the property's name tells it to do.
- You only have to add the string to the file once;
+ We only have to add the string to the file once;
after that,
Subversion updates it for you every time the file changes.
</p>
for example,
it carries its version number with it,
so you can tell which version you have even if it's outside version control.
- We'll see some more useful things we can do with this information in
- <a href="python.html">the next chapter</a>.
+ We'll see some more useful things we can do with this information <a href="python.html">later</a>.
</p>
- <div class="box">
- <h3>When <em>Not</em> to Use Version Control</h3>
-
- <p>
- Despite the rapidly decreasing cost of storage,
- it is still possible to run out of disk space.
- In some labs,
- people can easy go through 2 TB/month if they're not careful.
- Since version control tools usually store revisions in terms of lines,
- with binary data files,
- they end up essentially storing every revision separately.
- This isn't that bad
- (it's what we'd be doing anyway),
- but it means version control isn't doing what it likes to do,
- and the repository can get very large very quickly.
- Another concern is that if very old data will no longer be used,
- it can be nice to archive or delete old data files.
- This is not possible if our data is version controlled:
- information can only be added to a repository,
- so it can only ever increase in size.
- </p>
-
- </div>
-
<p>
We can use this trick with shell scripts too,
or with almost any other kind of program.
- Going back to Nelle Nemo's data processing from
- the lesson on the <a href="shell.html">shell</a>,
- for example,
- suppose she writes a shell script that uses <code>gooclean</code>
+ Let's go back to Nelle Nemo's data processing from
+ the lesson on the <a href="shell.html">shell</a>.
+ Suppose she writes a shell script called <code>gooclean</code>
to tidy up data files.
Her first version looks like this:
</p>
<pre>
-for filename in $*
-do
- gooclean -b 0 100 < $filename > cleaned-$filename
-done
+# gooclean: clean up a single data file
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 > cleaned-$1
</pre>
<p class="continue">
- i.e., it runs <code>gooclean</code> with bounding values of 0 and 100
- for each specified file,
- putting the result in a temporary file with a well-defined name.
- Assuming that '#' is the comment character for those kinds of data files,
+ i.e.,
+ it runs <code>goonorm</code> and then <code>goofilter</code> with some fixed parameters
+ and creates an output file called <code>cleaned-something.dat</code>
+ (if the input file's name was <code>something.dat</code>).
+ Assuming that '#' is the comment character for her output files,
she could instead write:
</p>
<pre>
-for filename in $*
-do
- <span class="highlight">echo "gooclean $Revision: 901$ -b 0 100" > $filename</span>
- gooclean -b 0 100 < $filename <span class="highlight">>></span> cleaned-$filename
-done
+# gooclean: clean up a single data file
+<span class="highlight">echo "# gooclean $Revision:$" > cleaned-$1</span>
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 <span class="highlight">>></span> cleaned-$1
+</pre>
+
+ <p class="continue">
+ then set the <code>svn:keywords</code> property
+ and commit the file to insert the revision number,
+ making it:
+ </p>
+
+<pre>
+# gooclean: clean up a single data file
+<span class="highlight">echo "# gooclean $Revision: 487$" > cleaned-$1</span>
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 <span class="highlight">>></span> cleaned-$1
</pre>
<p>
- The first change puts a line in the output file
- that describes how that file was created.
- The second change is to use <code>>></code> instead of <code>></code>
- to redirect <code>gooclean</code>'s output to the file.
- <code>>></code> means "append to":
- instead of overwriting whatever is in the file,
- it adds more content to it.
- This ensures that the first line of the file is the provenance record,
- with the actual output of <code>gooclean</code> after it.
+ Now,
+ each time this script is run it will:
+ </p>
+
+ <ul>
+ <li>
+ put the line
+<pre>
+# gooclean $Revision: 487$
+</pre>
+ in the output file,
+ then
+ </li>
+ <li>
+ append whatever the pipline containing <code>goonorm</code> and <code>goofilter</code>
+ would have put in the file originally.
+ (The double redirection <code>>></code> means "append to" rather than "overwrite".)
+ </li>
+ </ul>
+
+ <p class="continue">
+ In other words,
+ the output of this shell script will always record
+ exactly what version of the script produced it.
+ This isn't enough to reproduce the output—we would need to record
+ the version numbers of the input files and the <code>goonorm</code> and <code>goofilter</code> programs,
+ and the values of the parameters those programs used
+ in order to do that—but it's an important and useful first step.
</p>
<div class="keypoints">
<h3>Summary</h3>
<ul>
- <li><code>$Keyword:$</code> in a file can be filled in with a property value each time the file is committed.</li>
+ <li><code>$Keyword: …$</code> in a file can be filled in with a property value each time the file is committed.</li>
<li>Put version numbers in programs' output to establish provenance for data.</li>
<li><code>svn propset svn:keywords <em>property</em> <em>files</em></code> tells Subversion to start filling in property values.</li>
</ul>
<div class="challenges">
<h3>Challenges</h3>
- <p class="fixme">write some</p>
+
+ <ol>
+
+ <li>
+ Add <code>$Id:$</code> to a file,
+ use <code>svn propset</code> to set the corresponding property,
+ and then commit a change to the file.
+ What value does Subversion fill in for this keyword?
+ When would you use this rather than <code>Revision</code> or <code>Author</code>?
+ </li>
+
+ <li>
+ What does the <code>svn:ignore</code> property do when applied to a directory?
+ When would you use it?
+ </li>
+
+ </ol>
+
</div>
</section>
<h2>Summing Up</h2>
<p>
- Correlation does not imply causality,
- but there is a very strong correlation between
- using version control
- and doing good computational science.
- There's an equally strong correlation
- between <em>not</em> using it and either wasting effort or getting things wrong.
- Today (the middle of 2013),
- I will not review a paper if the software used in it
- is not under version control.
- The work it reports might be interesting,
- but without the kind of record-keeping that version control provides,
- there's no way to know exactly what its authors did.
- Just as importantly,
- if someone doesn't know enough about computing to use version control,
- the odds are good that they don't know enough
- to do the programming right either.
+ In 2006,
+ <a href="bib.html#mccullough-reproducibility">McCullough, McGeary, and Harrison</a>
+ analyzed several years of
+ the data and code archive of <cite>Journal of Money, Credit, and Banking</cite>,
+ a prestigious journal with a mandatory archiving policy.
+ Of 266 articles published during that time,
+ 193 were empirical and should have had data and code deposited in the archive.
+ Of those,
+ only 69 actually had anything in the archive;
+ Excluding eleven articles that only had data,
+ and seven that required software or other resources they did not have,
+ McCullough et al. were only able to replicate 14 of the remaining 186 articles.
+ This doesn't mean that the other 92% were wrong,
+ but it does mean there is no practical way to tell.
+ </p>
+
+ <p>
+ By itself,
+ version control doesn't making computational research reproducible.
+ It <em>does</em> help,
+ though,
+ and also eliminates the frustration and wasted time caused by
+ trying to figure out which emailed copy of a file,
+ or which of a dozen directories or USB drives,
+ is the most recent.
+ And while correlation doesn't imply causality,
+ there is certainly a strong correlation between
+ knowing enough about good computational practices to use version control
+ and knowing how to do other things right as well.
</p>
</section>