</ol>
-<div class="box">
- <h3>Nothing's Perfekt</h3>
-
- <p>
- Version control systems do have one important shortcoming.
- While it is easy for them to find, display, and merge differences in text files,
- images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they
- use specialized binary data formats.
- Most version control systems don't know how to deal with these formats,
- so all they can say is, "These files differ."
- Reconciling those differences will probably require use of an auxiliary tool,
- such as an audio editor
- or Microsoft Word's "Compare and Merge" utility.
- </p>
-</div>
-
<p>
The rest of this chapter will explore how to use
a popular open source version control system called Subversion.
+ It does not have all the features of some newer systems,
+ such as <a href="git.html">Git</a>,
+ but it is still widely used,
+ and is simpler to pick up than those more advanced alternatives.
+ No matter which system you use,
+ the most important thing to learn is not the details of their more obscure commands,
+ but the workflow that they encourage.
</p>
<div class="guide">
<h2>For Instructors</h2>
- <p class="fixme">explain</p>
+ <p>
+ Version control is the most important practical skill we introduce.
+ As the last paragraph of the introduction above says,
+ the workflow matters more than the ins and outs of any particular tool.
+ By the end of 90 minutes,
+ the instructor should be able to get learners to chant,
+ "Update, edit, merge, commit," in unison,
+ and have them understand what those terms mean
+ and why that's a good way to structure their working day.
+ </p>
+
+ <p>
+ Provided there aren't network problems,
+ this entire lesson can be covered in <span class="duration">90 minutes</span>.
+ The example at the end
+ showing how to use Subversion keywords to track provenance
+ is the "ah ha!" moment for many learners.
+ If time is short,
+ skip the material on recovering old versions of files
+ in order to get to this section instead.
+ (The fact that provenance is harder in Git,
+ both mechanically and conceptually,
+ is one reason to keep teaching Subversion.)
+ </p>
<div class="prereq">
<h3>Prerequisites</h3>
- <p class="fixme">prereq</p>
+ <p>
+ Basic shell concepts and skills
+ (<code>ls</code>, <code>cd</code>, <code>mkdir</code>,
+ editing files);
+ basic shell scripting
+ (for the discussion of <a href="#s:provenance">provenance</a>).
+ </p>
</div>
<div class="notes">
<h3>Teaching Notes</h3>
<ul>
+ <li>
+ Make sure the network is working <em>before</em> starting this lesson.
+ </li>
+ <li>
+ Give learners a ten-minute overview of what version control does for them
+ before diving into the watch-and-do practicals.
+ Most of them will have tried to co-author papers by emailing files back and forth,
+ or will have biked into the office
+ only to realize that the USB key with last night's work
+ is still on the kitchen table.
+ Instructors can also make jokes about directories with names like
+ "final version",
+ "final version revised",
+ "final version with reviewer three's corrections",
+ "really final version",
+ and,
+ "come on this really has to be the last version"
+ to motivate version control as a better way to collaborate
+ and as a better way to back work up.
+ </li>
+ <li>
+ Version control is typically taught after the shell,
+ so collect learners' names during that session
+ and create a repository for them to share
+ with their names as both their IDs and their passwords.
+ The easiest way to create the repository is to use
+ a server managed by an ISP such as Dreamhost,
+ or on SourceForge, Google Code, or some other "forge" site,
+ all of which provide web interfaces for repository creation and management.
+ If your learners are advanced enough to be using SSH,
+ you can instead create it on any server they can access,
+ and connect with the <code>svn+ssh</code> protocol instead of HTTPS.
+ </li>
+ <li>
+ Be very clear what files learners are to edit
+ and what user IDs they are to use
+ when giving instructions.
+ It is common for them to edit the instructor's biography,
+ or to use the instructor's user ID and password when committing.
+ Be equally clear <em>when</em> they are to edit things:
+ it's also common for someone to edit the file the instructor is editing
+ and commit changes while the instructor is explaining what's going on,
+ so that a conflict occurs when the instructor comes to commit the file.
+ </li>
+ <li>
+ Learners could do most exercises with repositories on their own machines,
+ but it's hard for them to see how version control helps collaboration
+ unless they're sharing a repository with other learners.
+ In particular,
+ showing learners who changed what using <code>svn blame</code>
+ is only compelling if a file has been edited by at least two people.
+ </li>
+ <li>
+ If some learners are using Windows,
+ there will inevitably be issues merging files with different line endings.
+ <code>svn diff -x -w</code> is supposed to suppress differences in whitespace,
+ but we have found that it doesn't always work as advertised.
+ </li>
</ul>
</div>
<pre>
$ <span class="in">pwd</span>
-<span class="out">/home/vlad/explore</span>
+<span class="out">/home/dracula/explore</span>
$ <span class="in">ls -a</span>
<span class="out">. .. .svn earth jupiter mars</span>
$ <span class="in">ls -F .svn</span>
<figcaption>Figure 8: Updated Repository</figcaption>
</figure>
+ <div class="box">
+ <h3>When <em>Not</em> to Use Version Control</h3>
+
+ <p>
+ Despite the rapidly decreasing cost of storage,
+ it is still possible to run out of disk space.
+ In some labs,
+ people can easy go through 2 TB/month if they're not careful.
+ Since version control tools usually store revisions in terms of lines,
+ with binary data files,
+ they end up essentially storing every revision separately.
+ This isn't that bad
+ (it's what we'd be doing anyway),
+ but it means version control isn't doing what it likes to do,
+ and the repository can get very large very quickly.
+ Another concern is that if very old data will no longer be used,
+ it can be nice to archive or delete old data files.
+ This is not possible if our data is version controlled:
+ information can only be added to a repository,
+ so it can only ever increase in size.
+ </p>
+
+ </div>
+
<p id="a:define-head">
Back in his cubicle,
Wolfman uses <code>svn update</code> to update his working copy.
</div>
+ <div class="box">
+ <h3>Nothing's Perfekt</h3>
+
+ <p>
+ Version control systems do have one important shortcoming.
+ While it is easy for them to find, display, and merge differences in text files,
+ images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they
+ use specialized binary data formats.
+ Most version control systems don't know how to deal with these formats,
+ so all they can say is, "These files differ."
+ Reconciling those differences will probably require use of an auxiliary tool,
+ such as an audio editor
+ or Microsoft Word's "Compare and Merge" utility.
+ </p>
+ </div>
+
<div class="box">
<h3>Diffing Other Files</h3>
and if necessary undo later on.
</p>
+ <div class="box">
+ <h3>Who Did What?</h3>
+
+ <p>
+ One other very useful command is <code>svn blame</code>,
+ which shows when each line in the file was last changed
+ and by whom:
+ </p>
+
+<pre>
+$ <span class="in">svn blame moons.txt</span>
+<span class="out"> 14 dracula Name Orbital Radius Orbital Period Mass Radius
+ 14 dracula (10**3 km) (days) (10**20 kg) (km)
+ 14 dracula Amalthea 181.4 0.498179 0.075 131 x 73 x 67
+ 9 mummy Io 421.6 1.769138 893.2 1821.6
+ 9 mummy Europa 670.9 3.551181 480.0 1560.8
+ 9 mummy Ganymede 1070.4 7.154553 1481.9 2631.2
+ 14 dracula Callisto 1882.7 16.689018 1075.9 2410.3
+ 14 dracula Himalia 11460 250.5662 0.095 85.0
+ 14 dracula Elara 11740 259.6528 0.008 40.0</span>
+</pre>
+
+ <p>
+ If you are ever wondering who to talk to about a change,
+ or why it was made,
+ <code>svn blame</code> is a good place to start.
+ </p>
+ </div>
+
<div class="keypoints">
<h3>Summary</h3>
<ul>
if he can ever convince the Mummy that numbers should have commas.
</p>
+ <div class="box">
+ <h3>Another Way to Do It</h3>
+
+ <p>
+ Another way to recover a particular version of a particular file
+ is to use the <code>svn copy</code> command.
+ If the URL of our repository is
+ <code>https://universal.software-carpentry.org/explore</code>,
+ then the command:
+ </p>
+
+<pre>
+$ <span class="in">svn copy https://universal.software-carpentry.org/explore/mission.txt@120 ./mission.txt</span>
+</pre>
+
+ <p class="continue">
+ copies the file <code>mission.txt</code> as it was in revision 120
+ into our working directory
+ (overwriting whatever <code>mission.txt</code> file we currently have,
+ if any).
+ What's more,
+ using <code>svn copy</code> brings along the file's history as well,
+ so that future <code>svn log</code> operations will show
+ how <code>mission.txt</code> was resurrected.
+ </p>
+ </div>
+
<p>
Merging can be used to recover older revisions of files,
not just the most recent,
<li>Old versions of files can be recovered by merging their old state with their current state.</li>
<li>Recovering an old version of a file does not erase the intervening changes.</li>
<li>Use branches to support parallel independent development.</li>
- <li><code>svn merge</code> merges two revisions of a file.</li>
<li><code>svn revert</code> undoes local changes to files.</li>
+ <li><code>svn merge</code> merges two revisions of a file.</li>
</ul>
</div>
<div class="challenges">
<h3>Challenges</h3>
- <p class="fixme">write some</p>
+ <ol>
+ <li>
+ Explain what the command:
+<pre>
+svn diff -r 240:261 fish.dat
+</pre>
+ does, and when you might want to run it.
+ </li>
+
+ <li>
+ Suppose that a file called <code>mission.txt</code>
+ existed in revision 90 of a repository,
+ but had been deleted in revision 91.
+ What two commands could we use to recover it?
+ </li>
+
+ </ol>
</div>
</section>
- <section id="s:setup">
+<section id="s:setup">
+ <h2>Setting up a Repository</h2>
- <h2>Setting up a Repository</h2>
+ <div class="understand">
+ <h3>Learning Objectives</h3>
+ <ul>
+ <li>How to create a repository.</li>
+ </ul>
+ </div>
- <div class="understand" id="u:setup">
- <h3>Understand:</h3>
- <ul>
- <li>How to create a repository.</li>
- </ul>
- </div>
+ <p>
+ It is finally time to see how to create a repository.
+ As a quick recap,
+ we will keep the master copy of our work in a repository
+ on a server that we can access from other machines on the internet.
+ That master copy consists of files and directories that no-one ever edits directly.
+ Instead, a copy of Subversion running on that machine
+ manages updates for us and watches for conflicts.
+ Our working copy is a mirror image of the master sitting on our computer.
+ When our Subversion client needs to communicate with the master,
+ it exchanges data with the copy of Subversion running on the server.
+ </p>
+
+ <figure id="f:repo_four_things">
+ <img src="svn/repo_four_things.png" alt="What's Needed for a Repository" />
+ <figcaption>Figure 15: What's Needed for a Repository</figcaption>
+ </figure>
- <p>
- It is finally time to see how to create a repository.
- As a quick recap,
- we will keep the master copy of our work in a repository
- on a server that we can access from other machines on the internet.
- That master copy consists of files and directories that no-one ever edits directly.
- Instead, a copy of Subversion running on that machine
- manages updates for us and watches for conflicts.
- Our working copy is a mirror image of the master sitting on our computer.
- When our Subversion client needs to communicate with the master,
- it exchanges data with the copy of Subversion running on the server.
- </p>
+ <p>
+ To make this to work, we need four things
+ (<a href="#f:repo_four_things">Figure 15</a>):
+ </p>
- <figure id="f:repo_four_things">
- <img src="svn/repo_four_things.png" alt="What's Needed for a Repository" />
- </figure>
+ <ol>
- <p>
- To make this to work, we need four things
- (<a href="#f:repo_four_things">Figure XXX</a>):
- </p>
+ <li>
+ The repository itself.
+ It's not enough to create an empty directory and start filling it with files:
+ Subversion needs to create a lot of other structure
+ in order to keep track of old revisions, who made what changes, and so on.
+ </li>
- <ol>
-
- <li>
- The repository itself.
- It's not enough to create an empty directory and start filling it with files:
- Subversion needs to create a lot of other structure
- in order to keep track of old revisions, who made what changes, and so on.
- </li>
-
- <li>
- The full URL of the repository.
- This includes the URL of the server
- and the path to the repository on that machine.
- (The second part is needed because a single server can,
- and usually will,
- host many repositories.)
- </li>
-
- <li>
- Permission to read or write the master copy.
- Many open source projects give the whole world permission to read from their repository,
- but very few allow strangers to write to it:
- there are just too many possibilities for abuse.
- Somehow, we have to set up a password or something like it
- so that users can prove who they are.
- </li>
-
- <li>
- A working copy of the repository on our computer.
- Once the first three things are in place,
- this just means running the <code>checkout</code> command.
- </li>
-
- </ol>
+ <li>
+ The full URL of the repository.
+ This includes the URL of the server
+ and the path to the repository on that machine.
+ (The second part is needed because a single server can,
+ and usually will,
+ host many repositories.)
+ </li>
- <p>
- To keep things simple,
- we will start by creating a repository on the machine that we're working on.
- This won't let us share our work with other people,
- but it <em>will</em> allow us to save the history of our work as we go along.
- </p>
+ <li>
+ Permission to read or write the master copy.
+ Many open source projects give the whole world permission to read from their repository,
+ but very few allow strangers to write to it:
+ there are just too many possibilities for abuse.
+ Somehow, we have to set up a password or something like it
+ so that users can prove who they are.
+ </li>
- <p>
- The command to create a repository is <code>svnadmin create</code>,
- followed by the path to the repository.
- If we want to create a repository called <code>lair_repo</code>
- directly under our home directory,
- we just <code>cd</code> to get home
- and run <code>svnadmin create lair_repo</code>.
- This command creates a directory called <code>lair_repo</code> to hold our repository,
- and fills it with various files that Subversion uses
- to keep track of the project's history:
- </p>
+ <li>
+ A working copy of the repository on our computer.
+ Once the first three things are in place,
+ this just means running the <code>checkout</code> command.
+ </li>
+
+ </ol>
+
+ <p>
+ To keep things simple,
+ we will start by creating a repository on the machine that we're working on.
+ This won't let us share our work with other people,
+ but it <em>will</em> allow us to save the history of our work as we go along.
+ </p>
+
+ <p>
+ The command to create a repository is <code>svnadmin create</code>,
+ followed by the path to the repository.
+ If we want to create a repository called <code>missions_repo</code>
+ directly under our home directory,
+ we just <code>cd</code> to get home
+ and run <code>svnadmin create missions_repo</code>.
+ This command creates a directory called <code>missions_repo</code> to hold our repository,
+ and fills it with various files that Subversion uses
+ to keep track of the project's history:
+ </p>
<pre>
$ <span class="in">cd</span>
-$ <span class="in">svnadmin create lair_repo</span>
-$ <span class="in">ls -F lair_repo</span>
+$ <span class="in">svnadmin create missions_repo</span>
+$ <span class="in">ls -F missions_repo</span>
<span class="out">README.txt conf/ db/ format hooks/ locks/</span>
</pre>
- <p class="continue">
- We should <em>never</em> edit anything in this repository directly.
- Doing so probably won't shred our sanity and leave us gibbering in mindless horror,
- but it will almost certainly make the repository unusable.
- </p>
+ <p class="continue">
+ We should <em>never</em> edit any of this directly,
+ since it will almost certainly make the repository unusable.
+ Instead,
+ we should use <code>svn checkout</code>
+ to get a working copy of this repository.
+ If our home directory is <code>/users/mummy</code>,
+ then the full path to the repository we just created is <code>/users/mummy/missions_repo</code>,
+ so we run <code>svn checkout file:///users/mummy/missions missions_working</code>.
+ </p>
- <p>
- To get a working copy of this repository,
- we use Subversion's <code>checkout</code> command.
- If our home directory is <code>/users/mummy</code>,
- then the full path to the repository we just created is <code>/users/mummy/lair_repo</code>,
- so we run <code>svn checkout file:///users/mummy/lair lair_working</code>.
- </p>
+ <p>
+ Working backward,
+ the second argument,
+ <code>missions_working</code>,
+ specifies where the working copy is to be put.
+ The first argument is the URL of our repository,
+ and it has two parts.
+ <code>/users/mummy/missions_repo</code> is the path to repository directory.
+ <code>file://</code> specifies the <a href="glossary.html#protocol">protocol</a>
+ that Subversion will use to communicate with the repository—in this case,
+ it says that the repository is part of the local machine's filesystem.
+ (Notice that the protocol ends in two slashes,
+ while the absolute path to the repository starts with a slash,
+ making three in total.
+ A very common mistake is to type only two, since that's what web URLs normally have.)
+ </p>
- <p>
- Working backward,
- the second argument,
- <code>lair_working</code>,
- specifies where the working copy is to be put.
- The first argument is the URL of our repository,
- and it has two parts.
- <code>/users/mummy/lair_repo</code> is the path to repository directory.
- <code>file://</code> specifies the <a href="glossary.html#protocol">protocol</a>
- that Subversion will use to communicate with the repository—in this case,
- it says that the repository is part of the local machine's filesystem.
- Notice that the protocol ends in two slashes,
- while the absolute path to the repository starts with a slash,
- making three in total.
- A very common mistake is to type only two, since that's what web URLs normally have.
- </p>
+ <p>
+ When we're doing a checkout,
+ it is <em>very</em> important that we provide the second argument,
+ which specifies the name of the directory we want the working copy to be put in.
+ Without it,
+ Subversion will try to use the name of the repository,
+ <code>missions_repo</code>,
+ as the name of the working copy.
+ Since we're in the directory that contains the repository,
+ this means that Subversion will try to overwrite the repository with a working copy.
+ Again,
+ there isn't much risk of our sanity being torn to shreds,
+ but this could ruin our repository.
+ </p>
- <p>
- When we're doing a checkout,
- it is <em>very</em> important that we provide the second argument,
- which specifies the name of the directory we want the working copy to be put in.
- Without it,
- Subversion will try to use the name of the repository,
- <code>lair_repo</code>,
- as the name of the working copy.
- Since we're in the directory that contains the repository,
- this means that Subversion will try to overwrite the repository with a working copy.
- Again,
- there isn't much risk of our sanity being torn to shreds,
- but this could ruin our repository.
- </p>
+ <p>
+ To avoid this problem,
+ most people create a sub-directory in their account called something like <code>repos</code>,
+ and then create their repositories in that.
+ For example,
+ we could create our repository in <code>/users/mummy/repos/missions</code>,
+ then check out a working copy as <code>/users/mummy/missions</code>.
+ This practice makes both names easier to read.
+ </p>
- <p>
- To avoid this problem,
- most people create a sub-directory in their account called something like <code>repos</code>,
- and then create their repositories in that.
- For example,
- we could create our repository in <code>/users/mummy/repos/lair</code>,
- then check out a working copy as <code>/users/mummy/lair</code>.
- This practice makes both names easier to read.
- </p>
+ <p>
+ The obvious next step is to put our repository on a server,
+ rather than on our personal machine.
+ In fact,
+ we should <em>always</em> do this
+ so that we don't lose the history of our project
+ if our laptop is damaged or stolen.
+ A departmental server is also much more likely to be backed up regularly
+ than our personal machine…
+ </p>
- <p>
- The obvious next steps are
- to put our repository on a server,
- rather than on our personal machine,
- and to give other people access to the repository we have just created
- so that they can work with us.
- We'll discuss the first in <a href="web.html#s:svn">a later chapter</a>,
- but unfortunately,
- the second really does require things that we are not going to cover in this course.
- If you want to do this, you can:
- </p>
+ <p>
+ Creating a repository on a server is simple:
+ just log in and go through the steps described above.
+ Accessing that repository from another machine
+ is also straightforward.
+ If the machine's address is <code>serv.euphoric.edu</code>,
+ and our user ID is <code>dracula</code>,
+ the URL of the repository will be something like:
+ </p>
+
+<pre>
+svn+ssh://dracula@serv.euphoric.edu/home/dracula/repos/missions
+</pre>
+
+ <p>
+ Reading from left to right:
+ </p>
- <ul>
+ <ul>
+ <li>
+ <code>svn+ssh</code> is the protocol that Subversion uses to connect to the server
+ (in this case,
+ a combination of Subversion's own protocol
+ and <a href="shell.html#s:ssh">SSH</a>);
+ </li>
+ <li>
+ <code>dracula@serv.euphoric.edu</code> identifies the server and who we are
+ (just like an email address);
+ and
+ </li>
+ <li>
+ <code>/home/dracula/repos/missions</code> is the absolutely path of the repository
+ on the server.
+ </li>
+ </ul>
- <li>
- ask your system administrator to set it up for you;
- </li>
+ <p id="a:only_user">
+ That's fine if you are the only person using the repository,
+ but if you want to share it with others,
+ you need to worry about security.
+ As we discuss in the lesson on <a href="web.html">web programming</a>,
+ as soon as you provide a service on the internet,
+ there's the possibility that someone may try to attack your system through it.
+ Rather than trying to learn enough system administration skills
+ to set things up safely,
+ it is usually easier to:
+ </p>
- <li>
- use an open source hosting service like <a href="http://www.sf.net">SourceForge</a>,
- <a href="http://code.google.com">Google Code</a>,
- <a href="https://github.com/">GitHub</a>,
- or <a href="https://bitbucket.org/">BitBucket</a>; or
- </li>
+ <ul>
- <li>
- spend a few dollars a month on a commercial hosting service like <a href="http://dreamhost.com">DreamHost</a>
- that provides web-based GUIs for creating and managing repositories.
- </li>
+ <li>
+ ask your department's system administrator to set it up for you;
+ </li>
- </ul>
+ <li>
+ use a hosting service like <a href="http://www.sf.net">SourceForge</a>,
+ <a href="http://code.google.com">Google Code</a>,
+ <a href="https://github.com/">GitHub</a>,
+ or <a href="https://bitbucket.org/">BitBucket</a>; or
+ </li>
- <p>
- If you choose the second or third option,
- please check with whoever handles intellectual property at your institution
- to make sure that putting your work on a commercially-operated machine
- that is probably in some other legal jurisdiction
- isn't going to cause trouble.
- Many people assume that it's "just OK",
- while others act as if not having asked will be an acceptable defence later on.
- Unfortunately,
- neither is true…
- </p>
+ <li>
+ spend a few dollars a month on a commercial hosting service
+ that provides web-based GUIs for creating and managing repositories.
+ </li>
- <div class="keypoints" id="k:setup">
- <h3>Summary</h3>
- <ul>
- <li>Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.</li>
- <li><code>svnadmin create <em>name</em></code> creates a new repository.</li>
- </ul>
- </div>
+ </ul>
- </section>
+ <p>
+ If you choose the second or third option,
+ please check with whoever handles intellectual property at your institution
+ to make sure that putting your work on a commercially-operated machine
+ that is probably in some other legal jurisdiction
+ isn't going to cause trouble.
+ Many people assume that it's "just OK",
+ while others act as if not having asked will be an acceptable defence later on.
+ Unfortunately,
+ neither is true…
+ </p>
+
+ <div class="keypoints">
+ <h3>Summary</h3>
+ <ul>
+ <li><code>svnadmin create <em>name</em></code> creates a new repository.</li>
+ <li>Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.</li>
+ </ul>
+ </div>
+
+ <div class="challenges">
+ <h3>Challenges</h3>
- <section id="s:provenance">
+ <ol>
- <h2>Provenance</h2>
+ <li>
+ Create a Subversion repository called <code>trials_repo</code>
+ in your home directory.
+ Check out a working copy in a directory called <code>trials_working</code>
+ (also in your home directory).
+ Add a couple of text files,
+ commit the changes,
+ and then use <code>svn info trials_working</code>
+ to see what Subversion tells you about your working copy.
+ </li>
- <div class="understand" id="u:provenance">
- <h3>Understand:</h3>
+ <li>
+ We said <a href="#a:only_user">above</a> that
+ you might be the only person using a particular repository.
+ When and why is version control worth using
+ if no-one else is working on a project with you?
+ </li>
+
+ <li>
+ There are many ways to organize repositories.
+ Some of the most common are to create one repository for:
<ul>
- <li>What data provenance is.</li>
- <li>How to embed version numbers and other information in files managed by version control.</li>
- <li>How to record version information about a program in its output.</li>
+ <li>each person</li>
+ <li>each paper</li>
+ <li>all the work done on one grant</li>
+ <li>all the work done on one project</li>
+ <li>the entire lab (which is shared by everyone in the lab)</li>
+ <li>the entire department (typically with a top-level directory for each person or project in the department)</li>
</ul>
- </div>
+ What activities does each one make easy or hard?
+ Which of these would you prefer, and why?
+ </li>
- <p>
- In art,
- the <a href="glossary.html#provenance">provenance</a> of a work
- is the history of who owned it, when, and where.
- In science,
- it's the record of how a particular result came to be:
- what raw data was processed by what version of what program to create which intermediate files,
- what was used to turn those files into which figures of which papers,
- and so on.
- </p>
+ </ol>
+ </div>
- <p>
- One of the central ideas of this course is that
- wen can automatically track the provenance of scientific data.
- To start,
- suppose we have a text file <code>combustion.dat</code> in a Subversion repository.
- Run the following two commands:
- </p>
+</section>
+
+<section id="s:provenance">
+ <h2>Provenance</h2>
+
+ <div class="understand">
+ <h3>Understand:</h3>
+ <ul>
+ <li>What data provenance is.</li>
+ <li>How to embed version numbers and other information in files managed by version control.</li>
+ <li>How to record version information about a program in its output.</li>
+ </ul>
+ </div>
+
+ <p>
+ In art,
+ the <a href="glossary.html#provenance">provenance</a> of a work
+ is the history of who owned it, when, and where.
+ In science,
+ it's the record of how a particular result came to be:
+ what raw data was processed by what version of what program to create which intermediate files,
+ what was used to turn those files into which figures of which papers,
+ and so on.
+ </p>
+
+ <p>
+ One of the big benefits of using version control is that
+ it lets us track the provenance of scientific data automatically.
+ To start,
+ suppose we have a text file <code>combustion.dat</code> in a Subversion repository.
+ Run the following two commands:
+ </p>
<pre>
$ svn propset svn:keywords Revision combustion.dat
$ svn commit -m "Turning on the 'Revision' keyword" combustion.dat
</pre>
- <p>
- Now open the file in an editor
- and add the following line somewhere near the top:
- </p>
+ <p class="continue">
+ This does nothing by itself,
+ but now open the file in an editor
+ and add the following line somewhere near the top:
+ </p>
<pre>
-# $Revision:$
+$Revision:$
</pre>
- <p>
- The '#' sign isn't important:
- it's just what <code>.dat</code> files use to show comments.
- The <code>$Revision:$</code> string,
- on the other hand,
- means something special to Subversion.
- Save the file, and commit the change:
- </p>
+ <p>
+ The <code>$Revision:$</code> string means something special to Subversion.
+ Save the file, and commit the change:
+ </p>
<pre>
$ svn commit -m "Inserting the 'Revision' keyword" combustion.dat
</pre>
- <p>
- When we open the file again,
- we'll see that Subversion has changed that line to something like:
- </p>
+ <p>
+ When we open the file again,
+ we'll see that Subversion has changed that line to something like:
+ </p>
<pre>
-# $Revision: 143$
+$Revision: 143$
</pre>
- <p class="continue">
- i.e., Subversion has inserted the version number
- after the colon and before the closing <code>$</code>.
- </p>
+ <p class="continue">
+ i.e., it has inserted the version number
+ after the colon and before the closing <code>$</code>.
+ If we edit the file again—e.g., add a couple of lines with random numbers—and
+ commit once more,
+ the line is updated again to:
+ </p>
- <p>
- Here's what just happened.
- First, Subversion allows you to set
- <a href="glossary.html#property-subversion">properties</a>
- for files and and directories.
- These properties aren't in the files or directories themselves,
- but live in Subversion's database.
- One of those properties,
- <code>svn:keywords</code>,
- tells Subversion to look in files that are being changed
- for strings of the form <code>$propertyname: …$</code>,
- where <code>propertyname</code> is a string like <code>Revision</code> or <code>Author</code>.
- (About half a dozen such strings are supported.)
- </p>
+<pre>
+$Revision: 144$
+</pre>
- <p>
- If it sees such a string,
- Subversion rewrites it as the commit is taking place to replace <code>…</code>
- with the current version number,
- the name of the person making the change,
- or whatever else the property's name tells it to do.
- You only have to add the string to the file once;
- after that,
- Subversion updates it for you every time the file changes.
- </p>
+ <p>
+ Here's what just happened.
+ First, Subversion allows uss to add
+ <a href="glossary.html#property-subversion">properties</a>
+ to files and and directories.
+ These properties aren't stored in the files or directories themselves,
+ but in Subversion's database.
+ One of those properties,
+ <code>svn:keywords</code>,
+ tells Subversion to look in files that are being changed
+ for strings of the form <code>$propertyname: …$</code>,
+ where <code>propertyname</code> is a string like <code>Revision</code> or <code>Author</code>.
+ (About half a dozen such strings are supported.)
+ </p>
- <p>
- Putting the version number in the file this way can be pretty handy.
- If you copy the file to another machine,
- for example,
- it carries its version number with it,
- so you can tell which version you have even if it's outside version control.
- We'll see some more useful things we can do with this information in
- <a href="python.html">the next chapter</a>.
- </p>
+ <p>
+ If it sees such a string,
+ Subversion rewrites it as the commit is taking place to replace <code>…</code>
+ with the current version number,
+ the name of the person making the change,
+ or whatever else the property's name tells it to do.
+ We only have to add the string to the file once;
+ after that,
+ Subversion updates it for you every time the file changes.
+ </p>
- <div class="box">
-
- <h3>When <em>Not</em> to Use Version Control</h3>
-
- <p>
- Despite the rapidly decreasing cost of storage,
- it is still possible to run out of disk space.
- In some labs,
- people can easy go through 2 TB/month if they're not careful.
- Since version control tools usually store revisions in terms of lines,
- with binary data files,
- they end up essentially storing every revision separately.
- This isn't that bad
- (it's what we'd be doing anyway),
- but it means version control isn't doing what it likes to do,
- and the repository can get very large very quickly.
- Another concern is that if very old data will no longer be used,
- it can be nice to archive or delete old data files.
- This is not possible if our data is version controlled:
- information can only be added to a repository,
- so it can only ever increase in size.
- </p>
-
- </div>
+ <p>
+ Putting the version number in the file this way can be pretty handy.
+ If you copy the file to another machine,
+ for example,
+ it carries its version number with it,
+ so you can tell which version you have even if it's outside version control.
+ We'll see some more useful things we can do with this information <a href="python.html">later</a>.
+ </p>
- <p>
- We can use this trick with shell scripts too,
- or with almost any other kind of program.
- Going back to Nelle Nemo's data processing from the previous chapter,
- for example,
- suppose she writes a shell script that uses <code>gooclean</code>
- to tidy up data files.
- Her first version looks like this:
- </p>
+ <p>
+ We can use this trick with shell scripts too,
+ or with almost any other kind of program.
+ Let's go back to Nelle Nemo's data processing from
+ the lesson on the <a href="shell.html">shell</a>.
+ Suppose she writes a shell script called <code>gooclean</code>
+ to tidy up data files.
+ Her first version looks like this:
+ </p>
<pre>
-for filename in $*
-do
- gooclean -b 0 100 < $filename > cleaned-$filename
-done
+# gooclean: clean up a single data file
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 > cleaned-$1
</pre>
- <p class="continue">
- i.e., it runs <code>gooclean</code> with bounding values of 0 and 100
- for each specified file,
- putting the result in a temporary file with a well-defined name.
- Assuming that '#' is the comment character for those kinds of data files,
- she could instead write:
- </p>
+ <p class="continue">
+ i.e.,
+ it runs <code>goonorm</code> and then <code>goofilter</code> with some fixed parameters
+ and creates an output file called <code>cleaned-something.dat</code>
+ (if the input file's name was <code>something.dat</code>).
+ Assuming that '#' is the comment character for her output files,
+ she could instead write:
+ </p>
<pre>
-for filename in $*
-do
- <span class="highlight">echo "gooclean $Revision: 901$ -b 0 100" > $filename</span>
- gooclean -b 0 100 < $filename <span class="highlight">>></span> cleaned-$filename
-done
+# gooclean: clean up a single data file
+<span class="highlight">echo "# gooclean $Revision:$" > cleaned-$1</span>
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 <span class="highlight">>></span> cleaned-$1
</pre>
- <p>
- The first change puts a line in the output file
- that describes how that file was created.
- The second change is to use <code>>></code> instead of <code>></code>
- to redirect <code>gooclean</code>'s output to the file.
- <code>>></code> means "append to":
- instead of overwriting whatever is in the file,
- it adds more content to it.
- This ensures that the first line of the file is the provenance record,
- with the actual output of <code>gooclean</code> after it.
- </p>
+ <p class="continue">
+ then set the <code>svn:keywords</code> property
+ and commit the file to insert the revision number,
+ making it:
+ </p>
- <div class="keypoints" id="k:provenance">
- <h3>Summary</h3>
- <ul>
- <li><code>$Keyword:$</code> in a file can be filled in with a property value each time the file is committed.</li>
- <li idea="paranoia">Put version numbers in programs' output to establish provenance for data.</li>
- <li><code>svn propset svn:keywords <em>property</em> <em>files</em></code> tells Subversion to start filling in property values.</li>
- </ul>
- </div>
+<pre>
+# gooclean: clean up a single data file
+<span class="highlight">echo "# gooclean $Revision: 487$" > cleaned-$1</span>
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 <span class="highlight">>></span> cleaned-$1
+</pre>
- </section>
+ <p>
+ Now,
+ each time this script is run it will:
+ </p>
+
+ <ul>
+ <li>
+ put the line
+<pre>
+# gooclean $Revision: 487$
+</pre>
+ in the output file,
+ then
+ </li>
+ <li>
+ append whatever the pipline containing <code>goonorm</code> and <code>goofilter</code>
+ would have put in the file originally.
+ (The double redirection <code>>></code> means "append to" rather than "overwrite".)
+ </li>
+ </ul>
+
+ <p class="continue">
+ In other words,
+ the output of this shell script will always record
+ exactly what version of the script produced it.
+ This isn't enough to reproduce the output—we would need to record
+ the version numbers of the input files and the <code>goonorm</code> and <code>goofilter</code> programs,
+ and the values of the parameters those programs used
+ in order to do that—but it's an important and useful first step.
+ </p>
+
+ <div class="keypoints">
+ <h3>Summary</h3>
+ <ul>
+ <li><code>$Keyword: …$</code> in a file can be filled in with a property value each time the file is committed.</li>
+ <li>Put version numbers in programs' output to establish provenance for data.</li>
+ <li><code>svn propset svn:keywords <em>property</em> <em>files</em></code> tells Subversion to start filling in property values.</li>
+ </ul>
+ </div>
+
+ <div class="challenges">
+ <h3>Challenges</h3>
+
+ <ol>
+
+ <li>
+ Add <code>$Id:$</code> to a file,
+ use <code>svn propset</code> to set the corresponding property,
+ and then commit a change to the file.
+ What value does Subversion fill in for this keyword?
+ When would you use this rather than <code>Revision</code> or <code>Author</code>?
+ </li>
+
+ <li>
+ What does the <code>svn:ignore</code> property do when applied to a directory?
+ When would you use it?
+ </li>
+
+ </ol>
+
+ </div>
+
+</section>
<section id="s:summary">
<h2>Summing Up</h2>
<p>
- Correlation does not imply causality,
- but there is a very strong correlation between
- using version control
- and doing good computational science.
- There's an equally strong correlation
- between <em>not</em> using it and either wasting effort or getting things wrong.
- Today (the middle of 2013),
- I will not review a paper if the software used in it
- is not under version control.
- The work it reports might be interesting,
- but without the kind of record-keeping that version control provides,
- there's no way to know exactly what its authors did.
- Just as importantly,
- if someone doesn't know enough about computing to use version control,
- the odds are good that they don't know enough
- to do the programming right either.
+ In 2006,
+ <a href="bib.html#mccullough-reproducibility">McCullough, McGeary, and Harrison</a>
+ analyzed several years of
+ the data and code archive of <cite>Journal of Money, Credit, and Banking</cite>,
+ a prestigious journal with a mandatory archiving policy.
+ Of 266 articles published during that time,
+ 193 were empirical and should have had data and code deposited in the archive.
+ Of those,
+ only 69 actually had anything in the archive;
+ Excluding eleven articles that only had data,
+ and seven that required software or other resources they did not have,
+ McCullough et al. were only able to replicate 14 of the remaining 186 articles.
+ This doesn't mean that the other 92% were wrong,
+ but it does mean there is no practical way to tell.
+ </p>
+
+ <p>
+ By itself,
+ version control doesn't making computational research reproducible.
+ It <em>does</em> help,
+ though,
+ and also eliminates the frustration and wasted time caused by
+ trying to figure out which emailed copy of a file,
+ or which of a dozen directories or USB drives,
+ is the most recent.
+ And while correlation doesn't imply causality,
+ there is certainly a strong correlation between
+ knowing enough about good computational practices to use version control
+ and knowing how to do other things right as well.
</p>
</section>