X-Git-Url: http://git.tremily.us/?p=swc-version-control-svn.git;a=blobdiff_plain;f=svn.html;h=1dbfb02f6485f55dc815d0e04e987224038a25ab;hp=fe01b5f96e780b7678e95cd9197ad42f64339a13;hb=962089603c44a337e198fbbbd78df7f995958cfc;hpb=af2312ba8b4d48309bfeac51fb4d10f8652eaf06 diff --git a/svn.html b/svn.html index fe01b5f..1dbfb02 100644 --- a/svn.html +++ b/svn.html @@ -58,40 +58,118 @@ -
-

Nothing's Perfekt

- -

- Version control systems do have one important shortcoming. - While it is easy for them to find, display, and merge differences in text files, - images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they - use specialized binary data formats. - Most version control systems don't know how to deal with these formats, - so all they can say is, "These files differ." - Reconciling those differences will probably require use of an auxiliary tool, - such as an audio editor - or Microsoft Word's "Compare and Merge" utility. -

-
-

The rest of this chapter will explore how to use a popular open source version control system called Subversion. + It does not have all the features of some newer systems, + such as Git, + but it is still widely used, + and is simpler to pick up than those more advanced alternatives. + No matter which system you use, + the most important thing to learn is not the details of their more obscure commands, + but the workflow that they encourage.

For Instructors

-

explain

+

+ Version control is the most important practical skill we introduce. + As the last paragraph of the introduction above says, + the workflow matters more than the ins and outs of any particular tool. + By the end of 90 minutes, + the instructor should be able to get learners to chant, + "Update, edit, merge, commit," in unison, + and have them understand what those terms mean + and why that's a good way to structure their working day. +

+ +

+ Provided there aren't network problems, + this entire lesson can be covered in 90 minutes. + The example at the end + showing how to use Subversion keywords to track provenance + is the "ah ha!" moment for many learners. + If time is short, + skip the material on recovering old versions of files + in order to get to this section instead. + (The fact that provenance is harder in Git, + both mechanically and conceptually, + is one reason to keep teaching Subversion.) +

Prerequisites

-

prereq

+

+ Basic shell concepts and skills + (ls, cd, mkdir, + editing files); + basic shell scripting + (for the discussion of provenance). +

Teaching Notes

@@ -101,16 +179,17 @@

Basic Use

-

Learning Objectives:

+

Learning Objectives

@@ -220,7 +299,7 @@ let's assume that the Mummy (Dracula and Wolfman's boss) has already put some notes in a version control repository - whose URL is https://universal.software-carpentry.org/monsters. + whose URL is https://universal.software-carpentry.org/explore. Every repository has an address like this that uniquely identifies the location of the master copy.

@@ -251,24 +330,24 @@

-$ svn checkout https://universal.software-carpentry.org/monsters
+$ svn checkout https://universal.software-carpentry.org/explore
 

- This creates a new directory called monsters + This creates a new directory called explore and fills it with a copy of the repository's contents (Figure 6).

-A    monsters/jupiter
-A    monsters/mars
-A    monsters/mars/mons-olympus.txt
-A    monsters/mars/cydonia.txt
-A    monsters/earth
-A    monsters/earth/himalayas.txt
-A    monsters/earth/antarctica.txt
-A    monsters/earth/carlsbad.txt
+A    explore/jupiter
+A    explore/mars
+A    explore/mars/mons-olympus.txt
+A    explore/mars/cydonia.txt
+A    explore/earth
+A    explore/earth/himalayas.txt
+A    explore/earth/antarctica.txt
+A    explore/earth/carlsbad.txt
 Checked out revision 6.
 
@@ -283,7 +362,7 @@ Checked out revision 6.

-$ cd monsters
+$ cd explore
 $ ls
 earth   jupiter mars
 $ ls *
@@ -310,7 +389,7 @@ cydonia.txt  mons-olympus.txt
 
 
 $ pwd
-/home/vlad/monsters
+/home/dracula/explore
 $ ls -a
 .    ..    .svn    earth    jupiter    mars
 $ ls -F .svn
@@ -367,7 +446,7 @@ Send the probe to Mons Olympus?
     the date the change was made,
     and whatever comment the user provided when the change was submitted.
     As we can see,
-    the monsters project is currently at revision 6,
+    the explore project is currently at revision 6,
     and all changes so far have been made by the Mummy.
   

@@ -477,6 +556,30 @@ Committed revision 7.
Figure 8: Updated Repository
+
+

When Not to Use Version Control

+ +

+ Despite the rapidly decreasing cost of storage, + it is still possible to run out of disk space. + In some labs, + people can easy go through 2 TB/month if they're not careful. + Since version control tools usually store revisions in terms of lines, + with binary data files, + they end up essentially storing every revision separately. + This isn't that bad + (it's what we'd be doing anyway), + but it means version control isn't doing what it likes to do, + and the repository can get very large very quickly. + Another concern is that if very old data will no longer be used, + it can be nice to archive or delete old data files. + This is not possible if our data is version controlled: + information can only be added to a repository, + so it can only ever increase in size. +

+ +
+

Back in his cubicle, Wolfman uses svn update to update his working copy. @@ -682,6 +785,72 @@ $ svn diff -r HEAD

+
+

Nothing's Perfekt

+ +

+ Version control systems do have one important shortcoming. + While it is easy for them to find, display, and merge differences in text files, + images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they + use specialized binary data formats. + Most version control systems don't know how to deal with these formats, + so all they can say is, "These files differ." + Reconciling those differences will probably require use of an auxiliary tool, + such as an audio editor + or Microsoft Word's "Compare and Merge" utility. +

+
+ +
+

Diffing Other Files

+ +

+ svn diff mimics the behavior of + the Unix diff command, + which can be used to compare any two files. + Given these two files: +

+ + + + + + + + + + +
left.txtright.txt
+
hydrogen
+lithium
+sodium
+magnesium
+rubidium
+
+
hydrogen
+lithium
+beryllium
+sodium
+potassium
+strontium
+
+ +

+ diff's output is: +

+
+$ diff left.txt right.txt
+2a3
+> beryllium
+4,5c5,6
+< magnesium
+< rubidium
+---
+> potassium
+> strontium
+
+
+

This is a very common workflow, and is the basic heartbeat of most developers' days. @@ -726,6 +895,35 @@ $ svn diff -r HEAD and if necessary undo later on.

+
+

Who Did What?

+ +

+ One other very useful command is svn blame, + which shows when each line in the file was last changed + and by whom: +

+ +
+$ svn blame moons.txt
+    14    dracula Name            Orbital Radius  Orbital Period  Mass            Radius
+    14    dracula                 (10**3 km)      (days)          (10**20 kg)     (km)
+    14    dracula Amalthea        181.4           0.498179        0.075           131 x 73 x 67
+     9    mummy   Io              421.6           1.769138        893.2           1821.6
+     9    mummy   Europa          670.9           3.551181        480.0           1560.8
+     9    mummy   Ganymede        1070.4          7.154553        1481.9          2631.2
+    14    dracula Callisto        1882.7          16.689018       1075.9          2410.3
+    14    dracula Himalia         11460           250.5662        0.095           85.0
+    14    dracula Elara           11740           259.6528        0.008           40.0
+
+ +

+ If you are ever wondering who to talk to about a change, + or why it was made, + svn blame is a good place to start. +

+
+

Summary

-
+

+ If you choose the second or third option, + please check with whoever handles intellectual property at your institution + to make sure that putting your work on a commercially-operated machine + that is probably in some other legal jurisdiction + isn't going to cause trouble. + Many people assume that it's "just OK", + while others act as if not having asked will be an acceptable defence later on. + Unfortunately, + neither is true… +

-

Provenance

+
+

Summary

+
    +
  • svnadmin create name creates a new repository.
  • +
  • Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.
  • +
+
+ +
+

Challenges

+ +
    + +
  1. + Create a Subversion repository called trials_repo + in your home directory. + Check out a working copy in a directory called trials_working + (also in your home directory). + Add a couple of text files, + commit the changes, + and then use svn info trials_working + to see what Subversion tells you about your working copy. +
  2. -
    -

    Understand:

    +
  3. + We said above that + you might be the only person using a particular repository. + When and why is version control worth using + if no-one else is working on a project with you? +
  4. + +
  5. + There are many ways to organize repositories. + Some of the most common are to create one repository for:
      -
    • What data provenance is.
    • -
    • How to embed version numbers and other information in files managed by version control.
    • -
    • How to record version information about a program in its output.
    • +
    • each person
    • +
    • each paper
    • +
    • all the work done on one grant
    • +
    • all the work done on one project
    • +
    • the entire lab (which is shared by everyone in the lab)
    • +
    • the entire department (typically with a top-level directory for each person or project in the department)
    -
  6. + What activities does each one make easy or hard? + Which of these would you prefer, and why? + -

    - In art, - the provenance of a work - is the history of who owned it, when, and where. - In science, - it's the record of how a particular result came to be: - what raw data was processed by what version of what program to create which intermediate files, - what was used to turn those files into which figures of which papers, - and so on. -

    +
+
-

- One of the central ideas of this course is that - wen can automatically track the provenance of scientific data. - To start, - suppose we have a text file combustion.dat in a Subversion repository. - Run the following two commands: -

+
+ +
+

Provenance

+ +
+

Understand:

+
    +
  • What data provenance is.
  • +
  • How to embed version numbers and other information in files managed by version control.
  • +
  • How to record version information about a program in its output.
  • +
+
+ +

+ In art, + the provenance of a work + is the history of who owned it, when, and where. + In science, + it's the record of how a particular result came to be: + what raw data was processed by what version of what program to create which intermediate files, + what was used to turn those files into which figures of which papers, + and so on. +

+ +

+ One of the big benefits of using version control is that + it lets us track the provenance of scientific data automatically. + To start, + suppose we have a text file combustion.dat in a Subversion repository. + Run the following two commands: +

 $ svn propset svn:keywords Revision combustion.dat
 $ svn commit -m "Turning on the 'Revision' keyword" combustion.dat
 
-

- Now open the file in an editor - and add the following line somewhere near the top: -

+

+ This does nothing by itself, + but now open the file in an editor + and add the following line somewhere near the top: +

-# $Revision:$
+$Revision:$
 
-

- The '#' sign isn't important: - it's just what .dat files use to show comments. - The $Revision:$ string, - on the other hand, - means something special to Subversion. - Save the file, and commit the change: -

+

+ The $Revision:$ string means something special to Subversion. + Save the file, and commit the change: +

 $ svn commit -m "Inserting the 'Revision' keyword" combustion.dat
 
-

- When we open the file again, - we'll see that Subversion has changed that line to something like: -

+

+ When we open the file again, + we'll see that Subversion has changed that line to something like: +

-# $Revision: 143$
+$Revision: 143$
 
-

- i.e., Subversion has inserted the version number - after the colon and before the closing $. -

+

+ i.e., it has inserted the version number + after the colon and before the closing $. + If we edit the file again—e.g., add a couple of lines with random numbers—and + commit once more, + the line is updated again to: +

-

- Here's what just happened. - First, Subversion allows you to set - properties - for files and and directories. - These properties aren't in the files or directories themselves, - but live in Subversion's database. - One of those properties, - svn:keywords, - tells Subversion to look in files that are being changed - for strings of the form $propertyname: …$, - where propertyname is a string like Revision or Author. - (About half a dozen such strings are supported.) -

+
+$Revision: 144$
+
-

- If it sees such a string, - Subversion rewrites it as the commit is taking place to replace - with the current version number, - the name of the person making the change, - or whatever else the property's name tells it to do. - You only have to add the string to the file once; - after that, - Subversion updates it for you every time the file changes. -

+

+ Here's what just happened. + First, Subversion allows uss to add + properties + to files and and directories. + These properties aren't stored in the files or directories themselves, + but in Subversion's database. + One of those properties, + svn:keywords, + tells Subversion to look in files that are being changed + for strings of the form $propertyname: …$, + where propertyname is a string like Revision or Author. + (About half a dozen such strings are supported.) +

-

- Putting the version number in the file this way can be pretty handy. - If you copy the file to another machine, - for example, - it carries its version number with it, - so you can tell which version you have even if it's outside version control. - We'll see some more useful things we can do with this information in - the next chapter. -

+

+ If it sees such a string, + Subversion rewrites it as the commit is taking place to replace + with the current version number, + the name of the person making the change, + or whatever else the property's name tells it to do. + We only have to add the string to the file once; + after that, + Subversion updates it for you every time the file changes. +

-
- -

When Not to Use Version Control

- -

- Despite the rapidly decreasing cost of storage, - it is still possible to run out of disk space. - In some labs, - people can easy go through 2 TB/month if they're not careful. - Since version control tools usually store revisions in terms of lines, - with binary data files, - they end up essentially storing every revision separately. - This isn't that bad - (it's what we'd be doing anyway), - but it means version control isn't doing what it likes to do, - and the repository can get very large very quickly. - Another concern is that if very old data will no longer be used, - it can be nice to archive or delete old data files. - This is not possible if our data is version controlled: - information can only be added to a repository, - so it can only ever increase in size. -

- -
+

+ Putting the version number in the file this way can be pretty handy. + If you copy the file to another machine, + for example, + it carries its version number with it, + so you can tell which version you have even if it's outside version control. + We'll see some more useful things we can do with this information later. +

-

- We can use this trick with shell scripts too, - or with almost any other kind of program. - Going back to Nelle Nemo's data processing from the previous chapter, - for example, - suppose she writes a shell script that uses gooclean - to tidy up data files. - Her first version looks like this: -

+

+ We can use this trick with shell scripts too, + or with almost any other kind of program. + Let's go back to Nelle Nemo's data processing from + the lesson on the shell. + Suppose she writes a shell script called gooclean + to tidy up data files. + Her first version looks like this: +

-for filename in $*
-do
-    gooclean -b 0 100 < $filename > cleaned-$filename
-done
+# gooclean: clean up a single data file
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 > cleaned-$1
 
-

- i.e., it runs gooclean with bounding values of 0 and 100 - for each specified file, - putting the result in a temporary file with a well-defined name. - Assuming that '#' is the comment character for those kinds of data files, - she could instead write: -

+

+ i.e., + it runs goonorm and then goofilter with some fixed parameters + and creates an output file called cleaned-something.dat + (if the input file's name was something.dat). + Assuming that '#' is the comment character for her output files, + she could instead write: +

-for filename in $*
-do
-    echo "gooclean $Revision: 901$ -b 0 100" > $filename
-    gooclean -b 0 100 < $filename >> cleaned-$filename
-done
+# gooclean: clean up a single data file
+echo "# gooclean $Revision:$" > cleaned-$1
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 >> cleaned-$1
 
-

- The first change puts a line in the output file - that describes how that file was created. - The second change is to use >> instead of > - to redirect gooclean's output to the file. - >> means "append to": - instead of overwriting whatever is in the file, - it adds more content to it. - This ensures that the first line of the file is the provenance record, - with the actual output of gooclean after it. -

+

+ then set the svn:keywords property + and commit the file to insert the revision number, + making it: +

-
-

Summary

-
    -
  • $Keyword:$ in a file can be filled in with a property value each time the file is committed.
  • -
  • Put version numbers in programs' output to establish provenance for data.
  • -
  • svn propset svn:keywords property files tells Subversion to start filling in property values.
  • -
-
+
+# gooclean: clean up a single data file
+echo "# gooclean $Revision: 487$" > cleaned-$1
+goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 >> cleaned-$1
+
-
+

+ Now, + each time this script is run it will: +

-
+ -

Summing Up

+

+ In other words, + the output of this shell script will always record + exactly what version of the script produced it. + This isn't enough to reproduce the output—we would need to record + the version numbers of the input files and the goonorm and goofilter programs, + and the values of the parameters those programs used + in order to do that—but it's an important and useful first step. +

-

- Correlation does not imply causality, - but there is a very strong correlation between - using version control - and doing good computational science. - There's an equally strong correlation - between not using it and wasting effort, - so today (the middle of 2012), - I will not review a paper if the software used in it - is not under version control. - Its authors' work might be interesting, - but without the kind of record-keeping that version control provides, - there's no way to know exactly what they did and when. - Just as importantly, - if someone doesn't know enough about computing to use version control, - the odds are good that they don't know enough - to do the programming right either. -

+
+

Summary

+
    +
  • $Keyword: …$ in a file can be filled in with a property value each time the file is committed.
  • +
  • Put version numbers in programs' output to establish provenance for data.
  • +
  • svn propset svn:keywords property files tells Subversion to start filling in property values.
  • +
+
+ +
+

Challenges

+ +
    + +
  1. + Add $Id:$ to a file, + use svn propset to set the corresponding property, + and then commit a change to the file. + What value does Subversion fill in for this keyword? + When would you use this rather than Revision or Author? +
  2. + +
  3. + What does the svn:ignore property do when applied to a directory? + When would you use it? +
  4. + +
+ +
+ +
+ +
+

Summing Up

-
+

+ In 2006, + McCullough, McGeary, and Harrison + analyzed several years of + the data and code archive of Journal of Money, Credit, and Banking, + a prestigious journal with a mandatory archiving policy. + Of 266 articles published during that time, + 193 were empirical and should have had data and code deposited in the archive. + Of those, + only 69 actually had anything in the archive; + Excluding eleven articles that only had data, + and seven that required software or other resources they did not have, + McCullough et al. were only able to replicate 14 of the remaining 186 articles. + This doesn't mean that the other 92% were wrong, + but it does mean there is no practical way to tell. +

+ +

+ By itself, + version control doesn't making computational research reproducible. + It does help, + though, + and also eliminates the frustration and wasted time caused by + trying to figure out which emailed copy of a file, + or which of a dozen directories or USB drives, + is the most recent. + And while correlation doesn't imply causality, + there is certainly a strong correlation between + knowing enough about good computational practices to use version control + and knowing how to do other things right as well. +

+ + {% endblock content %}