From 962089603c44a337e198fbbbd78df7f995958cfc Mon Sep 17 00:00:00 2001 From: Greg Wilson Date: Mon, 22 Apr 2013 12:30:00 -0400 Subject: [PATCH] Finishing the revisions to the Subversion chapter --- svn.html | 530 ++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 391 insertions(+), 139 deletions(-) diff --git a/svn.html b/svn.html index 3a02b0e..1dbfb02 100644 --- a/svn.html +++ b/svn.html @@ -58,40 +58,118 @@ -
-

Nothing's Perfekt

- -

- Version control systems do have one important shortcoming. - While it is easy for them to find, display, and merge differences in text files, - images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they - use specialized binary data formats. - Most version control systems don't know how to deal with these formats, - so all they can say is, "These files differ." - Reconciling those differences will probably require use of an auxiliary tool, - such as an audio editor - or Microsoft Word's "Compare and Merge" utility. -

-
-

The rest of this chapter will explore how to use a popular open source version control system called Subversion. + It does not have all the features of some newer systems, + such as Git, + but it is still widely used, + and is simpler to pick up than those more advanced alternatives. + No matter which system you use, + the most important thing to learn is not the details of their more obscure commands, + but the workflow that they encourage.

For Instructors

-

explain

+

+ Version control is the most important practical skill we introduce. + As the last paragraph of the introduction above says, + the workflow matters more than the ins and outs of any particular tool. + By the end of 90 minutes, + the instructor should be able to get learners to chant, + "Update, edit, merge, commit," in unison, + and have them understand what those terms mean + and why that's a good way to structure their working day. +

+ +

+ Provided there aren't network problems, + this entire lesson can be covered in 90 minutes. + The example at the end + showing how to use Subversion keywords to track provenance + is the "ah ha!" moment for many learners. + If time is short, + skip the material on recovering old versions of files + in order to get to this section instead. + (The fact that provenance is harder in Git, + both mechanically and conceptually, + is one reason to keep teaching Subversion.) +

Prerequisites

-

prereq

+

+ Basic shell concepts and skills + (ls, cd, mkdir, + editing files); + basic shell scripting + (for the discussion of provenance). +

Teaching Notes

    +
  • + Make sure the network is working before starting this lesson. +
  • +
  • + Give learners a ten-minute overview of what version control does for them + before diving into the watch-and-do practicals. + Most of them will have tried to co-author papers by emailing files back and forth, + or will have biked into the office + only to realize that the USB key with last night's work + is still on the kitchen table. + Instructors can also make jokes about directories with names like + "final version", + "final version revised", + "final version with reviewer three's corrections", + "really final version", + and, + "come on this really has to be the last version" + to motivate version control as a better way to collaborate + and as a better way to back work up. +
  • +
  • + Version control is typically taught after the shell, + so collect learners' names during that session + and create a repository for them to share + with their names as both their IDs and their passwords. + The easiest way to create the repository is to use + a server managed by an ISP such as Dreamhost, + or on SourceForge, Google Code, or some other "forge" site, + all of which provide web interfaces for repository creation and management. + If your learners are advanced enough to be using SSH, + you can instead create it on any server they can access, + and connect with the svn+ssh protocol instead of HTTPS. +
  • +
  • + Be very clear what files learners are to edit + and what user IDs they are to use + when giving instructions. + It is common for them to edit the instructor's biography, + or to use the instructor's user ID and password when committing. + Be equally clear when they are to edit things: + it's also common for someone to edit the file the instructor is editing + and commit changes while the instructor is explaining what's going on, + so that a conflict occurs when the instructor comes to commit the file. +
  • +
  • + Learners could do most exercises with repositories on their own machines, + but it's hard for them to see how version control helps collaboration + unless they're sharing a repository with other learners. + In particular, + showing learners who changed what using svn blame + is only compelling if a file has been edited by at least two people. +
  • +
  • + If some learners are using Windows, + there will inevitably be issues merging files with different line endings. + svn diff -x -w is supposed to suppress differences in whitespace, + but we have found that it doesn't always work as advertised. +
@@ -478,6 +556,30 @@ Committed revision 7.
Figure 8: Updated Repository
+
+

When Not to Use Version Control

+ +

+ Despite the rapidly decreasing cost of storage, + it is still possible to run out of disk space. + In some labs, + people can easy go through 2 TB/month if they're not careful. + Since version control tools usually store revisions in terms of lines, + with binary data files, + they end up essentially storing every revision separately. + This isn't that bad + (it's what we'd be doing anyway), + but it means version control isn't doing what it likes to do, + and the repository can get very large very quickly. + Another concern is that if very old data will no longer be used, + it can be nice to archive or delete old data files. + This is not possible if our data is version controlled: + information can only be added to a repository, + so it can only ever increase in size. +

+ +
+

Back in his cubicle, Wolfman uses svn update to update his working copy. @@ -683,6 +785,22 @@ $ svn diff -r HEAD

+
+

Nothing's Perfekt

+ +

+ Version control systems do have one important shortcoming. + While it is easy for them to find, display, and merge differences in text files, + images, MP3s, PDFs, or Microsoft Word or Excel files aren't stored as text—they + use specialized binary data formats. + Most version control systems don't know how to deal with these formats, + so all they can say is, "These files differ." + Reconciling those differences will probably require use of an auxiliary tool, + such as an audio editor + or Microsoft Word's "Compare and Merge" utility. +

+
+

Diffing Other Files

@@ -777,6 +895,35 @@ $ diff left.txt right.txt and if necessary undo later on.

+
+

Who Did What?

+ +

+ One other very useful command is svn blame, + which shows when each line in the file was last changed + and by whom: +

+ +
+$ svn blame moons.txt
+    14    dracula Name            Orbital Radius  Orbital Period  Mass            Radius
+    14    dracula                 (10**3 km)      (days)          (10**20 kg)     (km)
+    14    dracula Amalthea        181.4           0.498179        0.075           131 x 73 x 67
+     9    mummy   Io              421.6           1.769138        893.2           1821.6
+     9    mummy   Europa          670.9           3.551181        480.0           1560.8
+     9    mummy   Ganymede        1070.4          7.154553        1481.9          2631.2
+    14    dracula Callisto        1882.7          16.689018       1075.9          2410.3
+    14    dracula Himalia         11460           250.5662        0.095           85.0
+    14    dracula Elara           11740           259.6528        0.008           40.0
+
+ +

+ If you are ever wondering who to talk to about a change, + or why it was made, + svn blame is a good place to start. +

+
+

Summary

    @@ -1591,19 +1738,19 @@ svn diff -r 240:261 fish.dat

    The command to create a repository is svnadmin create, followed by the path to the repository. - If we want to create a repository called lair_repo + If we want to create a repository called missions_repo directly under our home directory, we just cd to get home - and run svnadmin create lair_repo. - This command creates a directory called lair_repo to hold our repository, + and run svnadmin create missions_repo. + This command creates a directory called missions_repo to hold our repository, and fills it with various files that Subversion uses to keep track of the project's history:

     $ cd
    -$ svnadmin create lair_repo
    -$ ls -F lair_repo
    +$ svnadmin create missions_repo
    +$ ls -F missions_repo
     README.txt    conf/    db/    format    hooks/    locks/
     
    @@ -1614,18 +1761,18 @@ $ ls -F lair_repo we should use svn checkout to get a working copy of this repository. If our home directory is /users/mummy, - then the full path to the repository we just created is /users/mummy/lair_repo, - so we run svn checkout file:///users/mummy/lair lair_working. + then the full path to the repository we just created is /users/mummy/missions_repo, + so we run svn checkout file:///users/mummy/missions missions_working.

    Working backward, the second argument, - lair_working, + missions_working, specifies where the working copy is to be put. The first argument is the URL of our repository, and it has two parts. - /users/mummy/lair_repo is the path to repository directory. + /users/mummy/missions_repo is the path to repository directory. file:// specifies the protocol that Subversion will use to communicate with the repository—in this case, it says that the repository is part of the local machine's filesystem. @@ -1641,7 +1788,7 @@ $ ls -F lair_repo which specifies the name of the directory we want the working copy to be put in. Without it, Subversion will try to use the name of the repository, - lair_repo, + missions_repo, as the name of the working copy. Since we're in the directory that contains the repository, this means that Subversion will try to overwrite the repository with a working copy. @@ -1655,52 +1802,85 @@ $ ls -F lair_repo most people create a sub-directory in their account called something like repos, and then create their repositories in that. For example, - we could create our repository in /users/mummy/repos/lair, - then check out a working copy as /users/mummy/lair. + we could create our repository in /users/mummy/repos/missions, + then check out a working copy as /users/mummy/missions. This practice makes both names easier to read.

    -

    HERE

    -

    - The obvious next steps are - to put our repository on a server, - rather than on our personal machine, - and to give other people access to the repository we have just created - so that they can work with us. - We should always keep repositories on a different machine than - the one we're using for day-to-day work - so that if the latter is lost or damaged, - we still have our master copy. + The obvious next step is to put our repository on a server, + rather than on our personal machine. + In fact, + we should always do this + so that we don't lose the history of our project + if our laptop is damaged or stolen. + A departmental server is also much more likely to be backed up regularly + than our personal machine…

    - The second step—sharing the repository with others—requires - skills that we are deliberately not going to cover. - As we discuss in the lessons on web programming, - as soon as you make something available over the internet, - you open up a channel for attack. + Creating a repository on a server is simple: + just log in and go through the steps described above. + Accessing that repository from another machine + is also straightforward. + If the machine's address is serv.euphoric.edu, + and our user ID is dracula, + the URL of the repository will be something like:

    +
    +svn+ssh://dracula@serv.euphoric.edu/home/dracula/repos/missions
    +
    +

    - If you want to do this, you can: + Reading from left to right: +

    + +
      +
    • + svn+ssh is the protocol that Subversion uses to connect to the server + (in this case, + a combination of Subversion's own protocol + and SSH); +
    • +
    • + dracula@serv.euphoric.edu identifies the server and who we are + (just like an email address); + and +
    • +
    • + /home/dracula/repos/missions is the absolutely path of the repository + on the server. +
    • +
    + +

    + That's fine if you are the only person using the repository, + but if you want to share it with others, + you need to worry about security. + As we discuss in the lesson on web programming, + as soon as you provide a service on the internet, + there's the possibility that someone may try to attack your system through it. + Rather than trying to learn enough system administration skills + to set things up safely, + it is usually easier to:

    • - ask your system administrator to set it up for you; + ask your department's system administrator to set it up for you;
    • - use an open source hosting service like SourceForge, + use a hosting service like SourceForge, Google Code, GitHub, or BitBucket; or
    • - spend a few dollars a month on a commercial hosting service like DreamHost + spend a few dollars a month on a commercial hosting service that provides web-based GUIs for creating and managing repositories.
    • @@ -1721,14 +1901,50 @@ $ ls -F lair_repo

      Summary

        -
      • Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.
      • svnadmin create name creates a new repository.
      • +
      • Repositories can be hosted locally, on local (departmental) servers, on hosting services, or on their owners' own domains.

      Challenges

      -

      write some

      + +
        + +
      1. + Create a Subversion repository called trials_repo + in your home directory. + Check out a working copy in a directory called trials_working + (also in your home directory). + Add a couple of text files, + commit the changes, + and then use svn info trials_working + to see what Subversion tells you about your working copy. +
      2. + +
      3. + We said above that + you might be the only person using a particular repository. + When and why is version control worth using + if no-one else is working on a project with you? +
      4. + +
      5. + There are many ways to organize repositories. + Some of the most common are to create one repository for: +
          +
        • each person
        • +
        • each paper
        • +
        • all the work done on one grant
        • +
        • all the work done on one project
        • +
        • the entire lab (which is shared by everyone in the lab)
        • +
        • the entire department (typically with a top-level directory for each person or project in the department)
        • +
        + What activities does each one make easy or hard? + Which of these would you prefer, and why? +
      6. + +
      @@ -1757,8 +1973,8 @@ $ ls -F lair_repo

      - One of the central ideas of this course is that - wen can automatically track the provenance of scientific data. + One of the big benefits of using version control is that + it lets us track the provenance of scientific data automatically. To start, suppose we have a text file combustion.dat in a Subversion repository. Run the following two commands: @@ -1769,21 +1985,18 @@ $ svn propset svn:keywords Revision combustion.dat $ svn commit -m "Turning on the 'Revision' keyword" combustion.dat -

      - Now open the file in an editor +

      + This does nothing by itself, + but now open the file in an editor and add the following line somewhere near the top:

      -# $Revision:$
      +$Revision:$
       

      - The '#' sign isn't important: - it's just what .dat files use to show comments. - The $Revision:$ string, - on the other hand, - means something special to Subversion. + The $Revision:$ string means something special to Subversion. Save the file, and commit the change:

      @@ -1797,21 +2010,28 @@ $ svn commit -m "Inserting the 'Revision' keyword" combustion.dat

      -# $Revision: 143$
      +$Revision: 143$
       

      - i.e., Subversion has inserted the version number + i.e., it has inserted the version number after the colon and before the closing $. + If we edit the file again—e.g., add a couple of lines with random numbers—and + commit once more, + the line is updated again to:

      +
      +$Revision: 144$
      +
      +

      Here's what just happened. - First, Subversion allows you to set + First, Subversion allows uss to add properties - for files and and directories. - These properties aren't in the files or directories themselves, - but live in Subversion's database. + to files and and directories. + These properties aren't stored in the files or directories themselves, + but in Subversion's database. One of those properties, svn:keywords, tells Subversion to look in files that are being changed @@ -1826,7 +2046,7 @@ $ svn commit -m "Inserting the 'Revision' keyword" combustion.dat with the current version number, the name of the person making the change, or whatever else the property's name tells it to do. - You only have to add the string to the file once; + We only have to add the string to the file once; after that, Subversion updates it for you every time the file changes.

      @@ -1837,84 +2057,86 @@ $ svn commit -m "Inserting the 'Revision' keyword" combustion.dat for example, it carries its version number with it, so you can tell which version you have even if it's outside version control. - We'll see some more useful things we can do with this information in - the next chapter. + We'll see some more useful things we can do with this information later.

      -
      -

      When Not to Use Version Control

      - -

      - Despite the rapidly decreasing cost of storage, - it is still possible to run out of disk space. - In some labs, - people can easy go through 2 TB/month if they're not careful. - Since version control tools usually store revisions in terms of lines, - with binary data files, - they end up essentially storing every revision separately. - This isn't that bad - (it's what we'd be doing anyway), - but it means version control isn't doing what it likes to do, - and the repository can get very large very quickly. - Another concern is that if very old data will no longer be used, - it can be nice to archive or delete old data files. - This is not possible if our data is version controlled: - information can only be added to a repository, - so it can only ever increase in size. -

      - -
      -

      We can use this trick with shell scripts too, or with almost any other kind of program. - Going back to Nelle Nemo's data processing from - the lesson on the shell, - for example, - suppose she writes a shell script that uses gooclean + Let's go back to Nelle Nemo's data processing from + the lesson on the shell. + Suppose she writes a shell script called gooclean to tidy up data files. Her first version looks like this:

      -for filename in $*
      -do
      -    gooclean -b 0 100 < $filename > cleaned-$filename
      -done
      +# gooclean: clean up a single data file
      +goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 > cleaned-$1
       

      - i.e., it runs gooclean with bounding values of 0 and 100 - for each specified file, - putting the result in a temporary file with a well-defined name. - Assuming that '#' is the comment character for those kinds of data files, + i.e., + it runs goonorm and then goofilter with some fixed parameters + and creates an output file called cleaned-something.dat + (if the input file's name was something.dat). + Assuming that '#' is the comment character for her output files, she could instead write:

      -for filename in $*
      -do
      -    echo "gooclean $Revision: 901$ -b 0 100" > $filename
      -    gooclean -b 0 100 < $filename >> cleaned-$filename
      -done
      +# gooclean: clean up a single data file
      +echo "# gooclean $Revision:$" > cleaned-$1
      +goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 >> cleaned-$1
      +
      + +

      + then set the svn:keywords property + and commit the file to insert the revision number, + making it: +

      + +
      +# gooclean: clean up a single data file
      +echo "# gooclean $Revision: 487$" > cleaned-$1
      +goonorm -b 0 100 < $1 | goofilter -x --enlarge 2.0 >> cleaned-$1
       

      - The first change puts a line in the output file - that describes how that file was created. - The second change is to use >> instead of > - to redirect gooclean's output to the file. - >> means "append to": - instead of overwriting whatever is in the file, - it adds more content to it. - This ensures that the first line of the file is the provenance record, - with the actual output of gooclean after it. + Now, + each time this script is run it will: +

      + +
        +
      • + put the line +
        +# gooclean $Revision: 487$
        +
        + in the output file, + then +
      • +
      • + append whatever the pipline containing goonorm and goofilter + would have put in the file originally. + (The double redirection >> means "append to" rather than "overwrite".) +
      • +
      + +

      + In other words, + the output of this shell script will always record + exactly what version of the script produced it. + This isn't enough to reproduce the output—we would need to record + the version numbers of the input files and the goonorm and goofilter programs, + and the values of the parameters those programs used + in order to do that—but it's an important and useful first step.

      Summary

        -
      • $Keyword:$ in a file can be filled in with a property value each time the file is committed.
      • +
      • $Keyword: …$ in a file can be filled in with a property value each time the file is committed.
      • Put version numbers in programs' output to establish provenance for data.
      • svn propset svn:keywords property files tells Subversion to start filling in property values.
      @@ -1922,7 +2144,24 @@ done

      Challenges

      -

      write some

      + +
        + +
      1. + Add $Id:$ to a file, + use svn propset to set the corresponding property, + and then commit a change to the file. + What value does Subversion fill in for this keyword? + When would you use this rather than Revision or Author? +
      2. + +
      3. + What does the svn:ignore property do when applied to a directory? + When would you use it? +
      4. + +
      +
      @@ -1931,22 +2170,35 @@ done

      Summing Up

      - Correlation does not imply causality, - but there is a very strong correlation between - using version control - and doing good computational science. - There's an equally strong correlation - between not using it and either wasting effort or getting things wrong. - Today (the middle of 2013), - I will not review a paper if the software used in it - is not under version control. - The work it reports might be interesting, - but without the kind of record-keeping that version control provides, - there's no way to know exactly what its authors did. - Just as importantly, - if someone doesn't know enough about computing to use version control, - the odds are good that they don't know enough - to do the programming right either. + In 2006, + McCullough, McGeary, and Harrison + analyzed several years of + the data and code archive of Journal of Money, Credit, and Banking, + a prestigious journal with a mandatory archiving policy. + Of 266 articles published during that time, + 193 were empirical and should have had data and code deposited in the archive. + Of those, + only 69 actually had anything in the archive; + Excluding eleven articles that only had data, + and seven that required software or other resources they did not have, + McCullough et al. were only able to replicate 14 of the remaining 186 articles. + This doesn't mean that the other 92% were wrong, + but it does mean there is no practical way to tell. +

      + +

      + By itself, + version control doesn't making computational research reproducible. + It does help, + though, + and also eliminates the frustration and wasted time caused by + trying to figure out which emailed copy of a file, + or which of a dozen directories or USB drives, + is the most recent. + And while correlation doesn't imply causality, + there is certainly a strong correlation between + knowing enough about good computational practices to use version control + and knowing how to do other things right as well.

      -- 2.26.2