05-script.md

   1 <div class="objectives" markdown="1">
   2
   3 #### Objectives
   4 *   Write a shell script that runs a command or series of commands for a fixed set of files.
   5 *   Run a shell script from the command line.
   6 *   Write a shell script that operates on a set of files defined by the user on the command line.
   7 *   Create pipelines that include user-written shell scripts.
   8
   9 </div>
  10
  11 We are finally ready to see what makes the shell such a powerful programming environment.
  12 We are going to take the commands we repeat frequently and save them in files
  13 so that we can re-run all those operations again later by typing a single command.
  14 For historical reasons,
  15 a bunch of commands saved in a file is usually called a [shell script](../../gloss.html#shell-script),
  16 but make no mistake:
  17 these are actually small programs.
  18
  19 Let's start by putting the following line in the file `middle.sh`:
  20
  21 <div class="file" markdown="1">
  22 ~~~
  23 head -20 cholesterol.pdb | tail -5
  24 ~~~
  25 </div>
  26
  27 This is a variation on the pipe we constructed earlier:
  28 it selects lines 16-20 of the file `cholesterol.pdb`.
  29 Remember, we are *not* running it as a command just yet:
  30 we are putting the commands in a file.
  31
  32 Once we have saved the file,
  33 we can ask the shell to execute the commands it contains.
  34 Our shell is called `bash`, so we run the following command:
  35
  36 <div class="in" markdown="1">
  37 ~~~
  38 $ bash middle.sh
  39 ~~~
  40 </div>
  41 <div class="out" markdown="1">
  42 ~~~
  43 ATOM     14  C           1      -1.463  -0.666   1.001  1.00  0.00
  44 ATOM     15  C           1       0.762  -0.929   0.295  1.00  0.00
  45 ATOM     16  C           1       0.771  -0.937   1.840  1.00  0.00
  46 ATOM     17  C           1      -0.664  -0.610   2.293  1.00  0.00
  47 ATOM     18  C           1      -4.705   2.108  -0.396  1.00  0.00
  48 ~~~
  49 </div>
  50
  51 Sure enough,
  52 our script's output is exactly what we would get if we ran that pipeline directly.
  53
  54 > #### Text vs. Whatever
  55 >
  56 > We usually call programs like Microsoft Word or LibreOffice Writer "text
  57 > editors", but we need to be a bit more careful when it comes to
  58 > programming. By default, Microsoft Word uses `.docx` files to store not
  59 > only text, but also formatting information about fonts, headings, and so
  60 > on. This extra information isn't stored as characters, and doesn't mean
  61 > anything to tools like `head`: they expect input files to contain
  62 > nothing but the letters, digits, and punctuation on a standard computer
  63 > keyboard. When editing programs, therefore, you must either use a plain
  64 > text editor, or be careful to save files as plain text.
  65
  66 What if we want to select lines from an arbitrary file?
  67 We could edit `middle.sh` each time to change the filename,
  68 but that would probably take longer than just retyping the command.
  69 Instead,
  70 let's edit `middle.sh` and replace `cholesterol.pdb` with a special variable called `$1`:
  71
  72 <div class="in" markdown="1">
  73 ~~~
  74 $ cat middle.sh
  75 ~~~
  76 </div>
  77 <div class="out" markdown="1">
  78 ~~~
  79 head -20 $1 | tail -5
  80 ~~~
  81 </div>
  82
  83 Inside a shell script,
  84 `$1` means "the first filename (or other parameter) on the command line".
  85 We can now run our script like this:
  86
  87 <div class="in" markdown="1">
  88 ~~~
  89 $ bash middle.sh cholesterol.pdb
  90 ~~~
  91 </div>
  92 <div class="out" markdown="1">
  93 ~~~
  94 ATOM     14  C           1      -1.463  -0.666   1.001  1.00  0.00
  95 ATOM     15  C           1       0.762  -0.929   0.295  1.00  0.00
  96 ATOM     16  C           1       0.771  -0.937   1.840  1.00  0.00
  97 ATOM     17  C           1      -0.664  -0.610   2.293  1.00  0.00
  98 ATOM     18  C           1      -4.705   2.108  -0.396  1.00  0.00
  99 ~~~
 100 </div>
 101
 102 or on a different file like this:
 103
 104 <div class="in" markdown="1">
 105 ~~~
 106 $ bash middle.sh vitamin-a.pdb
 107 ~~~
 108 </div>
 109 <div class="out" markdown="1">
 110 ~~~
 111 ATOM     14  C           1       1.788  -0.987  -0.861
 112 ATOM     15  C           1       2.994  -0.265  -0.829
 113 ATOM     16  C           1       4.237  -0.901  -1.024
 114 ATOM     17  C           1       5.406  -0.117  -1.087
 115 ATOM     18  C           1      -0.696  -2.628  -0.641
 116 ~~~
 117 </div>
 118
 119 We still need to edit `middle.sh` each time we want to adjust the range of lines,
 120 though.
 121 Let's fix that by using the special variables `$2` and `$3`:
 122
 123 <div class="in" markdown="1">
 124 ~~~
 125 $ cat middle.sh
 126 ~~~
 127 </div>
 128 <div class="out" markdown="1">
 129 ~~~
 130 head $2 $1 | tail $3
 131 ~~~
 132 </div>
 133 <div class="in" markdown="1">
 134 ~~~
 135 $ bash middle.sh vitamin-a.pdb -20 -5
 136 ~~~
 137 </div>
 138 <div class="out" markdown="1">
 139 ~~~
 140 ATOM     14  C           1       1.788  -0.987  -0.861
 141 ATOM     15  C           1       2.994  -0.265  -0.829
 142 ATOM     16  C           1       4.237  -0.901  -1.024
 143 ATOM     17  C           1       5.406  -0.117  -1.087
 144 ATOM     18  C           1      -0.696  -2.628  -0.641
 145 ~~~
 146 </div>
 147
 148 This works,
 149 but it may take the next person who reads `middle.sh` a moment to figure out what it does.
 150 We can improve our script by adding some [comments](../../gloss.html#comment) at the top:
 151
 152 <div class="in" markdown="1">
 153 ~~~
 154 $ cat middle.sh
 155 ~~~
 156 </div>
 157 <div class="out" markdown="1">
 158 ~~~
 159 # Select lines from the middle of a file.
 160 # Usage: middle.sh filename -end_line -num_lines
 161 head $2 $1 | tail $3
 162 ~~~
 163 </div>
 164
 165 A comment starts with a `#` character and runs to the end of the line.
 166 The computer ignores comments,
 167 but they're invaluable for helping people understand and use scripts.
 168
 169 What if we want to process many files in a single pipeline?
 170 For example, if we want to sort our `.pdb` files by length, we would type:
 171
 172 <div class="in" markdown="1">
 173 ~~~
 174 $ wc -l *.pdb | sort -n
 175 ~~~
 176 </div>
 177
 178 because `wc -l` lists the number of lines in the files
 179 and `sort -n` sorts things numerically.
 180 We could put this in a file,
 181 but then it would only ever sort a list of `.pdb` files in the current directory.
 182 If we want to be able to get a sorted list of other kinds of files,
 183 we need a way to get all those names into the script.
 184 We can't use `$1`, `$2`, and so on
 185 because we don't know how many files there are.
 186 Instead, we use the special variable `$*`,
 187 which means,
 188 "All of the command-line parameters to the shell script."
 189 Here's an example:
 190
 191 <div class="in" markdown="1">
 192 ~~~
 193 $ cat sorted.sh
 194 ~~~
 195 </div>
 196 <div class="out" markdown="1">
 197 ~~~
 198 wc -l $* | sort -n
 199 ~~~
 200 </div>
 201 <div class="in" markdown="1">
 202 ~~~
 203 $ bash sorted.sh *.dat backup/*.dat
 204 ~~~
 205 </div>
 206 <div class="out" markdown="1">
 207 ~~~
 208       29 chloratin.dat
 209       89 backup/chloratin.dat
 210       91 sphagnoi.dat
 211      156 sphag2.dat
 212      172 backup/sphag-merged.dat
 213      182 girmanis.dat
 214 ~~~
 215 </div>
 216
 217 > #### Why Isn't It Doing Anything?
 218 >
 219 > What happens if a script is supposed to process a bunch of files, but we
 220 > don't give it any filenames? For example, what if we type:
 221 >
 222 >     $ bash sorted.sh
 223 >
 224 > but don't say `*.dat` (or anything else)? In this case, `$*` expands to
 225 > nothing at all, so the pipeline inside the script is effectively:
 226 >
 227 >     wc -l | sort -n
 228 >
 229 > Since it doesn't have any filenames, `wc` assumes it is supposed to
 230 > process standard input, so it just sits there and waits for us to give
 231 > it some data interactively. From the outside, though, all we see is it
 232 > sitting there: the script doesn't appear to do anything.
 233
 234 We have two more things to do before we're finished with our simple shell scripts.
 235 If you look at a script like:
 236
 237 <div class="file" markdown="1">
 238 ~~~
 239 wc -l $* | sort -n
 240 ~~~
 241 </div>
 242
 243 you can probably puzzle out what it does.
 244 On the other hand,
 245 if you look at this script:
 246
 247 <div class="file" markdown="1">
 248 ~~~
 249 # List files sorted by number of lines.
 250 wc -l $* | sort -n
 251 ~~~
 252 </div>
 253
 254 you don't have to puzzle it out&mdash;the comment at the top tells you what it does.
 255 A line or two of documentation like this make it much easier for other people
 256 (including your future self)
 257 to re-use your work.
 258 The only caveat is that each time you modify the script,
 259 you should check that the comment is still accurate:
 260 an explanation that sends the reader in the wrong direction is worse than none at all.
 261
 262 Second,
 263 suppose we have just run a series of commands that did something useful&mdash;for example,
 264 that created a graph we'd like to use in a paper.
 265 We'd like to be able to re-create the graph later if we need to,
 266 so we want to save the commands in a file.
 267 Instead of typing them in again
 268 (and potentially getting them wrong)
 269 we can do this:
 270
 271 <div class="in" markdown="1">
 272 ~~~
 273 $ history | tail -4 > redo-figure-3.sh
 274 ~~~
 275 </div>
 276
 277 The file `redo-figure-3.sh` now contains:
 278
 279 <div class="file" markdown="1">
 280 ~~~
 281 297 goostats -r NENE01729B.txt stats-NENE01729B.txt
 282 298 goodiff stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
 283 299 cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
 284 300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
 285 ~~~
 286 </div>
 287
 288 After a moment's work in an editor to remove the serial numbers on the commands,
 289 we have a completely accurate record of how we created that figure.
 290
 291 > #### Unnumbering
 292 >
 293 > Nelle could also use `colrm` (short for "column removal") to remove the
 294 > serial numbers on her previous commands.
 295 > Its parameters are the range of characters to strip from its input:
 296 >
 297 > ~~~
 298 > $ history | tail -5
 299 >   173  cd /tmp
 300 >   174  ls
 301 >   175  mkdir bakup
 302 >   176  mv bakup backup
 303 >   177  history | tail -5
 304 > $ history | tail -5 | colrm 1 7
 305 > cd /tmp
 306 > ls
 307 > mkdir bakup
 308 > mv bakup backup
 309 > history | tail -5
 310 > history | tail -5 | colrm 1 7
 311 > ~~~
 312
 313 In practice, most people develop shell scripts by running commands at the shell prompt a few times
 314 to make sure they're doing the right thing,
 315 then saving them in a file for re-use.
 316 This style of work allows people to recycle
 317 what they discover about their data and their workflow with one call to `history`
 318 and a bit of editing to clean up the output
 319 and save it as a shell script.
 320
 321 #### Nelle's Pipeline: Creating a Script
 322
 323 An off-hand comment from her supervisor has made Nelle realize that
 324 she should have provided a couple of extra parameters to `goostats` when she processed her files.
 325 This might have been a disaster if she had done all the analysis by hand,
 326 but thanks to for loops,
 327 it will only take a couple of hours to re-do.
 328
 329 But experience has taught her that if something needs to be done twice,
 330 it will probably need to be done a third or fourth time as well.
 331 She runs the editor and writes the following:
 332
 333 <div class="file" markdown="1">
 334 ~~~
 335 # Calculate reduced stats for data files at J = 100 c/bp.
 336 for datafile in $*
 337 do
 338     echo $datafile
 339     goostats -J 100 -r $datafile stats-$datafile
 340 done
 341 ~~~
 342 </div>
 343
 344 (The parameters `-J 100` and `-r` are the ones her supervisor said she should have used.)
 345 She saves this in a file called `do-stats.sh`
 346 so that she can now re-do the first stage of her analysis by typing:
 347
 348 <div class="in" markdown="1">
 349 ~~~
 350 $ bash do-stats.sh *[AB].txt
 351 ~~~
 352 </div>
 353
 354 She can also do this:
 355
 356 <div class="in" markdown="1">
 357 ~~~
 358 $ bash do-stats.sh *[AB].txt | wc -l
 359 ~~~
 360 </div>
 361
 362 so that the output is just the number of files processed
 363 rather than the names of the files that were processed.
 364
 365 One thing to note about Nelle's script is that
 366 it lets the person running it decide what files to process.
 367 She could have written it as:
 368
 369 <div class="file" markdown="1">
 370 ~~~
 371 # Calculate reduced stats for  A and Site B data files at J = 100 c/bp.
 372 for datafile in *[AB].txt
 373 do
 374     echo $datafile
 375     goostats -J 100 -r $datafile stats-$datafile
 376 done
 377 ~~~
 378 </div>
 379
 380 The advantage is that this always selects the right files:
 381 she doesn't have to remember to exclude the 'Z' files.
 382 The disadvantage is that it *always* selects just those files&mdash;she can't run it on all files
 383 (including the 'Z' files),
 384 or on the 'G' or 'H' files her colleagues in Antarctica are producing,
 385 without editing the script.
 386 If she wanted to be more adventurous,
 387 she could modify her script to check for command-line parameters,
 388 and use `*[AB].txt` if none were provided.
 389 Of course, this introduces another tradeoff between flexibility and complexity.
 390
 391 <div class="keypoints" markdown="1">
 392
 393 #### Key Points
 394 *   Save commands in files (usually called shell scripts) for re-use.
 395 *   `bash filename` runs the commands saved in a file.
 396 *   `$*` refers to all of a shell script's command-line parameters.
 397 *   `$1`, `$2`, etc., refer to specified command-line parameters.
 398 *   Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.
 399
 400 </div>
 401
 402 <div class="challenges" markdown="1">
 403
 404 #### Challenges
 405
 406 1.  Leah has several hundred data files, each of which is formatted like this:
 407
 408     ~~~
 409     2013-11-05,deer,5
 410     2013-11-05,rabbit,22
 411     2013-11-05,raccoon,7
 412     2013-11-06,rabbit,19
 413     2013-11-06,deer,2
 414     2013-11-06,fox,1
 415     2013-11-07,rabbit,18
 416     2013-11-07,bear,1
 417     ~~~
 418
 419     Write a shell script called `species.sh` that takes any number of
 420     filenames as command-line parameters, and uses `cut`, `sort`, and
 421     `uniq` to print a list of the unique species appearing in each of
 422     those files separately.
 423
 424 2.  Write a shell script called `longest.sh` that takes the name of a
 425     directory and a filename extension as its parameters, and prints out
 426     the name of the most recently modified file in that directory with
 427     that extension. For example:
 428
 429     ~~~
 430     $ bash largest.sh /tmp/data pdb
 431     ~~~
 432
 433     would print the name of the `.pdb` file in `/tmp/data` that has been
 434     changed most recently.
 435
 436 3.  If you run the command:
 437
 438     ~~~
 439     history | tail -5 > recent.sh
 440     ~~~
 441
 442     the last command in the file is the `history` command itself, i.e.,
 443     the shell has added `history` to the command log before actually
 444     running it. In fact, the shell *always* adds commands to the log
 445     before running them. Why do you think it does this?
 446
 447 4.  Joel's `data` directory contains three files: `fructose.dat`,
 448     `glucose.dat`, and `sucrose.dat`. Explain what a script called
 449     `example.sh` would when run as `bash example.sh *.dat`
 450     if it contained the following lines:
 451
 452 <table>
 453   <tr>
 454     <td valign="top">1.</td>
 455     <td valign="top">
 456 <pre>
 457 echo *.*
 458 </pre>
 459     </td>
 460   </tr>
 461   <tr>
 462     <td valign="top">2.</td>
 463     <td valign="top">
 464 <pre>
 465 for filename in $1 $2 $3
 466 do
 467     cat $filename
 468 done
 469 </pre>
 470     </td>
 471   </tr>
 472   <tr>
 473     <td valign="top">3.</td>
 474     <td valign="top">
 475 <pre>
 476 echo $*.dat
 477 </pre>
 478     </td>
 479   </tr>
 480 </table>
 481
 482 </div>