1 <div class="objectives" markdown="1">
4 * Write a shell script that runs a command or series of commands for a fixed set of files.
5 * Run a shell script from the command line.
6 * Write a shell script that operates on a set of files defined by the user on the command line.
7 * Create pipelines that include user-written shell scripts.
11 We are finally ready to see what makes the shell such a powerful programming environment.
12 We are going to take the commands we repeat frequently and save them in files
13 so that we can re-run all those operations again later by typing a single command.
14 For historical reasons,
15 a bunch of commands saved in a file is usually called a [shell script](../../gloss.html#shell-script),
17 these are actually small programs.
19 Let's start by putting the following line in the file `middle.sh`:
21 <div class="file" markdown="1">
23 head -20 cholesterol.pdb | tail -5
27 This is a variation on the pipe we constructed earlier:
28 it selects lines 16-20 of the file `cholesterol.pdb`.
29 Remember, we are *not* running it as a command just yet:
30 we are putting the commands in a file.
32 Once we have saved the file,
33 we can ask the shell to execute the commands it contains.
34 Our shell is called `bash`, so we run the following command:
36 <div class="in" markdown="1">
41 <div class="out" markdown="1">
43 ATOM 14 C 1 -1.463 -0.666 1.001 1.00 0.00
44 ATOM 15 C 1 0.762 -0.929 0.295 1.00 0.00
45 ATOM 16 C 1 0.771 -0.937 1.840 1.00 0.00
46 ATOM 17 C 1 -0.664 -0.610 2.293 1.00 0.00
47 ATOM 18 C 1 -4.705 2.108 -0.396 1.00 0.00
52 our script's output is exactly what we would get if we ran that pipeline directly.
54 > #### Text vs. Whatever
56 > We usually call programs like Microsoft Word or LibreOffice Writer "text
57 > editors", but we need to be a bit more careful when it comes to
58 > programming. By default, Microsoft Word uses `.docx` files to store not
59 > only text, but also formatting information about fonts, headings, and so
60 > on. This extra information isn't stored as characters, and doesn't mean
61 > anything to tools like `head`: they expect input files to contain
62 > nothing but the letters, digits, and punctuation on a standard computer
63 > keyboard. When editing programs, therefore, you must either use a plain
64 > text editor, or be careful to save files as plain text.
66 What if we want to select lines from an arbitrary file?
67 We could edit `middle.sh` each time to change the filename,
68 but that would probably take longer than just retyping the command.
70 let's edit `middle.sh` and replace `cholesterol.pdb` with a special variable called `$1`:
72 <div class="in" markdown="1">
77 <div class="out" markdown="1">
83 Inside a shell script,
84 `$1` means "the first filename (or other parameter) on the command line".
85 We can now run our script like this:
87 <div class="in" markdown="1">
89 $ bash middle.sh cholesterol.pdb
92 <div class="out" markdown="1">
94 ATOM 14 C 1 -1.463 -0.666 1.001 1.00 0.00
95 ATOM 15 C 1 0.762 -0.929 0.295 1.00 0.00
96 ATOM 16 C 1 0.771 -0.937 1.840 1.00 0.00
97 ATOM 17 C 1 -0.664 -0.610 2.293 1.00 0.00
98 ATOM 18 C 1 -4.705 2.108 -0.396 1.00 0.00
102 or on a different file like this:
104 <div class="in" markdown="1">
106 $ bash middle.sh vitamin-a.pdb
109 <div class="out" markdown="1">
111 ATOM 14 C 1 1.788 -0.987 -0.861
112 ATOM 15 C 1 2.994 -0.265 -0.829
113 ATOM 16 C 1 4.237 -0.901 -1.024
114 ATOM 17 C 1 5.406 -0.117 -1.087
115 ATOM 18 C 1 -0.696 -2.628 -0.641
119 We still need to edit `middle.sh` each time we want to adjust the range of lines,
121 Let's fix that by using the special variables `$2` and `$3`:
123 <div class="in" markdown="1">
128 <div class="out" markdown="1">
133 <div class="in" markdown="1">
135 $ bash middle.sh vitamin-a.pdb -20 -5
138 <div class="out" markdown="1">
140 ATOM 14 C 1 1.788 -0.987 -0.861
141 ATOM 15 C 1 2.994 -0.265 -0.829
142 ATOM 16 C 1 4.237 -0.901 -1.024
143 ATOM 17 C 1 5.406 -0.117 -1.087
144 ATOM 18 C 1 -0.696 -2.628 -0.641
149 but it may take the next person who reads `middle.sh` a moment to figure out what it does.
150 We can improve our script by adding some [comments](../../gloss.html#comment) at the top:
152 <div class="in" markdown="1">
157 <div class="out" markdown="1">
159 # Select lines from the middle of a file.
160 # Usage: middle.sh filename -end_line -num_lines
165 A comment starts with a `#` character and runs to the end of the line.
166 The computer ignores comments,
167 but they're invaluable for helping people understand and use scripts.
169 What if we want to process many files in a single pipeline?
170 For example, if we want to sort our `.pdb` files by length, we would type:
172 <div class="in" markdown="1">
174 $ wc -l *.pdb | sort -n
178 because `wc -l` lists the number of lines in the files
179 and `sort -n` sorts things numerically.
180 We could put this in a file,
181 but then it would only ever sort a list of `.pdb` files in the current directory.
182 If we want to be able to get a sorted list of other kinds of files,
183 we need a way to get all those names into the script.
184 We can't use `$1`, `$2`, and so on
185 because we don't know how many files there are.
186 Instead, we use the special variable `$*`,
188 "All of the command-line parameters to the shell script."
191 <div class="in" markdown="1">
196 <div class="out" markdown="1">
201 <div class="in" markdown="1">
203 $ bash sorted.sh *.dat backup/*.dat
206 <div class="out" markdown="1">
209 89 backup/chloratin.dat
212 172 backup/sphag-merged.dat
217 > #### Why Isn't It Doing Anything?
219 > What happens if a script is supposed to process a bunch of files, but we
220 > don't give it any filenames? For example, what if we type:
224 > but don't say `*.dat` (or anything else)? In this case, `$*` expands to
225 > nothing at all, so the pipeline inside the script is effectively:
229 > Since it doesn't have any filenames, `wc` assumes it is supposed to
230 > process standard input, so it just sits there and waits for us to give
231 > it some data interactively. From the outside, though, all we see is it
232 > sitting there: the script doesn't appear to do anything.
234 We have two more things to do before we're finished with our simple shell scripts.
235 If you look at a script like:
237 <div class="file" markdown="1">
243 you can probably puzzle out what it does.
245 if you look at this script:
247 <div class="file" markdown="1">
249 # List files sorted by number of lines.
254 you don't have to puzzle it out—the comment at the top tells you what it does.
255 A line or two of documentation like this make it much easier for other people
256 (including your future self)
258 The only caveat is that each time you modify the script,
259 you should check that the comment is still accurate:
260 an explanation that sends the reader in the wrong direction is worse than none at all.
263 suppose we have just run a series of commands that did something useful—for example,
264 that created a graph we'd like to use in a paper.
265 We'd like to be able to re-create the graph later if we need to,
266 so we want to save the commands in a file.
267 Instead of typing them in again
268 (and potentially getting them wrong)
271 <div class="in" markdown="1">
273 $ history | tail -4 > redo-figure-3.sh
277 The file `redo-figure-3.sh` now contains:
279 <div class="file" markdown="1">
281 297 goostats -r NENE01729B.txt stats-NENE01729B.txt
282 298 goodiff stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
283 299 cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
284 300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
288 After a moment's work in an editor to remove the serial numbers on the commands,
289 we have a completely accurate record of how we created that figure.
293 > Nelle could also use `colrm` (short for "column removal") to remove the
294 > serial numbers on her previous commands.
295 > Its parameters are the range of characters to strip from its input:
298 > $ history | tail -5
302 > 176 mv bakup backup
303 > 177 history | tail -5
304 > $ history | tail -5 | colrm 1 7
310 > history | tail -5 | colrm 1 7
313 In practice, most people develop shell scripts by running commands at the shell prompt a few times
314 to make sure they're doing the right thing,
315 then saving them in a file for re-use.
316 This style of work allows people to recycle
317 what they discover about their data and their workflow with one call to `history`
318 and a bit of editing to clean up the output
319 and save it as a shell script.
321 #### Nelle's Pipeline: Creating a Script
323 An off-hand comment from her supervisor has made Nelle realize that
324 she should have provided a couple of extra parameters to `goostats` when she processed her files.
325 This might have been a disaster if she had done all the analysis by hand,
326 but thanks to for loops,
327 it will only take a couple of hours to re-do.
329 But experience has taught her that if something needs to be done twice,
330 it will probably need to be done a third or fourth time as well.
331 She runs the editor and writes the following:
333 <div class="file" markdown="1">
335 # Calculate reduced stats for data files at J = 100 c/bp.
339 goostats -J 100 -r $datafile stats-$datafile
344 (The parameters `-J 100` and `-r` are the ones her supervisor said she should have used.)
345 She saves this in a file called `do-stats.sh`
346 so that she can now re-do the first stage of her analysis by typing:
348 <div class="in" markdown="1">
350 $ bash do-stats.sh *[AB].txt
354 She can also do this:
356 <div class="in" markdown="1">
358 $ bash do-stats.sh *[AB].txt | wc -l
362 so that the output is just the number of files processed
363 rather than the names of the files that were processed.
365 One thing to note about Nelle's script is that
366 it lets the person running it decide what files to process.
367 She could have written it as:
369 <div class="file" markdown="1">
371 # Calculate reduced stats for A and Site B data files at J = 100 c/bp.
372 for datafile in *[AB].txt
375 goostats -J 100 -r $datafile stats-$datafile
380 The advantage is that this always selects the right files:
381 she doesn't have to remember to exclude the 'Z' files.
382 The disadvantage is that it *always* selects just those files—she can't run it on all files
383 (including the 'Z' files),
384 or on the 'G' or 'H' files her colleagues in Antarctica are producing,
385 without editing the script.
386 If she wanted to be more adventurous,
387 she could modify her script to check for command-line parameters,
388 and use `*[AB].txt` if none were provided.
389 Of course, this introduces another tradeoff between flexibility and complexity.
391 <div class="keypoints" markdown="1">
394 * Save commands in files (usually called shell scripts) for re-use.
395 * `bash filename` runs the commands saved in a file.
396 * `$*` refers to all of a shell script's command-line parameters.
397 * `$1`, `$2`, etc., refer to specified command-line parameters.
398 * Letting users decide what files to process is more flexible and more consistent with built-in Unix commands.
402 <div class="challenges" markdown="1">
406 1. Leah has several hundred data files, each of which is formatted like this:
419 Write a shell script called `species.sh` that takes any number of
420 filenames as command-line parameters, and uses `cut`, `sort`, and
421 `uniq` to print a list of the unique species appearing in each of
422 those files separately.
424 2. Write a shell script called `longest.sh` that takes the name of a
425 directory and a filename extension as its parameters, and prints out
426 the name of the most recently modified file in that directory with
427 that extension. For example:
430 $ bash largest.sh /tmp/data pdb
433 would print the name of the `.pdb` file in `/tmp/data` that has been
434 changed most recently.
436 3. If you run the command:
439 history | tail -5 > recent.sh
442 the last command in the file is the `history` command itself, i.e.,
443 the shell has added `history` to the command log before actually
444 running it. In fact, the shell *always* adds commands to the log
445 before running them. Why do you think it does this?
447 4. Joel's `data` directory contains three files: `fructose.dat`,
448 `glucose.dat`, and `sucrose.dat`. Explain what a script called
449 `example.sh` would when run as `bash example.sh *.dat`
450 if it contained the following lines:
454 <td valign="top">1.</td>
462 <td valign="top">2.</td>
465 for filename in $1 $2 $3
473 <td valign="top">3.</td>