shell/exercises/shell.markdown

   1 # Shell
   2
   3 Let's try out your new shell skills on some real data.
   4
   5 The file `1000gp.vcf` is a small sample (1%) of a very large text file
   6 containing human genetics data. Specifically, it describes genetic variation in
   7 three African individuals sequenced as part of the [1000 Genomes
   8 Project](http://www.1000genomes.org).
   9
  10 ## Exercise Part 1 (setup)
  11
  12 * If you had forgotten where you downloaded the file, how would you locate the
  13 path of all files with that name on the computer (using the shell)?
  14
  15   **Hint:**
  16   > `$ man find`
  17
  18   **Answer:**
  19   > `$ find / -name "1000gp.vcf"`
  20
  21 * It's usually a good idea to use an empty directory as a workspace so that
  22   other files don't get in the way (or accidentally get overwritten or deleted).
  23   Create a new subdirectory directory named "sandbox", move our data file there,
  24   and make the directory your current working directory (sandbox should be the
  25   last part of the path given when you type `pwd`.
  26
  27
  28   **Answer:**
  29   > ```bash
  30   > $ mkdir sandbox
  31   > $ mv /home/orion/Downloads/1000gp.vcf sandbox
  32   > $ cd sandbox
  33   > ```
  34
  35
  36 ## Exercise Part 2 (analysis)
  37
  38 * The data file you downloaded is a line-based text file. The "vcf" extension
  39   lets us know that it's in a specific text format, namely "Variant Call
  40   Format". The file starts with a bunch of comment lines (they start with "#" or
  41   "##"), and then a large number of data lines. The human genome can be thought
  42   of as an encyclopedia, where each chromosome is a volume. Each volume is just
  43   a long string of characters, but rather than the english alphabet, the genome
  44   uses just the characters "A", "C", "G", and "T". This VCF file lists the
  45   differences between the three African individuals and a standard "individual"
  46   called the reference (actually based upon a few different people). Each line
  47   in the file corresponds to a difference. The line tells us the position of the
  48   difference (chromosome and position), the genetic sequence in the reference,
  49   and the corresponding sequence in each of the three Africans. Research is
  50   ongoing to understand the full effects of these genetic differences; some
  51   cause diseases such as Tay-Sachs and Hemophilia, while others determine your
  52   blood type and eye color.
  53
  54   Before we start processing the file, let's get a high-level view of the file
  55   that we're about to work with.
  56
  57   What's the file size (in kilo-bytes), and how many lines are in the file?
  58
  59
  60   **Hint:**
  61   > There's an option to `ls` that will print the file sizes in a more
  62   > human-friendly format.
  63
  64
  65   **A hint about the number of lines:**
  66   > `$ man wc`
  67
  68
  69   **Answer:**
  70   > We should get a file size around 3.6 MB with:
  71   > `$ ls -lh 1000gp.vcf`
  72   > Alternatively, the command `du` can be used to achieve a similar result:
  73   > `$ du -h 1000gp.vcf`
  74   >
  75   > We find there are 45034 lines with:
  76   > `$ wc -l 1000gp.vcf`
  77
  78
  79 * Because this file is so large, you're going to almost always want to pipe
  80 ("|") the result of any command to `less` (a simple text viewer, type 'q' to
  81 exit) or `head` (to print the first 10 lines) so that you don't accidentally
  82 print 45,000 lines to the screen.
  83
  84   Let's start by printing the first 5 lines to see what it looks like.
  85
  86   **Answer:**
  87   > `$ head -5 1000gp.vcf`
  88
  89 * That isn't very interesting; it's just a bunch of the comments at the
  90 beginning of the file (they all start with "#")! Print the first 20 lines to see
  91 more of the file.
  92
  93   **Answer:**
  94   > `$ head -20 1000gp.vcf`
  95
  96
  97 * Okay, so now we can see the basic structure of the file. A few comment lines
  98   that start with "#" or "##" and then a bunch of lines of data that contain all
  99   the data and are pretty hard to understand. Each line of data contains the
 100   same number of fields, and all fields are separated with TABs. These fields
 101   are:
 102
 103   1. the chromosome (which volume the difference is in)
 104   2. the position (which character in the volume the difference starts at)
 105   3. the ID of the difference
 106   4. the sequence in the reference human(s)
 107
 108   The rest of the columns tell us, in a rather complex way, a bunch of
 109   additional information about that position, including: the predicted sequence
 110   for each of the three Africans and how confident the scientists are that these
 111   sequences are correct.
 112
 113   To start analyzing the actual data, we have to remove the header. How can we
 114   print the first 10 non-header lines (those that _don't_ start with a "#")?
 115
 116   **Hint:**
 117     $ man grep
 118
 119   **Hint:**
 120   > You can use a pipe ("|") to connect the output of `grep` to the input of
 121   > `head`.
 122
 123   **Hint:**
 124   In `grep` regular expressions, the carat '^' character matches the start of a
 125   line and the dollar sign '$' matches the end of a line. Thus, the following
 126   will print all non-blank lines from `file`:
 127     $ grep -v "^$" file
 128
 129   **Our answer:**
 130   >     $ grep -v "^#" 1000gp.vcf | head
 131   >
 132   > Why are neither of these correct?
 133   >     $ grep -v "#" 1000gp.vcf | head
 134   >     $ grep -v "^##" 1000gp.vcf | head
 135
 136 * How many lines of data are in the file (rather than counting the number of
 137   header lines and subtracting, try just counting the number of data lines)?
 138
 139   **Hint:**
 140   > Instead of piping to `head`, try piping to `wc`.
 141
 142   **Our Answer:**
 143   >     $ grep -v "^#" 1000gp.vcf | wc -l
 144   >
 145   > should print `45024`
 146
 147 * Where these differences are located can be important. If all the differences
 148   between two encyclopedias were in just the first volume, that would be
 149   interesting. The first field of each data line is the name of the chromosome
 150   that the difference occurs on (which volume we're on). Print the first 10
 151   chromosomes, one per line.
 152
 153   **Hint:**
 154   > You can extract a column from a tab-delimited text file using the `cut`
 155   > command.
 156
 157   **Hint:**
 158   > Use `grep` to print only non-comment lines, and `cut` to extract the
 159   > chromosome column.
 160
 161   **Our Answer:**
 162   >     $ grep -v "^#" 1000gp.vcf | cut -f 1 | head
 163
 164 * As you should have observed, the first 10 lines are on numbered chromosomes.
 165   Every normal cell in your body has 23 pairs of chromosomes, 22 pairs of
 166   "autosomal" chromosomes (these are numbered 1-22) and a pair of sex
 167   chromosomes (two Xs if you're female, an X and a Y if you're male). If you've
 168   heard of the genetics company [23andMe](https://www.23andme.com), the 23
 169   refers to these 23 pairs of chromosomes.
 170
 171   Let's look at which chromosomes these variations are on. Print a list of the
 172   chromosomes that are in the file (each chromosome name should only be printed
 173   once, so you should only print 23 lines).
 174
 175   **Hint:**
 176   > You need to remove all the duplicate lines from your previous answer.
 177
 178   **Hint:**
 179   > `sort` has an option that should make this easier.
 180
 181   **Our Answer:**
 182   >     $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort -u
 183
 184
 185 * Rather than using `sort` to print unique results, a common pipeline is to
 186   first sort and then pipe to another UNIX command, `uniq`. The `uniq` command
 187   takes _sorted_ input and prints only unique lines, but it provides more
 188   flexibility than just using `sort` by itself. Keep in mind, if the input isn't
 189   sorted, `uniq` won't work properly.
 190
 191   Using `sort` and `uniq`, print the number of times each chromosome occurs in
 192   the file.
 193
 194   **Hint:**
 195   >     $ man uniq
 196
 197   **Hint:**
 198   > Instead of using `sort` to remove duplicates, just use it to sort and pipe
 199   > the result to `uniq`.
 200
 201   **Our Answer:**
 202   >     $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c
 203
 204
 205 * Add to your previous solution to list the chromosomes from most frequently
 206   observed to least frequently observed.
 207
 208   **Hint:**
 209   >     $ man sort
 210
 211   **Hint:**
 212   > Make sure you're sorting in descending order. By default, `sort` sorts in
 213   > ascending order.
 214
 215   **Our Answer:**
 216   >     $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -n -r
 217   >
 218   > should output the following:
 219   >
 220   >     3721 2
 221   >     3387 1
 222   >     3224 4
 223   >     3219 3
 224   >     2894 5
 225   >     2860 6
 226   >     2527 8
 227   >     2525 7
 228   >     2203 10
 229   >     2166 11
 230   >     2032 12
 231   >     1865 9
 232   >     1656 13
 233   >     1409 14
 234   >     1362 16
 235   >     1304 X
 236   >     1275 18
 237   >     1265 15
 238   >     1097 17
 239   >      993 20
 240   >      814 19
 241   >      661 21
 242   >      565 22
 243
 244 * The autosomal chromosomes (1-22) are named according to their size. The
 245   largest of them is chromosome 1, while the smallest is chromosome 22. Does it
 246   look like differences occur relatively randomly across the genome, or are some
 247   chromosomes more different than you'd expect at random (very roughly taking
 248   their sizes into account)?
 249
 250   It's worth noting that the chromosomes were numbered by the sizes of the
 251   actual molecules, not how much of them had been sequenced.
 252
 253   Wikipedia has a nice table of chromosome sizes and how much of each has been
 254   sequenced (and you can sort it):
 255   http://en.wikipedia.org/wiki/Human_chromosome#Human_chromosomes
 256
 257   Notice anything?
 258
 259   **Our Hypothesis:**
 260   > Since variation can only be found in the known sequence, the order you
 261   > printed corresponds closely to ordering by the number of bases sequenced
 262   > (rather than the total number of bases).
 263   >
 264   > Given this, it seems like differences occur relatively randomly across the
 265   > genome. We see more differences on longer chromosomes, fewer on shorter,
 266   > without any striking outliers.
 267
 268 * This is great, but biologists might also like to see the chromosomes ordered
 269   by their number (not dictionary order), since different chromosomes have
 270   different attributes and this ordering allows them to find a specific
 271   chromosome more easily.
 272
 273   **Hint:**
 274   > A lot of the power of `sort` comes from the fact that you can specify which
 275   > fields to sort on, and the order in which to sort them. In this case you
 276   > only need to sort on one field.
 277
 278   **Answer:**
 279   >     $ grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -k 2n
 280
 281
 282 ## Exercise Part 3 (scripts and svn)
 283 * Wonderful! Now we have a (long) command for printing chromosome statistics
 284   from our `1000gp.vcf` file. Using `nano`, create a new file, "chrom_stats.sh",
 285   with just your answer to the previous question in it.
 286
 287   **Answer:**
 288   >  Type the following to open a new file:
 289   >     $ nano chrom_stats.sh
 290   >  Type in the command. Type ^o to save and ^x (where ^ means the control key).
 291
 292 * Just to be illustrate the flexibility of the shell, try creating the same file
 293   directly from the shell (without a text editor). Once you do, you can use
 294   `cat` to make sure the contents of the file are exactly what you expect.
 295
 296   **Hint:**
 297   > You can use `echo` to print something and `>` to redirect to a file.
 298
 299   **Hint:**
 300   > Since our long command has double-quotes in it, you either need to use
 301   > single-quotes or escape these with back-slashes.
 302
 303   **Answer:**
 304     $ echo 'grep -v "^#" 1000gp.vcf | cut -f 1 | sort | uniq -c | sort -k 2n' > chrom_stats.sh
 305     $ cat chrom_stats.sh
 306
 307 * Now, execute your new script to print the chromosome statistics.
 308
 309   **Hint:**
 310   > You may have to change the permissions to allow you to execute it.
 311
 312   **Hint:**
 313   > It's good form to only make permissions as permissive as necessary. So,
 314   > rather than allow everyone to execute the file, it is better to just allow
 315   > you to execute it.
 316
 317   **Answer:**
 318     $ chmod u+x chrom_stats.sh
 319     $ ./chrom_stats.sh
 320
 321   > Note that it is `u+x` instead of just `+x` or `a+x`. This only adds the
 322   > ability for the owner to execute it, whereas the other two options would
 323   > allow anyone to execute it.
 324
 325 * We'd like to be able to use this script in the future with arbitrary VCF
 326   files, instead of just our `1000gp.vcf` file. Edit the script so that it takes
 327   VCF-formatted text input on stdin and prints out chromosome statistics on
 328   stdout. This is simpler than you might think.
 329
 330   **Hint:**
 331   > If `grep` isn't given an input file, it will read from stdin.
 332
 333   **Answer:**
 334   > Change
 335   > `grep -v "^#" 1000gp.vcf | ...`
 336   > to
 337   > `grep -v "^#" | ...`
 338   >
 339   > Since this is in a file instead of the shell prompt, we aren't showing the
 340   > "$" at the beginning of the line.
 341
 342 * Now that we have a script that reads from stdin and prints to stdout, how do
 343   we run it on the `1000gp.vcf` file to get the same output as before?
 344
 345   **Hint:**
 346   > The `cat` command is used to print files to stdout.
 347
 348   **Hint:**
 349   > You can pipe the output of `cat` directly into our script.
 350
 351   **Hint:**
 352   > Just like before, in order to tell the shell that the `chrom_stats.sh` file
 353   > we want to execute is the one in our current directory, we need to use
 354   > `./chrom_stats.sh`.
 355
 356   **Answer:**
 357   `$ cat 1000gp.vcf | ./chrom_stats.sh`
 358
 359 * Finally, add a copy of this file to your folder in the class SVN repository.
 360   1. `cp chrom_stats.sh /path/to/repo/participants/user/`
 361   2. Add the file to subversion version control
 362   3. Commit your changes
 363
 364 **Fin.**
 365 Comments, questions, and suggestions are encouraged and appreciated.
 366 Thanks to Tommy Guy, Jon Pipitone, Greg Wilson, and Elango Cheran for their help
 367 with these exercises.
 368