6 Week 5- Finishing Variant Calling, then a pivot to R!

Picking up from last week, we will start where we left off with a variant calling pipeline. Then we’ll pivot to interacting with the cluster in a new way, by starting RStudio sessions in OnDemand.

6.1 Main Objectives

Organize simple commands into scripts to execute a variant calling pipeline
Write code in Rstudio through FarmOnDemand
Understand the different parts of the Rstudio window
Create and modify objects in R and general R operation

6.2 Warmup Part 1: Comparing scripts with a friend

Buddy up! Start an OnDemand session and navigate to the script that you wrote for the week 4 assignment.

If either you or your partner’s script did not run, take some time to look at the code together! Try to fix any errors and get it to run. If you’re having trouble diagnosing your issues, remember that you can look at your log files, which have the extensions .err and /out.

If both of your code produced the expected files, compare scripts! Find any places where your code differs, and discuss why you chose to code it the way you did! After seeing your partner’s script, are there any changes you’d like to make to yours?

6.3 Warmup Part 2: Ensuring we have well organized data!

Last week’s assignment was to rerun the fastqc and cutadapt commands that we learned last week to create a well organized file system. If your script didn’t work exactly as you planned, let’s make sure that we all are moving forward into the next section with the same files.

From within your week4 folder, you should make sure you have a folder called raw_data which contains the following files:

$ ls raw_data
Ppar_tinygenome.fna.gz    SRR6805881.tiny.fastq.gz  SRR6805883.tiny.fastq.gz  SRR6805885.tiny.fastq.gz
SRR6805880.tiny.fastq.gz  SRR6805882.tiny.fastq.gz  SRR6805884.tiny.fastq.gz

From within your week4 folder, you should make sure you have a folder called trimmed_fastqs which contains the following files:

$ ls trimmed_fastqs
SRR6805880.tiny_trimmed.fastq.gz  SRR6805882.tiny_trimmed.fastq.gz  SRR6805884.tiny_trimmed.fastq.gz
SRR6805881.tiny_trimmed.fastq.gz  SRR6805883.tiny_trimmed.fastq.gz  SRR6805885.tiny_trimmed.fastq.gz

You may have some additional files inside your folder, like fastqc files, your week 4 scripts or slurm outputs, those can all stay there!

If you are struggling to get the files into the right folder, you can copy this folder into your personal week4 folder from the folder week4_files as follows:

$ cd /group/rbaygroup/eve198-genomics/yourdirectory/week4
$ cp -r /group/rbaygroup/eve198-genomics/week4_files/raw_data ./
$ cp -r /group/rbaygroup/eve198-genomics/week4_files/trimmed_fastqs ./

Rerun the ls commands to ensure that you have everything you need in the raw_data and trimmed_fastqs. Now we can move forward with the next steps in the variant calling pipeline!

6.4 Step 3: Building an index of our genome

Recall that we’ve already performed QC and adapter cutting on our sequencing reads. Today, we’ll pick up where we left off with our variant calling pipeline: Finding Variants Pipeline

Recall that we need to reload packages every time we open a new OnDemand session. This week we’ll need:

samtools: allows us to filter and view our mapped data
bowtie2: to map our reads to the reference genome
angsd: used to find variants in our data

Load all these packages using module load.

Our next task is to index our genome. We’ll do that with the bowtie2-build command. This will generate a lot of files that describe different aspects of our genome

We give bowtie2-build two things, the name of our genome, and a general name to label the output files. I always keep the name of the output files the same as the original genome file (without the .fna.gz extension) to avoid confusion.

bowtie2-build raw_data/Ppar_tinygenome.fna.gz raw_data/Ppar_tinygenome

This should produce several output files with extensions including: .bt2 and rev.1.bt2 etc (six files in total)

6.5 Step 4: Map reads to the genome

Now that we’ve indexed our reference, we can map our sequencing reads to that reference genome. Let’s map those reads using a for loop.

First, let’s set up a directory for the output files, which will have the extension .sam (more on that later!)

$ mkdir sams

for filename in trimmed_fastqs/*.tiny_trimmed.fastq.gz
do

  base=$(basename $filename .tiny_trimmed.fastq.gz)
  echo ${base}

  bowtie2 -x raw_data/Ppar_tinygenome -U trimmed_fastqs/${base}.tiny_trimmed.fastq.gz -S sams/${base}.sam

done

You should see a bunch of text telling you all about how well our reads mapped to the genome. For this example we’re getting a low percentage (~20%) because of how the genome and reads were subset for this exercise.

You’ll also notice that we have made a bunch of .sam files. THis stands for Sequence Alignment Map file. Let’s use less to look at one of these files using less

6.6 Step 5: sam to bam file conversion

The next step is to convert our sam file to a bam (Binary Alignment Map file). This gets our file ready to be read by angsd the program we’re going to use to call SNPs.

mkdir bams
for filename in sams/*.sam
do

  base=$(basename $filename .sam)
  echo ${base}
  
  samtools view -bhS sams/${base}.sam | samtools sort -o bams/${base}.bam

done

6.7 Step 6: Genotype likelihoods

There are many ways and many programs that call genotypes. The program that we will use calculates genotype likelihoods, which account for uncertainty due to sequencing errors and/or mapping errors and is one of several programs in the package ANGSD. The purpose of this class is not to discuss which program is the “best”, but to teach you to use some commonly used programs.

angsd needs a text file with the .bam file names listed. We can make that by running the command below

ls bams/*.bam > bam.filelist

Look at the list:

cat bam.filelist

Run the following code to calculate genotype likelihoods

angsd -bam bam.filelist -GL 1 -out genotype_likelihoods -doMaf 2 -SNP_pval 1e-2 -doMajorMinor 1

The output of angsd gives us a nice overview of what happened:

    -> Output filenames:
        ->"genotype_likelihoods.arg"
        ->"genotype_likelihoods.mafs.gz"
    -> Tue Feb  3 22:27:17 2026
    -> Arguments and parameters for all analysis are located in .arg file
    -> Total number of sites analyzed: 71450
    -> Number of sites retained after filtering: 9 
    [ALL done] cpu-time used =  1.39 sec
    [ALL done] walltime used =  1.00 sec

We can see that it generated two files, one with a .arg extension (which has a record of the script we ran to generate the output), and a .maf file (that will give you the minor allele frequencies and is the main output file). We can also see that we started with 71,450 sites but only retained 9 after filtering.

To see what the other parameters do you can run the following:

angsd -h

We can change the parameters of the angsd genotype likelihoods command by altering the values next to -SNP_pval. If we remove the -SNP_pval command entirely we get around 70,000 sites retained! Wow! That seems like a lot given our ~20% mapping rate. If you instead increase the p-value threshold to 1e-3 we find 3 SNPs.

Class Exercise 1: Changing angsd parameters

To see what the parameters you can use with angsd, type angsd -h Try changing the value of -SNP_pval and rerunning angsd. How many sites are retained if we remove the -SNP_pval flag entirely? How many sites are retained if you instead increase the p-value threshold to 1e-3?

6.8 Orientation to R

There are many R packages that are built to work with SNP data! So often, once we use slurm scripts to pare down our data from sequencing reads to SNPs, we read the SNP data into R to analyze it from there! Before we get to working with SNP data, we’ll start with a more general intro to R.

This lesson is modified from materials compiled by Serena Caplins from the STEMinist_R lessons produced by several UC Davis graduate student and which can be found here.

First let’s navigate to our R studio on Farm OnDemand! Launch your session with the information shown below: Give yourself 2 for memory and make sure your conda environment is set to r-4.4.2

Fill out this information

After we start it up we will want to create a new file called “Week5-IntroR.R” to type our scripts for today.

To do this, we will click File, New File then R script. Name your file with your name-Week5-introR

# before a sentence comments out what you want before you write code. Without a # R will think what you are writing needs to be run as a command! I have used comments in this example to write what the script is for. The top left is the script, the bottom left is the terminal window where code and outputs will be typed, the top right is your environment R window- example R can be used for basic arithmetic. Type this into your script and hit enter:

#adding prompt
5+10+23

## [1] 38

It can also store values in variables:

You can assign an object using an assignment operator <- or =.

#storing variables
number<-10

numbers<-c(10, 11, 12, 14, 16)

You can see your assigned object by typing the name you gave it.

#assigned objects
number

## [1] 10

numbers

## [1] 10 11 12 14 16

Objects can be numbers or characters:

#objects as characters
cat<-"meow"
dog<-"woof"

We can use colons to get sequences of numbers:

#sequence of numbers
n<-1:100

Data Structures include:

Vector
Lists
Matrices
Factor
Data frame

Vectors can also include characters (in quotes): c()=concatenate, aka link things together!

#link things together with this code
animals<-c("woof", "meow", "hiss", "baa")

6.9 Manipulating a vector object

We can get summaries of vectors with summary()

#summarize what you have done so far
summary(n)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   25.75   50.50   50.50   75.25  100.00

We can see how long a vector is with length()

#how long is vector n
length(n)

## [1] 100

You can use square brackets [] to get parts of vectors.

#what is the 50th entry in our vector n?
n[50]

## [1] 50

Class Exercise 2

What is the 2nd entry in our vector animals?

create a new vector with the following code for different field sites to answer questions 2 & 3:

site_code <- c("BB","ShC","BH","Pes","StC","MR","PV","KH","CM","RP","Car","SR","RM","Dav","JB","Mal","PB","PA","Leu","VD","A")

Without counting, which site is the 10th entry?

How many sites are there total?

6.10 Operations act on each element of a vector

# +2
numbers+2

## [1] 12 13 14 16 18

# *2
numbers*2

## [1] 20 22 24 28 32

# mean
mean(numbers)

## [1] 12.6

# ^2
numbers^2

## [1] 100 121 144 196 256

# sum
sum(numbers)

## [1] 63

6.11 Operations can also work with two vectors

#define a new object y
y<-numbers*2

# n + y
numbers + y

## [1] 30 33 36 42 48

# n * y
numbers * y

## [1] 200 242 288 392 512

6.12 A few tips below for working with objects

We can keep track of what objects R is using, with the functions ls() and objects()

ls()

## [1] "animals"   "cat"       "dog"       "n"         "number"    "numbers"  
## [7] "site_code" "y"

objects() #returns the same results as ls() in this case. because we only have objects in our environment.

## [1] "animals"   "cat"       "dog"       "n"         "number"    "numbers"  
## [7] "site_code" "y"

This is where those objects show up with you type ls(): Returns a list of what is present in our R environment

# how to get help for a function; you can also write help()
?ls

# you can get rid of objects you don't want

rm(numbers)

# and make sure it got rid of them
ls()

## [1] "animals"   "cat"       "dog"       "n"         "number"    "site_code"
## [7] "y"

After removal

Call the help files for the functions ls() and rm() + What are the arguments for the ls() function? + What does the ‘sorted’ argument do?

From the help file: sorted is a logical indicating if the resulting character should be sorted alphabetically. Note that this is part of ls() may take most of the time.

6.13 Characterizing a dataframe

We’ll now move from working with objects and vectors to working with dataframes:

Here are a few useful functions! I will go over each as we introduce them throughout the lesson today:
- install.packages()
- library()
- data()
- str()
- dim()
- colnames() and rownames()
- class()
- as.factor()
- as.numeric()
- unique()
- t()
- max(), min(), mean() and summary()

We’re going to use data on sleep patterns in mammals. This requires installing a package (ggplot2) and loading the data

Install the package ggplot2. This only has to be done once and after installation we should then comment out the command to install the package with a #.

#install.packages("ggplot2")

#load the package

library (ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

Load the data (it’s called msleep). This dataset includes information bout mammal sleep times and weights that was taken from a study by V. M. Savage and G. B. West. “A quantitative, theoretical framework for understanding mammalian sleep. Proceedings of the National Academy of Sciences, 104 (3):1051-1056, 2007.”

The data includes - name (common name) - genus - vore (carnivore, omnivore, etc) - order - conservation (status) - sleep_total (total amount of sleep in hours) - sleep_rem (rem sleep in hours) - sleep_cycle (length of sleep cycle, in hours) - awake (amount of time spent awake, in hours) - brainwt (brain weight in kilograms) - bodywt (body weight in kilograms)

data("msleep")

There are many functions in R that allow us to get an idea of what the data looks like. For example, what are it’s dimensions (how many rows and columns)?

# head() -look at the beginning of the data file
# tail() -look at the end of the data file

head(msleep)

## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc                  12.1      NA        NA      11.9
## 2 Owl mo… Aotus omni  Prim… <NA>                17         1.8      NA       7  
## 3 Mounta… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
## 4 Greate… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
## 5 Cow     Bos   herbi Arti… domesticated         4         0.7       0.667  20  
## 6 Three-… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

tail(msleep)

## # A tibble: 6 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Tenrec  Tenr… omni  Afro… <NA>                15.6       2.3      NA       8.4
## 2 Tree s… Tupa… omni  Scan… <NA>                 8.9       2.6       0.233  15.1
## 3 Bottle… Turs… carni Ceta… <NA>                 5.2      NA        NA      18.8
## 4 Genet   Gene… carni Carn… <NA>                 6.3       1.3      NA      17.7
## 5 Arctic… Vulp… carni Carn… <NA>                12.5      NA        NA      11.5
## 6 Red fox Vulp… carni Carn… <NA>                 9.8       2.4       0.35   14.2
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

# str()
str(msleep)

## tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
##  $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
##  $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
##  $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

dim(), ncol(), nrow()- dimensions, number of columns, number of rows colnames(), rownames() - column names, row names

Rstudio also allows us to just look into the data file with View(). Try to look at the msleep data using View(msleep)

6.14 How to access parts of the data

We can also look at a single column at a time. There are three ways to access this: $, [,#] or [,“a”].

Think about “Remote Control car” to remember that [5,] means fifth row and [,5] means fifth column! Rows are listed first and columns are listed second.

Each way has it’s own advantages! The first subsets the third column of data, so you need to know where your data of interest is. The second subsets the vore column only. The third prints all of the data from the vore column in your console window.

msleep[,3]

## # A tibble: 83 × 1
##    vore 
##    <chr>
##  1 carni
##  2 omni 
##  3 herbi
##  4 omni 
##  5 herbi
##  6 herbi
##  7 carni
##  8 <NA> 
##  9 carni
## 10 herbi
## # ℹ 73 more rows

msleep[, "vore"]

## # A tibble: 83 × 1
##    vore 
##    <chr>
##  1 carni
##  2 omni 
##  3 herbi
##  4 omni 
##  5 herbi
##  6 herbi
##  7 carni
##  8 <NA> 
##  9 carni
## 10 herbi
## # ℹ 73 more rows

msleep$vore

##  [1] "carni"   "omni"    "herbi"   "omni"    "herbi"   "herbi"   "carni"  
##  [8] NA        "carni"   "herbi"   "herbi"   "herbi"   "omni"    "herbi"  
## [15] "omni"    "omni"    "omni"    "carni"   "herbi"   "omni"    "herbi"  
## [22] "insecti" "herbi"   "herbi"   "omni"    "omni"    "herbi"   "carni"  
## [29] "omni"    "herbi"   "carni"   "carni"   "herbi"   "omni"    "herbi"  
## [36] "herbi"   "carni"   "omni"    "herbi"   "herbi"   "herbi"   "herbi"  
## [43] "insecti" "herbi"   "carni"   "herbi"   "carni"   "herbi"   "herbi"  
## [50] "omni"    "carni"   "carni"   "carni"   "omni"    NA        "omni"   
## [57] NA        NA        "carni"   "carni"   "herbi"   "insecti" NA       
## [64] "herbi"   "omni"    "omni"    "insecti" "herbi"   NA        "herbi"  
## [71] "herbi"   "herbi"   NA        "omni"    "insecti" "herbi"   "herbi"  
## [78] "omni"    "omni"    "carni"   "carni"   "carni"   "carni"

If you wanted to save these as objects, you need to add an arrow an a new name for that object. They should all be the same!

column3<-msleep[,3]
voreonly<-msleep[, "vore"]
vores<-msleep$vore

head(column3) #do this or View() for all of your new objects!

## # A tibble: 6 × 1
##   vore 
##   <chr>
## 1 carni
## 2 omni 
## 3 herbi
## 4 omni 
## 5 herbi
## 6 herbi

It’s important to know the class of data if you want to manipulate it. For example, you can’t add characters. msleep contains several different types of data. We see with str() that there are columns of data that are characters and numeric.

Data Types/Classes:

Character (names)
Numeric (numbers)
Logical (T/F)
Integer (2L for example)
Complex (imaginary #s)
Raw (not really used)

class(msleep$vore) #character!

## [1] "character"

class(msleep$sleep_total) #numeric!

## [1] "numeric"

We can also look at a single row at a time. There are two ways to access this:

by indicating the row number in square brackets next to the name of the dataframe name[#,]
by calling the actual name of the row (if your rows have names) name["a",].

msleep[43,]

## # A tibble: 1 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Little… Myot… inse… Chir… <NA>                19.9         2         0.2   4.1
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

msleep[msleep$name == "Mountain beaver",]

## # A tibble: 1 × 11
##   name    genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
##   <chr>   <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
## 1 Mounta… Aplo… herbi Rode… nt                  14.4       2.4          NA   9.6
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

Mountain Beaver by Coke Smith https://www.nps.gov/articles/000/mapping-mountain-beavers-in-point-reyes-a-collaboration-between-the-national-park-service-and-uc-berkeley.htm

We can select more than one row or column at a time:

 # see two columns

msleep[,c(1, 6)]

## # A tibble: 83 × 2
##    name                       sleep_total
##    <chr>                            <dbl>
##  1 Cheetah                           12.1
##  2 Owl monkey                        17  
##  3 Mountain beaver                   14.4
##  4 Greater short-tailed shrew        14.9
##  5 Cow                                4  
##  6 Three-toed sloth                  14.4
##  7 Northern fur seal                  8.7
##  8 Vesper mouse                       7  
##  9 Dog                               10.1
## 10 Roe deer                           3  
## # ℹ 73 more rows

 # and make a new data frame from these subsets

subsleep<-msleep[,c(1, 6)]

But what if we actually care about how many unique things are in a column?

 # unique()
unique(msleep[, "order"])

## # A tibble: 19 × 1
##    order          
##    <chr>          
##  1 Carnivora      
##  2 Primates       
##  3 Rodentia       
##  4 Soricomorpha   
##  5 Artiodactyla   
##  6 Pilosa         
##  7 Cingulata      
##  8 Hyracoidea     
##  9 Didelphimorphia
## 10 Proboscidea    
## 11 Chiroptera     
## 12 Perissodactyla 
## 13 Erinaceomorpha 
## 14 Cetacea        
## 15 Lagomorpha     
## 16 Diprotodontia  
## 17 Monotremata    
## 18 Afrosoricida   
## 19 Scandentia

 # table()
table(msleep$order)

## 
##    Afrosoricida    Artiodactyla       Carnivora         Cetacea      Chiroptera 
##               1               6              12               3               2 
##       Cingulata Didelphimorphia   Diprotodontia  Erinaceomorpha      Hyracoidea 
##               2               2               2               2               3 
##      Lagomorpha     Monotremata  Perissodactyla          Pilosa        Primates 
##               1               1               3               1              12 
##     Proboscidea        Rodentia      Scandentia    Soricomorpha 
##               2              22               1               5

 # levels(), if class is factor (and if not we can make it a factor) showing the way that the data is displayed
levels(as.factor(msleep$order))

##  [1] "Afrosoricida"    "Artiodactyla"    "Carnivora"       "Cetacea"        
##  [5] "Chiroptera"      "Cingulata"       "Didelphimorphia" "Diprotodontia"  
##  [9] "Erinaceomorpha"  "Hyracoidea"      "Lagomorpha"      "Monotremata"    
## [13] "Perissodactyla"  "Pilosa"          "Primates"        "Proboscidea"    
## [17] "Rodentia"        "Scandentia"      "Soricomorpha"

6.15 Group Work Activity: practice exploring a dataframe

Iris Diagram for Reference

We’ll use the built-in ‘iris’ dataset. The command: `

data(iris) # this loads the 'iris' dataset.

You can view more information about this dataset with help(iris) or ?iris. This dataset was published by Ronald Fisher in his 1936 paper: “The use of multiple measurements in taxonomic problems”. It has three plant species (setosa, virginica, versicolor) and four morphological traits measured for each sample in centimeters: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width. It is important to acknowledge that the field of genetics has been built on eugenics, and Fisher was a prominent geneticist and eugenicist. More information about this can be accessed here: https://www.ucl.ac.uk/biosciences/gee/ucl-centre-computational-biology/ronald-aylmer-fisher-1890-1962

Submit a screenshot or copied text that shows the code you ran to answer the following questions:

How many rows are in the dataset?

How many species of flowers are in the dataset?

How many columns does this data frame have? What are their names?

What class did R assign to each column?

Create a vector that is the area of each flower’s petals. (You can assume that each petal is rectangular so that petal width * petal length = petal area)

What is the maximum sepal length of the irises?

6.16 Key Points

Useful functions such as install.packages(), library() can help us upload packages and data from R, while other functions such as str(), dim(), and unique() can help us investigate dataframes
To look at our data we can use several commands, including view() or data$columnofinterest if you only want to look at one variable.
Manipulating data in R can be extremely helpful in the analysis stage, and we can get data on mean(), min(), max(), and summary() data on different variables of interest

Class Exercise Solution

Class Exercise 1: Solution

no pvalue: 68574 sites retained p = 1e-3: 0 sites retained

Class Exercise 2: Solution

animals[2] “meow”

site_code[10] “RP”

summary(site_code) 21 sample sites