8 Week 7- Population Structure & Sea Cucumbers

We will be using data from a paper by Xuereb et al. (2018) paper on P. californicus titled “Asymmetric oceanographic processes mediate connectivity and population genetic structure, as revealed by RADseq, in a highly dispersive marine invertebrate (Parastichopus californicus)”.

http://galleries.neaq.org/2016/06/sea-cucumbers-east-meets-west.html

SNP data from this study will be used today in order to learn how to identify population structure in sea cucumber populations and make structure plots using tess3r and pophelper. This data is SNP data from 15 individuals per site across 7 collection sites:

Map of sampling locations in coastal British Columbia (1–20) and southeastern Alaska (21–24).

8.1 Main Objectives:

Practice moving from terminal to R when working with big data
Interpret structure plots
Create structure plots of genomic data in R

8.2 Warmup: Joining command line and R

So, far, we’ve mostly been working with datasets that are available to download in R; but a common practice in genomics is to work with large datasets (like sequencing reads) on a cluster and once the data is processed into a smaller format, working with downstream files in R.

Here is a brief overview of our goals for the warmup:

Create a new folder in your personal directory on farm called week7
Download genomic data into this folder using wget https://raw.githubusercontent.com/mlarmstrong/IntroGenomics_Data/main/week7.zip
Unzip the data folder and vcf
Create an RStudio OnDemand Session
Make a script called week7 and save it in your week7 directory
Install and load the following R packages:

ggplot2
vcfR
devtools
remotes
gridExtra
dplyr

Use devtools and remotes to install the following packages from github:

pophelper
tess3r

A more thorough walk through of the warm up can be followed below.

8.2.1 Steps 1 through 3: Getting the data onto the cluster

We first need to download the data in terminal. Start an OnDemand session. Navigate to your directory and do the following command to get the vcf file we need onto your farm desktop. We will download a zipped version of this data to make the download easier, but note that you can almost always download published genomic data from online repositories like dryad or NCBI. This data set is available on dryad here: https://datadryad.org/dataset/doi:10.5061/dryad.db6177b

cd /group/rbaygrp/eve198-genomics/yourdirectory
wget https://raw.githubusercontent.com/mlarmstrong/IntroGenomics_Data/main/week7.zip

Unzip the week7 directory with unzip week7.zip and navigate inside with cd week7. There should be three files:

The README.txt gives you information on the dataset you just downloaded.
The .csv file in this folder is our metadata, with coordinates for each of our sample sites along with sample names. This will be useful for understanding population structure between populations and across space!
The vcf file contains SNPs, similar to one we produced with our variant calling pipeline a few weeks ago! If you file has the extensions .zip, that means that even though your folder is unzipped, this file is still compressed. You can run unzip filtered_2719neutral_snps.recode.vcf.zip to unzip it!

8.2.2 Steps 4 through 7: Getting set up in R

Now we will move to R! Start by creating an OnDemand R Studio session.

Go to file>new to create a new R script. Name it week7.R and save it within the week7 folder you created in your personal directory above.

We will need quite a few packages today since genomic datasets require special tools!

We will be using tess3r to identify population structure and then pophelper to visualize that structure in a more organized way. We also need to be able to read in a vcf file, so vcfR is a useful package for

Most packages can be installed using install.packages() and then loaded using library(). Let’s do this for ggplot2, vcfR, devtools and remotes. devtools and remotes are R packages that allow us to install even more packages than can be installed using base R. More on that below!

install.packages(c("vcfR", "ggplot2", "devtools", "remotes", "gridExtra", "dplyr")) #only run this line once, it might take a few minutes

library(vcfR)
library(ggplot2)
library(devtools)
library(gridExtra)
library(dplyr)

You can need to use devtools to install tess3r, but then it can be loaded using library(), just like other packages.

devtools::install_github("bcm-uga/TESS3_encho_sen") #only run once, you can press 'enter' to skip updates
library(tess3r)

Similarly to devtools, remotes is a package you can load in R and use to install other packages that are available on github, like you pophelper

remotes::install_github('royfrancis/pophelper') #only run once, this might take a few minutes!
library(pophelper)

8.3 Setting your Working Directory in R

We downloaded data and created a script in the folder called week7, and we want to make sure that R recognizes this as our working directory. You can think of this process as similar to when you cd to change directories on the command line. To set your working directory you can do the following: setwd("/group/rbaygrp/eve198-genomics/yourpersonalfolder/week7")

A few pro tips:

You can use tab complete to fill in your path to your folder!
You can check your current directory with the command getwd()

8.4 Reading in vcfs

Now let’s read in our data! This will take a second and you will get information on the file in your console as it is read in. The number of variants that are processed should be 2,719. Then you can extract the genotypes with the second line of code.

vcf_cukes<-read.vcfR("filtered_2719neutral_snps.recode.vcf")
genotypes_raw<-extract.gt(vcf_cukes)

If you do head(genotypes_raw) you can see that the format is 0/0 or 1/1 for homozygotes and 0/1 for heterozygotes. We actually want to convert this to instead be 0 for homozyous reference, 1 for heterozygotes and 2 for homozyogous alternate. To do this, you can run the following code:

## Replace genotype codes with numeric values
genotypes_cleaned <- as.matrix(apply(genotypes_raw, c(1, 2), function(x) {
  if (x %in% c("0/0", "0|0")) return(0)    # Homozygous reference → 0
  if (x %in% c("0/1", "1/0", "0|1", "1|0")) return(1)  # Heterozygous → 1
  if (x %in% c("1/1", "1|1")) return(2)    # Homozygous alternate → 2
  return(NA)             # Handle missing or other cases
}))

Check to see that it worked by doing View(genotypes_cleaned) or head(genotypes_cleaned). The genotypes should be coded as 0, 1 or 2. Finally let’s check the dimensions of our genotype data. The number of individuals should be on the left and the genotypes should be on the right.

dim(genotypes_cleaned) #indiv should be on left, genotypes on right

It looks like it is actually flipped, so let’s fix that! The command t() will help us transpose our data

genotypes<- t(genotypes_cleaned) 
dim(genotypes) #now it is right!

8.5 Reading in csvs

Now let’s read in the coordinate data for our sample sites! We will also make an object “coordinates” to hold our lat/long data for the tess3 object later. Our sites.data should a length of 717 to match our genotype data.

sites.data<-read.csv('strata.cucumbers.717ind.csv', header=TRUE, sep=",")
coordinates<-as.matrix(cbind(sites.data$LAT, sites.data$LONG))
dim(sites.data)

8.6 Mini Lecture: Interpreting Structure Plots

Let’s spend a bit more time getting oriented to the sea cucumber paper and understanding more generally what information is represented with structure plots!

You can follow along with the slides here

8.7 population structure with tess3r

You can learn more about this package on their tutorial page (https://bcm-uga.github.io/TESS3_encho_sen/articles/main-vignette.html), but in short tess3 is a function that computes population structure by estimating ancestry proportions and ancestral allele frequencies. In the code below: - X argument refers to the genotype matrix - coord argument the coord argument corresponds to the geographic coordinates - K is the number of clusters or ancestral population - reps is the number of replications for cross validation that we are calling the ancestry correctly for each sample

Normally we would want multiple reps to ensure things are being calculated correctly, but for class we will only do 2.

tess3.cukes <- tess3(X=genotypes, coord =as.matrix(cbind(sites.data$LAT, sites.data$LONG)), K=1:8,
                     ploidy=2, lambda=0, keep="best",rep=2)

This will take a second since it is real genomic data! We started with 8 clusters.

Next to make a structure plot, we need to first call the data and assign how many K “clusters” we want. For example, if we want to assign 8 clusters we would run the following:

# retrieve tess3 Q matrix for K = 8 clusters 
cukes.matrix <- qmatrix(tess3.cukes, K = 8)

#now the barplot
barplot(cukes.matrix, border = NA, space = 0, 
        xlab = "Individuals", ylab = "Ancestry proportions", 
        main = "Ancestry matrix") -> bp
axis(1, at = 1:nrow(cukes.matrix), labels = sites.data$Sample,las = 3, cex.axis = .4)

Yay our first structure plot! But it isn’t ordered by site (yet!), so it’s hard to see if there are differences across sites.

8.8 How do we know which K to pick?

In order to get an accurate estimate of the “best” K for our data we first need to run a larger range of K, here 1:8, and then look at our data and see how the cross-validation score compares to the number of predicted ancestral populations. We can do that with the code below:

The plot function generates a plot for root mean-squared errors computed on a subset of loci used for cross-validation:

plot(tess3.cukes, pch = 19, col = "orange",
     xlab = "Number of ancestral populations",
     ylab = "Cross-validation score")

Class Exercise 1

Where the cross-validation score drops is which K is likely the best fit. Which K do you think is the best fit for our data? Let’s plot a couple different structure plots (K=2, K=4 and K=6) for data visualization and determine the best K. You can modify the code above by changing the K value and rerunning the plots.

Now that we have determined the best K for our dataset, let’s visualize things!

8.9 Visualizing population structure with pophelper!

Pophelper is a really useful tool for visualizing population structure graphs. You can view the tutorial here for more information: https://www.royfrancis.com/pophelper/articles/index.html#plotting-1. We can do a lot of customization in pophelper, and this figure shows what our final result will look like!

Pophelper Overview

First we need to convert our tess3 object to a qlist for pophelper

qlist.cukes <- readQTess3(t3list=tess3.cukes)

Then we need to specify the order that we want our sites to be in and the different labels for our structure plot. The first is our population order. We want it to match the map figure and the table in the paper, so let’s order from South to North. To get the same of the sites we can do the following:

unique(sites.data$SITE)

Our sites are “OGD” “SGI” “LAS” “JER” “TOF” “CRA” “RBY” “SHE” “MAL” “QUA” “HOP” “TBL” “CAL” “TOL” “PRI” “LEG” “JUA” “SEL” “REN” “MAZ” “AK1” “AK2” “AK3” “AK4”.

Table 1 from Xuereb et al., 2018

Table 1 from the paper shows the sites south to north. Luckily the dataset is already sorted appropriately, so now we can just put this into a new object!

ordered.sites=c("OGD", "SGI", "LAS" ,"JER" ,"TOF" ,"CRA" ,"RBY", "SHE", "MAL", "QUA", "HOP", "TBL", "CAL", "TOL", "PRI", "LEG", "JUA", "SEL", "REN", "MAZ", "AK1", "AK2" ,"AK3", "AK4")

We also want to label our two structure plots with what K equals, so add that with the object “labels”

labels <- c("K=2", "K=3")

Now that we have our labels we can pull the data that we want from our different files

slist.cukes <- alignK(qlist.cukes[2:3],type="auto")
labset<-data.frame(sites.data)

Now we can plot our data!

#plot samples separated by site
p<-plotQ(slist.cukes,imgoutput="join",returnplot=TRUE,exportplot=FALSE,basesize=11,
            splabsize=7,height=7,
            grplab=data.frame(SITE=labset$SITE),subsetgrp=ordered.sites,
            grplabsize=3,linesize=1,pointsize=3,splab=labels,grplabangle=0,
            grplabheight = 5)

grid.arrange(p$plot[[1]])

8.10 Manipulating our metadata with dplyr

The site names may mean something if you are familiar with this area or have read the paper, but how can we make our plot easier to understand? One way could be to add what countries these samples were taken in so you can get a sense of sampling area. To work with this dataset, we’ll use a really useful package for data manipulation that we haven’t used yet, called dplyr().

The newest thing that you will see with dplyr is %>%. This is called a pipe, and connects lines of code together.

To add a new column we will need to define what the new column will be named and then provide information to fill that column

library(dplyr)
sites.data<-sites.data %>% 
  mutate(Country=case_when(
    SITE %in% c("OGD", "SGI", "LAS" ,"JER" ,"TOF" ,"CRA" ,"RBY", "SHE", "MAL", "QUA", "HOP", "TBL", "CAL", "TOL", "PRI", "LEG", "JUA", "SEL", "REN", "MAZ") ~"British Columbia, Canada",
    SITE %in% c("AK1", "AK2" ,"AK3", "AK4") ~ "Alaska, USA"))

View sites.data before moving on to ensure you got the column!

Class Exercise 2

Now we can rerun some of the code we have above to remake our structure plot with “Site” and “Country” labels.

8.11 Group Work Activity- Adding even more context to our plot

The structure plot is now more organized with samples from south to north, but what if we want clear labels for how those sites split? Remember that Table 1 grouped sites by region, with Ogden Point to Table Island as the “South Region” and Calvert Island to Alaska site 4 as the “North Region”.

Table 1 from Xuereb et al., 2018

Add a column for “Region” in sites.data dataset.
Make a new structure plot with a label for SITE and Region.
Modify the color scheme however you’d like (but make it distinct from the preset colors, please!) View the pophelper page for more information on how to do so. Think of how you can use color to convey a message with your plot. You can check out a few color schemes frompophelper() here

Submit your code and a photo of your structure plot to canvas.

8.12 Key Points

Moving from terminal to R is important when working with genomic data
Structure plots can be made using tess3r and are useful for visualizing population structure within your dataset
Pophelper is a useful tool for customizing these structure plots

Class Exercise Solutions

Class Exercises: Solutions

Exercise 1

It looks like there might just be two clusters, since the cross-validation score drops between 1 and 2 ancestral populations. Let’s make a matrix using qmatrix for K=2 and visualize a structure plot with that. The qmatrix is a matrix of our ancestry coefficients, aka the matrix of relatedness between our samples! Our qmatrix will be called cukes.matrix

# retrieve tess3 Q matrix for K = 2 clusters 
cukes.matrix <- qmatrix(tess3.cukes, K = 2)

#now the barplot
barplot(cukes.matrix, border = NA, space = 0, 
        xlab = "Individuals", ylab = "Ancestry proportions", 
        main = "Ancestry matrix") -> bp
axis(1, at = 1:nrow(cukes.matrix), labels = sites.data$Sample,las = 3, cex.axis = .4)

Exercise 2

structure plot with site and country labels

labset<-data.frame(sites.data)
verifyGrplab(grplab=labset[,c("SITE", "Country")]) # to make sure it is there
p2<-plotQ(slist.cukes,imgoutput="join",returnplot=TRUE,exportplot=FALSE,basesize=11,
            splabsize=7,height=7,
            grplab=labset[,c("SITE", "Country")],subsetgrp=ordered.sites,
            grplabsize=3,linesize=1,pointsize=3,splab=labels,grplabangle=0,
            grplabheight = 5)

grid.arrange(p2$plot[[1]])