10 Week 9: Fst outlier analysis and chunky quails
The lecture for this week focused on Fst and outlier analysis can be found here: Week 9 Slides
Last week we used PCA to take a broad look at whether populations were genetically distinct. This week we will look at SNPs that differentiate between populations that may represent signals of selection. If you want to learn more about how different methods for detecting signatures of selection I recommend visiting this page on the marineomics website: https://marineomics.github.io/POP_10_Signatures_of_Selection.html
We will use data from the following paper focused on genomic variation in common quails:
We will get more into what the data focuses on later in the lesson, but a few findings from the study is that the quails with and without the inversion differ in their throat coloration, weight, fat reserves, wing shape and migratory behavior.
Sanchez-Donoso et al. (2022) Phenotypes of the common quail with and without the inversion from the graphical abstract
We will download the data and take a first look in bash and then we will do analysis and generate the plots in RStudio.
10.1 Main Objectives:
- Learn how to identify particular SNPs that are driving patterns of divergence between populations
- Connect what we learned in previous lessons using PCAs and Structure Plots and how these methods differ from today’s outlier analyses
- Practice keeping scripts organized to ensure reproducibility
10.2 Download the data
We first need to download the data. Start a terminal session and navigate to your personal directory. Then use wget to download the data folder. We’ll then use the tar command to un-compress it.
cd /group/rbaygrp/eve198-genomics/yourdirectory
mkdir Week9
cd Week9
wget https://raw.githubusercontent.com/mlarmstrong/IntroGenomics_Data/main/week9.tar.gz
tar -xzvf week9.tar.gz10.3 Downloading R Packages
Once your data is unzipped, terminate your OnDemand terminal session to free up space on the cluster.Start an RStudio OnDemand session. The first thing we need to do is download a few different packages.
Today we will be using the SeqArray and SNPRelate packages to read our data. Since you used these last week, you shouldn’t have to reinstall them.
For the outlier analysis, we will use a package called OutFLANK which also needs the package qvalue to run (we call this a ‘dependency’). We will use BiocManager to help us with our downloads again, so make sure to load that package as well.
library(SeqArray)
library(SNPRelate)
library(BiocManager)
BiocManager::install("qvalue") # When prompted, type "a" - this will take some time
install_github("whitlock/OutFLANK")10.4 R hygiene
Now that you are getting more comfortable coding in R, it’s a good time to talk
about how to make your scripts clean, readable, and reusable. The first and
likely most important piece of advice is to ALWAYS do lots of commenting with
# so when you come back to your code you know what you did and why.
It’s good practice to put three main pieces of information at the top of your code. Your name (for when you share code), the last modified date, and the purpose of the code. Looks like this:
## Code for finding outliers using OUTFlank
## Katie Erickson
## Modified February 27, 2026After that, I like to have all the libraries that will be needed for the code loaded:
library(SeqArray)
library(SNPRelate)
library(OutFLANK)
library(gdsfmt) Finally, I like to make objects for all the files that I will use. Again, this is so that if I want to come back to the code later and run the same analysis on different files, it is obvious what I need to change rather than being buried in the code:
setwd("/group/rbaygrp/eve198-genomics/kerickson")
vcf.path="ChunkyQuails.vcf.gz"
meta.path="ChunkyQuails_meta.csv"Make sure your working directory is set up to the folder your data is stored in for week 9. Now that our script is organized, we are ready to use the data!
10.5 Finding outliers using OutFLANK
We will read in the vcf file just like we did last week when we did PCA. Give it a try on your own!
Class Exercise 1
Using info from Week 8, read in your data!
Q1. Convert your vcf file to a gds file called “ChunkyQuails.gds”. Then read make a gds object called ‘genofile’. How many samples do you have?
Q2. Read in your metadata file to an object named ‘meta’ (hint: read.csv()). Take a look at the different columns. “Loc1” is the breeding location. How many unique breeding locations are represented?
10.6 Pulling SNP data from the gds file
Later, we will want information on the location of each SNP in the genome. We can pull information from the gds file like this:
chromosomes <- read.gdsn(index.gdsn(genofile, "chromosome"))
positions <- read.gdsn(index.gdsn(genofile, "position"))
variant.id <- read.gdsn(index.gdsn(genofile, "variant.id"))Then put everything together in a single dataframe like this:
snp.pos <- data.frame(LocusName=variant.id,
chr=as.factor(chromosomes),
pos=positions)We also want the matrix of SNP genotypes.
We can pull this from the gds object using the snpgdsGetGeno() command:
G <- snpgdsGetGeno(genofile)Take a look at this matrix. Here I’m just looking at the head of the first 10 columns because the matrix is large!
head(G[,1:10])We have four different values: 0,1,2, and NA. These represent the reference homozygote, heterozygote, alternate homozygote, and missing data respectively. The program that we want to use, OUTFLank requires a ‘9’ instead of ‘NA’ for missing data. We can do the replacement like this:
G[is.na(G)] <- 9Take a look at the head of the dataframe again. Did it work?
10.7 Calculating Fst and identifying outliers
Now we can calculate Fst for each SNP. We’re going to calculate differentiation among locations (the Loc1 column in your metadata).
WAIT! First we need to determine whether our metadata is in the same order as our genotype matrix.
Class Exercise 2
Is your metadata in the same order as your genotype matrix?
Q1. Get the sample names from your gds object (hint: how did you extract chromosome and position from gds? There is also information in ‘sample.id’)
Q2. Determine whether the sample IDs you extracted from the gds are in the same order as in the metadata file (hint: you did this last week too!)
Phew - NOW we can calculate Fst! This will take a few minutes.
fst <- MakeDiploidFSTMat(G,locusNames=variant.id,popNames=meta$Loc1)
head(fst)hist(fst$FST,breaks=50)Once we’ve calculated Fst between the two populations for each SNP individually, we want to determine whether some SNPs are outliers - that is, more differentiated than we would expect. OutFLANK does this by fitting a Chi-Squared distribution to the data and looking to see if the tails of the Chi-Squared distribution have more SNPs than expected:
OF <- OutFLANK(fst,LeftTrimFraction=0.01,RightTrimFraction=0.01,
Hmin=0.05,NumberOfSamples=9,qthreshold=0.01)OutFLANKResultsPlotter(OF,withOutliers=T,
NoCorr=T,Hmin=0.1,binwidth=0.005,
Zoom=F,RightZoomFraction=0.05,titletext=NULL)It’s a little hard to tell from these plots, but there may be some SNPs with high Fst even where the distribution predicts there should be none. To find these SNPs, we ask which SNPs are statistical outliers?
P1 <- pOutlierFinderChiSqNoCorr(fst,Fstbar=OF$FSTNoCorrbar,
dfInferred=OF$dfInferred,qthreshold=0.05,Hmin=0.1)
outliers <- P1$OutlierFlag==TRUE
table(outliers)Now we can make a manhattan plot! There are lots of packages to make very beautiful manhattan plots and even highlight particular genes of interest. Here we will just use base R. First, take a look at the dataframe that has information on Fst and outliers:
head(P1)We are missing chromosome information, which we want to make a beautiful
manhattan plot that is colored by chromosome. Remember how we made the snp.pos
dataframe earlier? Basically we want to merge P1 and snp.pos.
Class Exercise 3
Merge two dataframes based on one column
- Q1. Merge the P1 and snp.pos dataframes based on the LocusName column. There are many ways to do this. Two of my favorites are ‘merge()’ or ‘left_join()’ which is part of dplyr package. Use the help file for one of them and try to merge the two dataframes into a single called “merged.P1”
Yay! Now that we have a merged dataframe we can make a manhattan plot:
plot(merged.P1$LocusName,P1$FST,xlab="Position",ylab="FST",col=merged.P1$chr,pch=19,cex=0.5)What do you notice about this plot? Can you think of a possible explanation for this result?
10.8 Group Work Activity: PCA & Outflank
This week’s activity will combine skills you learned last week and this week using the quail dataset we familiarized ourselved with during class:
Rerun OUTFlank, but this time use the “Karyotype_cluster” cluster instead of geographic populations. This column tells you what inversion genotype a particular quail has. Compare this plot to the plot we made in class - why are there similarities and differences?
Create a PCA of the quail data. You can choose to color by geography or by karyotype (or get fancy with different shapes and colors!) Interpret the figure you make in one to two sentences
Copy the lines in your script used to answer these questions into your canvas submission. Be sure to include the answers for question 1, interpretations for the figures and the two figures you made as well.
10.9 Key Points
- Standard organization of R scripts increases reproducibility
- Fst outliers can be used to find genomic signals of local adaptation
Class Exercise Solutions
Class Exercises: Solutions
Exercise 1
- Convert your vcf file to a gds file called “ChunkyQuails.gds”. Then read make a gds object called ‘genofile’.
seqVCF2GDS(vcf.path,"ChunkyQuails.gds")
genofile <- seqOpen("ChunkyQuails.gds")
- Read in your metadata file to an object named ‘meta’. How many unique breeding locations are represented?
meta <- read.csv(meta.path)
table(meta$Loc1)
length(unique(meta$Loc1))Exercise 2
- Get the sample names from your gds object
samples <- read.gdsn(index.gdsn(genofile, "sample.id"))
- Determine whether the sample IDs you extracted from the gds are in the same order as in the metadata file
identical(samples,meta$SampleID)Exercise 3
- Merge the P1 and snp.pos dataframes based on the LocusName column.
merged.P1 <- merge(P1,snp.pos,by="LocusName")
library(dplyr)
merged.P1 <- left_join(P1,snp.pos,by="LocusName")