This workshop will cover how to run a PhyloG2P framework using PhyloAcc (Hu et al. 2019). There are three main things needed to run the PhyloAcc suite of programs, which we will cover during the next few hours:
The PhyloG2P paradigm involves connecting phenotypes and genotypes while taking account for phylogenetic relationships between species, aka in a comparative biology framework. Convergent phenotypes, in particular, are an integral aspect of using the PhyloG2P framework to untangle the genetic basis of phenotypes, provided the same genes and gene pathways are involved in the phenotype.
Here are some examples of how it has been used, using a PhyloAcc-based approach.
The following folder:
/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/
will contain many useful files necessary to run many of the pipelines that I will showcase. I would recommend copying the files in this folder into your working directory before the workshop.
A singularity of cactus is available in the shared folder and you can run cactus by copying it to your directory
You will need either Hal Tools (link) or Cactus (link) work with Hal files and do the alignments. Installing Cactus as a Singularity comes with Hal Tools preinstalled.
To install Hal Tools directly follow the instructions on the website. Any deviation from their methods, and you’re installation may fail. It takes ~20 mins to finish installing Hal Tools.
To install cactus, first you’ll need to have Singularity installed see here. Then you can build the singularity version by:
singularity pull --disable-cache docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.1-gpu
Bedtools is an important set of tools that can perform a lot of manipulations of bed, vcf, and bam files. You can install that here.
Phast is a group of programs to calculate conservation and neutral rates from alignments. You can install that here.
UCSC offers many tools to manipulate various files used in genome analysis. We will especially use ones that helps convert from one file format to another. Two programs we will use are gff3ToGenePred and genePredToBed.
One way to get 4 fold degenerate sites is through Degenotate that comes from Gregg in the Informatics group here at Harvard. You can install that here. It is recommended you install degenotate in its own conda environment as it might cause compatibility issues with other programs we will be using if you install in the same environment.
To get the original PhyloAcc ST/GT, you can refer to the installation instructions there to install Snakemake and PhyloAcc.
PhyloAcc-C, the
continuous version of PhyloAcc-C, works off an R script so it needs to
installed through R. It requires R-tools (Windows) or
r-base-dev to install properly along with some other
dependencies like RcppArmadillo and expm to
run properly. I’ve had issues sometimes installing some of these
dependencies on the Cannon cluster R, but you can install R dependencies
through conda as well which can solve problems of installing them using
install.packages().
You will also need ggplot2, parallel,
coda, argparse, sm,
stringi and ape if you don’t already have them
in R.
Just a guide for some file formats if you are not familiar with them:
There are more than one way to make whole genome alignments. You can use the Lastz based pairwise alignment (see example here). One could also use BLAST to get loci of interest and make alignments directly. However, there are some benefits of having a Cactus alignments, as it can be used for more than running PhyloAcc or building alignments.
The first step to doing a Cactus alignment is getting the fasta files of the genomes. These fasta files must be soft-masked (most NCBI genomes are already repeat-masked). I often have issues with super high quality genomes either because not all repeats have been fully masked or there are many Ns in the genome.
The easiest way to download many genomes at once from NCBI is to use ncbi-datasets. Just provide it with a text file that contains accession numbers (see help here).
# eg. code to get genomes from ncbi using accessions saved in my_accessions.txt
datasets download genome accession --include genome --inputfile my_accessions.txt
# ncbi-dataset can be used to download, say, all genomes with a specific taxonomic group as well
datasets download genome taxon Anseriformes --include genome
For cactus, it is often advisable, along with masking the genomes, to not include small scaffolds. I’ve often removed any scaffolds smaller than 1000 bps. Here is an example script to clean all genomes downloaded from ncbi_datasets step above.
# the file my_file.txt contains two columns, accession number, and the output name (because I want to specify exactly what the output genome will be called, usually species name+accession number)
# ncbi-dataset by default downloads files to a folder called ncbi_dataset, unless you change it explicitly.
while IFS=$'\t' read -r genome outname; do
# find genome
genomefile=$(find "ncbi_dataset/data/$genome/" -type f -name "*.fna" -print -quit)
# index genome
samtools faidx $genomefile
# get scaffolds larger than 1000 bp
awk -F'\t' '$2 > 1000 {print $1}' "${genomefile}.fai" > ncbi_dataset/data/$genome/keep.txt
# remove scaffolds larger than 1000 bp
samtools faidx $genomefile -r ncbi_dataset/data/$genome/keep.txt > $outname.fna
done < "my_file.txt"
The Harvard GenomeAnnotation Github has specific details on how to run cactus. Currently, it is written for cactus v2.2, but soon will be updated for the most recent version of cactus. There have been changes in cactus since that version in syntax so the current workflow does not work on other versions. I have a bootlegged version of the script that I modified that can run using more recent versions of cactus, but you may want to wait for an official release.
After having done many Cactus alignments, I have learned a few things that do work and others that do not. Here are some of my insights that might help you (many of the time recommendations based on bird genomes which are around 1 Gb and do not usually have a lot of repetitive elements):
The Hal (short for Hierarchical Alignment) format is a graph-based representation of a genome alignment (Hickey et al. 2013). After the cactus pipeline runs, you should get a hal file as an output. Let’s learn how to manipulate hal files using some of the tools I’ve commonly used from haltools. I will be using the cactus singularity based implementation of the hal tools.
# We will be using this hal file for the course of this workshop
/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/duck_align.hal
# And you can find a version of cactus already compiled here as well
/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/cactus_v2.8.1-gpu.sif
A word on singularities:
singularity shell <container>. This will open a
command-line within the singularity that you can run stuff on.singularity exec <container> <program> <program parameters>.singularity exec --bind <folders> <container>
to give it access to folders (it is recursive). It works in the same way
as setting $PATH on the command line (so multiple folders can be bound
with :)--nv (for Nvidia) to
execute gpu based commands.Let’s Hal
singularity shell cactus_v2.8.1-gpu.sifhal and hit tabhalStatsNext, we’ll look at some of the common hal tools that you may end up using. We will use a dummy data for practice.
What’s your favorite gene
# Get the coordinates for your favorite gene from the chicken gff file available here:
/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/galgal7b.gff
# I will use the gene TYR for Tyrosinase
grep "gene=TYR;" galgal7b.gff
# We will want to convert this to a bed format so let's use awk to get only the relevant information
# In brief, we want to get the exons, so the 3rd column should match exon, then we need to get the gene name
# so it appears in the output, we need to subtract 1 from the start index because bed is 0 indexed and
# vcf is 1 indexed, and finally lets add a count number to the end of the exons
grep "gene=TYR;" galgal7b.gff |
awk 'BEGIN{ FS="\t"; OFS=FS }
$3 == "exon" {
sub(/.*gene=/, "", $9);
sub(/;.*/, "", $9);
$4 = $4 - 1;
count++;
print $1, $4, $5, $9 "_exon_" count;
}' > tyr_galgal.bed
halStats can be used to get the basic statistics of the
hal file.
For example, if we wanted to see what genomes were present in an alignment, we could (note we are using exec now not shell):
singularity exec cactus_v2.8.1-gpu.sif halStats --genomes duck_align.hal
We can get the scaffold names for a particular genome using:
singularity exec cactus_v2.8.1-gpu.sif halStats --sequences Gallus_gallus duck_align.hal
We can also view the tree structure used in for the alignment using:
singularity exec cactus_v2.8.1-gpu.sif halStats --tree duck_align.hal
A few of the commands halAddToBranch,
halAppendSubtree, halAppendCactusSubtree,
halRemoveGenome, halRemoveSubtree and
halReplaceGenome can be used to add, remove, and modify the
genomes in the halfile. These actions will also do realignments across
all higher nodes for whatever is added or removed. See the respective
help for each of those commands.
One command that could be very useful is
halRenameSequence. Suppose the genome you used in the
alignment was from Genbank and your gff file had coordinates in the
RefSeq format. While there is a one to one mapping of each of these
scaffold names, they differ between genbank and refseq. You could use a
tab file with genbank scaffold names and refseq scaffold names to update
the scaffold names in the alignment with ease.
Another important command is halLiftover. This is one
reason, in my opinion, why a cactus alignment is so much better than say
a pairwise lastz alignment. You can easily convert coordinates from one
species to any other species across all aligned taxa.
Let’s look at the coordinates of the gene you chose in a different species, you can choose any one among the ones that showed up before (including what the predicted ancestral sequences were).
singularity exec cactus_v2.8.1-gpu.sif halLiftover duck_align.hal Gallus_gallus tyr_galgal.bed Anseranas_semipalmata tyr_anssem.bed
You will often find your lifted-over coordinates are broken up. This is because of indels and other structural artifacts introduce breaks. But we can use bedtools to clean up such breaks.
# In many cases the first step is to always sort the bed file. I will use -d of 20 to merge the file, ie any gaps less than 20 bp will be merged into one. I will also make sure that merge keeps the fourth column using -c and -o
bedtools sort -i tyr_anssem.bed | \
bedtools merge -d 20 -c 4 -o distinct -i "stdin" > \
tyr_anssem_merged.bed
Also, if your bed file contains extra columns, just use
--bedType 3 to get halLiftover to pass all the other
columns as it is.
The native hal2maf or the cactus
cactus-hal2maf can be used to convert coordinates to maf
format (multiple alignment file) and then subsequently to fasta. The
cactus-hal2maf is under active development so might work
better. They have slightly different syntax. One of the main advantages
of the latter is being able to specify how to handle duplicates (aka
regions of the genome aligned to multiple regions) but the former is
more easy to use and is quite straightforward.
A word about maf files, maf files are linearly mapped and scaffold information is staggered in the sequence. You can pull sequences based on linear coordinates but not based on scaffold coordinates. So like if you had two targets one in scaffold 1 that was 100 bp long and one in scaffold 2 that was 200 bp long, it would be mapped as 0-100 and 100-300 and would need to know the linear coordinates to pull out specific regions. It is advisable to split maf files by scaffolds (aka use bed files with coordinates of only one scaffold) in future operations if you intend to pull out specific regions.
# Using just hal2maf is easy, we'll also ignore paralogs using --noDupes, and not print ancestral sequences
singularity exec cactus_v2.8.1-gpu.sif hal2maf \
--noAncestors \
--noDupes \
--refGenome Gallus_gallus \
--refTargets tyr_galgal.bed \
duck_align.hal tyr_duck.maf
# Using cactus-hal2maf you need to give it more details.
# you can submit it as a separate job using --batchSystem
# --dupeMode specifies whether to get orthologs or make a consensus sequence
# ./jobstore/ is just a place to store temporary files, make sure it does not already exist
# you can parallelize jobs, set memory, and other things within cactus-hal2maf
singularity exec cactus_v2.8.1-gpu.sif cactus-hal2maf \
--batchSystem single_machine \
--refGenome Gallus_gallus \
--bedRanges tyr_galgal.bed \
--noAncestors \
--dupeMode single \
--chunkSize 1000 \
./jobstore/ duck_align.hal tyr_duck_cac.maf
There are some maf to fasta converters out there, and one could do it through scripting in Python or other programming languages (see BioAlign’s maf module). Here, we will use a small Python script that I wrote to convert the maf to a fasta format. This script works as long as your maf file does not contain paralogs/duplicates, and ideally consists of a single scaffold (remember mafs are not indexed by scaffold).
To get this script run
git clone https://github.com/subirshakya/phast_scripts.git
Then, you can use the maf2fasta.py script:
python phast_scripts/maf2fasta.py --maf tyr_duck.maf \
--bed tyr_galgal.bed \
--out_folder ./ \
--ref_species Gallus_gallus
To run PhyloAcc models, we need to the estimate neutral rate of
substitution. For this we will use 4-fold degenerate sites (3rd bp of
codons where all four bp changes do not change the amino acid) as a
proxy of a site evolving neutrally. To extract these sites we need an
annotated genome, since we need to know where genes are to get the
4-fold degenerate sites. We will use the gff file use used previously to
generate a neutral rate for a subset of the data. Normally, one would
want to run this per chromosome and then combine the rates across
multiple chromosomes using PhyloBoot (part of the Phast package). Here,
we will only work on a small subsample of genes, so that we can get done
quickly. There are two ways we can get the coordinates of 4-fold
degenerate sites, one using hal4dextract, which I have
found can take a long time to run sometimes. The other is using
degenotate.
The input hal4dextract is a 12-column bed file of exonic
sites. This we can get from the gff file using two programs gff3ToGenePred
and genePredToBed.
# first convert to genepred
gff3ToGenePred -useName galgal7b.gff galgal7b_cds.gp
# then convert to bed
genePredToBed galgal7b_cds.gp galgal7b_cds.bed
# for this exercise let's down-sample this bed file to include only a few genes, all genes in the first 10 Mb of chr 1
# you can choose your own set of genes, would be interesting to compare results between different parts of the genome
# just replace NC_052532.1 with any scaffold between NC_052532.1 and NC_052572.1.
awk 'BEGIN{ FS="\t"; OFS=FS }
$1 == "NC_052532.1" && $2 > 0 && $2 < 10000000' galgal7b_cds.bed > galgal7b_cds_subset.bed
# now let's use hal4dextract
singularity exec cactus_v2.8.1-gpu.sif hal4dExtract \
--conserved \
duck_align.hal Gallus_gallus galgal7b_cds_subset.bed galgal7b_subset_4d.bed
# next we can use hal2maf to get the alignments of all 4d sites
singularity exec cactus_v2.8.1-gpu.sif hal2maf \
--noAncestors \
--noDupes \
--refGenome Gallus_gallus \
--refTargets galgal7b_subset_4d.bed \
duck_align.hal galgal7b_subset_4d.maf
For degenotate see example 6 in the documentation
and then use the same hal2maf to get the alignments of the 4 fold
degenerate sites.
Next, we’ll use phyloFit to get the neutral rate and a
neutral rate tree.
# let's get a background tree first, we'll just borrow the tree already in the hal file for this tutorial.
# I've also stripped the ancestral node names and the branch lengths that already existed to give a fresh start
singularity exec cactus_v2.8.1-gpu.sif halStats --tree duck_align.hal | \
sed 's/Anc[0-9]*//g' - | sed 's/:0\.[0-9]*//g' - > duck_tree.tre
phyloFit --tree duck_tree.tre \
--init-random \
--subst-mod SSREV \
--sym-freqs \
--log galgal7b_subset_4d.log \
--msa-format MAF \
--out-root galgal7b_subset_4d_neutral \
galgal7b_subset_4d.maf
Here is an example of a neutral rate matrix (bases: A, C, G, T).
\[Q_{duck} = \begin{bmatrix} -1.042445 & 0.159358 & 0.770558 & 0.112529 \\ 0.122472 & -0.967380 & 0.252710 & 0.592198 \\ 0.592198 & 0.252710 & -0.967380 & 0.122472 \\ 0.112529 & 0.770558 & 0.159358 & -1.042445 \\ \end{bmatrix}\]
And the tree associated with it.
To generate a list of conserved elements, we can take the steps above further. This can be a time intensive process so we will not actually run any code for this section. I’ll show you how it can be done.
First, a practical consideration. If you are using looking for
acceleration using PhyloAcc and you know what the target group is, it is
advisable you do not include those species in the alignments used to
generate conserved rates, as you don’t want to miss elements where
accelerations have happened. This is easy enough to do, just use the
--targetGenomes flag in hal2maf. Second, the
ideal way to do this is to run phyloFit for each scaffold/chromosome
separately and then use phyloBoot to merge the rate files
for each scaffold.
OK, so to get a list of conserved elements, we need to estimate \(\rho\) in phastCons. First we need to get
entire maf alignments, ideally one for each scaffold; similar to how we
did it above, but without the --refTargets flag.
Next, we run PhastCons, which is part of the Phast suite of programs. See the help there on some tricks and tips to run PhastCons
# First we calculate the conserved rate;
# Here you need to set expected length of elements and target-coverage or let the program estimate it
# See the help file for definitions of these parameters
# We are estimating rho so we set that and include a starting value (not necessary)
# Finally, we may want to correct for GC bias, since the scaffold might have a different GC bias than the genome as a whole, I often use the GC value of the genome/group of chromosomes.
# You also need a mod file with neutral rates (output of neutral rate model)
phastCons \
--expected-length 45 \
--target-coverage 0.3
--rho 0.4 \
--gc 0.41 \
--estimate-rho \
--out-root birds_cons \
--no-post-probs \
--msa-format MAF \
some_scaffold.maf galgal7b_subset_4d_neutral.mod
# Then you use the conserved and neutral rates to get a list of conserved elements.
# This will also output a wig file with the conservation scores per element
phastCons \
--expected-length 45 \
--target-coverage 0.3 \
--most-conserved conserved_elements.bed \
--msa-format MAF
some_scaffold.maf birds_cons.mod,galgal7b_subset_4d_neutral.mod > \
conservation_scaffold.wig
Checkpoint: you should have fasta alignments for exons of a random gene and a mod file with neutral rates
These two models are used to associate acceleration of substitution rates to a binary trait. The GT (Gene Tree) model is an extension of the ST (Species Tree) that allows for different relationships between taxa for different loci (gene tree discordance) (Yan et al. (2023)). Thanks to Gregg and Tim in Bioinformatics, there is a simple snakemake based framework to run PhyloAcc (see here).
To run PhyloAcc-GT/ST first, we need to do some housekeeping,
-i option
in PhyloAcc.tree_doctor, which should already be available as a part of
the phast programs:
tree_doctor --name-ancestors galgal7b_subset_4d_neutral.mod > galgal7b_subset_4d_neutral_lab.mod-r gt or
-r adaptive. The difference here is that gt will
make gene trees for all alignments, which can be time and computation
intensive, while adaptive uses concordance factors to bin
loci to use with subset of tree topologies.--options to get
a full list of options and default parameters.--dollo.# Here is an example to run phyloacc for the 5 exons to TYR we had which were in the folder alignments
# I chose some random target species using -t where I am assuming there will be acceleration
# I parallelized the job by running 1 alignment per node to have 5 jobs total each with 100g of memory
phyloacc.py \
-d alignments/ \
-m galgal7b_subset_4d_neutral_lab.mod \
-o ./phyloacc_output \
-t "Oxyura_jamaicensis;Somateria_mollissima;Clangula_hyemalis;Melanitta_fusca;Aythya_fuligula;Tachyeres_brachypterus" \
-p 1 \
-j 5 \
-batch 1 \
-part "shared" \
-mem 100 \
-time 4
# Once it finishes running you can use phyloacc_post to gather the output
phyloacc_post.py -i phyloacc_output/
Let’s look at what kind of output PhyloAcc produces:
less phyloacc_output/results/elem_lik.txt should show a
summary for each elementless phyloacc_output/results/M2_elem_Z.txt should show
the state of acceleration/conservation/neutral for each node and
tipless phyloacc_output/results/rate_postZ_M2.txt should
show the posterior probability for each state
acceleration/conservation/neutralness for each node and tipThe folder phyloacc_output/plot/ will also contains some
plots about alignments and results
The continuous model of PhyloAcc works within an R framework. For this, we still need the same files as before, but also need a continuous trait in a text file, unlike the target tips as before.
I have an R executable called run_phyloaccc.r where I
have taken Dr. Patrick Gemmell’s scripts to seamlessly run PhyloAcc-C.
The Rscript takes the following parameters:
mod_file: the mod file with neutral rate and treeoutput: the name for the output file with summary of
PhyloAcc-C runsoutput_folder: the output folder to save the file
infasta: a folder containing fasta filesfasta_list: a list with one element per line for the
fasta files to run (can use it to run a job on only a subset of fasta;
ideal for parallelization)max-missing: this can be used to omit alignments that
are missing this fraction of speciestraits: a file with two columns, species name and trait
valuebatch_size: how many elements to process per
computation nodecores: number of cores provided to parallelizeruns: number of independent runs per element (to test
for convergence); default is 3iter: number of mcmc iterationsburnin: burn-in valuemax-missing: this can be used to check forLet’s create a random trait for the species we are working with. We can do this by pulling the species names from one of the fasta files, and then assigning a random number to the entry.
awk '/^>/{header=substr($0, 2); print header "\t" (rand()*2-1)}' alignments/TYR_exon_1.fa > duck_trait.txt
OK, that’s all we need to run PhyloAcc-C. One note, as before the prior probabilities for conservation has been adjusted to match that for birds. You may have to tweak your parameters if you are running on other species. This would have to be done by manually changing the priors in the R script.
# This will run phyloacc-c model for each element on each thread (parallelized)
Rscript run_phyloaccc.r \
--mod_file galgal7b_subset_4d_neutral_lab.mod \
--output tyr_phyloaccc \
--fasta alignments/ \
--traits duck_trait.txt \
--batch_size 5 \
--cores 5
This is generally enough to find elements that are evolving together with traits. In the output file, you can read in the table output and calculate BF:
library(ggplot2)
output_all <- read.table("phyloaccc_out.txt", header = T, sep = "\t")
# subset for elements with Gelman & Rubin convergence statistics < 1.01
output_all <- output_all[output_all$rhat <= 1.01,]
# BF
true_null = 1/(2*pi)
output_all$bf = true_null/output_all$dens0
output_all$color = "darkblue"
output_all$color[which(output_all$bf > 5)] = "orange"
ggplot(output_all)+
geom_point(aes(x=rat.50, y=log10(bf)), col = output_all$color)+
theme_classic()+
xlab("Rate")+
ylab("Bayes Factor")+
theme(plot.title = element_text(hjust = 0.5))+
geom_hline(yintercept=log10(5), linetype="dashed", color = "darkorange", linewidth=0.75)+
annotate('text', x=0, y=log10(5)+0.1, label = "BF > 5", color ="darkorange")
For elements of note, we may want to visualize trees and get some
parameter estimates. For this we can run PhyloAcc-C individually for
each candidate loci. The Rscript run_phyloaccc_indv.r can
be used to run PhyloAcc-C for each element and then print the outputs
and trees to a pdf file. I often run this for longer, using 3X
iterations that i did previously to get better mixing of parameters.
Let’s do this for one of these exons.
# The individual phyloacc script has fewer options than the batch one
# This code will run for exon5 of TYR
Rscript run_phyloaccc_indv.r \
--mod_file galgal7b_subset_4d_neutral_lab.mod \
--fasta alignments/TYR_exon_5.fa \
--traits duck_trait.txt \
--iter 30000 \
--burnin 3000
Here is the output tree:
There are also other trace plots and statistics such as accelerated rate and conserved rate that are printed in the file.
THE END