1 Introduction

This workshop will cover how to run a PhyloG2P framework using PhyloAcc (Hu et al. 2019). There are three main things needed to run the PhyloAcc suite of programs, which we will cover during the next few hours:

  • Alignments (ideally of conserved elements): We will cover the basics of genome alignments, aligning genomes using ProgressiveCactus (Armstrong et al. 2020), assess and manipulate Hal files, extract sites (maf and fasta) from hal files to build alignments.
  • Neutral rate and trees: We will cover how to calculate the neutral rate from the Hal alignment using 4-fold redundant sites. We will also touch on how to generate a list of conserved elements.
  • PhyloAcc: We will cover how to set up PhyloAcc-GT and PhyloAcc-ST (binary trait models), and we will learn how to run PhyloAcc-C (continuous rate model). We will cover how to interpret results from all these models. I will assume that you are using conda or mamba environment and are using them to install many of the packages, wherever possible. I will assume basic knowledge of R programming language and some basic bash shell scripting. I will be working on a slurm environment, and many of the snakemake pipelines have been built around slurm environments.

1.1 PhyloG2P

The PhyloG2P paradigm involves connecting phenotypes and genotypes while taking account for phylogenetic relationships between species, aka in a comparative biology framework. Convergent phenotypes, in particular, are an integral aspect of using the PhyloG2P framework to untangle the genetic basis of phenotypes, provided the same genes and gene pathways are involved in the phenotype.

Here are some examples of how it has been used, using a PhyloAcc-based approach.

  • Sackton et al. (2019) uses PhyloAcc binary version to find accelerated elements associated with flightless in paleognathous birds.
  • Gemmell et al. (2024) uses PhyloAcc-C, the continuous version, to find accelerated elements associated with longevity in mammals.
  • I have used the binary version to find accelerated elements associated with shifts in tarsus length in birds as well as using 3D geometric morphometrics to quantify shape variation in limb bones in ducks and associate accelerated elements with shape variation in bones.

2 Prerequisite Software

2.1 Files

The following folder:

/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/

will contain many useful files necessary to run many of the pipelines that I will showcase. I would recommend copying the files in this folder into your working directory before the workshop.

2.2 Haltools and/or Cactus

A singularity of cactus is available in the shared folder and you can run cactus by copying it to your directory

You will need either Hal Tools (link) or Cactus (link) work with Hal files and do the alignments. Installing Cactus as a Singularity comes with Hal Tools preinstalled.

To install Hal Tools directly follow the instructions on the website. Any deviation from their methods, and you’re installation may fail. It takes ~20 mins to finish installing Hal Tools.

To install cactus, first you’ll need to have Singularity installed see here. Then you can build the singularity version by:

singularity pull --disable-cache docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.1-gpu

2.3 Samtools

Samtools is essential to work with fasta files. You can install that here.

2.4 Bedtools

Bedtools is an important set of tools that can perform a lot of manipulations of bed, vcf, and bam files. You can install that here.

2.5 Phast

Phast is a group of programs to calculate conservation and neutral rates from alignments. You can install that here.

2.6 UCSC-tools

UCSC offers many tools to manipulate various files used in genome analysis. We will especially use ones that helps convert from one file format to another. Two programs we will use are gff3ToGenePred and genePredToBed.

2.7 Python dependencies

We will use many common python dependencies like Biopython and pandas.

2.8 Degenotate (optional)

One way to get 4 fold degenerate sites is through Degenotate that comes from Gregg in the Informatics group here at Harvard. You can install that here. It is recommended you install degenotate in its own conda environment as it might cause compatibility issues with other programs we will be using if you install in the same environment.

2.9 PhyloAcc

To get the original PhyloAcc ST/GT, you can refer to the installation instructions there to install Snakemake and PhyloAcc.

PhyloAcc-C, the continuous version of PhyloAcc-C, works off an R script so it needs to installed through R. It requires R-tools (Windows) or r-base-dev to install properly along with some other dependencies like RcppArmadillo and expm to run properly. I’ve had issues sometimes installing some of these dependencies on the Cannon cluster R, but you can install R dependencies through conda as well which can solve problems of installing them using install.packages().

You will also need ggplot2, parallel, coda, argparse, sm, stringi and ape if you don’t already have them in R.

3 File Formats

Just a guide for some file formats if you are not familiar with them:

4 Cactus Alignment

There are more than one way to make whole genome alignments. You can use the Lastz based pairwise alignment (see example here). One could also use BLAST to get loci of interest and make alignments directly. However, there are some benefits of having a Cactus alignments, as it can be used for more than running PhyloAcc or building alignments.

4.1 Getting genomes and Housekeeping

The first step to doing a Cactus alignment is getting the fasta files of the genomes. These fasta files must be soft-masked (most NCBI genomes are already repeat-masked). I often have issues with super high quality genomes either because not all repeats have been fully masked or there are many Ns in the genome.

The easiest way to download many genomes at once from NCBI is to use ncbi-datasets. Just provide it with a text file that contains accession numbers (see help here).

# eg. code to get genomes from ncbi using accessions saved in my_accessions.txt
datasets download genome accession --include genome --inputfile my_accessions.txt

# ncbi-dataset can be used to download, say, all genomes with a specific taxonomic group as well
datasets download genome taxon Anseriformes --include genome

For cactus, it is often advisable, along with masking the genomes, to not include small scaffolds. I’ve often removed any scaffolds smaller than 1000 bps. Here is an example script to clean all genomes downloaded from ncbi_datasets step above.

# the file my_file.txt contains two columns, accession number, and the output name (because I want to specify exactly what the output genome will be called, usually species name+accession number)
# ncbi-dataset by default downloads files to a folder called ncbi_dataset, unless you change it explicitly. 

while IFS=$'\t' read -r genome outname; do
        # find genome
        genomefile=$(find "ncbi_dataset/data/$genome/" -type f -name "*.fna" -print -quit)
        # index genome
        samtools faidx $genomefile
        # get scaffolds larger than 1000 bp
        awk -F'\t' '$2 > 1000 {print $1}' "${genomefile}.fai" > ncbi_dataset/data/$genome/keep.txt
        # remove scaffolds larger than 1000 bp
        samtools faidx $genomefile -r ncbi_dataset/data/$genome/keep.txt > $outname.fna
done < "my_file.txt"

4.2 Running cactus

The Harvard GenomeAnnotation Github has specific details on how to run cactus. Currently, it is written for cactus v2.2, but soon will be updated for the most recent version of cactus. There have been changes in cactus since that version in syntax so the current workflow does not work on other versions. I have a bootlegged version of the script that I modified that can run using more recent versions of cactus, but you may want to wait for an official release.

4.3 Lessons from the virtual field

After having done many Cactus alignments, I have learned a few things that do work and others that do not. Here are some of my insights that might help you (many of the time recommendations based on bird genomes which are around 1 Gb and do not usually have a lot of repetitive elements):

  • Completely masking your genome is integral. The first step of the cactus pipeline is a masking step. This should finish in less than an hour. If it takes more than 4-5 hours, then your genome still has too many repeats and other elements.
  • If you are using RepeatMasker to mask a genome that already had been masked by other programs (e.g. you want to mask satellites and then mask repeats), remember it will remove any previously soft-masked regions that were already there and only mask the newly found repeats.
  • Do not use ultrametric trees as your guide tree. If you have an ultrametric tree just remove all the branch lengths and that will set all the branch lengths to 1. Cactus assumes branch lengths are substitution rates so ultrametric trees will throw off the cactus aligner.
  • If a blast or align step stalls on a node (takes more than 1 day), it will probably never finish. This is usually a sign that that node cannot be aligned using cactus. It may be a sign to cut your looses and omit one or the other genome. In most cases it is the poor quality or the super high quality genome that is causing the issue.

4.4 The HAL file

The Hal (short for Hierarchical Alignment) format is a graph-based representation of a genome alignment (Hickey et al. 2013). After the cactus pipeline runs, you should get a hal file as an output. Let’s learn how to manipulate hal files using some of the tools I’ve commonly used from haltools. I will be using the cactus singularity based implementation of the hal tools.

# We will be using this hal file for the course of this workshop

/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/duck_align.hal

# And you can find a version of cactus already compiled here as well

/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/cactus_v2.8.1-gpu.sif

A word on singularities:

  • A singularity is just a self contained container (like a mamba environment in a sense) that allows you to run programs and do stuff without the programs within it from having access to your entire work-space (as a security measure).
  • You can open a singularity and run programs within it using singularity shell <container>. This will open a command-line within the singularity that you can run stuff on.
  • If you already know what to run, then you can directly execute the program using singularity exec <container> <program> <program parameters>.
  • You may not have access to all your files inside the singularity so you may have to set singularity exec --bind <folders> <container> to give it access to folders (it is recursive). It works in the same way as setting $PATH on the command line (so multiple folders can be bound with :)
  • GPU based singularities might need --nv (for Nvidia) to execute gpu based commands.

Let’s Hal

  1. Open the singularity for cactus (preferably you have copied it to your own work-space).
    singularity shell cactus_v2.8.1-gpu.sif
  2. Let’s see what all hal tools commands we have available. As in the basic shell, you can auto-complete by hitting tab, so let’s type in hal and hit tab
  3. Let’s look at the help for one of these tools.
    halStats

Next, we’ll look at some of the common hal tools that you may end up using. We will use a dummy data for practice.

What’s your favorite gene

# Get the coordinates for your favorite gene from the chicken gff file available here:

/n/holylfs05/LABS/informatics/Everyone/subir_interns/genome_workshop/galgal7b.gff

# I will use the gene TYR for Tyrosinase
grep "gene=TYR;" galgal7b.gff

# We will want to convert this to a bed format so let's use awk to get only the relevant information
# In brief, we want to get the exons, so the 3rd column should match exon, then we need to get the gene name 
# so it appears in the output, we need to subtract 1 from the start index because bed is 0 indexed and 
# vcf is 1 indexed, and finally lets add a count number to the end of the exons
grep "gene=TYR;" galgal7b.gff | 
awk 'BEGIN{ FS="\t"; OFS=FS }
$3 == "exon" { 
    sub(/.*gene=/, "", $9); 
    sub(/;.*/, "", $9); 
    $4 = $4 - 1; 
    count++;
    print $1, $4, $5, $9 "_exon_" count; 
}' > tyr_galgal.bed

4.4.1 halStats

halStats can be used to get the basic statistics of the hal file.

For example, if we wanted to see what genomes were present in an alignment, we could (note we are using exec now not shell):

singularity exec cactus_v2.8.1-gpu.sif halStats --genomes duck_align.hal

We can get the scaffold names for a particular genome using:

singularity exec cactus_v2.8.1-gpu.sif halStats --sequences Gallus_gallus duck_align.hal

We can also view the tree structure used in for the alignment using:

singularity exec cactus_v2.8.1-gpu.sif halStats --tree duck_align.hal

4.4.2 Modifying HAL file

A few of the commands halAddToBranch, halAppendSubtree, halAppendCactusSubtree, halRemoveGenome, halRemoveSubtree and halReplaceGenome can be used to add, remove, and modify the genomes in the halfile. These actions will also do realignments across all higher nodes for whatever is added or removed. See the respective help for each of those commands.

One command that could be very useful is halRenameSequence. Suppose the genome you used in the alignment was from Genbank and your gff file had coordinates in the RefSeq format. While there is a one to one mapping of each of these scaffold names, they differ between genbank and refseq. You could use a tab file with genbank scaffold names and refseq scaffold names to update the scaffold names in the alignment with ease.

4.4.3 halLiftover

Another important command is halLiftover. This is one reason, in my opinion, why a cactus alignment is so much better than say a pairwise lastz alignment. You can easily convert coordinates from one species to any other species across all aligned taxa.

Let’s look at the coordinates of the gene you chose in a different species, you can choose any one among the ones that showed up before (including what the predicted ancestral sequences were).

singularity exec cactus_v2.8.1-gpu.sif halLiftover duck_align.hal Gallus_gallus tyr_galgal.bed Anseranas_semipalmata tyr_anssem.bed

You will often find your lifted-over coordinates are broken up. This is because of indels and other structural artifacts introduce breaks. But we can use bedtools to clean up such breaks.

# In many cases the first step is to always sort the bed file. I will use -d of 20 to merge the file, ie any gaps less than 20 bp will be merged into one. I will also make sure that merge keeps the fourth column using -c and -o 
bedtools sort -i tyr_anssem.bed | \
bedtools merge -d 20 -c 4 -o distinct -i "stdin" > \
tyr_anssem_merged.bed

Also, if your bed file contains extra columns, just use --bedType 3 to get halLiftover to pass all the other columns as it is.

4.4.4 hal2maf/cactus-hal2maf

The native hal2maf or the cactus cactus-hal2maf can be used to convert coordinates to maf format (multiple alignment file) and then subsequently to fasta. The cactus-hal2maf is under active development so might work better. They have slightly different syntax. One of the main advantages of the latter is being able to specify how to handle duplicates (aka regions of the genome aligned to multiple regions) but the former is more easy to use and is quite straightforward.

A word about maf files, maf files are linearly mapped and scaffold information is staggered in the sequence. You can pull sequences based on linear coordinates but not based on scaffold coordinates. So like if you had two targets one in scaffold 1 that was 100 bp long and one in scaffold 2 that was 200 bp long, it would be mapped as 0-100 and 100-300 and would need to know the linear coordinates to pull out specific regions. It is advisable to split maf files by scaffolds (aka use bed files with coordinates of only one scaffold) in future operations if you intend to pull out specific regions.

# Using just hal2maf is easy, we'll also ignore paralogs using --noDupes, and not print ancestral sequences
singularity exec cactus_v2.8.1-gpu.sif hal2maf \
--noAncestors \
--noDupes \
--refGenome Gallus_gallus \
--refTargets tyr_galgal.bed \
duck_align.hal tyr_duck.maf

# Using cactus-hal2maf you need to give it more details.
# you can submit it as a separate job using --batchSystem
# --dupeMode specifies whether to get orthologs or make a consensus sequence
# ./jobstore/ is just a place to store temporary files, make sure it does not already exist
# you can parallelize jobs, set memory, and other things within cactus-hal2maf
singularity exec cactus_v2.8.1-gpu.sif cactus-hal2maf \
--batchSystem single_machine \
--refGenome Gallus_gallus \
--bedRanges tyr_galgal.bed  \
--noAncestors \
--dupeMode single \
--chunkSize 1000 \
./jobstore/ duck_align.hal tyr_duck_cac.maf

4.5 Maf to Fasta

There are some maf to fasta converters out there, and one could do it through scripting in Python or other programming languages (see BioAlign’s maf module). Here, we will use a small Python script that I wrote to convert the maf to a fasta format. This script works as long as your maf file does not contain paralogs/duplicates, and ideally consists of a single scaffold (remember mafs are not indexed by scaffold).

To get this script run
git clone https://github.com/subirshakya/phast_scripts.git

Then, you can use the maf2fasta.py script:

python phast_scripts/maf2fasta.py --maf tyr_duck.maf \
                                  --bed tyr_galgal.bed \
                                  --out_folder ./ \
                                  --ref_species Gallus_gallus

5 Conserved/Neutral Rate

5.1 Neutral Rate

To run PhyloAcc models, we need to the estimate neutral rate of substitution. For this we will use 4-fold degenerate sites (3rd bp of codons where all four bp changes do not change the amino acid) as a proxy of a site evolving neutrally. To extract these sites we need an annotated genome, since we need to know where genes are to get the 4-fold degenerate sites. We will use the gff file use used previously to generate a neutral rate for a subset of the data. Normally, one would want to run this per chromosome and then combine the rates across multiple chromosomes using PhyloBoot (part of the Phast package). Here, we will only work on a small subsample of genes, so that we can get done quickly. There are two ways we can get the coordinates of 4-fold degenerate sites, one using hal4dextract, which I have found can take a long time to run sometimes. The other is using degenotate.

The input hal4dextract is a 12-column bed file of exonic sites. This we can get from the gff file using two programs gff3ToGenePred and genePredToBed.

# first convert to genepred
gff3ToGenePred -useName galgal7b.gff galgal7b_cds.gp

# then convert to bed
genePredToBed galgal7b_cds.gp galgal7b_cds.bed

# for this exercise let's down-sample this bed file to include only a few genes, all genes in the first 10 Mb of chr 1
# you can choose your own set of genes, would be interesting to compare results between different parts of the genome
# just replace NC_052532.1 with any scaffold between NC_052532.1 and NC_052572.1. 
awk 'BEGIN{ FS="\t"; OFS=FS }
$1 == "NC_052532.1" && $2 > 0 && $2 < 10000000' galgal7b_cds.bed > galgal7b_cds_subset.bed

# now let's use hal4dextract
singularity exec cactus_v2.8.1-gpu.sif hal4dExtract \
--conserved \
duck_align.hal Gallus_gallus galgal7b_cds_subset.bed galgal7b_subset_4d.bed

# next we can use hal2maf to get the alignments of all 4d sites
singularity exec cactus_v2.8.1-gpu.sif hal2maf \
--noAncestors \
--noDupes \
--refGenome Gallus_gallus \
--refTargets galgal7b_subset_4d.bed \
duck_align.hal galgal7b_subset_4d.maf

For degenotate see example 6 in the documentation and then use the same hal2maf to get the alignments of the 4 fold degenerate sites.

Next, we’ll use phyloFit to get the neutral rate and a neutral rate tree.

# let's get a background tree first, we'll just borrow the tree already in the hal file for this tutorial.
# I've also stripped the ancestral node names and the branch lengths that already existed to give a fresh start
singularity exec cactus_v2.8.1-gpu.sif halStats --tree duck_align.hal | \
sed 's/Anc[0-9]*//g' - | sed 's/:0\.[0-9]*//g' - > duck_tree.tre

phyloFit --tree duck_tree.tre \
--init-random \
--subst-mod SSREV \
--sym-freqs \
--log galgal7b_subset_4d.log \
--msa-format MAF \
--out-root galgal7b_subset_4d_neutral \
galgal7b_subset_4d.maf

Here is an example of a neutral rate matrix (bases: A, C, G, T).

\[Q_{duck} = \begin{bmatrix} -1.042445 & 0.159358 & 0.770558 & 0.112529 \\ 0.122472 & -0.967380 & 0.252710 & 0.592198 \\ 0.592198 & 0.252710 & -0.967380 & 0.122472 \\ 0.112529 & 0.770558 & 0.159358 & -1.042445 \\ \end{bmatrix}\]

And the tree associated with it.

5.2 Conserved rates/elements

To generate a list of conserved elements, we can take the steps above further. This can be a time intensive process so we will not actually run any code for this section. I’ll show you how it can be done.

First, a practical consideration. If you are using looking for acceleration using PhyloAcc and you know what the target group is, it is advisable you do not include those species in the alignments used to generate conserved rates, as you don’t want to miss elements where accelerations have happened. This is easy enough to do, just use the --targetGenomes flag in hal2maf. Second, the ideal way to do this is to run phyloFit for each scaffold/chromosome separately and then use phyloBoot to merge the rate files for each scaffold.

OK, so to get a list of conserved elements, we need to estimate \(\rho\) in phastCons. First we need to get entire maf alignments, ideally one for each scaffold; similar to how we did it above, but without the --refTargets flag.

Next, we run PhastCons, which is part of the Phast suite of programs. See the help there on some tricks and tips to run PhastCons

# First we calculate the conserved rate;
# Here you need to set expected length of elements and target-coverage or let the program estimate it
# See the help file for definitions of these parameters
# We are estimating rho so we set that and include a starting value (not necessary)
# Finally, we may want to correct for GC bias, since the scaffold might have a different GC bias than the genome as a whole, I often use the GC value of the genome/group of chromosomes.
# You also need a mod file with neutral rates (output of neutral rate model)
phastCons \
--expected-length 45 \
--target-coverage 0.3 
--rho 0.4 \
--gc 0.41 \
--estimate-rho \
--out-root birds_cons \
--no-post-probs \
--msa-format MAF \
some_scaffold.maf galgal7b_subset_4d_neutral.mod

# Then you use the conserved and neutral rates to get a list of conserved elements. 
# This will also output a wig file with the conservation scores per element
phastCons \
--expected-length 45 \
--target-coverage 0.3 \
--most-conserved conserved_elements.bed \
--msa-format MAF 
some_scaffold.maf birds_cons.mod,galgal7b_subset_4d_neutral.mod > \
conservation_scaffold.wig

6 PhyloAcc

Checkpoint: you should have fasta alignments for exons of a random gene and a mod file with neutral rates

6.1 PhyloAcc-GT/ST

These two models are used to associate acceleration of substitution rates to a binary trait. The GT (Gene Tree) model is an extension of the ST (Species Tree) that allows for different relationships between taxa for different loci (gene tree discordance) (Yan et al. (2023)). Thanks to Gregg and Tim in Bioinformatics, there is a simple snakemake based framework to run PhyloAcc (see here).

To run PhyloAcc-GT/ST first, we need to do some housekeeping,

  1. It is possible to input a folder with fasta files as input directly to PhyloAcc. Alternatively, you can provide a concatenated fasta sequence. The python code concatinaTOR.py in the phast_scripts directory, which you used to convert a maf to a fasta, can concatenate files for you. One advantage of doing concatenation is that you can use the program to omit alignments with large amount of missing taxa, along with selectively omitting specific loci using the -i option in PhyloAcc.
  2. The tree needed in PhyloAcc-GT/ST that is in the mod file needs to have the ancestral nodes labelled. You can do that by running tree_doctor, which should already be available as a part of the phast programs: tree_doctor --name-ancestors galgal7b_subset_4d_neutral.mod > galgal7b_subset_4d_neutral_lab.mod
  3. The default for PhyloAcc is the Species Tree mode, but can be switched to the gene tree mode by setting -r gt or -r adaptive. The difference here is that gt will make gene trees for all alignments, which can be time and computation intensive, while adaptive uses concordance factors to bin loci to use with subset of tree topologies.
  4. The default model built into PhyloAcc is for conserved elements in birds and mammals. While this may still be appropriate for other vertebrates, for other groups, you may need to adjust the default assumptions for conservation. You can use --options to get a full list of options and default parameters.
  5. By default, PhyloAcc now does not assume dollo parsimony, i.e. when a character state changes to accelerated it can change back to neutral or conserved. This does, however, increase computational times. You can turn on dollo parsimony by setting --dollo.
# Here is an example to run phyloacc for the 5 exons to TYR we had which were in the folder alignments
# I chose some random target species using -t where I am assuming there will be acceleration
# I parallelized the job by running 1 alignment per node to have 5 jobs total each with 100g of memory
phyloacc.py \
-d alignments/ \
-m galgal7b_subset_4d_neutral_lab.mod \
-o ./phyloacc_output \
-t "Oxyura_jamaicensis;Somateria_mollissima;Clangula_hyemalis;Melanitta_fusca;Aythya_fuligula;Tachyeres_brachypterus" \
-p 1 \
-j 5 \
-batch 1 \
-part "shared" \
-mem 100 \
-time 4

# Once it finishes running you can use phyloacc_post to gather the output
phyloacc_post.py -i phyloacc_output/

Let’s look at what kind of output PhyloAcc produces:

  • less phyloacc_output/results/elem_lik.txt should show a summary for each element
  • less phyloacc_output/results/M2_elem_Z.txt should show the state of acceleration/conservation/neutral for each node and tip
  • less phyloacc_output/results/rate_postZ_M2.txt should show the posterior probability for each state acceleration/conservation/neutralness for each node and tip

The folder phyloacc_output/plot/ will also contains some plots about alignments and results

6.2 PhyloAcc-C

The continuous model of PhyloAcc works within an R framework. For this, we still need the same files as before, but also need a continuous trait in a text file, unlike the target tips as before.

I have an R executable called run_phyloaccc.r where I have taken Dr. Patrick Gemmell’s scripts to seamlessly run PhyloAcc-C. The Rscript takes the following parameters:

  • mod_file: the mod file with neutral rate and tree
  • output: the name for the output file with summary of PhyloAcc-C runs
  • output_folder: the output folder to save the file in
  • fasta: a folder containing fasta files
  • fasta_list: a list with one element per line for the fasta files to run (can use it to run a job on only a subset of fasta; ideal for parallelization)
  • max-missing: this can be used to omit alignments that are missing this fraction of species
  • traits: a file with two columns, species name and trait value
  • batch_size: how many elements to process per computation node
  • cores: number of cores provided to parallelize
  • runs: number of independent runs per element (to test for convergence); default is 3
  • iter: number of mcmc iterations
  • burnin: burn-in value
  • max-missing: this can be used to check for

Let’s create a random trait for the species we are working with. We can do this by pulling the species names from one of the fasta files, and then assigning a random number to the entry.

awk '/^>/{header=substr($0, 2); print header "\t" (rand()*2-1)}' alignments/TYR_exon_1.fa > duck_trait.txt

OK, that’s all we need to run PhyloAcc-C. One note, as before the prior probabilities for conservation has been adjusted to match that for birds. You may have to tweak your parameters if you are running on other species. This would have to be done by manually changing the priors in the R script.

# This will run phyloacc-c model for each element on each thread (parallelized)
Rscript run_phyloaccc.r \
--mod_file galgal7b_subset_4d_neutral_lab.mod \
--output tyr_phyloaccc \
--fasta alignments/ \
--traits duck_trait.txt \
--batch_size 5 \
--cores 5

This is generally enough to find elements that are evolving together with traits. In the output file, you can read in the table output and calculate BF:

library(ggplot2)
output_all <- read.table("phyloaccc_out.txt", header = T, sep = "\t")

# subset for elements with Gelman & Rubin convergence statistics < 1.01
output_all <- output_all[output_all$rhat <= 1.01,]

# BF
true_null = 1/(2*pi)
output_all$bf = true_null/output_all$dens0
output_all$color = "darkblue"
output_all$color[which(output_all$bf > 5)] = "orange"

ggplot(output_all)+
  geom_point(aes(x=rat.50, y=log10(bf)),  col = output_all$color)+
  theme_classic()+
  xlab("Rate")+
  ylab("Bayes Factor")+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_hline(yintercept=log10(5), linetype="dashed", color = "darkorange", linewidth=0.75)+
  annotate('text', x=0, y=log10(5)+0.1, label = "BF > 5", color ="darkorange")

For elements of note, we may want to visualize trees and get some parameter estimates. For this we can run PhyloAcc-C individually for each candidate loci. The Rscript run_phyloaccc_indv.r can be used to run PhyloAcc-C for each element and then print the outputs and trees to a pdf file. I often run this for longer, using 3X iterations that i did previously to get better mixing of parameters. Let’s do this for one of these exons.

# The individual phyloacc script has fewer options than the batch one
# This code will run for exon5 of TYR
Rscript run_phyloaccc_indv.r \
--mod_file galgal7b_subset_4d_neutral_lab.mod \
--fasta alignments/TYR_exon_5.fa \
--traits duck_trait.txt \
--iter 30000 \
--burnin 3000

Here is the output tree:

There are also other trace plots and statistics such as accelerated rate and conserved rate that are printed in the file.

THE END

References

Armstrong, Joel, Glenn Hickey, Mark Diekhans, Ian T Fiddes, Adam M Novak, Alden Deran, Qi Fang, et al. 2020. “Progressive Cactus Is a Multiple-Genome Aligner for the Thousand-Genome Era.” Nature 587 (7833): 246–51.
Gemmell, Patrick, Timothy B Sackton, Scott V Edwards, and Jun S Liu. 2024. “A Phylogenetic Method Linking Nucleotide Substitution Rates to Rates of Continuous Trait Evolution.” PLOS Computational Biology 20 (4): e1011995.
Hickey, Glenn, Benedict Paten, Dent Earl, Daniel Zerbino, and David Haussler. 2013. “HAL: A Hierarchical Format for Storing and Analyzing Multiple Genome Alignments.” Bioinformatics 29 (10): 1341–42.
Hu, Zhirui, Timothy B Sackton, Scott V Edwards, and Jun S Liu. 2019. “Bayesian Detection of Convergent Rate Changes of Conserved Noncoding Elements on Phylogenetic Trees.” Molecular Biology and Evolution 36 (5): 1086–1100.
Sackton, Timothy B, Phil Grayson, Alison Cloutier, Zhirui Hu, Jun S Liu, Nicole E Wheeler, Paul P Gardner, et al. 2019. “Convergent Regulatory Evolution and Loss of Flight in Paleognathous Birds.” Science 364 (6435): 74–78.
Yan, Han, Zhirui Hu, Gregg WC Thomas, Scott V Edwards, Timothy B Sackton, and Jun S Liu. 2023. “PhyloAcc-GT: A Bayesian Method for Inferring Patterns of Substitution Rate Shifts on Targeted Lineages Accounting for Gene Tree Discordance.” Molecular Biology and Evolution 40 (9): msad195.