Following includes codes used to run bayou to find shifts in tarsus length variation in birds. It uses bash and R codes interchangeably along with chunks that should be submitted as jobs using a resource manager in a supercomputing cluster (here we use slurm and snakemake). Please make sure you are running the codes in the correct interface.
Bayou is used to fit Bayesian reversible-jump multi-optima OU models to phylogenetic comparative data (Uyeda and Harmon 2014). See https://github.com/uyedaj/bayou for more information about the bayou code including tutorials. We wanted to get regions of the avian phylognetic tree where there were shifts in tarsus lenght. For this we followed the framework presented in Rombaut et al. (2022). For the phylogenetic tree and tarsus length data, we downloaded the data from Tobias et al. (2022) at this link. Tarsus length measurements are in the file AVONET1_BirdLife.csv in the folder ELEData/TraitData/. The phylogenetic tree generated from Jetz et al. (2012) can be found in the folder ELEData/PhylogeneticData/. The filename is HackettStage1_0001_1000_MCCTreeTargetHeights.nex.
First we need to get the tarsus length measurements formatted along
with the body weight for each species to correct tarsus length for body
mass. Similarly, we will need to prune the tree and data to include the
same taxa.
The column Group is not present in the original Avonet table. Since running bayou on the entire tree is impractical computationally, we adopted the approach presented in Rombaut et al. (2022) by splitting the dataset into 18 groups. The groups are as follows:
| Groups | Number_of_taxa |
|---|---|
| 1_Palaeognathae | 26 |
| 2_Galloanserae | 287 |
| 3_Mirandornithes | 19 |
| 4_Columbimorphae | 274 |
| 5_Strisores | 301 |
| 6_Gruiformes | 45 |
| 7_Aequornithes | 447 |
| 8_Accipitriformes | 256 |
| 9_Coraciiformes | 335 |
| 10_Falconiformes | 45 |
| 11_Psittaciformes | 222 |
| 12_Suboscines | 781 |
| 13_Meliphagoidea | 171 |
| 14_Corvoidea | 411 |
| 15_Sylvoidea | 532 |
| 16_Muscicapoidea | 451 |
| 17_Nectaroidea | 54 |
| 18_Passeroidea | 748 |
library(ggplot2)
library(ape)
library(bayou)
all_data <- read.csv("AVONET1_BirdLife.csv", header = T)
all_tree <- read.nexus("HackettStage1_0001_1000_MCCTreeTargetHeights.nex")
#change space between species names to underscores
all_data$species <- gsub(" ", "_", all_data$Species1)
#remove species which are not present in the tree
all_data_in_tree <- subset(all_data, species %in% all_tree$tip.label)
#keep only required columns
all_data_in_tree <- all_data_in_tree[,c("Group", "Family1", "species", "Mass", "Tarsus.Length"),]
#log transform the mass and tarsus lengths
all_data_in_tree$ln_tarsus <- log(all_data_in_tree$Tarsus.Length)
all_data_in_tree$ln_mass <- log(all_data_in_tree$Mass)
#get groups
all_groups <- unique(all_data$Group)
The R script accepts the values 1 through 18 for the respective group and can therefore subset the data for that group and format it for bayou. This allows us to run the program in parallel. We also ran two instance of each run. The number of generations depended on whether or not the chains converged.
use_group <- 7 #this would run for group number 7, 7_Aequornithes
#gather group subset
group <- grep(paste("^",as.numeric(use_group),"_",sep=""), all_groups, value = T)
group_init <- substr(group, 1, 8)
#iteration number and generations
iter <- 1
runs <- 20000000 #20 million
output_folder <- "./"
#prepare data for bayou input
all_data.1 <- subset(all_data_in_tree, Group %in% group)
all_tree.1 <- keep.tip(all_tree, all_data.1$species)
all_data.1$species <- factor(all_data.1$species, levels = all_tree.1$tip.label)
all_data.1 <- all_data.1[order(all_data.1$species),]
#note: order of tips and data should be same
dat <- all_data.1$ln_tarsus
names(dat) <- all_data.1$species
pred <- data.frame(lnMass = all_data.1$ln_mass)
rownames(pred) <- all_data.1$species
You need to set up priors to run bayou. We used similar parameters as mentioned in Rombaut et al. (2022). For tuning parameters we ran short runs (50,000 generations) of different values of tuning parameters until tuning acceptance ratio (which you will see in the output to screen as .alpha for alpha and so on). The bayou manual says good acceptance is 0.2-0.4 so if acceptance ratio is higher than 0.4 increase value of parameter and if acceptance ratio is lower than 0.2 decrease value of parameter.
#priors for bayou
priors <- make.prior(all_tree.1, plot.prior = T,
dists=list(dalpha="dhalfcauchy", dsig2="dhalfcauchy", dbeta_lnMass="dnorm",
dsb="dsb", dk="cdpois", dtheta="dnorm", dloc="dunif"),
param=list(dalpha=list(scale=0.1), dsig2=list(scale=0.1),
dbeta_lnMass=list(mean=2, sd=0.5),
dk=list(lambda=if(nrow(all_data.1)*0.05 < 1){1}
else{floor(nrow(all_data.1)*0.05)},
kmax=floor(nrow(all_data.1)*0.5)),
dsb=list(bmax=1,prob=1), dtheta=list(mean=0.3, sd=0.5)))
#list of tuning parameters for the 18 runs
DN1 <- list(list(alpha=4, sig2=3.6, beta_lnMass=0.1, k=c(1,1), theta=1, slide=1),
list(alpha=7, sig2=1.3, beta_lnMass=0.2, k=c(1,1), theta=2.5, slide=1),
list(alpha=6, sig2=2, beta_lnMass=0.2, k=c(1,1), theta=1, slide=1),
list(alpha=9, sig2=1.2, beta_lnMass=0.2, k=c(1,1), theta=2, slide=1),
list(alpha=7, sig2=1.3, beta_lnMass=0.4, k=c(1,1), theta=2.5, slide=1),
list(alpha=6, sig2=2.3, beta_lnMass=0.2, k=c(1,1), theta=1, slide=1),
list(alpha=8, sig2=1, beta_lnMass=0.3, k=c(1,1), theta=3, slide=1),
list(alpha=8, sig2=1.2, beta_lnMass=0.2, k=c(1,1), theta=2, slide=1),
list(alpha=9, sig2=1.2, beta_lnMass=0.2, k=c(1,1), theta=2.5, slide=1),
list(alpha=6, sig2=2.6, beta_lnMass=0.3, k=c(1,1), theta=2, slide=1),
list(alpha=7, sig2=1.8, beta_lnMass=0.2, k=c(1,1), theta=1.5, slide=1),
list(alpha=5, sig2=1, beta_lnMass=0.5, k=c(1,1), theta=3.5, slide=1),
list(alpha=7, sig2=1.8, beta_lnMass=0.1, k=c(1,1), theta=0.5, slide=1),
list(alpha=7, sig2=1.2, beta_lnMass=0.3, k=c(1,1), theta=2.5, slide=1),
list(alpha=7, sig2=1, beta_lnMass=0.3, k=c(1,1), theta=3, slide=1),
list(alpha=6, sig2=1, beta_lnMass=0.3, k=c(1,1), theta=3, slide=1),
list(alpha=7, sig2=3, beta_lnMass=0.2, k=c(1,1), theta=0.5, slide=1),
list(alpha=6, sig2=1, beta_lnMass=0.3, k=c(1,1), theta=3, slide=1))
model.N1 <- makeBayouModel(dat ~ lnMass, rjpars = c("theta", "lnMass"),
tree=all_tree.1, dat=dat, pred=pred, prior=priors, D=DN1[[use_group]])
mcmc.N1 <- bayou.makeMCMC(all_tree.1, dat, pred=pred, model=model.N1$model, prior=priors, startpar=model.N1$startpar,
new.dir=output_folder, outname=paste(group_init, iter, sep="_"), plot.freq=NULL, SE = 0.1)
#This will run 20 million generations for group 7 iteration 1
mcmc.N1$run(runs)
It is also cogent to run a few plots and graphs at this stage as well as saving an image of the data in case you need to restart run.
chain.N1 <- set.burnin(mcmc.N1$load(), 0.5)
save.image(paste(output_folder, group_init, ".RData", sep=""))
#plot(chain.N1, auto.layout=FALSE)
par(mfrow=c(1,1))
plotSimmap.mcmc(chain.N1, burnin = 0.3, pp.cutoff = 0.3)
shiftsumsN1 <- shiftSummaries(chain.N1, mcmc.N1, pp.cutoff=0.3)
save.image(paste(output_folder, group_init, ".RData", sep=""))
plotShiftSummaries(shiftsumsN1, lwd=2, single.plot=TRUE, label.pts=FALSE)
dev.off()
Once you have two runs for each group, you can check for convergence using the Gelman R statistic for convergence. We used a value less than 1.1 for considering the parameters of two runs converged.
all_data <- read.csv("AVONET1_BirdLife.csv", header = T)
output_folder <- "./"
pdf(paste(output_folder, "convergence", ".pdf", sep=""))
for (x in seq(1,2)){
output_folder <- "./"
all_groups <- unique(all_data$Group)
group <- grep(paste("^",x,"_",sep=""), all_groups, value = T)
group_init <- substr(group, 1, 8)
load(paste(output_folder, group_init,"_2.RData", sep=""))
output_folder <- "./"
chain2 <- chain.N1
load(paste(output_folder, group_init,"_1.RData", sep=""))
print(group)
summary(chain.N1)
summary(chain2)
start_gap <- length(chain2$gen)/2
post_gap <- length(chain2$gen)*3/4*10
gelman <- gelman.R("lnL", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gelman <- gelman.R("prior", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gelman <- gelman.R("alpha", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gelman <- gelman.R("sig2", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gelman <- gelman.R("k", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gelman <- gelman.R("ntheta", chain2, chain.N1, freq=10000, start = start_gap)
abline(h = mean(gelman$R), col = "red", lwd = 4, lty = 4)
text(x=post_gap, y=mean(gelman$R), labels = round(mean(gelman$R),5), pos=3, col="red")
gc()
}
dev.off()
If runs have not converged sufficiently, you can run the particular run for longer by rerunning bayou., bayou will pick up where it stopped. We ran to a maximum of 50 million generations for some groups.
We recovered shifts in 10 families of birds. The groups are
distributed as follows.