Note

Following includes codes used to generate an generate alignments of coding sequences and then run selection analysis

1 Introduction

This markdown file documents selection tests used to identify genes selected in short-tarsi taxa. We used two approaches one using TOGA to generate coding sequence alignments for the 78 species in our study and then using aBSREL to look for evidence of positive selection. For the second approach we used population resequencing data to perform a MK test.

We used the Hal file generated from cactus: see cactus doc. For the reference species we used Galgal7b refseq gff file: here

2 TOGA and aBSREL

2.1 TOGA

We used TOGA (see Kirilenko et al. (2023)) to generate orthologous sequences for the 78 species. To prepare the data for TOGA, we followed the steps outlined here.

2.1.1 Gather files for reference genome

#bash

#Generate file containing chromosome lenghts for galgal7 assembly
python GenomeAnnotation-TOGA/utilities/WriteChromLengthBedFromFasta.py galgal7 Gallus_gallus_GCF016699485.fna

#Gather only CDS
gffread galgal7b.gff -g Gallus_gallus_GCF016699485.fna -F -C -o galgal7b_cds.gff

#Get isoform information file
awk -F"\t" '$3=="mRNA"{print $9}' galgal7b_cds.gff | \
    awk -F";" '{print $3"\t"$1}' | \
    sed 's/gene_name=//g' | \
    sed 's/ID=//g' > galgal7b_isoform.tsv

#convert gff to GenePred
gff3ToGenePred galgal7b.gff galgal7b.GenePred

#convert GenePred to bed
genePredToBed galgal7b.GenePred galgal7b.bed

#Keep only CDS in bed
python GenomeAnnotation-TOGA/utilities/FilterAnnotationBedForCDS.py galgal7b_isoform.tsv galgal7b.bed

#Final file is CDS_galgal7b.bed

2.1.2 Prepare Genomes to Map galgal7b CDS

#bash

#Create array of all species
files=( $(ls /n/holylfs05/LABS/informatics/Users/sshakya/genomes_cactus/snakemake_out/*.fna) )

#Get scaffold length files for each species
#Convert fasta to 2bit
for i in ${files[@]};
do
        fbname=$(basename "$i" .fna)
        python GenomeAnnotation-TOGA/utilities/WriteChromLengthBedFromFasta.py chrom_length_bed/${fbname} $i
        faToTwoBit $i 2bit/${fbname}.2bit
done

#Manually removed the two gallus gallus genomes
#Use halLiftover to convert hal to psl
#Convert psl to positive strand
#use axtChain to create chain files from PSL
files=( $(ls chrom_length_bed/*.bed) )

for i in ${files[@]};
do
       fbname=$(basename $i _chromlengths.bed)
       species=$(echo $fbname | awk -F_ '{print $1"_"$2}')
       halLiftover --outPSL ../all_birds_new.hal $species $i Gallus_gallus7 psl/${species}_galgal7.psl
       pslPosTarget psl/${species}_galgal7.psl pos_psl/${species}_galgal7.psl
       axtChain -psl -verbose=0 -linearGap=loose pos_psl/${species}_galgal7.psl 2bit/${fbname}.2bit \
          Gallus_gallus_GCF016699485.2bit chain/${species}_galgal7.chain
done

2.1.3 Run TOGA

TOGA was run with 500 parallel processes per species

#bash

files=( $(ls chrom_length_bed/*.bed) )

for i in ${files[@]};
do
        fbname=$(basename $i _chromlengths.bed)
        species=$(echo $fbname | awk -F_ '{print $1"_"$2}')
        ~/TOGA/toga.py chain/${species}_galgal7.chain \
                        galgal7b_new.bed \
                        Gallus_gallus_GCF016699485.2bit \
                        2bit/$fbname.2bit \
                        --kt \
                        --pn toga_run/$fbname \
                        -i galgal7b_isoform.tsv \
                        --nc ~/TOGA/nextflow_config_files/ \
                        --cb 10,100 \
                        --cjn 500
done

2.2 Process TOGA output

2.2.1 Gather TOGA output files

First we gather all output files. TOGA failed for two species Eudyptes filholi and Eudyptes moseleyi and were subsequently omitted from the selection analysis

#bash

#The main toga output is in the orthology_classification.tsv file

files=( $(ls chrom_length_bed/*.bed) )

for i in ${files[@]};
do
  cp toga_run/${i}/orthology_classification.tsv orthology_files/
  mv orthology_files/orthology_classification.tsv orthology_files/${i}.tsv
done

2.2.2 Compile list of one-to-one orthologous loci

Next we compiled a dataframe identifying only single copy orthologs. * We included all genes labeled one2one with galgal7b * We also included all genes labeled one2zero * For genes labeled one2many we dropped the species with the multiple hits and only kept single copy genes * We kept only one transcript (largest) per gene * We dropped any gene with less than 15 species

#R

chicken <- read.table("galgal7b_isoform.tsv", 
                      sep = "\t", header = F, col.names = c("gene", "transcript"))
chicken_bed <- read.table("galgal7b.bed", sep = "\t", header = F)
chicken_bed$length <- unlist(lapply(lapply(strsplit(chicken_bed$V11,split = ","), as.integer), sum))
chicken <- merge(chicken, chicken_bed[,c("V4", "length")], by.x="transcript", by.y="V4", all.x = T, all.y = F)

species <- read.table("species_list.txt", sep = "\t", header = F)
for (specie in species$V1){
  print(specie)
  try({
    transcripts <- read.table(paste("orthology_files/", specie, ".tsv", sep = ""), sep = "\t", header = T)
    #keep only one to one
    transcripts <- subset(transcripts, orthology_class %in% c("one2one","one2zero", "one2many"))
    transcripts <- transcripts[!duplicated(transcripts[,1:2]),]
    transcripts$q_gene[transcripts$orthology_class %in% "one2many"] <- "None"
    transcripts$q_transcript[transcripts$orthology_class %in% "one2many"] <- "None"
    colnames(transcripts)[3:4] <- paste(colnames(transcripts)[3:4], specie, sep = ".")
    chicken <- merge(chicken, transcripts[,2:4], by.x = "transcript", by.y="t_transcript", all.x = F, all.y = F)
  })
}

chicken <- chicken[order(chicken$gene, -chicken$length), ]
chicken <- chicken[!duplicated(chicken$gene),]
chicken$none_count <- rowSums(chicken == "None")

#We filtered genes with less than 15 species represented
chicken2 <- subset(chicken, none_count < 30)
chicken2 <- chicken2[!duplicated(chicken2$gene),]

write.table(chicken2[,c(1:153)], "one2one_orthos.txt", sep = "\t", row.names = F, col.names = T, quote = F)

2.3 Run Selection Analysis

2.3.1 Gather each gene in a single fasta file

We ended up with 10435 genes.

!usr/bin/python

from Bio import SeqIO
import pandas as pd
import os

class orthologues:
    def __init__(self):
        self.seqs = []

def main():
    one2one_df = pd.read_table("one2one_orthos.txt")
    species = [one2one_df.columns[x].split(".")[1] for x in range(3,len(one2one_df.columns),2)]
    genes = {}
    
    #Create list for each gene as we parse across species
    for row in range(len(one2one_df)):
        genes[one2one_df.loc[row][1]+"_"+one2one_df.loc[row][0]] = orthologues()

    for specie in species:
        print(specie)
        for record in SeqIO.parse("toga_run/"+specie+"/nucleotide.fasta", "fasta"):
            if ("ref" not in record.id):
                try:
                    geneidx = one2one_df[one2one_df["q_transcript."+specie]==record.id].index.values.astype(int)[0]
                    geneid = one2one_df.loc[geneidx][1]+"_"+one2one_df.loc[geneidx][0]
                    record.id = specie
                    genes[geneid].seqs.append(record)
                except:
                    pass
                  
    # Finally work on Gallus gallus
    print("Gallus_gallus")
    for record in SeqIO.parse("toga_run/"+species[0]+"/nucleotide.fasta", "fasta"):
        if ("ref" in record.id):
            try:
                geneidx = one2one_df[one2one_df["transcript"]==record.id[4:]].index.values.astype(int)[0]
                geneid = one2one_df.loc[geneidx][1]+"_"+one2one_df.loc[geneidx][0]
                record.id = "Gallus_gallus"
                genes[geneid].seqs.append(record)
            except:
                pass
    for gene in genes.keys():
        SeqIO.write(genes[gene].seqs, "prot_align/"+gene+".fasta", "fasta")

if __name__=="__main__":
    main()

2.3.2 Align each gene

We used macse v2 (Ranwez et al. 2018) to align genes.

files=( $(ls *.fasta) )

#Submits array jobs each with batches of 100 genes
for ((i=${SLURM_ARRAY_TASK_ID}*100; i<=(${SLURM_ARRAY_TASK_ID}*100)+99; i++)); do
        echo ${files[${i}]}
        macse -prog alignSequences -seq ${files[${i}]} -out_NT ../prot_align_macse/${files[${i}]} -out_AA ../prot_align_aa/${files[${i}]}
        sed -i 's/\(>[A-Za-z]*_[A-Za-z]*\).*/\1/g' ../prot_align_macse/${files[${i}]}
done

2.3.3 Filter alignments

We used the filtering code (see here) to filter alignments of misalinged and problematic codons. We ended up with 9854 genes.

#bash
python aln_filter.py -i prot_align_macse/ -n 10 -o prot_align_macse_filt --overwrite

2.3.4 Mask bad codons

We used HMMCleaner (Franco et al. 2019), to mask codons that are potentially the result of bad assemblies.

#bash
files=( $(ls ../prot_align_macse_filt-spec10-seq20-site50/cds/*.fasta) )

for ((i=${SLURM_ARRAY_TASK_ID}*100; i<=(${SLURM_ARRAY_TASK_ID}*100)+99; i++)); do
        echo ${files[${i}]}
        fbname=$(basename "$1" .fasta)
        HmmCleaner.pl -costs -0.25 -0.20 0.15 0.45 ${files[${i}]}
        mv ../prot_align_macse_filt-spec10-seq20-site50/cds/${fbname}_hmm.fasta ./
done

2.3.5 Run aBSREL

We ran aBSREL (Smith et al. 2015) for each gene alignment using a synonymous rate variation model. An custom R-script was used to prune input tree to match the taxa present in each alignment.

files=( $(ls ../prot_align_hmm/*.fasta) )

for ((i=${SLURM_ARRAY_TASK_ID}*10; i<=(${SLURM_ARRAY_TASK_ID}*10)+9; i++)); do
        echo ${files[${i}]}
        fbname=$(basename ${files[$i]} .fasta)
        Rscript tree_format.Rscript macro_tree_node.txt prot_align_hmm/${fbname}.fasta trees/$fbname
        hyphy absrel --alignment prot_align_hmm/${fbname}.fasta --tree trees/${fbname}.txt \
            --output output_hmm/${fbname}_SRV.JSON --srv Yes
done

#R script to subset taxa in tree to match fasta file
args <- commandArgs(trailingOnly = TRUE)

library(ape)
tree <- read.tree(args[1])
fasta <- read.FASTA(args[2])
tree2 <- keep.tip(tree, intersect(names(fasta), tree$tip.label))
write.tree(tree2, paste(args[3], ".txt", sep=""))

2.3.6 Parse aBSREL results

Parse out aBSREL results and find branches selected among the four focal groups.

!usr/bin/python

import json
import argparse
import pandas as pd
from functools import reduce

def count_targets(species):
    bulbul = set(["Pycnonotus_jocosus", "Brachypodius_atriceps", "Anc33"])
    penguin = set(["Aptenodytes_patagonicus", "Aptenodytes_forsteri", "Pygoscelis_adeliae",
                 "Pygoscelis_papua", "Pygoscelis_antarcticus", "Megadyptes_antipodes",
                 "Eudyptula_novaehollandiae", "Eudyptula_albosignata", "Eudyptula_minor",
                 "Spheniscus_demersus", "Spheniscus_humboldti", "Spheniscus_magellanicus",
                 "Eudyptes_pachyrhynchus", "Eudyptes_sclateri", "Eudyptes_chrysolophus",
                 "Eudyptes_schlegeli", "Eudyptes_chrysocome", "Eudyptes_filholi",
                 "Eudyptes_moseleyi", "Anc56", "Anc58", "Anc60", "Anc63", "Anc65",
                 "Anc68", "Anc69", "Anc48", "Anc59", "Anc62", "Anc64", "Anc67",
                 "Anc57", "Anc61", "Anc66", "Anc55", "Anc54", "Anc53", "Anc52"])
    kingfisher = set(["Chloroceryle_aenea", "Actenoides_hombroni", "Halcyon_senegalensis",
                    "Anc46", "Anc45"])
    swallow = set(["Hirundo_rustica", "Progne_subis", "Riparia_riparia",
                    "Anc34", "Anc35"])
    input_list = set(species)
    return(len(input_list & bulbul), len(input_list & penguin), len(input_list & kingfisher), len(input_list & swallow))

def absrel_out(args):
    genes = pd.read_table(args.gene_list, header=None, names=["gene_name"])
    genes[["AIC_SRV", "nsel_SRV", "bulbul", "penguin", "kingfisher", "swallow", "targets", "taxa"]] = ''

    for idx in range(len(genes)):
        jfile = json.load(open(args.input_folder + genes.loc[idx]["gene_name"] + "_SRV" + ".JSON", 'r'))
        genes.loc[idx]["AIC_SRV"] = round(jfile["fits"]["Full adaptive model"]["AIC-c"], 1)
        genes.loc[idx]["nsel_SRV"] = jfile["test results"]["positive test results"]
        if jfile["test results"]["positive test results"] > 0:
            selected = set()
            for k in jfile["branch attributes"]["0"].keys():
                if jfile["branch attributes"]["0"][k]["Corrected P-value"] < 0.05:
                    selected.add(k)
            genes.loc[idx]["taxa"] = reduce(lambda x, y: str(x) + "," + str(y), selected)
            genes.loc[idx]["bulbul"], genes.loc[idx]["penguin"], genes.loc[idx]["kingfisher"], genes.loc[idx]["swallow"] = count_targets(selected)
            genes.loc[idx]["targets"] = genes.loc[idx]["bulbul"] + genes.loc[idx]["penguin"] + genes.loc[idx]["kingfisher"] + genes.loc[idx]["swallow"]
    return genes

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input_folder", type=str, help="Input folder with JSON outputs")
    parser.add_argument("-g", "--gene_list", type=str, help="Names of JSON files")
    parser.add_argument("-o", "--output", type=str, help="Output file name")

    args = parser.parse_args()
    output = absrel_out(args)
    output.to_csv(args.output+".txt", sep="\t", index=False)

if __name__=="__main__":
    main()

To analyze the summarized data from above:

#R
library(stringr)
#Selection results
sel_out <- read.table("output_sele_hmm.txt", header = T, sep = "\t")
sel_out$gene <- str_split_fixed(sel_out$gene_name, "_", 3)[,1]

#Save list of all genes used in selection analysis
#write.table(sel_out$gene, "sel_all.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(sel_out$gene[sel_out$bulbul > 0], "sel_bul.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(sel_out$gene[sel_out$kingfisher > 0], "sel_king.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(sel_out$gene[sel_out$swallow > 0], "sel_swa.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(sel_out$gene[sel_out$penguin > 0], "sel_pen.txt", quote = F, sep = "", row.names = F, col.names = F)

sel_out <- subset(sel_out, nsel_SRV > 0)
sel_out$non_target <- sel_out$nsel_SRV - sel_out$targets
sel_out$frac_target <- sel_out$targets/sel_out$nsel_SRV

all_taxa <- unlist(strsplit(sel_out$taxa,","))
plot(sort(table(all_taxa)), main="Number of genes showing positive selection", xlab="Species", ylab="Number of genes")

Since Malurus cyaneus had too many positively selected genes than all other species, we removed it from further analysis of selection.

sel_out$nsel_SRV_corr <- sel_out$nsel_SRV

#Malurus cyaneus is an outlier
for (x in seq(1, nrow(sel_out))){
  if ("Malurus_cyaneus" %in% unlist(strsplit(sel_out[x,"taxa"], ","))){
    sel_out[x, "nsel_SRV_corr"] <- sel_out[x, "nsel_SRV_corr"] - 1
  }
}

sel_out$frac_target_corr <- sel_out$targets/sel_out$nsel_SRV

sel_out_targets <- subset(sel_out, frac_target_corr > 0.5)

getphyper <- function(overlap, list1, list2, pop){
  test_val <- phyper(overlap-1,list1,pop-list1,list2)
  return(test_val)
}

PopSize=nrow(sel_out_targets)
listB=length(which(sel_out_targets$bulbul>0))
listP=length(which(sel_out_targets$penguin>0))
listS=length(which(sel_out_targets$swallow>0))
listK=length(which(sel_out_targets$kingfisher>0))
#calculate whether any of these pairwise comparisons are significant using a hypergeometric test
pvals = c()
#BP
pvals = c(pvals, getphyper(length(which(sel_out_targets$bulbul > 0 & sel_out_targets$penguin > 0)), listB, listP, PopSize))
#BS
pvals = c(pvals, getphyper(length(which(sel_out_targets$bulbul > 0 & sel_out_targets$swallow > 0)), listB, listS, PopSize))
#BK
pvals = c(pvals, getphyper(length(which(sel_out_targets$bulbul > 0 & sel_out_targets$kingfisher > 0)), listB, listK, PopSize))
#SK
pvals = c(pvals, getphyper(length(which(sel_out_targets$swallow > 0 & sel_out_targets$kingfisher > 0)), listS, listK, PopSize))
#PK
pvals = c(pvals, getphyper(length(which(sel_out_targets$penguin > 0 & sel_out_targets$kingfisher > 0)), listP, listK, PopSize))
#PS
pvals = c(pvals, getphyper(length(which(sel_out_targets$penguin > 0 & sel_out_targets$swallow > 0)), listP, listS, PopSize))

p.adjust(pvals, method = "bonferroni")

gene_taxa <- read.table("gene_taxa.txt", sep = "\t", header = T)
gene_taxa_cat1 <- read.table("gene_taxa_cat1.txt", sep = "\t", header = T)
overlap <- intersect(gene_taxa$gene, sel_out_targets$gene)
overlap_cat1 <- intersect(gene_taxa_cat1$gene, sel_out_targets$gene)

#Save list of genes selected only in target species
write.table(sel_out_targets$gene, "sel_all_target.txt", quote = F, sep = "", eol = ",", row.names = F, col.names = F)

#outgroups
sel_out_og <- read.table("output_sele_hmm_outgroup.txt", header = T, sep = "\t")
sel_out_og$gene <- str_split_fixed(sel_out_og$gene_name, "_", 3)[,1]
sel_out_og <- subset(sel_out_og, nsel_SRV > 0)
sel_out_og$non_target <- sel_out_og$nsel_SRV - sel_out_og$targets
sel_out_og$frac_target <- sel_out_og$targets/sel_out_og$nsel_SRV
sel_out_og$nsel_SRV_corr <- sel_out_og$nsel_SRV

for (x in seq(1, nrow(sel_out))){
  if ("Malurus_cyaneus" %in% unlist(strsplit(sel_out_og[x,"taxa"], ","))){
    sel_out_og[x, "nsel_SRV_corr"] <- sel_out_og[x, "nsel_SRV_corr"] - 1
  }
}

sel_out_og$frac_target_corr <- sel_out_og$targets/sel_out_og$nsel_SRV

sel_out_targets_og <- subset(sel_out_og, frac_target_corr > 0.5)

conv_gene_4 <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0))),nrow = 2),
                                  alternative = "greater")

conv_gene_3_bsk <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_3_bsk <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_3_bpk <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$apenguin > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$apenguin > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_3_psk <- fisher.test(matrix(c(length(which(sel_out_targets$penguin > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  length(which(sel_out_targets_og$apenguin > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$penguin > 0 &
                                                   sel_out_targets$kingfisher > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$apenguin > 0 &
                                                   sel_out_targets_og$akingfisher > 0 &
                                                   sel_out_targets_og$aswallow > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_bk <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$akingfisher > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_bs <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$aswallow > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$swallow > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$aswallow > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_bp <- fisher.test(matrix(c(length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  length(which(sel_out_targets_og$abulbul > 0 &
                                                   sel_out_targets_og$apenguin > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$bulbul > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$apenguin > 0 &
                                                   sel_out_targets_og$akingfisher > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_ks <- fisher.test(matrix(c(length(which(sel_out_targets$swallow > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  length(which(sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$akingfisher > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$swallow > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$akingfisher > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_kp <- fisher.test(matrix(c(length(which(sel_out_targets$penguin > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  length(which(sel_out_targets_og$apenguin > 0 &
                                                   sel_out_targets_og$akingfisher > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$penguin > 0 &
                                                   sel_out_targets$kingfisher > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$apenguin > 0 &
                                                   sel_out_targets_og$akingfisher > 0))),nrow = 2),
                                  alternative = "greater")
conv_gene_2_sp <- fisher.test(matrix(c(length(which(sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  length(which(sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0)),
                                  nrow(sel_out_targets)-length(which(sel_out_targets$swallow > 0 &
                                                   sel_out_targets$penguin > 0)),
                                  nrow(sel_out_targets_og)-length(which(sel_out_targets_og$aswallow > 0 &
                                                   sel_out_targets_og$apenguin > 0))),nrow = 2),
                                  alternative = "greater")

2.3.7 GO enrichment analysis

For GO enrichment analysis we used PANTHER

3 McDonald-Kreitman Test

For the MK test we used population resequencing data for a bulbul species and a swallow species.

3.1 Pop-gen data

3.1.1 Bulbul

Population resequencing data for bulbuls come from Shakya et al. (2021) for Brachypodius melanocephalos. We only included one individual from the Maratua Island (a small island off the coast of Borneo).

BioSample	Run	BioProject	lat	long
LSUMZ_198531_Mentawai	SRR14761645	PRJNA736057	-1.6040	99.1590
KU_98891_Borneo	SRR14761646	PRJNA736057	9.8372	118.6412
LSUMZ_198032_Borneo	SRR14761648	PRJNA736057	2.1080	117.4020
LSUMZ_191390_Borneo	SRR14761649	PRJNA736057	1.1150	110.2170
LSUMZ_216191_Borneo	SRR14761651	PRJNA736057	5.8455	117.1825
LSUMZ_176475_Borneo	SRR14761652	PRJNA736057	2.9416	113.0333
LSUMZ_198033_Bawean	SRR14761656	PRJNA736057	-5.8100	112.6140
LSUMZ_198030_Maratua	SRR14761657	PRJNA736057	2.2650	118.5630
LSUMZ_179296_Borneo	SRR14761663	PRJNA736057	5.3261	115.6736
LSUMZ_215800_Borneo	SRR14761664	PRJNA736057	5.3997	116.1022
LSUMZ_198530_Sumatra	SRR14761647	PRJNA736057	0.1000	99.9370
LSUMZ_186231_Borneo	SRR14761650	PRJNA736057	1.8011	109.7122

3.1.2 Swallow

Population resequencing data for swallows come from Safran et al. (2016) for Hirundo rustica.

BioSample	LibraryName	Run	BioProject
SAMN18196679	1040_WGS	SRR14127698	PRJNA323498
SAMN18196678	1036_WGS	SRR14127699	PRJNA323498
SAMN18196677	1021_WGS	SRR14127700	PRJNA323498
SAMN18196712	1763_WGS	SRR14127701	PRJNA323498
SAMN18196676	1019_WGS	SRR14127702	PRJNA323498
SAMN18196711	1760_WGS	SRR14127703	PRJNA323498
SAMN18196709	1747_WGS	SRR14127705	PRJNA323498
SAMN18196708	1744_WGS	SRR14127706	PRJNA323498
SAMN18196706	1723_WGS	SRR14127708	PRJNA323498
SAMN18196704	1838_WGS	SRR14127710	PRJNA323498
SAMN18196675	1015_WGS	SRR14127713	PRJNA323498
SAMN18196701	1798_WGS	SRR14127714	PRJNA323498
SAMN18196700	1791_WGS	SRR14127715	PRJNA323498
SAMN18196699	1777_WGS	SRR14127716	PRJNA323498
SAMN18196698	1311_WGS	SRR14127717	PRJNA323498
SAMN18196696	1303_WGS	SRR14127719	PRJNA323498
SAMN18196695	1289_WGS	SRR14127720	PRJNA323498
SAMN18196692	1223_WGS	SRR14127723	PRJNA323498
SAMN18196690	1177_WGS	SRR14127726	PRJNA323498
SAMN18196689	1171_WGS	SRR14127727	PRJNA323498
SAMN18196687	1158_WGS	SRR14127729	PRJNA323498
SAMN18196686	1107_WGS	SRR14127730	PRJNA323498
SAMN18196684	1091_WGS	SRR14127732	PRJNA323498
SAMN18196683	1075_WGS	SRR14127733	PRJNA323498
SAMN18196682	1058_WGS	SRR14127734	PRJNA323498
SAMN18196673	479_WGS	SRR14127735	PRJNA323498
SAMN18196672	464_WGS	SRR14127736	PRJNA323498
SAMN18196722	2057_WGS	SRR14127737	PRJNA323498
SAMN18196721	2049_WGS	SRR14127738	PRJNA323498
SAMN18196718	2028_WGS	SRR14127741	PRJNA323498
SAMN18196753	1473_WGS	SRR14127742	PRJNA323498
SAMN18196717	2025_WGS	SRR14127743	PRJNA323498
SAMN18196752	1240_WGS	SRR14127744	PRJNA323498
SAMN18196748	100_WGS	SRR14127748	PRJNA323498
SAMN18196746	43_WGS	SRR14127750	PRJNA323498
SAMN18196745	27_WGS	SRR14127751	PRJNA323498
SAMN18196744	15_WGS	SRR14127752	PRJNA323498
SAMN18196742	278_WGS	SRR14127755	PRJNA323498
SAMN18196740	271_WGS	SRR14127757	PRJNA323498
SAMN18196739	265_WGS	SRR14127758	PRJNA323498
SAMN18196738	251_WGS	SRR14127759	PRJNA323498
SAMN18196734	208_WGS	SRR14127763	PRJNA323498
SAMN18196733	199_WGS	SRR14127764	PRJNA323498
SAMN18196715	2005_WGS	SRR14127765	PRJNA323498
SAMN18196731	187_WGS	SRR14127767	PRJNA323498
SAMN18196728	170_WGS	SRR14127770	PRJNA323498
SAMN18196727	151_WGS	SRR14127771	PRJNA323498
SAMN18196726	2071_WGS	SRR14127772	PRJNA323498
SAMN18196725	2067_WGS	SRR14127773	PRJNA323498
SAMN18196724	2065_WGS	SRR14127774	PRJNA323498
SAMN18196723	2061_WGS	SRR14127775	PRJNA323498
SAMN18196713	1661_WGS	SRR14127777	PRJNA323498
SAMN18196803	x242035_WGS	SRR14129407	PRJNA323498
SAMN18196802	x242025_WGS	SRR14129408	PRJNA323498
SAMN18196801	x242024_WGS	SRR14129409	PRJNA323498
SAMN18196799	x242007_WGS	SRR14129411	PRJNA323498
SAMN18196814	327_WGS	SRR14129415	PRJNA323498
SAMN18196813	326_WGS	SRR14129416	PRJNA323498
SAMN18196812	324_WGS	SRR14129417	PRJNA323498
SAMN18196810	318_WGS	SRR14129419	PRJNA323498
SAMN18196809	316_WGS	SRR14129420	PRJNA323498
SAMN18196806	303_WGS	SRR14129423	PRJNA323498
SAMN18196796	1512_WGS	SRR14129425	PRJNA323498
SAMN18196795	1471_WGS	SRR14129426	PRJNA323498
SAMN18196763	1921_WGS	SRR14131088	PRJNA323498
SAMN18196760	1646_WGS	SRR14131091	PRJNA323498
SAMN18196759	1550_WGS	SRR14131092	PRJNA323498
SAMN18196794	1470_WGS	SRR14131093	PRJNA323498
SAMN18196793	1457_WGS	SRR14131095	PRJNA323498
SAMN18196792	1433_WGS	SRR14131096	PRJNA323498
SAMN18196791	1425_WGS	SRR14131097	PRJNA323498
SAMN18196790	1424_WGS	SRR14131098	PRJNA323498
SAMN18196789	1411_WGS	SRR14131099	PRJNA323498
SAMN18196788	1410_WGS	SRR14131100	PRJNA323498
SAMN18196787	1392_WGS	SRR14131101	PRJNA323498
SAMN18196786	1379_WGS	SRR14131102	PRJNA323498
SAMN18196785	1362_WGS	SRR14131103	PRJNA323498
SAMN18196784	1361_WGS	SRR14131104	PRJNA323498
SAMN18196757	1487_WGS	SRR14131105	PRJNA323498
SAMN18196783	1350_WGS	SRR14131106	PRJNA323498
SAMN18196782	1343_WGS	SRR14131107	PRJNA323498
SAMN18196781	1341_WGS	SRR14131108	PRJNA323498
SAMN18196780	1332_WGS	SRR14131109	PRJNA323498
SAMN18196779	1331_WGS	SRR14131110	PRJNA323498
SAMN18196778	1330_WGS	SRR14131111	PRJNA323498
SAMN18196777	1324_WGS	SRR14131112	PRJNA323498
SAMN18196776	1319_WGS	SRR14131113	PRJNA323498
SAMN18196775	450_WGS	SRR14131114	PRJNA323498
SAMN18196774	432_WGS	SRR14131115	PRJNA323498
SAMN18196756	1485_WGS	SRR14131116	PRJNA323498
SAMN18196773	382_WGS	SRR14131117	PRJNA323498
SAMN18196772	379_WGS	SRR14131118	PRJNA323498
SAMN18196771	364_WGS	SRR14131119	PRJNA323498
SAMN18196770	361_WGS	SRR14131120	PRJNA323498
SAMN18196769	358_WGS	SRR14131121	PRJNA323498
SAMN18196768	347_WGS	SRR14131122	PRJNA323498
SAMN18196766	1992_WGS	SRR14131124	PRJNA323498
SAMN18196765	1971_WGS	SRR14131125	PRJNA323498
SAMN18196764	1966_WGS	SRR14131126	PRJNA323498
SAMN18196755	1483_WGS	SRR14131127	PRJNA323498
SAMN18196754	1475_WGS	SRR14131128	PRJNA323498
SAMN18196804	x242036_WGS	SRR14129406	PRJNA323498
SAMN18196800	x242008_WGS	SRR14129410	PRJNA323498
SAMN18196798	x242006_WGS	SRR14129412	PRJNA323498
SAMN18196797	x242002_WGS	SRR14129414	PRJNA323498
SAMN18196811	319_WGS	SRR14129418	PRJNA323498
SAMN18196808	309_WGS	SRR14129421	PRJNA323498
SAMN18196807	307_WGS	SRR14129422	PRJNA323498
SAMN18196805	302_WGS	SRR14129424	PRJNA323498
SAMN18196762	1905_WGS	SRR14131089	PRJNA323498
SAMN18196761	1648_WGS	SRR14131090	PRJNA323498
SAMN18196758	1534_WGS	SRR14131094	PRJNA323498
SAMN18196767	1995_WGS	SRR14131123	PRJNA323498
SAMN18196681	1051_WGS	SRR14127696	PRJNA323498
SAMN18196680	1041_WGS	SRR14127697	PRJNA323498
SAMN18196710	1758_WGS	SRR14127704	PRJNA323498
SAMN18196707	1736_WGS	SRR14127707	PRJNA323498
SAMN18196705	1848_WGS	SRR14127709	PRJNA323498
SAMN18196703	1822_WGS	SRR14127711	PRJNA323498
SAMN18196702	1808_WGS	SRR14127712	PRJNA323498
SAMN18196697	1307_WGS	SRR14127718	PRJNA323498
SAMN18196694	1279_WGS	SRR14127721	PRJNA323498
SAMN18196693	1224_WGS	SRR14127722	PRJNA323498
SAMN18196674	1003_WGS	SRR14127724	PRJNA323498
SAMN18196691	1198_WGS	SRR14127725	PRJNA323498
SAMN18196688	1162_WGS	SRR14127728	PRJNA323498
SAMN18196685	1095_WGS	SRR14127731	PRJNA323498
SAMN18196720	2036_WGS	SRR14127739	PRJNA323498
SAMN18196719	2035_WGS	SRR14127740	PRJNA323498
SAMN18196751	131_WGS	SRR14127745	PRJNA323498
SAMN18196750	130_WGS	SRR14127746	PRJNA323498
SAMN18196749	113_WGS	SRR14127747	PRJNA323498
SAMN18196747	48_WGS	SRR14127749	PRJNA323498
SAMN18196743	3_WGS	SRR14127753	PRJNA323498
SAMN18196716	2013_WGS	SRR14127754	PRJNA323498
SAMN18196741	275_WGS	SRR14127756	PRJNA323498
SAMN18196737	239_WGS	SRR14127760	PRJNA323498
SAMN18196736	223_WGS	SRR14127761	PRJNA323498
SAMN18196735	211_WGS	SRR14127762	PRJNA323498
SAMN18196732	192_WGS	SRR14127766	PRJNA323498
SAMN18196730	184_WGS	SRR14127768	PRJNA323498
SAMN18196729	176_WGS	SRR14127769	PRJNA323498
SAMN18196714	1712_WGS	SRR14127776	PRJNA323498

3.2 snpArcher

To generate vcf files for each species, we remapped the raw fastq files to the respective reference genomes and called SNPs using snpArcher. Check snpArcher github or paper, Mirchandani et al. (2023), for more information on snpArcher.
QC files for bulbul
QC files for swallow

3.3 Running degenotate

3.3.1 Filtering vcf files

We used degenotate to run imputed MK test. First, we filtered each of the two vcf files for low quality sites.

#!bin/bash
vcftools --gzvcf B_melanocephalos_final.vcf.gz --remove-indels --max-missing 0.25 --minDP 2 --recode --recode-INFO-all --min-alleles 2 --max-alleles 2 --out B_melanocephalos_bi_filt_mac2_75p_noindel
bgzip B_melanocephalos_bi_filt_nomac_75p_noindel.recode.vcf
tabix B_melanocephalos_bi_filt_nomac_75p_noindel.recode.vcf.gz

#removed one individual with high missing data based on QC file. 
vcftools --gzvcf H_rustica_final.vcf.gz --remove-indels --max-missing 0.25 --minDP 2 --recode --recode-INFO-all --min-alleles 2 --max-alleles 2 --out H_rustica_bi_filt_mac2_75p_noindel --remove-indv SAMN18196751 --remove-indv SAMN18196742
bgzip B_melanocephalos_bi_filt_nomac_75p_noindel.recode.vcf
tabix B_melanocephalos_bi_filt_nomac_75p_noindel.recode.vcf.gz

3.3.2 Polarizing vcfs

To perform an imputed MK test, we need a polarized vcf. We used an outgroup species to polarize vcfs. For the bulbul, we used Sylvia atricapilla (GCA_009819655.1) and for the swallow, we used Acrocephalus scripaceus (GCA_910950805.1) to polarize the respective vcf files.

# Get maf file for the outgroup with ingroup species as reference
~/hal/bin/hal2maf --noAncestors --noDupes --refGenome Brachypodius_melanocephalos --targetGenomes Brachypodius_melanocephalos,Sylvia_atricapilla all_birds_new.hal B_mel-S_atr.maf

~/hal/bin/hal2maf --noAncestors --noDupes --refGenome Brachypodius_melanocephalos --targetGenomes Brachypodius_melanocephalos,Phylloscopus_trochilus all_birds_new.hal B_mel-P_tro.maf

To convert the maf to a vcf we used a custom python script below:

import argparse
from Bio import AlignIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio.SeqFeature import SeqFeature, FeatureLocation

def main(params):
    # Open MAF file for reading
    maf_file = open(params.maf, "r")

    # Parse MAF alignment
    alignment = AlignIO.parse(maf_file, "maf")

    # Open VCF file for writing
    vcf_file = open(params.vcf, "w")

    from_sp = params.ref_sp
    to_sp = params.map_sp

    # Write VCF header
    vcf_file.write("##fileformat=VCFv4.2\n")
    vcf_file.write(f"#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t{to_sp}\n")

    # Iterate through each alignment block in MAF file
    for block in alignment:
        for record in block:
            # Assuming the first sequence is the reference
            record_split = record.id.split('.')
                        record_sp = record_split[0]
            if record_sp == from_sp:
                ref_seq = record.seq
                ref_start = record.annotations['start']
                ref_chrom = '.'.join(record_split[-(len(record_split)-1):])
                continue

            # Compare non-reference sequences with the reference
            alt_seq = record.seq
            for i, (ref_base, alt_base) in enumerate(zip(ref_seq, alt_seq)):
                pos = ref_start + i + 1  # 1-based position
                if ref_base.upper() != alt_base.upper():
                    # Write VCF record for variant
                    vcf_file.write(f"{ref_chrom}\t{pos}\t.\t{ref_base.upper()}\t{alt_base.upper()}\t.\tPASS\t.\tGT\t1/1\n")

    # Close files
    maf_file.close()
    vcf_file.close()
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--maf", type=str, help="Maf File")
    parser.add_argument("--ref_sp", type=str, help="Reference species")
    parser.add_argument("--map_sp", type=str, help="Species to map to")
    parser.add_argument("--vcf", type=str, help="Output Vcf filename")

    options, args = parser.parse_known_args()

    main(options)

sp1="B_mel"
sp2="S_atr"

## filter VCFs for degenotate
bcftools view ${sp1}-${sp2}.vcf -v snps -U -m2 -M2 -e 'F_MISSING > 0.75 | ref="N" | ALT="." | ALT="n" | ALT="-"| TYPE~"indel"' -O z -o ${sp1}-${sp2}_filt.vcf.gz
tabix ${sp1}-${sp2}_filt.vcf.gz

## make ancestral fasta from outgroup vcf
bcftools consensus ${sp1}-${sp2}_filt.vcf.gz -f Brachypodius_melanocephalos_GCA013398615.fna -o ${sp1}-${sp2}_filt.fa
samtools faidx ${sp1}-${sp2}_filt.fa

## make VCF with ONE outgroup
bcftools merge --missing-to-ref ../B_melanocephalos_bi_filt_mac2_75p_noindel.recode.vcf.gz ${sp1}-${sp2}_filt.vcf.gz -O v -o ${sp1}-${sp2}_comb.vcf
sed -i -e "s/sample/$sp2/" ${sp1}-${sp2}_comb.vcf

## infer ancestral allele
vcfdo polarize -i ${sp1}-${sp2}_comb.vcf -f ${sp1}-${sp2}_filt.fa > ${sp1}-${sp2}_pol.vcf
bgzip ${sp1}-${sp2}_pol.vcf
tabix ${sp1}-${sp2}_pol.vcf.gz

3.3.3 Two outgroups

We ran the MK test with two outgroup species for each run. For bulbul, we included Phylloscopus trochilus (GCA_016584745.1) as the second outgroup and for swallow, we included Eremophila alpestris (GCA_009792885.1) as the second outgroup.

sp1="B_mel"
sp2="P_tro"

## filter VCFs for degenotate
bcftools view ${sp1}-${sp2}.vcf -v snps -U -m2 -M2 -e 'F_MISSING > 0.75 | ref="N" | ALT="." | ALT="n" | ALT="-" | TYPE~"indel"' -O z -o ${sp1}-${sp2}_filt.vcf.gz
tabix ${sp1}-${sp2}-_filt.vcf.gz

## merge to polarized vcf file
bcftools merge --missing-to-ref B_mel-S_atr_pol.vcf.gz ${sp1}-${sp2}_filt.vcf.gz -O z -o B_mel_polarized.vcf.gz
tabix B_mel_polarized.vcf.gz

3.4 keep only one population to avoid population structure

vcftools --gz B_mel_polarized.vcf.gz --remove rem_island.txt --recode --recode-INFO-all --out B_mel_polarized_noisle
vcftools --gz H_rus_polarized.vcf.gz --keep rustica_samples.txt --recode --recode-INFO-all --out H_rus_polarized_noisle

3.4.1 Get annotations

To run degenotate we also need a gtf/gff file with annotations of coding sequences in the target species. For this we used the TOGA predictions from above.

bedToGenePred Brachypodius_melanocephalos_GCA013398615/query_annotation.bed B_mel.genepred
genePredToBed 'file' B_mel.genepred B_mel.gtf

3.4.2 Run degenotate

degenotate.py -a B_mel.gtf -g Brachypodius_melanocephalos_GCA013398615.fna -u S_atr,sample -v B_mel_polarized_noisle.vcf.gz

3.5 Analyze results

3.5.1 Bulbul

bulbul_selected <- subset(trans_largest, transcript %in% mk_table_largest$transcript[which(mk_table_largest$sig %in% "red" & mk_table_largest$imp.dos > 0)])

library(rbioapi)
bulbul_enrich <- rba_panther_enrich(genes = unique(bulbul_selected$gene), 
                                 organism = 9031, 
                                 annot_dataset = "GO:0008150",
                                 ref_genes = unique(mk_table_largest$gene),
                                 ref_organism = 9031)

write.table(unique(bulbul_selected$gene), "bulbul_selected_mk.txt", quote = F, sep = "", eol = ",", row.names = F, col.names = F)
write.table(unique(trans$gene), "bulbul_all_genes_mk.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(bulbul_enrich$result, "bulbul_all_enrichment_mk.txt", quote = F, sep = "\t", row.names = F, col.names = F)

3.5.2 Swallow

swallow_selected <- subset(trans_largest, transcript %in% mk_table_largest$transcript[which(mk_table_largest$sig %in% "red" & mk_table_largest$imp.dos > 0)])

swallow_enrich <- rba_panther_enrich(genes = unique(swallow_selected$gene), 
                                 organism = 9031, 
                                 annot_dataset = "GO:0008150",
                                 ref_genes = unique(mk_table_largest$gene),
                                 ref_organism = 9031)

write.table(unique(swallow_selected$gene), "swallow_selected_mk.txt", quote = F, sep = "", eol = ",", row.names = F, col.names = F)
write.table(unique(trans$gene), "swallow_all_genes_mk.txt", quote = F, sep = "", row.names = F, col.names = F)
write.table(swallow_enrich$result, "swallow_all_enrichment_mk.txt", quote = F, sep = "\t", row.names = F, col.names = F)

bs_overlap <- unique(intersect(bulbul_selected$gene, swallow_selected$gene))
phyper(length(bs_overlap)-1,1069,15259-2584,2584)

absrel_bulbul_overlap <- unique(intersect(bulbul_selected$gene, sel_out_targets$gene[which(sel_out_targets$bulbul>0)]))
absrel_swallow_overlap <- unique(intersect(swallow_selected$gene, sel_out_targets$gene[which(sel_out_targets$swallow>0)]))

4 Modified MK Test

We used a modified version of the MK test comparing fixed and polymorphic sites in each CNEE to synonymous sites in the gene. The same vcf files from above were used in the analysis.

4.1 Counting sites in elements

We used a custom python script to count the fixed and polymorphic sites in all elements using the vcf files for bulbuls and swallow respectively

import pandas as pd
import pysam
import argparse,sys,os
import glob

def get_fix_poly(vcf, scaffold, start, stop):
    samples = list((vcf.header.samples))
    fix = 0
    poly = 0
    for rec in vcf.fetch(scaffold, start, stop):
        out = False
        alt = 0
        tot = 0
        #Uses same filters as those implemented in degenotate for outgroups
        if (rec.samples[samples[-2]]['GT'] == (1,1) and rec.samples[samples[-1]]['GT'] == (1,1)):
            out = True
        for sample in samples[:-2]:
            for haplotype in rec.samples[sample]['GT']:
                if haplotype == 1:
                    alt += 1
                    tot += 1
                elif haplotype == 0:
                    tot += 1
        #Using an allele frequency cutoff of 1/number of samples.
        af = alt/tot
        if ((af < (1/len(samples))) and out):
            fix += 1
        elif ((af < (1/len(samples))) and out is False):
            pass
        else:
            poly += 1
    return fix, poly

def parse_vcf(vcfs, bed):
    vcf_reader = pysam.VariantFile(vcfs)
    vcf_bed = pd.read_table(bed, header=None)
    # Also get fixed and polymorphic sites upstream and downstream of each element
    vcf_pd = pd.DataFrame(columns=['element', 'len', 'fixed', 'poly', 'u_fixed', 'u_poly', 'd_fixed', 'd_poly'])
    for idx in range(len(vcf_bed)):
        buff = 100
        length = vcf_bed.loc[idx][2]-vcf_bed.loc[idx][1]
        fix_main, fix_poly = get_fix_poly(vcf_reader, vcf_bed.loc[idx][0], vcf_bed.loc[idx][1], vcf_bed.loc[idx][2])
        try:
            u_main, u_poly = get_fix_poly(vcf_reader, vcf_bed.loc[idx][0], vcf_bed.loc[idx][1]-buff-length, vcf_bed.loc[idx][1]-buff)
        except:
            u_main, u_poly = "NA", "NA"
        try:
            d_main, d_poly = get_fix_poly(vcf_reader, vcf_bed.loc[idx][0], vcf_bed.loc[idx][2]+buff, vcf_bed.loc[idx][2]+buff+length)
        except:
            d_main, d_poly = "NA", "NA"
        vcf_pd.loc[idx] = [vcf_bed.loc[idx][3], length, fix_main, fix_poly, u_main, u_poly, d_main, d_poly]
    return(vcf_pd)

def main(options):
    out_pd = parse_vcf(options.vcf, options.bed)
    out_pd.to_csv(options.outname+".txt", sep='\t', index=False)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--vcf", type=str, help="VCF File")
    parser.add_argument("--bed", type=str, help="BED File")
    parser.add_argument("--outname", type=str, help="output file name", default = "output")

    options,args = parser.parse_known_args()

    main(options)

4.2 Run modified MK test

library(stringr)
library(plyr)
library(data.table)
library(dplyr)

#get fixed and polymorphic sites from code above
bulbul <- read.table("all_accl_bulbul_noisle_stats.txt", sep="\t", header = T) 

#load phyloacc results
phyloacc <- fread("elem_lik_filt.txt", sep = "\t", header = T)

bulbul <- merge(bulbul, phyloacc, by.x = "element", by.y="original.id", all.x=F, all.y=F)

#get closest genes for each element (using bedtools)
bulbul_closest <- read.table("B_mel_closest3.bed", sep="\t", header = F)

#get id fo genes (from toga)
ids <- read.table("Brachypodius_melanocephalos_GCA013398615.tsv", sep = "\t", header = T)
bulbul_closest$id <- str_extract(bulbul_closest$V13, "gene_id [A-Za-z0-9-_.]*")
bulbul_closest$id <- gsub("gene_id ", "", bulbul_closest$id)
bulbul_closest$id2 <- gsub(".[0-9]*$", "", bulbul_closest$id)
bulbul_closest$gene <- mapvalues(bulbul_closest$id2, from=ids$t_transcript, to=ids$t_gene, warn_missing = F)

bulbul <- merge(bulbul, bulbul_closest[,c("V4", "id", "gene", "V14")], by.x = "element", by.y="V4")

#get gene mk table (from degenotate)
mk_table <- read.table("bulbul_noisle_mk.tsv", sep = "\t", header = T)
bulbul <- merge(bulbul, mk_table_largest, by = "gene", all.x = T, all.y = F)

#function to do a fisher's exact test
do_fisher <- function(row){
  p <- tryCatch({
    fish <- fisher.test(matrix(data=c(row[1], row[2], row[3], row[4]), nrow = 2))
    fish$p.value
    }, error = function(e) {
      NA
    })
  return(p)
}

bulbul$fisher <- apply(bulbul[, c("poly", "pS", "fixed", "dS")], 1, do_fisher)
bulbul$cnee_dos <-  bulbul$fixed / (bulbul$fixed + bulbul$dS) - bulbul$poly / (bulbul$poly + bulbul$pS)

library(parallel)
cl <- makeCluster(4)
bulbul$fisher_u <- parApply(cl, bulbul[, c("poly", "u_poly", "fixed", "u_fixed")], 1, do_fisher)
bulbul$fisher_d <- parApply(cl, bulbul[, c("poly", "d_poly", "fixed", "d_fixed")], 1, do_fisher)
stopCluster(cl)

bulbul$sig <- ifelse(bulbul$fisher < 0.05 & (bulbul$fisher_d < 0.05 | bulbul$fisher_u < 0.05), "red", "black")
bulbul$sig[is.na(bulbul$sig)] <- "white"

fwrite(bulbul, "bulbul_noisle_psuedomk_3_flank.txt", sep = "\t", row.names = F, col.names = T, quote = F)

The same steps were repeated for the swallow

4.3 Get selection results per element

4.3.1 Bulbul

4.3.2 Swallow

4.4 For each gene

We repeated the above approach for each gene by counting up the fixed and polymorphic sites for all conserved elements within 10000 bp of each gene (including those in introns)

#Get list of elements within 10000 bp of each gene using bedtools
bulbul_gene_10000 <- read.table("Bmel_CDS_elements_intersect.txt", sep = "\t", header = F)
bulbul_gene_10000$id <- str_extract(bulbul_gene_10000$V9, "gene_id [A-Za-z0-9-_.]*")
bulbul_gene_10000$id <- gsub("gene_id ", "", bulbul_gene_10000$id)
bulbul_gene_10000$id2 <- gsub(".[0-9]*$", "", bulbul_gene_10000$id)
bulbul_gene_10000$gene <- mapvalues(bulbul_gene_10000$id2, from=ids$t_transcript, to=ids$t_gene, warn_missing = F)

bulbul_mk_table <- read.table("bulbul_noisle_mk.tsv", sep = "\t", header = T)
bulbul_mk_table <- merge(bulbul_mk_table, bulbul_gene_10000[,c("gene", "V13")], by="gene", all.x = F, all.y = F)

bulbul <- read.table("all_accl_bulbul_noisle_stats.txt", sep="\t", header = T)
bulbul_mk_table <- merge(bulbul_mk_table, bulbul, by.x = "V13", by.y="element", all.x = F, all.y = F)
bulbul_mk_table <- unique(bulbul_mk_table)

#remove elements accelerated in sylvia
rate2 <- fread("M2_elem_Z.txt", sep = "\t", header = T, select = c(1,13))
sylv_accl <- phyloacc$original.id[which(phyloacc$best.fit.model %in% c("M1","M2") & phyloacc$phyloacc.id %in% rate2$`Locus ID`[which(rate2$Sylvia_atricapilla == 2)])]

bulbul_mk_table <- subset(bulbul_mk_table, !(V13 %in% sylv_accl))

#get sum of all elements within 10000 bp of each gene
bulbul_gene <- bulbul_mk_table %>% group_by(gene) %>% summarize(fixed = sum(fixed),
                                                                  poly = sum(poly),
                                                                  len = sum(len),
                                                                  dS = first(dS),
                                                                  pS = first(pS),
                                                                  cds_length = first(cds_length),
                                                                  f0 = first(f0),
                                                                  f2 = first(f2),
                                                                  f3 = first(f3),
                                                                  f4 = first(f4))
bulbul_gene$cnee_dos <-  bulbul_gene$fixed / (bulbul_gene$fixed + bulbul_gene$dS) - bulbul_gene$poly / (bulbul_gene$poly + bulbul_gene$pS)
bulbul_gene$alpha <- 1-((bulbul_gene$dS/bulbul_gene$fixed)/(bulbul_gene$pS/bulbul_gene$poly))

bulbul_gene$fisher <- apply(bulbul_gene[, c("pS", "poly", "dS", "fixed")], 1, do_fisher)
bulbul_gene$sig <- ifelse(bulbul_gene$fisher < 0.05, "red", "black")


write.table(bulbul_gene, "bulbul_gene_mk.txt", sep = "\t", row.names = F, col.names = T, quote = F)

The above methods were repeated for swallows with elements in Acrocephalus removed.

4.4.1 Bulbul

4.4.2 Swallow

References

Franco, Arnaud Di, Raphaël Poujol, Denis Baurain, and Hervé Philippe. 2019. “Evaluating the Usefulness of Alignment Filtering Methods to Reduce the Impact of Errors on Evolutionary Inferences.” BMC Evolutionary Biology 19 (January): 1–17. https://doi.org/10.1186/S12862-019-1350-2/FIGURES/6.

Kirilenko, Bogdan M, Chetan Munegowda, Ekaterina Osipova, David Jebb, Virag Sharma, Moritz Blumer, Ariadna E Morales, et al. 2023. “Integrating Gene Annotation with Orthology Inference at Scale.” Science 380 (6643): eabn3107.

Mirchandani, Cade D., Allison J. Shultz, Gregg W. C. Thomas, Sara J. Smith, Mara Baylis, Brian Arnold, Russ Corbett-Detig, Erik Enbody, and Timothy B. Sackton. 2023. “A Fast, Reproducible, High-Throughput Variant Calling Workflow for Evolutionary, Ecological, and Conservation Genomics.” bioRxiv. https://doi.org/10.1101/2023.06.22.546168.

Ranwez, Vincent, Emmanuel JP Douzery, Cédric Cambon, Nathalie Chantret, and Frédéric Delsuc. 2018. “MACSE V2: Toolkit for the Alignment of Coding Sequences Accounting for Frameshifts and Stop Codons.” Molecular Biology and Evolution 35 (10): 2582–84.

Safran, Rebecca J, Elizabeth SC Scordato, Matthew R Wilkins, Joanna K Hubbard, Brittany R Jenkins, Tomas Albrecht, SM Flaxman, et al. 2016. “Genome-Wide Differentiation in Closely Related Populations: The Roles of Selection and Geographic Isolation.” Molecular Ecology 25 (16): 3865–83.

Shakya, Subir B, Tri Haryoko, Mohammad Irham, Suparno, Dewi M Prawiradilaga, and Frederick H Sheldon. 2021. “Genomic Investigation of Colour Polymorphism and Phylogeographic Variation Among Populations of Black-Headed Bulbul (Brachypodius Atriceps) in Insular Southeast Asia.” Molecular Ecology 30 (19): 4757–70.

Smith, Martin D., Joel O. Wertheim, Steven Weaver, Ben Murrell, Konrad Scheffler, and Sergei L. Kosakovsky Pond. 2015. “Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection.” Molecular Biology and Evolution 32 (May): 1342–53. https://doi.org/10.1093/molbev/msv022.

Selection

2024-12-13