Note

Following includes codes used to generate an annotated list of conserved elements for birds. It uses bash and R codes interchangeably along with other instructions to run. Please make sure you are running the codes in the correct interface. Load required libraries:

library(stringr)
library(ggplot2)
library(ggvenn)

1 Introduction

Three datasets were used to generate an initial list of conserved elements. Two more datasets were generated using phastcons to supplement the list of conserved elements. All data was mapped to the genome bGalGal1.mat.broiler.GRCg7b (henceforth only referred to as galgal7b). You can get the genome here:

#bash
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/Gallus_gallus/latest_assembly_versions/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz

The three datasets can be downloaded as such:

2 From external sources

2.1 Ratite elements

This list of elements comes from Sackton et al. (2019) https://datadryad.org/stash/dataset/doi:10.5061/dryad.v72d325

The working file from this dataset is cnees.galGal4NCBI.bed

First we need to convert the coordinates from galgal4 to galgal7b. One way to do this is using NCBI Genome Remapping Service: https://www.ncbi.nlm.nih.gov/genome/tools/remap

The converted file will have loci broken up. We will merge intervening loci using bedtools to merge loci that are next to one another and then size filter the dataset to include only loci that are greater than 20 bp and less than 5000 bp.

#bash
bedtools sort -i remapped_cnees.galGal4NCBI.bed | \
bedtools merge -i "stdin" > \
cnees.galGal4NCBI_merged.bed

#R
infile <- read.table("cnees.galGal4NCBI_merged.bed", sep = "\t", header = F)
infile$size <- infile$V3 - infile$V2
infile <- infile[which(infile$size <= 5000 & infile$size >= 20),]
write.table(infile[,c(1:3)], "ratites_merged_sized.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

2.2 Vocal elements

This list of elements come from Cahill et al. (2021).

There are three files: Hummingbird.bed, Oscine.bed, and Parrot.bed. Convert the files from galgal5 to galgal7b using NCBI Genome Remapping Service, combine files, and then merge and filter loci.

#bash
cat remapped_Hummingbird.bed remapped_Oscine.bed remapped_Parrot.bed | \
bedtools sort -i "stdin" | bedtools merge -i "stdin" > \
vocal_merged.bed

#R
infile <- read.table("vocal_merged.bed", sep = "\t", header = F)
infile$size <- infile$V3 - infile$V2
infile <- infile[which(infile$size <= 5000 & infile$size >= 20),]
write.table(infile[,c(1:3)], "vocal_merged_sized.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

2.3 UCSC elements

This list comes from UCSC vertebrate 100 way alignment. You can download file from https://genome.ucsc.edu/cgi-bin/hgTables?command=start using the table browser. The filename is cons_100_verts.bed

First, convert from hg38 to galgal7b. We will do this using UCSCs liftOver tool (get here: https://genome-store.ucsc.edu/) and chain file chain file

Next, use NCBI Genome Remapping Service to convert from galgal4 to galgal7b and then merge and filter. One extra step is required to change scaffold names from UCSC style to NCBI style. Once you get the galgal4 based bed file, we will convert the scaffold names using R. First we need to get a map file, which can be downloaded here: map file. Uncomment the column header line (last # line before data)

#bash
liftOver cons_100_verts.bed hg38ToGalGal4.over.chain.gz Hum2chick_CNEE.bed unMapped.bed
bedtools sort -i Hum2chick_CNEE.bed | bedtools merge -i "stdin" > Hum2chick_CNEE_merged.bed

#R
#remap the UCSC style scaffold names to NCBI scaffold names
map_file <- read.table("galgal4_map.txt", header = T, sep="\t")
infile <- read.table("Hum2chick_CNEE_merged.bed", sep = "\t", header = F)
infile <- mapvalues(infile$V1, 
                    from = map_file$UCSC.style.name, 
                    to = map_file$RefSeq.Accn, 
                    warn_missing = F)
write.table(infile, "Hum2chick_CNEE_merged_NCBI.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

#bash
#use NCBI genome remapping service for Hum2chick_CNEE_merged_NCBI.bed
bedtools sort -i remapped_Hum2chick_CNEE_merged_NCBI.bed | merge -i "stdin" > UCSC_merged.bed

#R
infile <- read.table("UCSC_merged.bed", header = T, sep="\t")
infile$size <- infile$V3 - infile$V2
infile <- infile[which(infile$size <= 5000 & infile$size >= 20),]
write.table(infile[,c(1:3)], "UCSC_merged_sized.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

Optional: check for overlap elements in the three datasets

We can briefly visualize the overlap of elements between the three datasets to get an idea of the distribution of conserved elements in each of these datasets.

#bash
#combine the three sets into one
cat ratites_merged_sized.bed vocal_merged_sized.bed UCSC_merged_sized.bed | \
bedtools sort -i "stdin" | bedtools merge -i "stdin" > \
three_prelim_merged.bed
#annotate file to get counts of each
bedtools annotate -counts -i three_prelim_merged.bed -files ratites_merged_sized.bed vocal_merged_sized.bed UCSC_merged_sized.bed > three_prelim_merged_annot.bed

#R
infile <- read.table("three_prelim_merged_annot.bed", sep = "\t", header = F)
#include a unique id
infile$ID <- paste("element", seq(1, nrow(infile)), sep = "")
#create venn diagram
venn_list <- list(ratite = infile$ID[infile$V4 > 0],
                  vocal = infile$ID[infile$V5 > 0],
                  UCSC = infile$ID[infile$V6 > 0])
ggvenn(
  venn_list, 
  fill_color = c("#AAFFAAFF", "#AFAFE9FF", "#FFAACCFF"),
  stroke_size = 0.5, set_name_size = 4
)

3 Phastcons

We used Phastcons to generate a list of conserved elements mapping to Chicken Gallus gallus and Zebra Finch Taeniopygia guttata.

3.1 Cactus alignment

To run phastcons first we need a whole genome cactus alignment. See the cactus.Rmd file for steps to generate the cactus alignment. The final alignment file is named all_birds_new.hal.

3.2 Neutral rates

3.2.1 Generating 4d sites

This documents the steps to generate a list of 4d sites to calculate neutral parameters.

First, we need to get a gff file for the galgal7b genome. You can get that here. The file was renamed to galgal7b.gff

Next to generate a bed file corresponding to coding regions we used UCSC tools to convert first to GenePred format and then to BED format. You can get gff3ToGenePred here and genePredToBed here.

Then we used hal4dExtract to get a bed of 4d sites. You can find all haltools here.

Note: We split the bed file into chunks of 10000 elements and then ran hal4dExtract on each chunk to parallelize the process

#bash
gff3ToGenePred -useName galgal7b.gff galgal7b_cds.gp
genePredToBed galgal7b_cds.gp galgal7b_cds.bed

hal4dExtract --conserved all_birds_new.hal Gallus_gallus7 galgal7b_cds.bed galgal7b_4d.bed

We ran Phastcons separately on macrochromosomes (Gallus gallus chromosomes 1 to 12), microchromosomes (Gallus gallus chromosomes 13 to 38) and Z chromosome. W chromosomes scaffolds were omitted because our alignment includes a mixture of males and females and not all female genomes have well assembled W chromosomes. The numbers after the code line denotes number of elements in that bed file.

#bash
grep -f macro_chrom.txt galgal7b_4d.bed > galgal7b_4d_macro.bed #44426
grep -f micro_chrom.txt galgal7b_4d.bed > galgal7b_4d_micro.bed #19897
grep -f Z_chrom.txt galgal7b_4d.bed > galgal7b_4d_Z.bed #3104

3.2.2 Running phyloFit

To run phyloFit first we need to convert the hal file to a maf file. We use hal2mafMP to get maf files containing only the 4d bed sites with galgal7b as the reference genome.

#bash
hal2mafMP.py  --numProc 6 --refGenome Gallus_gallus7 --noDupes --noAncestors --refTargets galgal7b_4d_macro.bed all_birds_new.hal all_birds_4d_macro.maf
hal2mafMP.py  --numProc 6 --refGenome Gallus_gallus7 --noDupes --noAncestors --refTargets galgal7b_4d_micro.bed all_birds_new.hal all_birds_4d_micro.maf
hal2mafMP.py  --numProc 6 --refGenome Gallus_gallus7 --noDupes --noAncestors --refTargets galgal7b_4d_Z.bed all_birds_new.hal all_birds_4d_Z.maf

Next, we run phylofit to generate mod files for each of the three files

#bash
tree="((((((((((((((((((Certhia_americana,(Ficedula_albicollis,Sturnus_vulgaris)),(Taeniopygia_guttata,((((Diglossa_brunneiventris,(Emberiza_elegans,(Setophaga_coronata,Molothrus_ater))),Serinus_canaria),Motacilla_alba),Passer_domesticus))),((((((Sylvia_atricapilla,(Pycnonotus_jocosus,Brachypodius_melanocephalos)),Phylloscopus_trochilus),((Progne_subis,Riparia_riparia),Hirundo_rustica)),Acrocephalus_scirpaceus),Eremophila_alpestris),Parus_major)),(Corvus_cornix,Lanius_collurio)),(Lichenostomus_cassidix,Malurus_cyaneus)),(Myiozetetes_cayanensis,Chiroxiphia_lanceolata)),Acanthisitta_chloris),Melopsittacus_undulatus),Falco_peregrinus),(Athene_cunicularia,((Bucorvus_abyssinicus,((Colaptes_auratus,Pogoniulus_pusillus),((Chloroceryle_aenea,(Halcyon_senegalensis,Actenoides_hombroni)),Merops_nubicus))),Trogon_surrucura))),Aquila_chrysaetos),(((Sterna_hirundo,Pluvialis_apricaria),Phoenicopterus_ruber),(Theristicus_caerulescens,(Macronectes_giganteus,(((((Eudyptes_moseleyi,Eudyptes_sclateri),((Eudyptes_chrysolophus,Eudyptes_schlegeli),((Eudyptes_chrysocome,Eudyptes_filholi),Eudyptes_pachyrhynchus))),Megadyptes_antipodes),(((Spheniscus_demersus,Spheniscus_magellanicus),Spheniscus_humboldti),((Eudyptula_minor,Eudyptula_albosignata),Eudyptula_novaehollandiae))),(((Pygoscelis_antarcticus,Pygoscelis_papua),Pygoscelis_adeliae),(Aptenodytes_patagonicus,Aptenodytes_forsteri))))))),Porphyrio_hochstetteri),((Cuculus_canorus,Tauraco_erythrolophus),Columba_livia)),((Calypte_anna,Apus_apus),Caprimulgus_europaeus)),((Gallus_gallus4,Gallus_gallus7),Anas_platyrhynchos)),(((Dromaius_novaehollandiae,Eudromia_elegans),Rhea_americana),Struthio_camelus)),Gavialis_gangeticus)"

phyloFit --tree $tree --init-random --subst-mod SSREV --sym-freqs \
--log all_birds_4d_macro.log --msa-format MAF \
--out-root all_birds_4d_macro_non-conserved all_birds_4d_macro.maf

phyloFit --tree $tree --init-random --subst-mod SSREV --sym-freqs \
--log all_birds_4d_micro.log --msa-format MAF \
--out-root all_birds_4d_micro_non-conserved all_birds_4d_micro.maf

phyloFit --tree $tree --init-random --subst-mod SSREV --sym-freqs \
--log all_birds_4d_Z.log --msa-format MAF \
--out-root all_birds_4d_Z_non-conserved all_birds_4d_Z.maf

The values obtained for the neutral rates are as follows: \[Q_{Z} = \begin{bmatrix} -0.790118 & 0.206624 & 0.493049 & 0.090444 \\ 0.355909 & -1.361521 & 0.156337 & 0.849275 \\ 0.849275 & 0.156337 & -1.361521 & 0.355909 \\ 0.090444 & 0.493049 & 0.206624 & -0.790118 \end{bmatrix}\] \[Q_{macro} = \begin{bmatrix} -0.877359 & 0.159559 & 0.605190 & 0.112609 \\ 0.187081 & -1.143795 & 0.247140 & 0.709574 \\ 0.709574 & 0.247140 & -1.143795 & 0.187081 \\ 0.112609 & 0.605190 & 0.159559 & -0.877359 \end{bmatrix}\] \[Q_{micro} = \begin{bmatrix} -1.097038 & 0.206082 & 0.758698 & 0.132258 \\ 0.138586 & -0.934744 & 0.285947 & 0.510210 \\ 0.510210 & 0.285947 & -0.934744 & 0.138586 \\ 0.132258 & 0.758698 & 0.206082 & -1.097038 \end{bmatrix}\]

3.3 Conserved Rates

Next we need to get a mod file with rates for conserved sites. Instead of calculating a whole matrix, we calculated a value of \(\rho\) which describes a constant that multiples the non-conserved rates to conserved rates.To calculate this, first we need to generate mafs of the each scaffold. Since, we are interested in increased rates among our focal short-tarsi birds, we do not want to include these species when calculating rates (to not bias subsequent analysis). We can get the mafs of each scaffold by:

#bash
hal2mafMP.py  --numProc 6 --refGenome Gallus_gallus7 --noDupes --noAncestors --splitBySequence --targetGenomes Gavialis_gangeticus,Struthio_camelus,Rhea_americana,Eudromia_elegans,Dromaius_novaehollandiae,Anas_platyrhynchos,Gallus_gallus7,Phoenicopterus_ruber,Columba_livia,Tauraco_erythrolophus,Cuculus_canorus,Caprimulgus_europaeus,Apus_apus,Calypte_anna,Porphyrio_hochstetteri,Pluvialis_apricaria,Sterna_hirundo,Macronectes_giganteus,Theristicus_caerulescens,Aquila_chrysaetos,Athene_cunicularia,Trogon_surrucura,Bucorvus_abyssinicus,Merops_nubicus,Pogoniulus_pusillus,Colaptes_auratus,Falco_peregrinus,Melopsittacus_undulatus,Acanthisitta_chloris,Myiozetetes_cayanensis,Chiroxiphia_lanceolata,Malurus_cyaneus,Lichenostomus_cassidix,Lanius_collurio,Corvus_cornix,Eremophila_alpestris,Parus_major,Acrocephalus_scirpaceus,Phylloscopus_trochilus,Sylvia_atricapilla,Certhia_americana,Sturnus_vulgaris,Ficedula_albicollis,Taeniopygia_guttata,Passer_domesticus,Motacilla_alba,Serinus_canaria,Emberiza_elegans,Molothrus_ater,Setophaga_coronata,Diglossa_brunneiventris all_birds_new.hal phastcons/non_targets_all_birds

We also need to prune the tree file to keep only the requisite tips. Here is an example for doing this for tree in the Z scaffold mod file.

#R
tree <- read.newick("tree_Z.txt")
tree_sub <- keep.tip(tree, c("Gavialis_gangeticus","Struthio_camelus","Rhea_americana","Eudromia_elegans","Dromaius_novaehollandiae","Anas_platyrhynchos","Gallus_gallus7","Phoenicopterus_ruber","Columba_livia","Tauraco_erythrolophus","Cuculus_canorus","Caprimulgus_europaeus","Apus_apus","Calypte_anna","Porphyrio_hochstetteri","Pluvialis_apricaria","Sterna_hirundo","Macronectes_giganteus","Theristicus_caerulescens","Aquila_chrysaetos","Athene_cunicularia","Trogon_surrucura","Bucorvus_abyssinicus","Merops_nubicus","Pogoniulus_pusillus","Colaptes_auratus","Falco_peregrinus","Melopsittacus_undulatus","Acanthisitta_chloris","Myiozetetes_cayanensis","Chiroxiphia_lanceolata","Malurus_cyaneus","Lichenostomus_cassidix","Lanius_collurio","Corvus_cornix","Eremophila_alpestris","Parus_major","Acrocephalus_scirpaceus","Phylloscopus_trochilus","Sylvia_atricapilla","Certhia_americana","Sturnus_vulgaris","Ficedula_albicollis","Taeniopygia_guttata","Passer_domesticus","Motacilla_alba","Serinus_canaria","Emberiza_elegans","Molothrus_ater","Setophaga_coronata","Diglossa_brunneiventris"))
write.tree(tree_sub, "tree_Z_sub.txt")

Repeat the same for macro and micro chromosome tree files from the respective mod files and replace the trees in the respective mod files with the pruned trees to create a non_targets mod file.

Next, we will run phastCons with the estimate-rho flag for each scaffold. We will be running phastCons for each macrochromosome and each microchromosome scaffold independently. We also corrected for GC contents by using a value of GC content that corresponded to the average GC content for those sets of chromosomes (Z = 0.41, macro = 0.42, and micro = 0.52) They were run as a slurm array.

#bash
#For Z chromosome
phastCons --expected-length 45 --target-coverage 0.3 --rho 0.4 --gc 0.41 --estimate-rho \\
non_targets_4d_NC_052572.1 --no-post-probs --msa-format MAF \\
non_targets_all_birds_NC_052572.1.maf non_targets_4d_Z_non-conserved.mod

#For macro chromosomes (submitted as a slurm array)
#SBATCH --array=0-13
scaffolds=("NC_052532.1" "NC_052533.1" "NC_052534.1" "NC_052535.1" "NC_052536.1" "NC_052537.1" "NC_052538.1" "NC_052539.1" "NC_052540.1" "NC_052541.1" "NC_052542.1" "NC_052543.1" "NW_024095932.1" "NW_024095933.1" "NW_024095934.1" "NW_024095935.1" "NW_024095936.1" "NW_024095937.1")

phastCons --expected-length 45 --target-coverage 0.3 --rho 0.4 --gc 0.42 --estimate-rho \\
phastcons/new_mod/non_targets_4d_macro_${scaffolds[$SLURM_ARRAY_TASK_ID]} --no-post-probs \\
--msa-format MAF phastcons/non_targets_all_birds_${scaffolds[$SLURM_ARRAY_TASK_ID]}.maf \\
phastcons/non_targets_4d_macro_non-conserved.mod

#For micro chromosomes (submitted as a loop)
scaffolds=("NC_052544.1" "NC_052545.1" "NC_052546.1" "NC_052547.1" "NC_052548.1" "NC_052549.1" "NC_052550.1" "NC_052551.1" "NC_052552.1" "NC_052553.1" "NC_052554.1" "NC_052555.1" "NC_052556.1" "NC_052557.1" "NC_052558.1" "NC_052559.1" "NC_052560.1" "NC_052561.1" "NC_052562.1" "NC_052563.1" "NC_052564.1" "NC_052565.1" "NC_052566.1" "NC_052567.1" "NC_052568.1" "NC_052569.1" "NC_052570.1" "NW_024095938.1" "NW_024095940.1" "NW_024095941.1" "NW_024095942.1" "NW_024095939.1" "NW_024095943.1" "NW_024095944.1" "NW_024095945.1" "NW_024095946.1" "NW_024095947.1" "NW_024095949.1" "NW_024095950.1" "NW_024095951.1" "NW_024095952.1" "NW_024095953.1" "NW_024095948.1" "NW_024095954.1" "NW_024095955.1" "NW_024095957.1" "NW_024095958.1" "NW_024095960.1" "NW_024095956.1" "NW_024095959.1" "NW_024095963.1" "NW_024095964.1" "NW_024095965.1" "NW_024095961.1" "NW_024095962.1" "NW_024095966.1" "NW_024095967.1" "NW_024095968.1" "NW_024095969.1" "NW_024095970.1" "NW_024095971.1" "NW_024095972.1" "NW_024095973.1" "NW_024095974.1" "NW_024095998.1" "NW_024096000.1" "NW_024096002.1" "NW_024096003.1" "NW_024096004.1" "NW_024096006.1" "NW_024096008.1" "NW_024096009.1" "NW_024096010.1" "NW_024096012.1" "NW_024096013.1" "NW_024096014.1" "NW_024096015.1" "NW_024096016.1" "NW_024096017.1" "NW_024096019.1" "NW_024096020.1" "NW_024096021.1" "NW_024096022.1" "NW_024096023.1" "NW_024096024.1" "NW_024096025.1" "NW_024096026.1" "NW_024096027.1" "NW_024096028.1" "NW_024096029.1" "NW_024096030.1" "NW_024096031.1" "NW_024096032.1" "NW_024096033.1" "NW_024096034.1" "NW_024096035.1" "NW_024096036.1" "NW_024096037.1" "NW_024096038.1" "NW_024096039.1" "NW_024096040.1" "NW_024096041.1" "NW_024096042.1" "NW_024096043.1" "NW_024096044.1" "NW_024096045.1" "NW_024096046.1" "NW_024096047.1" "NW_024096048.1" "NW_024096049.1" "NW_024096050.1" "NW_024096051.1" "NW_024096052.1" "NW_024096053.1" "NW_024096054.1" "NW_024096055.1" "NW_024096056.1" "NW_024096057.1" "NW_024096058.1" "NW_024096059.1" "NW_024096060.1" "NW_024096061.1" "NW_024096062.1" "NW_024096063.1" "NW_024096064.1" "NW_024096065.1" "NW_024096066.1" "NW_024096067.1" "NW_024096068.1" "NW_024096069.1" "NW_024096070.1" "NW_024096071.1" "NW_024096072.1" "NW_024096073.1" "NW_024096074.1" "NW_024096075.1" "NW_024096076.1" "NW_024096077.1" "NW_024096078.1" "NW_024096079.1" "NW_024096080.1" "NW_024096081.1" "NW_024096082.1" "NW_024096083.1" "NW_024096084.1" "NW_024096085.1" "NW_024096086.1" "NW_024096087.1" "NW_024096088.1" "NW_024096089.1" "NW_024096090.1" "NW_024096091.1" "NW_024096092.1" "NW_024096093.1" "NW_024096094.1" "NW_024096095.1" "NW_024096096.1" "NW_024096097.1" "NW_024096098.1" "NW_024096099.1" "NW_024096100.1" "NW_024096101.1" "NW_024096102.1" "NW_024096103.1" "NW_024095994.1" "NW_024095995.1" "NW_024095996.1" "NW_024095997.1" "NW_024095999.1" "NW_024096001.1" "NW_024096005.1" "NW_024096007.1" "NW_024096011.1" "NW_024096018.1")
for i in ${scaffolds[@]};
do
       phastCons --expected-length 45 --target-coverage 0.3 --rho 0.4 \\
       --gc 0.52 --estimate-rho phastcons/new_mod/non_targets_4d_micro_${i} \\
       --no-post-probs --msa-format MAF phastcons/non_targets_all_birds_${i}.maf \\
       phastcons/non_targets_4d_micro_non-conserved.mod
done

Use phyloBoot to get an average conserved rate for each set of Z, macro, and micro chromosomes.

#bash
phyloBoot --read-mods '*Z_cons.mod' --output-average non_targets_4d_Z_conserved.mod
phyloBoot --read-mods '*macro_cons.mod' --output-average non_targets_4d_macro_conserved.mod
phyloBoot --read-mods '*micro_cons.mod' --output-average non_targets_4d_micro_conserved.mod

3.4 Running phastCons

Run phastCons to get bed of conserved elements.

#bash
#phastcons for Z scaffold
phastCons --expected-length 45 --target-coverage 0.3 --most-conserved \\
cons_beds/conserved_Z_NC_052572.1.bed --msa-format MAF non_targets_all_birds_NC_052572.1.maf \\
non_targets_4d_Z_conserved.mod,non_targets_4d_Z_non-conserved.mod > \\
cons_wig/conserved_Z_NC_052572.1.wig

#phastcons for macro scaffolds
scaffolds=("NC_052532.1" "NC_052533.1" "NC_052534.1" "NC_052535.1" "NC_052536.1" "NC_052537.1" "NC_052538.1" "NC_052539.1" "NC_052540.1" "NC_052541.1" "NC_052542.1" "NC_052543.1" "NW_024095932.1" "NW_024095933.1" "NW_024095934.1" "NW_024095935.1" "NW_024095936.1" "NW_024095937.1")

for i in ${scaffolds[@]};
do
       phastCons --expected-length 45 --target-coverage 0.3 --most-conserved \\
       cons_beds/conserved_macro_${i}.bed --msa-format MAF non_targets_all_birds_${i}.maf \\
       non_targets_4d_macro_conserved.mod,non_targets_4d_macro_non-conserved.mod > \\
       cons_wig/conserved_macro_${i}.wig
done

#phastcons for micro scaffolds
scaffolds=("NC_052544.1" "NC_052545.1" "NC_052546.1" "NC_052547.1" "NC_052548.1" "NC_052549.1" "NC_052550.1" "NC_052551.1" "NC_052552.1" "NC_052553.1" "NC_052554.1" "NC_052555.1" "NC_052556.1" "NC_052557.1" "NC_052558.1" "NC_052559.1" "NC_052560.1" "NC_052561.1" "NC_052562.1" "NC_052563.1" "NC_052564.1" "NC_052565.1" "NC_052566.1" "NC_052567.1" "NC_052568.1" "NC_052569.1" "NC_052570.1" "NW_024095938.1" "NW_024095940.1" "NW_024095941.1" "NW_024095942.1" "NW_024095939.1" "NW_024095943.1" "NW_024095944.1" "NW_024095945.1" "NW_024095946.1" "NW_024095947.1" "NW_024095949.1" "NW_024095950.1" "NW_024095951.1" "NW_024095952.1" "NW_024095953.1" "NW_024095948.1" "NW_024095954.1" "NW_024095955.1" "NW_024095957.1" "NW_024095958.1" "NW_024095960.1" "NW_024095956.1" "NW_024095959.1" "NW_024095963.1" "NW_024095964.1" "NW_024095965.1" "NW_024095961.1" "NW_024095962.1" "NW_024095966.1" "NW_024095967.1" "NW_024095968.1" "NW_024095969.1" "NW_024095970.1" "NW_024095971.1" "NW_024095972.1" "NW_024095973.1" "NW_024095974.1" "NW_024095998.1" "NW_024096000.1" "NW_024096002.1" "NW_024096003.1" "NW_024096004.1" "NW_024096006.1" "NW_024096008.1" "NW_024096009.1" "NW_024096010.1" "NW_024096012.1" "NW_024096013.1" "NW_024096014.1" "NW_024096015.1" "NW_024096016.1" "NW_024096017.1" "NW_024096019.1" "NW_024096020.1" "NW_024096021.1" "NW_024096022.1" "NW_024096023.1" "NW_024096024.1" "NW_024096025.1" "NW_024096026.1" "NW_024096027.1" "NW_024096028.1" "NW_024096029.1" "NW_024096030.1" "NW_024096031.1" "NW_024096032.1" "NW_024096033.1" "NW_024096034.1" "NW_024096035.1" "NW_024096036.1" "NW_024096037.1" "NW_024096038.1" "NW_024096039.1" "NW_024096040.1" "NW_024096041.1" "NW_024096042.1" "NW_024096043.1" "NW_024096044.1" "NW_024096045.1" "NW_024096046.1" "NW_024096047.1" "NW_024096048.1" "NW_024096049.1" "NW_024096050.1" "NW_024096051.1" "NW_024096052.1" "NW_024096053.1" "NW_024096054.1" "NW_024096055.1" "NW_024096056.1" "NW_024096057.1" "NW_024096058.1" "NW_024096059.1" "NW_024096060.1" "NW_024096061.1" "NW_024096062.1" "NW_024096063.1" "NW_024096064.1" "NW_024096065.1" "NW_024096066.1" "NW_024096067.1" "NW_024096068.1" "NW_024096069.1" "NW_024096070.1" "NW_024096071.1" "NW_024096072.1" "NW_024096073.1" "NW_024096074.1" "NW_024096075.1" "NW_024096076.1" "NW_024096077.1" "NW_024096078.1" "NW_024096079.1" "NW_024096080.1" "NW_024096081.1" "NW_024096082.1" "NW_024096083.1" "NW_024096084.1" "NW_024096085.1" "NW_024096086.1" "NW_024096087.1" "NW_024096088.1" "NW_024096089.1" "NW_024096090.1" "NW_024096091.1" "NW_024096092.1" "NW_024096093.1" "NW_024096094.1" "NW_024096095.1" "NW_024096096.1" "NW_024096097.1" "NW_024096098.1" "NW_024096099.1" "NW_024096100.1" "NW_024096101.1" "NW_024096102.1" "NW_024096103.1" "NW_024095994.1" "NW_024095995.1" "NW_024095996.1" "NW_024095997.1" "NW_024095999.1" "NW_024096001.1" "NW_024096005.1" "NW_024096007.1" "NW_024096011.1" "NW_024096018.1")

for i in ${scaffolds[@]};
do
       phastCons --expected-length 45 --target-coverage 0.3 --most-conserved \\
       cons_beds/conserved_micro_${i}.bed --msa-format MAF non_targets_all_birds_${i}.maf \\
       non_targets_4d_micro_conserved.mod,non_targets_4d_micro_non-conserved.mod > \\
       cons_wig/conserved_micro_${i}.wig
done

3.5 phastCons for Taeniopygia guttata

For the Zebra Finch we repeated the steps as above with a few exceptions. We generated maf files with Taeniopygia guttata as reference. We used the same neutral and conserved mod files as for chicken. One last step is to convert the zebra finch conserved elements back to galgal7b coordinates. First we concatenated all the bed files into one file and sorted and merged overlapping elements. Next, we used halLiftOver using the 79 genome alignment hal file to liftover from Tgut to galgal7b coordinates. Finally, we sorted and merged (with d = 5) the elements

#bash
cat *.bed | bedtools sort -i "stdin" | bedtools merge -i "stdin" > conserved_tgut_merged.bed
#Use R to remove elements smaller than 20 bp and greater than 5000 bp to get conserved_tgut_merged_filt.bed

halLiftover all_birds_new.hal Taeniopygia_guttata conserved_tgut_merged_filt.bed Gallus_gallus7 conserved_tgut_galgal7.bed

bedtools sort -i conserved_tgut_galgal7.bed | bedtools merge -i "stdin" -d 5 > conserved_tgut_galgal7_d5.bed
#Use R to remove elements smaller than 20 bp and greater than 5000 bp to get conserved_tgut_galgal7_filt.bed

4 Combining the five datasets

Now that we have five files that contain different sets of conserved elements, the next step is to get one final list of conserved elements. First we combined the two phastcons elements into one and filtered elements less than 20 bp and greater than 5000 bp.

#bash
cat conserved_all_phastcon_filt.bed conserved_tgut_galgal7_filt.bed | bedtools sort -i "stdin" | bedtools merge -i "stdin" > conserved_phast2.bed
#Use R to remove elements smaller than 20 bp and greater than 5000 bp to get conserved_phast2_filt.bed

Next get unique elements from the ratite, UCSC, and vocal elements (i.e. remove any elements that overlaps partly or fully with the elements generated from the two phastcon runs). Then combine these unique elements with the phastcons derived elements.

#bash
bedtools subtract -A -a ratites_merged_sized.bed -b conserved_phast2_filt.bed > ratite_unique.bed
bedtools subtract -A -a UCSC_merged_sized.bed -b conserved_phast2_filt.bed > UCSC_unique.bed
bedtools subtract -A -a vocal_merged_sized.bed -b conserved_phast2_filt.bed > vocal_unique.bed

cat conserved_phast2_filt.bed ratite_unique.bed UCSC_unique.bed vocal_unique.bed | \\
bedtools sort -i "stdin" | bedtools merge -i "stdin" > final_working_conserved.bed

#R
infile <- read.table("final_working_conserved.bed", sep = "\t", header = F)
infile$size <- infile$V3 - infile$V2
infile <- infile[which(infile$size <= 5000 & infile$size >= 20),]

#Add an extra column for CE ID
infile$ID <- paste("ce", seq(1, nrow(infile)), sep = "")
write.table(infile[,c(1:3,5)], "final_working_conserved_filt.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

This final working file has 1,117,392 elements.

5 Annotating the working file

We annotated the working file for type (whether it is an exonic element, intronic element, or intragenic element), source (which of the five datasets the element came from), atacseq (whether the element overlaps a known chicken atacseq peak), phylop score (rate and probability of conservation of element), fraction conserved (what fraction of the element is conserved among conserved sites in the 363 bird alignment). To get this annotation we need some other files.

5.1 Exon, Intron, or Intergenic

We need the gff file of the chicken to start with (get here) Then we can process the gff file to get a bed file of exons and introns (+UTRs). Intergenic regions are anything that is not within an exon or an intron.

#bash
grep -P "\tgene\t" galgal7b.gff | \
awk 'BEGIN{ FS="\t"; OFS=FS }{ print $1,$4-1,$5 }' > galgal7b_gene.bed

grep -P "\texon\t" galgal7b.gff | awk 'BEGIN{ FS="\t"; OFS=FS }{ print $1,$4-1,$5 }' | \
bedtools sort -i "stdin" | bedtools merge -i "stdin" > galgal7b_exon.bed

grep -P "\tCDS\t" galgal7b.gff | awk 'BEGIN{ FS="\t"; OFS=FS }{ print $1,$4-1,$5 }' | \
bedtools sort -i "stdin" | bedtools merge -i "stdin" > galgal7b_cds.bed

bedtools subtract -a galgal7b_exon.bed -b galgal7b_cds.bed > galgal7b_utr.bed

bedtools subtract -a galgal7b_gene.bed -b galgal7b_exon.bed > galgal7b_intron.bed

5.2 ATACseq

For code to get a list of ATACseq peaks from chicken, please see ATAC.Rmd

5.3 363 conserved elements

The conserved bases, based on galgal4, can be downloaded here. First we need to convert this from galgal4 to galgal7b. This can be done using the hal alignment.

#R
#First we need to convert from UCSC scaffold names to NCBI names
#remap the UCSC style scaffold names to NCBI scaffold names
map_file <- read.table("galgal4_map.txt", header = T, sep="\t")
infile <- read.table("ancRep_separate_models_rev.bw.conserved.bed", sep = "\t", header = F)
infile <- mapvalues(infile$V1, 
                    from = map_file$UCSC.style.name, 
                    to = map_file$RefSeq.Accn, 
                    warn_missing = F)
write.table(infile, "ancRep_separate_models_rev.bw.conserved_ncbi.bed", 
            sep = "\t", row.names = F, col.names = F, quote = F)

#bash
halLiftover all_birds_new.hal Gallus_gallus4 ancRep_separate_models_rev.bw.conserved_ncbi.bed \
Gallus_gallus7 ancRep_separate_models_rev.bw.conserved_galgal7.bed

bedtools sort -i ancRep_separate_models_rev.bw.conserved_galgal7.bed | bedtools merge -i "stdin" > \
ancRep_separate_models_rev.bw.conserved_galgal7_merge.bed

#then calculate overlap
bedtools intersect -wao -a final_working_conserved_filt.bed \
-b ancRep_separate_models_rev.bw.conserved_galgal7_merge.bed > \
final_working_conserved_filt_cov.bed

5.4 Phylop scores

For PhyloP first we need maf alignments of each conserved element from each scaffold. Three files Z_chrom.txt, macro_chrom.txt and micro_chrom.txt contains a list of all scaffolds corresponding to the specific group. This will create three folders for macro, micro, and Z and put all the maf files for each in each folder. Maf files need to be sorted, which can be done using this script. Then it will run phylop for each scaffold using the required non-conserved mod file (make sure files are in the directory)

#bash
files=("Z_chrom.txt","macro_chrom.txt","micro_chrom.txt")
labels=("Z","macro","micro")

mkdir -p beds/
mkdir -p phylop_vals/

for i in 0 1 2;
do
  while IFS= read -r line; do
    mkdir -p maf_${labels[$i]}_sort
    grep -P $line final_working_conserved_filt.bed > beds/${line}.bed
    hal2mafMP.py --numProc 6 --refGenome Gallus_gallus7 --noDupes --noAncestors \
    --refTargets beds/${line}.bed all_birds_new.hal maf_${labels[$i]}_sort/all_birds_${line}.maf
    #sort maffile
    maf-sort.sh maf_${labels[$i]}_sort/all_birds_${line}.maf > \
    maf_${labels[$i]}_sort/all_birds_${line}.sort.maf
    mv maf_${labels[$i]}_sort/all_birds_${line}.sort.maf  maf_${labels[$i]}_sort/all_birds_${line}.maf
    #run phylop
    phyloP --method LRT --mode CONACC -i MAF --features beds/${line}.bed \
    all_birds_4d_${labels[$i]}_non-conserved.mod maf_${labels[$i]}_sort/all_birds_${line}.maf > \
    phylop_vals/all_birds_phylop_${labels[$i]}_${line}.out
  done < ${files[$i]}
done

5.5 Getting the annotation

Now that we have all the files, plus one more that includes the element IDs for each of the ratite elements, we can finally annotate the bed file. First, we will use bedtools annotate to find overlaps of different bed files.

#bash
bedtools annotate -counts -i final_working_conserved_filt.bed -files \
galgal7b_exon.bed \
galgal7b_intron.bed \
galgal7b_utr.bed \
conserved_all_phastcon_filt.bed \
conserved_tgut_galgal7_filt.bed \
UCSC_merged_sized.bed \
ratites_merged_sized.bed \
vocal_merged_sized.bed \
atac_all.bed > \
final_working_conserved_filt_annot.bed

Then we will finish up with some cleanup in R and adding a few extra fields.

#R
final_data <- read.table("final_working_conserved_filt_annot.bed", sep = "\t", header = F)
final_data$exon <- final_data$V5 > 0
final_data$intron <- final_data$V6 > 0
final_data$utr <- final_data$V7 > 0
final_data$phastcon_gal <- final_data$V8 > 0
final_data$phastcon_tgut <- final_data$V9 > 0
final_data$UCSC <- final_data$V10 > 0
final_data$ratite <- final_data$V11 > 0
final_data$vocal <- final_data$V12 > 0
final_data$atac <- final_data$V13 > 0
#include ratite mce id
ratite_id <- read.table("ratite_name_match.txt", sep = "\t", header = F)
ratite_id <- ratite_id[,c(1:4,8)]
ratite_id <- aggregate(V8 ~ V1 + V2 + V3 + V4, data=ratite_id, paste0, sep = ",", collapse = "")
final_data <- merge(final_data, ratite_id, by=c("V1", "V2", "V3", "V4"), all=T)
final_data$V8.y <- gsub(",$", "", final_data$V8.y)
#add phylop scores
phylop <- read.table("all_birds_phylop.out", sep = "\t", header = F)
final_data <- merge(final_data, phylop, by=c("V1", "V2", "V3", "V4"), all=T)
final_data$length <- final_data$V3 - final_data$V2
final_data <- final_data[,c("V1", "V2", "V3", "V4", "length", 
                            "intron", "exon", "utr",
                            "phastcon_gal", "phastcon_tgut",
                            "UCSC", "vocal", "ratite", "V8.y",
                            "V5.y", "V6.y", "V7.y",
                            "atac")]
colnames(final_data) <- c("scaffold", "start", "end", "id", "length",
                          "intron", "exon", "utr",
                          "phastcon_gal", "phastcon_tgut",
                          "UCSC", "vocal", "ratite", "ratite_ID",
                          "phylop_scale", "phylop_lnlratio", "phylop_pval",
                          "atac")
#add logp
final_data$phylop_logp <- log10(abs(final_data$phylop_pval)+0.000001)
final_data[which(final_data$phylop_pval >= 0), "phylop_logp"] <- final_data[which(final_data$phylop_pval >= 0), "phylop_logp"] * -1

#add frac conserved based on 363 dataset
library(dplyr)
overlap <- read.table("final_working_conserved_filt_cov.bed", sep = "\t", header = F)
overlap_sum <- aggregate(V8 ~ V4, data=overlap, FUN = sum)
final_data <- merge(final_data, overlap_sum, by.x="id", by.y="V4")
final_data$frac_conserved <- final_data$V8/final_data$length

#remove frac_conserved for galgal4 loci that don't map perfectly to galgal7b
gal4_7_filter <- read.table("final_working_conserved_filt_galgal4-galgal7_merge.bed", sep="\t", header = F)
gal4_7_filter$nonunique <- grepl(",", gal4_7_filter$V4)
gal4_7_filter <- subset(gal4_7_filter, nonunique == FALSE)
gal4_7_filter$length_new <- gal4_7_filter$V3 - gal4_7_filter$V2
gal4_7_filter <- merge(gal4_7_filter, final_data[,c("id", "length")], by.x = "V4", by.y = "id", all.x = T, all.y = F)
gal4_7_filter$len_var <- ((gal4_7_filter$length - gal4_7_filter$length_new)/gal4_7_filter$length)*100
gal4_7_filter$len_prob <- gal4_7_filter$len_var >= 50 | gal4_7_filter$len_var <= -10
gal4_7_filter <- subset(gal4_7_filter, len_prob == FALSE)
gal4_7_unique <- setdiff(final_data$id, gal4_7_filter$V4)
final_data$frac_conserved[which(final_data$id %in% gal4_7_unique)] <- NA

#reorder columns
final_data <- final_data[,c(2:4,1,5:17,19,18,21)]

#sort by id
final_data <- final_data[gtools::mixedorder(final_data$id),]
#write dataframe
write.table(final_data, "conserved_elements_all.txt", sep = "\t", row.names = F, col.names = T, quote = F)

6 Final Statistics

The final dataset consists of 1,117,392 elements. The table contains the following elements:

Column	Description
scaffold	Scaffold Name in NCBI for galgal7b GCF_016699485.2
start	Start of element
end	End of element
id	Element ID
length	Length of element
intron	Elements overlapping an intron (702,466 elements)
exon	Elements overlapping an exon (322,888 elements); (Elements overlapping an intragenic region are ones absent in both exon and intron (310,242 elements))
utr	Elements in an exon but not part of CDS
phastcon_gal	Elements identified by Phastcons run with Gallus gallus reference (779,664 elements)
phastcon_tgut	Elements identified by Phastcons run with Taeniopygia guttata reference (783,583 elements)
UCSC	Elements identified by UCSC 100-way alignment (310,010 elements)
vocal	Elements identified by vocal learning dataset (270,104 elements)
ratite	Elements identified by ratite conserved element dataset (248,330 elements)
ratite_ID	ID of elements from ratite conserved elements dataset
phylop_scale	PhyloP scale
phylop_lnlratio	PhyloP log likelihood ratio
phylop_pval	PhyloP p value (negative values are accelerated; positive values are conserved)
phylop_logp	Log converstion of PhyloP p values
atac	Overlap with chicken ATACseq peaks
frac_conserved	Fraction of elements conserved (based on 363 bird alignment)

Cleanup: We set up some filters to remove sites that have phylop_scores less than 20.

A distribution of lengths, phylop_scale, phylop logp, fraction conserved are as follows:

#R

final_data <- read.table("conserved_elements_all.txt", sep = "\t", header = T)
final_data_filtered <- subset(final_data, phylop_scale <= 20)
final_data_filtered$length2 <- final_data_filtered$length
final_data_filtered[which(final_data_filtered$length >= 1000), "length2"] <- 1000
final_data_filtered$phylop_scale2 <- final_data_filtered$phylop_scale
final_data_filtered[which(final_data_filtered$phylop_scale >= 2), "phylop_scale2"] <- 2

final_data_filtered <- subset(final_data_filtered, phylop_scale < 0.6 | frac_conserved > 0.3)

#fwrite(final_data_filtered, "final_data_filtered.txt", sep = "\t", row.names = F, col.names = T, quote = F)
#fwrite(final_data_filtered[,c(1:4)], "final_data_filtered.bed", sep = "\t", row.names = F, col.names = F, quote = F)

Produce plots

library(gridExtra)
final_data_filtered <- read.table("final_data_filtered.txt", sep = "\t", header = T)

g1 <- ggplot(final_data_filtered, aes(x = final_data_filtered$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_filtered$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))

#weighted mean for phylop
final_data_filtered$phylop_scale_bp <- final_data_filtered$phylop_scale * final_data_filtered$length
#sum(final_data_filtered$phylop_scale_bp)/sum(final_data_filtered$length) #0.3756161

g2 <- ggplot(final_data_filtered, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_filtered$phylop_scale_bp)/sum(final_data_filtered$length),
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))

g3 <- ggplot(final_data_filtered, aes(x = frac_conserved)) +
  geom_histogram(bins = 51) +
  theme_classic() +
  scale_x_continuous(limits = c(0,1.0), expand = c(0, 0)) +
  labs(title="Distribution of conserved fraction of elements",x="Conserved Fraction", y = "Count")

#for mean frac conserved
final_data_filtered$frac_conserved_bp <- final_data_filtered$frac_conserved * final_data_filtered$length
#sum(final_data_filtered$frac_conserved_bp, na.rm = T)/sum(final_data_filtered$length[which(!is.na(final_data_filtered$frac_conserved_bp))]) #0.5722735

gs <- arrangeGrob(g1, g2, g3, ncol = 2)
grid.arrange(gs)

Length by dataset

final_data_UCSC <- subset(final_data_filtered, UCSC == TRUE)
final_data_ratite <- subset(final_data_filtered, ratite == TRUE)
final_data_vocal <- subset(final_data_filtered, vocal == TRUE)
final_data_pcgal <- subset(final_data_filtered, phastcon_gal == TRUE)
final_data_pctgut <- subset(final_data_filtered, phastcon_tgut == TRUE)

g1 <- ggplot(final_data_UCSC, aes(x = final_data_UCSC$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_UCSC$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements-UCSC",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g2 <- ggplot(final_data_ratite, aes(x = final_data_ratite$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_ratite$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements-ratite",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g3 <- ggplot(final_data_vocal, aes(x = final_data_vocal$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_vocal$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements-vocal",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g4 <- ggplot(final_data_pcgal, aes(x = final_data_pcgal$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_pcgal$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements-phastcons_gal",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g5 <- ggplot(final_data_pctgut, aes(x = final_data_pctgut$length2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = mean(final_data_pctgut$length), col = "red", linetype = "dashed") +
  labs(title="Distribution of length of elements-phastcons_tgut",x="Length of Elements (bp)", y = "Count")+
  scale_x_continuous(limits = c(0,1100), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))

gs <- arrangeGrob(g1, g2, g3, g4, g5, ncol = 2)
grid.arrange(gs)

Conservation by dataset

g1 <- ggplot(final_data_UCSC, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_UCSC$phylop_scale_bp)/sum(final_data_UCSC$length), 
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements-UCSC",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g2 <- ggplot(final_data_ratite, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_ratite$phylop_scale_bp)/sum(final_data_ratite$length),
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements-ratite",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g3 <- ggplot(final_data_vocal, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_vocal$phylop_scale_bp)/sum(final_data_vocal$length), 
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements-vocal",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g4 <- ggplot(final_data_pcgal, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_pcgal$phylop_scale_bp)/sum(final_data_pcgal$length), 
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements-phastcons_gal",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g5 <- ggplot(final_data_pctgut, aes(x = phylop_scale2)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  geom_vline(xintercept = sum(final_data_pctgut$phylop_scale_bp)/sum(final_data_pctgut$length), 
             col = "red", linetype = "dashed") +
  labs(title="Distribution of phylop scale of elements-phastcons_tgut",x="Phylop Scale", y = "Count") +
  scale_x_continuous(limits = c(0,2.1), expand = c(0, 0)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))

gs <- arrangeGrob(g1, g2, g3, g4, g5, ncol = 2)
grid.arrange(gs)

fracs_conserved by dataset

g1 <- ggplot(final_data_UCSC, aes(x = frac_conserved)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  labs(title="Distribution of conserved fraction of elements-UCSC",x="Conserved Fraction", y = "Count")
g2 <- ggplot(final_data_ratite, aes(x = frac_conserved)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  labs(title="Distribution of conserved fraction of elements-ratite",x="Conserved Fraction", y = "Count")
g3 <- ggplot(final_data_vocal, aes(x = frac_conserved)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  labs(title="Distribution of conserved fraction of elements-vocal",x="Conserved Fraction", y = "Count")
g4 <- ggplot(final_data_pcgal, aes(x = frac_conserved)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  labs(title="Distribution of conserved fraction of elements-phastcons_gal",x="Conserved Fraction", y = "Count")
g5 <- ggplot(final_data_pctgut, aes(x = frac_conserved)) +
  geom_histogram(bins = 50) +
  theme_classic() +
  labs(title="Distribution of conserved fraction of elements-phastcons_tgut",x="Conserved Fraction", y = "Count")

gs <- arrangeGrob(g1, g2, g3, g4, g5, ncol = 2)
grid.arrange(gs)

Venn of all 5 elements

library(venn)
venn_list <- list(ratite = final_data_filtered$id[final_data_filtered$ratite],
                  vocal = final_data_filtered$id[final_data_filtered$vocal],
                  UCSC = final_data_filtered$id[final_data_filtered$UCSC], 
                  phastcons_gal = final_data_filtered$id[final_data_filtered$phastcon_gal], 
                  phastcons_tgut = final_data_filtered$id[final_data_filtered$phastcon_tgut])
venn(venn_list, ilabels = TRUE, zcolor = "style")

Overlap of ATAC elements and genomic features.

#number in exons
nrow(final_data_filtered[which(final_data_filtered$exon & !final_data_filtered$intron),])

## [1] 132316

#number in exons in ATAC
table(final_data_filtered$atac[which(final_data_filtered$exon & !final_data_filtered$intron)])

## 
## FALSE  TRUE 
## 84876 47440

#number in introns
nrow(final_data_filtered[which(!final_data_filtered$exon & final_data_filtered$intron),])

## [1] 395272

#number in introns in ATAC
table(final_data_filtered$atac[which(!final_data_filtered$exon & final_data_filtered$intron)])

## 
##  FALSE   TRUE 
## 242010 153262

#number in exon/intron boundary
nrow(final_data_filtered[which(final_data_filtered$exon & final_data_filtered$intron),])

## [1] 155074

#number in exon/intron boundary in ATAC
table(final_data_filtered$atac[which(final_data_filtered$exon & final_data_filtered$intron)])

## 
##  FALSE   TRUE 
## 108440  46634

#number in intragenic
nrow(final_data_filtered[which(!final_data_filtered$exon & !final_data_filtered$intron),])

## [1] 249805

#number in intragenic in ATAC
table(final_data_filtered$atac[which(!final_data_filtered$exon & !final_data_filtered$intron)])

## 
##  FALSE   TRUE 
## 153828  95977

7 Comparing conservation to non birds

We used the mammal phylop scores from Armstrong et al. (2020) to assess the conservation patterns of our elements to mammals. We followed the steps outlined here to convert the bigwig file to a bed file of conserved sites.

#bash
total=$(wc -l {input.phylop_scores} | cut -d ' ' -f1)
sort -k6,6g --parallel={resources.cpus} -T {params.tmp_dir} {input.phylop_scores} | \
    awk -v OFS="\\t" '{{ print $0, NR }}' | \
    sort -rn -k7 --parallel={resources.cpus} -T {params.tmp_dir} | \
    awk -v OFS="\\t" -v total=${{total}} 'BEGIN{{prev_pn=999}} {{p_adj=(total/$7)*$6; if(p_adj>prev_pn){{p_adj=prev_pn}};          prev_pn=p_adj; if(p_adj>1){{p_adj=1}}; sig=0; if(p_adj<0.05){{sig=1}}; print $0, p_adj, sig}}' | \
    sort -k 1,1 -k2,2n --parallel={resources.cpus} -T {params.tmp_dir} |\
    cut -f 1,2,3,5,8,9 2> {log} > {output.fdr_adj_scores}

Next we used the human to chicken liftover file from UCSC to convert from human (hg38) coordinates to galGal7b coordinates. Then we calculated overlap to the bird conserved elements.

#bash
liftOver hg38ToGCF_016699485.2.over.chain.gz 241-mammalian-2020v2b.phylop-scores.fdr-adj-0.05.conserved.bed 241-mammalian_hg38-galgal7b.bed unlifted_over_hg38.bed
bedtools sort -i 241-mammalian_hg38-galgal7b.bed | bedtools merge -i "stdin" > 241-mammalian_hg38-galgal7b_filtered.bed
bedtools intersect -wao -a final_data_filtered.bed -b 241-mammalian_hg38-galgal7b_filtered.bed > final_data_filetered_hg38.bed

Plot some statistics

#R
library(dplyr)
library(ggplot2)
final_data <- read.table("final_data_filtered.txt", sep = "\t", header=T)
overlap <- read.table("final_data_filtered_hg38.bed", sep = "\t", header = F)
overlap_sum <- aggregate(V8 ~ V4, data=overlap, FUN = sum)
final_data <- merge(final_data, overlap_sum, by.x="id", by.y="V4")
final_data$frac_conserved_mammal <- final_data$V8/final_data$length
final_data$frac_conserved_mammal[which(final_data$frac_conserved_mammal == 0)] <- NA
final_data$mammal_class <- NA
final_data$mammal_class[which(final_data$frac_conserved_mammal < 0.3)] <- "bird_conserved"
final_data$mammal_class[which(final_data$frac_conserved_mammal >= 0.3)] <- "all_conserved"
write.table(final_data, "final_data_filtered_mammal.txt", sep="\t", row.names = F, col.names = T, quote = F)

library(data.table)
final_data = fread("final_data_filtered_mammal.txt", header = T)
ggplot(final_data, aes(x = frac_conserved_mammal)) +
  geom_histogram(bins = 51) +
  theme_classic() +
  scale_x_continuous(limits = c(0,1.0), expand = c(0, 0)) +
  labs(title="Distribution of conserved fraction of elements",x="Conserved Fraction in mammals", y = "Count")

References

Armstrong, Joel, Glenn Hickey, Mark Diekhans, Ian T Fiddes, Adam M Novak, Alden Deran, Qi Fang, et al. 2020. “Progressive Cactus Is a Multiple-Genome Aligner for the Thousand-Genome Era.” Nature 587 (7833): 246–51.

Cahill, James A, Joel Armstrong, Alden Deran, Carolyn J Khoury, Benedict Paten, David Haussler, and Erich D Jarvis. 2021. “Positive Selection in Noncoding Genomic Regions of Vocal Learning Birds Is Associated with Genes Implicated in Vocal Learning and Speech Functions in Humans.” Genome Research 31 (11): 2035–49.

Sackton, Timothy B, Phil Grayson, Alison Cloutier, Zhirui Hu, Jun S Liu, Nicole E Wheeler, Paul P Gardner, et al. 2019. “Convergent Regulatory Evolution and Loss of Flight in Paleognathous Birds.” Science 364 (6435): 74–78.

Generating Conserved Elements

2024-04-29