Following includes codes used to generate an annotated list of conserved elements for birds. It uses bash and R codes interchangeably along with other instructions to run. Please make sure you are running the codes in the correct interface.
This markdown file documents the generation of a dataset of ATACseq
peaks for chicken, mapped to the bGalGal1.mat.broiler.GRCg7b
(henceforth only referred to as galgal7b). You can get the genome here.
We originally started with 116 fasta files comprising 43 unique tissues and treatments at different stages of chick development. We mapped them to the galgal7b genome. We omitted 15 fasta files (see bottom of table) due to quality of reads and problems with alaingments leaving us with 101 fasta files of 36 tissue types.
| Tissue_type | Replicate | SRA_accession | Total_reads_mapped | Data_source |
|---|---|---|---|---|
| cone cell e5 - | 1 | SRR11916725 | 400534 | Lonfat et al. (2021) |
| cone cell e5 - | 2 | SRR11916726 | 586532 | Lonfat et al. (2021) |
| retinal cell e5-6 | 1 | SRR10533482 | 619494 | Patoori et al. (2020) |
| retinal cell e5-6 | 2 | SRR10533483 | 1155904 | Patoori et al. (2020) |
| wing bud HH24 | 1 | SRR13428042 | 392412 | Jhanwar et al. (2021) |
| wing bud HH24 | 2 | SRR13428041 | 501428 | Jhanwar et al. (2021) |
| wing bud HH22 | 1 | SRR13428040 | 323406 | Jhanwar et al. (2021) |
| wing bud HH22 | 2 | SRR13428039 | 422077 | Jhanwar et al. (2021) |
| wing bud HH20 | 1 | SRR13428038 | 195410 | Jhanwar et al. (2021) |
| wing bud HH20 | 2 | SRR13428037 | 197882 | Jhanwar et al. (2021) |
| limb bud HH24 | 1 | SRR11878467 | 392158 | Malkmus et al. (2021) |
| limb bud HH24 | 2 | SRR11878466 | 501842 | Malkmus et al. (2021) |
| presomitic mesoderm HH14 | 1 | SRR10917286 | 2429861 | Mok et al. (2021) |
| presomitic mesoderm HH14 | 2 | SRR10917285 | 2801157 | Mok et al. (2021) |
| presomitic mesoderm HH14 | 3 | SRR10917274 | 2288277 | Mok et al. (2021) |
| epithelial somites HH14 | 1 | SRR10917266 | 1992608 | Mok et al. (2021) |
| epithelial somites HH14 | 2 | SRR10917265 | 2105774 | Mok et al. (2021) |
| epithelial somites HH14 | 3 | SRR10917264 | 1836757 | Mok et al. (2021) |
| hindlimb HH18 | 1 | SRR10059742 | 1023402 | Young et al. (2019) |
| hindlimb HH18 | 2 | SRR10059741 | 795418 | Young et al. (2019) |
| hindlimb HH18 | 3 | SRR10059740 | 603441 | Young et al. (2019) |
| forelimb HH18 | 1 | SRR10059737 | 915961 | Young et al. (2019) |
| forelimb HH18 | 2 | SRR10059738 | 781829 | Young et al. (2019) |
| forelimb HH18 | 3 | SRR10059739 | 855197 | Young et al. (2019) |
| flank HH18 | 1 | SRR10059734 | 687477 | Young et al. (2019) |
| flank HH18 | 2 | SRR10059735 | 661340 | Young et al. (2019) |
| flank HH18 | 3 | SRR10059736 | 881883 | Young et al. (2019) |
| neural crest HH9 | 1 | SRR10279641 | 465478 | Rothstein and Simoes-Costa (2020) |
| neural crest HH9 | 2 | SRR10279640 | 1052795 | Rothstein and Simoes-Costa (2020) |
| neural crest HH6 | 1 | SRR10279639 | 1188400 | Rothstein and Simoes-Costa (2020) |
| neural crest HH6 | 2 | SRR10279638 | 1828905 | Rothstein and Simoes-Costa (2020) |
| negative control NC2- E2- HH25 | 3 | SRR8492067 | 1308863 | Ling and Sauka-Spengler (2019) |
| negative control NC2- E2- HH25 | 2 | SRR8492066 | 1470813 | Ling and Sauka-Spengler (2019) |
| negative control NC2- E2- HH25 | 1 | SRR8492065 | 984023 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- HH25 | 3 | SRR8492064 | 739201 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- HH25 | 2 | SRR8492063 | 840937 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- HH25 | 1 | SRR8492062 | 699955 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH25 | 3 | SRR8492061 | 908522 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH25 | 2 | SRR8492060 | 991016 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH25 | 1 | SRR8492059 | 717557 | Ling and Sauka-Spengler (2019) |
| negative control NC2- E2- HH18 | 3 | SRR8492058 | 943111 | Ling and Sauka-Spengler (2019) |
| negative control NC2- E2- HH18 | 2 | SRR8492057 | 1419320 | Ling and Sauka-Spengler (2019) |
| negative control NC2- E2- HH18 | 1 | SRR8492056 | 993025 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- E2+ HH18 | 3 | SRR8492055 | 1080428 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- E2+ HH18 | 2 | SRR8492054 | 1325955 | Ling and Sauka-Spengler (2019) |
| neural crest NC2- E2+ HH18 | 1 | SRR8492053 | 1506587 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ E2+ HH18 | 3 | SRR8492052 | 1657227 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ E2+ HH18 | 2 | SRR8492051 | 1863603 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ E2+ HH18 | 1 | SRR8492050 | 1656390 | Ling and Sauka-Spengler (2019) |
| neural crest E2+ HH12 | 3 | SRR8492049 | 591477 | Ling and Sauka-Spengler (2019) |
| neural crest E2+ HH12 | 2 | SRR8492048 | 746994 | Ling and Sauka-Spengler (2019) |
| neural crest E2+ HH12 | 1 | SRR8492047 | 685209 | Ling and Sauka-Spengler (2019) |
| negative control NC2- HH12 | 3 | SRR8492046 | 566067 | Ling and Sauka-Spengler (2019) |
| negative control NC2- HH12 | 2 | SRR8492045 | 640875 | Ling and Sauka-Spengler (2019) |
| negative control NC2- HH12 | 1 | SRR8492044 | 692517 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH12 | 3 | SRR8492043 | 1337110 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH12 | 2 | SRR8492042 | 654113 | Ling and Sauka-Spengler (2019) |
| neural crest NC2+ HH12 | 1 | SRR8492041 | 1474878 | Ling and Sauka-Spengler (2019) |
| non-neural crest 8-10ss | 1 | SRR8062752 | 327185 | Williams et al. (2019) |
| non-neural crest 8-10ss | 2 | SRR8062751 | 713343 | Williams et al. (2019) |
| non-neural crest 8-10ss | 3 | SRR8062750 | 544369 | Williams et al. (2019) |
| non-neural crest 8-10ss | 4 | SRR8062749 | 526987 | Williams et al. (2019) |
| non-neural crest 5-6ss | 1 | SRR8062748 | 406114 | Williams et al. (2019) |
| non-neural crest 5-6ss | 2 | SRR8062747 | 536168 | Williams et al. (2019) |
| non-neural crest 5-6ss | 3 | SRR8062746 | 490289 | Williams et al. (2019) |
| non-neural crest 5-6ss | 4 | SRR8062745 | 769000 | Williams et al. (2019) |
| neural crest 8-10ss FOXD3 | 1 | SRR8062744 | 643461 | Williams et al. (2019) |
| neural crest 8-10ss FOXD3 | 2 | SRR8062743 | 510546 | Williams et al. (2019) |
| neural crest 8-10ss FOXD3 | 3 | SRR8062742 | 505559 | Williams et al. (2019) |
| epiblast HH4 | 1 | SRR8062741 | 347257 | Williams et al. (2019) |
| epiblast HH4 | 2 | SRR8062740 | 355175 | Williams et al. (2019) |
| metatarsal skin e12 retinoic acid | 1 | SRR6782021 | 889266 | Lai et al. (2018) |
| metatarsal skin e12 retinoic acid | 2 | SRR6782020 | 911850 | Lai et al. (2018) |
| metatarsal skin e12 RCAS-GFP | 1 | SRR6782019 | 1672701 | Lai et al. (2018) |
| metatarsal skin e12 RCAS-GFP | 2 | SRR6782018 | 1622741 | Lai et al. (2018) |
| metatarsal skin e12 RCAS-CTNNB1 | 1 | SRR6782017 | 514530 | Lai et al. (2018) |
| metatarsal skin e12 RCAS-CTNNB1 | 2 | SRR6782016 | 332392 | Lai et al. (2018) |
| keel e10 | 1 | SRR6713569 | 799482 | Sackton et al. (2019) |
| keel e10 | 2 | SRR6713570 | 815108 | Sackton et al. (2019) |
| keel e10 | 3 | SRR6713592 | 705974 | Sackton et al. (2019) |
| hindlimb e4.5 | 1 | SRR6713571 | 1057708 | Sackton et al. (2019) |
| hindlimb e4.5 | 2 | SRR6713572 | 816446 | Sackton et al. (2019) |
| hindlimb e4.5 | 3 | SRR6713577 | 992959 | Sackton et al. (2019) |
| full sternum e10 | 1 | SRR6713573 | 1243533 | Sackton et al. (2019) |
| full sternum e10 | 2 | SRR6713574 | 847942 | Sackton et al. (2019) |
| full sternum e10 | 3 | SRR6713580 | 1091860 | Sackton et al. (2019) |
| inferior sternum e9 | 1 | SRR6713575 | 1128212 | Sackton et al. (2019) |
| inferior sternum e9 | 2 | SRR6713576 | 1171526 | Sackton et al. (2019) |
| inferior sternum e9 | 3 | SRR6713578 | 1127918 | Sackton et al. (2019) |
| forelimb e4.5 | 1 | SRR6713579 | 883046 | Sackton et al. (2019) |
| forelimb e4.5 | 2 | SRR6713585 | 351224 | Sackton et al. (2019) |
| forelimb e4.5 | 3 | SRR6713586 | 717558 | Sackton et al. (2019) |
| flight muscle e10 | 1 | SRR6713581 | 697283 | Sackton et al. (2019) |
| flight muscle e10 | 2 | SRR6713583 | 641133 | Sackton et al. (2019) |
| flight muscle e10 | 3 | SRR6713584 | 513558 | Sackton et al. (2019) |
| flight muscle e9 | 1 | SRR6713582 | 586575 | Sackton et al. (2019) |
| flight muscle e9 | 2 | SRR6713587 | 560014 | Sackton et al. (2019) |
| flight muscle e9 | 3 | SRR6713588 | 403399 | Sackton et al. (2019) |
| superior sternum e9 | 1 | SRR6713589 | 815243 | Sackton et al. (2019) |
| superior sternum e9 | 2 | SRR6713590 | 909182 | Sackton et al. (2019) |
| superior sternum e9 | 3 | SRR6713591 | 1292178 | Sackton et al. (2019) |
| 1 | SRR6418911 | 0 | Laugsch et al. (2018) | |
| 1 | SRR11916722 | 0 | Lonfat et al. (2021) | |
| 2 | SRR11916723 | 0 | Lonfat et al. (2021) | |
| 3 | SRR11916724 | 0 | Lonfat et al. (2021) | |
| 1 | SRR12965854 | 118497 | Crispatzu et al. (2021) | |
| 1 | SRR10917280 | 0 | Mok et al. (2021) | |
| 2 | SRR10917281 | 0 | Mok et al. (2021) | |
| 3 | SRR10917282 | 0 | Mok et al. (2021) | |
| 1 | SRR10917276 | 0 | Mok et al. (2021) | |
| 2 | SRR10917275 | 0 | Mok et al. (2021) | |
| 3 | SRR10917273 | 0 | Mok et al. (2021) | |
| 1 | SRR8062739 | 2405 | Williams et al. (2019) | |
| 1 | SRR8062737 | 23955 | Williams et al. (2019) | |
| 1 | SRR8062736 | 0 | Williams et al. (2019) | |
| 2 | SRR8062735 | 0 | Williams et al. (2019) |
We used sra toolkit to download the files and convert them to fastq files. You can download the toolkit from https://hpc.nih.gov/apps/sratoolkit.html.
#bash
#get SRA files
prefetch \
SRR11916725 SRR11916726 SRR8062740 SRR8062741 SRR10917264 SRR10917265 \
SRR10917266 SRR10059734 SRR10059735 SRR10059736 SRR6713581 SRR6713583 \
SRR6713584 SRR6713582 SRR6713587 SRR6713588 SRR6713579 SRR6713585 \
SRR6713586 SRR10059737 SRR10059738 SRR10059739 SRR6713573 SRR6713574 \
SRR6713580 SRR6713571 SRR6713572 SRR6713577 SRR10059740 SRR10059741 \
SRR10059742 SRR6713575 SRR6713576 SRR6713578 SRR6713569 SRR6713570 \
SRR6713592 SRR11878466 SRR11878467 SRR6782016 SRR6782017 SRR6782018 \
SRR6782019 SRR6782020 SRR6782021 SRR8492056 SRR8492057 SRR8492058 \
SRR8492065 SRR8492066 SRR8492067 SRR8492044 SRR8492045 SRR8492046 \
SRR8062742 SRR8062743 SRR8062744 SRR8492047 SRR8492048 SRR8492049 \
SRR10279638 SRR10279639 SRR10279640 SRR10279641 SRR8492053 SRR8492054 \
SRR8492055 SRR8492062 SRR8492063 SRR8492064 SRR8492050 SRR8492051 \
SRR8492052 SRR8492041 SRR8492042 SRR8492043 SRR8492059 SRR8492060 \
SRR8492061 SRR8062745 SRR8062746 SRR8062747 SRR8062748 SRR8062749 \
SRR8062750 SRR8062751 SRR8062752 SRR10917274 SRR10917285 SRR10917286 \
SRR10533482 SRR10533483 SRR6713589 SRR6713590 SRR6713591 SRR13428037 \
SRR13428038 SRR13428039 SRR13428040 SRR13428041 SRR13428042 -O ./
#convert SRA to fastq
mkdir -p fastq
fasterq-dump \
SRR11916725 SRR11916726 SRR8062740 SRR8062741 SRR10917264 SRR10917265 \
SRR10917266 SRR10059734 SRR10059735 SRR10059736 SRR6713581 SRR6713583 \
SRR6713584 SRR6713582 SRR6713587 SRR6713588 SRR6713579 SRR6713585 \
SRR6713586 SRR10059737 SRR10059738 SRR10059739 SRR6713573 SRR6713574 \
SRR6713580 SRR6713571 SRR6713572 SRR6713577 SRR10059740 SRR10059741 \
SRR10059742 SRR6713575 SRR6713576 SRR6713578 SRR6713569 SRR6713570 \
SRR6713592 SRR11878466 SRR11878467 SRR6782016 SRR6782017 SRR6782018 \
SRR6782019 SRR6782020 SRR6782021 SRR8492056 SRR8492057 SRR8492058 \
SRR8492065 SRR8492066 SRR8492067 SRR8492044 SRR8492045 SRR8492046 \
SRR8062742 SRR8062743 SRR8062744 SRR8492047 SRR8492048 SRR8492049 \
SRR10279638 SRR10279639 SRR10279640 SRR10279641 SRR8492053 SRR8492054 \
SRR8492055 SRR8492062 SRR8492063 SRR8492064 SRR8492050 SRR8492051 \
SRR8492052 SRR8492041 SRR8492042 SRR8492043 SRR8492059 SRR8492060 \
SRR8492061 SRR8062745 SRR8062746 SRR8062747 SRR8062748 SRR8062749 \
SRR8062750 SRR8062751 SRR8062752 SRR10917274 SRR10917285 SRR10917286 \
SRR10533482 SRR10533483 SRR6713589 SRR6713590 SRR6713591 SRR13428037 \
SRR13428038 SRR13428039 SRR13428040 SRR13428041 SRR13428042 -O ./fastq -e 10
We use NGmerge to remove adapters. You can get NGmerge at https://github.com/jsh58/NGmerge.
#bash
#We created a file acc_list.txt that lists all the accessions, one per line
mkdir -p fastq_ngmerge
while IFS= read -r line; do
NGmerge -a -1 ./fastq/${line}_1.fastq -2 ./fastq/${line}_2.fastq -o ./fastq_ngmerge/${line} -v -u 73 -n 6
done < "acc_list.txt"
We used bwa v0.7.17 and samtools v1.9 to map all the files to the galgal7b genome.
#bash
bwa index galgal7b.fna
samtools faidx galgal7b.fna
samtools dict galgal7b.fna -o galgal7b.dict
mkdir -p bams
while IFS= read -r line; do
bwa mem -t 10 galgal7b.fna ./fastq_ngmerge/${line}_1.fastq ./fastq_ngmerge/${line}_2.fastq | \
samtools view -b -u | \
samtools sort --threads 8 -n -o ./bams/${line}.bam
done < "acc_list.txt"
We called peaks two ways, one for each tissue type and one for all the files altogether. We used a -e filter to remove mitochondrial sequences.
#bash
mkdir -p beds
Genrich -t ./bams/SRR11916725.bam,./bams/SRR11916726.bam \
-o ./beds/cone_cell_e5_-.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10533482.bam,./bams/SRR10533483.bam \
-o ./beds/retinal_cell_e5-6.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR13428042.bam,./bams/SRR13428041.bam \
-o ./beds/wing_bud_HH24.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR13428040.bam,./bams/SRR13428039.bam \
-o ./beds/wing_bud_HH22.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR13428038.bam,./bams/SRR13428037.bam \
-o ./beds/wing_bud_HH20.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR11878467.bam,./bams/SRR11878466.bam \
-o ./beds/limb_bud_HH24.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10917286.bam,./bams/SRR10917285.bam,./bams/SRR10917274.bam \
-o ./beds/presomitic_mesoderm_HH14.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10917266.bam,./bams/SRR10917265.bam,./bams/SRR10917264.bam \
-o ./beds/epithelial_somites_HH14.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10059742.bam,./bams/SRR10059741.bam,./bams/SRR10059740.bam \
-o ./beds/hindlimb_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10059737.bam,./bams/SRR10059738.bam,./bams/SRR10059739.bam \
-o ./beds/forelimb_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10059734.bam,./bams/SRR10059735.bam,./bams/SRR10059736.bam \
-o ./beds/flank_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10279641.bam,./bams/SRR10279640.bam \
-o ./beds/neural_crest_HH9.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR10279639.bam,./bams/SRR10279638.bam \
-o ./beds/neural_crest_HH6.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492067.bam,./bams/SRR8492066.bam,./bams/SRR8492065.bam \
-o ./beds/negative_control_NC2-_E2-_HH25.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492064.bam,./bams/SRR8492063.bam,./bams/SRR8492062.bam \
-o ./beds/neural_crest_NC2-_HH25.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492061.bam,./bams/SRR8492060.bam,./bams/SRR8492059.bam \
-o ./beds/neural_crest_NC2+_HH25.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492058.bam,./bams/SRR8492057.bam,./bams/SRR8492056.bam \
-o ./beds/negative_control_NC2-_E2-_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492055.bam,./bams/SRR8492054.bam,./bams/SRR8492053.bam \
-o ./beds/neural_crest_NC2-_E2+_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492052.bam,./bams/SRR8492051.bam,./bams/SRR8492050.bam \
-o ./beds/neural_crest_NC2+_E2+_HH18.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492049.bam,./bams/SRR8492048.bam,./bams/SRR8492047.bam \
-o ./beds/neural_crest_E2+_HH12.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492046.bam,./bams/SRR8492045.bam,./bams/SRR8492044.bam \
-o ./beds/negative_control_NC2-_HH12.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8492043.bam,./bams/SRR8492042.bam,./bams/SRR8492041.bam \
-o ./beds/neural_crest_NC2+_HH12_.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8062752.bam,./bams/SRR8062751.bam,./bams/SRR8062750.bam,./bams/SRR8062749.bam \
-o ./beds/non-neural_crest_8-10ss.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR8062741.bam,./bams/SRR8062740.bam \
-o ./beds/epiblast_HH4.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6782021.bam,./bams/SRR6782020.bam \
-o ./beds/metatarsal_skin_e12_retinoic_acid.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6782019.bam,./bams/SRR6782018.bam \
-o ./beds/metatarsal_skin_e12_RCAS-GFP.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6782017.bam,./bams/SRR6782016.bam \
-o ./beds/metatarsal_skin_e12_RCAS-CTNNB1.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713569.bam,./bams/SRR6713570.bam,./bams/SRR6713592.bam \
-o ./beds/keel_e10.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713571.bam,./bams/SRR6713572.bam,./bams/SRR6713577.bam \
-o ./beds/hindlimb_e4.5.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713573.bam,./bams/SRR6713574.bam,./bams/SRR6713580.bam \
-o ./beds/full_sternum_e10.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713575.bam,./bams/SRR6713576.bam,./bams/SRR6713578.bam \
-o ./beds/inferior_sternum_e9.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713579.bam,./bams/SRR6713585.bam,./bams/SRR6713586.bam \
-o ./beds/forelimb_e4.5.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713581.bam,./bams/SRR6713583.bam,./bams/SRR6713584.bam \
-o ./beds/flight_muscle_e10.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713582.bam,./bams/SRR6713587.bam,./bams/SRR6713588.bam \
-o ./beds/flight_muscle_e9.bed -j -e NC_053523.1 -r
Genrich -t ./bams/SRR6713589.bam,./bams/SRR6713590.bam,./bams/SRR6713591.bam \
-o ./beds/superior_sternum_e9.bed -j -e NC_053523.1 -r
#Call all bams at once
Genrich -t \
./bams/SRR10059734.bam,./bams/SRR10059735.bam,./bams/SRR10059736.bam,\
./bams/SRR10059737.bam,./bams/SRR10059738.bam,./bams/SRR10059739.bam,\
./bams/SRR10059740.bam,./bams/SRR10059741.bam,./bams/SRR10059742.bam,\
./bams/SRR10279638.bam,./bams/SRR10279639.bam,./bams/SRR10279640.bam,\
./bams/SRR10279641.bam,./bams/SRR10533482.bam,./bams/SRR10533483.bam,\
./bams/SRR10917264.bam,./bams/SRR10917265.bam,./bams/SRR10917266.bam,\
./bams/SRR10917274.bam,./bams/SRR10917285.bam,./bams/SRR10917286.bam,\
./bams/SRR11878466.bam,./bams/SRR11878467.bam,./bams/SRR11916725.bam,\
./bams/SRR11916726.bam,./bams/SRR13428037.bam,./bams/SRR13428038.bam,\
./bams/SRR13428039.bam,./bams/SRR13428040.bam,./bams/SRR13428041.bam,\
./bams/SRR13428042.bam,./bams/SRR6713569.bam,./bams/SRR6713570.bam,\
./bams/SRR6713571.bam,./bams/SRR6713572.bam,./bams/SRR6713573.bam,\
./bams/SRR6713574.bam,./bams/SRR6713575.bam,./bams/SRR6713576.bam,\
./bams/SRR6713577.bam,./bams/SRR6713578.bam,./bams/SRR6713579.bam,\
./bams/SRR6713580.bam,./bams/SRR6713581.bam,./bams/SRR6713582.bam,\
./bams/SRR6713583.bam,./bams/SRR6713584.bam,./bams/SRR6713585.bam,\
./bams/SRR6713586.bam,./bams/SRR6713587.bam,./bams/SRR6713588.bam,\
./bams/SRR6713589.bam,./bams/SRR6713590.bam,./bams/SRR6713591.bam,\
./bams/SRR6713592.bam,./bams/SRR6782016.bam,./bams/SRR6782017.bam,\
./bams/SRR6782018.bam,./bams/SRR6782019.bam,./bams/SRR6782020.bam,\
./bams/SRR6782021.bam,./bams/SRR8062740.bam,./bams/SRR8062741.bam,\
./bams/SRR8062742.bam,./bams/SRR8062743.bam,./bams/SRR8062744.bam,\
./bams/SRR8062745.bam,./bams/SRR8062746.bam,./bams/SRR8062747.bam,\
./bams/SRR8062748.bam,./bams/SRR8062749.bam,./bams/SRR8062750.bam,\
./bams/SRR8062751.bam,./bams/SRR8062752.bam,./bams/SRR8492041.bam,\
./bams/SRR8492042.bam,./bams/SRR8492043.bam,./bams/SRR8492044.bam,\
./bams/SRR8492045.bam,./bams/SRR8492046.bam,./bams/SRR8492047.bam,\
./bams/SRR8492048.bam,./bams/SRR8492049.bam,./bams/SRR8492050.bam,\
./bams/SRR8492051.bam,./bams/SRR8492052.bam,./bams/SRR8492053.bam,\
./bams/SRR8492054.bam,./bams/SRR8492055.bam,./bams/SRR8492056.bam,\
./bams/SRR8492057.bam,./bams/SRR8492058.bam,./bams/SRR8492059.bam,\
./bams/SRR8492060.bam,./bams/SRR8492061.bam,./bams/SRR8492062.bam,\
./bams/SRR8492063.bam,./bams/SRR8492064.bam,./bams/SRR8492065.bam,\
./bams/SRR8492066.bam,./bams/SRR8492067.bam \
-o ./beds/all_qc.bed -j -e NC_053523.1 -r
To generate a combined dataset, we used the Genrich call file using all files at once and added all the peaks that were unique to each of the specific tissue type.
#bash
bedtools intersect -v -b beds/all_qc.bed -a beds/cone_cell_e5_-.bed > \
beds_intersect/cone_cell_e5_-_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/retinal_cell_e5-6.bed > \
beds_intersect/retinal_cell_e5-6_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/wing_bud_HH24.bed > \
beds_intersect/wing_bud_HH24_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/wing_bud_HH22.bed > \
beds_intersect/wing_bud_HH22_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/wing_bud_HH20.bed > \
beds_intersect/wing_bud_HH20_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/limb_bud_HH24.bed > \
beds_intersect/limb_bud_HH24_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/presomitic_mesoderm_HH14.bed > \
beds_intersect/presomitic_mesoderm_HH14_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/epithelial_somites_HH14.bed > \
beds_intersect/epithelial_somites_HH14_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/hindlimb_HH18.bed > \
beds_intersect/hindlimb_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/forelimb_HH18.bed > \
beds_intersect/forelimb_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/flank_HH18.bed > \
beds_intersect/flank_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_HH9.bed > \
beds_intersect/neural_crest_HH9_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_HH6.bed > \
beds_intersect/neural_crest_HH6_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/negative_control_NC2-_E2-_HH25.bed > \
beds_intersect/negative_control_NC2-_E2-_HH25_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_NC2-_HH25.bed > \
beds_intersect/neural_crest_NC2-_HH25_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_NC2+_HH25.bed > \
beds_intersect/neural_crest_NC2+_HH25_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/negative_control_NC2-_E2-_HH18.bed > \
beds_intersect/negative_control_NC2-_E2-_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_NC2-_E2+_HH18.bed > \
beds_intersect/neural_crest_NC2-_E2+_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_NC2+_E2+_HH18.bed > \
beds_intersect/neural_crest_NC2+_E2+_HH18_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_E2+_HH12.bed > \
beds_intersect/neural_crest_E2+_HH12_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/negative_control_NC2-_HH12.bed > \
beds_intersect/negative_control_NC2-_HH12_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/neural_crest_NC2+_HH12_.bed > \
beds_intersect/neural_crest_NC2+_HH12__intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/non-neural_crest_8-10ss.bed > \
beds_intersect/non-neural_crest_8-10ss_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/non-neural_crest_5-6ss.bed > \
beds_intersect/non-neural_crest_5-6ss_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/epiblast_HH4.bed > \
beds_intersect/epiblast_HH4_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/metatarsal_skin_e12_retinoic_acid.bed > \
beds_intersect/metatarsal_skin_e12_retinoic_acid_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/metatarsal_skin_e12_RCAS-GFP.bed > \
beds_intersect/metatarsal_skin_e12_RCAS-GFP_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/metatarsal_skin_e12_RCAS-bTNNB1.bed > beds_intersect/metatarsal_skin_e12_RCAS-bTNNB1_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/keel_e10.bed > \
beds_intersect/keel_e10_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/hindlimb_e4.5.bed > \
beds_intersect/hindlimb_e4.5_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/full_sternum_e10.bed > \
beds_intersect/full_sternum_e10_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/inferior_sternum_e9.bed > \
beds_intersect/inferior_sternum_e9_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/forelimb_e4.5.bed > \
beds_intersect/forelimb_e4.5_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/flight_muscle_e10.bed > \
beds_intersect/flight_muscle_e10_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/flight_muscle_e9.bed > \
beds_intersect/flight_muscle_e9_intersect.bed
bedtools intersect -v -b beds/all_qc.bed -a beds/superior_sternum_e9.bed > \
beds_intersect/superior_sternum_e9_intersect.bed
#cat all the uniques into one and then add to main file
cat beds_intersect/*.bed | bedtools sort -i "stdin" | \
bedtools merge -i "stdin" > beds/all_uniques.bed
cat beds/all_qc.bed beds/all_uniques.bed | bedtools sort -i "stdin" > atac_all.bed
To generate annotation we used bedtools annotate.
#bash
bedtools annotate -counts -i atac_all.bed -files \
beds/cone_cell_e5_-.bed beds/epiblast_HH4.bed beds/epithelial_somites_HH14.bed \
beds/flank_HH18.bed beds/flight_muscle_e10.bed beds/flight_muscle_e9.bed \
beds/forelimb_e4.5.bed beds/forelimb_HH18.bed beds/full_sternum_e10.bed \
beds/hindlimb_e4.5.bed beds/hindlimb_HH18.bed beds/inferior_sternum_e9.bed \
beds/keel_e10.bed beds/limb_bud_HH24.bed beds/metatarsal_skin_e12_RCAS-CTNNB1.bed \
beds/metatarsal_skin_e12_RCAS-GFP.bed beds/metatarsal_skin_e12_retinoic_acid.bed \
beds/negative_control_NC2-_E2-_HH18.bed beds/negative_control_NC2-_E2-_HH25.bed \
beds/negative_control_NC2-_HH12.bed beds/neural_crest_E2+_HH12.bed \
beds/neural_crest_HH6.bed beds/neural_crest_HH9.bed beds/neural_crest_NC2-_E2+_HH18.bed \
beds/neural_crest_NC2-_HH25.bed beds/neural_crest_NC2+_E2+_HH18.bed \
beds/neural_crest_NC2+_HH12_.bed beds/neural_crest_NC2+_HH25.bed \
beds/non-neural_crest_5-6ss.bed beds/non-neural_crest_8-10ss.bed \
beds/presomitic_mesoderm_HH14.bed beds/retinal_cell_e5-6.bed \
beds/superior_sternum_e9.bed beds/wing_bud_HH20.bed beds/wing_bud_HH22.bed \
beds/wing_bud_HH24.bed > atac_all_annot.txt
#Clean up in R to include header line
To generate some statistics for annotated ATACseq file.
library(data.table)
library(ggplot2)
library(gridExtra)
atac_data <- fread("atac_all_annot.txt")
atac_data$length <- atac_data$Stop - atac_data$Start
atac_data$length2 <- atac_data$length
atac_data$length2[which(atac_data$length >= 2000)] <- 2000
#Drop elements not called in individual dataset
atac_data$tissues <- rowSums(atac_data[,5:40])
atac_data <- subset(atac_data, tissues > 0)
g1 <- ggplot(atac_data, aes(x = atac_data$length2)) +
geom_histogram(bins = 50) +
theme_classic() +
geom_vline(xintercept = mean(atac_data$length), col = "red", linetype = "dashed") +
labs(title="Distribution of length of ATACseq elements",x="Length of Elements (bp)", y = "Count")+
scale_x_continuous(limits = c(0,2000), expand = c(0, 0)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
g2 <- ggplot(atac_data, aes(x = atac_data$tissues)) +
geom_histogram(bins = 36) +
theme_classic() +
labs(x="Shared ATACseq elements", y = "Count")+
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE), expand = expansion(mult = c(0, .1)))
gs <- arrangeGrob(g1, g2, ncol = 2)
grid.arrange(gs)
We used the annotations to create a presence (1)/absence (0) matrix and use IQTree2 to analyze the matrix in a tree-building framework. The matrix was subsequently formatted into a phylip like format for analysis with IQTree.
library(pheatmap)
atac_data_cor <- atac_data[,5:40]
#corr_map <- cor(atac_data_cor)
corr_map <- read.table("ATAC_corr_map.txt", header = T, row.names = 1, sep = "\t")
pheatmap(log(corr_map,base = 10))
To run the tree:
#bash
iqtree2-mpi -s all_tissues.phylip.varsites.phy --seqtype MORPH -m MK+ASC -T 12
Heatmap of elements
We hope to use these open chromatin regions for genomes on birds other than chicken. It is not well known whether a region that is open in a chicken is open in other avian species. An approach called TACIT was developed by Kaplow et al. (2022) that uses machine learning to predict open chromatin regions in different species using a training dataset. We adopted this approach to predict open chromatin regions in different birds species using data from chicken and emu.
The code is a python code modified from Kaplow et al. (2022) that takes a config file as input and train a neural network on ATACseq peaks get code here (machinations.py).
Here we use ATACseq calls for Hindlimbs at HH18 stage of chicken
(Gallus gallus) and emu (Dromaius novaehollandiae) and
use calls for tinamou (Eudromia elegans) to validate the
dataset. First, we need to create a dataset of positive (ATAC peak open)
and negative (ATAC peak closed). For this we compared elements between
chicken and emu and characterized any element that overlapped with one
another and present in both datasets as positive (open). For any
elements that were open in one species and absent in the other species
we binned the open elements in the positive category and the homologous
sequence of the absent element in the negative (closed) category. Then
we divided this into a training, validation, and evaluation datasets
(binned based on scaffold location). These files were then used to train
the model.
These steps include jumping from R to bash and back. Please make sure
you run the code in the appropriate framework.
First we need to create a bed file for the two species that include sequences of all the same size. The python code can accommodate different sizes but the missing data gets accounted as information for the neural network and the resulting assignments are segregated by size of element (without Ns) rather than the patterns within it.
#R
#getting pos and neg datasets
###inputs; i use a and b to designate two datasets. Replace a and b with respective bed files
#to run for any group without having to change any other info
bed_a <- "emu_hindlimb_HH18.bed"
bed_a_id <- "emu"
bed_b <- "galgal_hindlimb_HH18.bed"
bed_b_id <- "gal"
###outputs
working_folder <- "./"
###We need to make sure all elements are the same length. Here we chose 500 bp
ele_size <- 500
#A function to parse bed files from Genrich to output unique elements
#that are of the size we need
process_bed <- function(bed_file, id, size){
bed <- read.table(bed_file, sep="\t", header = F,
col.names = c("Scaffold", "Start", "Stop", "Peak_ID", "Score",
"Strand", "SignalValue", "p", "q", "Summit"))
bed$Peak_ID <- paste(id, bed$Peak_ID, sep="_")
bed$Length <- bed$Stop - bed$Start
bed$Summit_Base <- bed$Start + bed$Summit
bed$Summit_Start <- bed$Summit_Base - size/2
bed$Summit_Stop <- bed$Summit_Base + size/2 - 1
bed <- subset(bed, Summit_Start >= 0)
return(bed)
}
#Get bed files for the two species
bed_a.data <- process_bed(bed_a, bed_a_id, ele_size)
bed_b.data <- process_bed(bed_b, bed_b_id, ele_size)
#Write the output as a bed file
write.table(bed_a.data[,c("Scaffold", "Summit_Base", "Summit_Base", "Peak_ID")],
paste(working_folder, "bed_a_summit.bed", sep=""),
quote = F, row.names = F, col.names = F, sep="\t")
write.table(bed_b.data[,c("Scaffold", "Summit_Base", "Summit_Base", "Peak_ID")],
paste(working_folder, "bed_b_summit.bed", sep=""),
quote = F, row.names = F, col.names = F, sep="\t")
Next we will clean up the bed files a bit
#bash
bedtools merge -o count,distinct -d 500 -i bed_a_summit.bed > bed_a_unique_summit.bed
bedtools merge -o count,distinct -d 500 -i bed_b_summit.bed > bed_b_unique_summit.bed
This step removes closeby peaks that overlapped when we extended a peak size to 500 bp.
#R
bed_a.uniq <- read.table(paste(working_folder, "bed_a_unique_summit.bed", sep=""), sep="\t", header = F,
col.names = c("Scaffold", "Summit_Start", "Summit_Stop", "Count", "Peak_ID"))
bed_b.uniq <- read.table(paste(working_folder, "bed_b_unique_summit.bed", sep=""), sep="\t", header = F,
col.names = c("Scaffold", "Summit_Start", "Summit_Stop", "Count", "Peak_ID"))
#if peaks overlapped, only choose the first peak among the mix
bed_a.uniq$Peak_ID <- gsub(",.*", "", bed_a.uniq$Peak_ID)
bed_b.uniq$Peak_ID <- gsub(",.*", "", bed_b.uniq$Peak_ID)
write.table(bed_a.data[bed_a.data$Peak_ID %in% bed_a.uniq$Peak_ID,
c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID")],
paste(working_folder, "bed_a_peaks_500.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
write.table(bed_b.data[bed_b.data$Peak_ID %in% bed_b.uniq$Peak_ID,
c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID")],
paste(working_folder, "bed_b_peaks_500.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
To compare the two datasets, we need it to be in the same genetic coordinate system. For this we will use the hal alignment to liftover the emu sequences into galgal7b and vice versa.
#bash
#using hal to convert each to the other species coordinates and then use sort and merge to get merged peaks
halLiftover all_birds_new.hal Gallus_gallus7 bed_b_peaks_500.bed Dromaius_novaehollandiae bed_b2a_peaks_500.bed
halLiftover all_birds_new.hal Dromaius_novaehollandiae bed_a_peaks_500.bed Gallus_gallus7 bed_a2b_peaks_500.bed
#clean up the liftover files, we'll consider anything 100 bp nearby as being a convergent peak
bedtools sort -i bed_b2a_peaks_500.bed | bedtools merge -o distinct -d 100 -i stdin > bed_b2a_peaks_merged.bed
bedtools sort -i bed_a2b_peaks_500.bed | bedtools merge -o distinct -d 100 -i stdin > bed_a2b_peaks_merged.bed
#get intersects
bedtools intersect -a bed_a_peaks_500.bed -b bed_b2a_peaks_merged.bed -wa -wb > bed_a_intersect.bed
bedtools intersect -a bed_b_peaks_500.bed -b bed_a2b_peaks_merged.bed -wa -wb > bed_b_intersect.bed
More cleanup in R to make all the sequences in the lifted over files again 500 bp
#R
bed_a2b_merged <- read.table(paste(working_folder, "bed_a2b_peaks_merged.bed", sep=""), sep="\t", header = F,
col.names = c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID"))
bed_b2a_merged <- read.table(paste(working_folder, "bed_b2a_peaks_merged.bed", sep=""), sep="\t", header = F,
col.names = c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID"))
#remove peaks that span big gaps, or merge two original peaks
bed_a2b_merged <- bed_a2b_merged[!(duplicated(bed_a2b_merged$Peak_ID)|duplicated(bed_a2b_merged$Peak_ID, fromLast=TRUE)),]
bed_b2a_merged <- bed_b2a_merged[!(duplicated(bed_b2a_merged$Peak_ID)|duplicated(bed_b2a_merged$Peak_ID, fromLast=TRUE)),]
bed_a2b_merged <- bed_a2b_merged[!grepl(",", bed_a2b_merged$Peak_ID),]
bed_b2a_merged <- bed_b2a_merged[!grepl(",", bed_b2a_merged$Peak_ID),]
#get length of converted elements and keep elements that are between 400 and 600 bps
bed_a2b_merged$Length <- bed_a2b_merged$Summit_Stop - bed_a2b_merged$Summit_Start
bed_b2a_merged$Length <- bed_b2a_merged$Summit_Stop - bed_b2a_merged$Summit_Start
bed_a2b_merged <- bed_a2b_merged[which(bed_a2b_merged$Length > 400 &
bed_a2b_merged$Length < 600),]
bed_b2a_merged <- bed_b2a_merged[which(bed_b2a_merged$Length > 400 &
bed_b2a_merged$Length < 600),]
#readjust lengths of elements to equal 500 bps, remove peaks that are less than 500 bp
bed_a2b_merged$Chop_Left <- ceiling((bed_a2b_merged$Length - 500)/2)+bed_a2b_merged$Summit_Start
bed_a2b_merged$Chop_Right <- bed_a2b_merged$Summit_Stop - floor((bed_a2b_merged$Length - 500)/2) - 1
bed_b2a_merged$Chop_Left <- ceiling((bed_b2a_merged$Length - 500)/2)+bed_b2a_merged$Summit_Start
bed_b2a_merged$Chop_Right <- bed_b2a_merged$Summit_Stop - floor((bed_b2a_merged$Length - 500)/2) - 1
bed_a2b_merged <- subset(bed_a2b_merged, Chop_Left >= 0)
bed_b2a_merged <- subset(bed_b2a_merged, Chop_Left >= 0)
#intersects
bed_a_intersects <- read.table(paste(working_folder, "bed_a_intersect.bed", sep=""), sep="\t", header = F)
bed_b_intersects <- read.table(paste(working_folder, "bed_b_intersect.bed", sep=""), sep="\t", header = F)
#get files in order and get samtools formatted output to get fasta files
#some fasta files will be truncated because we do not account for the size of scaffolds
positives_a <- bed_a.data[bed_a.data$Peak_ID %in% bed_a.uniq$Peak_ID,
c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID")]
positives_b <- bed_b.data[bed_b.data$Peak_ID %in% bed_b.uniq$Peak_ID,
c("Scaffold", "Summit_Start", "Summit_Stop", "Peak_ID")]
positives_a$Samtool <- paste(positives_a$Scaffold, ":", positives_a$Summit_Start, "-", positives_a$Summit_Stop, sep="")
positives_b$Samtool <- paste(positives_b$Scaffold, ":", positives_b$Summit_Start, "-", positives_b$Summit_Stop, sep="")
#for negatives we want all the things that did not intersect
negatives_a_id <- setdiff(bed_a.data$Peak_ID, bed_a_intersects$V4)
negatives_b_id <- setdiff(bed_b.data$Peak_ID, bed_b_intersects$V4)
negatives_a <- bed_a2b_merged[bed_a2b_merged$Peak_ID %in% negatives_a_id,
c("Scaffold", "Chop_Left", "Chop_Right", "Peak_ID")]
negatives_b <- bed_b2a_merged[bed_b2a_merged$Peak_ID %in% negatives_b_id,
c("Scaffold", "Chop_Left", "Chop_Right", "Peak_ID")]
negatives_a$Samtool <- paste(negatives_a$Scaffold, ":", negatives_a$Chop_Left, "-", negatives_a$Chop_Right, sep="")
negatives_b$Samtool <- paste(negatives_b$Scaffold, ":", negatives_b$Chop_Left, "-", negatives_b$Chop_Right, sep="")
#write these positives and negatives bed file
write.table(positives_a, paste(working_folder, "positives_a.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
write.table(positives_b, paste(working_folder, "positives_b.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
write.table(negatives_a, paste(working_folder, "negatives_a.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
write.table(negatives_b, paste(working_folder, "negatives_b.bed", sep=""),
quote = F, row.names = F, col.names = F, sep = "\t")
Since ATAC peaks within in coding loci might have different patterns than those in intragenic regions, we will filter out all the peaks that overlap a coding loci. We will use a gff file containing only coding cds for each species.
#bash
bedtools intersect -a positives_a.bed -b dronov_cds.bed -v > positives_a_nocds.bed
bedtools intersect -a positives_b.bed -b galgal_cds.bed -v > positives_b_nocds.bed
bedtools intersect -a negatives_a.bed -b dronov_cds.bed -v > negatives_a_nocds.bed
bedtools intersect -a negatives_b.bed -b galgal_cds.bed -v > negatives_b_nocds.bed
Filter out these regions that overlap a coding region
#R
positives_a_cds <- read.table(paste(working_folder, "positives_a_nocds.bed", sep=""), header = F,
col.names = c("Scaffold", "Start", "Stop", "Peak_ID", "Samtools"))
positives_b_cds <- read.table(paste(working_folder, "positives_b_nocds.bed", sep=""), header = F,
col.names = c("Scaffold", "Start", "Stop", "Peak_ID", "Samtools"))
negatives_a_cds <- read.table(paste(working_folder, "negatives_a_nocds.bed", sep=""), header = F,
col.names = c("Scaffold", "Start", "Stop", "Peak_ID", "Samtools"))
negatives_b_cds <- read.table(paste(working_folder, "negatives_b_nocds.bed", sep=""), header = F,
col.names = c("Scaffold", "Start", "Stop", "Peak_ID", "Samtools"))
#At the same time I will also divvy up the data into training, validation, and evaluation datsets. For this I will use the samtools fai index file to get the size of each scaffold for that species, sort it by size and distribute it into training, validation, and evalutaion datasets.
chrom_sizes_a <- read.table("Dromaius_novaehollandiae_droNov1.fna.fai", sep="\t", header = F, stringsAsFactors = F)
chrom_sizes_a <- chrom_sizes_a[,c(1,2)]
chrom_sizes_a <- chrom_sizes_a[order(chrom_sizes_a$V2, decreasing = T),]
chrom_sizes_a$order <- seq(1,nrow(chrom_sizes_a))
#First two will go for training
positives_a_train <- subset(positives_a_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(1,60,4)]])
positives_a_train <- rbind(positives_a_train, subset(positives_a_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(2,60,4)]]))
#Third will go for validation
positives_a_valid <- subset(positives_a_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(3,60,4)]])
#Fourth will go for evaluation
positives_a_eval <- subset(positives_a_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(4,60,4)]])
#Same for the negative datasets
negatives_b_train <- subset(negatives_b_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(1,60,4)]])
negatives_b_train <- rbind(negatives_b_train, subset(negatives_b_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(2,60,4)]]))
negatives_b_valid <- subset(negatives_b_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(3,60,4)]])
negatives_b_eval <- subset(negatives_b_cds, Scaffold %in% chrom_sizes_a$V1[chrom_sizes_a$order[seq(4,60,4)]])
#Write the output in samtools format
write.table(positives_a_train$Samtools, paste(working_folder, "positives_a_train.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(positives_a_valid$Samtools, paste(working_folder, "positives_a_valid.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(positives_a_eval$Samtools, paste(working_folder, "positives_a_eval.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_b_train$Samtools, paste(working_folder, "negatives_b_train.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_b_valid$Samtools, paste(working_folder, "negatives_b_valid.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_b_eval$Samtools, paste(working_folder, "negatives_b_eval.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
#Repeat for the second species
chrom_sizes_b <- read.table("galgal7b.fna.fai", sep="\t", header = F, stringsAsFactors = F)
chrom_sizes_b <- chrom_sizes_b[,c(1,2)]
chrom_sizes_b <- chrom_sizes_b[order(chrom_sizes_b$V2, decreasing = T),]
chrom_sizes_b$order <- seq(1,nrow(chrom_sizes_b))
positives_b_train <- subset(positives_b_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(1,60,4)]])
positives_b_train <- rbind(positives_b_train, subset(positives_b_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(2,60,4)]]))
positives_b_valid <- subset(positives_b_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(3,60,4)]])
positives_b_eval <- subset(positives_b_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(4,60,4)]])
negatives_a_train <- subset(negatives_a_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(1,60,4)]])
negatives_a_train <- rbind(negatives_a_train, subset(negatives_a_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(2,60,4)]]))
negatives_a_valid <- subset(negatives_a_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(3,60,4)]])
negatives_a_eval <- subset(negatives_a_cds, Scaffold %in% chrom_sizes_b$V1[chrom_sizes_b$order[seq(4,60,4)]])
write.table(positives_b_train$Samtools, paste(working_folder, "positives_b_train.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(positives_b_valid$Samtools, paste(working_folder, "positives_b_valid.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(positives_b_eval$Samtools, paste(working_folder, "positives_b_eval.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_a_train$Samtools, paste(working_folder, "negatives_a_train.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_a_valid$Samtools, paste(working_folder, "negatives_a_valid.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
write.table(negatives_a_eval$Samtools, paste(working_folder, "negatives_a_eval.txt", sep=""), row.names = F, col.names = F, sep="\t", quote = F)
Get the fasta files for each grouping and then combine the different groups for the two species into one. We can finally use this for in the machinations.py file.
#bash
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r positives_a_train.txt > positives_a_train.fa
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r positives_a_valid.txt > positives_a_valid.fa
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r positives_a_eval.txt > positives_a_eval.fa
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r negatives_b_train.txt > negatives_b_train.fa
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r negatives_b_valid.txt > negatives_b_valid.fa
samtools faidx Dromaius_novaehollandiae_droNov1.fna -r negatives_b_eval.txt > negatives_b_eval.fa
samtools faidx galgal7b.fna -r positives_b_train.txt > positives_b_train.fa
samtools faidx galgal7b.fna -r positives_b_valid.txt > positives_b_valid.fa
samtools faidx galgal7b.fna -r positives_b_eval.txt > positives_b_eval.fa
samtools faidx galgal7b.fna -r negatives_a_train.txt > negatives_a_train.fa
samtools faidx galgal7b.fna -r negatives_a_valid.txt > negatives_a_valid.fa
samtools faidx galgal7b.fna -r negatives_a_eval.txt > negatives_a_eval.fa
##combine a and b into one
cat positives_a_train.fa positives_b_train.fa > positives_train.fa
cat positives_a_valid.fa positives_b_valid.fa > positives_valid.fa
cat positives_a_eval.fa positives_b_eval.fa > positives_eval.fa
cat negatives_a_train.fa negatives_b_train.fa > negatives_train.fa
cat negatives_a_valid.fa negatives_b_valid.fa > negatives_valid.fa
cat negatives_a_eval.fa negatives_b_eval.fa > negatives_eval.fa
Use the config file to run the python script. Once the model is made and evaluated you can use the same file to evaluate any other fasta file.
To check for ATAC-seq peaks, we used the center of the accelerated elements sequences to get a 500 bp region around the element and then used that as an evaluation dataset. We did this for all 79 species.
#for atacseq get 500 bp regions from center of CE
final_data_filtered <- fread("final_data_filtered.txt", sep = "\t", header = T)
phyloacc_subset <- fread("accelerated_elements_all.txt", sep = "\t")
atac_ce <- final_data_filtered[which(final_data_filtered$id %in% phyloacc_subset$original.id), 1:4]
atac_ce$center <- floor((atac_ce$start+atac_ce$end)/2)
atac_ce$center2 <- atac_ce$center + 1
fwrite(atac_ce[,c("scaffold", "center", "center2", "id")], "ce_accl_center_15k.bed", quote = F, sep = "\t", row.names = F, col.names = F)
Liftover the center region to all other species, add 250 bp to the front and back and then get sequences for each species.
#bash
species=("Gavialis_gangeticus" "Struthio_camelus" "Rhea_americana" "Eudromia_elegans" "Dromaius_novaehollandiae" "Anas_platyrhynchos" "Gallus_gallus7" "Phoenicopterus_ruber" "Columba_livia" "Tauraco_erythrolophus" "Cuculus_canorus" "Caprimulgus_europaeus" "Apus_apus" "Calypte_anna" "Porphyrio_hochstetteri" "Pluvialis_apricaria" "Sterna_hirundo" "Aptenodytes_patagonicus" "Aptenodytes_forsteri" "Pygoscelis_adeliae" "Pygoscelis_papua" "Pygoscelis_antarcticus" "Megadyptes_antipodes" "Eudyptula_novaehollandiae" "Eudyptula_albosignata" "Eudyptula_minor" "Spheniscus_demersus" "Spheniscus_humboldti" "Spheniscus_magellanicus" "Eudyptes_pachyrhynchus" "Eudyptes_sclateri" "Eudyptes_chrysolophus" "Eudyptes_schlegeli" "Eudyptes_chrysocome" "Eudyptes_filholi" "Eudyptes_moseleyi" "Macronectes_giganteus" "Theristicus_caerulescens" "Aquila_chrysaetos" "Athene_cunicularia" "Trogon_surrucura" "Bucorvus_abyssinicus" "Halcyon_senegalensis" "Chloroceryle_aenea" "Actenoides_hombroni" "Merops_nubicus" "Pogoniulus_pusillus" "Colaptes_auratus" "Falco_peregrinus" "Melopsittacus_undulatus" "Acanthisitta_chloris" "Myiozetetes_cayanensis" "Chiroxiphia_lanceolata" "Malurus_cyaneus" "Lichenostomus_cassidix" "Lanius_collurio" "Corvus_cornix" "Eremophila_alpestris" "Parus_major" "Acrocephalus_scirpaceus" "Progne_subis" "Riparia_riparia" "Hirundo_rustica" "Brachypodius_melanocephalos" "Pycnonotus_jocosus" "Phylloscopus_trochilus" "Sylvia_atricapilla" "Certhia_americana" "Sturnus_vulgaris" "Ficedula_albicollis" "Taeniopygia_guttata" "Passer_domesticus" "Motacilla_alba" "Serinus_canaria" "Emberiza_elegans" "Molothrus_ater" "Setophaga_coronata" "Diglossa_brunneiventris")
for i in ${species[@]};
do
halLiftover all_birds_new.hal Gallus_gallus7 ce_accl_center_15k.bed $i regions_new/ce_accl_center_${i}.bed
done
# clean the bed files
for file in regions_new/*.bed
do
file="$(basename file .bed)"
bedtools merge -d 20 -c 4 -o "distinct" $file |
awk 'BEGIN{ FS="\t"; OFS=FS }{ if(($3 -$2) <= 500)) print $1,$2,$3,$4 }'
done
#Add 250bp to either side of the center
while IFS= read -r line
do
awk '{ if ($2-250>0) print $1":"$2-250"-"$3+248 }' ./regions_new/ce_accl_center_${line}.bed > ./regions/ce_accl_center_${line}.txt
done < "all_species.txt"
#Get fasta sequences for each species, line1 is the species name and line2 is the genome location
while read -r line1 line2
do
samtools faidx $line2 -r regions/ce_accl_center_${line1}.txt > fastas/ce_accl_center_${line1}.fa
done < "new_bird_aln.txt"
#Concatenate the sequences
cat fastas/*.fa > all_fastas_15k.fa
#Use the fasta as an evaluation dataset in the python script above
#R
#parse atac probs for elements
atac_probs <- fread("hindlimb_HH18_epoch20_lay450_batch64_lr1e7_eval_all_atac_prob.txt", sep = "\t")
species=c("Gavialis_gangeticus", "Struthio_camelus", "Rhea_americana", "Eudromia_elegans",
"Dromaius_novaehollandiae", "Anas_platyrhynchos", "Gallus_gallus7", "Phoenicopterus_ruber",
"Columba_livia", "Tauraco_erythrolophus", "Cuculus_canorus", "Caprimulgus_europaeus",
"Apus_apus", "Calypte_anna", "Porphyrio_hochstetteri", "Pluvialis_apricaria",
"Sterna_hirundo", "Aptenodytes_patagonicus", "Aptenodytes_forsteri", "Pygoscelis_adeliae",
"Pygoscelis_papua", "Pygoscelis_antarcticus", "Megadyptes_antipodes", "Eudyptula_novaehollandiae",
"Eudyptula_albosignata", "Eudyptula_minor", "Spheniscus_demersus", "Spheniscus_humboldti",
"Spheniscus_magellanicus", "Eudyptes_pachyrhynchus", "Eudyptes_sclateri", "Eudyptes_chrysolophus",
"Eudyptes_schlegeli", "Eudyptes_chrysocome", "Eudyptes_filholi", "Eudyptes_moseleyi",
"Macronectes_giganteus", "Theristicus_caerulescens", "Aquila_chrysaetos", "Athene_cunicularia",
"Trogon_surrucura", "Bucorvus_abyssinicus", "Halcyon_senegalensis", "Chloroceryle_aenea",
"Actenoides_hombroni", "Merops_nubicus", "Pogoniulus_pusillus", "Colaptes_auratus",
"Falco_peregrinus", "Melopsittacus_undulatus", "Acanthisitta_chloris", "Myiozetetes_cayanensis",
"Chiroxiphia_lanceolata", "Malurus_cyaneus", "Lichenostomus_cassidix", "Lanius_collurio",
"Corvus_cornix", "Eremophila_alpestris", "Parus_major", "Acrocephalus_scirpaceus",
"Progne_subis", "Riparia_riparia", "Hirundo_rustica", "Brachypodius_melanocephalos",
"Pycnonotus_jocosus", "Phylloscopus_trochilus", "Sylvia_atricapilla", "Certhia_americana",
"Sturnus_vulgaris", "Ficedula_albicollis", "Taeniopygia_guttata", "Passer_domesticus",
"Motacilla_alba", "Serinus_canaria", "Emberiza_elegans", "Molothrus_ater",
"Setophaga_coronata", "Diglossa_brunneiventris")
galgal_elems <- read.table("ce_accl_center_15k.bed", sep = "\t")
for (aspecies in species){
elem_file <- read.table(paste(aspecies, ".bed", sep = ""), sep = "\t")
elem_file$samtools <- paste(elem_file$V1, ":", elem_file$V2 - 250, "-", elem_file$V3 + 248, sep="")
atac_probs_new <- merge(atac_probs, elem_file, by.x = "V1", by.y = "samtools", all = F)
atac_probs_new <- atac_probs_new[!duplicated(atac_probs_new$V4)]
colnames(atac_probs_new)[2] <- aspecies
galgal_elems <- merge(galgal_elems, atac_probs_new[,c(2,6)], by.x = "V4", by.y = "V4", all.x = T, all.y = F)
}
#code to get four categories
#chicken ATAC yes; chicken yes; accl yes + chicken ATAC no; chicken no; accl no
#chicken ATAC yes; chicken yes; accl no + chicken ATAC no; chicken no; accl yes
#chicken ATAC yes; chicken no; accl no + chicken ATAC no; chicken yes; accl yes
#chicken ATAC yes; chicken no; accl yes + chicken ATAC no; chicken yes; accl no
#atac main data
atac_hl <- read.table("ce_accl_center_15k_hl_atac_annot.bed", header = F)
atac_hl$atac <- atac_hl$V5 > 0
atac_df <- atac_hl[which(atac_hl$V4 %in% galgal_elems$V4), c("V4", "atac")]
colnames(atac_df) <- c("id", "atac")
atac_df <- merge(atac_df, phyloacc_subset[,c(1:3)], by.x = "id", by.y = "original.id")
rate_data <- fread("M2_elem_Z.txt", header = T)
results <- fread("rate_postZ_M2.txt", header = T, sep = "\t")
results <- results[,c(1:6,seq(10,322,4))]
colnames(results) <- gsub("_3", "", colnames(results))
#using 0.5 as the barrier
for (elem in 1:nrow(atac_df)){
element <- atac_df$id[elem]
if (galgal_elems[elem, "Gallus_gallus7"] > 0.5){
atac_df[elem, "atac_pred"] <- TRUE
}else{
atac_df[elem, "atac_pred"] <- FALSE
}
accl_taxa <- intersect(colnames(rate_data)[rate_data[which(rate_data$`Locus ID` %in% atac_df[elem, "phyloacc.id"]),] == 2],
colnames(results)[results[which(results$`Locus ID` %in% atac_df[elem, "phyloacc.id"]),] >= 0.9])
accl_taxa <- accl_taxa[accl_taxa != "Gallus_gallus4"]
accl_atac <- table(galgal_elems[elem, accl_taxa] > 0.5)
if (length(accl_atac)==2){
atac_df[elem, "accl_atac_pred"] <- unname(accl_atac[2]/(accl_atac[1] + accl_atac[2]))
}
else if (length(accl_atac)==0){
atac_df[elem, "accl_atac_pred"] <- NA
}
else if (names(accl_atac)=="TRUE"){
atac_df[elem, "accl_atac_pred"] <- 1
}
else {
atac_df[elem, "accl_atac_pred"] <- 0
}
all_atac <- table(galgal_elems[elem,] > 0.5)
atac_df[elem, "all_atac_pred"] <- unname(all_atac[2]/(all_atac[1] + all_atac[2]))
}
atac_df$accl_atac_pred_bool <- atac_df$accl_atac_pred > 0.5
table(atac_df[,c("atac_pred", "accl_atac_pred_bool", "atac")])
#where atac shifts
shifted <- atac_df[which(!atac_df$atac & !atac_df$atac_pred & atac_df$accl_atac_pred_bool),]
shifted <- rbind(shifted, atac_df[which(atac_df$atac & atac_df$atac_pred & !atac_df$accl_atac_pred_bool),])
shifted_elems <- subset(final_data[,c(1:4)], id %in% shifted$id)
write.table(shifted_elems, "atac_shift_elems.txt", sep = "\t", col.names = F, row.names = F, quote = F)
#get closeby genes
library(stringr)
shifted_genes <- fread("atac_shift_genes.txt", sep = "\t")
shifted_genes <- subset(shifted_genes, V14 < 10000)
shifted_genes$id <- str_extract(shifted_genes$V13, "Name=[A-Za-z0-9-_.]*")
shifted_genes$id <- gsub("Name=", "", shifted_genes$id)