在临床数据分析时,使用的人类参考基因组为UCSC上面的hg19,hg39版本,且常常将除1-22,X,Y,M,以外的染色体去除,避免数据的干扰。
This directory contains the Feb. 2009 assembly of the human genome (hg19,GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)),as well as repeat annotations and GenBank sequences.The Feb. 2009 human reference sequence (GRCh37) was produced by theGenome Reference Consortium: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
chromFa.tar.gz - The assembly sequence in one file per chromosome. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.md5sum.txt - checksums of files in this directoryhg19.chrom.sizes - Two-column tab-separated text file containing assembly sequence names and sizes.
# 打开码文件cat md5sum.txt## 将chromFa.tar.gz的码写入一个新文件 echo "ec3c974949f87e6c88795c17985141d3 chromFa.tar.gz" > check_md5sum.txt## 检测 md5sum -c check_md5sum.txt ##显示 chromFa.tar.gz: OK
cat hg19.chrom.sizes |head -n 25## 显示chr1 249250621chr2 243199373chr3 198022430chr4 191154276chr5 180915260chr6 171115067chr7 159138663chrX 155270560chr8 146364022chr9 141213431chr10 135534747chr11 135006516chr12 133851895chr13 115169878chr14 107349540chr15 102531392chr16 90354753chr17 81195210chr18 78077248chr20 63025520chrY 59373566chr19 59128983chr22 51304566chr21 48129895chr6_ssto_hap7 4928567
tar -zxf chromFa.tar.gz
mv chrUn_gl0002* chr*_random.fa chr*ctg* chr*apd* chr*cox* chr*hap* temp#mkdir chrfa#mv chromFa.tar.gz chrfa/
当面目录只剩下
chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fado echo $i; #cat $i >> hg19.fa mv $i chrfadone
bwa index -a bwtsw -p hg19.fa hg19.fa
索引构建完后显示
[BWTIncConstructFromPacked] 670 iterations done. 6154206942 characters processed.[BWTIncConstructFromPacked] 680 iterations done. 6177547390 characters processed.[bwt_gen] Finished constructing BWT in 687 iterations.[bwa_index] 2423.30 seconds elapse.[bwa_index] Update BWT... 19.05 sec[bwa_index] Pack forward-only FASTA... 16.77 sec[bwa_index] Construct SA from BWT and Occ... 1192.17 sec[main] Version: 0.7.17-r1188[main] CMD: bwa index -a bwtsw -p hg19.fa hg19.fa[main] Real time: 3814.250 sec; CPU: 3676.297 sec
文件生成查看
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg19$ ls -lhttotal 8.0G-rwxrwxrwx 1 toucan toucan 3.0G Sep 27 09:53 hg19.fa-rwxrwxrwx 1 toucan toucan 1.5G Sep 27 11:17 hg19.fa.sa-rwxrwxrwx 1 toucan toucan 6.6K Sep 27 10:57 hg19.fa.amb-rwxrwxrwx 1 toucan toucan 944 Sep 27 10:57 hg19.fa.ann-rwxrwxrwx 1 toucan toucan 739M Sep 27 10:57 hg19.fa.pac-rwxrwxrwx 1 toucan toucan 2.9G Sep 27 10:57 hg19.fa.bwt-rwxrwxrwx 1 toucan toucan 3.5K Sep 27 13:16 work_index.log-rwxrwxrwx 1 toucan toucan 685 Sep 27 10:06 work_index.shdrwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:58 chrfadrwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:17 tempdrwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:00 download
hg38.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case. hg38.fa.gz - "Soft-masked" assembly sequence in one file. Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.md5sum.txt - checksums of files in this directoryhg38.chrom.sizes - Two-column tab-separated text file containing assembly sequence names and sizes.
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gz" >> check_md5sum.txttoucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "a5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gz" >> check_md5sum.txttoucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat check_md5sum.txt1c9dcaddfa41027f17cd8f7a82c7293b hg38.fa.gza5aa5da14ccf3d259c4308f7b2c18cb0 hg38.chromFa.tar.gztoucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ md5sum -c check_md5sum.txthg38.fa.gz: OKhg38.chromFa.tar.gz: OK
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat hg38.chrom.sizes |head -n 25chr1 248956422chr2 242193529chr3 198295559chr4 190214555chr5 181538259chr6 170805979chr7 159345973chrX 156040895chr8 145138636chr9 138394717chr11 135086622chr10 133797422chr12 133275309chr13 114364328chr14 107043718chr15 101991189chr16 90338345chr17 83257441chr18 80373285chr20 64444167chr19 58617616chrY 57227415chr22 50818468chr21 46709983
tar -zxf hg38.chromFa.tar.gzcd chroms/
mkdir chrfa cat work.shfor i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fado echo $i; cat $i >> ../hg38.fa mv $i ../chrfadonetoucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ rm -rf chroms/toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ mv hg38.chromFa.tar.gz chrfa/
查看结果
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ cat hg38.fa |grep ">"|uniq -c 1 >chr1 1 >chr2 1 >chr3 1 >chr4 1 >chr5 1 >chr6 1 >chr7 1 >chr8 1 >chr9 1 >chr10 1 >chr11 1 >chr12 1 >chr13 1 >chr14 1 >chr15 1 >chr16 1 >chr17 1 >chr18 1 >chr19 1 >chr20 1 >chr21 1 >chr22 1 >chrX 1 >chrY 1 >chrM
bwa index -a bwtsw -p hg38.fa hg38.fa
输出显示:
[BWTIncConstructFromPacked] 630 iterations done. 6018708242 characters processed.[BWTIncConstructFromPacked] 640 iterations done. 6055482898 characters processed.[BWTIncConstructFromPacked] 650 iterations done. 6088164258 characters processed.[BWTIncConstructFromPacked] 660 iterations done. 6117207522 characters processed.[BWTIncConstructFromPacked] 670 iterations done. 6143017202 characters processed.[BWTIncConstructFromPacked] 680 iterations done. 6165952882 characters processed.[bwt_gen] Finished constructing BWT in 686 iterations.[bwa_index] 2564.67 seconds elapse.[bwa_index] Update BWT... 19.27 sec[bwa_index] Pack forward-only FASTA... 16.36 sec[bwa_index] Construct SA from BWT and Occ... 1230.34 sec[main] Version: 0.7.17-r1188[main] CMD: bwa index -a bwtsw -p hg38.fa hg38.fa[main] Real time: 3969.074 sec; CPU: 3855.406 sec
输出文件
-rwxrwxrwx 1 toucan toucan 3150052305 Oct 5 15:44 hg38.fa*-rwxrwxrwx 1 toucan toucan 16843 Oct 5 16:34 hg38.fa.amb*-rwxrwxrwx 1 toucan toucan 954 Oct 5 16:34 hg38.fa.ann*-rwxrwxrwx 1 toucan toucan 3088286508 Oct 5 16:33 hg38.fa.bwt*-rwxrwxrwx 1 toucan toucan 772071602 Oct 5 16:34 hg38.fa.pac*-rwxrwxrwx 1 toucan toucan 1544143256 Oct 5 16:54 hg38.fa.sa*
其他索引构建
samtools faidx hg19.fa## 显示生成hg19.fa.fai
samtools faidx hg38.fa
gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict
联系客服