$ wget http://www.hudsonalpha.org/gsl/static/software/bam2fastq-1.1.0.tgz $ tar zxf bam2fastq-1.1.0.tgz $ cd bam2fastq-1.1.0 $ make $ ./bam2fastq <in.bam>
输入参数 -6 Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling. -B Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments. -b FILE List of input BAM files, one file per line [null] -C INT Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0] -d INT At a position, read maximally INT reads per input BAM. [250] -E Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit. -f FILE The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null] -l FILE BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null] -M INT cap mapping quality at INT [60] -q INT Minimum mapping quality for an alignment to be used [0] -Q INT Minimum base quality for a base to be considered [13] -r STR Only generate pileup in region STR [all sites] 输出参数
-D Output per-sample read depth (require -g/-u)
-g Compute genotype likelihoods and output them in the binary call format (BCF).
-u Similar to -g except that the output is uncompressed BCF, which is preferred for piping.
Options for Genotype Likelihood Computation (for -g or -u):
-e INT Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20]
-h INT Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]
-I Do not perform INDEL calling
-L INT Skip INDEL calling if the average per-sample depth is above INT. [250]
-o INT Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]
-P STR Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]
生成的结果文件为vcf格式,有10列,分别是:1 参考序列名;2 varianti所在的left-most位置;3 variant的ID(默认未设置,用’.’表示);4 参考序列的allele;5 variant的allele(有多个alleles,则用’,’分隔);6 variant/reference QUALity;7 FILTers applied;8 variant的信息,使用分号隔开;9 FORMAT of the genotype fields, separated by colon (optional); 10 SAMPLE genotypes and per-sample information (optional)。
例如:
1 2 3 4 5 6 7 8
scaffold_1 2847 . A AACGGTGAAG 194 . INDEL;DP=11;VDB=0.0401;AF1=1;AC1=2;DP4=0,0,8,3;MQ=35;FQ=-67.5 GT:PL:GQ 1/1:235,33,0:63 scaffold_1 3908 . G A 111 . DP=13;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,5,7;MQ=42;FQ=-63 GT:PL:GQ 1/1:144,36,0:69 scaffold_1 4500 . A G 31.5 . DP=8;VDB=0.0034;AF1=1;AC1=2;DP4=0,0,1,3;MQ=42;FQ=-39 GT:PL:GQ 1/1:64,12,0:21 scaffold_1 4581 . TGGNGG TGG 145 . INDEL;DP=8;VDB=0.0308;AF1=1;AC1=2;DP4=0,0,0,8;MQ=42;FQ=-58.5 GT:PL:GQ 1/1:186,24,0:45 scaffold_1 4644 . G A 195 . DP=21;VDB=0.0198;AF1=1;AC1=2;DP4=0,0,10,10;MQ=42;FQ=-87 GT:PL:GQ 1/1:228,60,0:99 scaffold_1 4827 . NACAAAGA NA 4.42 . INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=40;FQ=-37.5 GT:PL:GQ 0/1:40,3,0:3 scaffold_1 4854 . A G 48 . DP=6;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,2,1;MQ=41;FQ=-36 GT:PL:GQ 1/1:80,9,0:16 scaffold_1 5120 . A G 85 . DP=8;VDB=0.0355;AF1=1;AC1=2;DP4=0,0,5,3;MQ=42;FQ=-51 GT:PL:GQ 1/1:118,24,0:45
第8列中显示了对variants的信息描述,比较重要,其中的 Tag 的描述如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Tag Format Description AF1 double Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele DP int Raw read depth (without quality filtering) DP4 int[4] # high-quality reference forward bases, ref reverse, alternate for and alt rev bases FQ int Consensus quality. Positive: sample genotypes different; negative: otherwise MQ int Root-Mean-Square mapping quality of covering reads PC2 int[2] Phred probability of AF in group1 samples being larger (,smaller) than in group2 PCHI2 double Posterior weighted chi^2 P-value between group1 and group2 samples PV4 double[4] P-value for strand bias, baseQ bias, mapQ bias and tail distance bias QCHI2 int Phred-scaled PCHI2 RP int # permutations yielding a smaller PCHI2 CLR int Phred log ratio of genotype likelihoods with and without the trio/pair constraint UGT string Most probable genotype configuration without the trio constraint CGT string Most probable configuration with the trio constraint
Input/Output Options: -A Retain all possible alternate alleles at variant sites. By default, the view command discards unlikely alleles. -b Output in the BCF format. The default is VCF. -D FILE Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null] -F Indicate PL is generated by r921 or before (ordering is different). -G Suppress all individual genotype information. -l FILE List of sites at which information are outputted [all sites] -N Skip sites where the REF field is not A/C/G/T -Q Output the QCALL likelihood format -s FILE List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null] -S The input is VCF instead of BCF. -u Uncompressed BCF output (force -b). Consensus/Variant Calling Options:
-c Call variants using Bayesian inference. This option automatically invokes option -e.
-d FLOAT When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0]
当有多个sample用于variants calling时,比如多个转录组数据或多个重测序
数据需要比对到参考基因组上,设置该值,表明至少有该<float 01>比例的
samples在该位点都有覆盖才计算入variant.所以对于只有一个sample的情况
下,该值设置在01之间没有意义,大于1则得不到任何结果。
-e Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT.
-g Call per-sample genotypes at variant sites (force -c)
-i FLOAT Ratio of INDEL-to-SNP mutation rate [0.15]
-p FLOAT A site is considered to be a variant if P(ref|D)
-t FLOAT Scaled muttion rate for variant calling [0.001]
-T STR Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null]
-v Output variant sites only (force -c)
Contrast Calling and Association Test Options:
-1 INT Number of group-1 samples. This option is used for dividing the samples into two groups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0]
-U INT Number of permutations for association test (effective only with -1) [0]