90% of bioinformatics is just converting from one file format to another.
Here are some notes taken fromCoursera - Command Line Tools for Genomic Data Science course. It covers common file formats: SAM/BAM, BED, GFF/GTF, VCF/BCF, FastA/Q; as well as showcase the usage of common software (samtools, bedtools, bcftools, etc.) for read alignments, genome arithmetic and genetic variants calling via some course projects/ problems.
Common workflow:
- FastA files are often used to store reference genomes, FastQ for sequencing reads.
- Aligning the reads to a reference genome involves SAM format to store alignment info.
- With reads aligned, genetic variants can be called, generating VCF files.
- Genetic intervals/segments/annotations can be presented in BED and GFF/GTF files.
SAM BAM Format
Headers, fields, FLAG scores, CIGAR strings
BED Format
GFF GTF GFF3 Format
General feature format.
VCF BCF Format
FastA FastQ Format
And quality scores.
Format Summary
Format | Usage | 0/1 based | Column Names (in bioawk) | Metadata | Software |
---|---|---|---|---|---|
BED | interval arithmetic | 0 | 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts | . | bedtools |
SAM/BAM | alignment | SAM:1 BAM:0 | 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual | HD, SQ, RG, PG | samtools, BWA-mem, Novoalign, bowtie2, Tophat, STAR, GSNAP, IGV |
VCF/BCF | variant call | VCF:1 BCF:0 | 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info | fileformat, INFO, FILTER, ALT, config | bcftools, vcftools |
GFF3/GFF/GTF | genetic feature | 1 | 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute | version | . |
FastA/Q | seq reads | . | 1:name 2:seq 3:qual 4:comment | . | fastp |
0 and 1 index systems
0-based
- python style, start from 0 with an OPEN end (up to but NOT include the end point).
- E.g. the first 5 characters in a string have indices 0, 1, 2, 3, 4; the segment is represented as [start:end) - [0:5)
- number of elements = end - start
- BAM, BCF, BED, and PSL formats are 0-based. 1-based
- closer to the natural way of counting things, start from 1 with a CLOSED end (include the end point).
- E.g. the first 5 characters in a string have indices 1, 2, 3, 4, 5; the segment is represented as [start:end] - [1:5]
- number of elements = end - start + 1
- SAM, VCF, GTF, GFF and Wiggle formats are 1-based.
Common traps
BEWARE
- inconsistent chromosome labels (chr1 v Chr1 v CHR1)
- different sorting criteria (alphabetic v numerical)
- mixed UNIX/Windows newlines
- file violate spec with vigor
- program expects exact extension
- file is gzippâed not bgzippâed
- annotations us diff. genome builds
- tool only works for one format
- tools is hard-coded for specific build
- tool requires act of gods to compile
Source Applied Computational Genomics.
So be careful and cautious, and make sure to understand the columns, parameters, and values being presented/calculated.
When in doubt, donât assume, check the document - it saves time to get it right the first time :D
Problems/ Projects
I personally found the descriptions of the problem hard to understand (so if you feel that way too youâre not alone). I ended up rephrasing a lot of it.
1: unix basic (grep, cut, awk, comm, sort, uniq)
Files needed for the project are here for download and extraction.
Q1, in apple.genome, each chromosome starts on a line with â>â, how many chromosomes are there?
Q2-3, in apple.genes, the first column stores gene, the second column stores transcript, whatâs the number of unique genes and unique transcripts in the file?
For genes, cut out first column and sort -u
to find uniq values, then count.
For transcripts, same operation for the second column.
Q4-5, a gene can have multiple transcripts (two lines can have the same value in first column and differ in the second column), how many genes have only one transcripts, how many genes have more than one transcripts?
Cut out the first column and count the how many time each genes occurs, which gives the number of their transcripts. Then
Q6-7, in apple.gene, column 4 stores strand (+ or -) information, how many genes are on + strand, how many on - strand?
Q8-13, column 3 stores chromosome information, count the number of genes on chr1, chr2, chr3; count the number of transcripts on chr1/2/3.
Question 6 through 13 are about filtering out lines that match certain value in a column. The solutions below covers the situation of counting genes. To count transcripts, change genes (column 1) to transcripts (col2).
- Solution 1: cut out both gene column and filter column (strand or chromosome), uniq the combination, then count how many +/- strand or chromosomes there are. A problem with this method is if one gene maps to 2 strands/chromosomes (which shouldnât occur here), itâll count the gene twice.
- Solution 2: similar to solution 1, after getting the uniq combination, we can also get the counts by
grep -c
- Solution 3: awk can be used to filter value in a column.
awk '$4=="var"'
will return lines where column 4 matches âvarâ. We can filter to get matching values, then count the number of unique genes.
To me, solution 3 is the most intuitive and easiest to read one, awk is a command worth learning for handling column based files.
Q14-17, column 1 in apple.conditionA/B/C stores gene, find out the number of genes exists (1) in both condition A and B, (2) only in A, (3) only in B, (4) in A,B,C.
First we need to cut out the gene columns in those 3 files for further comparison.
comm file1 file2
compares two sorted files, returns 3 columns - lines only in file1; only in file2; in both file1 and file2.
We can pass in the number of columns to be omitted in the output, e.g. comm -12 file1 file2
only returns the third column with common lines.
The last question asks to compare 3 files, we can output common genes of A and B, then compare it with C. Or a whacky way to achieve this is to concatenate genes from all 3 files, then count the genes that occur 3 times.
2: process SAM/BAM/BED/GTF files
Files needed for the project are here
Q1-5 Inspect the âathal_wu_0_A.bamâ file with samtools and answer the following questions:
- How many alignments are there in the bam file? (Q1)
- For SAM format, column 7 is the âRNEXTâ field, which contains the reference name of the mate/next read, a value of â=â in this field means the mate is mapped to the same chromosome (most common for pair end sequencing); whereas a â*â means the mate is unmapped. In the given bam file, find out the number of alignments that (a) with a mate mapped to the same chromosome (Q4); and (b) with an unmapped mate (Q2)?
- How many alignments have a deletion (have âDâ in its CIGAR string / column 6)? (Q3)
- How many alignments have a intron (have âNâ in its CIGAR string)? (Q5)
Q6-10 from the original âathal_wu_0_A.bamâ file, extract alignments with in the range of Chr3:11777000-1179400 to a new bam file, then answer the same set of question as Q1-5 for the new bam file.
Note: the bam file here are already sorted and indexed (the index file âathal_wu_0_A.bam.baiâ is included). However, in the case where a bam file is not sorted, you need to sort and index it before you can extract the range
Q11-13: check the header of the âathal_wu_0_A.bamâ file to get more information
- How many reference genomes were used to align the sequences? In header, @SQ SN lines has ref sequence info. (Q11)
- Among all ref genomes above, whatâs the length of the first reference genome? (Q12)
- What alignment program/tools was used? (In header, @PG shows the program used.) (Q13)
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:Chr1 LN:29923332 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr2 LN:19386101 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr3 LN:23042017 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr4 LN:18307997 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:Chr5 LN:26567293 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:chloroplast LN:154478 AS:wu_0.v7.fas SP:wu_0.v7.fas
@SQ SN:mitochondria LN:366924 AS:wu_0.v7.fas SP:wu_0.v7.fas
@RG ID:H100223_GAII05_0002 PL:SLX LB:wu_PII PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER01 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_PER02 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@RG ID:Wii_SR03 PL:SLX LB:wu_phaseI PI:400 DS:wu_0_Genome SM:wu_0
@PG ID:stampy VN:1.0.3_(r627) CL:-g /lustre/scratch103/sanger/xcg/tmp/tmp.zYfXz26246 -h /lustre/scratch103/sanger/xcg/tmp/tmp.zYfXz26246 --readgroup=ID:Wii_PER01,LB:wu_phaseI,SM:wu_0,PI:400,PL:SLX,DS:wu_0_Genome --bwaoptions=-q10 /lustre/scratch103/sanger/xcg/wu_0.v7.fas -o /lustre/scratch103/sanger/xcg/wu_0/A/aln_A1.sp0.sam -M /lustre/scratch103/sanger/xcg/wu_0/read_1_1.sp0.fq.gz /lustre/scratch103/sanger/xcg/wu_0/read_1_2.sp0.fq.gz
@PG ID:samtools PN:samtools PP:stampy VN:1.16.1 CL:samtools view -H athal_wu_0_A.bam
@CO TM:Fri, 17 Sep 2010 12:20:13 BST WD:/lustre/scratch103/sanger/xcg/wu_0/self HN:bc-16-1-07.internal.sanger.ac.uk UN:xcg
Q14-15, check the first alignment in the âathal_wu_0_A.bamâ file and find out (a) the name/identifier (column 1) and (b) the start position of its mate (column 7, 8).
Q16-20 use bedtools to investigate overlapping intervals between the given âathal_wu_0_A_annot.gtfâ file and the range.bam file in Q6-10, and answer the following questions
- How many overlapping intervals (output with
-wo
option) are there between the 2 files? (Q16) - The length of the overlaps can be found in column 22 with the
-wo
option. How many overlaps has a length of 10 bases or longer? (Q17) - How many alignments in the bam file overlap with an interval in the gtf file? (Q18)
- Of all the intervals in gtf file that overlap with an alignment in the bam file, how many of them are exons? (Q19)
- How many unique transcripts (in column 9, first part of the string) are there in the gtf file (Q20)
3: align FastQ reads with bowtie2 and variant calling with bfctools
Important notes: you need bowtie v.2.2.5 to get the correct answers, a different version will generate a index and different alignment results. The older version can be found here. I saved a copy of the pre-compiled bowtie2 linux version in case the link doesnât work in the future.
In an linux environment (or WSL2), extract the file and add it to PATH variable
export PATH=/extracted_directory/bowtie2-2.2.5:$PATH
, you can verify by running which bowtie2
.
Python3 is needed for the bowtie2 program to run.
Files needed for the project can be found here.
Q1-3, inspect the âwu_0_v7.fasâ file for the following information
- How many sequences (each starts with â>â) are there in the file? (Q1)
- Whatâs the name of 3rd sequence? (Q2)
- Whatâs the name of the last sequence? (Q3)
Q4-5, use botie2 to index the âwu_0_v7.fasâ genome.
- How many files are created? (Q4)
- Whatâs the 3-character extension of the index file? (Q5)
Q6-14, align the reads in âwu_0_A_wgs.fastqâ using bowtie2 with full-match (default) and local-match setting.
- How many reads are there in the given wu_0_A_wgs.fastq file? (Q6)
- With full-match setting alignment
- How many reads mapped to at least one place in the genome? (Q9)
- How many reads mapped to more than one place in the genome? (Q11)
- Whatâs the number of total matches reported? (Q7) Although some reads map to >1 loci, bowtie2 by default only report 1 alignment, so this is the same as Q9.
- With the local-match setting alignment, answer the same set of questions, whatâs the number of total matches (Q8), which is the same as the number of reads with >= 1 match (Q10), whatâs the number of reads with > 1 match (Q12).
- How many alignments contains insertions and/or deletions (has either âIâ or âDâ in CIGAR string) in the full match setting (Q13)? What about the local-match setting (Q14)? CIGAR string is stored in column 6 in SAM format.
Alignment summary (default full-match):
147354 reads; of these: 147354 (100.00%) were unpaired; of these: 9635 (6.54%) aligned 0 times 93780 (63.64%) aligned exactly 1 time 43939 (29.82%) aligned >1 times 93.46% overall alignment rate
Local alignment summary:
147354 reads; of these: 147354 (100.00%) were unpaired; of these: 6310 (4.28%) aligned 0 times 84939 (57.64%) aligned exactly 1 time 56105 (38.07%) aligned >1 times 95.72% overall alignment rate
Q15-19, with reads aligned to the genome, we can generate a pileup file to prepare for variant calling later.
Here bcftools mpileup
was used to generate pileup format, where the reads are lined up to the genome, and we take a slice of one nucleotide position (one column), and store that information in one line for variant caller to inspect this position and judge if thereâs a mutation.
One site/locus can have multiple lines representing different situations such as SNP and INDEL.
The questions in the course practical originally used samtools (older version of samtools can output in VCF format with -g or -u). However this feature is deprecated and removed, and we need bcftools mpileup
instead.
Prepare a pileup file (using the alignment with the full-match default setting) in VCF format with bcftools. Note that the alignment sam file needs to be sorted before generating the VCF file. Find the following information of the output file:
- How many lines are located in Chr3 (chromosome name in column1)? (Q15)
- How many lines have âAâ in the reference genome (REF base in column 4)? (Q16)
- How many lines have depth cover of exactly 20 (DP=20 in column 8)? (Q17)
- How many lines are reporting âINDELâ value in column 8? (Q18)
- How many lines correspond to position Chr1:175672 (column 1,2)? (Q19)
Note: Here Q15, 16 and 19 have different answers than the courses answer sheet for project 2, due to software version difference - they used an older version of samtools to generate the VCF file. The point is to learn about the file format and the workflow of variant calling. If you need to pass the exam, the answer you need are Q15 360295; Q16 1150985; Q19 2.
Q20-24 with the pileup file ready, we can call variants with bfctools
Bcftools has a call command to call variants from a mpileup file, here we use -m
for multiallelic caller and -v
for variant only. The variant calling result is output in a VCF file by specifying -Ov
.
Inspect the VCF file with all the variants generated by bcftools call
for the following information.
- How many variants are located on Chr3 (chromosome name in column1)? (Q20)
- How many lines represents an AâT SNP (has âAâ in col4 and âTâ in col5)? (Q21)
- How many lines represents INDELS (contain âINDELâ value in column 8)? (Q22)
- How many lines have depth cover of exactly 20 (DP=20 in column 8)? (Q23)
- Whatâs the type of variant (SNP or INDEL) at position Chr3:11937923 (column 1,2)? (Q19)
Again, different answers for Q20-22 in the course exam, caused by different software versions. If you need to pass the exam, here are the answers you need: Q20 398; Q21 392; Q22 320.