SAM document

SAM - Sequence Alignment/Map format, is a text-based, Tab-delimited format for storing sequence alignment data. BAM is the binary form of SAM (smaller size to save disk space). SAM uses 1-based indices while BAM uses 0-based indices.

Header section:

  • Each header line starts with ’@’ followed by a 2-letter code.
  • @HD: Header line with general info such as SAM format version and sorting order.
  • @SQ: Reference sequence dictionary, with info such as sequence name & length.
  • @RG: Read group info such as sequencing platform.
  • @PG: Program used to generate the alignment.

Alignments section - 11 mandatory fields/columns.

ColFieldDescriptionExample
1QNAMEQuery template nameGAII05_0002:1:113:7822:3886#0
2FLAGbitwise FLAG score4
3RNAMERef seq namechr1
4POS1-based start cor11699950
5MAPQMapping quality60
6CIGARCIGAR string39M1D12M
7RNEXTRef name of mate/next readchr1
8PNEXTPos of mate/next read11700332
9TLEN(Paired) insert size433
10SEQread seqAATGTAAAAC

11QUALPhred33 quality of seqHHHAAB^^%CCC

12OPTOptional tagsXA:i:2, MD:Z:0T34G15

FLAG score (column 2)

  • FLAG score can be confusing, it’s an integer number storing dense information about the read.
  • It answers 11 yes/no questions with 0/1, and string those 0 or 1 together gives you a number in binary, which is the FLAG score (often converted to decimal). The 11 questions are list in the table below.
  • A nice explanation of the FLAG score by Dr. Quinlan can be found here.
  • For example, if a read from a single end sequencing is unmapped, and passes quality check and not a PCR duplicate, basically if it answers no (with a 0) to all but one question in the table - question 3 “is the read itself unmapped”, which means it has a 1 (yes) at third position of the binary number, then all 0 for other positions - a number of ‘00000000100’, which translate to a FLAG score of 4 in decimal system.
base2base10QuestionPair end seq only
000000000011Read comes from paired sequencingN
000000000102Read mapped in a proper pairY
000000001004Read itself is unmappedN
000000010008Read’s mate is unmappedY
0000001000016Strand (0 forward, 1 reverse)N
0000010000032Strand of mateY
0000100000064This is the first read in the pairY
00010000000128This is the second read in the pairY
00100000000256The alignment is not primaryN
01000000000512Read fails platform/vendor quality checksN
100000000001024Read is a PCR/optical duplicateN

Table adapted from Dr. Quilan’s slide.

CIGAR string

The string stores the operations needed to turn reference string to the experimentally sequenced read.

  • M: Alignment match, could be a seq (character) match or mismatch(SNP).
  • I: Insertion.
  • D: Deletion.
  • N: Split or spliced region
  • S: Soft clipping (clipped sequences present in SEQ).
  • H: Hard clipping (clipped sequences not present in SEQ).
  • P: Padding.
  • =: Seq match.
  • X: Seq mismatch (SNP).

Example:

Reference: ACCTGTC--TACCTTACG
Read:      ACCT-TCCATACTTTATC
Action:    MMMMDMMIIMMMMMMMSS
CIGAR:     4M 1D2M2I   7M  2S