CIGAR string explanation

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

CIGAR string explanation

Jacques Dainat-4
Hello,

Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER)

cigar: P46461.1 3 740 . genome 460484 439594 - 2580  M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115
vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$
-- completed exonerate analysis


and here the result we get in the protein2genome.gff output from MAKER

@000426F|arrow|arrow    protein2genome  protein_match   439595  460484  2580    -       .       ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6
@000426F|arrow|arrow    protein2genome  match_part      460399  460484  2580    -       .       ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28
@000426F|arrow|arrow    protein2genome  match_part      460135  460344  2580    -       .       ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2
@000426F|arrow|arrow    protein2genome  match_part      458437  458582  2580    -       .       ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2
@000426F|arrow|arrow    protein2genome  match_part      454953  455091  2580    -       .       ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1
@000426F|arrow|arrow    protein2genome  match_part      454674  454834  2580    -       .       ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2
@000426F|arrow|arrow    protein2genome  match_part      454296  454477  2580    -       .       ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1
@000426F|arrow|arrow    protein2genome  match_part      453985  454150  2580    -       .       ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55
@000426F|arrow|arrow    protein2genome  match_part      453401  453570  2580    -       .       ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1
@000426F|arrow|arrow    protein2genome  match_part      448042  448363  2580    -       .       ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107
@000426F|arrow|arrow    protein2genome  match_part      447761  447918  2580    -       .       ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1
@000426F|arrow|arrow    protein2genome  match_part      447460  447644  2580    -       .       ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61
@000426F|arrow|arrow    protein2genome  match_part      445484  445642  2580    -       .       ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2
@000426F|arrow|arrow    protein2genome  match_part      439595  439709  2580    -       .       ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2

MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don’t get the meaning of the R and F.
Could you explain their meanings ?

Best regards,

/Jacques
-------------------------------------------------
Jacques Dainat, Ph.D.
NBIS (National Bioinformatics Infrastructure Sweden)
Genome Annotation Service
http://nbis.se/about/staff/jacques-dainat

Contact — 
Address: Uppsala University, Biomedicinska Centrum
Department of Medical Biochemistry Microbiology, Genomics
Husargatan 3, box 582
S-75123 Uppsala Sweden
Phone: +46 18 471 46 25


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: CIGAR string explanation

Carson Holt-2
Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn’t have this because it’s only half the cigar and it’s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it’s flipped because the alignment happens on the opposite strand.

—Carson


On Oct 23, 2018, at 7:56 AM, Jacques Dainat <[hidden email]> wrote:

Hello,

Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER)

cigar: P46461.1 3 740 . genome 460484 439594 - 2580  M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115
vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$
-- completed exonerate analysis


and here the result we get in the protein2genome.gff output from MAKER

@000426F|arrow|arrow    protein2genome  protein_match   439595  460484  2580    -       .       ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6
@000426F|arrow|arrow    protein2genome  match_part      460399  460484  2580    -       .       ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28
@000426F|arrow|arrow    protein2genome  match_part      460135  460344  2580    -       .       ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2
@000426F|arrow|arrow    protein2genome  match_part      458437  458582  2580    -       .       ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2
@000426F|arrow|arrow    protein2genome  match_part      454953  455091  2580    -       .       ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1
@000426F|arrow|arrow    protein2genome  match_part      454674  454834  2580    -       .       ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2
@000426F|arrow|arrow    protein2genome  match_part      454296  454477  2580    -       .       ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1
@000426F|arrow|arrow    protein2genome  match_part      453985  454150  2580    -       .       ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55
@000426F|arrow|arrow    protein2genome  match_part      453401  453570  2580    -       .       ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1
@000426F|arrow|arrow    protein2genome  match_part      448042  448363  2580    -       .       ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107
@000426F|arrow|arrow    protein2genome  match_part      447761  447918  2580    -       .       ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1
@000426F|arrow|arrow    protein2genome  match_part      447460  447644  2580    -       .       ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61
@000426F|arrow|arrow    protein2genome  match_part      445484  445642  2580    -       .       ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2
@000426F|arrow|arrow    protein2genome  match_part      439595  439709  2580    -       .       ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2

MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don’t get the meaning of the R and F.
Could you explain their meanings ?

Best regards,

/Jacques
-------------------------------------------------
Jacques Dainat, Ph.D.
NBIS (National Bioinformatics Infrastructure Sweden)
Genome Annotation Service
http://nbis.se/about/staff/jacques-dainat

Contact — 
Address: Uppsala University, Biomedicinska Centrum
Department of Medical Biochemistry Microbiology, Genomics
Husargatan 3, box 582
S-75123 Uppsala Sweden
Phone: +46 18 471 46 25

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Re: CIGAR string explanation

Jacques Dainat-4
Thanks for your response.
It’s surprising the link in the Sequence Ontology web site doesn’t work anymore. I will notify them.

I was surprise that I was not able finding any resource on internet describing these values. Helped by your answer I have refined my key words and googled again, and I finnaly found old ressources describing that too. 

I put a copy here of the Wormbase description in case those resources also disappear. At that time it sounds it was not yet officialy accepted by the SO.

/Jacques

On 23 Oct 2018, at 17:55, Carson Holt <[hidden email]> wrote:

Once upon a time the link in the official GFF3 specification to the cigar string documentation actually worked and it would bring you to a nice page that explained everything. It described how the F and R were to be used on protein space space alignments (F is forward frame shift and R is a reverse frame shift in the alignment). M1 in a protein space is actually an amino acid match (matches 3 bp in nucleotide space), this was previously clear in the now broken link. At the same time I1 is an amino acid insertion (3bp in nucleotide space), and D1 is an amino acid deletion (3bp in nucleotide space). F and R therefore allow for single bp movement either to the left or right within amino acid space. Sometime this happens in Exonerate where it appears as a slightly shifted codon (codons look stacked ), but it also happens when an amino acid is split across a splice site (1st part of a codon is on one exon and second part on the next exon). The raw exonerate cigar you show below doesn’t have this because it’s only half the cigar and it’s in nucleotide space, the value shown in the Gap= has to be in the same space as the Target= feature, which in this case is a protein. So we build the protein cigar string from the vulgar string according to the now broken documentation on Gap attributes. You have 28 amino acid matches, 1 insertion, and then an amino acid split across the intron (1bp of the codon on one side and 2bp on the other side), and it’s flipped because the alignment happens on the opposite strand.

—Carson


On Oct 23, 2018, at 7:56 AM, Jacques Dainat <[hidden email]> wrote:

Hello,

Here an example of the cigar string output from exonerate (exactly the same command as launched by MAKER)

cigar: P46461.1 3 740 . genome 460484 439594 - 2580  M 84 I 1 D 56 M 154 I 3 M 54 D 1554 M 145 D 3346 M 137 D 120 M 160 D 197 M 182 D 145 M 165 D 415 M 170 D 5037 M 321 D 124 M 158 D 116 M 183 D 1819 M 157 D 5776 M 115
vulgar: P46461.1 3 740 . genome 460484 439594 - 2580 M 28 84 G 1 0 S 0 2 5 0 2 I 0 50 3 0 2 S 1 1 M 51 153 G 3 0 M 18 54 S 0 2 5 0 2 I 0 1548 3 0 2 S 1 1 M 48 144 S 0 1 5 0 2 I 0 3341 3 0 2 S 1 2 M 45 135 S 0 2 5 0 2 I 0 114 3 0 2 S 1 1 M 53 159 S 0 1 5 0 2 I 0 192 3 0 2 S 1 2 M 60 180 5 0 2$
-- completed exonerate analysis


and here the result we get in the protein2genome.gff output from MAKER

@000426F|arrow|arrow    protein2genome  protein_match   439595  460484  2580    -       .       ID=@000426F|arrow|arrow:hit:153696:3.10.0.4;Name=P46461.1;target_length=745;aligned_coverage=98.93;aligned_identity=72.6
@000426F|arrow|arrow    protein2genome  match_part      460399  460484  2580    -       .       ID=@000426F|arrow|arrow:hsp:233933:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 4 32;Gap=F2 I1 M28
@000426F|arrow|arrow    protein2genome  match_part      460135  460344  2580    -       .       ID=@000426F|arrow|arrow:hsp:233934:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 33 105;Gap=F2 M18 I3 M52 R2
@000426F|arrow|arrow    protein2genome  match_part      458437  458582  2580    -       .       ID=@000426F|arrow|arrow:hsp:233935:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 106 154;Gap=F1 M49 R2
@000426F|arrow|arrow    protein2genome  match_part      454953  455091  2580    -       .       ID=@000426F|arrow|arrow:hsp:233936:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 155 200;Gap=F2 M46 R1
@000426F|arrow|arrow    protein2genome  match_part      454674  454834  2580    -       .       ID=@000426F|arrow|arrow:hsp:233937:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 201 254;Gap=F1 M54 R2
@000426F|arrow|arrow    protein2genome  match_part      454296  454477  2580    -       .       ID=@000426F|arrow|arrow:hsp:233938:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 255 315;Gap=M61 R1
@000426F|arrow|arrow    protein2genome  match_part      453985  454150  2580    -       .       ID=@000426F|arrow|arrow:hsp:233939:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 316 370;Gap=F1 M55
@000426F|arrow|arrow    protein2genome  match_part      453401  453570  2580    -       .       ID=@000426F|arrow|arrow:hsp:233940:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 371 427;Gap=M57 R1
@000426F|arrow|arrow    protein2genome  match_part      448042  448363  2580    -       .       ID=@000426F|arrow|arrow:hsp:233941:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 428 534;Gap=F1 M107
@000426F|arrow|arrow    protein2genome  match_part      447761  447918  2580    -       .       ID=@000426F|arrow|arrow:hsp:233942:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 535 587;Gap=M53 R1
@000426F|arrow|arrow    protein2genome  match_part      447460  447644  2580    -       .       ID=@000426F|arrow|arrow:hsp:233943:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 588 648;Gap=F2 M61
@000426F|arrow|arrow    protein2genome  match_part      445484  445642  2580    -       .       ID=@000426F|arrow|arrow:hsp:233944:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 649 701;Gap=F2 M53 R2
@000426F|arrow|arrow    protein2genome  match_part      439595  439709  2580    -       .       ID=@000426F|arrow|arrow:hsp:233945:3.10.0.4;Parent=@000426F|arrow|arrow:hit:153696:3.10.0.4;Target=P46461.1 702 740;Gap=M39 R2

MAKER apparently process the CIGAR string and save it into the Gap attribute. The value looks like CIGAR string but it is different. Here is the different letters we can find (M, D, I, R, F). I guess M=match, D=deletion and I=insertion, but I don’t get the meaning of the R and F.
Could you explain their meanings ?

Best regards,

/Jacques
-------------------------------------------------
Jacques Dainat, Ph.D.
NBIS (National Bioinformatics Infrastructure Sweden)
Genome Annotation Service
http://nbis.se/about/staff/jacques-dainat

Contact — 
Address: Uppsala University, Biomedicinska Centrum
Department of Medical Biochemistry Microbiology, Genomics
Husargatan 3, box 582
S-75123 Uppsala Sweden
Phone: +46 18 471 46 25

_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org



_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org
Reply | Threaded
Open this post in threaded view
|

Is gene retrieval from gff possible?

Elyssa Garza
In reply to this post by Jacques Dainat-4


Hello

I recently annotated my plant genome and am looking at retrieving a particular set of genes from the maker results. I have a list of TAIR Ids that I am particularly interested in and was thinking about using the gff file to help pull out the associated transcripts. I was wondering if you could advise me on the best or easiest way of obtaining the associated TAIR accession or gene model from the gff file.

I did try looking at the genes (41,779 genes) using CLCbio but the accessions were not easily identified or found. I also looked at the protein matches (819,805 protein matches) and was able to easily find gene model matches corresponding to my target accessions. Is it wise to do this? Can you explain why I can't find these same protein matches in the gene file? I have some ideas on why this is happening but I am looking for support for them.

Elyssa


_______________________________________________
maker-devel mailing list
[hidden email]
http://box290.bluehost.com/mailman/listinfo/maker-devel_yandell-lab.org