Annotating alternate alleles of a SNP

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Annotating alternate alleles of a SNP

Mara Kim-2
Hello gmod-ers,

I was wondering what others have done to store supporting data for SNPs.  SNPs present an interesting challenge, as there are data pertaining to the site (Tajima's D, Fst, etc.) as well as data pertaining to different alleles at the site (eg. allele frequencies for different populations, selection coefficients, etc.).

The best practices on the wiki suggest using a single feature to denote the SNP itself, with alternate alleles as different ranks in featureloc using residue_info to describe the alternate sequences.  My question is, how have people been adding things like allele frequencies?

The lack of an analysisfeatureloc or featurelocprop tables make it difficult to be clear which variant is being referenced by a given annotation.  Does anyone have suggestions?

--
Mara Kim

Ph.D. Candidate
Computational Biology
Vanderbilt University
Nashville, TN

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Annotating alternate alleles of a SNP

Karl O. Pinc
Hi Maria,

On Fri, 6 Mar 2015 14:17:00 -0600
Mara Kim <[hidden email]> wrote:

> I was wondering what others have done to store supporting data for
> SNPs. SNPs present an interesting challenge, as there are data
> pertaining to the site (Tajima's D, Fst, etc.) as well as data
> pertaining to different alleles at the site (eg. allele frequencies
> for different populations, selection coefficients, etc.).
>
> The best practices on the wiki suggest using a single feature to
> denote the SNP itself, with alternate alleles as different ranks in
> featureloc using residue_info to describe the alternate sequences.
> My question is, how have people been adding things like allele
> frequencies?
>
> The lack of an analysisfeatureloc or featurelocprop tables make it
> difficult to be clear which variant is being referenced by a given
> annotation.  Does anyone have suggestions?

We have taken a different approach.  See:
http://papio.biology.duke.edu/babase_chado_html/chado-vcf-load.html
As you might guess from the program name,
we import our data into chado from VCF files.

We are doing long term longitudinal studies of a
population and repeated high througput sequencing of
individuals in the population.  Rather than store SNV data
in featureloc, we create a feature for every snv
site we ever analyze.  We also create a feature
for each site we ever analyze on any one of our
individuals, effectively treating our individuals
as different organisms.  (Although we use the
same feature.organism_id and instead use dbxref
to designate which features belong to which individuals.)

We then store all of our snv data in analysis_feature
and analysis_featureprop.  One thing this lets us do
is re-analyze the same SNV site of a given individual
and store the results per-analysis.

We do things like use
feature_relationship to relate all of the features
that are at a single SNV site.  See the program
docs for all the details.  We've also extended
chado to give the ANALYSIS table it's own
type and prop tables, since it seemed like
it couldn't hurt to be able to attach more
information to the analysis.  Although, We're not actually
using this to record much regards the high
thoughput sequencing analysis decisions.

All of our source code is, by the by, available
and open source.  See: http://papio.biology.duke.edu

Note that in order to make this work for us we made
some (very conventional) changes to Chado.  Notably,
we added an Analysis.Type_Id column, to allow us to classify
our analysis.  We also had to add a few indexes to
be able to load vcf files within a reasonable time
frame.  

In the FWIW category
there were also a number of issues with
chado defaults.  Chado's practice of deferring
cascading deletes (a feature I find questionable)
needed to be turned off lest bad things happen in
transaction-land.  The chado-vcf-load program does
this and provides the --transaction-isolation-level
argument for finer control.  Note that you will only
have problems with this if you use a not-old version
of Postgres which supports a serializable transaction
isolation level.

chado-vcf-load also provides a --do-pg-analyze argument,
without which it will take forever to load
the first one or 2 vcf files because the db won't have
statistics for the query planner and generate
bad query plans.

You also need to have postgres configured to use "enough"
shared memory and in general be prepared to consume
resources.  So far, we've gotten away with a (virtual)
server with only 2G of ram and a single (not so fast)
core.  On this it takes about 6 days to load a 750MB
vcf file.  It winds up using around 50G of disk in
Postgres.  We plan to deal with scaling issues as they
arise.  Note that, when loading a single vcf file,
you probably won't get much benefit from having more
than 2 cores.

See http://papio.biology.duke.edu/babase_chado_html/
and particularly,
http://papio.biology.duke.edu/babase_chado_html/Babase-Chado-Extensions.html
for changes we made to Chado.  Note that we've made
other changes for other reasons.

You might contact Siddhartha Basu <[hidden email]> for
an outside opinion on this approach since I have mentioned
it to him.  See this thread:
http://sourceforge.net/p/gmod/mailman/message/33241162/

I'd love to hear of any comment anyone might have.

BTW, after developing this scheme I wanted to place a
note to it on the Chado wiki but a cursory search for the
place on the wiki describing SNV storage strategies turned
up nothing.  If you would forward a link to me I'll try to
see about adding a note on our approach to the wiki.

If you wind up using our scheme and ever present attribution
please credit Jenny Tung and Susan Alberts, both of Duke,
for development and the National Institute of Aging
under grant R03-AG045459-01 for funding.  (And I suppose, me.)

Note that our programs are still under development and
could be improved.  A particular limitation is that
chado-vcf-load presently requires that all SNVs are located
on chromosomes.  It's pretty clear how to enhance this
but it's not a feature we need at the moment.

Regards,


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema