Querying for Objects without references

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Querying for Objects without references

Alex Kalderimis-2
Several people have asked us how to perform queries with constraints on
whether an object has or doesn't have a particular reference, eg. genes without
an organism, or all genes without any exons. We have always replied that this is
not a feature we directly support, and that we would look into such constraint
types. The other day, though, JS Wong of FlyBase came up with an elegant solution
based on category theory, that can be generalised as subqueries when using
the webservice clients. This is at best a nice work-around, and at worst
a horrible hack with some serious performance issues, but it does provide
a working solution when it is needed, and has pointed the way to more performant
implementations. Below there is a bried discussion of what this method is and
how it works, as well as sample code in perl and python.

The idea
========

The basic method is to define a complementary set to the desired results, and constrain
on that, such that "genes without exons" = "all genes" - "genes with exons". To do this, all
we need is to constrain on something that is a good proxy for object existence, and for that
the "id" field, which every object in an intermine database has is well suited. This field
never contains useful information in its own right, but for these purposes, it is well suited.
So to define our complement, we can say "genes with exons" = "all genes where exons.id IS NOT NULL".

Logging in
==========

This method relies on the creation of a temporary list that defines the complement. This means that
in order to use this method you will need to authenticate with your token (to get yours, go to
the "My Mine" section of your mine's web-site) or with your username and password (but, not really, since
we don't actually recommend you sending your email and passwords around over the internet).

Performance
===========

Since the complement is a temporary list, it needs to be written to the database before your
query can run, and then removed when your script is finished. For small complements this is fine,
but in the above example ("all genes with exons"), the temporary list is over 250,000 genes, and takes
a minute or two to create on FlyMine's database, and about half that to remove. This means that queries
of this type with large complements will be very slow, and may degrade mine performance for everyone. You
can mitigate this by using very specific complements, including all the constraints for the rest of your
query. Examples of this are included below.

Examples
========

Without, further ado, examples:

(these examples were tested with version 0.9904 of the Perl client and 0.99.04 of the Python client. These
are available from http://search.cpan.org/perldoc?Webservice::InterMine and http://pypi.python.org/pypi/intermine
respectively).

Perl
----

Works, but VERY inefficient: getting all genes without exons.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   #!/usr/bin/perl
   use Webservice::InterMine 0.9904;
   use feature 'say';

   my $service = get_service("flymine.org/query", $YOUR_TOKEN);
   my $has_exons = $service->select("Gene.*")->where("exons.id" =>  "IS NOT NULL");
   my $no_exons  = $service->select("Gene.*", "organism.name")->where("Gene" =>  {not_in =>  $has_exons})

   my $iterator = $no_exons->iterator(size =>  10)
   while (my $row =<$iterator>) {
     say $row->{symbol}, $row->{"organism.name"};
   }

Better: getting all genes from one of your lists without exons:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note the use of the "to_query" method on a list.

   #!/usr/bin/perl
   use Webservice::InterMine 0.9904;
   use feature 'say';

   my $service = get_service("flymine.org/query", $YOUR_TOKEN);
   my $list       = $service->list("my-genes-of-interest");
   my $with_exons = $list->to_query->where("exons.id" =>  "IS NOT NULL");
   my $no_exons   = $list->to_query->add_views("organism.name")->where(Gene =>  {not_in =>  $with_exons);

   my $iterator = $no_exons->iterator(size =>  10)
   while (my $row =<$iterator>) {
     say $row->{symbol}, $row->{"organism.name"};
   }

Also Better: getting all genes with a certain protein domain without exons:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note the use of the "clone" method to create queries from a prototypical instance.

   #!/usr/bin/perl
   use Webservice::InterMine 0.9904;
   use feature 'say';

   my $service = get_service("flymine.org/query", $YOUR_TOKEN);

   my $homeoboxes = $service->select("Gene.*")->where("proteins.proteinDomains.name" =>  "Homeobox");
   my $with_exons = $homeoboxes->clone->where("exons.id" =>  "IS NOT NULL")
   my $no_exons   = $homeoboxes->clone->add_views("organism.name")->where("Gene =>  {not_in =>  $with_exons})

   my $iterator = $no_exons->iterator(size =>  10)
   while (my $row =<$iterator>) {
     say $row->{symbol}, $row->{"organism.name"};
   }

Python
------

Works, but VERY inefficient: getting all genes without exons.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


   #!/usr/bin/env python
   from webservice.intermine import Service

   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
   Gene = s.model.Gene

   genes_with_exons = Gene.where("exons.id", "IS NOT NULL")
   genes_wo_exons   = Gene.select("*", "organism.name").where(Gene ^ genes_with_exons)

   for gene in genes_wo_exons.results(size = 10):
     print gene.symbol, gene.organism.name


Better: getting all genes from one of your lists without exons:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note the use of the "to_query" method on a list.

   #!/usr/bin/env python
   from webservice.intermine import Service

   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
   Gene = s.model

   x = s.get_list("my-genes-of-interest")
   with_exons = x.to_query().where(Gene.exons.id != None)
   without_exons = x.to_query().add_views("organism.name").where(Gene ^ with_exons)

   for gene in without_exons.results(size = 10):
     print gene.symbol, gene.organism.name


Also Better: getting all genes with a certain protein domain without exons:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Note the use of the "clone" method to create queries from a prototypical instance.

   #!/usr/bin/env python
   from webservice.intermine import Service

   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
   Gene = s.model

   homeoboxes    = Gene.where(Gene.proteins.proteinDomains.name == "Homeobox")
   with_exons    = homeoboxes.clone().where(Gene.exons.id != None)
   without_exons = homeoboxes.clone().select("*", "organism.name").where(Gene ^ with_exons)

   for gene in without_exons.results(size = 10):
     print gene.symbol, gene.organism.name


Please feel free to get back to me with questions regarding these examples and the ideas discussed here,

Alex



_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Querying for Objects without references

Alex Kalderimis-2
On 09/12/11 19:15, Alex Kalderimis wrote:

> Several people have asked us how to perform queries with constraints on
> whether an object has or doesn't have a particular reference, eg.
> genes without
> an organism, or all genes without any exons. We have always replied
> that this is
> not a feature we directly support, and that we would look into such
> constraint
> types. The other day, though, JS Wong of FlyBase came up with an
> elegant solution
> based on category theory, that can be generalised as subqueries when
> using
> the webservice clients. This is at best a nice work-around, and at worst
> a horrible hack with some serious performance issues, but it does provide
> a working solution when it is needed, and has pointed the way to more
> performant
> implementations. Below there is a bried discussion of what this method
> is and
> how it works, as well as sample code in perl and python.
>
> The idea
> ========
>
> The basic method is to define a complementary set to the desired
> results, and constrain
> on that, such that "genes without exons" = "all genes" - "genes with
> exons". To do this, all
> we need is to constrain on something that is a good proxy for object
> existence, and for that
> the "id" field, which every object in an intermine database has is
> well suited. This field
> never contains useful information in its own right, but for these
> purposes, it is well suited.
> So to define our complement, we can say "genes with exons" = "all
> genes where exons.id IS NOT NULL".
>
> Logging in
> ==========
>
> This method relies on the creation of a temporary list that defines
> the complement. This means that
> in order to use this method you will need to authenticate with your
> token (to get yours, go to
> the "My Mine" section of your mine's web-site) or with your username
> and password (but, not really, since
> we don't actually recommend you sending your email and passwords
> around over the internet).
>
> Performance
> ===========
>
> Since the complement is a temporary list, it needs to be written to
> the database before your
> query can run, and then removed when your script is finished. For
> small complements this is fine,
> but in the above example ("all genes with exons"), the temporary list
> is over 250,000 genes, and takes
> a minute or two to create on FlyMine's database, and about half that
> to remove. This means that queries
> of this type with large complements will be very slow, and may degrade
> mine performance for everyone. You
> can mitigate this by using very specific complements, including all
> the constraints for the rest of your
> query. Examples of this are included below.
>
> Examples
> ========
>
> Without, further ado, examples:
>
> (these examples were tested with version 0.9904 of the Perl client and
> 0.99.04 of the Python client. These
> are available from
> http://search.cpan.org/perldoc?Webservice::InterMine and
> http://pypi.python.org/pypi/intermine
> respectively).
>
> Perl
> ----
>
> Works, but VERY inefficient: getting all genes without exons.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>   #!/usr/bin/perl
>   use Webservice::InterMine 0.9904;
>   use feature 'say';
>
>   my $service = get_service("flymine.org/query", $YOUR_TOKEN);
>   my $has_exons = $service->select("Gene.*")->where("exons.id" =>  "IS
> NOT NULL");
>   my $no_exons  = $service->select("Gene.*",
> "organism.name")->where("Gene" =>  {not_in =>  $has_exons})
>
>   my $iterator = $no_exons->iterator(size =>  10)
>   while (my $row =<$iterator>) {
>     say $row->{symbol}, $row->{"organism.name"};
>   }
>
> Better: getting all genes from one of your lists without exons:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Note the use of the "to_query" method on a list.
>
>   #!/usr/bin/perl
>   use Webservice::InterMine 0.9904;
>   use feature 'say';
>
>   my $service = get_service("flymine.org/query", $YOUR_TOKEN);
>   my $list       = $service->list("my-genes-of-interest");
>   my $with_exons = $list->to_query->where("exons.id" =>  "IS NOT NULL");
>   my $no_exons   =
> $list->to_query->add_views("organism.name")->where(Gene =>  {not_in
> =>  $with_exons);
>
>   my $iterator = $no_exons->iterator(size =>  10)
>   while (my $row =<$iterator>) {
>     say $row->{symbol}, $row->{"organism.name"};
>   }
>
> Also Better: getting all genes with a certain protein domain without
> exons:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> Note the use of the "clone" method to create queries from a
> prototypical instance.
>
>   #!/usr/bin/perl
>   use Webservice::InterMine 0.9904;
>   use feature 'say';
>
>   my $service = get_service("flymine.org/query", $YOUR_TOKEN);
>
>   my $homeoboxes =
> $service->select("Gene.*")->where("proteins.proteinDomains.name" =>  
> "Homeobox");
>   my $with_exons = $homeoboxes->clone->where("exons.id" =>  "IS NOT
> NULL")
>   my $no_exons   =
> $homeoboxes->clone->add_views("organism.name")->where("Gene =>  
> {not_in =>  $with_exons})
>
>   my $iterator = $no_exons->iterator(size =>  10)
>   while (my $row =<$iterator>) {
>     say $row->{symbol}, $row->{"organism.name"};
>   }
>
> Python
> ------
>
> Works, but VERY inefficient: getting all genes without exons.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>   #!/usr/bin/env python
>   from webservice.intermine import Service
>
>   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
>   Gene = s.model.Gene
>
>   genes_with_exons = Gene.where("exons.id", "IS NOT NULL")
>   genes_wo_exons   = Gene.select("*", "organism.name").where(Gene ^
> genes_with_exons)
>
>   for gene in genes_wo_exons.results(size = 10):
>     print gene.symbol, gene.organism.name
>
>
> Better: getting all genes from one of your lists without exons:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> Note the use of the "to_query" method on a list.
>
>   #!/usr/bin/env python
>   from webservice.intermine import Service
>
>   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
>   Gene = s.model
>
>   x = s.get_list("my-genes-of-interest")
>   with_exons = x.to_query().where(Gene.exons.id != None)
>   without_exons = x.to_query().add_views("organism.name").where(Gene ^
> with_exons)
>
>   for gene in without_exons.results(size = 10):
>     print gene.symbol, gene.organism.name
>
>
> Also Better: getting all genes with a certain protein domain without
> exons:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
> Note the use of the "clone" method to create queries from a
> prototypical instance.
>
>   #!/usr/bin/env python
>   from webservice.intermine import Service
>
>   s = Service("squirrel.flymine.org/flymine", token = YOUR_TOKEN)
>   Gene = s.model
>
>   homeoboxes    = Gene.where(Gene.proteins.proteinDomains.name ==
> "Homeobox")
>   with_exons    = homeoboxes.clone().where(Gene.exons.id != None)
>   without_exons = homeoboxes.clone().select("*",
> "organism.name").where(Gene ^ with_exons)
>
>   for gene in without_exons.results(size = 10):
>     print gene.symbol, gene.organism.name
>
>
> Please feel free to get back to me with questions regarding these
> examples and the ideas discussed here,
>
> Alex
>
>
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Minor correction: I meant to credit JD Wong with pointing this out, not
JS Wong. My apologies

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev