[biomart-users] Curl timeout when fetching anything related to paralogs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[biomart-users] Curl timeout when fetching anything related to paralogs

Kyle Duyck
This used to work before the migration, but has since stopped working. I have tested on multiple species and using different ensembl versions and mirrors, but to no avail. It seems that all paralog queries are broken.

mymart <- useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",dataset = "hsapiens_gene_ensembl")
mydatabase <- getBM(attributes=c('ensembl_gene_id',
  'hsapiens_paralog_ensembl_gene',
  'hsapiens_paralog_subtype',
  'hsapiens_paralog_orthology_type'),
 mart=mymart,
 uniqueRows=FALSE)

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web, visit https://groups.google.com/d/msgid/biomart-users/143c7828-f512-4c01-9202-22dd83685f26%40googlegroups.com.
Reply | Threaded
Open this post in threaded view
|

[biomart-users] Re: Curl timeout when fetching anything related to paralogs

Kyle Duyck
I found this problem is partially due to large query but also a potential oversight in biomart query batching.

In the solution below you can see that filtering to only genes with a known paralog solves the timeout problem potentially by reducing the query size (speculated to be related to cartesian product). Although it may seem that the filtering itself solves the timeout problem, in fact it results in a less than 5% reduction in query size. The two step solution also does not change the cartesian product (gives identical results) if timeout limit is modified in getBM member method submitQueryXML to be longer than 300 seconds, allowing such large queries enough time to finish. Then why does the below solution work?

Matthew's two step approach of first identifying the geneIDs with paralogs and then querying the paralogs by geneIDs initiates a condition in getBM's member method generateFilterXML that splits jobs into batches of 5000 using the relatively recently added splitValues member method. As each of these small batches are sure to not timeout, the results are individually queried and stitched together without issue. This leaves me wondering, why this would not be a default method for attributes such as paralogs that are undoubtedly going to timeout for most users.

Solution from Biostars: https://www.biostars.org/p/331872/

para.attr <- c("ensembl_gene_id", attr[grepl("paralog", name), name])

hgid <- getBM(attributes = "ensembl_gene_id",
              filters    = "with_hsapiens_paralog",
              values     = TRUE,
              mart       = human)$ensembl_gene_id

para <- getBM(attributes = para.attr,
              filters    = "ensembl_gene_id",
              values     = hgid,
              mart       = human)


On Thursday, May 7, 2020 at 12:00:01 PM UTC-5, Kyle Duyck wrote:
This used to work before the migration, but has since stopped working. I have tested on multiple species and using different ensembl versions and mirrors, but to no avail. It seems that all paralog queries are broken.

mymart <- useEnsembl(biomart = "ENSEMBL_MART_ENSEMBL",dataset = "hsapiens_gene_ensembl")
mydatabase <- getBM(attributes=c('ensembl_gene_id',
  'hsapiens_paralog_ensembl_gene',
  'hsapiens_paralog_subtype',
  'hsapiens_paralog_orthology_type'),
 mart=mymart,
 uniqueRows=FALSE)

--
You received this message because you are subscribed to the Google Groups "biomart-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web, visit https://groups.google.com/d/msgid/biomart-users/8e3059be-b25e-4ea1-9ed5-c3ad68152f9f%40googlegroups.com.