[BioMart Users] Joining datasets: 0.8 rc6

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[BioMart Users] Joining datasets: 0.8 rc6

Tony Miqueli

I am trying to setup a biomart 0.8 instance that has datasources in two separate Oracle databases.   Each database instance has just one table with two columns, a unique identifier and varchar column.  The unique identifier columns are the link between the two datasets.  I am trying to setup a simple join between these two tables.  According the rc6 documentation,  it appears that creating a pointer attribute would create this link and effectively ‘inner join’ these to datasets at runtime.  I’m able to create the pointer attribute in the target access point, setup the datasource link, start the webserver and attempt to run the query.  From the web interface I see the 3 attributes (the shared unique ID, varchar column from datasource ‘A’, varchar column from datasource ‘B’), select them and run the query…which errors out.  I ran this query through the Java API in Eclipse and while debugging noticed that an exception is thrown when trying to query the second datasource:  ORA-01795: maximum number of expressions in a list is 1000.

 

Stepping through the code in the debugger, it appears that it’s running a query to get all of the results from datasource ‘A’, and then turning the ID column into a list and using it in a subsequent query to get the necessary resultset from datasource ‘B’…but there are well over 1000 records I am trying to join.  Is there a way around this so that the query mechanism doesn’t rely on using an IN(<list>) to join these two datasets?

 

Thanks so much in advance.  I really like what I see in the new release and am looking forward to getting this up and running!

 

Thanks again!

 

Tony


_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] Joining datasets: 0.8 rc6

Arek Kasprzyk
Hi Tony,
As far as I remember in the old (0.7) implementation we circumvented this problem by expanding "in list" for oracle into its equivalent: a collection of 'OR'  statements ei instead of "where ID in ("id1", "id2", "id3")" the software would expand it into "where ID="id1" OR ID="id2" OR ID="id3". this seemed to work fine.

If this is indeed the problem, this would be quite an easy fix. I am cc'ing Junjun Zhang who tested rc6 release in case he has a better insight into this problem

a




On Sun, Nov 27, 2011 at 10:49 PM, Tony R Miqueli <[hidden email]> wrote:

I am trying to setup a biomart 0.8 instance that has datasources in two separate Oracle databases.   Each database instance has just one table with two columns, a unique identifier and varchar column.  The unique identifier columns are the link between the two datasets.  I am trying to setup a simple join between these two tables.  According the rc6 documentation,  it appears that creating a pointer attribute would create this link and effectively ‘inner join’ these to datasets at runtime.  I’m able to create the pointer attribute in the target access point, setup the datasource link, start the webserver and attempt to run the query.  From the web interface I see the 3 attributes (the shared unique ID, varchar column from datasource ‘A’, varchar column from datasource ‘B’), select them and run the query…which errors out.  I ran this query through the Java API in Eclipse and while debugging noticed that an exception is thrown when trying to query the second datasource:  ORA-01795: maximum number of expressions in a list is 1000.

 

Stepping through the code in the debugger, it appears that it’s running a query to get all of the results from datasource ‘A’, and then turning the ID column into a list and using it in a subsequent query to get the necessary resultset from datasource ‘B’…but there are well over 1000 records I am trying to join.  Is there a way around this so that the query mechanism doesn’t rely on using an IN(<list>) to join these two datasets?

 

Thanks so much in advance.  I really like what I see in the new release and am looking forward to getting this up and running!

 

Thanks again!

 

Tony


_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users



_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] Joining datasets: 0.8 rc6

Junjun Zhang
In reply to this post by Tony Miqueli
Hi Tony,

In the current implementation, the join is done as you described, no
special handling is given to Oracle.

One way may get around that is to change the batch size so that it's less
than 1000. Find the following line in QueryRunner.java:

public static final int BATCH_SIZE = 5000;

Change it to say 800. What it does is that the second query will receive
up to 800 rows of result from the first query, and then add them in the IN
list of the second query.

Hope this helps,

Junjun



From:  Tony R Miqueli <[hidden email]>
Date:  Sun, 27 Nov 2011 22:49:48 -0500
To:  "[hidden email]" <[hidden email]>
Subject:  [BioMart Users] Joining datasets:  0.8 rc6


>I am trying to setup a biomart 0.8 instance that has datasources in two
>separate Oracle databases.   Each database instance has just one table
>with two columns, a unique identifier and varchar column.  The unique
>identifier columns are
> the link between the two datasets.  I am trying to setup a simple join
>between these two tables.  According the rc6 documentation,  it appears
>that creating a pointer attribute would create this link and effectively
>Œinner join¹ these to datasets at runtime.
> I¹m able to create the pointer attribute in the target access point,
>setup the datasource link, start the webserver and attempt to run the
>query.  From the web interface I see the 3 attributes (the shared unique
>ID, varchar column from datasource ŒA¹, varchar
> column from datasource ŒB¹), select them and run the queryŠwhich errors
>out.  I ran this query through the Java API in Eclipse and while
>debugging noticed that an exception is thrown when trying to query the
>second datasource:
>ORA-01795: maximum number of expressions in a list is 1000.
>
>Stepping through the code in the debugger, it appears that it¹s running a
>query to get all of the results from datasource
> ŒA¹, and then turning the ID column into a list and using it in a
>subsequent query to get the necessary resultset from datasource ŒB¹Šbut
>there are well over 1000 records I am trying to join.  Is there a way
>around this so that the query mechanism doesn¹t
> rely on using an IN(<list>) to join these two datasets?
>
>Thanks so much in advance.  I really like what I see in the new release
>and am looking forward to getting this up
> and running!
>
>Thanks again!
>
>Tony

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: [BioMart Users] Joining datasets: 0.8 rc6

Tony Miqueli
That did it.  Thank you all so, so much!

Tony

-----Original Message-----
From: Junjun Zhang [mailto:[hidden email]]
Sent: Monday, November 28, 2011 11:14 AM
To: Tony R Miqueli; [hidden email]
Subject: Re: [BioMart Users] Joining datasets: 0.8 rc6

Hi Tony,

In the current implementation, the join is done as you described, no
special handling is given to Oracle.

One way may get around that is to change the batch size so that it's less
than 1000. Find the following line in QueryRunner.java:

public static final int BATCH_SIZE = 5000;

Change it to say 800. What it does is that the second query will receive
up to 800 rows of result from the first query, and then add them in the IN
list of the second query.

Hope this helps,

Junjun



From:  Tony R Miqueli <[hidden email]>
Date:  Sun, 27 Nov 2011 22:49:48 -0500
To:  "[hidden email]" <[hidden email]>
Subject:  [BioMart Users] Joining datasets:  0.8 rc6


>I am trying to setup a biomart 0.8 instance that has datasources in two
>separate Oracle databases.   Each database instance has just one table
>with two columns, a unique identifier and varchar column.  The unique
>identifier columns are
> the link between the two datasets.  I am trying to setup a simple join
>between these two tables.  According the rc6 documentation,  it appears
>that creating a pointer attribute would create this link and effectively
>Œinner join¹ these to datasets at runtime.
> I¹m able to create the pointer attribute in the target access point,
>setup the datasource link, start the webserver and attempt to run the
>query.  From the web interface I see the 3 attributes (the shared unique
>ID, varchar column from datasource ŒA¹, varchar
> column from datasource ŒB¹), select them and run the queryŠwhich errors
>out.  I ran this query through the Java API in Eclipse and while
>debugging noticed that an exception is thrown when trying to query the
>second datasource:
>ORA-01795: maximum number of expressions in a list is 1000.
>
>Stepping through the code in the debugger, it appears that it¹s running a
>query to get all of the results from datasource
> ŒA¹, and then turning the ID column into a list and using it in a
>subsequent query to get the necessary resultset from datasource ŒB¹Šbut
>there are well over 1000 records I am trying to join.  Is there a way
>around this so that the query mechanism doesn¹t
> rely on using an IN(<list>) to join these two datasets?
>
>Thanks so much in advance.  I really like what I see in the new release
>and am looking forward to getting this up
> and running!
>
>Thanks again!
>
>Tony

_______________________________________________
Users mailing list
[hidden email]
https://lists.biomart.org/mailman/listinfo/users