Why don't subclasses use postgres table inheritance?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Why don't subclasses use postgres table inheritance?

Sam Hokin-3
Hi, devs. I was just shocked to discover that the subclass of an entity winds up populating both the subclass table and the parent
table with the same data (the extra subclass fields missing from the parent table, of course). For example:

   <class name="GOAnnotation" extends="OntologyAnnotation" is-interface="true">
     <attribute name="annotationExtension" type="java.lang.String"/>
     <collection name="evidence" referenced-type="GOEvidence"/>
   </class>

results in:

beanmine=> select count(*) from goannotation;
  count
-------
  44844

beanmine=> select count(*) from ontologyannotation;
  count
-------
  44844

A terrible price to pay for subclassing! Of course, the king of parents, BioEntity, gives:

soymine=> select count(*) from bioentity;
   count
---------
  1458068

Yikes!!!!!

Postgres provides table inheritance, so you'd be able to simply define, for example:

CREATE TABLE goannotation (
   annotationextension  text
) INHERITS (ontologyannotation);

In which case you'd put a record in goannotation and it'd show up automagically in ontologyannotation (without annotationextension).

Is there a reason why the InterMine devs chose not to use table inheritance for subclasses and, instead, populate both tables with
the same data? I use table inheritance all the time, it's one of the great features of Postgres going back to the beginning!



_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

Sam Hokin-3
Actually, I'll note that those count queries would have returned the same numbers if table inheritance were used. But, table
inheritance isn't used, so those are actual table-specific record counts. And then, there's the true daddy of them all:

multimine=# select count(*) from intermineobject;
   count
---------
  7107971

Which made me think about heirarchy: intermineobject -> bioentity -> a zillion other things. So perhaps that's why table inheritance
isn't used? Gets too hairy? But Postgres is happy to let you inherit a table which, in turn, inherits another table. The common
fields just keep cascading down. Still puzzling to me.

On 03/29/2016 05:03 PM, Sam Hokin wrote:

> Hi, devs. I was just shocked to discover that the subclass of an entity winds up populating both the subclass table and the parent
> table with the same data (the extra subclass fields missing from the parent table, of course). For example:
>
>    <class name="GOAnnotation" extends="OntologyAnnotation" is-interface="true">
>      <attribute name="annotationExtension" type="java.lang.String"/>
>      <collection name="evidence" referenced-type="GOEvidence"/>
>    </class>
>
> results in:
>
> beanmine=> select count(*) from goannotation;
>   count
> -------
>   44844
>
> beanmine=> select count(*) from ontologyannotation;
>   count
> -------
>   44844
>
> A terrible price to pay for subclassing! Of course, the king of parents, BioEntity, gives:
>
> soymine=> select count(*) from bioentity;
>    count
> ---------
>   1458068
>
> Yikes!!!!!
>
> Postgres provides table inheritance, so you'd be able to simply define, for example:
>
> CREATE TABLE goannotation (
>    annotationextension  text
> ) INHERITS (ontologyannotation);
>
> In which case you'd put a record in goannotation and it'd show up automagically in ontologyannotation (without annotationextension).
>
> Is there a reason why the InterMine devs chose not to use table inheritance for subclasses and, instead, populate both tables with
> the same data? I use table inheritance all the time, it's one of the great features of Postgres going back to the beginning!

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

Justin Clark-Casey
That's a good question.  Unfortunately, I can't answer it directly (if anybody else can shed more light please jump in).  As you say, PostgreSQL has had
inheritance since at least 6.0 afaict.  This was released in 1997 and so precedes InterMine by a good few years.

Perhaps originally InterMine did not want its ORM system tied too tightly into PostgreSQL (e.g. systems like Hibernate that can use many different SQL backends
don't use PostgreSQL inheritance afaik).  Or maybe there was some old (perhaps now dead) InterMine feature that prevented using PostgreSQL inheritance.  This is
all speculation.

There are some caveats involved with inheritance around things like constraints [1].  On a quick read, it doesn't seem any of these would be an issue for
InterMine but the proof would be in the pudding.

On a quick Google, there's no obvious good performance information on how using inheritance impacts row insert times/storage requirements/query performance.
However, one of our current concerns is the time it takes to load InterMine with data.  Intuitively, one might expect it to also reduce database size but this
would be another case of suck it and see.  DB size does not seem to be a critical issue (please feel free to correct me!).

Thanks for bringing this up, Sam.  It's now on our radar as a possible experiment to see if using PostgreSQL inheritance could improve data load times without
affecting query performance.

[1] http://www.postgresql.org/docs/current/static/ddl-inherit.html

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc

On 30/03/16 02:33, Sam Hokin wrote:

> Actually, I'll note that those count queries would have returned the same numbers if table inheritance were used. But, table inheritance isn't used, so those
> are actual table-specific record counts. And then, there's the true daddy of them all:
>
> multimine=# select count(*) from intermineobject;
>    count
> ---------
>   7107971
>
> Which made me think about heirarchy: intermineobject -> bioentity -> a zillion other things. So perhaps that's why table inheritance isn't used? Gets too hairy?
> But Postgres is happy to let you inherit a table which, in turn, inherits another table. The common fields just keep cascading down. Still puzzling to me.
>
> On 03/29/2016 05:03 PM, Sam Hokin wrote:
>> Hi, devs. I was just shocked to discover that the subclass of an entity winds up populating both the subclass table and the parent
>> table with the same data (the extra subclass fields missing from the parent table, of course). For example:
>>
>>    <class name="GOAnnotation" extends="OntologyAnnotation" is-interface="true">
>>      <attribute name="annotationExtension" type="java.lang.String"/>
>>      <collection name="evidence" referenced-type="GOEvidence"/>
>>    </class>
>>
>> results in:
>>
>> beanmine=> select count(*) from goannotation;
>>   count
>> -------
>>   44844
>>
>> beanmine=> select count(*) from ontologyannotation;
>>   count
>> -------
>>   44844
>>
>> A terrible price to pay for subclassing! Of course, the king of parents, BioEntity, gives:
>>
>> soymine=> select count(*) from bioentity;
>>    count
>> ---------
>>   1458068
>>
>> Yikes!!!!!
>>
>> Postgres provides table inheritance, so you'd be able to simply define, for example:
>>
>> CREATE TABLE goannotation (
>>    annotationextension  text
>> ) INHERITS (ontologyannotation);
>>
>> In which case you'd put a record in goannotation and it'd show up automagically in ontologyannotation (without annotationextension).
>>
>> Is there a reason why the InterMine devs chose not to use table inheritance for subclasses and, instead, populate both tables with
>> the same data? I use table inheritance all the time, it's one of the great features of Postgres going back to the beginning!
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

Julie Sullivan
Hi Sam,

Yes, as Justin says, this design decision was made prior to us. We are
assuming the InterMine team knew about this feature and decided not to
use it!

Looking at the docs, I see one problem:

        http://www.postgresql.org/docs/current/static/ddl-inherit.html

"A serious limitation of the inheritance feature is that indexes
(including unique constraints) and foreign key constraints only apply to
single tables, not to their inheritance children."

Being a data warehouse, indexes are our everything. In some cases, when
querying we don't even go to the tables, we use the index only.

        https://wiki.postgresql.org/wiki/Index-only_scans

Maybe that's why? Just a guess. However, there is a ray of hope:

"These deficiencies will probably be fixed in some future release, but
in the meantime considerable care is needed in deciding whether
inheritance is useful for your application."

We're taking a really good look at postgres this summer, so I've put
that on our list of things to investigate.

And thanks for this!!

Cheers
Julie

On 30/03/16 12:33, Justin Clark-Casey wrote:

> That's a good question.  Unfortunately, I can't answer it directly (if
> anybody else can shed more light please jump in).  As you say,
> PostgreSQL has had inheritance since at least 6.0 afaict.  This was
> released in 1997 and so precedes InterMine by a good few years.
>
> Perhaps originally InterMine did not want its ORM system tied too
> tightly into PostgreSQL (e.g. systems like Hibernate that can use many
> different SQL backends don't use PostgreSQL inheritance afaik).  Or
> maybe there was some old (perhaps now dead) InterMine feature that
> prevented using PostgreSQL inheritance.  This is all speculation.
>
> There are some caveats involved with inheritance around things like
> constraints [1].  On a quick read, it doesn't seem any of these would be
> an issue for InterMine but the proof would be in the pudding.
>
> On a quick Google, there's no obvious good performance information on
> how using inheritance impacts row insert times/storage
> requirements/query performance. However, one of our current concerns is
> the time it takes to load InterMine with data.  Intuitively, one might
> expect it to also reduce database size but this would be another case of
> suck it and see.  DB size does not seem to be a critical issue (please
> feel free to correct me!).
>
> Thanks for bringing this up, Sam.  It's now on our radar as a possible
> experiment to see if using PostgreSQL inheritance could improve data
> load times without affecting query performance.
>
> [1] http://www.postgresql.org/docs/current/static/ddl-inherit.html
>
> --
> Justin Clark-Casey, Synbiomine/InterMine Developer
> http://synbiomine.org
> http://twitter.com/justincc
>
> On 30/03/16 02:33, Sam Hokin wrote:
>> Actually, I'll note that those count queries would have returned the
>> same numbers if table inheritance were used. But, table inheritance
>> isn't used, so those
>> are actual table-specific record counts. And then, there's the true
>> daddy of them all:
>>
>> multimine=# select count(*) from intermineobject;
>>    count
>> ---------
>>   7107971
>>
>> Which made me think about heirarchy: intermineobject -> bioentity -> a
>> zillion other things. So perhaps that's why table inheritance isn't
>> used? Gets too hairy?
>> But Postgres is happy to let you inherit a table which, in turn,
>> inherits another table. The common fields just keep cascading down.
>> Still puzzling to me.
>>
>> On 03/29/2016 05:03 PM, Sam Hokin wrote:
>>> Hi, devs. I was just shocked to discover that the subclass of an
>>> entity winds up populating both the subclass table and the parent
>>> table with the same data (the extra subclass fields missing from the
>>> parent table, of course). For example:
>>>
>>>    <class name="GOAnnotation" extends="OntologyAnnotation"
>>> is-interface="true">
>>>      <attribute name="annotationExtension" type="java.lang.String"/>
>>>      <collection name="evidence" referenced-type="GOEvidence"/>
>>>    </class>
>>>
>>> results in:
>>>
>>> beanmine=> select count(*) from goannotation;
>>>   count
>>> -------
>>>   44844
>>>
>>> beanmine=> select count(*) from ontologyannotation;
>>>   count
>>> -------
>>>   44844
>>>
>>> A terrible price to pay for subclassing! Of course, the king of
>>> parents, BioEntity, gives:
>>>
>>> soymine=> select count(*) from bioentity;
>>>    count
>>> ---------
>>>   1458068
>>>
>>> Yikes!!!!!
>>>
>>> Postgres provides table inheritance, so you'd be able to simply
>>> define, for example:
>>>
>>> CREATE TABLE goannotation (
>>>    annotationextension  text
>>> ) INHERITS (ontologyannotation);
>>>
>>> In which case you'd put a record in goannotation and it'd show up
>>> automagically in ontologyannotation (without annotationextension).
>>>
>>> Is there a reason why the InterMine devs chose not to use table
>>> inheritance for subclasses and, instead, populate both tables with
>>> the same data? I use table inheritance all the time, it's one of the
>>> great features of Postgres going back to the beginning!
>>
>> _______________________________________________
>> dev mailing list
>> [hidden email]
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>
> _______________________________________________
> dev mailing list
> [hidden email]
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

Sam Hokin-3
Hrm, I'm not sure that explains it, since you're free to define indexes and foreign key constraints on the child tables just as we
are now doing on the standalone tables (which makes sense, since the child tables often have additional fields). But perhaps that
creates a complexity that I'm not seeing. Maybe the constraints and keys don't apply to the data pulled up from children? That would
mess things up.

In any case, it's great that you're "looking under the hood" (or bonnet?) this summer. I'd vote for implementing table inheritance
for subclasses being a top priority, since that will decrease the size of the database enormously. And, I think, improve
performance, since the performance of inherited tables is something they continue to work on, I think, and smaller is usually
better. Thanks!

On 04/08/2016 03:26 AM, Julie Sullivan wrote:

> Hi Sam,
>
> Yes, as Justin says, this design decision was made prior to us. We are assuming the InterMine team knew about this feature and
> decided not to use it!
>
> Looking at the docs, I see one problem:
>
> http://www.postgresql.org/docs/current/static/ddl-inherit.html
>
> "A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only
>  apply to single tables, not to their inheritance children."
>
> Being a data warehouse, indexes are our everything. In some cases, when querying we don't even go to the tables, we use the index
> only.
>
> https://wiki.postgresql.org/wiki/Index-only_scans
>
> Maybe that's why? Just a guess. However, there is a ray of hope....

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

joe carlson

On Apr 8, 2016, at 7:57 AM, Sam Hokin <[hidden email]> wrote:

Hrm, I'm not sure that explains it, since you're free to define indexes and foreign key constraints on the child tables just as we are now doing on the standalone tables (which makes sense, since the child tables often have additional fields). But perhaps that creates a complexity that I'm not seeing. Maybe the constraints and keys don't apply to the data pulled up from children? That would mess things up.

In any case, it's great that you're "looking under the hood" (or bonnet?) this summer. I'd vote for implementing table inheritance for subclasses being a top priority, since that will decrease the size of the database enormously. And, I think, improve performance, since the performance of inherited tables is something they continue to work on, I think, and smaller is usually better. Thanks!



I was biding my time on this until I head Julie’s comment. Like everyone (?) in this thread, I started using the code long after it was architected. I had a couple observations:

1) there are hints in the codebase that this was at one point running on an oracle server. If this is correct, then there may have been a conscious decision early on to make it vendor neutral.

2) But the main thing is performance. Table inheritance has its advantages, but performance is not one of them. The basic issue is that you effectively have a table join when you select from the child table when using information in the parent table. Here is an illustration. Implement a parent-child table and a single table of the same thing:

CREATE TABLE parent (
  id integer not null primary key unique,
  data1 varchar(32)
);

CREATE TABLE child (
  data2 varchar(32)
) INHERITS (parent);

CREATE TABLE single (
  id integer not null primary key unique,
  data1 varchar(32),
  data2 varchar(32)
);

These tables have a structure similar to intermine’s (id is non-null primary key integer)
Throw in some data:

demo=# insert into single select generate_series(1,10000000),md5(random()::text),md5(random()::text);
INSERT 0 10000000
demo=# insert into child select generate_series(1,10000000),md5(random()::text),md5(random()::text);
INSERT 0 10000000
demo=# analyze single;
ANALYZE
demo=# analyze child;
ANALYZE


If you’re just doing a “select *”, then both implementations are the same:

demo=# explain select * from single;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Seq Scan on single  (cost=0.00..223457.17 rows=10000017 width=70)
(1 row)

demo=# explain select * from child;
                            QUERY PLAN                            
------------------------------------------------------------------
 Seq Scan on child  (cost=0.00..223457.17 rows=10000017 width=70)
(1 row)

But bad things happen if you want to sort on id:

demo=# explain select * from single order by id;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Index Scan using single_pkey on single  (cost=0.43..383144.69 rows=10000017 width=70)
(1 row)

demo=# explain select * from child order by id;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Sort  (cost=1386134.10..1411134.14 rows=10000017 width=70)
   Sort Key: id
   ->  Seq Scan on child  (cost=0.00..223457.17 rows=10000017 width=70)
(3 rows)

This is bad. There is a scan on the child table. That’s going to take time. 

demo=# \timing
demo=# select * from single order by id limit 10 offset 100000;
   id   |              data1               |              data2               
--------+----------------------------------+----------------------------------
 100001 | bd1be1e5b6ec5588d3ee151da7845c3b | d2455b73aee7eec7002ea4cdd88c628e
 100002 | 0f7f0cc11ddbfba76726889eab0d0c3d | 71be2f0c1e174949a4f7518a9ac94745
 100003 | d9d49e20a1a0d1a3e56de9fb7caeb88b | 122bcdc320e92a8fb11f85034afebcdb
 100004 | 2d7f84a0050a984eec7a9ffb08473e37 | 8ad77cab8a90fee26110a71e4b26d377
 100005 | 0f5347b7443de75e91e65788e3580a0f | 12bd265c48bb90d166453b71dd94d9ac
 100006 | 0c95a1a3934a1549662dbf9bf9cb8523 | 06021763edb88defaf30847f921f88ef
 100007 | 0e6f2aa8a8d5a71b248c19906cd63042 | d190abd8085fb8d88132dce3f5454401
 100008 | 0ce76733fca8f8fbe5576a3de92cb221 | 62f0e9b0ca70b5758ebd2ed8fa127a3e
 100009 | 8e310f9573ff2c2902ada186f9c1b8f0 | a9afd8cafd5a8d5820e95239a323c4a3
 100010 | 3d31a7d75cedf514e478b8cc73df940e | dd156d0750a817f979ed201e82c38ba5
(10 rows)

Time: 13.000 ms
demo=# select * from child  order by id limit 10 offset 100000;
   id   |              data1               |              data2               
--------+----------------------------------+----------------------------------
 100001 | 5e4da88d7f8369255964d21d7565b561 | eb05b644e9956d2262a0732a3b9b6441
 100002 | 4a125147a3a344b8278b5c041a004fd0 | a944b9f53ae3416823c9bdc2eceb29f3
 100003 | 22ae00720bf2aee0ddb8392726a5b464 | 736d23f8d8252be82329aa75817bc8e8
 100004 | 2d931c518d18a5140c16bf20cf331cbc | 5d0589c82cc4d194e0e346afab0a2d3d
 100005 | f90690aae2d50757a257a9a28ddb53ab | 6a75c4989a2a34df96944bdeb1d78014
 100006 | e7e3d2ca171160fcf098b3ada6f90bb5 | 62576fc9bd8b63530bf46d933d1f683a
 100007 | d684706e08029b7a21477ee1329f7dfa | 8106a09d4aaea117824754c7dfbe2154
 100008 | 192fa62a8f68ce0baa7239f220c098e8 | b48f7199250a8b1e9347cf18f1751d05
 100009 | 756ca7a425cece977d473ab280bd360b | bfb8141fa685ab9a6fb60d6c23fc9a8d
 100010 | 4c82925ca9b7e5e1dea40e1b11c0ac27 | fa463ccc861eda7e3df37110786eda78
(10 rows)

Time: 1135.699 ms


And similarly:

demo=# explain select * from child where id=124356;
                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on child  (cost=0.00..248457.21 rows=1 width=70)
   Filter: (id = 124356)
(2 rows)

demo=# explain select * from single where id=124356;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Scan using single_pkey on single  (cost=0.43..8.45 rows=1 width=70)
   Index Cond: (id = 124356)
(2 rows)

Another seq scan on the inherited table with an index scan on the single table.

demo=# \timing
Timing is on.
demo=# select * from single where id=124356;
   id   |              data1               |              data2               
--------+----------------------------------+----------------------------------
 124356 | 3b30b5e85d1485f21a70819f8bcd97ff | 1866f76642259dcb4a9d903c032af130
(1 row)

Time: 0.969 ms
demo=# select * from child where id=124356;
   id   |              data1               |              data2               
--------+----------------------------------+----------------------------------
 124356 | 3ee5d23891aa596cb796183b1e28d4e0 | 53da5823f4244bd591255678ef41e267
(1 row)

Time: 1114.170 ms


Since intermine uses a batching strategy for queries in which the id of the parent table is the criterion for including in a batch, this will have a big impact on the select performance. Let me know if there is some indexing trick that I’m not aware of.

Otherwise, please don’t do it! Disk is cheap. Time is precious. (Sam, you mention that your Bioentity table is 1M records. Mine is 300M records. And I’m expecting a lot more data in the next couple months that will make it - to quote a candidate - YUGE!)

The one thing that table inheritance would give you is the ability to manually do SQL without corrupting tables. I have a solution for that. I’ve shared it with Julie and am hoping to clean it up a bit and make a pull request.

Joe

On 04/08/2016 03:26 AM, Julie Sullivan wrote:
Hi Sam,

Yes, as Justin says, this design decision was made prior to us. We are assuming the InterMine team knew about this feature and
decided not to use it!

Looking at the docs, I see one problem:

http://www.postgresql.org/docs/current/static/ddl-inherit.html

"A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only
apply to single tables, not to their inheritance children."

Being a data warehouse, indexes are our everything. In some cases, when querying we don't even go to the tables, we use the index
only.

https://wiki.postgresql.org/wiki/Index-only_scans

Maybe that's why? Just a guess. However, there is a ray of hope....

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev


_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Why don't subclasses use postgres table inheritance?

Sam Hokin-3
Awesome stuff, Joe! I didn't know inheritance could result in such a clear performance penalty. I officially withdraw my request for
table inheritance! :)

On 04/08/2016 11:15 AM, Joe Carlson wrote:
>
> I was biding my time on this until I head Julie’s comment. Like everyone (?) in this thread, I started using the code long after it
> was architected. I had a couple [EXCELLENT] observations....

_______________________________________________
dev mailing list
[hidden email]
http://mail.intermine.org/cgi-bin/mailman/listinfo/dev