Integration without tracking. An obscure question

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Integration without tracking. An obscure question

joe carlson

Hi,

I have an obscure question about a minor point in the code.

I'm all about trying to speed up loading. My mine is getting even larger and I'm working on trying to trim time whenever I can. In this latest build I'm trying to avoid using the tracking table whenever I can; this saves some space in addition to time.

What I found is that if I am not tracking an object class and try to do an integration of two data sources that store objects of the same class, I run into a build exception. It is triggered from lines 571 to 574 in org.intermine.dataloader.IntegrationWriterDataTrackingImpl:

            if (!(field instanceof CollectionDescriptor)) {
                if (lastSource == null) {
                    throw new NullPointerException("Error: lastSource is null for"
                            + " object " + o.getId() + " and fieldName " + fieldName);
                }

                trackingMap.put(fieldName, lastSource);
            }

Since the object has no information in the tracking table, lastSource is indeed null. In the specific case I see, I've entered the attributes for the proteins in a previous loading step. In a later step I need to make a reference to the protein. I create the new object and specify the primary key. When it's time to merge, the integration loader properly determines the correct object id and merges the newly created object with the existing one. But not having a source for the previous object triggers this exception.

I've looked at the history in GibHub and Richard Smith made a comment that this was added to prevent the tracking table from entering info if there are nulls in the columns so we're not keeping a history of data loaders that do not affect attributes. But that really doesn't seem to be the case here.

I've commented this check in my own code and successfully did a big integration step. And not dealing with the tracking table certainly helped. But I'm worried if I may be missing something. Does anyone have any experience with things of this nature?

I've not tried to see what happens if my new object has conflicting information in fields other than the primary key. Or if the field in the new object is a collection. I'm just looking at the case of a simple attribute field.

Thanks,

Joe


_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|

Re: Integration without tracking. An obscure question

Daniela Butano-2

Joe,

I don't have much experience with things of this nature, but have you thought to add a check if the data tracker is enabled instead of commenting the code?

Daniela

On 28/03/2019 15:42, Joe Carlson wrote:

Hi,

I have an obscure question about a minor point in the code.

I'm all about trying to speed up loading. My mine is getting even larger and I'm working on trying to trim time whenever I can. In this latest build I'm trying to avoid using the tracking table whenever I can; this saves some space in addition to time.

What I found is that if I am not tracking an object class and try to do an integration of two data sources that store objects of the same class, I run into a build exception. It is triggered from lines 571 to 574 in org.intermine.dataloader.IntegrationWriterDataTrackingImpl:

            if (!(field instanceof CollectionDescriptor)) {
                if (lastSource == null) {
                    throw new NullPointerException("Error: lastSource is null for"
                            + " object " + o.getId() + " and fieldName " + fieldName);
                }

                trackingMap.put(fieldName, lastSource);
            }

Since the object has no information in the tracking table, lastSource is indeed null. In the specific case I see, I've entered the attributes for the proteins in a previous loading step. In a later step I need to make a reference to the protein. I create the new object and specify the primary key. When it's time to merge, the integration loader properly determines the correct object id and merges the newly created object with the existing one. But not having a source for the previous object triggers this exception.

I've looked at the history in GibHub and Richard Smith made a comment that this was added to prevent the tracking table from entering info if there are nulls in the columns so we're not keeping a history of data loaders that do not affect attributes. But that really doesn't seem to be the case here.

I've commented this check in my own code and successfully did a big integration step. And not dealing with the tracking table certainly helped. But I'm worried if I may be missing something. Does anyone have any experience with things of this nature?

I've not tried to see what happens if my new object has conflicting information in fields other than the primary key. Or if the field in the new object is a collection. I'm just looking at the case of a simple attribute field.

Thanks,

Joe


_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev