History Import/Export

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

History Import/Export

Assaf Gordon-2
Kudos for this nice feature. It would be very useful, both for backups and for exchanging histories between Galaxies.

But as usual ;) I have couple of comments:

1. Exporting big files doesn't work.
You first create the tar file, then add the individual datasets, and only then stream the tar file to the user.
very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ files (and remember that they are duplicated because you guys still insist on grooming).
The proxy (apache at least) will surely timeout.

BTW, since you're using native python Tarfile, the python process will keep being on 100% CPU long after the client connection timed out. 20 users exporting large histories at the same time will constitute a DDOS attack ;)

2. Importing large files (after you'll fix exporting large files...) will cause the same problems as uploading huge FASTQ files (and you're seeing more and more complaints about those).


Combining 1+2, I would suggest the following:
1. Make the export a new "job". This will allow it to run for a long time, in a separate process, not blocking python and not tying the user to the browser (also remember: downloading 3GB with your browser is no fun).
Since your code already store "temporary" tarballs in a dedicated "export" directory, just keep them there for an arbitrary amount of time (let say: 4 days),
and provide the user will a link to download the history at his/her own convenience.

2. Allow importing a history from a URL.
This will also have the benefit of not transferring huge tarballs through the users' personal computer when they want to transfer histories between Galaxies.
They will go to one Galaxy, run the "export" job, get a download link, wait until the export job is done, then go to the other Galaxy, and simply import the history with that link.


And a bonus:
You've checked the tarball for malicious files, but not the JSON attributes.
If I import a history with a dataset that have:
  [ { "file_name": "../../../etc/passwd" } ]

I can get any file on the system into Galaxy (this happens in <galaxy>/lib/galaxy/web/controllers/history.py:580).
If I import "/dev/urandom" it'll be even more fun.

I would recommend against checking for ".." or "/" (as in line 457), because a determined person could possibly circumvent that with tricks like ".\u002E/etc/passwd" etc.).
Instead, construct the full path (as set in the dataset attributes), then check with os.path.abspath() that it resides in the temporary import directory.


Comments are welcomed,
        -gordon
_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: History Import/Export

Dennis Gascoigne-2
As a note on this, it would be important to not include "hidden" datasets in the output, or at least make it an option to not included them. With regards the large fastq (and SAM), there were murmurs recently about an option to turn off metadata for files over a particular size or something similar. Any progress on that?

Cheers
Dennis

On Wed, Jun 30, 2010 at 7:56 AM, Assaf Gordon <[hidden email]> wrote:
Kudos for this nice feature. It would be very useful, both for backups and for exchanging histories between Galaxies.

But as usual ;) I have couple of comments:

1. Exporting big files doesn't work.
You first create the tar file, then add the individual datasets, and only then stream the tar file to the user.
very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ files (and remember that they are duplicated because you guys still insist on grooming).
The proxy (apache at least) will surely timeout.

BTW, since you're using native python Tarfile, the python process will keep being on 100% CPU long after the client connection timed out. 20 users exporting large histories at the same time will constitute a DDOS attack ;)

2. Importing large files (after you'll fix exporting large files...) will cause the same problems as uploading huge FASTQ files (and you're seeing more and more complaints about those).


Combining 1+2, I would suggest the following:
1. Make the export a new "job". This will allow it to run for a long time, in a separate process, not blocking python and not tying the user to the browser (also remember: downloading 3GB with your browser is no fun).
Since your code already store "temporary" tarballs in a dedicated "export" directory, just keep them there for an arbitrary amount of time (let say: 4 days),
and provide the user will a link to download the history at his/her own convenience.

2. Allow importing a history from a URL.
This will also have the benefit of not transferring huge tarballs through the users' personal computer when they want to transfer histories between Galaxies.
They will go to one Galaxy, run the "export" job, get a download link, wait until the export job is done, then go to the other Galaxy, and simply import the history with that link.


And a bonus:
You've checked the tarball for malicious files, but not the JSON attributes.
If I import a history with a dataset that have:
 [ { "file_name": "../../../etc/passwd" } ]

I can get any file on the system into Galaxy (this happens in <galaxy>/lib/galaxy/web/controllers/history.py:580).
If I import "/dev/urandom" it'll be even more fun.

I would recommend against checking for ".." or "/" (as in line 457), because a determined person could possibly circumvent that with tricks like ".\u002E/etc/passwd" etc.).
Instead, construct the full path (as set in the dataset attributes), then check with os.path.abspath() that it resides in the temporary import directory.


Comments are welcomed,
       -gordon
_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev




_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: History Import/Export

Jeremy Goecks
In reply to this post by Assaf Gordon-2
Thanks for the useful comments Assaf. All good ideas and many that we have been talking about ourselves.

The import/export feature is far from done and, as such, should not have been web-accessible by default nor publicized. As of commit 3988:1d05df35ec27, it's been disabled by default (you can enable it manually if you want). We'll enable it once we have solutions for the security issues and the importing/exporting of large files.

Thanks,
J.



On Jun 29, 2010, at 5:56 PM, Assaf Gordon wrote:

> Kudos for this nice feature. It would be very useful, both for backups and for exchanging histories between Galaxies.
>
> But as usual ;) I have couple of comments:
>
> 1. Exporting big files doesn't work.
> You first create the tar file, then add the individual datasets, and only then stream the tar file to the user.
> very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ files (and remember that they are duplicated because you guys still insist on grooming).
> The proxy (apache at least) will surely timeout.
>
> BTW, since you're using native python Tarfile, the python process will keep being on 100% CPU long after the client connection timed out. 20 users exporting large histories at the same time will constitute a DDOS attack ;)
>
> 2. Importing large files (after you'll fix exporting large files...) will cause the same problems as uploading huge FASTQ files (and you're seeing more and more complaints about those).
>
>
> Combining 1+2, I would suggest the following:
> 1. Make the export a new "job". This will allow it to run for a long time, in a separate process, not blocking python and not tying the user to the browser (also remember: downloading 3GB with your browser is no fun).
> Since your code already store "temporary" tarballs in a dedicated "export" directory, just keep them there for an arbitrary amount of time (let say: 4 days),
> and provide the user will a link to download the history at his/her own convenience.
>
> 2. Allow importing a history from a URL.
> This will also have the benefit of not transferring huge tarballs through the users' personal computer when they want to transfer histories between Galaxies.
> They will go to one Galaxy, run the "export" job, get a download link, wait until the export job is done, then go to the other Galaxy, and simply import the history with that link.
>
>
> And a bonus:
> You've checked the tarball for malicious files, but not the JSON attributes.
> If I import a history with a dataset that have:
>  [ { "file_name": "../../../etc/passwd" } ]
>
> I can get any file on the system into Galaxy (this happens in <galaxy>/lib/galaxy/web/controllers/history.py:580).
> If I import "/dev/urandom" it'll be even more fun.
>
> I would recommend against checking for ".." or "/" (as in line 457), because a determined person could possibly circumvent that with tricks like ".\u002E/etc/passwd" etc.).
> Instead, construct the full path (as set in the dataset attributes), then check with os.path.abspath() that it resides in the temporary import directory.
>
>
> Comments are welcomed,
> -gordon
> _______________________________________________
> galaxy-dev mailing list
> [hidden email]
> http://lists.bx.psu.edu/listinfo/galaxy-dev


_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: History Import/Export

Jeremy Goecks
In reply to this post by Dennis Gascoigne-2

> As a note on this, it would be important to not include "hidden" datasets in the output, or at least make it an option to not included them.

Good idea -- will include as an option.

Thanks,
J.


_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: History Import/Export

Daniel Blankenberg
In reply to this post by Dennis Gascoigne-2
Hi Dennis,

 With regards the large fastq (and SAM), there were murmurs recently about an option to turn off metadata for files over a particular size or something similar. Any progress on that?

Theres a datatypes_conf.xml datatype attribute 'max_optional_metadata_filesize' that can be set to prevent Galaxy from setting certain optional metadata (like sequence counts) for datatypes that have been coded for it (only a few limited datatypes make use of this option currently); the attribute is inherited.

So to set a maximum of 1048576 bytes for all datatypes that support it, change the 'data' datatype in your datatypes_conf.xml to be similar to:

<datatype extension="data" type="galaxy.datatypes.data:Data" mimetype="application/octet-stream" max_optional_metadata_filesize="1048576" />


Thanks,

Dan


On Jun 29, 2010, at 6:00 PM, Dennis Gascoigne wrote:

As a note on this, it would be important to not include "hidden" datasets in the output, or at least make it an option to not included them. With regards the large fastq (and SAM), there were murmurs recently about an option to turn off metadata for files over a particular size or something similar. Any progress on that?

Cheers
Dennis

On Wed, Jun 30, 2010 at 7:56 AM, Assaf Gordon <[hidden email]> wrote:
Kudos for this nice feature. It would be very useful, both for backups and for exchanging histories between Galaxies.

But as usual ;) I have couple of comments:

1. Exporting big files doesn't work.
You first create the tar file, then add the individual datasets, and only then stream the tar file to the user.
very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ files (and remember that they are duplicated because you guys still insist on grooming).
The proxy (apache at least) will surely timeout.

BTW, since you're using native python Tarfile, the python process will keep being on 100% CPU long after the client connection timed out. 20 users exporting large histories at the same time will constitute a DDOS attack ;)

2. Importing large files (after you'll fix exporting large files...) will cause the same problems as uploading huge FASTQ files (and you're seeing more and more complaints about those).


Combining 1+2, I would suggest the following:
1. Make the export a new "job". This will allow it to run for a long time, in a separate process, not blocking python and not tying the user to the browser (also remember: downloading 3GB with your browser is no fun).
Since your code already store "temporary" tarballs in a dedicated "export" directory, just keep them there for an arbitrary amount of time (let say: 4 days),
and provide the user will a link to download the history at his/her own convenience.

2. Allow importing a history from a URL.
This will also have the benefit of not transferring huge tarballs through the users' personal computer when they want to transfer histories between Galaxies.
They will go to one Galaxy, run the "export" job, get a download link, wait until the export job is done, then go to the other Galaxy, and simply import the history with that link.


And a bonus:
You've checked the tarball for malicious files, but not the JSON attributes.
If I import a history with a dataset that have:
 [ { "file_name": "../../../etc/passwd" } ]

I can get any file on the system into Galaxy (this happens in <galaxy>/lib/galaxy/web/controllers/history.py:580).
If I import "/dev/urandom" it'll be even more fun.

I would recommend against checking for ".." or "/" (as in line 457), because a determined person could possibly circumvent that with tricks like ".\u002E/etc/passwd" etc.).
Instead, construct the full path (as set in the dataset attributes), then check with os.path.abspath() that it resides in the temporary import directory.


Comments are welcomed,
       -gordon
_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev



_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev


_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: History Import/Export

Dennis Gascoigne-2
That's great. What datatypes is this enabled for?

Cheers
Dennis

On Wed, Jun 30, 2010 at 11:37 PM, Daniel Blankenberg <[hidden email]> wrote:

> Hi Dennis,
>
>  With regards the large fastq (and SAM), there were murmurs recently about
> an option to turn off metadata for files over a particular size or something
> similar. Any progress on that?
>
> Theres a datatypes_conf.xml datatype attribute
> 'max_optional_metadata_filesize' that can be set to prevent Galaxy from
> setting certain optional metadata (like sequence counts) for datatypes that
> have been coded for it (only a few limited datatypes make use of this option
> currently); the attribute is inherited.
> So to set a maximum of 1048576 bytes for all datatypes that support it,
> change the 'data' datatype in your datatypes_conf.xml to be similar to:
> <datatype extension="data" type="galaxy.datatypes.data:Data"
> mimetype="application/octet-stream" max_optional_metadata_filesize="1048576"
> />
>
> Thanks,
> Dan
>
> On Jun 29, 2010, at 6:00 PM, Dennis Gascoigne wrote:
>
> As a note on this, it would be important to not include "hidden" datasets in
> the output, or at least make it an option to not included them. With regards
> the large fastq (and SAM), there were murmurs recently about an option to
> turn off metadata for files over a particular size or something similar. Any
> progress on that?
> Cheers
> Dennis
>
> On Wed, Jun 30, 2010 at 7:56 AM, Assaf Gordon <[hidden email]> wrote:
>>
>> Kudos for this nice feature. It would be very useful, both for backups and
>> for exchanging histories between Galaxies.
>>
>> But as usual ;) I have couple of comments:
>>
>> 1. Exporting big files doesn't work.
>> You first create the tar file, then add the individual datasets, and only
>> then stream the tar file to the user.
>> very bad. Try exporting a history with 18M-paired-end-76nt-reads FASTQ
>> files (and remember that they are duplicated because you guys still insist
>> on grooming).
>> The proxy (apache at least) will surely timeout.
>>
>> BTW, since you're using native python Tarfile, the python process will
>> keep being on 100% CPU long after the client connection timed out. 20 users
>> exporting large histories at the same time will constitute a DDOS attack ;)
>>
>> 2. Importing large files (after you'll fix exporting large files...) will
>> cause the same problems as uploading huge FASTQ files (and you're seeing
>> more and more complaints about those).
>>
>>
>> Combining 1+2, I would suggest the following:
>> 1. Make the export a new "job". This will allow it to run for a long time,
>> in a separate process, not blocking python and not tying the user to the
>> browser (also remember: downloading 3GB with your browser is no fun).
>> Since your code already store "temporary" tarballs in a dedicated "export"
>> directory, just keep them there for an arbitrary amount of time (let say: 4
>> days),
>> and provide the user will a link to download the history at his/her own
>> convenience.
>>
>> 2. Allow importing a history from a URL.
>> This will also have the benefit of not transferring huge tarballs through
>> the users' personal computer when they want to transfer histories between
>> Galaxies.
>> They will go to one Galaxy, run the "export" job, get a download link,
>> wait until the export job is done, then go to the other Galaxy, and simply
>> import the history with that link.
>>
>>
>> And a bonus:
>> You've checked the tarball for malicious files, but not the JSON
>> attributes.
>> If I import a history with a dataset that have:
>>  [ { "file_name": "../../../etc/passwd" } ]
>>
>> I can get any file on the system into Galaxy (this happens in
>> <galaxy>/lib/galaxy/web/controllers/history.py:580).
>> If I import "/dev/urandom" it'll be even more fun.
>>
>> I would recommend against checking for ".." or "/" (as in line 457),
>> because a determined person could possibly circumvent that with tricks like
>> ".\u002E/etc/passwd" etc.).
>> Instead, construct the full path (as set in the dataset attributes), then
>> check with os.path.abspath() that it resides in the temporary import
>> directory.
>>
>>
>> Comments are welcomed,
>>        -gordon
>> _______________________________________________
>> galaxy-dev mailing list
>> [hidden email]
>> http://lists.bx.psu.edu/listinfo/galaxy-dev
>
>
>
> _______________________________________________
> galaxy-dev mailing list
> [hidden email]
> http://lists.bx.psu.edu/listinfo/galaxy-dev
>
>

_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev