Clean up datsets script

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Clean up datsets script

SHAUN WEBB

Hi,
I recently ran the clean up datasets script to free up some storage  
space and this resulted in some of the active library datasets being  
purged from disk.

This library was loaded from an external file path. When I wanted to  
add more files to galaxy from the same path it was easier to load the  
whole directory again and delete the duplicated files.

I'm assuming that the cleanup script looks at these deleted datasets  
and purges the file they are associated with even though another  
current dataset also links to this file.

Is there a way to check if a file is referenced by another dataset  
before purging or to prohibit the script from deleting files out with  
the default galaxy file directory?

If I leave out the purge_libraries script will this stop library  
datasets from being removed or only deleted libraries as a whole?

Thanks for your help

Shaun

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clean up datsets script

Greg Von Kuster
Hello Shaun,

On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:


Hi,
I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk.


This should not be possible, so perhaps you have found a corner case scenario that needs to be handled.



This library was loaded from an external file path.


Did you just use the "Upload a directory of files" option for the upload form, or did you have the "allow_library_path_paste" config setting turned on?  In either case, did you check the "Copy data into Galaxy?" checkbox on the upload form, eliminating the copy of the files into Galaxy's default files directory?


When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files.

I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file.


This is not the case.  Any dataset file that has at least one active link ( an undeleted link from either a history item or a library ) will not be removed from disk by the cleanup_datasets.py script.  It does not matter how many deleted links to the file exist, as long as 1 active link exists, the file will not be removed from disk ( unless you've found a bug ).



Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory?

The cleanup_dataset.py script already performs these types of checks.  

I've also added a very recent new feature to the library dataset information page ( the page that is displayed if you click on the library dataset link within the library ) which displays all history items and other library dataset that point to the disk file for the current library dataset.  This is useful for manually seeing the various linked items, but is not related to the cleanup_datasets.py script.  This new feature has not yet made it out to the distribution, but will be there soon.

There is also a new Galaxy report ( not yet in the distribution, but will be soon ) which shows disk usage for the file system on which the Galaxy data files exist, and you can view large datasets ( over 4 gb ) that are not purged, and see all history and library items that point to the disk file ( similar to the library dataset information page above, but display datasets linked to only by histories as well ).



If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole?

The purge_libraries option to the cleanup_datasets.py script handles removing data files from disk that have been deleted for the appropriate number of days as well as purging the library record from the database ( as long as the libraries contents have all been purged ).



Thanks for your help

Shaun

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev

Greg Von Kuster
Galaxy Development Team




_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clean up datsets script

SHAUN WEBB

Hi Greg,

I used the allow_library_path_paste = True option to upload a  
directory of files (several times). I also checked the box so that  
files would not be copied in to galaxy. Although I may have missed the  
box on occasion, deleted the library items straight away and restarted  
the upload with the box checked.

I had assumed that the cleanup datasets script would perform these  
checks so it's reassuring to know his may be an isolated case. I have  
checked the logs and it was definitely this script that purged the  
files.

Do you have any ideas what may have caused this? As use of our  
production server grows we will need to delete unused data and I want  
to be sure this doesn't happen again.

Thanks for your help

Shaun


Quoting Greg Von Kuster <[hidden email]>:

> Hello Shaun,
>
> On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:
>
>>
>> Hi,
>> I recently ran the clean up datasets script to free up some storage  
>>  space and this resulted in some of the active library datasets  
>> being purged from disk.
>
>
> This should not be possible, so perhaps you have found a corner case  
>  scenario that needs to be handled.
>
>
>>
>> This library was loaded from an external file path.
>
>
> Did you just use the "Upload a directory of files" option for the  
> upload form, or did you have the "allow_library_path_paste" config  
> setting turned on?  In either case, did you check the "Copy data  
> into Galaxy?" checkbox on the upload form, eliminating the copy of  
> the files into Galaxy's default files directory?
>
>
>> When I wanted to add more files to galaxy from the same path it was  
>>  easier to load the whole directory again and delete the duplicated  
>>  files.
>>
>> I'm assuming that the cleanup script looks at these deleted  
>> datasets and purges the file they are associated with even though  
>> another current dataset also links to this file.
>
>
> This is not the case.  Any dataset file that has at least one active  
>  link ( an undeleted link from either a history item or a library )  
> will not be removed from disk by the cleanup_datasets.py script.  It  
>  does not matter how many deleted links to the file exist, as long  
> as  1 active link exists, the file will not be removed from disk (  
> unless you've found a bug ).
>
>
>>
>> Is there a way to check if a file is referenced by another dataset  
>> before purging or to prohibit the script from deleting files out  
>> with the default galaxy file directory?
>
> The cleanup_dataset.py script already performs these types of checks.
>
> I've also added a very recent new feature to the library dataset  
> information page ( the page that is displayed if you click on the  
> library dataset link within the library ) which displays all history  
>  items and other library dataset that point to the disk file for the  
>  current library dataset.  This is useful for manually seeing the  
> various linked items, but is not related to the cleanup_datasets.py  
> script.  This new feature has not yet made it out to the  
> distribution, but will be there soon.
>
> There is also a new Galaxy report ( not yet in the distribution, but  
>  will be soon ) which shows disk usage for the file system on which  
> the Galaxy data files exist, and you can view large datasets ( over  
> 4 gb ) that are not purged, and see all history and library items  
> that point to the disk file ( similar to the library dataset  
> information page above, but display datasets linked to only by  
> histories as well ).
>
>
>>
>> If I leave out the purge_libraries script will this stop library  
>> datasets from being removed or only deleted libraries as a whole?
>
> The purge_libraries option to the cleanup_datasets.py script handles  
>  removing data files from disk that have been deleted for the  
> appropriate number of days as well as purging the library record  
> from the database ( as long as the libraries contents have all been  
> purged ).
>
>
>>
>> Thanks for your help
>>
>> Shaun
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> _______________________________________________
>> galaxy-dev mailing list
>> [hidden email]
>> http://lists.bx.psu.edu/listinfo/galaxy-dev
>
> Greg Von Kuster
> Galaxy Development Team
> [hidden email]
>
>
>
>



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev
Reply | Threaded
Open this post in threaded view
|

Re: Clean up datsets script

Greg Von Kuster
Hello Shaun,

On Jun 7, 2010, at 5:15 AM, SHAUN WEBB wrote:

>
> Hi Greg,
>
> I used the allow_library_path_paste = True option to upload a directory of files (several times). I also checked the box so that files would not be copied in to galaxy. Although I may have missed the box on occasion, deleted the library items straight away and restarted the upload with the box checked.

I believe what must have happened is that using allow_library_path_paste, you uploaded the files to a Library *with the box checked*, and then thought that you had not checked the box, so deleted them, and uploaded them the same way again with the box checked again.  This is the only way that we can see how the files would ultimately have been removed from disk.  This scenario brought to light a weakness in our code that has been fixed in change set 3900:384137f8b5c6.  From that change set on, whenever the box is checked, Galaxy will never be able to purge the files since they are not in Galaxy's default file location.  This change set will be available in the distribution very soon.

Thanks for reporting this issue Shaun!


>
> I had assumed that the cleanup datasets script would perform these checks so it's reassuring to know his may be an isolated case. I have checked the logs and it was definitely this script that purged the files.
>
> Do you have any ideas what may have caused this? As use of our production server grows we will need to delete unused data and I want to be sure this doesn't happen again.
>
> Thanks for your help
>
> Shaun
>
>
> Quoting Greg Von Kuster <[hidden email]>:
>
>> Hello Shaun,
>>
>> On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:
>>
>>>
>>> Hi,
>>> I recently ran the clean up datasets script to free up some storage  space and this resulted in some of the active library datasets  being purged from disk.
>>
>>
>> This should not be possible, so perhaps you have found a corner case  scenario that needs to be handled.
>>
>>
>>>
>>> This library was loaded from an external file path.
>>
>>
>> Did you just use the "Upload a directory of files" option for the  upload form, or did you have the "allow_library_path_paste" config  setting turned on?  In either case, did you check the "Copy data  into Galaxy?" checkbox on the upload form, eliminating the copy of  the files into Galaxy's default files directory?
>>
>>
>>> When I wanted to add more files to galaxy from the same path it was  easier to load the whole directory again and delete the duplicated  files.
>>>
>>> I'm assuming that the cleanup script looks at these deleted  datasets and purges the file they are associated with even though  another current dataset also links to this file.
>>
>>
>> This is not the case.  Any dataset file that has at least one active  link ( an undeleted link from either a history item or a library )  will not be removed from disk by the cleanup_datasets.py script.  It  does not matter how many deleted links to the file exist, as long as  1 active link exists, the file will not be removed from disk (  unless you've found a bug ).
>>
>>
>>>
>>> Is there a way to check if a file is referenced by another dataset  before purging or to prohibit the script from deleting files out  with the default galaxy file directory?
>>
>> The cleanup_dataset.py script already performs these types of checks.
>>
>> I've also added a very recent new feature to the library dataset  information page ( the page that is displayed if you click on the  library dataset link within the library ) which displays all history  items and other library dataset that point to the disk file for the  current library dataset.  This is useful for manually seeing the  various linked items, but is not related to the cleanup_datasets.py  script.  This new feature has not yet made it out to the  distribution, but will be there soon.
>>
>> There is also a new Galaxy report ( not yet in the distribution, but  will be soon ) which shows disk usage for the file system on which  the Galaxy data files exist, and you can view large datasets ( over  4 gb ) that are not purged, and see all history and library items  that point to the disk file ( similar to the library dataset  information page above, but display datasets linked to only by  histories as well ).
>>
>>
>>>
>>> If I leave out the purge_libraries script will this stop library  datasets from being removed or only deleted libraries as a whole?
>>
>> The purge_libraries option to the cleanup_datasets.py script handles  removing data files from disk that have been deleted for the  appropriate number of days as well as purging the library record  from the database ( as long as the libraries contents have all been  purged ).
>>
>>
>>>
>>> Thanks for your help
>>>
>>> Shaun
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>> _______________________________________________
>>> galaxy-dev mailing list
>>> [hidden email]
>>> http://lists.bx.psu.edu/listinfo/galaxy-dev
>>
>> Greg Von Kuster
>> Galaxy Development Team
>> [hidden email]
>>
>>
>>
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> _______________________________________________
> galaxy-dev mailing list
> [hidden email]
> http://lists.bx.psu.edu/listinfo/galaxy-dev

Greg Von Kuster
Galaxy Development Team
[hidden email]




_______________________________________________
galaxy-dev mailing list
[hidden email]
http://lists.bx.psu.edu/listinfo/galaxy-dev