Re: Predict similar tools for Galaxy tools using text mining and machine learning

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Predict similar tools for Galaxy tools using text mining and machine learning

Yvan Le Bras-2
Hi Anup, Hello everyone,

Thanks for sharing this amazing work! 

Your project make me remember previous ideas we investigate on a workflow oriented way with notably Mouhamadou Ba, copied. Maybe we can try to go further on it if interested....

Furthermore, I am wondering if this work can help to go further on initiated Galaxy tools translation .... https://github.com/manabuishii/galaxy-translation ... probably not relevant....

Wishing you all a very good week-end. 

Kind regards,

Yvan



Envoyé depuis mon smartphone Samsung Galaxy.

-------- Message d'origine --------
De : anup kumar <[hidden email]>
Date : 03/02/2018 02:57 (GMT+01:00)
Objet : [galaxy-dev] Predict similar tools for Galaxy tools using text mining and machine learning

Hello everyone,

Greetings!

I am a contributor to the Galaxy project and currently working on finding similarities in Galaxy tools (a part of my master's thesis). This work aims at finding similar tools for any Galaxy tool based on tools description and their input/output file types.

For example, the similar tools for a tool "bowtie2" could be "bwa", "bwameth" or "bwa_mem" among others.

To see the results of this project online, please visit the link (works fine on Firefox and Chrome). You should wait for a few seconds before you see a list of tools in the select list as the page loads a big JSON file (~100MB) asynchronously. Once the tools are loaded, please choose a tool and see the similar ones for your favourite tool(s). The similar tools are arranged in the descending order of their probability scores (top 20 are shown). The similar tools that you see are a mixture of tools extracted based on the selected tool's description and file types. It means that sometimes the tools are similar due to their description/kind of functions they have and sometimes due to their file types. Also, there are a few graphics/plots at the end of the page.

Here is the code repository to read more about this project.

I have followed the following approach to compute the similar tools:
  1. Text mining to collect and preprocess the tools' keywords (which represent a tool) - BM25
  2. Matrix factorization to extract important concepts (and not just words) - Latent Semantic Analysis
  3. Optimization to combine probability distributions - Gradient Descent and Backtracking Line Search
Further, I have plans to include:
  • Deep learning approach to compute similarity between paragraphs inspired from this work
  • Similarity in workflows

If you have any comment/feedback, please write. It will be immensely helpful.

Thanks a lot!



Regards,
Anup Kumar

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/
Reply | Threaded
Open this post in threaded view
|

Re: Predict similar tools for Galaxy tools using text mining and machine learning

anup kumar
Hi Yvan,

Thanks a lot for checking out the work. I think Galaxy tools translation is a different work altogether.
I have added another approach to find similar tools. It uses deep learning approach (doc2vec) to find text similarity [here].
More details are there in the original repository.

Thanks!


Regards,
Anup Kumar

On Sat, Feb 3, 2018 at 9:49 AM, Yvan Le Bras <[hidden email]> wrote:
Hi Anup, Hello everyone,

Thanks for sharing this amazing work! 

Your project make me remember previous ideas we investigate on a workflow oriented way with notably Mouhamadou Ba, copied. Maybe we can try to go further on it if interested....

Furthermore, I am wondering if this work can help to go further on initiated Galaxy tools translation .... https://github.com/manabuishii/galaxy-translation ... probably not relevant....

Wishing you all a very good week-end. 

Kind regards,

Yvan



Envoyé depuis mon smartphone Samsung Galaxy.

-------- Message d'origine --------
De : anup kumar <[hidden email]>
Date : 03/02/2018 02:57 (GMT+01:00)
Objet : [galaxy-dev] Predict similar tools for Galaxy tools using text mining and machine learning

Hello everyone,

Greetings!

I am a contributor to the Galaxy project and currently working on finding similarities in Galaxy tools (a part of my master's thesis). This work aims at finding similar tools for any Galaxy tool based on tools description and their input/output file types.

For example, the similar tools for a tool "bowtie2" could be "bwa", "bwameth" or "bwa_mem" among others.

To see the results of this project online, please visit the link (works fine on Firefox and Chrome). You should wait for a few seconds before you see a list of tools in the select list as the page loads a big JSON file (~100MB) asynchronously. Once the tools are loaded, please choose a tool and see the similar ones for your favourite tool(s). The similar tools are arranged in the descending order of their probability scores (top 20 are shown). The similar tools that you see are a mixture of tools extracted based on the selected tool's description and file types. It means that sometimes the tools are similar due to their description/kind of functions they have and sometimes due to their file types. Also, there are a few graphics/plots at the end of the page.

Here is the code repository to read more about this project.

I have followed the following approach to compute the similar tools:
  1. Text mining to collect and preprocess the tools' keywords (which represent a tool) - BM25
  2. Matrix factorization to extract important concepts (and not just words) - Latent Semantic Analysis
  3. Optimization to combine probability distributions - Gradient Descent and Backtracking Line Search
Further, I have plans to include:
  • Deep learning approach to compute similarity between paragraphs inspired from this work
  • Similarity in workflows

If you have any comment/feedback, please write. It will be immensely helpful.

Thanks a lot!



Regards,
Anup Kumar


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/