Is it useful to use RAMDirectory for indexing lucene keywords?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Is it useful to use RAMDirectory for indexing lucene keywords?

HongKee Moon
Hi all,

I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?

Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?

Cheers,
HongKee

--
HongKee Moon
Software Engineer
Scientific Computing Facility

Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstr. 108
01307 Dresden
Germany

fon: +49 351 210 2740
fax: +49 351 210 1689


_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

Justin Clark-Casey-2
Hi Hongkee,

I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that using a
RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.

One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for instance,
though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.

But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr (currently
leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded Lucene) that may go
away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.

You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].

[1] https://github.com/intermine/intermine/issues/517

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc

On 18/11/16 11:12, HongKee Moon wrote:

> Hi all,
>
> I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
> Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>
> Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
> Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?
>
> Cheers,
> HongKee
>
> --
> HongKee Moon
> Software Engineer
> Scientific Computing Facility
>
> Max Planck Institute of Molecular Cell Biology and Genetics
> Pfotenhauerstr. 108
> 01307 Dresden
> Germany
>
> fon: +49 351 210 2740
> fax: +49 351 210 1689
> www.mpi-cbg.de <http://www.mpi-cbg.de>
>
>
>
> _______________________________________________
> dev mailing list
> [hidden email]
> https://lists.intermine.org/mailman/listinfo/dev
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

Colin
Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.

With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure out the "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index

Not sure if that is already optimized or not!

-Colin

On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <[hidden email]> wrote:
Hi Hongkee,

I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that using a RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.

One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for instance, though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.

But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.

You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].

[1] https://github.com/intermine/intermine/issues/517

--
Justin Clark-Casey, Synbiomine/InterMine Developer
http://synbiomine.org
http://twitter.com/justincc


On 18/11/16 11:12, HongKee Moon wrote:
Hi all,

I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?

Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?

Cheers,
HongKee

--
HongKee Moon
Software Engineer
Scientific Computing Facility

Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstr. 108
01307 Dresden
Germany

fon: <a href="tel:%2B49%20351%20210%202740" value="+493512102740" target="_blank">+49 351 210 2740
fax: <a href="tel:%2B49%20351%20210%201689" value="+493512101689" target="_blank">+49 351 210 1689
www.mpi-cbg.de <http://www.mpi-cbg.de>



_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev

_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev


_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

Justin Clark-Casey-2
At KeywordSearch.java:1078 there is the line

         writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB

There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I doubt
increasing it will make much difference but this would be a fairly easy thing to try.

[1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly

On 21/11/16 17:35, Colin wrote:

> Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.
>
> With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure out the
> "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>
> Not sure if that is already optimized or not!
>
> -Colin
>
> On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Hongkee,
>
>     I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that using a
>     RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.
>
>     One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
>     KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for instance,
>     though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.
>
>     But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
>     (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
>     Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.
>
>     You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].
>
>     [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
>
>     --
>     Justin Clark-Casey, Synbiomine/InterMine Developer
>     http://synbiomine.org
>     http://twitter.com/justincc
>
>
>     On 18/11/16 11:12, HongKee Moon wrote:
>
>         Hi all,
>
>         I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
>         Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>
>         Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
>         Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?
>
>         Cheers,
>         HongKee
>
>         --
>         HongKee Moon
>         Software Engineer
>         Scientific Computing Facility
>
>         Max Planck Institute of Molecular Cell Biology and Genetics
>         Pfotenhauerstr. 108
>         01307 Dresden
>         Germany
>
>         fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740>
>         fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689>
>         www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>
>
>
>         _______________________________________________
>         dev mailing list
>         [hidden email] <mailto:[hidden email]>
>         https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>
>     _______________________________________________
>     dev mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

Colin
I also haven't tested it but it could be that increasing it helps :)

From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29

" setRAMBufferSizeMB

public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)

    Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can. "


I also know with elasticsearch or similar you can can manually control a "bulk api" and this was said to be important to increase performance https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

-Colin


On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <[hidden email]> wrote:
At KeywordSearch.java:1078 there is the line

        writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB

There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I doubt increasing it will make much difference but this would be a fairly easy thing to try.

[1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly

On 21/11/16 17:35, Colin wrote:
Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.

With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure out the
"batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index

Not sure if that is already optimized or not!

-Colin

On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]>> wrote:

    Hi Hongkee,

    I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that using a
    RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.

    One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
    KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for instance,
    though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.

    But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
    (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
    Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.

    You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].

    [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>

    --
    Justin Clark-Casey, Synbiomine/InterMine Developer
    http://synbiomine.org
    http://twitter.com/justincc


    On 18/11/16 11:12, HongKee Moon wrote:

        Hi all,

        I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
        Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?

        Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
        Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?

        Cheers,
        HongKee

        --
        HongKee Moon
        Software Engineer
        Scientific Computing Facility

        Max Planck Institute of Molecular Cell Biology and Genetics
        Pfotenhauerstr. 108
        01307 Dresden
        Germany

        fon: <a href="tel:%2B49%20351%20210%202740" value="+493512102740" target="_blank">+49 351 210 2740 <tel:%2B49%20351%20210%202740>
        fax: <a href="tel:%2B49%20351%20210%201689" value="+493512101689" target="_blank">+49 351 210 1689 <tel:%2B49%20351%20210%201689>
        www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>



        _______________________________________________
        dev mailing list
        [hidden email] <mailto:[hidden email]>
        https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>

    _______________________________________________
    dev mailing list
    [hidden email] <mailto:[hidden email]>
    https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>




_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

Justin Clark-Casey-2
Eh, with SSDs and my very primitive benchmark of eyeballing CPU usage whilst indexing I'd still argue the toss :)  But as always with performance stuff, you
have to try it and see :)

Yeah, we'll definitely want to look at similiar facilities in elasticsearch/solr.

-- Justin

On 22/11/16 17:14, Colin wrote:

> I also haven't tested it but it could be that increasing it helps :)
>
> From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29
> <http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29>
>
> " setRAMBufferSizeMB
>
> public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)
>
>     Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster
> indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can. "
>
>
> I also know with elasticsearch or similar you can can manually control a "bulk api" and this was said to be important to increase performance
> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
>
> -Colin
>
>
> On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]>> wrote:
>
>     At KeywordSearch.java:1078 there is the line
>
>             writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB
>
>     There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I
>     doubt increasing it will make much difference but this would be a fairly easy thing to try.
>
>     [1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly
>     <http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly>
>
>     On 21/11/16 17:35, Colin wrote:
>
>         Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.
>
>         With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure
>         out the
>         "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>         <http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index>
>
>         Not sure if that is already optimized or not!
>
>         -Colin
>
>         On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]
>         <mailto:[hidden email]>>> wrote:
>
>             Hi Hongkee,
>
>             I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that
>         using a
>             RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.
>
>             One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
>             KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for
>         instance,
>             though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.
>
>             But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
>             (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
>             Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.
>
>             You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].
>
>             [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
>         <https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>>
>
>             --
>             Justin Clark-Casey, Synbiomine/InterMine Developer
>             http://synbiomine.org
>             http://twitter.com/justincc
>
>
>             On 18/11/16 11:12, HongKee Moon wrote:
>
>                 Hi all,
>
>                 I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
>                 Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>
>                 Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
>                 Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine
>         tasks?
>
>                 Cheers,
>                 HongKee
>
>                 --
>                 HongKee Moon
>                 Software Engineer
>                 Scientific Computing Facility
>
>                 Max Planck Institute of Molecular Cell Biology and Genetics
>                 Pfotenhauerstr. 108
>                 01307 Dresden
>                 Germany
>
>                 fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740> <tel:%2B49%20351%20210%202740>
>                 fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689> <tel:%2B49%20351%20210%201689>
>                 www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>
>
>
>                 _______________________________________________
>                 dev mailing list
>                 [hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>
>                 https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>         <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>
>             _______________________________________________
>             dev mailing list
>             [hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>
>             https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>         <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>
>
>
_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is it useful to use RAMDirectory for indexing lucene keywords?

HongKee Moon
Hi Justin & Colin,

Thank you so much for your kind and helpful comments.
I am looking forward to seeing new searching framework running in Intermine soon.
If I can manage to apply the mentioned tips and find any results, I will keep you updated.

Have a lovely day!

Cheers,
HongKee

On Nov 22, 2016, at 7:40 PM, Justin Clark-Casey <[hidden email]> wrote:

Eh, with SSDs and my very primitive benchmark of eyeballing CPU usage whilst indexing I'd still argue the toss :)  But as always with performance stuff, you have to try it and see :)

Yeah, we'll definitely want to look at similiar facilities in elasticsearch/solr.

-- Justin

On 22/11/16 17:14, Colin wrote:
I also haven't tested it but it could be that increasing it helps :)

From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29
<http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29>

" setRAMBufferSizeMB

public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)

   Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster
indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can. "


I also know with elasticsearch or similar you can can manually control a "bulk api" and this was said to be important to increase performance
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

-Colin


On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]>> wrote:

   At KeywordSearch.java:1078 there is the line

           writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB

   There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I
   doubt increasing it will make much difference but this would be a fairly easy thing to try.

   [1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly
   <http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly>

   On 21/11/16 17:35, Colin wrote:

       Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.

       With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure
       out the
       "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
       <http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index>

       Not sure if that is already optimized or not!

       -Colin

       On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]
       <mailto:[hidden email]>>> wrote:

           Hi Hongkee,

           I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that
       using a
           RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.

           One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
           KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for
       instance,
           though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.

           But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
           (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
           Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.

           You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].

           [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
       <https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>>

           --
           Justin Clark-Casey, Synbiomine/InterMine Developer
           http://synbiomine.org
           http://twitter.com/justincc


           On 18/11/16 11:12, HongKee Moon wrote:

               Hi all,

               I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
               Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?

               Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
               Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine
       tasks?

               Cheers,
               HongKee

               --
               HongKee Moon
               Software Engineer
               Scientific Computing Facility

               Max Planck Institute of Molecular Cell Biology and Genetics
               Pfotenhauerstr. 108
               01307 Dresden
               Germany

               fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740> <tel:%2B49%20351%20210%202740>
               fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689> <tel:%2B49%20351%20210%201689>
               www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>



               _______________________________________________
               dev mailing list
               [hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>
               https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
       <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>

           _______________________________________________
           dev mailing list
           [hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>
           https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
       <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>



_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev


--
HongKee Moon
Software Engineer
Scientific Computing Facility

Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstr. 108
01307 Dresden
Germany

fon: +49 351 210 2740
fax: +49 351 210 1689


_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev
Loading...