Full-text search in CHADO/PostgreSQL/GBROWSE

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Full-text search in CHADO/PostgreSQL/GBROWSE

lpritc@scri.ac.uk
Hi,

After very helpful emails from James Ward and Scott, I now have a working
(as far as I've pushed it, so far) CHADO/PostgreSQL/GBROWSE Landmark and
Region search field.  I'm kind of new to PostgreSQL full-text searching, and
I'm not a Perlista, so the attached SQL script to enable FTS on CHADO, and
the .diffs for modifying Chado.pm and Segment.pm (relative to release 0.26)
in the Chado adaptor may not be the best code in the world, and could
probably do with some inspection.

That said, multi-term space-separated queries for 'ion channel', 'resistance
multidrug' and so on work, as do suffix-only wildcard searches (which
appears to be a limitation of the FTS lexeme approach) like 'effect*' and
'crink*'.

I hope the attached files are useful to others who need this sort of
functionality.  The SQL file gets the usual

psql -U $CHADO_DB_USER $CHADO_DB_NAME < chado_fts.sql

arrangement, and the .diffs are .diffs like any other.

Cheers,

L.


--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405



______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
------------------------------------------------------------------------------


_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse

chado_fts.sql (7K) Download Attachment
Chado.pm.diff (4K) Download Attachment
Segment.pm.diff (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Adaptor performance/materialized views/indexing - Re: Full-text search in CHADO/PostgreSQL/GBROWSE

lpritc@scri.ac.uk
Hi,

I've had the chance to look into what's going on with the FTS/multipart
query issue in a bit more detail over the last couple of days, and there are
some interesting performance metrics that might be useful to anyone
following up on this CHADO/PostgreSQL/GBROWSE setup and interested in
improving performance.

Most of the Landmark or Region queries of interest to us pass through the
feature table and/or all_feature_names view.  Our example database (complete
oomycete annotation) has the following number of rows in each:

oomycete_reference=> SELECT COUNT(*) FROM feature;
 count  
--------
 164036
(1 row)

oomycete_reference=> SELECT COUNT(*) FROM all_feature_names;
 count  
--------
 610586
(1 row)

Plain old string-matching on this dataset has reasonable, but not
spectacular performance:

oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM feature WHERE
name LIKE '%hypothetical%protein%' OR name LIKE '%protein%hypothetical%';
 count
-------
   129
(1 row)

Time: 114.721 ms

oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM
all_feature_names WHERE name LIKE '%hypothetical%protein%' OR name LIKE
'%protein%hypothetical%';
 count
-------
 10957
(1 row)

Time: 1138.191 ms

Using full-text searching after creating new columns to hold the tsvector in
each table and indexing that column with gin, gives quite a speed up:

oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM feature WHERE
searchable_name @@ to_tsquery('hypothetical & protein');
 count
-------
   129
(1 row)

Time: 3.313 ms

But we can't create an index on a view, so the all_feature_name query takes
quite a while, in fact 10X longer than the plain-text query (which has the
additional advantages of allowing word suffix searches, and searches on
substrings within each word):

oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM
all_feature_names WHERE searchable_name @@ to_tsquery('hypothetical &
protein');
 count
-------
 10958
(1 row)

Time: 9537.446 ms

The obvious solution is to use a materialised view that can be indexed,
which I guess can be handled using the gmod_materialized_view_tool.pl or
gmod_materialize_gbrowse_views.pl tools, but I found it quicker just to
create a new materialised view by hand than to understand what was required
- and what would happen to our database - in the documentation for the
scripts and by forging through Perl code.  Creating the materialised view
mv_all_feature_names speeds things up, by itself:

oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM
mv_all_feature_names WHERE searchable_name @@ to_tsquery('hypothetical &
protein');
 count
-------
 10958
(1 row)

Time: 5251.238 ms

But doesn't start to improve on string matching until it's indexed:

oomycete_reference=> CREATE INDEX mv_all_feature_names_sname_idx ON
mv_all_feature_names USING gin(searchable_name);
CREATE INDEX
Time: 6767.002 ms
oomycete_reference=> SELECT COUNT(DISTINCT(feature_id)) FROM
mv_all_feature_names WHERE searchable_name @@ to_tsquery('hypothetical &
protein');
 count
-------
 10958
(1 row)

Time: 50.492 ms

This has two issues: one is arranging synchronisation of the materialised
view with the view, and the other is that the Chado adaptor doesn't point to
the materialised view without user modification, in any case.  Since we want
to inspect a live annotation database, and it takes around 7s to index the
column we're querying on, having trigger functions in each of the underlying
tables actively to update the materialised view on changes isn't going to be
acceptable when changes may be more frequent than each 7s.  A scheduled
materialisation seems more sensible, there - though any suggestions are
welcome.  Manual modification of the Chado adaptor in the local installation
seems to be the only solution for the second problem, at the moment.

I hope this info is useful,

L.


On 26/05/2010 Wednesday, May 26, 15:18, "Leighton Pritchard"
<[hidden email]> wrote:

> Hi,
>
> After very helpful emails from James Ward and Scott, I now have a working (as
> far as I've pushed it, so far) CHADO/PostgreSQL/GBROWSE Landmark and Region
> search field.  I'm kind of new to PostgreSQL full-text searching, and I'm not
> a Perlista, so the attached SQL script to enable FTS on CHADO, and the .diffs
> for modifying Chado.pm and Segment.pm (relative to release 0.26) in the Chado
> adaptor may not be the best code in the world, and could probably do with some
> inspection.
>
> That said, multi-term space-separated queries for 'ion channel', 'resistance
> multidrug' and so on work, as do suffix-only wildcard searches (which appears
> to be a limitation of the FTS lexeme approach) like 'effect*' and 'crink*'.
>
> I hope the attached files are useful to others who need this sort of
> functionality.  The SQL file gets the usual
>
> psql -U $CHADO_DB_USER $CHADO_DB_NAME < chado_fts.sql
>
> arrangement, and the .diffs are .diffs like any other.
>
> Cheers,
>
> L.
>

--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:[hidden email]       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify [hidden email] quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

------------------------------------------------------------------------------

_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse