Handy post-processor for cleaning up your publications (and grabbing DOI, PMID)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Handy post-processor for cleaning up your publications (and grabbing DOI, PMID)

Sam Hokin-3
Hiya, devs. Happy holidays and stuff! I get a lot of very sketchy publication data from my various legume sources, and I finally
decided to just polish up the publications and authors with a post-processor. I'd already written a Pub Med tool, and then I
discovered CrossRef, which allows you to query on whatever fields and get back a pretty nice ordered list of records, the first of
which is usually an exact match, including a DOI (but not a PMID since it's a general library tool). So I thought I'd share what
I've written, in this repo:

https://github.com/LegumeFederation/intermine-bio-postprocess.git

  main/src/org/intermine/bio/postprocess/PopulatePublications.java

is the post-processor. All you need is a PubMed ID or a first author/title combination. It uses the following JARs, which I symlink
over to the bio/postprocess tree (along with all my custom post-processor sources):

  main/lib/ncgr-crossref.jar
  main/lib/ncgr-pubmed.jar
  main/lib/json-simple.jar
  main/lib/xstream-1.4.8.jar

Those last two are generally available of course. (I highly recommend the json-simple package; it's what it says, and everything
else I've looked at is way, way over-designed.)

If you'd like to look at the package sources, those are under my own NCGR repo:

https://github.com/sammyjava/ncgr.git

(I've got an org.ncgr.intermine package in there that I should probably move over to the LegFed repo.)

The post-processor does some finagling with middle initials and stuff to try to get duplicate authors into a single record. It often
doesn't succeed, but, for example, "Douglas Cook" and "Douglas R Cook" and "Douglas R. Cook" all get merged into a single "Douglas
Cook" author record (or "Douglas R Cook" if that came first). It can't, of course, merge "Richard W. Dawkins" and "R.W. Dawkins"
since we don't know that the latter isn't his sister, Roberta W. Dawkins. (Or their uncle Ronald W. for that matter.)

Anyway, enjoy making your pubs a bit tighter with both PMID and DOI attributes and consistently-named authors with these tools.



_______________________________________________
dev mailing list
[hidden email]
https://lists.intermine.org/mailman/listinfo/dev