Handy post-processor for cleaning up your publications (and grabbing DOI, PMID)
Hiya, devs. Happy holidays and stuff! I get a lot of very sketchy publication data from my various legume sources, and I finally
decided to just polish up the publications and authors with a post-processor. I'd already written a Pub Med tool, and then I
discovered CrossRef, which allows you to query on whatever fields and get back a pretty nice ordered list of records, the first of
which is usually an exact match, including a DOI (but not a PMID since it's a general library tool). So I thought I'd share what
I've written, in this repo:
is the post-processor. All you need is a PubMed ID or a first author/title combination. It uses the following JARs, which I symlink
over to the bio/postprocess tree (along with all my custom post-processor sources):
(I've got an org.ncgr.intermine package in there that I should probably move over to the LegFed repo.)
The post-processor does some finagling with middle initials and stuff to try to get duplicate authors into a single record. It often
doesn't succeed, but, for example, "Douglas Cook" and "Douglas R Cook" and "Douglas R. Cook" all get merged into a single "Douglas
Cook" author record (or "Douglas R Cook" if that came first). It can't, of course, merge "Richard W. Dawkins" and "R.W. Dawkins"
since we don't know that the latter isn't his sister, Roberta W. Dawkins. (Or their uncle Ronald W. for that matter.)
Anyway, enjoy making your pubs a bit tighter with both PMID and DOI attributes and consistently-named authors with these tools.