Hacking on the Pubmed API

The pubmed API is pretty convoluted. Every time I try to use it, I have to try and relearn it from scratch.

Generally, I want to get JSON data about an article, using its PubMED ID and I want to do searches programmatically… These are pretty basic and pretty common goals…

The PubMED api is an old-school RESTish API that has hundreds of different purposes and options. Technically the PubMed API is called the Entrez Database, and instructions for using it begin, and end with the Entrez Programming Utilities Help document. Heres the things you probably really wanted to know…

How to search for articles using the PubMed API

To search pubmed you need to use the eSearch API.

Here is the example they give…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d 

The first thing we want to do is not have this thing return XML, but JSON instead. We do that by adding a GET variable called retmode=json. The new url

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&retmode=json

Ahh… thats better… No lets get more ids in each batch of the results…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&retmode=json&retmax=1000

Breaking this down…

http://eutils.ncbi.nlm.nih.gov/entrez/

is kind the entry point for the whole system..

/eutils/esearch.fcgi

is the actual function that you will be using…

This tells the API that you want to search pubmed.

db=pubmed

Next you want to set the “return mode” so that JSON is returned.

retmod=json

And then you want to add the retmax to get at least 1000 results at a time… The documentation says that you can get 100,000 but I get a 404 if I go over 1000

retmax=1000

The term argument

term=YOUR SEARCH TERMS HERE

db and term are seperated by the classic GET variable layout (starts with a ? and is then seperated by a &) if that sounds strange to you, I suggest you learn a little more about how GET variables work in practice.

Now about the “YOUR SEARCH TERMS HERE” What that is a url_encoded string of arguments to the search string for pubmed. URL coding is (something of a trivialized explanation) how you make sure that there are no spaces or other strangeness in a URL. Here is a handy way to get data into and out of url encoding if you do not know what that is..

Thankfully the search terms are well defined, but not anywhere near the documentation for the API. The simplest way to understand the very advanced search functionality on pubmed is to use the PubMed advanced query builder or you can do a simple search, and then pay close attention to the box labeled “search details” on the right sidebar. For instance, I did a simple search for “Breast Cancer” and then enabled filters for Article Type of Review Articles and Journal Categories of “Core Clinical Journals”.. which results in a search text that looks like this:

("breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]) AND (Review[ptyp] AND jsubsetaim[text])

Lets break that apart into a readable syntax display…

("breast neoplasms"[MeSH Terms] 
  OR ("breast"[All Fields] 
        AND "neoplasms"[All Fields]) 
  OR "breast neoplasms"[All Fields] 
  OR ("breast"[All Fields] 
        AND "cancer"[All Fields]) 
  OR "breast cancer"[All Fields]) 
AND (Review[ptyp] 
  AND jsubsetaim[text])

How did I get this from such a simple search? PubMed is using MesH terms to map my search to what I “really wanted”. MesH stands for “Medical Subject Headings” is an ontology built specifically to make this task easier.

After that, it just tacked on the filter constraints that I manually set.

Now all I have to do is use my handy URL encoder.. to get the following url encoded version of my search parameters.

(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

Lets put the retmode=json ahead of the term= so that we easily just paste this onto the back of the url.. we get the following result.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=1000&term=(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

I wish that my css could handle these really long links better… but oh well. I know it looks silly, lets move on.

To save you (well mostly me at some future date) the trouble of cut and pasting here is the trunk of the url that is just missing the url encoded search term.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&term=

At the time of the writing, the PubMed GUI returns 2622 results for this search, and so does the API call… which is consistent and a good check to indicate that I am on the right track. Very satisfying.

The JSON that I get back has a section that looks like this:

    "esearchresult": {
        "count": "2622",
        "retmax": "20",
        "retstart": "0",
        "idlist": [
            "25081398",
            "25056393",
            "25055284",
            "25055283",
            "24956046",
            "24926080",
            "24912480",
            "24890451",
            "24889167",
            "24880509",
            "24878027",
            "24849143",
            "24838656",
            "24830599",
            "24792660",
            "24792659",
            "24792658",
            "24792657",
            "24792656",
            "24792655"
        ],

With this result it is easy to see why you want to set retmax… getting 20 at a time is pretty slow… But how do you page through the results to get the next 1000 results? Add the retstart variable

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=1000&retstart=1000&term=(%22breast%20neoplasms%22%5BMeSH%20Terms%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22neoplasms%22%5BAll%20Fields%5D)%20OR%20%22breast%20neoplasms%22%5BAll%20Fields%5D%20OR%20(%22breast%22%5BAll%20Fields%5D%20AND%20%22cancer%22%5BAll%20Fields%5D)%20OR%20%22breast%20cancer%22%5BAll%20Fields%5D)%20AND%20(Review%5Bptyp%5D%20AND%20jsubsetaim%5Btext%5D)

If you need more help, here is the link to the full documentation for eSearch API again…

 

How to download data about specific articles using the PubMed API

There are two stages to downloading the specific articles. First, to get article meta-data you want to use the eSummary API… using the ids from the idlist json element above… you can call it like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=25081398I

This will return a lovely json summary of this abstract. Technically, you can get more than one id at a time, by separating them with commas like so…

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=25081398,24792655

This summary is great, but it will not get the abstracts, if and when they are available. (it will tell you if there is an abstract available however…) In order to get the abstracts you need to use the eFetch API

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=text&rettype=abstract&id=25081398

Unlike the other APIs, there is no json retmode, the default is XML, but you can get plaintext using retmode=text. So if you want structured data here, you must use xml. Why? Because. Thats why. This API will take comma separated id list too, but I cannot see how to separate the plaintext results easily, so if you are using the plaintext (which is fine for me current purposes) better to call it a single id at a time.