Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

First we use this API interface to select ATC codes
Then we get a list of every entity that is an ATC code. This list is not comprehensive, there lots more ATC codes that that, and there are likely Wikipedia articles on many of them.. but they have not yet be linked.
Then we pick the first ATC code to lookup all of the ethanol articles across all languages of wikipedia (that are known by wikidata as being parallel articles)

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Get the wikitext of an article
There are many ways to have the API parse the wikitext into correspoding HTML
- Use parsoid API to convert wikitext to html and html to wikitext
- Use the traditional API instead.
You can get specific revisions between two different periods of time using rvstart and rvend
You can review the traditional API by just visiting the api.php file with no arguments.
Use the API to sort out who is editing a given page.
Use the API to get information on wikipedia categories, you can get the category list, and you can get a category summary. Of course, you might want to page through all of categories available on wikipedia.
Redirects in wikipedia are a wealth of information, and you can get at them using the API, but it is not clear to me how they work, since there is not a clear from->to in the results.
You can get all of the pages that a given page links to. For the life of me, I cannot figure out how to increase the number of entries that are returned in these results.
For almost all of the examples I link to here, you are actually seeing an html page as the result, which is a pretty-print of the underlying json. To get the real json, you need to add “&format=json” to the end of the various urls…

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….

-FT