Fred Trotter

Healthcare Data Journalist

Hacking, Open Data, wiki

Hacking on the Wikipedia APIs for Health Tech

Recently I wrote about my work hacking on the PubMed API. Which I hope is helpful to people. Now I will cover some of the revelations I have had working with DocGraph on the Wikipedia APIs.

This article will presume some knowledge of the basic structure of open medical data sets, but we have recently released a pretty good tool for browsing the relationships between the various data sets: DocGraph Linea (that project was specifically backed by Merck, both financially and with coding resources, and they deserve a ton of credit for it working as smoothly as it does).

Ok. here are some basics to remember when hacking on the Wikipedia API’s if you are doing so from a clinical angle. Some of this will apply to Wikipedia hacking in general, but much of it is specifically geared towards understanding the considerable clinical content that Wikipedia and it’s sister projects posses.

First, there is a whole group of editors that might be interested in collaborating with you at Wikiproject Medicine. (There is also a Wikiproject Anatomy, which ends up being strongly linked to clinical topics for obvious reasons). In general you should think of Wikiprojects as a group of editors with a shared interest in a topic, that collectively adopt a group of medical articles. Lots of behind the scenes things on Wikipedia take place on Wikipedia talk pages, and the connection between Wikiprojects and specific wiki articles is one of them. You can see the connection between wikiproject medicine and the Diabetes article, for instance, on the Diabetes Talk page.

Wikiproject Medicine maintains an internal work list that is the best place to understand the fundamental quality levels of all of the articles that they overlook. You can see the summary of this report embedded in the project page and also here. There is a quasi-api for this data using the quality search page data, you can get articles that are listed as “C quality” but are also “High Priority”.

Once a clinical article on Wikipedia article has reached a state where the Wikipedian community (Wikipedian is the nick-name for Wikipedia contributors and editors) regards it as either a “good” article or a “feature” article, it can generally be considered to be highly reliable. To prove this, several prominent healthcare wikipedians converted the “dengue fever” wikipedia article into a proper medical review article, and then got that article published in a peer-reviewed journal.

All of which is to say: the relative importance and quality of wikipedia articles is something that is mostly known and can be accessed programmatically if needed. For now “programmatically” means parsing the HTML results of the quality search engine above, I have a request in for a “get json” flag.. which I am sure will be added “real soon now”.

The next thing I wish I had understood about Wikipedia articles is the degree to which they have been pre-datamined. Most of the data linking for Wikipedia articles started life as “infoboxes” which are typically found at the top right of clinically relevant articles. They look like this:

ethanol_1 ethonal_infobox diabetes_infobox

The Diabetes infobox contains links to ICD9 and ICD10 as well as MeSH. Others will have links to Snomed or CPT as appropriate. The ethanol article has tons of stuff in it, but for now we can focus just on the ATC code entry. Not only does it have the codes, but the correctly link to the relevant page on the WHO website.

An infobox is a template on wikipedia, which means it is a special kind of markup that can be found inside the wikitext for a given article. Later we will show how we can download the wikitext. But for now, I want to assure you that the right way to access this data is through wikidata, parsing wikitext is not something you need to do in order to get at this data. (This sentence would have saved me about a month of development time, if I had been able to read it.).

For instance, here is how we get ATC codes and ethonol via the wikidata API:

Most of this data mining is found in the Wikidata project. Lets have a brief 10000 ft tour of the resources that it offers. First, there are several clinically related data points that it tracks. This includes ATC codes, which are the WHO maintained codes for medications. (It should be noted that recent versions of RX Norm, can link ATC codes to NDC codes, which are maintained by the US FDA, and are being newly exposed by the Open FDA API project.

I pulled all of the tweets I made from wikimania about this into a storify.

Other things you want to do in no particular order:

Once you have wikitext its pretty easy to mine for pmid so that you can use the PubMed API. I used regular expressions to do this, which does occasionally miss some pmids. I think there is an API way to do this perfectly but I cannot remember what it is…

Thats a pretty good start. Let me know if you have any questions. Will likely expand on this article when I am not sleepy….

-FT