Hacking data: showing patterns in kids health

Here is my submission for the Local Children’s Data Health 2.0 developer challenge. The challenge was to make data available through kidsdata.org come alive.

Generally, the red circles correspond to the percentage of child allergy suffers who had -seen- a doctor, but had no specific plan to address their condition. The red tags, are healthcare providers from the NPI database that are listed as experts in kids allergies… the top of the field for asthma treatment. We are using these “super experts” as a proxy for the availability of specialist care for allergies generally. Notice the under-served areas… The specialist are clustering in the high-population areas. Hopefully this map will inspire an expert to move to Eureka, or Santa Maria..

Here was my process for this for my hack:

  • I would only use Open Source software or Open APIs. The idea here is to show just how powerful FOSS tools can be in health data analysis.
  • I have just created the best API to the National Provider Identifier database at docnpi.com, so I have this rich datasource that previously has not been available as an API.
  • I wanted to target something from kidsdata.org that was directly related to the availability of healthcare, something that you can measure geographically using the docnpi.com API.
  • I chose Asthma, because this is something that clearly responds to treatment.
  • I wanted to document my process to show how easy this kind of analysis is with the right tools.

Ok here’s what I did…

  1. First, I browsed kidsdata.org for asthma information. That leads you straight to this analysis of asthma hospitalizations for young children over the last few years.
  2. Then I started digging for source data. It looks like the California Health Interview Survey was a substantial source of the data.
  3. They offer Public Use Files of the original survey data. I signed in, and the terms of use for the data were reasonable, and not contrary to my purposes or Open Source. So I signed up and went to download the data.
  4. Sadly, the data was only available in three proprietary data formats, Strata, SPSS and SAS. This was obviously designed for academics that think using proprietary software is ethical and normal. Thankfully there are other options. The R project is where I usually turn first for stats help, but I actually found that there was an Open Source SPSS alternative called PSPP. Using PSPP I was able to open the SPSS data file. Victory for Open Source! It would be nice if organizations like CHIS would release in simple XML or CSV, which is much friendlier to hackers and people who believe in software freedom.
  5. My feeling of elation was short lived. The data had no geo-coded information. Which makes sense, that would make re-identification much easier. There had to be another way to get geo-coded data.
  6. And there was. AskCHIS is a powerful data reporting tool that allowed for xls data download. Again, I am amazed that CHIS would chose to run with a proprietary format without an open alternative. They used alot of advanced xls layout options that meant that an export to CSV would never work. An API would be even better, but at least CSV would allow me to actually parse a file instead of cutting and pasting which is what I ended up doing.
  7. But I had access to lots of data. I could see several different measures of asthma that I could have used in my mashup. This included lots of stuff like missed school days, emergency room visits, diagnosis of asthma, symptoms in the last twelve months… etc etc. If CHIS had given this data up using an API, I would have been able to merge the various asthma measures into an overall asthma status score… but it would have take a week of cutting and pasting to do that manually.
  8. So I had to choose one data point and run with it. I chose “Health professional ever provided asthma management plan“. This was asked to parents whose kids already had a doctor who was “treating” the asthma. I thought this was an interesting question because it seemed to correlate strongly with doctor-availability, something that I had good geo-coded data on.
  9. Now what provider data should I compare it to? Using docnpi.com I can easily grab a list of all/most of the doctors in California who specialize in treating allergies in children I decided to use that as a proxy for “available allergy specialists”. Of course, I had a serious advantage here, because I had already done the work of changing the NPI database into something I could access using an API (that is the idea behind docnpi.com). This easily saved me 30 hours of work on this project alone.
  10. So now I have the data I want… but what now? I had addresses for the doctors and clinics from the NPI database, but the asthma data was coded by county. No problem, I just needed to geocode the counties into longitude and latitude. If I had a rich data source from CHIS, it would have been worth writing a script to do this, but since I was using cut-and-paste data, with about 75 rows, it was much simpler to just manually geocode everything. Which is what I did. More cut-and-paste.
  11. But now I have geo-coded data for both data sources.
  12. I needed a method to graphically display geo-coded scoring. This is pretty easy to do using proprietary GIS tools, even costless tools like Google Earth. But I wanted to keep things simple and Open Source at the same time. Enter the EInsert extension to Google Maps API v2. This allowed me to overlay png circle graphics on a Google Map, and size them in accordance with their percentage (bigger is worse, it means more of the kids did not have asthma plans).
  13. Then something tickled my brain. Using circles to represent scaled data is problematic. There is solid research indicating that humans have trouble estimating the area of circles in relation to each other… So I used the ratio suggested by James Flannery to counter this effect. Now the circles are sized in a way that indicates their relative meanings in a somewhat more appropriate way.
  14. Now I had a Google Map that displayed data regarding the frequency of plans as meaningfully sized circles over the California state. This data shows some predictable effects. First, the worst areas are either very urban or very rural. Exactly the places that have trouble attracting medical talent. That means that on this map, Ureka and Los Angeles urban counties have similarly sized circles.
  15. Now all I needed to do was overlay the doctor data on this map. This turned out to be pretty simple. I already have a link to provide a Google Map display of any small search on docnpi.com. For instance, here is the link for the map for the search on allergists in California. All I needed to do was copy the html and javascript for the doctor map and integrate the map with the Asthma data map I had already made.
  16. So far, that maps looks pretty good. However, there is no easy way to tell which county, specifically, a given circle represents. I decided that the simplest way to address this was to dynamically rewrite the png using the gd library of php. I would pass the php script a label, and it would generate a circle with a label on it. This would allow me to label all of the circles on the map. As usual, stackoverflow provided a quick and dirty solution. (update 4-20) I realized that the label should show both the name of the county, and the percentage without a plan… now it does.

Take a look at the final result.

Notice that the shapes scale automatically as you zoom in. Try zooming in to Los Angeles or San Francisco to compare the compacted counties more closely. Also note that you can actually get the name of particular doctor that specializes in the treatment of asthma directly from the map. If you click the link you can get all of the contact information from docnpi.com

Which brings us to the point of this exercise.  A better view of the data can prompt change.

If you are a parent of a child with Asthma in one of the “big circles” you need to know that the long term treatment of Asthma requires a plan. If you do not have a plan, the reason might be that there are not enough doctors around you to provide the help you need. This map can put you in touch with the nearest expert.

If you are a doctor, who specializes in childhood allergy treatment, this is an opportunity map for you. Eureka is much smaller than LA or San Francisco, but you would have a near monopoly on a population that needs help with asthma. These people do not have the same access to specialized care and that might be a business opportunity for you. Moreover, a doctor who chose to focus on the urban areas in the larger cities might also be able to gain patients and profit. The data here shows that while there are lots of experts -around- the densely urban areas they are not meeting the demand for care. If a doctor could find a way to make money on a Medicare/Medicaid population in these urban areas, this might also be an opportunity.

Seeing the health data in a new way can provoke change. I hope you think my application is cool and sexy, but frankly I do not give a damn about that. I want to make a difference, not toys.

People remember Florence Nightengale as the mother of modern nursing. But she once made a diagram that changed the way people thought about war. It was that diagram that gave her much of the political clout she needed to create the field of professional nursing that we know today.

I have made the NPI data more liquid with docnpi.com. Organizations like CHIS need to a much better job of making their data accessible. If I had been able to access the data from AskCHIS in a normalized and open format using an API, I would have been able to make mapping system that would allow the overlay of -any- type of doctor with -any- health data measure that they survey.

So that leaves me with a call to action for three groups: Patients -> find better care near you. Doctors -> go where the patients need you. Researchers -> expose your data in open formats using APIs and open file formats.

Of course, I publish my source code under an Open Source license. Enjoy.

-FT

2 thoughts on “Hacking data: showing patterns in kids health

Comments are closed.