First, let me say that I applaud and salute anyone who releases open data about anything as relevant as healthcare data. It is a tough and thankless slog to properly build, format and document open data files. Really, if you work on this please know that I appreciate you. I value your time and your purpose in life.
But please get your shit together.
Please do not make your own data format standards. Please use a standard that does not require me to buy any proprietary expensive software to read. The best open standards have RFCs. Choose one of those.
And most of all. If a comma-delimited file will work for your data, just use a CSV. If you were thinking, “but what if I have commas in my data?”… well you are just wrong. CSV is an actual standard. It has ways to escape commas and most importantly, you do not need to think about that. All you need to do is use the CSV export functionality of whatever you are working with. It will automatically do the right thing for you.
You are not doing yourself any favors creating a fixed length file structure. Soon, you will find that you did not really account for how long last names are. Or you will find an address that is longer than 40 characters. Or the people at the FDA will add another digit to sort out NDC codes… or whatever. CSV files mean that you do not have to think about how many characters your data fields use. More importantly, it means that I do not need to think about it either.
You might be thinking “We should use JSON for this!” or “XML is an open standard”. Yes, thank you for choosing other good open formats… but for very large data sets, you probably just want to use a CSV file. The people at CMS thought JSON would be a good standard to use for the Qualified Health Plan data… and they did in fact design the standard so you could keep the JSON filed to a reasonable size. But the health insurance companies have no incentive to make their JSON files a reasonable size and so they have multiple gigabyte JSON files. That is hugely painful to download and it is a pain to parse.
Just use CSV.
I was recently working with the MAX Provider Characteristics files from Medicaid. Here are the issues I had.
- They have one zip file from 2009 which empties into a directory with the same name as the zip file. That means that the zip file will not open, because it is trying to write to a directory with the same name as the original file. I have to admit, I am amazed that this mistake is even possible.
- in 2009, the zip files made subdirectories. In 2010 and 2011 they dumped to the current directory tar-bomb style. (either way is fine, pick one)
- sometimes the file names of the ‘txt’ files are ALL CAPS and sometimes not, even in the same years data.
- Sometimes the state codes are upper case like ‘WI’ and ‘WV’, sometimes they are camel case ‘Wy’ and ‘Wa’, sometimes they are lowercase ‘ak’ and ‘al’. Of course, we also have ‘aZ’.
- Usually the structure is StateCode.year.maxpc.txt .. like GA.2010.maxpc.txt. Except for that one time when they wrote it FL.Y2010.MAXPC.TXT
- the actual data in the files is fixed length format. Each year, you have to confirm that all of the field lengths are the same in order to ensure that your parser will continue to work.
- They included instructions for importing the data files in SAS, the single most expense data processing tool available. Which is, of course, what they were using to export the data.
- They did not include instructions for any of the most popular programming languages. SAS does not even make the top 20 list.
- There are multiple zip files, each with multiple files inside. We can afford a download that is over 100 MB in size. Just make, one. single. csv file. please.
- Sometimes the files end in .txt Other times they just end in a ‘.’ (period).
- The files are not just text files, they have some cruft at the beginning that ensures that they are interpreted as binary files.
Now how does that make me feel as someone trying to make use of these files? Pretty much like you might expect.
I love open data in healthcare. But please, please, start using easy to use and simple data standards. Get your shit together. I spend too much time hacking on ETL, I need to focus on things that change the world. And guess what… you need me to focus on those things too.
So if you are reading this, and you might very well be because I specifically referred you to this rant. Please do the right thing.
Soon, this advise will likely be formally compatible with the Open Data policies of the Federal Government.
- Use an open standard for your data
- Use CSV if you can
- Are you ABSOLUTELY SURE that you cannot use CSV?
- Use JSON if you cannot use CSV
- Use XML if you cannot use CSV or JSON
- Are you looking to compress and manage massive amounts of data moving it around at a furious rate, in a almost-binary compressed format? Perhaps try Protocol Buffers.
- Find that Protocol Buffers page confusing? Its because you should be using CSV. So just use CSV.
- Make your data and file naming consistent, so that a machine can process it.
This way, we can have all of the wonderful tools for csv data processing available to us. Joy!
Updated Mar 22 2017 (added protocol buffers and links)