It is the policy of many governments to support transparency with the release of Open Data. But few understand how important it is that this Open Data be released in machine-readable openly available formats. I have already written a lengthly blog post about how most of the time, the CSV standard is the right data standard to use for releasing large open data sets. But really JSON, XML, HTML, RTF, TXT, TSV and PDF files, which are all open standard file formats, each have their place as appropriate data standards for governments to use as they release Open Data.
But it can be difficult to explain to someone inside a government or non-profit, who is already releasing Open Data that CSV is a good standard, but XLSX (Microsoft Excel) is not. For many people, a CSV really is an Excel file, so there is no difference in their direct experience. But for those of us who want to parse, ETL or integrate that data automatically there is a world of difference in the level of effort required between a clean CSV and a messy XLSX file (not to mention the cybersecurity implications).
A few months ago (sorry I get distracted) Project Open Data which is a policy website maintained and governed jointly by the Office of Management and Budget and the Office of Science and Technology Policy of the US Federal Government updated its website to include W3C and IETF as sources of Open Data Format Standards, by accepting a pull request that I made. As I had expected, not including IETF and W3C in the list of sources of Open Standards was an omission and not a conspiracy (sometimes I panic).
This is a very important resource for those of us who advocate for Open Data. It means that we can use a single URL link, specifically, this one:
https://project-open-data.cio.gov/open-standards/
To indicate that it is the policy of the United States Federal Government that not only release Open Data, but it do so using specific standards that are also open. Now that the W3C and IETF are added, the following data standards are by proxy included in the new policy regarding open data standards:
- IETF -> JSON https://www.ietf.org/rfc/rfc4627.txt
- IETF -> CSV https://www.ietf.org/rfc/rfc4180.txt
- W3C -> XML https://www.w3.org/XML/
- W3C -> HTML https://www.w3.org/html/
Obviously these four standards make up almost all of the machine readable Open Data that is already easy to work with, and with a few exceptions represents the data formats that 95% (my guesstimate) of all Government data should be released in. In short, while there are certainly other good standards, and even cases where we must tolerate proprietary standards for data, most of the data that we need to release should be released in one of these four data formats.
For those of us who advocate for reasonableness in Open Data releases.. this is a pretty big deal. We can now simply include a few links to publicly available policy documents rather than arguing independently for the underlying principles.
And because the entire Project Open Data website is so clear, concise and well-written and because it comes with the implicit endorsement of US Federal Governments (OMB and OSTP), this is a wonderful new resource for advocating with National, State, City, Local and International governments for the release of Open Data using reasonable data formats. Hell, we might even be able to get some of the NGOs to consider releasing data correctly because of this. My hope is that this will make complaining about proprietary format data releases easier, and therefore more frequent, and help us to educate data releasers on how to make their data more useful. Which in turn will make it easier for data scientists, data journalists, academics and other data wonks to create impact using the data.
My applause to the maintainers and contributors to Project Open Data.
-FT