It's not the case that government IT is invariably bad---the Federal Reserve Bank of St. Louis has an amazing interface (FRED) and API for working with their data. Unfortunately, not all government statistics are available here, especially some of the more interesting BLS series.
The essential problem with BLS is that all of their work products---reports, tables etc.---are designed to be printed out, not accessed electronically. Many BLS tables are embedded in PDFs, which makes the data they contain essentially impossible to extract; non-PDF, text-based tables, which are better, are difficult to parse electronically: structure is conveyed by tabs and white space, column headings are split over multiple lines with no separators; heading lengths vary etc.
Why does it matter? For one, when users can access data electronically, via an API, they can combine it with other sources, look for patterns, test hypotheses, find bugs / measurement errors, create visualization and do all sorts of other things that make the data more useful.
BLS does offer a GUI tool for downloading data, but it's kludgy, requires a Java Applet, requires series to be hand-selected and then returns an Excel(!) spreadsheet w/ extraneous headers and formatting. Furthermore, it's not clear what series and what transformations are needed from GUI-data to make the more refined, aggregated tables.
To actually create this figure, I needed to get data into in R by way of a CSV file. The code required to get table data into a useful CSV file, while not rocket science, isn't trivial---there's lots of one-off/hacky things to work around the limitations of the table. Getting the nested structure of the industries e.g., ("Durable Goods" is a subset of "Manufacturing" and "Durable Goods" has 4 sub-classifications) required recursion (see the "bread_crumb" function). FWIW, here's the code:
Most of the code is dealing with the problems shows in this sketch:
My suggestion: BLS should borrow someone from FRED and help them create a proper API.
Maybe they will migrate their data to http://www.data.gov/ soon.
ReplyDeleteI did exactly this type of thing in a (very) old PHP system I wrote for dealing with census and home-building data. I remember doing a manual POST request against a URL I found by viewing the source of a form in an HTML page, stripping out the text data wrapped in HTML header and footer sections in the response, then parsing the text data into structures I could actually use.
ReplyDeleteHere's a thought: if there are enough of us writing our own parsers for this stuff, would it be worth it to create our own API for it that others could use? Obviously would be better if the gov did it for us, but they seem uninterested in being helpful in this regard.