Non-profit uses big data to track big government

The Sunlight Foundation, a non-profit aimed at showing how corporate interests influence government released a pretty sweet tool for citizens and big data nerds on Monday. The tool, called Capitol Words, monitors how often, and which, legislators said certain phrases in an effort to track how those phrases enter and influence the political debate. Capitol Words is one of those random tools that gives us a glimpse of how cheap computing and better data analytics can change the business as usual in politics.

Already, the combo of cheap computing and big data are changing how retail firms set prices, offering insights into healthcare, and helping investors maximize rental income, so the notion it could lead to more government transparency isn’t all that crazy. In the case of Capitol Words and the Sunlight Foundation, the goal is to analyze what legislators say on the floor of the House and Senate to track how an idea can filter through a political party, a region or a debate by parsing the text data generated daily by the Congressional Record.

Tom Lee, the Director of Sunlight Labs at the Sunlight Foundation, said in an interview that the amount of data isn’t huge — about 50 or 60 gigabytes a day — but the text does need to be parsed so it can be made into something useful. So the Sunlight Foundation has developed algorithms and techniques, many of which it releases on Github, for using the data. It does the calculating and analysis on Amazon’s Elastic Map Reduce (s amzn) service and then uses Solr, an open-source search platform, to process people’s queries against the records. The database supporting the tool has upwards of 20 million records.

“The speech used by legislators is used to advance causes and manipulate the public,” Lee said. “And how their speech is similar or different can show how particular terms originate from some political messaging memo.”

Lee said the original version of the project in 2008 ran the search and the data parsing in parallel, but that approach was too compute-intensive and didn’t allow for the richness of the results the project can offer today by splitting the two steps up. However, he didn’t rule out coming back to running the job in parallel eventually as the data stores become larger and queries became more complex.

The Sunlight Foundation makes its findings and data available via a JSON API so others can build on it. It’s also hoping to expand beyond floor speeches to politician’s appearances on talk shows and other venues. It hopes to create other services by tying these political sound bytes to its repository of funding data, which tracks what lobbying groups and individuals politicians accept money from. It has over a terabyte-and-a-half of data on hand to work from.

And for those eagerly watching how our government attempts to become more transparent and share data, the Foundation is also working with the Government Printing Office, which published the Congressional Record to get the document in a more web-friendly, structured format. That would help the Capitol Words project become more useful and help others build their own data analytics based on what’s said in Congress. Right now, much of the esoteric (somewhat stilted) debate most often hits the general public when The Daily Show (s via) mocks it. While Jon Stewart may be funny, he doesn’t offer the ability to track ideas over time or in any broad fashion.

“For us, this is trying to expand the way Sunlight tracks influence,” Lee said. “We track the way the money flows around Washington and it’s not enough. The ways the system is affected are too subtle and deliberate, so we’re making an investment in tracking not just the flow of money but also the flow of ideas.” And when you’re trying to track something as nebulous as ideas, analyzing a lot of data using cheap compute is perhaps the only way for a non-profit to do it.