The New York Times open-sources its Hive crowdsourcing platform

A couple of months ago, the New York Times rolled out an interesting project called Madison, in which the newspaper asked readers to help the paper identify old print ads by going to a website and answering questions — and even in some cases transcribing the actual text in the ads. Now, the company is open-sourcing the platform it built for that project, known as Hive, so that others can use it for their own experiments in crowdsourcing.

Matt Boggie, executive director of the Times‘ research lab, said in an interview that not long after the paper started its Madison project, it became obvious to the R&D team that plenty of other publishers would be interested in having a tool that would allow them to build and manager similar experiments, so Hive was developed from the beginning with an eye towards releasing it to the world.

“There’s a community of journalists out there doing work in this area of data journalism — we’ve been inspired by the work that Knight Labs are doing, and that Brian Boyer and the team at NPR are doing. We saw that this project could be immediately useful for a lot of people, and we wanted to contribute back to the community.”

On Tuesday morning, the Times flipped the switch and made its software repository on Github — a platform that allows for collaboration on code development — public instead of private, so that anyone can take the source code for Hive and adapt it to their own needs. All they will have to do, Boggie said, is write their own front-end and user interface for the project, and tell Hive what content they want users to interact with.

Equipped for any task

The platform is flexible enough that it can manage almost any type of collaborative project, Boggie said — whether it’s getting users to transcribe audio or video, edit text, manipulate images, or all of the above. The software manages the different assets and automates the workflow, determining which order to do tasks in and then monitoring whether they are completed. It is written in Google’s Go language and uses JSON, ElasticSearch and a RESTful API.


Theoretically at least, this kind of platform would allow any media outlet or website to set up a crowdsourcing project similar to the ones that The Guardian and ProPublica have had success with in the past: the British paper got more than 20,000 people to help it catalog and analyze expense reports filed by government representatives in order to detect fraud, and ProPublica got users to help it track election spending on campaign ads. The NYT research lab said it designed Hive to be as flexible as possible:

“Hive has an intentionally flexible definition of a user: you may require a signup and login process, or simply allow anonymous contributions to lower the barrier of entry. Hive keeps track of each user’s number of contributions by task, both as a total, and further broken down by how many were skipped, completed or verified.”

Boggie said the Times is continuing to put resources into Madison — which has so far only released ads going back to the 1960s — and that the paper was pleasantly surprised by the amount of reader interaction it has gotten from the project so far. “We’ve had upwards of 14,000 people completing assignments and over 100,000 assignments completed,” he said. “We didn’t expect that many people would spend the time to transcribe ads, but a surprising number have.”

A platform like Hive may seem like a fairly small thing amid the turmoil and upheaval that is disrupting the media business, but if it helps more media outlets engage with their readers in some way then I think it’s definitely worthwhile — and it’s also nice to see the New York Times, which can be fairly insular, giving back to the journalism community at large.