It’s one of the great questions of our times: if you have loads of interesting data that can be used for the benefit of all, how does that square up against people’s desire, or indeed right, to maintain control over their own information? I suspect there are many such debates on the horizon, but for now you’d be hard-pressed to find a more perfect case study than that of England’s National Health Service (NHS) and the data it holds on its patients.
This is a highly politicized mess, as the British government is gradually privatizing the NHS, but the key facts stand above that particular fray.
Let’s see what happens in the cloud…
Over the last 15 years or so, the NHS has become increasingly computerized, and it is now at the point where data about hospital and general practitioner visits will soon all be accessible from the same place, namely an archive for so-called Hospital Episode Statistics (HES) that used to be only for inpatient data.
With this centralization comes the enhanced ability to find patterns, track trends and generally improve the NHS’s ability to do what it does. The English NHS’s central data organization, the Information Center for Health and Social Care (HSCIC), sold the HES dataset to consulting firm PA Consulting, to see what magic could be squeezed out of people’s medical records.
A couple years back, PA produced a report in which it talked about how it experimented with handling all this data. Fast forward to yesterday, and someone pointed out certain details of this report to doctor and member of parliament Sarah Wollaston on Twitter. She wasn’t happy, and it became a big story.
The section of the report that made Wollaston and plenty of privacy campaigners so upset was effectively a pitch for cloud technology. In it, PA wrote that it had tried uploading these reams of data to a traditional Microsoft(s msft) SQL database on a local server. But the upload took ages and interrogating the data was slow. Faced with having to plow tons of capital investment into dedicated hardware and analytics software, PA took a different route:
“The alternative was to upload it to the cloud using tools such as Google(s goog) Storage and use BigQuery to extract data from it. As PA has an existing relationship with Google, we pursued this route (with appropriate approval). This showed that it is possible to get even sensitive data in the cloud and apply proper safeguards.
“We found that queries that took all night on our servers were returned in under 30 seconds using BigQuery. This was the performance on the raw uploads with no optimisation. This stunning improvement in speed applied even to more sophisticated analysis. Within two weeks of starting to use the Google tools we were able to produce interactive maps directly from HES queries in seconds. In the old days it would have taken more than a month to produce just one clever map.”
Yay for BigQuery, right?
You put the data where?
“There is no way for the public to tell that this data has left the HSCIC,” campaigner Phil Booth of medConfidential spluttered to the Guardian. “The government and NHS England must now come completely clean. Anything less than full disclosure would be a complete betrayal of trust.”
That would perhaps come across as a bit of an overreaction if it weren’t for the fact that this was not an isolated incident. Indeed, the whole “care.data” program for sharing data between doctors’ surgeries and the HSCIC has recently been paused, following anger over the way in which it was rolled out. NHS England said it had sent out flyers to the whole population, explaining to people how they could opt out if they didn’t want to have their most intimate medical data channelled into a unified national database.
Most people said they hadn’t received this notice. Then, in late February, the HSCIC admitted that its predecessor organization had messed up by selling hospital data to the insurance industry. This is not legal. Oh, and on Monday there was also a big scare about a data mapping firm called Earthware allegedly putting HSCIC data online in an identifiable way — though Earthware claimed its health map was based on “mock data,” it was nonetheless swiftly taken offline and the HSCIC has launched an investigation.
In that context, it’s not hard to see why so many people are upset about the PA episode, particularly as we’re talking about Google(s goog), a company that doesn’t exactly inspire confidence when it comes to data-sharing — even if you discount what we learned from Edward Snowden about spy agencies tapping the web giant’s systems. Hence this tweet from doctor and author Ben Goldacre:
Is it legal to throw this stuff into Google’s cloud? That’s a particularly important question if the data found its way to U.S. servers, which we don’t know for sure (I did ask PA, to no avail), but thanks to the Safe Harbor agreement it probably would be legal. Snowden may have shown us that Safe Harbor is practically meaningless, but the agreement stands nonetheless and Google has legitimately self-certified its adherence to European data protection standards.
In a statement, PA insisted: “The dataset does not contain information that can be linked to specific individuals and is held securely in the cloud in accordance with conditions specified and approved by HSCIC. Access to the dataset is tightly controlled and restricted to the small PA project team.”
HSCIC said: “PA Consulting used a product called Google BigQuery to manipulate the datasets provided and the NHS IC [Information Centre] was aware of this. The NHS IC had written confirmation from PA Consulting prior to the agreement being signed that no Google staff would be able to access the data; access continued to be restricted to the individuals named in the data sharing agreement.”
Severe trust failure
Goldacre summed things up quite nicely last week when he wrote about the care.data suspension and the sale of patient records to the insurance industry. He said that, though he previously backed the scheme, the implementation was chaotic, hampered by “vague promises and an imaginary regulatory framework”:
“To summarise, a government body handed over parts of my medical records to people I’ve never met, outside the NHS and medical research community, but it is refusing to tell me what it handed over, or who it gave it to, and the minister is now incorrectly claiming that it never happened anyway.
“There are people in my profession who think they can ignore this problem. Some are murmuring that this mess is like MMR, a public misunderstanding to be corrected with better PR. They are wrong: it’s like nuclear power. Medical data, rarefied and condensed, presents huge power to do good, but it also presents huge risks. When leaked, it cannot be unleaked; when lost, public trust will take decades to regain.”
If you’re going to harness the power of the crowd’s collective data, you had better do a good job of explaining to the crowd why this is a good thing. You need to sell it to them, backing up your pitch with explanations of how giving up their data will benefit them. And, especially if you have a notoriously shoddy data protection record, you have to give well-founded assurances about data security.
Unfortunately, those security assurances are hard to give with a straight face, because the main protective measure can be fairly easily unraveled. HSCIC insists that the data it sells is pseudonymized – that is, identifying personal information has been scrubbed out – but those records have to retain a good deal of personal information if they are to be useful. So we’re looking at a situation where, with a sufficient amount of data at hand, there’s a strong risk that buyers could correlate their way back to identifying who’s who.
Add that potential risk to a bungled, “we know best” style of implementation, and everything falls apart. Many people will — with a fair amount of reason — opt out of care.data, and England’s population as a whole will lose out as a result. Those clever maps won’t be so accurate, and important clinical patterns will be harder to spot.
When it comes to getting the best out of big data — particularly but not only in the public sector — the aims of the exercise should be clear and well-communicated, and security must be more than a vague promise. Ultimately, if you don’t earn trust, you’re not going to get where you want to go.
Data trust will no doubt be a major topic of discussion at our upcoming Structure:Data conference, which will take place in New York later this month.