As everyone’s lives move increasingly online, it’s fair to say that the public have never been more concerned about protecting their privacy, and about what organisations and governments are doing with all that personal information.
If you’re publishing any kind of open data, then protecting individual identities is vitally important. Not only do you have legal obligations under Australia’s privacy legislation, but you also need to satisfy the court of public opinion. If people aren’t confident that you are taking every precaution with their data then they’ll rapidly disengage with the process of collecting it, resulting in datasets that don’t accurately reflect the population and are therefore much less useful.
While removing Personal Identifiable Information (PII) such as names, email addresses, or phone numbers provides an absolute minimum level of protection, this alone is not enough.
As multiple studies have shown, it can often take remarkably few datapoints to reidentify an individual in an “anonymised” dataset. If you know a few specific facts about someone in the data, you can often find them with surprising ease.
In one famous example in the 1990s, the US state of Massachusetts released anonymised medical records of all state government employees to the research community. A local graduate student combined the released data with information from the electoral roll and some news reports and successfully extracted the State Governor’s entire medical history, which, in a somewhat theatrical flourish, she promptly FedExed to his office to prove the point.
Closer to home, last year University of Melbourne researchers successfully demonstrated how 2 billion “anonymised” travel records from Melbourne’s public transport system that were released for a data hack event were anything but. The researchers could easily find their own records in the data, and used this to extract the entire independent travel history of anyone who had travelled with them. They also used information from a Victorian MP’s Twitter account to (with his permission) identify his detailed travel history over the three-year period covered by the dataset.
So, what does this mean for open data? One approach is to release only aggregated, high-level data. While this can fully circumvent the privacy problem, if the data is very high level it just isn’t as useful. It only answers specific questions sanctioned by the data custodians. It doesn’t allow the community to use that data in new and interesting ways.
Software company WingArc Australia’s data dissemination platform, SuperWEB2, takes a different approach. It allows users to run any query they like against a dataset, but presents only the aggregated results. While the query runs against the underlying unit records, those records themselves are never exposed to the end user. It’s a “best of both worlds” self-service approach that allows users to ask any question of the data, while still applying tight controls to what gets released. WingArc’s confidentiality algorithm, perturbation, automatically applies subtle but consistent adjustments to the values returned that mitigate the risk of identifying an individual but do not introduce bias or affect the general trends and patterns reflected in the results.
You can learn more about WingArc’s approach and its open data confidentiality solution on WingArc’s website.
Comment below to have your say on this story.
If you have a news story or tip-off, get in touch at firstname.lastname@example.org.
Sign up to the Government News newsletter