Data sharing across agencies is a growing trend in Australian government, but even anonymised data can often be re-identified. A new Australian initiative aims to solve the problem, as Graeme Philipson discovers.
In August 2016 a large data set of 2.9 million Australians – more than 10 percent of the population – was published on the Department of Health’s open data website. The data, containing over a billion records of patients’ medical history going back 30 years, had been ‘de-identified’ – all data that could be used to identify individuals had been stripped out.
“To ensure that personal details cannot be derived from this data, a suite of confidentiality measures including encryption, perturbation and exclusion of rare events has been applied,” said the Department at the time.
It was not enough. The data set was withdrawn within a month after three University of Melbourne researchers used publicly available data to reidentify individuals in the data set. It was a high-profile incident at the time, after the researchers showed that they were able to match data and identify seven prominent individuals – including MPs and AFL footballers.
The researchers, Dr Chris Culnane, Dr Benjamin Rubinstein and Dr Vanessa Teague, said that the re-identification was a simple process for anyone with undergraduate IT skills.
Stepping back from data sharing
The Health Department reidentification issue has set back the cause of open data in Australia significantly. In an attempt to solve the problems highlighted by the Melbourne researchers, a cross-industry taskforce was formed in an attempt to develop standards and procedures which would prevent the reidentification of anonymised data.
The taskforce was led by Dr Ian Oppermann, CEO and chief data scientist at the NSW Treasury’s influential Data Analytics Centre. In September 2017 it published a technical whitepaper, Data Sharing Frameworks, which outlined the issues and the challenges facing data sharing in Australia, and in particular the problems with re-identification.
The taskforce worked under the auspices of the Australian Computer Society and included representatives from Standards Australia, CSIRO and its Data 61 division, Department of Prime Minister and Cabinet, the Australian Information and Privacy Commissioner and the Digital Transformation Office, to develop a framework for data sharing across Australian government.
Addressing the problems
The whitepaper outlines a number of challenges, one of the most important of which is ensuring that data cannot be used to re-identify individuals based on personally identifiable information.
“Even when you remove information like people’s names and addresses, there are other pieces of information there that make it possible for individuals be identified,” Dr Oppermann explained to Government News.
“A lot of it depends on context. It is impossible to tell what data might be used for re-identification. Health records, for example, often contain information specific to an individual which, combined with other data that can be introduced later, will make it possible for individual to be identified.
“One of the problems is that there are no standards for anonymised data,” says Dr Oppermann. “That is what we are attempting to do with the Data Sharing Task Force – developing an unambiguous test for the presence of personally identifiable information within a number of data sets.”
Future smart services for homes, factories, cities and governments rely on sharing large volumes of often personal data between individuals and organisations, or between individuals and governments, he says.
“Data sharing comes with a wide range of challenges: data format and meaning, legal obligations, privacy, data security, and concerns about the unintended consequences of data sharing,” says Dr Oppermann. “This creates the need to develop sharing frameworks which address technical challenges, embed regulatory frameworks, and anticipate and address concerns as to fairness and equity of outcomes in order to maintain the trust of consumers and citizens.”
Dr Oppermann says that human judgement is not sufficient to determine the possibility of re-identification in any one case, and that standards based on a number of statistical factors need to be developed.
Now the taskforce has developed the draft of a second whitepaper, which Government News has seen, that makes recommendations on standards and procedures to ensure that shared data cannot be re-identified. It’s expected that the whitepaper will be published soon.
The ‘Five Safes’
Central to the approach being taken is the ‘Five Safes’ framework:
- Safe People: relies on individuals to act appropriately, based on their technical skills and training.
- Safe Projects: legislation and guidelines.
- Safe Setting: physical controls over the usage of the data.
- Safe Data: the potential for identification in the data, and the sensitivity of the data itself.
- Safe Outputs: control over other data is used.
The document contains a number of draft standards which Dr Oppermann believes address the legal and technical issues surrounding reidentification. It uses a number of calculations based on statistical analysis methodologies to overcome many of the concerns expressed in the earlier whitepaper that human judgement calls were not sufficient to guard against reidentification.
The draft guidelines go well beyond existing international efforts designed to prevent re-identification. They have the potential to significantly alter the landscape for data sharing in government, because they are the first real attempt to overcome the practical issues surrounding data and normalisation and re-identification.
The inclusion of Standards Australia in the task force indicates that the procedures it implements might well be adopted internationally – Australia has a long history of being a pioneer in the development of international standards in areas that involve complex procedural matters.
“The development of standards around just what ‘anonymised’ means will help to address the challenges of dealing with privacy,” says Dr Oppermann.
“In all parts of the world, there is currently only very high-level guidance, and certainly nothing quantitative, as to what to ‘anonymise’ and how to do it. That means many organisations must determine what it means for them based on different data sets.
“We believe we have an answer to the problem.”
Comment below to have your say on this story.
If you have a news story or tip-off, get in touch at firstname.lastname@example.org.
Sign up to the Government News newsletter