Privacy-preserving Data Sharing and the European Data Strategy

It was a great pleasure to give a lecture on the fundamental limits of privacy-enhancing data sharing technologies at the Brussels Privacy Hub Global Summer Academy 2024.

The main question that I tried to answer during my lecture is whether privacy-enhancing technologies can achieve their envisioned role in a future data-driven economy as outlined by the Europen Data Strategy and similiar legislation and regulatory frameworks.

The key take-aways of the lecture were

Any data sharing use case brings with it an inherent risk that is defined by its intended purpose.

Every data use case comes with inherent privacy risks. Inherent risks are privacy risks that are intrinsically linked to the intended data use which is defined by the purpose of the data sharing.
If we analyse the envisioned purposes of data sharing outlined by the European Data Strategy and other relevant policy documents, we find a vision of general purpose, high flexibility, high utility data sharing for research and innovation.
By definition, the inherent risks of these envisioned data use cases are large. If we provide the desired utility for the analyst, we must reveal a lot of information that can be misused to infer private information.
This is the fundamental trade-off of privacy-preserving data publishing.

Due to the fundamental trade-offs of privacy-preserving data sharing, purely technological solutions might never achieve the envisioned high utility data sharing for secondary purposes under strong privacy protections as envisioned by many strategic policy documents.

No privacy-enhancing technology that gives the desired utility can eliminate the inherent risks of a data use case.
Thus, privacy-enhancing technologies that produce (micro)data useful for secondary data use, such as research and innovation, do not provide strong privacy protections and, vice versa, privacy-enhancing technologies that provide strong privacy protections do not fulfil the utility requirements.

To demonstrate this take the example of synthetic data:

Assume that synthetic data fulfils the promise of “drop-in replacement for the original data” or “preserve information richness”. This implies that the synthetic dataset shared must preserve information for many possible “advanced analytics task”, incl. things like medical anomaly detection. If an analyst can successfully use this information for valid statistical analysis, a privacy adversary can successfully use this information to infer accurate information about individual outliers in the data. Thus, the synthetic data does not provide strong privacy protection.
Assume that the synthetic data fulfils the promise of strong privacy, i.e., fulfil the formal guarantees of differential privacy. By definition, this dataset may not preserve any (accurate) information about individual records, such as outliers. Thus, any analysis that requires this statistical information will be inaccurate. Thus, the synthetic data does not fulfil the promise of “drop-in replacement” for the original data.

If you are interested in hearing the full reasoning behind these messages, feel fre to get in contact. I’m always happy to connect and give this talk to a wider audience.

Enjoy Reading This Article?