How to Anonymize - Pseudonymize open data?
For a few months now, all you've been hearing about is the RGPD, the European Union's famous General Data Protection Regulation. Everyone is asking you how your company is going to comply... without really understanding what it's all about. Suppliers and consultants redouble their inventiveness to offer you to take part in events on the subject... which only skim the surface.
Among the range of IT security solutions available, you've already implemented encryption and carefully controlled access to your information system. Now you're thinking about anonymization, which, along with pseudonymization, comes up again and again in your discussions: how do you go about it? What organization should you put in place?
SUMMARY :
A belated awareness
It's astonishing to have waited for the advent of a strong regulatory constraint to put the spotlight back on a discipline that has been around for so long.
It is therefore legitimate to ask "Why?
"Why have we waited so long?"
"Why hasn't this already been done?"
... so obvious is it to all customers that you shouldn't "play" with your personal information.
There are many explanations, but in the end, they only interest those who live in the past.
So, let's look at the present picture: companies share production data (that which they need for their day-to-day business) to meet a variety of needs:
- Copy the entire production run to enable developers and administrators to test upgrades, patches and releases,
- Gain agility and competitiveness by developing new functionalities and analytical models in an environment as close as possible to production,
- Analyze trends (consumption, behavior, medical research...) by sharing data with consultants and researchers so that they can apply statistical or Machine Learning models.
As a result, billions of items of customer data (no matter how sensitive) leave production environments unprotected.
RGPD, a gas pedal for the accountability of all players
Recent analyst studies on data privacy tend to show that companies have no way of knowing whether data taken out of a production environment has been compromised.
I think the "Why?" becomes obvious: notwithstanding any regulatory constraints, the person whose personal data is being used without knowing whether it will be shared and compromised is you, it's me, it's our children...
The protection of privacy is a fundamental right guaranteed by the Universal Declaration of Human Rights.
We must all implement this mechanism to ensure that our data is used for justified and limited purposes.
That's why we must all, as company directors and IT managers, implement the mechanism that will ensure that our data is used for justified and limited purposes.
Identifying the right means of protection
The RGDP is therefore not the answer to the "Why?", but may be the beginning of the answer to the "How?".
In the first place, the regulatory framework and, above all, the pecuniary sanctions and other fines attached to it, are a lever for financing the implementation of the anonymization project.
Drawing up the register of processing operations, required by the RGDP, is a good way of pinpointing exactly where personal data is located in the information system... which will make it possible to find out quickly what needs to be anonymized.
Secondly, the regulation urges us to think first and foremost about the need to process personal data, and advocates the principle of data minimization: "what is necessary with regard to the purposes for which it is processed".
For example:
Is it really necessary to have all production data in development, qualification or training environments? Ultimately, isn't it too costly and too risky?
Data sampling is a second response: reducing the risk surface by selecting (intelligently) a representative set of data, which can then be anonymized according to business needs.
The regulation also proposes simplified mechanisms, such as pseudonymization, which consists in replacing personal data with a pseudonym, thereby masking the link with the original individual (provided that the link between the pseudonym and the individual is not trivial or preserved).
How to implement data anonymization?
That said, none of these guidelines will direct the company as to how it should organize itself. This may well be the Gordian knot of anonymization:
- "Should I anonymize application by application?"
- "What should be done with applications that share the same individual's personal data?
- "What type of organization will meet business requirements?
- "Will I lose agility in the evolution of the information system?
Clearly, organization is the keystone of an anonymization project, and conditions its success.
You need to implement an "industrialized anonymization service" capable of meeting the needs of all IT teams, who will be the most affected:
- be able to address all technologies (while respecting their licensing and support rules, of course);
- offer high-performance, intelligent sampling: we're not content with the first 1,000 lines... we look for a representative data set in a data source and in subsidiary repositories (to guarantee referential integrity between applications);
- guarantee high-performance service levels: offer "on-demand" or automated anonymization;
- provide a library of complete anonymization formats (random replacement, data deletion, rewriting, etc.).
This anonymization service will then bring about a positive change in IT teams' working methods, with minimal impact on their day-to-day work.
Choosing the right tools
As you can see, this subject is not driven by technology. But what about tooling?
The literature will help you understand the various anonymization algorithms, such as "k-anonymity", "l-diversity", "t-proximity" or "differential confidentiality"... whose effectiveness and level of protection will be judged...
These are just some of the tools available to specialists for implementing the right anonymization for the right set of data.
Instead, I'd like to focus on an industrial anonymization solution that guarantees :
- multi-source and multi-target connectivity, so as to be the company's central, unifying tool, guaranteeing anonymization that respects inter-application referential integrity;
- a wizard, enabling the construction of anonymization workflows adapted to the dataset (discovery of sensitive data in the source, proposal of suitable algorithms, preview of results, etc.);
- the ability to automate anonymization chains to guarantee optimized service levels (night-time processing, on-demand dataset refreshing, etc.);
- ease of use, so that the team in charge of the anonymization service can quickly and easily upgrade its skills and capabilities.
Of course, the solution must guarantee that it is, itself, compliant with RGPD best practices: encryption, access control for privileged accounts, supervision... because the anonymization infrastructure will be at the crossroads of personal data flows.
Oracle's " Data Masking Factory " initiative meets these requirements, and has established itself in the information systems landscape as the agnostic, high-performance solution for anonymization services.
Don't mess around with personal data
2018 is the year of the paradigm shift: it's the era of the realization that our own personal data is what companies handle too lightly.
Everyone must, at their own level, integrate and understand that the game with data is over.
The RGPD is a reminder of good practices, among which anonymization plays a more than important role.
More than just a technical project, it's an effective organization and toolset that's needed to provide businesses with a high-performance anonymization service.