Reidentification Risk of Masked Datasets: Part 2

This article is a continuation of Reidentification Risk of Masked Datasets: Part I, where we discussed how organizations progressed from simple to sophisticated methods of data security, and how even then, they faced the challenge of reidentification. In its conclusion, we shared what companies really need to focus on while anonymizing their data.

Now, we delve into the subject of reidentification and how to go about achieving your goal, which is ultimately to reduce or eliminate the risk of reidentification.

Before we dive further into this, it helps to understand a few basic concepts. 

What is reidentification and reidentification risk?

A direct or an indirect possibility that the original data could be deciphered depends on the dataset and the method of anonymization. This is called reidentification, and the associated risk is appropriately named reidentification risk.

The NY cab example that we saw in Part 1 of this article is a classic case of reidentification, where the combination of indirectly related factors led to the reidentification of seemingly anonymized personal data.

Understanding the terms data classification and direct identifiers

Data classification or an identifier is any data element that can be used to identify an individual, either by itself or in combination with another element, such as name, gender, DOB, employee ID, SSN, age, phone number, ZIP code and so forth. Certain specific data classifications — for instance, employee ID, SSN and phone number — are unique or direct identifiers (i.e., they can be used to uniquely identify an individual). Name and age, on the other hand, are not unique identifiers since there's repeatability in a large dataset.

Understanding indirect identifiers and reidentification risk through a simple example

Let's say you take a dataset of 100 employees and you're tasked with finding a specific employee in her 40s. Assume that all direct identifiers have been anonymized. Now you look at indirect identifiers, such as race/ethnicity, city or, say, her bus route — and sure enough, you've identified her. Indirect identifiers depend on the dataset and, therefore, are distinct for different datasets. Even though the unique identifiers have been anonymized, you can't say for sure that an individual can never be identified given that every dataset carries indirect identifiers, leading to the risk of reidentification.

What are quasi-identifiers?

Quasi-identifiers are a combination of data classifications that, when considered together, will be able to uniquely identify a person or an entity. As previously mentioned, studies have found that the five-digit ZIP code, birth date and gender form a quasi-identifier, which can uniquely identify 87% of the American population.

Now that we are on the same page with essential terms, let's get to the question: How do you go about choosing the right solution that minimizes or eliminates the risk of reidentification while still preserving the functionality of the data?

The answer lies in taking a risk-versus-value approach.

To find out more, visit Forbes Tech Council – Reidentification Risk of Masked Datasets: Part 2 to read the entire article.