What is Data Masking? Types & Techniques

Jordan Lampe · November 3, 2022

Data Security

What is data masking?

Data masking is the process of hiding elements of an original value, while still keeping enough context for the string to make sense to the user. For example, masking a credit card’s Primary Account Number (PAN) of 3566-0020-2036-0505 to reveal the last four-digits might look like XXXX-XXXX-XXXX-0505.

Why mask data?

With 1,802 data breaches in 2022 and the increasing number of data privacy regulations, like CCPA and the GDPR, businesses need to use confidential information intelligently. Data masking offers a smart way to minimize or eliminate compliance requirements while maintaining day-to-day operations.

Here is a comprehensive overview of data masking techniques, benefits, and examples.

What does Data Masking Do?

Masked values retain usability and functionality, while enhancing privacy and confidentiality. Because of this, masked data can be used for things—like sales demos, user training, account validation, and software testing—without increasing a company’s risk footprint, as plaintext data does. It also protects private information when sharing your business data with a third party.

Data Masking Use Cases and Examples

When secured with strong privacy policies, identity access management tools, and role-based access controls, data masking provides a powerful and dynamic tool for desensitizing and using data. Let’s look at a few ways organizations mask data today.

Using Masked Data as a Visual Prompt

Masked data provides powerful visual cues to those familiar with the underlying data. For example, when a customer checks out at an online store, her digital wallet prompts her to select a stored card and uses the last four-digits of their respective PANs to help identify the correct payment method.

Using Masked Data for Testing

Humans and applications need data to test various system functions or standard operating procedures. Using sensitive plaintext data, or original values, is dangerous and expands both compliance scope and costs considerably.

When done right, masked data provides a cost-effective way to test whether a system or design will perform as expected in real-life scenarios without revealing sensitive data.

Using Masked Data to Help Migrate Data

Data masking can apply new formats to the underlying data. When combined with an abstraction layer, like tokenization, masked data may help format, structure, or clean data to satisfy new business or schema requirements encountered during a migration.

Types of Data Masking

While other methods exist, masked data is primarily created either as a copy of the original data or during processing. Let’s take a closer look at the two.

Static Data Masking (SDM)

SDM duplicates your data with your data masking rules and algorithms applied to the new data set. This is helpful for test data environments, especially with third parties, where you may want to avoid the risk of something bad happening to your source data.

Dynamic Data Masking (DDM)

DDM does not require a second data source to store the masked data dynamically. Instead, it masks and presents data according to the access policies and permissions of the actor requesting the data. For example, a customer support manager may have access to full PAN while his support representatives only have access to the last four digits.

In doing so, DDM provides access to masked data in real-time to an authorized user and limits the exposure risk static data may create.

What are Some Data Masking Techniques?

Because there are so many ways to obfuscate data, it’s helpful to consider how each of the following techniques brings you closer to your usability and risk goals. In fact, most companies mix and match these approaches based on the effort it takes to build, maintain, and use them.

For example, while using format-preserving encryption to mask a social security number adds a significant level of security, an organization may not pursue it due to layers of complexity and cost to the application required to decrypt it. (We’ll get into encryption a bit more, later).

Obfuscating all but the last four-digits of a card’s Primary Account Number (PAN) provides some level of protection and usability to both users and organizations alike.

Substituting Data

Substitution replaces part of an original value with other characters or numbers—sometimes similar in nature. For example, if your email was Claude.Shannon@fakeemail.com, a substituted version might look like: Terry.Shannon@fakeemail.com.

Many methods for populating the data are used to substitute all or part of the original value. For example, you could draw from a look-up table of fake names or dates, use a generator or function to generate values, or calculate an average using aggregate values.

Redaction or Masking out Data

Redaction, also known as masking, is worth calling out as a notable sub-genre of substitution because of its popularity. This approach obscures a portion of the original value with fixed characters, like “X” or “*” and is by far the simplest form of data abstraction (it requires the least amount of power from humans and computers alike to understand and process).

Deleting or Truncating Data

Deletion or truncation desensitizes a piece of information by removing its essential elements.

Shuffling Data

Shuffling creates a mask by rearranging the characters or numbers of the original value; however, its inclusion in this is purely informational. While it’s low effort to build, shuffling is easily reverse-engineered and most experts strongly recommend against using it for sensitive data.

Is Encrypting Data the Same as Masking Data?

Encryption uses math and a combination of keys to transform an original value into an unrecognizable string of characters. The same set of keys used to encrypt the original value is then used to decrypt it.

There are many that maintain encryption is a form of masking. While encryption does obfuscate data and render it unreadable, the resulting string provides no context to a user or business trying to complete a task, so it fails to meet our definition of data masking.

While format-preserving encryption does offer some value here, the compute costs, latency, and operational overhead of managing the necessary keys and services have traditionally made the other data masking techniques much more attractive.

Learn more about managing your keys with Basis Theory’s Open Source KMS.

What’s the Difference Between Data Masking and Data Tokenization?

Data masking modifies the existing original value while tokenizing data creates a net new one, called a token.

Masking takes a string and obfuscates an original raw value while tokenizing create a new value in the form of a token. — Obfuscating all but the last four-digits of a card’s Primary Account Number (PAN) provides some level of protection and usability to users and organizations alike.

Using masked data in applications and databases is a great way to reduce your compliance footprint in your environment, but it doesn’t eliminate the compliance obligations or security risks that come with storing the original value. Depending on the type of data being held and its usage, this could cost hundreds of thousands of dollars to build, maintain, and audit.

Tokenization platforms, like Basis Theory, store the original sensitive data in a compliant, hosted environment outside of your system. Instead of holding onto the full plaintext value, you’d store references, called Token Identifiers or Token IDs, to it in your application or database. Systems use these Token IDs to authorize and call back different properties of the original value (e.g., masks), allowing them to do many of the same things its plaintext counterparts can without increasing scope or risk. Let’s look at a tokenization example while incorporating what we know about DDM.

Say your customer support application needs access to the last four digits of a customer’s PAN to help a representative process a return on behalf of a customer. The application would use the Token ID to request a redacted version of the data. Basis Theory uses DDM, so once the application and user (i.e., the support rep) have been authorized, a mask of the customer’s card, “XXXX-XXXX-XXXX-0505”, would be generated and passed back to the application.

In this situation, the original value stays encrypted in a PCI-compliant environment and a masked value is returned to the representative so the customer can confirm which card to use to issue a refund. This workflow keeps your larger system and the web app out of PCI scope while reducing or even eliminating many of the obligations and costs that come with hosting full plaintext card numbers.

What are the Benefits of Data Masking?

Here are reasons organizations might use data masking:

Reduce compliance scope: Using masked data inside your applications and databases instead of plaintext greatly reduces your compliance scope and the costs, complexity, and speed of your compliance efforts.
Least privilege: When combined with strong access rules and policies, masking ensures authorized actors have just the right amount of visibility needed to complete a job or function.
Compliance: If done well, masked data not only satisfies business needs but also sovereign or industry requirements, like GDPR, CCPA, or PCI.
Programmability: Masked data can be generated to assume or change its value based on various static or dynamic factors.
Made for humans: Human memory is a fickle thing. It's easier for us to recognize and remember data sets when they are provided in a familiar pattern.
Reduce the impact of a breach: Masking data greatly decreases its usability, potentially preventing, or at least, cyber criminals' ability to cause harm.

What are Some Drawbacks or Limitations of Data Masking?

It’s Not a Silver Bullet

While using masked data in your applications inherently lowers your risk and scope compared to plaintext, you’ll still have obligations related to holding the original value. As discussed, storing the last four digits in your support services web application may not bring it into PCI scope, but organizations still need to prove that the database storing the full PAN is PCI compliant. Similarly, Personally Identifiable Information (PII) may have different masking requirements depending on where and for how long that information is stored (e.g., USA vs. European Union).

Some Things Sold Separately

You may have noted several times the use of the words “permissions”, “access controls,” “policies,” etc., in this blog. Having solid data governance policies, identity access management systems, and role-based access controls (RBAC) to govern the “who”, “what”, “when”, “where”, and “how” around data permissions is the best way to ensure an organization meets its usability, risk, and compliance goals for its data. Data masking will only be as strong as these systems are complete and accurate.

Preserving Formats

If you’re using an algorithm to mask data, you’ll need to accommodate the multitude of variations for that data. For example, while social security numbers in the USA are predictably formatted, emails can vary in length, numbers, and characters. Masking these incorrectly could lead to unfortunate downstream consequences.

Preserving Social Norms

Robots have a tough time determining things that humans may instinctively know as correct. One often cited example is gender. If you’re using an algorithm to substitute PII, like names and gender, in a data set, you’ll want a lookup table(s) with a population of male or female names proportional to your data set’s gender breakdown (or use popular neutral names, like Pat or Terry).

Re-identification

Given enough context, some data can be pieced back together. To put that into perspective, one study found 87% of Massachusetts residents could be identified with just 3 data points: gender, zip, and birth date. While the study used plaintext data (i.e., not masked) to identify names and addresses, the same principle applies to data masking. Suppose multiple variations of masked data in your production system were to leak. In that case, hackers could use a combination of manual efforts, AI/ML programs, or other data sets to reverse engineer an original value.

Avoiding Duplicates

While rare in smaller data sets, large datasets with thousands or millions of stored values could accidentally have two of the same masked values. While this doesn’t matter for some things, like birthdates, it may for unique values, like social security numbers.

Data Masking with Basis Theory

Unlike other tokenization providers, Basis Theory allows developers to segment their data’s access controls as containers and define their tokens’ masks using a familiar Liquid syntax, called Expressions. If you’re interested in learning more, reach out for a quick demo!