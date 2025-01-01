Figure 1: Value in shifting left

Many organizations will have security engineers or even a Chief Information Security Officer (CISO), yet many lack technical expertise when it comes to privacy. This has led to the emergence of the privacy engineer – a specialist software engineering role that ensures privacy considerations are embedded into product development, rather than left as an afterthought. The need for this role has intensified partly because organizations there are today increasing legislative requirements to which practices, processes and products must comply. Greater awareness of the ethical dimension of technology also makes the role particularly valuable. In the past, data collection was something often done with little consideration for users’ personal privacy. However, today privacy can be a differentiator: there are plenty of examples of privacy being placed at the center of a product development.

In the same way that we — as developers — think about technical debt, we need to also start paying attention to our “privacy debt”. With data breaches increasing, companies have a decision to make about when they tackle this debt. Those that successfully shift left on security and privacy will significantly reduce the probability of ending up on a growing list of companies who have failed to protect their data.

Safeguarding data

Personally Identifiable Information (PII) is data that directly identifies, in isolation or in combination with other data, an individual. Without the right security in place, hackers can access this data and create profiles – using that information to impersonate them or sell it to other criminals.

Consideration is required when storing any form of personal data. Stop and reflect: is collecting PII necessary to deliver the customer experience? Can you retain trust? For example, a retail recommendation system can offer a tailored experience with a broad age bracket without capturing a customer's date of birth. For organizations that do use PII, being able to find and identify sensitive data is critical to protecting their customers and their reputation. Technologies like data catalogs and appropriate governance frameworks are particularly useful here for ensuring data can be effectively organized and secured.

It is important to note that simply obfuscating or masking PII fields in a dataset does not necessarily de-identify an individual’s data. It may be possible to re-identify the data using other contextual information. For example, hackers use uniqueness as a path to exploit vulnerabilities, so knowing all the ways your data can be unique to an individual is important. A driver of a distinctly coloured car in a large city might be fairly unique, but that same driver in a small country town would be identified easily. This also goes for machine learning approaches that are trained on data with outliers. Outliers can reveal sensitive information about your data and can inadvertently leak it through a prediction API.

One of the complexities organizations face is the need for access to PII to allow development teams to experiment and test as they’re working. Test environments are often not subject to the same security and privacy, because they don’t contain production data. But a data science workflow needs to have data that represents production, so you can train models and do analysis to understand what that model will do in production.

There are many ways you can do this without giving access to production data directly:

Generate fake data that matches the schema of your production data. For some use cases, such as data validation checks, this can be enough to help ensure pipelines work adequately and without error. But if your goal is to train and release a model into production, you should avoid fake data. Not training your model on data that is as close to the real thing as possible raises ethical and accuracy concerns. Evaluate whether synthetic data, or data that is representative of the real scenario but generated by a model, could work for your use case. You may still need to apply additional privacy preserving techniques on top of this data. Generate a subset of secure, anonymous data. Anonymization is a challenging and at times impossible task when it comes to PII. Simply removing obvious fields like names, addresses and other identifiers does not mean you cannot re-identify an individual in that dataset. Having a good understanding of privacy engineering practices such as masking, differential privacy and encrypted computation are important to do this effectively. Build an isolated secure environment specifically for model building and training with access to a copy of production data. This approach is more costly and introduces risk since you’re copying data to another location. You’ll also need a separate environment with all of the same security and privacy controls as production.

The biggest shift you have to make to improve security and privacy is to think small. Data minimization is your friend – helping you build what you want, using the smallest subset of data you really need.

Adopt sensible security practices

There are a number of practices you can put in place to help you make decisions when it comes to developing secure data products.

Appointing security champions with experience in security activities and processes will help guide your development teams on the right decisions to make. Put a data classification process in place to allow you to tag sensitive data and apply governance policies across the organisation based on the sensitivity of your data.

Here is a simple mental model for your data as it enters your systems.