AWS Machine Learning Blog

Redacting PII data at The Very Group with Amazon Comprehend

This is guest post by Andy Whittle, Principal Platform Engineer – Application & Reliability Frameworks at The Very Group.

At The Very Group, which operates digital retailer Very, security is a top priority in handling data for millions of customers. Part of how The Very Group secures and tracks business operations is through activity logging between business systems (for example, across the stages of a customer order). It is a critical operating requirement and enables The Very Group to trace incidents and proactively identify problems and trends. However, this can mean processing customer data in the form of personally identifiable information (PII) in relation to activities such as purchases, returns, use of flexible payment options, and account management.

In this post, The Very Group shows how they use Amazon Comprehend to add a further layer of automated defense on top of policies to design threat modelling into all systems, to prevent PII from being sent in log data to Elasticsearch for indexing. Amazon Comprehend is a fully managed and continuously trained natural language processing (NLP) service that can extract insight about the content of a document or text.

Overview of solution

The overriding goal for The Very Group’s engineering team was to prevent any PII data from reaching documents within Elasticsearch. To accomplish this and automate removal of PII from millions of identified records per day, The Very Group’s engineering team created an Application Observability module in Terraform. This module implements an observability solution, including application logs, application performance monitoring (APM), and metrics. Within the module, the team used Amazon Comprehend to highlight PII within log data with the option of removing it before sending to Elasticsearch.

Amazon Comprehend was identified as part of an internal platform engineering initiative to investigate how AWS AI services can be used to improve efficiency and reduce risk in repetitive business activities. The Very Group’s culture to learn and experiment meant Amazon Comprehend was reviewed for applicability using a Java application to learn how it worked with test PII data. The team used code examples in the documentation to accelerate the proof of concept and quickly proved potential within a day.

The engineering team developed a schematic demonstrating how a PII redaction service could integrate with The Very Group’s logging. It involved developing a microservice to call Amazon Comprehend to detect PII data. The solution worked by passing The Very Group’s log data through a Logstash instance running on AWS Fargate, which cleanses the data using another Fargate-hosted pii-logstash-redaction service based on a Spring Boot Java application that makes calls to Amazon Comprehend to remove PII. The following diagram illustrates this architecture.

Very Group Comprehend PII Redaction Architecture Diagram

The Very Group’s solution takes logs from Amazon CloudWatch and Amazon Elastic Container Service (Amazon ECS) and passes cleansed versions to Elasticsearch to be indexed. Amazon Kinesis is used in the solution to capture and store logs for short periods, with Logstash pulling logs down every few seconds.

Logs are sourced across the many business processes, including ordering, returns, and Financial Services. They include logs from over 200 Amazon ECS apps across test and prod environments in Fargate that push logs into Logstash. Another source is AWS Lambda logs that are pulled into Kinesis and then pulled into Logstash. Finally, a separate standalone instance of Filebeat pulls log analysis and that puts them into CloudWatch and then into Logstash. The result is that many sources of logs are pulled or pushed into Logstash and processed by the Application Observability module and Amazon Comprehend before being stored in Elasticsearch.

A separate Terraform module provides all the infrastructure required to stand up a Logstash service capable of exporting logs from CloudWatch log groups into Elasticsearch via an AWS PrivateLink VPC endpoint. The Logstash service can also be integrated with Amazon ECS via a firelens log configuration, with Amazon ECS establishing connectivity over an Amazon Route 53 record. Scalability is built in with Kinesis scaling on demand (although the team started with fixed shards, but are now switching to on-demand usage), and Logstash scales out with additional Amazon Elastic Compute Cloud (Amazon EC2) instances behind an NLB due to protocols used by Filebeat and enables Logstash to more effectively pull logs from Kinesis.

Finally, the Logstash service consists of a task definition containing a Logstash container and PII redaction container, ensuring the removal of PII prior to exporting to Elasticsearch.

Results

The engineering team was able to build and test the solution within a week, without needing to understand machine learning (ML) or the working of AI, using Amazon Comprehend video guidance, API reference documentation, and example code. Having demonstrated business value so quickly, the business product owners have begun to develop new use cases to take advantage of the service. Some decisions had to be made to enable the solution. Although the platform engineering team knew they could redact the data, they wanted to intercept the logs from the current solution (based on a Fluent Bit sidecar to redirect logs to an endpoint). They decided to adopt Logstash to enable interception of log fields through pipelines to integrate with their PII service (comprising the Terraform module and Java service).

The adoption of Logstash was initially done seamlessly. The Very Group engineering squads are now using the service directly through an API endpoint to put logs straight into Elasticsearch. This has allowed them to switch their endpoint from the sidecar to the new endpoint and deploy it through the Terraform module. The only issue the team had was from initial tests that revealed a speed issue when testing with peak trading loads. This was overcome through adjustments to the Java code.

The following code shows how The Very Group use Amazon Comprehend to remove PII from log messages. It detects any PII and creates a list of entity types to record. To accelerate development, the code was taken from the AWS documentation and adapted for use in the Java application service deployed on Fargate.

        
private List<EntityLabel> getEntityLabels(String logData) {
		ContainsPiiEntitiesRequest request = ContainsPiiEntitiesRequest
                .builder()
                .languageCode(LanguageCode.EN)
                .text(logData)
                .build();

        ContainsPiiEntitiesResponse response = comprehendClient.containsPiiEntities(request);

        List<EntityLabel> labels = new ArrayList<>();
        if (response != null && response.hasLabels() && !response.labels().isEmpty()) {
            for (EntityLabel el : response.labels()) {
                if (el.score() > minScore && !redactionConfig.getComprehendExcludedTypes().contains(el.nameAsString())) {
                    labels.add(el);
                }
            }
        }
        return labels;
    }

The following screenshot shows the output sent to Elasticsearch as part of the PII redaction process. The service generates 1 million records per day, generating a record each time a redaction is made.

PII redacted output record sent to Elasticsearch

The log message is redacted, and the field redacted_entities contains a list of the entity types found in the message. In this case, the example found a URL, but it could have identified any type of PII data largely based on the built-in types of PII. An additional bespoke PII type for customer account number was added through Amazon Comprehend, but has not been needed so far. Engineering squad-level overrides are documented in GitHub on how to use them.

Conclusion

This project allowed The Very Group to implement a quick and simple solution to redact sensitive PII in logs. The engineering team added further flexibility allowing overrides for entity types, using Amazon Comprehend to provide the flexibility to redact PII based on the business needs. In the future, the engineering team is looking into training individual Amazon Comprehend entities to redact strings such as our customer IDs.

The result of the solution is that The Very Group has freedom to put logs through without needing to worry. It enforces the policy of not having PII stored in logs, thereby reducing risk and improving compliance. Furthermore, metadata being redacted is being reported back to the business through an Elasticsearch dashboard, enabling alerts and further action.

Make time to assess AWS AI/ML services that your organization hasn’t used yet and foster a culture of experimentation. Starting simple can quickly lead to business benefit, just as The Very Group proved.


About the Author

Andy Whittle is Principal Platform Engineer – Application & Reliability Frameworks at The Very Group, which operates UK-based digital retailer Very. Andy helps deliver performance monitoring across the organization’s tribes, and has a particular interest in application monitoring, observability, and performance. Since joining Very in 1998, Andy has undertaken a wide variety of roles covering content management and catalog production, stock management, production support, DevOps, and Fusion Middleware. For the past 4 years, he has been part of the platform engineering team.