Scrubbing sensitive data at 180MiB/sec/core

"RING RING RING! RING RING RIIING!"

You jolt awake and silence the pager.

Bleary-eyed you fire up the laptop and realize the full breadth of the problem — the whole payment system is down. Every minute of debugging is costing the company thousands. Drowsiness making way for adrenaline, you try to figure out what’s going on, but it’s a big system and you just can’t seem to understand what’s happening.

Several hours later – and hundreds of thousands of dollars in lost revenue – you finally nail down the root cause: One of the backend services received an unexpected request payload that caused it to exhaust a huge amount of resources, causing a death spiral and the system slowly grinding to a halt.

Observability tools like Distributed Tracing are designed to help with this, but are usually limited to request metadata (HTTP method, duration, Kubernetes pod information, etc) and don't capture request/response payloads as they often contain sensitive data.

You really don't want to be surprised by a credit card number flying by as you're trying to debug an issue. At the same time, having access to request/response payloads would have made the incident a five minute affair.

Designing a better approach

Normally we “solve” this predicament in one of three ways:

Don’t collect the data in the first place
Collect it and limit who has access to it
Force developers to implement a scrubbed version for each data type and write a conversion function.

None of these options really solves the problem — we want all our engineers to be able to debug our systems without risking them being exposed to that information, nor putting the burden on them to write a scrubbing function for every data type.

For Encore's distributed tracing we came up with what we think is a pretty neat approach that leverages static analysis to achieve what we think is a best-of-both-worlds solution. It's naturally open source. Let's take a look at how it works.

Encore's distributed tracing captures request/response payloads sent as JSON payloads by default. While it would be relatively straightforward to scrub valid JSON according to a specification, a common use case of inspecting request payloads is when something goes wrong. Like sending invalid JSON. It would be bad if the scrubbing failed to scrub data because a human called the API with a trailing comma inside a JSON object, for example.

To summarize the requirements:

We need to understand what is considered sensitive
We need to be able to scrub a stream of data as it’s being processed
We want to be able to scrub corrupted/truncated data streams

Understanding what is sensitive

Encore defines APIs in a type-safe way, similarly to gRPC, with struct types representing the request and response payloads. This has the nice property that the structure of the payloads can be statically understood by analyzing the types.

To mark a field as sensitive is done with a Go struct tag. (Other languages might use annotations or macros to achieve the same effect of attaching meta-data to a field.)

For example, consider an API endpoint for placing an order to purchase something. We can easily tag various parts of the request as sensitive:

// PlaceOrderRequest is the API request type for the PlaceOrder endpoint.
type PlaceOrderRequest struct {  
  Card    *CreditCard
  Billing *BillingAddress
  Emails []string `encore:"sensitive"`
}

type CreditCard struct {
  Expiry string
  Number string `encore:"sensitive"`
}

type BillingAddress struct {
  FullName string `encore:"sensitive"`
  Address  string `encore:"sensitive"`
  City     string
  Country  string
}

//encore:api auth method=POST path=/orders
func PlaceOrder(ctx context.Context, req *PlaceOrderRequest) error { /* ... */ }

From this API description, Encore's static analyzer captures all this information and outputs it as a protobuf. What's nice about determining what is sensitive from the code itself is it makes it a local decision. If you had to manually configure data scrubbing in some other system it's easy to forget, whereas it's easy to do at the moment you're writing the relevant code.

Implementing a streaming scrubber

So how to actually do the scrubbing? And how to handle invalid JSON? It turns out the same solution is ideal for both requirements: a streaming parser.

A detour into tokenization

The typical user-facing API for encoding and decoding JSON is: "marshal this data structure as JSON" and "unmarshal this JSON into this data structure", respectively. Simple to understand, yes, but also slow and memory-hungry: the whole data structure needs to be kept in memory.

Under the hood, however, JSON parsers process the incoming bytes in a streaming way to construct a sequence of "tokens" (a process called "tokenization"), before the decoding phase takes over and needs to do a whole bunch of buffering to construct the final object. The transformation looks like this (line by line):

JSON Representation                  Token stream              
                                                                      
{                                    OPEN_BRACE                               
    "key": true,                     LITERAL("key") COLON LITERAL(true) COMMA
    "array": [         ──────▶       LITERAL("array") COLON OPEN_BRACKET      
        1, null                      LITERAL(1) COMMA LITERAL(null)           
    ]                                CLOSE_BRACKET                            
}                                    CLOSE_BRACE

Crucially, the tokenizer doesn't know or even care about JSON syntax rules: it just processes bytes and outputs a stream of tokens according to what it sees. For example, the (very invalid) JSON payload {"foo","bar":3] becomes OPEN_BRACE LITERAL("foo") COMMA LITERAL("bar") COLON LITERAL(3) CLOSE_BRACKET when interpreted as a sequence of tokens. It doesn't matter that "foo" doesn't have a value associated with it, or that the closing square bracket should have been a curly brace.

The implementation of this (the code calls it "scanner") can be found here.

Scrubbing a token stream

To scrub the data we can construct the data scrubbing to operate on this sequence of tokens instead of requiring the fully unmarshalled object, by designing the data structures carefully.

The algorithm works like this.

First, we use the output from the static analysis to build a list of all the fields that need scrubbing. Each list entry is a simplified version of JSONPath, and expresses the sequence of fields and array elements to get from the root of the JSON value (starting with the opening { in our case) to the field that needs scrubbing.

In the PlaceOrderRequest above, the list of fields to scrub looks like this (each line being its own path expression):

[{Kind: ObjectField, Field: "Emails"}]
[{Kind: ObjectField, Field: "Card"},    {Kind: ObjectField, Field: "Number"}]
[{Kind: ObjectField, Field: "Billing"}, {Kind: ObjectField, Field: "FullName"}]
[{Kind: ObjectField, Field: "Billing"}, {Kind: ObjectField, Field: "Address"}]

We can then take these path expressions and merge them into a tree. As you can see we have two lines that both start with {Kind: ObjectField, FieldName: "Billing"}. These get merged, so we end up with a tree structure that looks like this:

                                       ┌──────────────────┐
                                   ┌──▶│Field:  "FullName"│
               ┌────────────────┐  │   │Action: SCRUB     │
          ┌───▶│Field: "Billing"│──┤   └──────────────────┘
          │    │                │  │   ┌──────────────────┐
          │    └────────────────┘  └──▶│Field:  "Address" │
          │                            │Action: SCRUB     │
          │                            └──────────────────┘
          │                                                
┌────┐    │    ┌───────────────┐       ┌──────────────────┐
│Root│────┼───▶│Field: "Card"  │──────▶│Field:  "Number"  │
└────┘    │    │               │       │Action: SCRUB     │
          │    └───────────────┘       └──────────────────┘
          │                                                
          │                                                
          │                                                
          │    ┌────────────────┐                          
          └───▶│Field:  "Emails"│                          
               │Action: SCRUB   │                          
               └────────────────┘

See groupNodes implementation.

Then, when processing a token stream we keep track of our current position in this scrub tree. Starting with the "Root" node as the active scrub node, we begin iterating over the token stream.

Whenever we encounter an object key, we look at the children of the active scrub node to see if the object key exists as a child node. So when we encounter the token representing the object field LITERAL("Card") we notice there exists a child node in our scrub tree matching that. We recurse, updating the internal state to set the Card node as the selected node.

When we continue reading the token stream we encounter the token OBJECT_KEY("Number") which we again identify as a child of our active scrub node in this case it's marked with Action: SCRUB so instead of recursing we "skip" the whole value (including any nested objects or arrays).

Since all tokens include position information (offsets into the byte stream where the token starts and ends), skipping the value tells of exactly where the value started and ended. We record these positions as a byte segment that must be redacted.

When we then encounter the next CLOSE_BRACE token we know the Card object has been completed, so we return back to the caller (bubble up one recursion level), updating the active scrub node to again be the Root node. We then proceed processing key-value pairs.

Whenever we encounter an object field that's not listed as a child to the active scrub node we can skip over the whole value with no additional processing.

When we've processed the whole byte slice we're armed with a list of non-overlapping (start, end) byte offsets that represents the byte segments to redact. From that it's straightforward to perform the redaction by replacing the bytes. By pre-computing the new length we can do so in a single allocation.

The end result is a JSON byte slice that looks something like this:

{
  "Emails": "[REDACTED]",
  "Billing": {
    "FullName": "[REDACTED]",
    "Address":  "[REDACTED]",
    "City":     "Stockholm",
    "Country":  "Sweden"
  },
  "Card": {
    "Expiry": "04/25",
    "Number": "[REDACTED]"
  }
}

Handling invalid JSON

By virtue of implementing this based on the token stream it's quite straightforward to handle invalid JSON. The core idea is to try to guess the user's intent in the presence of invalid JSON.

For example, the tokenizer tries to guess whether a token represents an object key or a value, based on surrounding context. Consider {"foo", "bar": true}. Is "foo" a key or a value? If "foo" is a key, is "bar" a value? Seemingly not since it's proceeded by the colon.

After plenty of experimentation we came up with a heuristic that is easy to reason about and matches the user's intent quite well. We treat the first literal we encounter as an object key. Then we alternate, treating the next as the value. When we see a colon we update the previous literal to be a key and the next to be a value.

In the example above, "foo" and "bar" would be considered keys and true the value for the "bar" key. We also perform other error recovery strategies, like considering strings to end at newlines, treating closing ] and } as equivalent so if they're mismatches it doesn't matter, ignoring commas entirely, and more.

You can see the full suite of tests here and here, including the fuzz testing to make sure the code handles arbitrary JSON input without crashing.

This approach is also much faster than actually parsing JSON. Our benchmarks indicate it's almost 50% faster than standard JSON parsing, and a single CPU core can scrub ~180MiB data/sec. That's without any serious optimization effort put in, unlike Go's standard library JSON parser.

Conclusion

Now we have a way to scrub a stream of data, we can use this to scrub the request and response bodies of Encore API's before sending them to Distributed Tracing or logging systems from the application. This works by computing the JSONPath expression based on the static analysis description (see implementation and tests).

This means that we can safely send our data to these systems without worrying about exposing sensitive data.

Future improvements we plan to make include scrubbing misspelt fields using the Levenshtein distance and detecting incorrect casing (like using camelCase when the API expects snake_case keys). We're also experimenting with suggesting redacting fields based on the field name (like Email for PII, or Password for hopefully-obvious reasons). Additional suggestions are always welcome.

You can checkout the full implementation of Encore's scrubber here, and the package documentation here.

Scrubbing sensitive data at 180MiB/sec/core

Redacting sensitive data from traces with static analysis and JSON tokenization.

Scrubbing sensitive data at 180MiB/sec/core

Redacting sensitive data from traces with static analysis and JSON tokenization.

Designing a better approach

Understanding what is sensitive

Implementing a streaming scrubber

A detour into tokenization

Scrubbing a token stream

Handling invalid JSON

Conclusion

More Articles