Automate database obfuscation for non production environments – Part 1

At Skills Matter, we’ve recently changed a lot of practices to make our processes faster/better. One of these consisted of automating the obfuscation of a copy of our production database to then be distributed to our developers and to be used in staging environments (we recently started using review apps on Heroku).

Obfuscating a database means making all sensitive data anonymous.

We used to do this manually, and it used to take a lot of time (hours). A developer had to:

  • Download a dump
  • Restore it into their local database
  • Run an obfuscation `rake` task from our Rails app that would work at database level
  • Export a dump and restore it in our staging environment

After all that whenever someone wanted a fresh dump to restore their local database they would go and download one from staging. Tedious. And if something is tedious people tend not to do it, resulting in this case for example in staging data ending up being way behind our actual production data. Not only that, but we wanted to introduce [review apps] (which by the way are amazing) into our workflow, so having fresh obfuscated copies of our database was a necessity.

There were two points we wanted to address when looking into automating this:

  • speed: our original `rake` task was way too slow
  • portability: ideally we wanted to deploy this process somewhere, as easily as possible, and a full Rails app didn’t feel like the perfect candidate
  • security: we didn’t this to happen on a developer machine, production data shouldn’t leave production

When it comes to speed we found our biggest potential time save to be in obfuscating the dump file directly, instead of going through Postgres and ActiveRecord. This is doable in this case because when obfuscating a database you don’t want to touch relationships, but the data itself, and in our case being aware of just a row data was good enough.
As per portability we went with [Go] because, well, it’s [great at that].

And so misty was born. misty is a rather simple Go package that allows you to specify targets for updating row values and if/when to delete a row altogether, and then surfs through a plain-text Postgres dump file and applies them. misty has no opinions about those targets, they were intended to be as flexible as possible:

type TargetColumn struct {
    // The name of the column to target
    Name string
    // A function that receives the current content of that column
    // and returns the new value for that column.
    Value func([]byte) []byte
}

This way the user (us) has complete freedom on how to provide new values for the columns they want to change/obfuscate. Here’s a couple of examples of what you can achieve:

// Static values
targetColumn := &misty.TargetColumn{
    Name: "username",
    // turn all usernames into "rentziass"
    Value: func(_ []byte) []byte {
        return []byte("rentziass")
    },
}

// Change if a condition is met
targetColumn = &misty.TargetColumn{
    Name: "email",
    Value: func(oldVal []byte) []byte {
        // if `email` is a Skills Matter one keep it as it is
        if strings.HasSuffix(string(oldVal), "@skillsmatter.com") {
            return oldVal
        }
        // otherwise obfuscate it
        return []byte("obfuscated@mail.com")
    },
}

// Incremental values
userCounter := 0
targetColumn = &misty.TargetColumn{
    Name: "username",
    Value: func(_ []byte) []byte {
        userCounter++
        username := fmt.Sprintf("user_%v", userCounter)
        return []byte(username)
    },
}

And same goes for row deletion rules, expect that your functions should return a bool rather than a new value. Here’s an example of everything put together:

package main

import (
    "bytes"
    "log"
    "os"

    "github.com/icrowley/fake"
    "github.com/rentziass/misty"
)

func main() {
    f, err := os.Open("dump.sql")
    if err != nil {
        panic(err)
    }

    target := &misty.Target{
        TableName: "public.users",
        Columns: []*misty.TargetColumn{
            {
                Name: "username",
                // we use fake to generate random usernames here
                Value: obfuscateHandle,
            },
        },
        DeleteRowRules: []*misty.DeleteRule{
            {
                ColumnName: "email",
                // delete the row if the value of column 'email' is 'some@mail.com'
                ShouldDelete: func(b []byte) bool {
                    return bytes.Equal(b, []byte("some@mail.com"))
                },
            },
        },
    }

    err = misty.Obfuscate(f, os.Stdout, []*misty.Target{target})
    if err != nil {
        log.Println(err)
    }

}

func obfuscateHandle(_ []byte) []byte {
    return []byte(fake.UserName())
}

This goes through a dump.sql file, changing all values in the username column of the public.users table to a random string (we use icrowley/fake for this, it is a Golang port of something we’re very used to in the Ruby world) and deleting all the users whose email is some@mail.com. It then outputs the obfuscated result to STDOUT.

We’ve been using this in production for a few weeks now with some great results: obfuscation went down from a couple of hours to \~20 minutes, and can be run pretty much everywhere (Go and Docker are a match made in heaven).

I have plans to continue working on misty in my free time, I’d like to make the whole row available inside Value functions, and make the whole process work on different tables concurrently: if you want to contribute you’re more than welcome!

Next part in this series will be about how we got to a fully automated process that runs on AWS every morning at 6am. Nothing beats fresh databases and a good cup of coffee in the morning ☕️.