Automate database obfuscation for non production environments – Part 1

At Skills Matter, we’ve recently changed a lot of practices to make our processes faster/better. One of these consisted of automating the obfuscation of a copy of our production database to then be distributed to our developers and to be used in staging environments (we recently started using review apps on Heroku).

Obfuscating a database means making all sensitive data anonymous.

We used to do this manually, and it used to take a lot of time (hours). A developer had to:

  • Download a dump
  • Restore it into their local database
  • Run an obfuscation `rake` task from our Rails app that would work at database level
  • Export a dump and restore it in our staging environment

After all that whenever someone wanted a fresh dump to restore their local database they would go and download one from staging. Tedious. And if something is tedious people tend not to do it, resulting in this case for example in staging data ending up being way behind our actual production data. Not only that, but we wanted to introduce [review apps] (which by the way are amazing) into our workflow, so having fresh obfuscated copies of our database was a necessity.

There were two points we wanted to address when looking into automating this:

  • speed: our original `rake` task was way too slow
  • portability: ideally we wanted to deploy this process somewhere, as easily as possible, and a full Rails app didn’t feel like the perfect candidate
  • security: we didn’t this to happen on a developer machine, production data shouldn’t leave production

When it comes to speed we found our biggest potential time save to be in obfuscating the dump file directly, instead of going through Postgres and ActiveRecord. This is doable in this case because when obfuscating a database you don’t want to touch relationships, but the data itself, and in our case being aware of just a row data was good enough.
As per portability we went with [Go] because, well, it’s [great at that].

And so misty was born. misty is a rather simple Go package that allows you to specify targets for updating row values and if/when to delete a row altogether, and then surfs through a plain-text Postgres dump file and applies them. misty has no opinions about those targets, they were intended to be as flexible as possible:

type TargetColumn struct {
    // The name of the column to target
    Name string
    // A function that receives the current content of that column
    // and returns the new value for that column.
    Value func([]byte) []byte
}

This way the user (us) has complete freedom on how to provide new values for the columns they want to change/obfuscate. Here’s a couple of examples of what you can achieve:

// Static values
targetColumn := &misty.TargetColumn{
    Name: "username",
    // turn all usernames into "rentziass"
    Value: func(_ []byte) []byte {
        return []byte("rentziass")
    },
}

// Change if a condition is met
targetColumn = &misty.TargetColumn{
    Name: "email",
    Value: func(oldVal []byte) []byte {
        // if `email` is a Skills Matter one keep it as it is
        if strings.HasSuffix(string(oldVal), "@skillsmatter.com") {
            return oldVal
        }
        // otherwise obfuscate it
        return []byte("obfuscated@mail.com")
    },
}

// Incremental values
userCounter := 0
targetColumn = &misty.TargetColumn{
    Name: "username",
    Value: func(_ []byte) []byte {
        userCounter++
        username := fmt.Sprintf("user_%v", userCounter)
        return []byte(username)
    },
}

And same goes for row deletion rules, expect that your functions should return a bool rather than a new value. Here’s an example of everything put together:

package main

import (
    "bytes"
    "log"
    "os"

    "github.com/icrowley/fake"
    "github.com/rentziass/misty"
)

func main() {
    f, err := os.Open("dump.sql")
    if err != nil {
        panic(err)
    }

    target := &misty.Target{
        TableName: "public.users",
        Columns: []*misty.TargetColumn{
            {
                Name: "username",
                // we use fake to generate random usernames here
                Value: obfuscateHandle,
            },
        },
        DeleteRowRules: []*misty.DeleteRule{
            {
                ColumnName: "email",
                // delete the row if the value of column 'email' is 'some@mail.com'
                ShouldDelete: func(b []byte) bool {
                    return bytes.Equal(b, []byte("some@mail.com"))
                },
            },
        },
    }

    err = misty.Obfuscate(f, os.Stdout, []*misty.Target{target})
    if err != nil {
        log.Println(err)
    }

}

func obfuscateHandle(_ []byte) []byte {
    return []byte(fake.UserName())
}

This goes through a dump.sql file, changing all values in the username column of the public.users table to a random string (we use icrowley/fake for this, it is a Golang port of something we’re very used to in the Ruby world) and deleting all the users whose email is some@mail.com. It then outputs the obfuscated result to STDOUT.

We’ve been using this in production for a few weeks now with some great results: obfuscation went down from a couple of hours to \~20 minutes, and can be run pretty much everywhere (Go and Docker are a match made in heaven).

I have plans to continue working on misty in my free time, I’d like to make the whole row available inside Value functions, and make the whole process work on different tables concurrently: if you want to contribute you’re more than welcome!

Next part in this series will be about how we got to a fully automated process that runs on AWS every morning at 6am. Nothing beats fresh databases and a good cup of coffee in the morning ☕️.

The perks of storing your dotfiles in a repository

Lately I have been through a computer mayhem: my honorable Macbook Pro left me (may he find peace in the afterlife) and at work I’ve been changing machines like clothes for at least a month. As a developer, you know that nowadays you basically don’t lose many things in that kind of situation, at least application wise (the beauty of the cloud, they say). And all around, if you were using the lost/dead/burnt (what are you doing to those poor machines??) computer mostly for developing stuff, cloud or not, there are good chances you got all of your projects store in repositories hosted on GitHub, Bitbucket or wherever it pleases you. If you aren’t using any form of versioning control in your project you should really stop reading this and start questioning the meaning of life. No, seriously, go check out Git and enjoy developing once again.

Anyway, what should frighten you in case your machine was lost, is the most delicate stuff you developed during all of your hard work: your dotfiles. I’m speaking of those tiny little files that tell Git who you are, define that precious alias linked to the most complex command you now don’t even remember and hold all of your configurations. Not to mention that if you are a Vim user losing your .vimrc may drive you crazy, depending on how much time you invested in it. (Nah, it will just drive you crazy.)

tl;dr You don’t want to lose your dotfiles.

Needless to say, there is a very simple solution that can prevent you from going through this painful experience. Storing your dotfiles in Git repository.

If you’re already on board with this you may ask “Who wouldn’t do that?”, and you’d surprised from the answer. I personally started taking care of this a couple years ago, and can’t think of how it was before. Not to mention that if you store those files in a repo it becomes super easy to port them to different environments, whether it’s a server you manage or a colleague with whom you want to share some or all of your configuration. They can fork your repo and add their personal flavour, just as with any other project under version control.

Let me show you how I achieved this, though I’m sure there are many other ways.

Start by creating a folder named dotfiles in your home (~/), then start moving all the files you want to include in the repository, removing the . from their names. Do keep in mind that if you want to make this repo public you should avoid including files that hold personal or business sensitive informations.

Really, be extra careful about what you include in your repository if your plans are to make it public!

To turn this folder in a repo, in case you’re not aware, run

$ git init
$ git add --all
$ git commit -m "First commit"

Now you may argue that there are no more usable dotfiles in your root, and you’d be right! The next involves a tool called RCM, made by thoughtbot, and happens to be a management suite for dotfiles. To install it on macOS run

$ brew tap thoughtbot/formulae
$ brew install rcm

If you’re on a Debian-based system run

$ wget -qO - https://apt.thoughtbot.com/thoughtbot.gpg.key | sudo apt-key add -
$ echo "deb http://apt.thoughtbot.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/thoughtbot.list
$ sudo apt-get update
$ sudo apt-get install rcm

We’re about to use RCM to create dotfiles symlinks from your home directory to your /dotfiles/ directory. Before doing this, add a file called rcrc to your ~/dotfiles directory. The content of rcrc should be as follow:

EXCLUDES="README.md LICENSE"
DOTFILES_DIRS="$HOME/dotfiles"

This will tell RCM what files NOT to symlink and where your dotfiles are now stored. Once that’s done you can run:

$ env RCRC=$HOME/dotfiles/rcrc rcup

After this you should be back to having all of your dotfiles in your home directory, but those are just symlinks to files that are now stored in a Git repository. To have it backed up remotely, if you’re not familiar with Git, you can create a free public repo on GitHub.
After that’s done, your loyal dotfiles will be able to follow you wherever you go, and if you keep your repo public you may even inspire others! (How do you think I came up with this idea a couple years ago? 🍻)

Whenever you edit your dotfiles, remember to track those changes in the repo

$ git add --all
$ git commit -m "Add super useful alias"
$ git push

And that’s it, if you want to take a look at a final result you can check out my repository. Don’t judge me too hard, there’s even an alias that starts an audio clip about a famous wrestler… See if you can find it! 😎 (we just find it fun in the office from time to time).

Hope you enjoyed this, it’s incredible in how many simple ways a repo can save your life!

Local microservices development with Edward

Lately I’ve started playing around with microservices concepts and Kubernetes, which are really intriguing to me. I love the ideas being those things, so I began getting my hands dirty with a personal project.

Wait a minute, the dude is making microservice alone? Duh. Yeah, I know, but I’m a very curious person passionate about learning.

Anyway, I built my first two services using Go and threw gRPC in the mix, ‘coz why only learn one thing at a time? Those services were:

  • A database interface
  • A front end for users

Nothing fancy at all. 👍🏼

Then I wrote the two Dockerfiles, built the respective Docker images, launched Minikube, wrote a Kubernetes config file and applied it: awesome! Nope, apparently something went wrong. No problem, right? It’s part of the beauty of experimenting with things. Investigating the cause of the issue I found a typo in one of the services, which was no big deal. I fixed the typo, rebuilt the Docker image, apply the changes to Minikube and everything was looking good, except for one thing: while I was enjoying the way this all works, I soon came to a point where I found it not to be very practical for development (not to mention that I was losing advantage of Go’s blazing fast compile time).

In the need for a better solution (again, by a development point of view), I stumbled across Edward, which claims to be “A tool for managing local microservice instances”. How couldn’t I give it a try?

Edward requires Go 1.6, once you have that you install the tool with

$ go get github.com/yext/edward

Cool, now all it needs to do its job is an edward.json configuration file, which is composed of three main sections: imports, groups and services. services are the building blocks through which, guess what, you define your services, providing a path, build and launch commands, environment variables for that particular piece of software and many other things (seriously, take a look at the docs, you can even warm up your services).

So I wrote the definition for both my dummy services, and it looked like this:

# edward.json
{
  "imports": [],
  "groups": [],
  "services": [
    {
      "name": "web",
      "path": "./web",
      "commands": {
        "build": "go build",
        "launch": "./web"
      }
    },
    {
      "name": "db_interface",
      "path": "./db_interface",
      "commands": {
        "build": "go build",
        "launch": "./db_interface"
      }
    }
  ]
}

Now we’re ready to start, let’s begin spinning up the web service with

$ edward start web

web > Build: [OK] (1.641s)
web > Start: [OK] (182.167ms)

Then on with the database interface

$ edward start db_interface

db_interface > Build:  [OK] (2.86s)
db_interface > Start:  [OK] (167.675ms)

Things are looking good so far, but how about checking the status or our services?

$ edward status

+--------------+---------+-------+-------+----------+---------+---------------------+
|     NAME     | STATUS  |  PID  | PORTS |  STDOUT  | STDERR  |     START TIME      |
+--------------+---------+-------+-------+----------+---------+---------------------+
| db_interface | RUNNING | 78429 | 50051 | 0 lines  | 2 lines | 2017-07-28 12:32:46 |
| web          | RUNNING | 78321 | 3000  | 11 lines | 1 lines | 2017-07-28 12:30:00 |
+--------------+---------+-------+-------+----------+---------+---------------------+

But what if we wanted to manage more than one service at once? That’s what groups are for, remember them? Let’s create one for this awesome app then. Let’s add a group to our configuration:

# edward.json
{
  "imports": [],
  "groups": [
    {
      "name": "front",
      "children": ["web", "db_interface"]
    }
  ],
  "services": [
    {
      "name": "web",
      "path": "./web",
      "commands": {
        "build": "go build",
        "launch": "./web"
      }
    },
    {
      "name": "db_interface",
      "path": "./db_interface",
      "commands": {
        "build": "go build",
        "launch": "./db_interface"
      }
    }
  ]
}

Now the two services are manageable together through the app group, and we can, for instance, restart them at once running

$ edward restart app

front > web > Stop:  [OK] (223.307ms)
front > web > Build:  [OK] (1.463s)
front > web > Start:  [OK] (170.654ms)
front > db_interface > Stop:  [OK] (206.495ms)
front > db_interface > Build:  [OK] (2.825s)
front > db_interface > Start:  [OK] (170.689ms)

This is way faster than the build the Docker image, apply to Minikube process, but there is still another Edward feature that came in very hand for me, and that is the possibility to watch on services folder and then automatically rebuild them if a change occurred. To use it, simply add the watch key in the configuration, specifying the folders the folder you want Edward to keep an eye on and those that need to be excluded. Let’s add this to the web service

# edward.json
{
  "imports": [],
  "groups": [
    {
      "name": "front",
      "children": ["web", "db_interface"]
    }
  ],
  "services": [
    {
      "name": "web",
      "path": "./web",
      "watch": {
        "include": ["."]
      },
      "commands": {
        "build": "go build",
        "launch": "./web"
      }
    },
    {
      "name": "db_interface",
      "path": "./db_interface",
      "commands": {
        "build": "go build",
        "launch": "./db_interface"
      }
    }
  ]
}

And that’s it, now this super simple microservices infrastructure is also simply manageable, and the web service will rebuild and relaunch if any changes occur within its folder.
Overall Edward was such a great discovery, if you want to see more of the things it can do for you (I advise you do) check out the speech from Tom Elliott at GopherCon 2017 and Edward documentation.

How to explain Open Source to your grandma

If you’re involved in one of these new digital jobs (e.g. software development, as I am), you may have found yourself in the situation in which you have to explain some aspects of it to your grandma. Since I represent the only ‘everyday chat’ for my grandma, I have. Many times.

They say

If you understand it, you can explain it simply.

Cool story, bro. But try to replace your usual target with your grandma. That takes the challenge to a whole new level.

Turning something so ethereal, that can not be touched, into something understandable by a 85-years-old, who’s only ever known a job for something rather tangible, is not an easy task. But to me, it’s something definitely worth training on.

One day I was really excited about something related to Open Source (ain’t we all? 🍻), so when the everyday grandma chat time came, I knew exactly what I wanted to talk about.

So it began. ⚡️

Sure enough, I was not allowed to use the standard definition of “open source”. Wiki states:

Open-source software is computer software with its source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose.

Yo, grandma, are you still following? I guess not.

And let’s be honest, that definition doesn’t do justice to what open source really is, to me at least. It’s sharing, it’s improving something all together, just for the sake of the thing itself; it’s rather romantic if you think about it. Alright, back to grandma.

So, how do you make open source physical, available, so that she can understand it? I usually try to disguise my job as a more conventional one, wether it be plumber or carpenter, which happens to be the one I chose in my attempt to explain the magical world of open source.

Granny, let’s say I’m a carpenter. While I’m working on the furniture for a dining room someone asked me to produce, I stumble across the need of inventing something that allows people in the room to sit down. So I come up with a chair. It’s a rather simple chair, with no armrests and nothing fancy, but it suits my need: allow people to sit down. Yay, problem solved 👍🏼. Now, I can keep my brilliant invention secret and be jealous about my precious (💍, get it?), OR I can share it with the world of carpenters.

Why would you share it?

Because it could help someone solving the same problem I encountered. Because after, say, a few days, or months, or even years, my chair could come back to me with a good ol’ pair of armrests, that were made by another carpenter out there in the world. Someone who borrowed my chair, worked on it and shared its improvements back in turn. That can happen over and over, the more problems my chair solves, the more people will work to improve it, maybe giving it a more comfortable back or more functional legs. That way we can develop together the best chair possible for that particular situation. And because it feels so good to take part in something like that.

It may be hard to believe, but that day my grandma understood the concept of Open Source. Now, that may not be the most precise and technical explantation ever provided about it, but it made her smile, and took her a step closer to understand why I do love what I do and “why do you stay up at nights doing what you do in your office?”. That makes me smile. 😊

Besides, explaining everyday software development stuff to my grandma really strengthened my ability to teach things and speak about what I do. Go give it a try, then if you can get your grandma on board you’ll see speaking of your job with your boss or your sales accountant will become a breeze (and they’ll appreciate it, I can guarantee 🍻).