Best Practices: Generate Synthetic Data

  03 Jun 2018


When building data pipelines it is essential that we have some data available for development. In some situations you might be able to use a subset of the production data, but in other scenarios, e.g. when no production data is yet available or the data is too sensitive, you are better off genereting synthetic data.

The task to generate synthetic data shouldn’t be painful - we don’t want to do this by hand. There are various open source tools available that can do the job. One of the more advanced ones is FakeIt.js.

In this very short tutorial we will create a simple data model based on Users putting in Orders for Products. The aim of this tutorial is to show you how easy it is to create synthetic datasets with relationships with FakeIt.js.

The great advantage of FakeIt.js is that the data models can be defined as YAML files, so even users who are not that familiar with scripting and programming language can theoretically create synthetic data (as long as they are willing to learn and write a little bit of JavaScript).

To generate the actual data, FakeIt makes use popular modules FakeJS and ChangeJS.

FakeIt.js Pros:

  • You can define everything in a YAML … which makes it easier for users that are not hard core developers
  • You can defined relationships
  • Apart from standard table structures it also supports JSON
  • Various output formats: JSON, XML, CSV etc

Installation

This is pretty straight forward:

npm install fakeit --global

Data Types used in the Model

types data type description
number, long, integer 0 Converts result to number using parseInt
double, float 0 Converts result to number using parseFloat
string ’’ Converts result to a string using result.toString()
boolean, bool false Converts result to a boolean if it’s not already, if result is a string and is ‘false’, ‘0’, ‘undefined’, ‘null’ it will return false
array [] returns the result from the build loop
object, structure {} returns the result from the build loop
null, undefined, * (anything else) null returns the result from the build loop

Example with Relationships

We will create three models, one of which (Orders) will have the relationships defined. I will not go into much detail, since I think the YAML manifests are rather easy to understand.

The Products model:

name: Products
type: object
key: product_id
properties:
  product_id:
    type: string
    description: Unique identifier representing a specific product
    data:
      build: faker.random.uuid()
  product_name:
    type: string
    description: Display name of product.
    data:
      build: faker.commerce.productName()
  price:
    type: double
    description: The product price
    data:
      build: chance.floating({ min: 0, max: 150, fixed: 2 })

Testing the output:

$ fakeit console --count 3 --format csv products.yaml
┌──────────────────────────────────────┬───────────────────────┬────────┐
│ product_id                           │ product_name          │ price  │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 0d80b0d1-4a25-47aa-ae9b-409cba875b51 │ Refined Cotton Salad  │ 124.07 │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 578980da-c1d1-440b-9b6a-7ed934dbae2a │ Ergonomic Metal Bacon │ 91.93  │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 9d9b4a55-f300-4f1f-bec2-4af8a055c6b3 │ Ergonomic Metal Mouse │ 20.48  │
└──────────────────────────────────────┴───────────────────────┴────────┘

Let’s create our data structure for the Users:

name: Users
type: object
key: user_id
properties:
  user_id:
    type: integer
    description: An auto-incrementing number
    data:
      build: document_index
  first_name:
    type: string
    description: The users first name
    data:
      build: faker.name.firstName()
  last_name:
    type: string
    description: The users last name
    data:
      build: faker.name.lastName()
  username:
    type: string
    description: The username
    data:
      build: faker.internet.userName()
  password:
    type: string
    description: The users password
    data:
      build: faker.internet.password()
  email_address:
    type: string
    description: The users email address
    data:
      build: faker.internet.email()
  created_on:
    type: String
    description: A date of when the user was created
    data:
      build: faker.date.past().toISOString().slice(0,10)

Testing the output:

$ fakeit console --count 3 --format csv users.yaml
┌─────────┬────────────┬───────────┬──────────────────┬─────────────────┬────────────────────────────┬────────────┐
│ user_id │ first_name │ last_name │ username         │ password        │ email_address              │ created_on │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 0       │ Virgie     │ Lockman   │ Raul79           │ zpD9ZO8cLIUrSdr │ Matteo_Hettinger@yahoo.com │ 2017-09-14 │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 1       │ Alyson     │ Dooley    │ Royal_Ledner     │ Aak5Yfay7ENv5hc │ Jerrell.Hahn37@hotmail.com │ 2017-10-17 │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 2       │ Roberta    │ Kautzer   │ Leslie_Nikolaus1 │ eXgSu3N0h7Y46aQ │ Joanne_Torp@yahoo.com      │ 2017-09-03 │
└─────────┴────────────┴───────────┴──────────────────┴─────────────────┴────────────────────────────┴────────────┘

Next we create the Orders data model, which has a relationship with Users and Products, which is expressed via dependencies. To e.g. access a random user id from the Users model from within the Orders model, we can use this approach:

faker.random.arrayElement(documents.Users).user_id;

Here is the model:

name: Orders
type: object
key: order_id
data:
  dependencies:         <== HERE WE DEFINE RELATIONSHIPS
    - products.yaml
    - users.yaml
properties:
  order_id:
    type: integer
    description: The order_id
    data:
      build: document_index + 1
  user_id:
    type: integer
    description: The user_id that placed the order
    data:
      build: faker.random.arrayElement(documents.Users).user_id;
  order_date:
    type: string
    description: An date of when the order was placed
    data:
      build: faker.date.past().toISOString().slice(0,10)
  order_status:
    type: string
    description: The status of the order
    data:
      build: faker.random.arrayElement([ 'Pending', 'Processing', 'Cancelled', 'Shipped' ])
  billing_name:
    type: string
    description: The name of the person the order is to be billed to
    data:
      build: `${faker.name.firstName()} ${faker.name.lastName()}`

Testing the output:

$ fakeit console --count 6 --format csv orders.yaml
┌──────────┬─────────┬────────────┬──────────────┬───────────────────┐
│ order_id │ user_id │ order_date │ order_status │ billing_name      │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 1        │ 1       │ 2018-04-23 │ Processing   │ Arturo Wilkinson  │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 2        │ 2       │ 2017-07-12 │ Pending      │ Della Schamberger │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 3        │ 4       │ 2018-04-02 │ Processing   │ Emma Nolan        │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 4        │ 5       │ 2017-08-01 │ Cancelled    │ Zion Balistreri   │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 5        │ 2       │ 2017-12-14 │ Cancelled    │ Tevin Wolf        │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 6        │ 3       │ 2018-02-16 │ Shipped      │ Allene Koss       │
└──────────┴─────────┴────────────┴──────────────┴───────────────────┘

So in this short while we created synthetic data for three tables and even defined releationships between them! Naturally FakeIt.js offers a command to write the results out to files:

fakeit directory --count 25 --format csv --verbose output/ users.yaml

Taking it to the next level

While FakeIt.js can load data directly into Couchbase, no other database or document store is currently supported. This doesn’t stop anyone though extending that functionality:

We can just another property to the data points called something like ddl_type and define column type there. Then a simple script could pick this up (among other properties) and create the DDL statements for us.

Sources

comments powered by Disqus