When building data pipelines it is essential that we have some data available for development. In some situations you might be able to use a subset of the production data, but in other scenarios, e.g. when no production data is yet available or the data is too sensitive, you are better off genereting synthetic data.
The task to generate synthetic data shouldn’t be painful - we don’t want to do this by hand. There are various open source tools available that can do the job. One of the more advanced ones is FakeIt.js.
In this very short tutorial we will create a simple data model based on Users putting in Orders for Products. The aim of this tutorial is to show you how easy it is to create synthetic datasets with relationships with FakeIt.js.
The great advantage of FakeIt.js is that the data models can be defined as YAML files, so even users who are not that familiar with scripting and programming language can theoretically create synthetic data (as long as they are willing to learn and write a little bit of JavaScript).
To generate the actual data, FakeIt makes use popular modules FakeJS and ChangeJS.
FakeIt.js Pros:
- You can define everything in a YAML … which makes it easier for users that are not hard core developers
- You can defined relationships
- Apart from standard table structures it also supports JSON
- Various output formats: JSON, XML, CSV etc
Installation
This is pretty straight forward:
npm install fakeit --global
Data Types used in the Model
types | data type | description |
---|---|---|
number, long, integer | 0 | Converts result to number using parseInt |
double, float | 0 | Converts result to number using parseFloat |
string | ’’ | Converts result to a string using result.toString() |
boolean, bool | false | Converts result to a boolean if it’s not already, if result is a string and is ‘false’, ‘0’, ‘undefined’, ‘null’ it will return false |
array | [] | returns the result from the build loop |
object, structure | {} | returns the result from the build loop |
null, undefined, * (anything else) | null | returns the result from the build loop |
Example with Relationships
We will create three models, one of which (Orders) will have the relationships defined. I will not go into much detail, since I think the YAML manifests are rather easy to understand.
The Products model:
name: Products
type: object
key: product_id
properties:
product_id:
type: string
description: Unique identifier representing a specific product
data:
build: faker.random.uuid()
product_name:
type: string
description: Display name of product.
data:
build: faker.commerce.productName()
price:
type: double
description: The product price
data:
build: chance.floating({ min: 0, max: 150, fixed: 2 })
Testing the output:
$ fakeit console --count 3 --format csv products.yaml
┌──────────────────────────────────────┬───────────────────────┬────────┐
│ product_id │ product_name │ price │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 0d80b0d1-4a25-47aa-ae9b-409cba875b51 │ Refined Cotton Salad │ 124.07 │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 578980da-c1d1-440b-9b6a-7ed934dbae2a │ Ergonomic Metal Bacon │ 91.93 │
├──────────────────────────────────────┼───────────────────────┼────────┤
│ 9d9b4a55-f300-4f1f-bec2-4af8a055c6b3 │ Ergonomic Metal Mouse │ 20.48 │
└──────────────────────────────────────┴───────────────────────┴────────┘
Let’s create our data structure for the Users:
name: Users
type: object
key: user_id
properties:
user_id:
type: integer
description: An auto-incrementing number
data:
build: document_index
first_name:
type: string
description: The users first name
data:
build: faker.name.firstName()
last_name:
type: string
description: The users last name
data:
build: faker.name.lastName()
username:
type: string
description: The username
data:
build: faker.internet.userName()
password:
type: string
description: The users password
data:
build: faker.internet.password()
email_address:
type: string
description: The users email address
data:
build: faker.internet.email()
created_on:
type: String
description: A date of when the user was created
data:
build: faker.date.past().toISOString().slice(0,10)
Testing the output:
$ fakeit console --count 3 --format csv users.yaml
┌─────────┬────────────┬───────────┬──────────────────┬─────────────────┬────────────────────────────┬────────────┐
│ user_id │ first_name │ last_name │ username │ password │ email_address │ created_on │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 0 │ Virgie │ Lockman │ Raul79 │ zpD9ZO8cLIUrSdr │ Matteo_Hettinger@yahoo.com │ 2017-09-14 │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 1 │ Alyson │ Dooley │ Royal_Ledner │ Aak5Yfay7ENv5hc │ Jerrell.Hahn37@hotmail.com │ 2017-10-17 │
├─────────┼────────────┼───────────┼──────────────────┼─────────────────┼────────────────────────────┼────────────┤
│ 2 │ Roberta │ Kautzer │ Leslie_Nikolaus1 │ eXgSu3N0h7Y46aQ │ Joanne_Torp@yahoo.com │ 2017-09-03 │
└─────────┴────────────┴───────────┴──────────────────┴─────────────────┴────────────────────────────┴────────────┘
Next we create the Orders data model, which has a relationship with Users and Products, which is expressed via dependencies
. To e.g. access a random user id from the Users model from within the Orders model, we can use this approach:
faker.random.arrayElement(documents.Users).user_id;
Here is the model:
name: Orders
type: object
key: order_id
data:
dependencies: <== HERE WE DEFINE RELATIONSHIPS
- products.yaml
- users.yaml
properties:
order_id:
type: integer
description: The order_id
data:
build: document_index + 1
user_id:
type: integer
description: The user_id that placed the order
data:
build: faker.random.arrayElement(documents.Users).user_id;
order_date:
type: string
description: An date of when the order was placed
data:
build: faker.date.past().toISOString().slice(0,10)
order_status:
type: string
description: The status of the order
data:
build: faker.random.arrayElement([ 'Pending', 'Processing', 'Cancelled', 'Shipped' ])
billing_name:
type: string
description: The name of the person the order is to be billed to
data:
build: `${faker.name.firstName()} ${faker.name.lastName()}`
Testing the output:
$ fakeit console --count 6 --format csv orders.yaml
┌──────────┬─────────┬────────────┬──────────────┬───────────────────┐
│ order_id │ user_id │ order_date │ order_status │ billing_name │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 1 │ 1 │ 2018-04-23 │ Processing │ Arturo Wilkinson │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 2 │ 2 │ 2017-07-12 │ Pending │ Della Schamberger │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 3 │ 4 │ 2018-04-02 │ Processing │ Emma Nolan │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 4 │ 5 │ 2017-08-01 │ Cancelled │ Zion Balistreri │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 5 │ 2 │ 2017-12-14 │ Cancelled │ Tevin Wolf │
├──────────┼─────────┼────────────┼──────────────┼───────────────────┤
│ 6 │ 3 │ 2018-02-16 │ Shipped │ Allene Koss │
└──────────┴─────────┴────────────┴──────────────┴───────────────────┘
So in this short while we created synthetic data for three tables and even defined releationships between them! Naturally FakeIt.js offers a command to write the results out to files:
fakeit directory --count 25 --format csv --verbose output/ users.yaml
Taking it to the next level
While FakeIt.js can load data directly into Couchbase, no other database or document store is currently supported. This doesn’t stop anyone though extending that functionality:
We can just another property to the data points called something like ddl_type
and define column type there. Then a simple script could pick this up (among other properties) and create the DDL statements for us.