[logo] a small computer

Generating Millions of Fake Records Using NodeJS

generating millions of fake records using nodejs

Bradley Kingsley
Published 6 months ago.
5 minute read
logo of undefined

I've recently found myself in a position where I need to generate a lot of fake data because, as it turns out, there isn't an API out there that grants you access to their unlimited troves of data. It makes perfect sense because running such a service would probably cost more than its worth maintaining and preventing downtime.

A great service I constantly rely on for generating relatively small data sets is Mockaroo. I've never tried to generate millions of records using it, but I assume you can on their paid plan.

Then again, it turns out you don't need to rely on a fancy internet service to help you generate such data. At least not after you've discovered the power of Faker, Casual, Chance, Randexp and JSON Schema Faker.

These are incredibly powerful solutions on their own, but what if we could turn it up a notch? What if someone was crazy enough to create something that combines all these smaller libraries into a single, ultra-powerful Thanos-like library for generating data?

One of the best aspects of this library is its ability to generate dummy data with relationships between individual objects.

Enter mocker-data-generator

Someone went ahead and created [mocker-data-generator](https://github.com/danibram/mocker-data-generator), the most amazing fake data generator I've come across so far, and I couldn't be gladder.

We start off by installing the library

npm install mocker-data-generator --save

Then importing it

const mocker = require('mocker-data-generator').default //old way

//or

import mocker from 'mocker-data-generator' //ES6 way

Using Mocker Data Generator

In the true spirit of democracy, mocker-data-generator allows you to use as many or as few libraries as you want for the generation process. Things can get pretty complex once other libraries are involved, so we'll start off simple.

So, say you had an amazing idea for a website, but needed to generate a few first names, last names, and emails. We're not yet ready for the responsibility of generating (strong) passwords yet, so let's skip over it for the meantime.

We could implement it like so:

First, we define our schema

const mocker = require('mocker-data-generator');

let userSchema = {
    firstName: {
        faker: 'name.firstName'
    },
    lastName: {
        faker: 'name.lastName'
    },
    email: {
        faker: 'internet.email'
    }
};

Then pass the schema to mocker-data-generator and let it do its magic

...

mocker().schema('users', userSchema, 2)
          .build()
          .then(data=> {
              /**
                  Should be an object with an array of users:
                  {
                   "user": [
                      {
                         "firstName": "Marietta",
                         "lastName": "Padberg",
                         "email": "[email protected]"
                      },
                      {
                         "firstName": "Stan",
                         "lastName": "Pacocha",
                         "email": "[email protected]"
                      }
                   ]
                }
              */
          })

How about if we wanted to generate usernames? Usernames aren't as obvious to create as a person's name would be, so we need a different approach. mocker-data-generator allows us to generate fake data in more complex ways. We could do something like this, for example:

let userSchema = {
    ...
    username;: function(){
        return this.object.firstName + this.object.lastName
    }    
}

Running this schema should now produce:

{
     "user": [
         {
             "firstName": "Marietta",
             "lastName": "Padberg",
             "email": "[email protected]",
             "username": "MariettaPadberg"
         },
         {   
             "firstName": "Stan",
             "lastName": "Pacocha",
             "email": "[email protected]",
             "username": "StanPacocha"
         }
     ]
 }

But in the grand scheme of things, what we've done so far isn't too complex. The true power of this library comes in when you need to create relationships between different fake objects that need to be generated.

So, let's say that every user in our database owned a dog. We would create a new schema for the dog:

...
const dogNames = require('dog-names');

let dogSchema = {
    id: {
        faker: 'random.uuid'
    },
    name: dogNames.randomDogName()    
}

And modify our user schema to show this relationship:

let userSchema = {
    ...
    dog: {
        hasOne: 'dogs'//the name of the schema
    }
}

And, finally, we modify the schema generation code:

...

mocker().schema('users', userSchema, 2)
          .schema('dogs', dogSchema, 2)
          .build()
          .then(data=> {
              ...
          })

so that our new user schema will look like this:

{
     "user": [
                 {
                     "firstName": "Marietta",    
                     "lastName": "Padberg",
                     "email": "[email protected]",
                     "username": "MariettaPadberg",
                     "dog": {
                         "id": "e4053f56-cf7f-41c5-9c95-726dc9070673"
                         "name": "Toby"
                     }    
                 },
                 {    
                     "firstName": "Stan",
                     "lastName": "Pacocha",
                     "email": "[email protected]",
                     "username": "StanPacocha",
                     "dog": {
                         "id": "fcd0c009-6906-4cb9-9bcc-fd84f0a888fe",
                         "name": "Roxy"
                     }
                 }
             ]
 }

And with that, the modeling is over. All that's left is scaling our method to generate as many records as we like:

...
let async generatePeople = (number)=> {
    await mocker().schema('users', userSchema, number)
            .schema('dogs', dogSchema, number)
            .build();
}
generatePeople(10e+6).then(data=>{//do stuff})

Bonus: Writing a Million Fake JSON Records to Disk Storage (without running out of memory)

If you're like me, the first thing that you'll think of when the 'writing files in node' topic is brought up is the fs module. It contains all the functionality necessary for manipulating files and other methods for dealing with the filesystem.

More specifically, fs.writeFile is used for writing data to files asynchronously. Since Node is non-blocking, we can provide a callback method to run when writeFile returns.

It's signature looks like: fs.writeFile(file, data , options, callback). We can also provide the method without the options object like this:

const path = require("path")
const dataPath = path.join(__dirname, '/data/');
const filesPath = `${dataPath}/park_visitors_at_${Date.now()}.json`
...
generatePeople(10e+6).then(data=>{
     fs.writeFile(filesPath, JSON.stringify(data), err => {
            if (err) {
                throw err;
            } 
      });
})

That was my first iteration of the function I needed to write, and as you might guess (if you'd tried) is that it could take several minutes to finish generating if it doesn't run out of memory first.

To run the above code, use the following command:

node --max-old-space-size=4096 index.js //4GB of RAM

That depends on your system, of course, but we could optimize this code a little more.

Instead of writing all the data to the file at one, let's use streams.

...

const writeFileStream = fs.createWriteStream(filesPath);

generatePeople(10e+6).then(data=>{
    writeFileStream.write(JSON.stringify(data));
  	writeFileStream.end();
})

If this still throws errors, you might have to use third-party libraries like big-json or the JSONStream.stringifyObjectAPI from JSONStream instead of JSON.stringify .

Copyright © 2019 The Kenyan Dev