LearnJsData

About this guide

This guide teaches the basics of manipulating data using JavaScript in the browser, or in node.js. Specifically, these tasks are geared around preparing data for further analysis and visualization.

This guide will demonstrate some basic techniques and how to implement them using core JavaScript API, the d3.js library and lodash.

It assumes you already have some basic knowledge of JavaScript.

Translations

Access this guide in:

English
Español - coming soon!

Code

Each document in this repo is executed when loaded into a browser. Check it out by opening the Developer Tools Console. You should see the output of the following code block:

console.log("This is the index!");

Check out the full source on github.

Why?

Is data munging in JavaScript something you would actually want to do? Maybe.

There are other languages out there that do a great job with data wrangling:

These tools are great and you should use them. Often times, however, you are already familiar with a particular language (like JavaScript) and would like to get started with data, but want to take it one step at a time.

Additionally, sometimes you are already in a particular environment (like JavaScript) and don't have the luxury of switching to one of these other options.

In these cases, JavaScript could be considered a viable option for your data analysis. And if you find yourself in one of these situations, or just want to try out JavaScript for data analysis for fun, then this guide is for you!

Check out some of the tasks, and see if JavaScript Data something you want to try yourself.

Getting Started

About Tasks

This guide is broken up into a number of tasks, which we can think of as little modules or recipes.

Each task tries to encapsulate a concrete lesson around common data manipulation and analysis processes. Tasks attempt to be self-contained and stay focused on the, well, task at hand.

This guide was built with for client side data processing (in the browser), but can easily be used in a server side (Node) application with a bit of tweaking. Check out the analyzing data with Node section for the details.

Why D3?

D3.js is largely known for its data visualization capabilities - and for good reason. It is quickly becoming the de facto standard for interactive visualization on the web.

Its core feature of binding data to visual representations happens to require a lot of manipulation of said data. Thus, while this toolkit is focused around visualization, it is well suited for data munging as well!

And, a typical output for data manipulation is at least some sort of visualization of that data, in which case you are all ready to go.

Why lodash?

Lodash is fast, popular, and fills in some holes in D3's processing features. Plus, it's functional style and chaining capabilities make it work well alongside D3.

Code Snippets

There are a bunch of useful snippets in this guide. Here is an example:

var theMax = d3.max([1,2,20,3]);
console.log(theMax);

=> 20

This code is using d3.js

We use a little arrow, =>, to indicate output from a code snippet. This same output you can view by opening the console of your favorite web browser.

Snippets in this guide that are not pure JavaScript will be marked with the libraries used to make them work.

Preparing Your Site for Data Processing

To get started using these tools for your data processing, you are going to want to include them in your html file along with a JavaScript file to perform the analysis.

I typically download these scripts and include local copies in my page. To do this, you would want to have your HTML look something like this:

<!doctype html>
<html>
<head>
</head>
<body>

<script src='js/d3.js'></script>
<script src='js/lodash.js'></script>
<script src='js/analysis.js'></script>
</body>
</html>

analysis.js would be where your analysis code goes. I put them at the end of the body - just so that if there is other content on the page, it won't be delayed in loading. Typically, I name this file index.html - so that its loaded automatically as the root page.

Running a Local Server

D3's functions for reading data require you be running the page from a server. You can do this on your own machine by running a local server out of the root directory of your site.

There are many options for easy-to-spin-up web servers:

SimpleHTTPServer for Python

httpd for Ruby
http-server for Node

Lately, I have been using that last option - http-server. If you have Node and npm installed, you can grab the required package by installing it from the command line:

npm install -g http-server

(The -g flag stands for global - which allows you to access http-server from any directory on your machine.

Then cd to your analysis directory and start it up!

cd /path/to/dir
http-server

In your web browser, open up http://0.0.0.0:8080 and you should be ready to go!

Next Task

Reading in Data

The first step in any data processing is getting the data! Here is how to parse in and prepare common input formats using D3.js

Parsing CSV Files

D3 has a bunch of filetypes it can support when loading data, and one of the most common is probably plain old CSV (comma separated values).

Let's say you had a csv file with some city data in it:

cities.csv:

city,state,population,land area
seattle,WA,652405,83.9
new york,NY,8405837,302.6
boston,MA,645966,48.3
kansas city,MO,467007,315.0

Use d3.csv to convert it into an array of objects

d3.csv("/data/cities.csv", function(data) {
  console.log(data[0]);
});

=> {city: "seattle", state: "WA", population: "652405", land area: "83.9"}

This code is using d3.js

You can see that the headers of the original CSV have been used as the property names for the data objects. Using d3.csv in this manner requires that your CSV file has a header row.

If you look closely, you can also see that the values associated with these properties are all strings. This is probably not what you want in the case of numbers. When loading CSVs and other flat files, you have to do the type conversion.

We will see more of this in other tasks, but a simple way to do this is to use the + operator (unary plus). forEach can be used to iterate over the data array.

d3.csv("/data/cities.csv", function(data) {
  data.forEach(function(d) {
    d.population = +d.population;
    d["land area"] = +d["land area"];
  });
  console.log(data[0]);
});

=> {city: "seattle", state: "WA", population: 652405, land area: 83.9}

This code is using d3.js

Dot notation is a useful way to access the properties of these data objects. However, if your headers have spaces in them, then you will need to use bracket notation as shown.

This can also be done during the loading of the data, by d3.csv directly. This is done by providing an accessor function to d3.csv, who's return value will be the individual data objects in our data array.

d3.csv("/data/cities.csv", function(d) {
  return {
    city : d.city,
    state : d.state,
    population : +d.population,
    land_area : +d["land area"]
  };
}, function(data) {
  console.log(data[0]);
});

=> {city: "seattle", state: "WA", population: 652405, land_area: 83.9}

This code is using d3.js

In this form, you have complete control over the data objects and can rename properties (like land_area) and convert values (like population) willy-nilly. On the other hand, you have to be quite explicit about which properties to return. This may or may not be what you are into.

I typically allow D3 to load all the data, and then make modifications in a post-processing step, but it might be more effective for you to be more explicit with the modifications.

Reading TSV Files

CSV is probably the most common flat file format, but in no way the only one.

I often like to use TSV (tab separated files) - to get around the issues of numbers and strings often having commas in them.

D3 can parse TSV's with d3.tsv:

animals.tsv:

name    type    avg_weight
tiger    mammal    260
hippo    mammal    3400
komodo dragon    reptile    150

Loading animals.tsv with d3.tsv:

d3.tsv("/data/animals.tsv", function(data) {
  console.log(data[0]);
});

=> {name: "tiger", type: "mammal", avg_weight: "260"}

This code is using d3.js

Reading Other Flat Files

In fact, d3.tsv and d3.csv are only the tip of the iceberg. If you have a non-standard delimited file, you can create your own parser easily, using d3.dsv

Using d3.dsv takes one more step. You first create a new parser by passing in the type of delimiter and mimeType to use.

For example, if we had a file that looked like this:

animals_piped.txt:

name|type|avg_weight
tiger|mammal|260
hippo|mammal|3400
komodo dragon|reptile|150

We could create a pipe separated values (PSV) parser using d3.dsv:

var psv = d3.dsv("|", "text/plain");

And then use this to parse the strangely formated file.

psv("/data/animals_piped.txt", function(data) {
  console.log(data[1]);
});

=> {name: "hippo", type: "mammal", avg_weight: "3400"}

This code is using d3.js

Reading JSON Files

For nested data, or for passing around data where you don't want to mess with data typing, its hard to beat JSON.

JSON has become the language of the internet for good reason. Its easy to understand, write, and parse. And with d3.json - you too can harness its power.

employees.json:

[
 {"name":"Andy Hunt",
  "title":"Big Boss",
  "age": 68,
  "bonus": true
 },
 {"name":"Charles Mack",
  "title":"Jr Dev",
  "age":24,
  "bonus": false
 }
]

Loading employees.json with d3.json:

d3.json("/data/employees.json", function(data) {
  console.log(data[0]);
});

=> {name: "Andy Hunt", title: "Big Boss", age: 68, bonus: true}

This code is using d3.js

We can see that, unlike our flat file parsing, numeric types stay numeric. Indeed, a JSON value can be a string, a number, a boolean value, an array, or another object. This allows nested data to be dealt with easily.

Loading Multiple Files

D3's basic loading mechanism is fine for one file, but starts to get messy as we nest multiple callbacks.

For loading multiple files, we can use Queue.js (also written by Mike Bostock) to wait for multiple data sources to be loaded.

queue()
  .defer(d3.csv, "/data/cities.csv")
  .defer(d3.tsv, "/data/animals.tsv")
  .await(analyze);

function analyze(error, cities, animals) {
  if(error) { console.log(error); }

  console.log(cities[0]);
  console.log(animals[0]);
}

=> {city: "seattle", state: "WA", population: "652405", land area: "83.9"}
{name: "tiger", type: "mammal", avg_weight: "260"}

This code is using queue.js and d3.js

Note that we defer the loading of two types of files - using two different loading functions - so this is an easy way to mix and match file types.

The callback function passed into await gets each dataset as a parameter, with the first parameter being populated if an error has occurred in loading the data.

It can be useful to output the error, if it is defined, so you catch data loading problems quickly.

To add another data file, simply add another defer and extend the input parameters for your callback!

Next Task

Combining Data

Note: This task was very generously contributed by Timo Grossenbacher - Geographer and Data Specialist extraordinaire. Thanks very much Timo!

Often, you have to combine two or more different data sets because they contain complementary information. Or, for example, the data come in chunks from the server and need to be reassembled on the client side.

Combining or merging data may involve one of the following tasks:

Combine data sets by one or more common attributes
Add together rows from different data sets
Combine attributes from different data sets

Combine data sets by one or more common attributes

Let's say we have a list of the following articles:

var articles = [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand_id": 2
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand_id": 1
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand_id": 2
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand_id": 3
}];

And of the following brands:

var brands = [{
    "id": 1,
    "name": "SuperKitchen"
}, {
    "id": 2,
    "name": "HomeSweetHome"
}];

As you can see, in each article, brand_id points to a particular brand, whose details are saved in another data set - which can be considered a lookup table in this case. This is often how separate data schemes are stored in a server-side database. Also note that the last article in the list has a brand_id for which no brand is stored in brands.

What we want to do now is to combine both datasets, so we can reference the brand's name directly from an article. There are several ways to achieve this.

Using native `Array` functions

We can implement a simple join (left outer join in database terms) using native, i.e., already existing Array functions as follows. The method presented here modifies the articles array in place by adding a new key-value-pair for brand.

articles.forEach(function(article) {
    var result = brands.filter(function(brand) {
        return brand.id === article.brand_id;
    });
    delete article.brand_id;
    article.brand = (result[0] !== undefined) ? result[0].name : null;
});
console.log(articles);

=> [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand": "HomeSweetHome"
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand": "SuperKitchen"
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand": "HomeSweetHome"
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand": null
}];

First, we loop over each article, where we take its brand_id to look up the corresponding brand using the native filter function. Note that this function returns an array and we expect it to have only one element. In case there is no corresponding brand, result[0] will be undefined, and in order to prevent an error (something like result[0] is undefined), we use the ternary operator.

Also, as we no longer need brand_id after the lookup has been done, we can safely delete it.

If we want to join by more than one attribute, we can modify the filter function to achieve this. Hypothetically, this might look something like:

innerArray.filter(function(innerArrayItem) {
    return innerArrayItem.idA === outerArrayItem.idA &&
        innerArrayItem.idB === outerArrayItem.idB;
});

Using a generic and more efficient approach

A more generic, and also more performant version of a join is proposed below (abbreviated from this StackOverflow answer). Its output is equivalent to the one of the above method.

function join(lookupTable, mainTable, lookupKey, mainKey, select) {
    var l = lookupTable.length,
        m = mainTable.length,
        lookupIndex = [],
        output = [];
    for (var i = 0; i < l; i++) { // loop through l items
        var row = lookupTable[i];
        lookupIndex[row[lookupKey]] = row; // create an index for lookup table
    }
    for (var j = 0; j < m; j++) { // loop through m items
        var y = mainTable[j];
        var x = lookupIndex[y[mainKey]]; // get corresponding row from lookupTable
        output.push(select(y, x)); // select only the columns you need
    }
    return output;
};

Because above defined function creates an index for the lookupTable (in our case brands) in the first iteration, it runs considerably faster than the previously shown method. Also, via a callback, it allows us to directly define which keys (or "attributes") we want to retain in the resulting, joined array (output). It is used like so:

var result = join(brands, articles, "id", "brand_id", function(article, brand) {
    return {
        id: article.id,
        name: article.name,
        weight: article.weight,
        price: article.price,
        brand: (brand !== undefined) ? brand.name : null
    };
});
console.log(result);

=> [{
    "id": 1,
    "name": "vacuum cleaner",
    "weight": 9.9,
    "price": 89.9,
    "brand": "HomeSweetHome"
}, {
    "id": 2,
    "name": "washing machine",
    "weight": 540,
    "price": 230,
    "brand": "SuperKitchen"
}, {
    "id": 3,
    "name": "hair dryer",
    "weight": 1.2,
    "price": 24.99,
    "brand": "HomeSweetHome"
}, {
    "id": 4,
    "name": "super fast laptop",
    "weight": 400,
    "price": 899.9,
    "brand": null
}];

Note that we don't modify articles in place but create a new array.

Add together rows from different data sets

Let's say we want to load a huge data set from the server, but because of network performance reasons, we load it in three chunks and reassemble it on the client side. Using Queue.js, as illustrated in reading in data, we get the data and immediately combine it. For this, we can use D3's merge to combine the single arrays one after another. In database terms, this operation is called "union".

queue()
    .defer(d3.csv, "/data/big_data_1.csv")
    .defer(d3.csv, "/data/big_data_2.csv")
    .defer(d3.csv, "/data/big_data_3.csv")
    .await(combine);

function combine(error, big_data_1, big_data_2, big_data_3) {
    if (error) {
        console.log(error);
    }
    console.log(d3.merge([big_data_1, big_data_2, big_data_3]));
}

=> [{"a": "1", "b": "2"},{"a": "3", "b": "4"},{"a": "5", "b": "6"}]

This code is using d3.js

Note that the argument passed to d3.merge must be an array itself, which is why we use the square brackets.

Combine attributes from different data sets

In the last case, we have two or more data sets that contain attributes describing the same observations, or conceptual entities, and they need to be combined. This implies that all data sets have the same length. For example, dataset_1 below contains two observations of attribute type and attribute model, while dataset_2 contains the same two entities, but observed through attributes price and weight.

var dataset_1 = [{
    'type': 'boat',
    'model': 'Ocean Queen 2000'
}, {
    'type': 'car',
    'model': 'Ferrari'
}];
var dataset_2 = [{
    'price': 23202020,
    'weight': 5656.9
}, {
    'price': 59988,
    'weight': 1.9
}];

So in both data sets we essentially have separate information about the same conceptual entities, thus it makes sense to "merge" them, for which we can use lodash's merge function:

var result = _.merge(dataset_1, dataset_2);
console.log(result);

=> [{
    'type': 'boat',
    'model': 'Ocean Queen 2000',
    'price': 23202020,
    'weight': 5656.9
}, {
    'type': 'car',
    'model': 'Ferrari',
    'price': 59988,
    'weight': 1.9
}];

This code is using lodash

Next Task

Summarizing Data

With the data loaded, we want to take a quick look at what we have. D3 has a number of tools to use for quick data exploration.

To start, let's pretend we have loaded up a csv file - and have a dataset that looks something like:

var data = [
  {"city":"seattle", "state":"WA", "population":652405, "land_area":83.9},
  {"city":"new york", "state":"NY", "population":8405837, "land_area":302.6},
  {"city":"boston", "state":"MA", "population":645966, "land_area":48.3},
  {"city":"kansas city", "state":"MO", "population":467007, "land_area":315}
];

Min & Max

As it turns out, D3 comes to the rescue again, with d3.min and d3.max. Use the callback function to indicate which property (or computed value based on the properties) to access.

var minLand = d3.min(data, function(d) { return d.land_area; });
console.log(minLand);

=> 48.3

This code is using d3.js

var maxLand = d3.max(data, function(d) { return d.land_area; });
console.log(maxLand);

=> 315

This code is using d3.js

If you want both of them at the same time, you can use d3.extent

var landExtent = d3.extent(data, function(d) { return d.land_area; });
console.log(landExtent);

=> [48.3, 315]

This code is using d3.js

This returns an array with the first element the minimum value and the second element the maximum.

Summary Statistics

D3 provides a few basic tools to analyze your data, all using the same format as the min and max functions. Simply provide the property you would like to analyze, and you are good to go.

d3.mean

var landAvg = d3.mean(data, function(d) { return d.land_area; });
console.log(landAvg);

=> 187.45

This code is using d3.js

d3.median

var landMed = d3.median(data, function(d) { return d.land_area; });
console.log(landMed);

=> 193.25

This code is using d3.js

d3.deviation - for standard deviation

var landSD = d3.deviation(data, function(d) { return d.land_area; });
console.log(landSD);

=> 140.96553952414519

This code is using d3.js

Next Task

Iterating and Reducing

Iterating Over and Reducing Data

Most of the functions we used to summarize our data had to iterate over the entire dataset to generate their results - but the details were hidden behind the function. Now let's look at how we might perform this iteration ourselves for other metrics and manipulations!

Again, we start with a basic data set already loaded:

var data = [
  {"city":"seattle", "state":"WA", "population":652405, "land_area":83.9},
  {"city":"new york", "state":"NY", "population":8405837, "land_area":302.6},
  {"city":"boston", "state":"MA", "population":645966, "land_area":48.3},
  {"city":"kansas city", "state":"MO", "population":467007, "land_area":315}
];

Iterating

First some basic iteration. We already saw this in the data loading task, but a common way to process each data object is by using forEach

var count = 0;

data.forEach(function(d) {
  count += 1;
});

console.log(count);

=> 4

Of course, data also has the property length which would be the actual way to get the number of data elements in data - but this is just an example.

console.log(data.length);

=> 4

Immutability

Let me sidetrack this task just a bit to talk about

forEach provides for a basic way to loop through our data set. We can use this to modify the data in place, generate counts, or perform other manipulations that deal with each piece of data individually.

This works, but can get clunky and confusing fast. Keeping straight what form the data is in at any given time can be confusing, as can side effects of modifying your data that you might not be aware of.

To combat this confusion, it can be useful to think of the data as immutable. Immutable data cannot be modified once created. Immutability seems a bit counterintuitive for a task where we want to coerce our data into the form we want - but it comes together with the concept of transformations.

The idea is simple: each immutable dataset can be transformed into another immutable dataset through the use of a transformation function that works on each component of the data.

This process helps simplify the data flow, but if you have to make a copy of your data object each time, it can make code a bit brittle as you have to keep track of every attribute of your dataset.

Cloning

To help with this issue of brittle transformations, lodash provides the clone function.

This function takes an object and returns a copy of that object. That copy is now a separate data object that you can edit without effecting the original object.

var dataObject = {"name":"Carl", "age":"48", "salary":"12300"};
var copyOfData = _.clone(dataObject);
copyOfData.age = +copyOfData.age;
copyOfData.salary = +copyOfData.salary;
console.log(dataObject);

=> {name: "Carl", age: "48", salary: "12300"}

This code is using lodash

console.log(copyOfData);

=> {name: "Carl", age: 48, salary: 12300}

By default, the clone function will not copy over nested objects. Instead these nested objects are simply passed by referenced - meaning the original and the copy will still share them.

var dataObject = {"name":"Saul", "stats":{"age":"55"}};
var shallowCopy = _.clone(dataObject);
shallowCopy.stats.age = +shallowCopy.stats.age;
console.log(dataObject);

=> {"name":"Saul","stats":{"age":55}}

This code is using lodash

console.log(shallowCopy);

=> {"name":"Saul","stats":{"age":55}}

Note that because stats is a nested object the modification happened in both spots!

To prevent this "feature", we can pass true as the second parameter to clone to indicate that the copy should be deep and copy nested objects as well.

var dataObject = {"name":"Saul", "stats":{"age":"55"}};
var deepCopy = _.clone(dataObject, true);
deepCopy.stats.age = +deepCopy.stats.age;
console.log(dataObject);

=> {"name":"Saul","stats":{"age":"55"}}

This code is using lodash

console.log(deepCopy);

=> {"name":"Saul","stats":{"age":55}}

lodash also has a cloneDeep that can be used to make the deep-ness more explicit.

Mapping

JavaScript's map can be a very useful tool to implement this concept of a transformation on immutable data.

map takes an array and produces another array which is the result of the callback function being executed on each element in the array.

var smallData = data.map(function(d,i) {

  return {
    name: d.city.toUpperCase(),
    index: i + 1,
    rounded_area: Math.round(d.land_area)
  };
});
console.log(data[0]);
console.log(smallData[0]);

=> {city: "seattle", state: "WA", population: 652405, land_area: 83.9}
  {name: "SEATTLE", index: 1, rounded_area: 84}

The callback function gets called for each element in the array, and also has access to the index of that element in the array. The result is an array of returned values from the callback.

With plain JavaScript, the immutability of an array is just in the mind of the developer. While map does not modify the array, it is easy for your callback method to do so. That is why we return a new object in the callback. lodash's clone would be another approach to getting a copy of each data element as a starting point for the transformation.

Filtering

Select a subset of the data using the built in filter method. This creates a new array of data (again see transformation talk above) with only the values that the callback function returns true for.

var large_land = data.filter(function(d) { return d.land_area > 200; });
console.log(JSON.stringify(large_land));

=> [{"city":"new york","state":"NY","population":8405837,"land_area":302.6},
  {"city":"kansas city","state":"MO","population":467007,"land_area":315}]

Sorting

Similar to filtering, sorting data based on attributes is something you'll want to do frequently.

The built in sort for arrays can do this. A caveat to this function is that, unlike filter, map, and other functions, this modifies the array you are sorting in place, instead of returning a new array with the objects sorted.

To sort an array, you need a comparator function. This is a function that takes two pieces of data and indicates which one you want higher in the list. The comparator-function-way to do this is to return a negative value if the first value should go higher then the second value, and a positive value if the second value should go higher. If they are equal, and you don't care, then return a 0.

Let's see it in action. Here is a way to sort by population in a descending order (larger populations come first).

data.sort(function(a,b) {
  return b.population - a.population;
});
console.log(JSON.stringify(data));

=> [{"city":"new york","state":"NY","population":8405837,"land_area":302.6},
   {"city":"seattle","state":"WA","population":652405,"land_area":83.9},
   {"city":"boston","state":"MA","population":645966,"land_area":48.3},
   {"city":"kansas city","state":"MO","population":467007,"land_area":315}]

This b - a thing is a pretty common way to generate this kind of sort. But you could also do it more explicitly. Thinking through it, if b's population is larger then a's, then the value returned by b.population - a.population will be positive - so b will be sorted toward the top of the array. If the reverse is true, then the result will be negative, and a will be sorted first.

Note again, that the sort happened on the original data, which I'm not a big fan of.

D3 also has a few helper functions to implement ascending and descending comparator functions - but (as far as I can tell) they only accept arrays of raw numbers instead of objects. So to use d3.ascending or d3.descending you would have to do something like this:

var populations = data.map(function(d) { return d.population; });
console.log(populations);

=> [652405, 8405837, 645966, 467007]

populations.sort(d3.descending);
console.log(populations);

=> [8405837, 652405, 645966, 467007]

I'm usually looking to keep my data objects together, so I shy away from using these methods, but they might be great for what you are trying to do.

A big gotcha with sorting that you should watch out for is that if you do not pass a comparator function, the default function sorts alphabetically. So, the array:

var nums = [3,1,10,20];

Would be sorted to:

console.log(nums.sort());

=> [1, 10, 20, 3]

This is never what you want for data sorting. For this reason, you should never use sort without a comparator function.

Reducing

These functions all take an array and reduce it down to a single number. But what if that number isn't the one you want? Well, you can take this reduction into your own hands with reduce!

The syntax for reduce is always hard for me to remember, so let's go over it with the classic example: summing up a value.

var landSum = data.reduce(function(sum, d) {
  return sum + d.land_area;
}, 0);
console.log(landSum);

=> 749.8

The first parameter to reduce is the callback function that will return the running "total" of the reduction. This function is passed in the previous value returned from the last time the callback was called. Here, that parameter - sum provides the running total as we move through the array. The second parameter to the callback d is the current value of the array we are working on.

reduce can take an initial value, which is the second parameter to the reduce call. For this example, we start the sum at 0. If there is no starting value provided, then for the first execution of the callback (when there is no previous value) the first parameter to the callback will be the value of the first element of the array, and the reduction starts with the second element.

It always makes more sense to me to provide a starting value - unless you know what you are doing. You can also get the current index into the array (and the whole array itself) if that is useful to you.

var weirdString = data.reduce(function(str, d, i) {
  var ending = (i % 2 === 0) ? " is cool." : " sucks." ;
  return str + " " + d.city + ending;
}, "");
console.log(weirdString);

=> seattle is cool. new york sucks. boston is cool. kansas city sucks.

_And summing over a variable is only used for example. You can always just use d3.sum for this instead._

Chaining Functions

One of the great things about these more functional functions is that it is possible to chain them together into one big data wrangling pipeline!

var bigCities = data.filter(function(d) { return d.population > 500000; })
  .sort(function(a,b) { return a.population - b.population; })
  .map(function(d) { return d.city; });
console.log(bigCities);

=> ["boston", "seattle", "new york"]

Since we are using sort after filter, sort is working on the returned array from filter. The sort function at least is nice enough to also return the array, so chaining is still possible.

Next Task

Grouping Data

Grouping data is an important capability to have when doing data analysis. Often times, you will want to break apart the data by a categorical variable and look at statistics or details for each group.

D3 includes the powerful d3.nest functionality to produce these groupings with a minimal amount of code.

Nest Basics

Fundamentally, d3.nest is about taking a flat data structure and turning it into a nested one. The user gets to decide how the nesting should occur, and how deep to nest. This is a bit different then many group_by concepts, where only a single level of nesting is allowed.

Let's say we have the following CSV file of "expenses":

name,amount,date
jim,34.0,11/12/2015
carl,120.11,11/12/2015
jim,45.0,12/01/2015
stacy,12.00,01/04/2016
stacy,34.10,01/04/2016
stacy,44.80,01/05/2016

And that has been converted to a nice array of objects via our data reading powers into something like this:

var expenses = [{"name":"jim","amount":34,"date":"11/12/2015"},
  {"name":"carl","amount":120.11,"date":"11/12/2015"},
  {"name":"jim","amount":45,"date":"12/01/2015"},
  {"name":"stacy","amount":12.00,"date":"01/04/2016"},
  {"name":"stacy","amount":34.10,"date":"01/04/2016"},
  {"name":"stacy","amount":44.80,"date":"01/05/2016"}
];

And now we want to slice up this data in different ways.

First, let's use nest to group by name:

var expensesByName = d3.nest()
  .key(function(d) { return d.name; })
  .entries(expenses);

This code is using d3.js

Which results in a nested data structure:

expensesByName = [
  {"key":"jim","values":[
    {"name":"jim","amount":34,"date":"11/12/2015"},
    {"name":"jim","amount":45,"date":"12/01/2015"}
  ]},
  {"key":"carl","values":[
    {"name":"carl","amount":120.11,"date":"11/12/2015"}
  ]},
  {"key":"stacy","values":[
    {"name":"stacy","amount":12.00,"date":"01/04/2016"},
    {"name":"stacy","amount":34.10,"date":"01/04/2016"},
    {"name":"stacy","amount":44.80,"date":"01/05/2016"}
  ]}
];

expensesByName is an array of objects. Each object has a key property - which is what we used as the grouping value using the key function. Here, we used the values associated with the name property as the key.

The values property of these entries is an array containing all the original data objects that had that key.

Summarizing Groups

The nested structure can be great for visualizing your data, but might be a little underwhelming for analytical applications. Never fear! d3.rollup is here!

With rollup, you provide a function that takes the array of values for each group and it produces a value based on that array. This provides for some very flexible group by functionality.

Here is a simple one to get back the counts for each name:

var expensesCount = d3.nest()
  .key(function(d) { return d.name; })
  .rollup(function(v) { return v.length; })
  .entries(expenses);
console.log(JSON.stringify(expensesCount));

=> [{"key":"jim","values":2},{"key":"carl","values":1},{"key":"stacy","values":3}]

This code is using d3.js

The individual records are gone (for better or worse) and in their place are the values returned by our rollup function. The naming stays the same (key and values) but the content is yours to specify. Note that the value passed into the rollup callback is the array of values for that key.

Here is another example where we get the average amount per person:

var expensesAvgAmount = d3.nest()
  .key(function(d) { return d.name; })
  .rollup(function(v) { return d3.mean(v, function(d) { return d.amount; }); })
  .entries(expenses);
console.log(JSON.stringify(expensesAvgAmount));

=> [{"key":"jim","values":39.5},{"key":"carl","values":120.11},{"key":"stacy","values":30.3}]

This code is using d3.js

Pretty cool right? Any roll-up function you can think of, you can make happen. And you don't need to stop at just one. rollup can return an object, so you can easily produce multiple metrics on your groups.

var expenseMetrics = d3.nest()
  .key(function(d) { return d.name; })
  .rollup(function(v) { return {
    count: v.length,
    total: d3.sum(v, function(d) { return d.amount; }),
    avg: d3.mean(v, function(d) { return d.amount; })
  }; })
  .entries(expenses);
console.log(JSON.stringify(expenseMetrics));

=> [{"key":"jim","values":{"count":2,"total":79,"avg":39.5}},
 {"key":"carl","values":{"count":1,"total":120.11,"avg":120.11}},
 {"key":"stacy","values":{"count":3,"total":90.9,"avg":30.3}}]

This code is using d3.js

Map Output

The array output can be useful for using map or forEach as discussed in the iteration and summation task. But you can also have d3.nest return an object (or d3.map) of the results, for direct access. Note the use of nest.map below.

var expensesTotal = d3.nest()
  .key(function(d) { return d.name; })
  .rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
  .map(expenses);
console.log(JSON.stringify(expensesTotal));

=> {"jim":79,"carl":120.11,"stacy":90.9}

This code is using d3.js

Multi-Level Nesting

And you thought that single-level nesting was cool. Wait till you try multiple levels!

By adding more keys, you can sub-divide your data even further. Here is expense sums by name and then by date:

var expensesTotalByDay = d3.nest()
  .key(function(d) { return d.name; })
  .key(function(d) { return d.date; })
  .rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
  .map(expenses);
console.log(JSON.stringify(expensesTotalByDay));

=> {"jim":{"11/12/2015":34,"12/01/2015":45},
 "carl":{"11/12/2015":120.11},
 "stacy":{"01/04/2016":46.1,"01/05/2016":44.8}}

This code is using d3.js

Now the rollup callback is called for each of our smaller subgroups.

The order of the nest.key calls determines the order of the grouping. If we reverse our keys, we get the totals by date and then by name:

var expensesTotalByDay = d3.nest()
  .key(function(d) { return d.date; })
  .key(function(d) { return d.name; })
  .rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
  .map(expenses);
console.log(JSON.stringify(expensesTotalByDay));

=> {"11/12/2015":{"jim":34,"carl":120.11},
 "12/01/2015":{"jim":45},
 "01/04/2016":{"stacy":46.1},
 "01/05/2016":{"stacy":44.8}}

This code is using d3.js

Here the values are the same, but the mapping might be more convenient, depending on the questions you are trying to answer.

Derived Key Values

Remember, we are specifying our key value using a function. This gives us the power to group on derived or otherwise on-the-fly keys.

For example, if we wanted to find out totals for all expenses for each year, we would just do some basic string manipulation on the date string:

var expensesByYear = d3.nest()
  .key(function(d) { return d.date.split("/")[2]; })
  .rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
  .map(expenses);
console.log(JSON.stringify(expensesByYear));

=> {"2015":199.11,"2016":90.9}

This code is using d3.js

All this flexibility provides for a powerful toolkit for exploring your data.

Next Task

Working with Strings

String cleaning is something you end up doing quite a lot. Hopefully this task will help make the process less painful. There are a near infinite transformations you might want to do with strings, so we won't get to everything, but this will serve as a starting point for common manipulations that will come up again and again.

We will start with generic JavaScript string functions and add in a bit of lodash magic to make things easier.

String Basics

Similar to arrays, the characters in strings are accessible via indexing

var aChar = "Hello There!"[6];
console.log(aChar);

=> T

Also, just like arrays, you have access to the powerful slice method, which is used to extract sub-sections based on indexes.

var aSlice = "Hello There!".slice(6,11);
console.log(aSlice);

=> There

The sliced string goes up to - but not including - the last index.

And, of course, string concatenation is done in JavaScript using the + operator. Use parenthesis if you want to do actual arithmetic inside your concatenation.

var orderNum = 8;
console.log("You are number " + (orderNum + 1) + " in line.");

=> You are number 9 in line.

Check the documentation for all the other basic tools.

Stripping Whitespace

Often, you are going to have some surrounding whitespace that you don't want corrupting the rest of your data. Reading CSV files gives a good example of this, as spaces are typically also used in conjunction with the commas to separate columns.

A data file like this:

cities_spaced.csv:

city  ,state ,population,land area
  seattle  ,WA , 652405 ,83.9
new york,NY,8405837,  302.6

When read in can produce quite the messy dataset:

d3.csv("data/cities_spaced.csv", function(data) {
  console.log(JSON.stringify(data));
});

=> [{"city  ":"  seattle  ","state ":"WA ","population":" 652405 ","land area":"83.9"},
{"city  ":"new york","state ":"NY","population":"8405837","land area":"  302.6"}]

This code is using d3.js

Note the spaces in the property names as well as the values. In cases like this, it might be best to map the data back to a clean version. Lodash's trim can help. It removes that unsightly whitespace from the front and back of your strings.

Here is a version of the data loading function that removes whitespace. It uses

d3.csv("data/cities_spaced.csv", function(data) {
  var clean = data.map(function(d) {
    var cleanD = {};
    d3.keys(d).forEach(function(k) {
      cleanD[_.trim(k)] = _.trim(d[k]);
    });
    return cleanD;
  });
  console.log(JSON.stringify(clean));
});

=> [{"city":"seattle","state":"WA","population":"652405","land area":"83.9"},
{"city":"new york","state":"NY","population":"8405837","land area":"302.6"}]

This code is using d3.js and lodash

The strings are now clear of those pesky spaces.

Find and Replace

Extracting data from strings can sometimes mean extracting pieces of strings. Finding out if a string contains a keyword or sub-string of interest is a first step in quantifying the content of a body of text.

indexOf can be used to perform this searching. You pass it a sub-string, and it'll tell you the location in string you are calling it where that sub-string starts. -1 is returned if the sub-string can't be found. You can use this to build a little string finder, by comparing the return value to -1.

console.log("A man, a plan, a canal".indexOf("man") !== -1);

=> true

console.log("A man, a plan, a canal".indexOf("panama") !== -1);

=> false

Replace is the butter to find's bread. We will see more replacing when we get to regular expressions, but replacing sections of a string can be done with the replace method.

console.log("A man, a plan, a canal".replace("canal", ""));

=> "A man, a plan, a"

Templating

When you need to create a more complicated string, such as an html snippet, it may become too tedious to just combine strings by concatenating them with your variables. Consider the following example:

<div class="person">
  <span class="name">Birdman</span>
  <span class="occupation">Imaginary Super Hero</span>
</div>

If we wanted to build it using string concatenation, it might look like this:

var person = { name : "Birdman", occupation: "Imaginary Super Hero" };
var html_snippet = "<div class=\"person\">" +
  "<span class=\"name\">" + person.name + "</span>" +
  "<span class=\"occupation\">" + person.occupation + "</span>" +
"</div>";
console.log(html_snippet);

=> '<div class="person"><span class="name">Birdman</span><span class="occupation">Imaginary Super Hero</span></div>'

That's a lot of string escaping! You can imagine this gets pretty hard to manage after a while.

In order to simplify this process, you can use lodash templates to define a "template" that you can reuse with different data. Using our example above, we might define it like so:

var templateString = "<div class='person'>" +
  "  <span class='name'><%= name %></span>" +
  "  <span class='occupation'><%= occupation %></span>" +
  "</div>";
var templateFunction = _.template(templateString);

Now you can use this template function with lots of data to generate the same snippet of html:

console.log(templateFunction(person));

=> '<div class="person"><span class="name">Birdman</span><span class="occupation">Imaginary Super Hero</span></div>'

This code is using lodash

var anotherPerson = { name : "James. James Bond", occupation: "Spy" };
console.log(templateFunction(anotherPerson));

=> '<div class="person"><span class="name">James. James Bond</span><span class="occupation">Spy</span></div>'

Next Task

Regular Expressions

Regular expressions are used to match certain patterns of strings within other strings.

They can be a useful tool for extracting patterns rather than exact strings, for example: telephone numbers (sequences of numbers of a specific length,) street numbers or email addresses.

Finding Strings

var str = "how much wood would a woodchuck chuck if a woodchuck could chuck wood";
var regex = /wood/;

If we want to know whether the string "wood" appears in our larger string str we could do the following

if (regex.test(str)) {
  console.log("we found 'wood' in the string!");
}

=> "we found 'wood' in the string!"

To see the actual matches we found in the string, we can use the match method to find all matches available:

var matches = str.match(regex);
console.log(matches);

=> ["wood"]

Note that this only returned one match, even though the word "wood" appears several times in our original string. In order to find all individual instances of wood, we need to add the global flag, which we can do by adding a g to the end of our expression:

regex = /wood/g;
console.log(str.match(regex));

=> ["wood", "wood", "wood", "wood"]

Now, note that two of those matches actually belonged to the word "woodchuck", which was not a part of our results. If we wanted to extend our regular expression to match both we could do so in a few ways:

regex = /wood.*?\b/g;
console.log(str.match(regex));

=> ["wood", "woodchuck", "woodchuck", "wood"]

In this regular expression we are matching everything that starts with the string "wood" followed by 0 or more characters (.*?) until a word break (\b) occures. Alternatively, we could also just search for both words:

regex = /woodchuck|wood/g;
console.log(str.match(regex));

=> ["wood", "woodchuck", "woodchuck", "wood"]

Note the order in which we did the last search. We used the word "woodchuch" before the word "wood". If we were to run our expression like so: /wood|woodchuck/g, we would end up with ["wood", "wood", "wood", "wood"] again because that search would be "greedy".

Replacing with regular expressions

If we wanted to replace the word "wood" in our original string, with the word "nun", we could do it like so:

regex = /wood/g;
var newstr = str.replace(regex, "nun");
console.log(newstr);

=> "how much nun would a nunchuck chuck if a nunchuck could chuck nun"

Probablay not what you'd be going for, but you get our drift.

Finding Numbers

Extracting numbers from strings is a common task when looking for things like dollar amounts or any other numerical measurements that might be scattered about in the text. For example, if we wanted to extract the total amount of money spent on groceries from this message:

var message = "I bought a loaf of bread for $3.99, some milk for $2.49 and" +
  "a box of chocolate cookies for $6.95";

we could define a regular expression that looks for dollar amounts by defining a pattern like so.

regex = /\$([0-9\.]+)\b/g;

this pattern looks for:

A dollar sign (\$) to indicate the beginning of a price
A set of repeating characters that can be a number (0-9) or the period character .. These can appear repeatedly (+). Note that we're not being particularly careful in making sure we only have one period in our string, for example.
A word break that would indicate the end of the price string (\b).

If we wanted to find all the matches, we could use our string match function like so:

matches = message.match(regex);
console.log(matches);

=> ["$3.99", "$2.49", "$6.95"]

This is great! We have all our dollar amounts. While this gets us 90% there, we can't really add them with those $ signs. To remove them, we can use our trusty reduce function like so:

matches.reduce(function(sum, value) {
  return sum + Number(value.slice(1));
}, 0);

=> 13.43

Useful special characters

We've used a few special characters so far, like \b to indicate a word break. There are a few others that might be useful to you:

\d - any number character. Equivalent to [0-9].
\D - any non number character. Equivalent to [^0-9].
\s - any single space character. This includes a single space, tab, line feed or form feed.

You can see a full list of all special characters here: MDN - Regular Expressions

Next Task

Working With Time

Working with Time

Time is one of those tricky programming things that seems like it should be easy, but usually turns out not to be. We will use D3's built in time formating and interval functions. We will also take a look at the powerful Moment.js library, for when you just need more time power.

String to Date

The first task when dealing with dates is usually getting a Date object out of a string. Most of the time, your data will have dates or times in an (mostly) arbitrary format, and you need to force that mess into an actual date.

D3 has d3.time.format which provides a way to do this parsing. It was a little confusing for me the first time I tried it. You use this function to create a string parser, and then use the parser to actually convert the string.

In our nesting example, we saw data that had dates as strings:

var expense = {"name":"jim","amount":34,"date":"11/12/2015"};

To convert this date string to a Date object, we would need a parser that looks like:

var parser = d3.time.format("%m/%d/%Y");

This code is using d3.js

The input string to d3.time.format indicates what the date string should look like. You have a lot of options for the special, percent-sign-prefixed variables. You can see in the string I'm using month, day, and four-digit year. The slashes in the format string are not special variables - but just what we expect to find separating the fields in the date string.

Next we use the parser to parse our string.

expense.date = parser.parse(expense.date);
console.log(expense);

=> {name: "jim", amount: 34, date: Thu Nov 12 2015 00:00:00 GMT-0500 (EST)}

Cool! Now our date is actually a Date object.

Here are a few more time parsers to show the capabilities of D3's parsing.

Just the date:

var date = d3.time.format("%A, %B %-d, %Y").parse("Wednesday, November 12, 2014");
console.log(date);

=> Wed Nov 12 2014 00:00:00 GMT-0500 (EST)

This code is using d3.js

The little dash in front of the d is to remove the 0-padding)

date = d3.time.format("%m/%y").parse("12/14");
console.log(date);

=> Mon Dec 01 2014 00:00:00 GMT-0500 (EST)

You can see it defaults to the first day of the month.

Just the time:

var time = d3.time.format("%I:%M%p").parse("12:34pm");
console.log(time);

=> Mon Jan 01 1900 12:34:00 GMT-0500 (EST)

This code is using d3.js

Gives you a somewhat strange default date.

Date and time:

time = d3.time.format("%m/%d/%Y %H:%M:%S").parse("01/02/2014 08:22:05");
console.log(time);

=> Thu Jan 02 2014 08:22:05 GMT-0500 (EST)

This code is using d3.js

This could also be done using some built in short-hands:

time = d3.time.format("%x %X").parse("01/02/2014 08:22:05");
console.log(time);

=> Thu Jan 02 2014 08:22:05 GMT-0500 (EST)

This code is using d3.js

You can see that d3.time.format gives you a lot of flexibility about what your time string will look like.

Modifying Time

In many cases, you might want to modify a date object. Perhaps you only want to display the hour from a date, or maybe you want to figure out what a week from now would be.

The d3.time.interval set of functions provides a starting point for these kinds of manipulations.

Intervals allow for modifying dates around specific time slices like minutes, hours, days, months, or years. We are given a number of functions to work with each interval, depending on what we might want to do.

So, to get the nearest hour from a date, we can use d3.time.hour.round

var hourParser = d3.time.format("%I:%M%p");
var time = hourParser.parse("10:34pm");
var hour = d3.time.hour.round(time);
console.log(hour);

=> Mon Jan 01 1900 23:00:00 GMT-0500

This code is using d3.js

It returns a date object that just contains the nearest hour (11:00pm). We can display this by using the d3.time.format parser to format the date object into a string (these formaters can work both ways).

console.log(hourParser(hour));

=> 11:00PM

Moment.js

Moment.js is another JavaScript library that could be better suited to your needs, if you happen to be doing a lot of time manipulations. Its syntax and capabilities seem a bit more intuitive for certain time manipulations.

Check it out if you need more time control power!

Next Task

Checking Data Assumptions

Data processing is tricky business, full of pitfalls and gotchas. Hopefully the tasks in this guide help with getting started in this process. But you, I, and the entire world will make mistakes. It's natural.

But mistakes in data processing, like all other kinds of mistakes, can be painful. They can result in hours of bug hunting, days of reprocessing, and months of crying. Since we know mistakes happen and will continue to happen, what can we do to take away some of the pain?

In a word, padding. We need some padding to protect us from the bumps and bruises of data processing. And I would suggest that this padding come in the form of simple tests that check the assumptions you have about the shape and contents of your data.

Unless there is an extreme performance need, these tests should run in the data processing pipeline. Optimally, they would be easy to turn on and off so that you can disable them if you need to if your code is deployed.

Assertions

These tests can be created with assertions - functions that check the truthiness of a statement in code. Typically, they raise an error when an expected truth is not actually true.

JavaScript doesn't have a built assertions, but we can rectify this deficiency with a simple function.

function assert(isTrue, message) {
  if(!isTrue) {
    console.log(message);
    return false;
  }
  return true;
}

This will output a given message if the input is not true. Typically assertions throw errors, but we can just log it for explaining purposes.

Data Content Assumptions

Now let's use our assert function to check some assumptions about the details of our data.

We can use lodash's suite of type checking functions to take care of performing the checks, passing the result of the check to assert to produce our errors.

Let's say our data importing process has made some mistakes:

var data = [{"name":"Dan",
             "age":23,
             "superhuman":false},
            {"name":"Sleepwalker",
              "age":NaN,
              "superhuman":"TRUE"}
];

Our first entry looks ok, where our second entry has some problems. The age parsing for the immortal Sleepwalker has left him with no age. Also, bad input data has left us with a string in superhuman, where we expect a boolean.

A simple assumption checking function that could be run on this data could look something like this:

function checkDataContent(data) {
  data.forEach(function(d) {
    var dString = JSON.stringify(d);
    assert(_.isString(d.name), dString + " has a bad name - should be a string");
    assert(_.isNumber(d.age), dString + " has a bad age - should be a number");
    assert(!_.isNaN(d.age), dString + " has a bad age - should not be NaN");
    assert(_.isBoolean(d.superhuman), dString + " has a bad superhuman - should be boolean");
  });
}

checkDataContent(data);

=> {"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad age - should not be NaN
{"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad superhuman - should be boolean

This code is using lodash

Again, the focus here is on detection of data problems. You want something quick and simple that will serve as an early warning sign.

Unfortunately, the JavaScript primitive NaN is indeed a number, and so additional checks need to be made. As more data comes in, this function will need to be updated to add more checks. This might get a bit tedious, but a little bit of checking can go a long way towards maintaining sanity.

Data Shape Assumptions

Just as you can test your assumptions about the content of your data elements, it can be a good idea to test your assumptions about the shape of your data. Here, shape just refers to the size and structure of your data. Rows and columns.

Something simple to perform this check could look like this:

function checkDataShape(data) {
  assert(data.length > 0, "data is empty");
  assert(data.length > 4, "data is too small");
  var keys = d3.keys(data[0]);
  assert(keys.length === 4, "wrong number of columns");
}

checkDataShape(data);

=> data is too small
wrong number of columns

The two assumption functions could easily be combined into one, but it's important to look at both aspects of your data.

More Assertions

If this is an approach that appeals to you, and your data might get really complicated (or really messy) you may want to explore using more complicated assertion code.

One useful library to explore is Chai which comes with a great collection of assertion helpers. These can help you check for more complicated things like whether two objects are equal or whether an object has or doesn't have a property.

For example:

assert.deepEqual({ tea: 'green' }, { tea: 'green' });

=> true

This code is using chai's assert library

Next Task

Using Node

Analyzing Data with Node

As mentioned in the introduction, this guide is mostly geared for client-side data analysis, but with a few augmentations, the same tools can be readily used server-side with Node.

If the data is too large, this might in fact be your only option if you want to use JavaScript for your data analysis. Trying to deal with large data in the browser might result in your users having to wait for a long time. No user will wait for 5 minutes with a frozen browser, no matter how cool the analysis might be.

Setting up a Node Project

To get started with Node, ensure both node and npm, the Node package manager, are installed and available via the command line:

which node
# /usr/local/bin/node
which npm
# /usr/local/bin/npm

Your paths may be different then mine, but as long as which returns something, you should be good to go.

If node isn't installed on your machine, you can install it easily via a package manager.

Create a new directory for your data analysis project. In this example, we have a directory with a sub-directory called data which contains our animals.tsv file inside.

animals_analysis
|
 - data
   |
    - animals.tsv

Installing Node Modules

Next, we want to install our JavaScript tools, D3 and lodash. With Node, we can automate the process by using npm. Inside your data analysis directory run the following:

npm install d3
npm install lodash

You can see that npm creates a new sub-directory called node_modules by default, where your packages are installed. Everything is kept local, so you don't have to worry about problems with missing or out-of-date packages. Your analysis tools for each project are ready to go.

A package.json file can be useful for saving this kind of meta information about your project: dependencies, name, description, etc. Check out this interactive example or npm's documentation for more information.

Requiring Modules

Now we create a separate JavaScript file to do our analysis in:

touch analyze.js

Inside this file, we first require our external dependencies.

var fs = require("fs");
var d3 = require("d3");
var _ = require("lodash");

We are requiring our locally installed d3 and lodash packages. Note how we assign them to variables, which are used to access their functions later in the code.

We also require the file system module. As we will see in a second, we need this to load our data - which is really the key difference between client-side and server-side use of these tools

Loading Data in Node

D3's data loading functionality is based on XMLHttpRequest, which is great, but Node does not have XMLHttpRequest. There are packages around this mismatch, but a more elegant solution is to just use Node's built in file system functionality to load the data, and then D3 to parse it.

fs.readFile("data/animals.tsv", "utf8", function(error, data) {
  data = d3.tsv.parse(data);
  console.log(JSON.stringify(data));
});

fs.readFile is asynchronous and takes a callback function when it is finished loading the data.

Like our Queue example in client-side reading, the parameters of this function start with error, which will be null unless there is an error.

The data returned by readFile is the raw string contents of the file.

We can use d3.tsv.parse which takes a string and and converts it into an array of data objects - just like what we are used to on the client side!

From this point on, we can use d3 and lodash functionality to analyze our data.

A full, but very simple script might look like this:

var fs = require("fs");
var d3 = require("d3");
var _  = require("lodash");

fs.readFile("data/animals.tsv", "utf8", function(error, data) {
  data = d3.tsv.parse(data);
  console.log(JSON.stringify(data));

  var maxWeight = d3.max(data, function(d) { return d.avg_weight; });
  console.log(maxWeight);
});

Running the Analysis

Since this is not in a browser, we need to execute this script, much like you would with a script written in Ruby or Python.

From the command line, we can simply run it with node to see the results.

node analyze.js

=> [{"name":"tiger","type":"mammal","avg_weight":"260"},{"name":"hippo","type":"mammal","avg_weight":"3400"},{"name":"komodo dragon","type":"reptile","avg_weight":"150"}]
3400

Writing Data

Maybe the original data set is too big, but we can use Node to perform an initial pre-processing or filtering step and output the result to a new file to work with later.

Node has fs.writeFile that can perform this easily.

Inside the read callback, we can call this to write the data out.

var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = JSON.stringify(bigAnimals);

fs.writeFile("big_animals.json", bigAnimalsString, function(err) {
  console.log("file written");
});

Running this should leave us with a big_animals.json file in our analysis folder.

This is fine if JSON is what you want, but often times you want to output TSV or CSV files for further analysis. D3 to the rescue again!

D3 includes d3.csv.format (and the equivalent for TSV and other file formats) which converts our array of data objects into a string - perfect for writing to a file.

Let's use it to make a CSV of our big animals.

var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = d3.csv.format(bigAnimals);

fs.writeFile("big_animals.csv", bigAnimalsString, function(err) {
  console.log("file written");
});

Run this with the same node analysis.js and now you should have a lovely little big_animals.csv file in your directory. It even takes care of the headers for you:

name,type,avg_weight
hippo,mammal,3400

Now even BIG data is no match for us - using the power of JavaScript!

About this guide

Tasks

Translations

Code

Why?

Getting Started

About Tasks

Why D3?

Why lodash?

Code Snippets

Preparing Your Site for Data Processing

Running a Local Server

Next Task

See Also

Reading in Data

Parsing CSV Files

Reading TSV Files

Reading Other Flat Files

Reading JSON Files

Loading Multiple Files

Next Task

See Also

Combining Data

Combine data sets by one or more common attributes

Using native Array functions

Using a generic and more efficient approach

Add together rows from different data sets

Combine attributes from different data sets

Next Task

Summarizing Data

Min & Max

Summary Statistics

Next Task

See Also

Iterating Over and Reducing Data

Iterating

Immutability

Cloning

Mapping

Filtering

Sorting

Reducing

Chaining Functions

Next Task

See Also

Grouping Data

Nest Basics

Summarizing Groups

Map Output

Multi-Level Nesting

Derived Key Values

Next Task

See Also

Working with Strings

String Basics

Stripping Whitespace

Find and Replace

Templating

Next Task

See Also

Regular Expressions

Finding Strings

Replacing with regular expressions

Finding Numbers

Useful special characters

Next Task

See Also

Working with Time

String to Date

Modifying Time

Moment.js

Next Task

See Also

Checking Data Assumptions

Assertions

Data Content Assumptions

Data Shape Assumptions

More Assertions

Next Task

See Also

Using native `Array` functions