learn js data
Must Watch!
MustWatch
About this guide
This guide teaches the basics of manipulating data using JavaScript in the
browser, or in node.js.
Specifically, these tasks are geared around preparing
data for further analysis and visualization.
Here we demonstrate some basic techniques and how to implement them
using core JavaScript API, the d3.js library and lodash.
It assumes you already have some basic knowledge of JavaScript.
News
01/11/2019: We also have an amazing Observable version of the guide provided by the very talented @dakoop.
Observable is a great, interactive way to try out JavaScript.
Give it a try now!
01/10/2019: We've updated the guide to use D3v5.
The new changes mostly impact the "Reading in Data" section of the guide.
A very special thank you goes out to Erin Brown who contributed the PR to make this happen! We really appreciate the help!
03/20/2017: We've updated the guide to use D3v4!! Thanks very much to Kanit Ham Wong and others at the UW Interactive Data Lab for support, suggestions, and motivation for this process.
Thanks to Adam Pearce for doing most of the converting!
Code
Each document in this repo is executed when loaded into a browser.
Check it out by opening the Developer Tools Console.
You should see the output of the following code block:
console.log("This is the index!");
Check out the full source on github.
Why?
Is data cleaning and processing in JavaScript something you would actually want to do? Maybe.
There are other languages out there that do a great job with data wrangling:
R with the Tidyverse
Python with pandas
These tools are great and you should use them.
Often times, however, you are already familiar with a particular language (like JavaScript) and would like to get started with data, but want to take it one step at a time.
Additionally, sometimes you are already in a particular environment (like JavaScript) and don't have the luxury of switching to one of these other options.
In these cases, JavaScript could be considered a viable option for your data analysis.
And if you find yourself in one of these situations, or just want to try out JavaScript for data analysis for fun, then this guide is for you!
Check out some of the tasks, and see if JavaScript Data something you want to try yourself.
Thanks!
This guide is the result of quite a team effort.
Its inspiration came from a ponderous tweet by the one and only Lynn Cherney, who lamented the dirth of JavaScript data guidance at the time.
The beautiful folks at Bocoup are the reason this guide exists.
They had the foresight to provide the all important luxury of time so that it could be written.
Thanks to Jory Burson and Boaz Sender for creating a culture that gave space for these kinds of things to be created.
The bulk of this guide was written while at Bocoup by Irene Ros, Yannick Assogba, and Jim Vallandingham.
Since that time, we have had numerous other contributors, who have seen something to improve, and made us all better for their help.
Perhaps hackneyed, but almost always true - It takes a village.
So to that village we want to say, "Thanks!".
Getting Started
About Tasks
This guide is broken up into a number of tasks, which we can think of as little modules or recipes.
Each task tries to encapsulate a concrete lesson around common data manipulation and analysis processes.
Tasks attempt to be self-contained and stay focused on the, well, task at hand.
This guide was built with for client side data processing (in the browser), but can easily be used in a server side (Node) application with a bit of tweaking (you can check out the analyzing data with Node section for the details later).
Why D3?
D3.js is largely known for its data visualization capabilities - and for good reason.
It is quickly becoming the de facto standard for interactive visualization on the web.
Its core feature of binding data to visual representations happens to require a lot of manipulation of said data.
Thus, while this toolkit is focused around visualization, it is well suited for data processing as well!
And, a typical output for data manipulation is at least some sort of visualization of that data, in which case you are all ready to go.
A Note about D3v4
In the not too distant past, a major rewrite of the D3.js library was completed and released into the wild.
In includes quite a few API changes and a very modular structure (meaning in theory you can just use the bits of D3 that you want and not the rest).
This major rewrite makes D3 a lot better - but it also makes it more challenging to read and use old code and sometimes to understand the documentation.
But don't despair! We've maintained the old version of this guide using D3v3 in case you have old code you need help with.
Why lodash?
Lodash is fast, popular, and fills in some holes in D3's processing features.
Plus, it's functional style and chaining capabilities make it work well alongside D3.
Code Snippets
There are a bunch of useful snippets in this guide.
Here is an example:
var theMax = d3.max([1,2,20,3]);
console.log(theMax);
=> 20
This code is using d3.js
We use a little arrow, =>
, to indicate output from a code snippet.
This same output you can view by opening the console of your favorite web browser.
Snippets in this guide that are not pure JavaScript will be marked with the libraries used to make them work.
Preparing Your Site for Data Processing
To get started using these tools for your data processing, you are going to want to include them in your html file along with a JavaScript file to perform the analysis.
I typically download these scripts and include local copies in my page.
You can keep "libraries" that you are using but didn't write in a lib
folder, and the code you write yourself in a src
folder.
Then you want to load all these files on an HTML page.
To do this, you would want to have your HTML look something like this:
<!doctype html>
<html>
<head>
</head>
<body>
<script src='lib/d3.js'></script>
<script src='lib/lodash.js'></script>
<script src='src/analysis.js'></script>
</body>
</html>
src/analysis.js
would be where your analysis code goes.
I put these script
tags at the end of the body
- just so that if there is other content on the page, it won't be delayed in loading.
Typically, I name this main HTML file index.html
- so that its loaded automatically as the root page.
Running a Local Server
D3's functions for reading data require you be running the page from a server.
You can do this on your own machine by running a local server out of the root directory of your site.
There are many options for easy-to-spin-up web servers:
SimpleHTTPServer for Python
httpd for Ruby
http-server for Node
Lately, I have been using that last option - http-server
.
If you have Node and npm installed, you can grab the required package by installing it from the command line:
npm install -g http-server
(The -g
flag stands for global - which allows you to access http-server
from any directory on your machine.
Then cd
to your analysis directory and start it up!
cd /path/to/dir
http-server
In your web browser, open up http://0.0.0.0:8080 and you should be ready to go!
Next Task
Reading in Data
See Also
Installing Node - if you need some help getting http-server
on your machine.
Reading in Data
The first step in any data processing is getting the data! Here is how to parse in and prepare common input formats using D3.js
Parsing CSV Files
D3 has a bunch of filetypes it can support when loading data, and one of the most common is probably plain old CSV (comma separated values).
Let's say you had a csv file with some city data in it:
cities.csv:
city,state,population,land area
seattle,WA,652405,83.9
new york,NY,8405837,302.6
boston,MA,645966,48.3
kansas city,MO,467007,315.0
Use d3.csv to convert it into an array of objects
d3.csv("/data/cities.csv").then(function(data) {
console.log(data[0]);
});
=> {city: "seattle", state: "WA", population: "652405", land area: "83.9"}
This code is using d3.js
You can see that the headers of the original CSV have been used as the property names for the data objects.
Using d3.csv
in this manner requires that your CSV file has a header row.
If you look closely, you can also see that the values associated with these properties are all strings.
This is probably not what you want in the case of numbers.
When loading CSVs and other flat files, you have to do the type conversion.
We will see more of this in other tasks, but a simple way to do this is to use the + operator (unary plus).
forEach
can be used to iterate over the data array.
d3.csv("/data/cities.csv").then(function(data) {
data.forEach(function(d) {
d.population = +d.population;
d["land area"] = +d["land area"];
});
console.log(data[0]);
});
=> {city: "seattle", state: "WA", population: 652405, land area: 83.9}
This code is using d3.js
Dot notation is a useful way to access the properties of these data objects.
However, if your headers have spaces in them, then you will need to use bracket notation as shown.
This can also be done during the loading of the data, by d3.csv
directly.
This is done by providing an accessor function to d3.csv
, whose return value will be the individual data objects in our data array.
d3.csv("/data/cities.csv", function(d) {
return {
city : d.city,
state : d.state,
population : +d.population,
land_area : +d["land area"]
};
}).then(function(data) {
console.log(data[0]);
});
=> {city: "seattle", state: "WA", population: 652405, land_area: 83.9}
This code is using d3.js
In this form, you have complete control over the data objects and can rename properties (like land_area
) and convert values (like population
) willy-nilly.
On the other hand, you have to be quite explicit about which properties to return.
This may or may not be what you are into.
I typically allow D3 to load all the data, and then make modifications in a post-processing step, but it might be more effective for you to be more explicit with the modifications.
Reading TSV Files
CSV is probably the most common flat file format, but in no way the only one.
I often like to use TSV (tab separated files) - to get around the issues of numbers and strings often having commas in them.
D3 can parse TSV's with d3.tsv.
Here is animals.tsv
, as an example:
nametypeavg_weight
tigermammal260
hippomammal3400
komodo dragonreptile150
Loading animals.tsv with d3.tsv
:
d3.tsv("/data/animals.tsv").then(function(data) {
console.log(data[0]);
});
=> {name: "tiger", type: "mammal", avg_weight: "260"}
This code is using d3.js
Reading Other Flat Files
In fact, d3.csv
and d3.tsv
are only the tip of the iceberg.
If you have a non-standard delimited flat file, you can parse them too using d3.dsv!
For example, here is a pipe-delimited file called animals_piped.txt
:
name|type|avg_weight
tiger|mammal|260
hippo|mammal|3400
komodo dragon|reptile|150
We first provide d3.dsv
with the delimiter, in this case, a pipe (|
), then read in our file:
d3.dsv("|", "/data/animals_piped.txt").then(function(data){
console.log(data[1]);
});
=> {name: "hippo", type: "mammal", avg_weight: "3400"}
This code is using d3.js
Reading JSON Files
For nested data, or for passing around data where you don't want to mess with data typing, its hard to beat JSON.
JSON has become the language of the internet for good reason.
Its easy to understand, write, and parse.
And with d3.json - you too can harness its power.
Here is an example JSON file called employees.json
:
[
{"name":"Andy Hunt",
"title":"Big Boss",
"age": 68,
"bonus": true
},
{"name":"Charles Mack",
"title":"Jr Dev",
"age":24,
"bonus": false
}
]
Loading employees.json
with d3.json
:
d3.json("/data/employees.json").then(function(data) {
console.log(data[0]);
});
=> {name: "Andy Hunt", title: "Big Boss", age: 68, bonus: true}
This code is using d3.js
We can see that, unlike our flat file parsing, numeric types stay numeric.
Indeed, a JSON value can be a string, a number, a boolean value, an array, or another object.
This allows nested data to be dealt with easily.
Loading Multiple Files
D3's basic loading mechanism is fine for one file, but starts to get messy as we nest multiple callbacks.
For loading multiple files, we can use Promises to wait for multiple data sources to be loaded.
Promise.all([
d3.csv("/data/cities.csv"),
d3.tsv("/data/animals.tsv")
]).then(function(data) {
console.log(data[0][0]) // first row of cities
console.log(data[1][0]) // first row of animals
});
=> {city: "seattle", state: "WA", population: "652405", land area: "83.9"}
{name: "tiger", type: "mammal", avg_weight: "260"}
This code is using d3.js
Note that inside the all
method we load two types of files - using two different loading functions - so this is an easy way to mix and match file types.
The method returns an array of our data sources.
The first item returns our cities; the second, our animals.
Next Task
Combining Data
See Also
D3 documentation
Loading XML with D3
Loading External SVG with D3 - SVG is just XML!
Combining Data
Note: This task was very generously contributed by Timo Grossenbacher - Geographer and Data Specialist extraordinaire.
Thanks very much Timo!
Often, you have to combine two or more different data sets because they contain complementary information.
Or, for example, the data come in chunks from the server and need to be reassembled on the client side.
Combining or merging data may involve one of the following tasks:
Combine data sets by one or more common attributes
Add together rows from different data sets
Combine attributes from different data sets
Combine data sets by one or more common attributes
Let's say we have a list of the following articles:
var articles = [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand_id": 2
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand_id": 1
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand_id": 2
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand_id": 3
}];
And of the following brands:
var brands = [{
"id": 1,
"name": "SuperKitchen"
}, {
"id": 2,
"name": "HomeSweetHome"
}];
As you can see, in each article, brand_id
points to a particular brand, whose details are saved in another data set - which can be considered a lookup table in this case.
This is often how separate data schemes are stored in a server-side database.
Also note that the last article in the list has a brand_id
for which no brand is stored in brands
.
What we want to do now is to combine both datasets, so we can reference the brand's name
directly from an article.
There are several ways to achieve this.
Using native Array
functions
We can implement a simple join (left outer join in database terms) using native, i.e., already existing Array
functions as follows.
The method presented here modifies the articles
array in place by adding a new key-value-pair for brand
.
articles.forEach(function(article) {
var result = brands.filter(function(brand) {
return brand.id === article.brand_id;
});
delete article.brand_id;
article.brand = (result[0] !== undefined) ? result[0].name : null;
});
console.log(articles);
=> [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand": "HomeSweetHome"
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand": "SuperKitchen"
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand": "HomeSweetHome"
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand": null
}];
First, we loop over each article
, where we take its brand_id
to look up the corresponding brand
using the native filter
function.
Note that this function returns an array and we expect it to have only one element.
In case there is no corresponding brand
, result[0]
will be undefined
, and in order to prevent an error (something like result[0] is undefined
), we use the ternary operator.
Also, as we no longer need brand_id
after the lookup has been done, we can safely delete it.
If we want to join by more than one attribute, we can modify the filter function to achieve this.
Hypothetically, this might look something like:
innerArray.filter(function(innerArrayItem) {
return (
innerArrayItem.idA === outerArrayItem.idA &&
innerArrayItem.idB === outerArrayItem.idB
);
});
Using a generic and more efficient approach
A more generic, and also more performant version of a join is proposed below (abbreviated from this StackOverflow answer).
Its output is equivalent to the one of the above method.
function join(lookupTable, mainTable, lookupKey, mainKey, select) {
var l = lookupTable.length,
m = mainTable.length,
lookupIndex = [],
output = [];
for (var i = 0; i < l; i++) { // loop through l items
var row = lookupTable[i];
lookupIndex[row[lookupKey]] = row; // create an index for lookup table
}
for (var j = 0; j < m; j++) { // loop through m items
var y = mainTable[j];
var x = lookupIndex[y[mainKey]]; // get corresponding row from lookupTable
output.push(select(y, x)); // select only the columns you need
}
return output;
};
Because above defined function creates an index for the lookupTable
(in our case brands
) in the first iteration, it runs considerably faster than the previously shown method.
Also, via a callback, it allows us to directly define which keys (or "attributes") we want to retain in the resulting, joined array (output
).
It is used like so:
var result = join(brands, articles, "id", "brand_id", function(article, brand) {
return {
id: article.id,
name: article.name,
weight: article.weight,
price: article.price,
brand: (brand !== undefined) ? brand.name : null
};
});
console.log(result);
=> [{
"id": 1,
"name": "vacuum cleaner",
"weight": 9.9,
"price": 89.9,
"brand": "HomeSweetHome"
}, {
"id": 2,
"name": "washing machine",
"weight": 540,
"price": 230,
"brand": "SuperKitchen"
}, {
"id": 3,
"name": "hair dryer",
"weight": 1.2,
"price": 24.99,
"brand": "HomeSweetHome"
}, {
"id": 4,
"name": "super fast laptop",
"weight": 400,
"price": 899.9,
"brand": null
}];
Note that we don't modify articles
in place but create a new array.
Add together rows from different data sets
Let's say we want to load a huge data set from the server, but because of network performance reasons, we load it in three chunks and reassemble it on the client side.
With D3v5 and later, we can use Promise.all() to run many requests concurrently, combining them after they have finished downloading using d3.merge().
Note that Promise.all()
takes an array of Promises, in this case supplied by calls to d3.csv()
.
Promise.all([
d3.csv("/data/big_data_1.csv"),
d3.csv("/data/big_data_2.csv"),
d3.csv("/data/big_data_3.csv")
]).then(function(allData) {
console.log(d3.merge(allData));
});
=> [{"a": "1", "b": "2"},{"a": "3", "b": "4"},{"a": "5", "b": "6"}]
This code is using d3.js
Note that the argument passed to d3.merge
must be an array itself, which is why we use the square brackets.
Combine attributes from different data sets
In the last case, we have two or more data sets that contain attributes describing the same observations, or conceptual entities, and they need to be combined.
This implies that all data sets have the same length.
For example, dataset_1
below contains two observations of attribute type
and attribute model
, while dataset_2
contains the same two entities, but observed through attributes price
and weight
.
var dataset_1 = [{
'type': 'boat',
'model': 'Ocean Queen 2000'
}, {
'type': 'car',
'model': 'Ferrari'
}];
var dataset_2 = [{
'price': 23202020,
'weight': 5656.9
}, {
'price': 59988,
'weight': 1.9
}];
So in both data sets we essentially have separate information about the same conceptual entities, thus it makes sense to "merge" them, for which we can use lodash's merge
function:
var result = _.merge(dataset_1, dataset_2);
console.log(result);
=> [{
'type': 'boat',
'model': 'Ocean Queen 2000',
'price': 23202020,
'weight': 5656.9
}, {
'type': 'car',
'model': 'Ferrari',
'price': 59988,
'weight': 1.9
}];
This code is using lodash
Next Task
Summarizing Data
Summarizing Data
With the data loaded, we want to take a quick look at what we have.
D3 has a number of tools to use for quick data exploration.
To start, let's pretend we have loaded up a csv file - and have a dataset that looks something like:
var data = [
{"city":"seattle", "state":"WA", "population":652405, "land_area":83.9},
{"city":"new york", "state":"NY", "population":8405837, "land_area":302.6},
{"city":"boston", "state":"MA", "population":645966, "land_area":48.3},
{"city":"kansas city", "state":"MO", "population":467007, "land_area":315}
];
Min & Max
As it turns out, D3 comes to the rescue again, with d3.min and d3.max.
Use the callback function to indicate which property (or computed value based on the properties) to access.
var minLand = d3.min(data, function(d) { return d.land_area; });
console.log(minLand);
=> 48.3
This code is using d3.js
var maxLand = d3.max(data, function(d) { return d.land_area; });
console.log(maxLand);
=> 315
This code is using d3.js
If you want both of them at the same time, you can use d3.extent
var landExtent = d3.extent(data, function(d) { return d.land_area; });
console.log(landExtent);
=> [48.3, 315]
This code is using d3.js
This returns an array with the first element the minimum value and the second element the maximum.
Summary Statistics
D3 provides a few basic tools to analyze your data, all using the same format as the min and max functions.
Simply provide the property you would like to analyze, and you are good to go.
d3.mean
var landAvg = d3.mean(data, function(d) { return d.land_area; });
console.log(landAvg);
=> 187.45
This code is using d3.js
d3.median
var landMed = d3.median(data, function(d) { return d.land_area; });
console.log(landMed);
=> 193.25
This code is using d3.js
d3.deviation - for standard deviation
var landSD = d3.deviation(data, function(d) { return d.land_area; });
console.log(landSD);
=> 140.96553952414519
This code is using d3.js
Next Task
Iterating and Reducing
See Also
simple statistics - more JavaScript based stats written in easier to comprehend code.
Datalib - A Javascript utility library for data loading, type inference, common statistics, and string templates that was created to power Vega and Vega-Lite.
Iterating Over and Reducing Data
Most of the functions we used to summarize our data had to iterate over the entire dataset to generate their results - but the details were hidden behind the function.
Now let's look at how we might perform this iteration ourselves for other metrics and manipulations!
Again, we start with a basic data set already loaded:
var data = [
{"city":"seattle", "state":"WA", "population":652405, "land_area":83.9},
{"city":"new york", "state":"NY", "population":8405837, "land_area":302.6},
{"city":"boston", "state":"MA", "population":645966, "land_area":48.3},
{"city":"kansas city", "state":"MO", "population":467007, "land_area":315}
];
Iterating
First some basic iteration.
We already saw this in the data loading task, but a common way to process each data object is by using forEach
var count = 0;
data.forEach(function(d) {
count += 1;
});
console.log(count);
=> 4
Of course, data also has the property length
which would be the actual way to get the number of data elements in data
- but this is just an example.
console.log(data.length);
=> 4
Immutability
Let me sidetrack this task just a bit to talk about
forEach
provides for a basic way to loop through our data set.
We can use this to modify the data in place, generate counts, or perform other manipulations that deal with each piece of data individually.
This works, but can get clunky and confusing fast.
Keeping straight what form the data is in at any given time can be confusing, as can side effects of modifying your data that you might not be aware of.
To combat this confusion, it can be useful to think of the data as immutable.
Immutable data cannot be modified once created.
Immutability seems a bit counterintuitive for a task where we want to coerce our data into the form we want - but it comes together with the concept of transformations.
The idea is simple: each immutable dataset can be transformed into another immutable dataset through the use of a transformation function that works on each component of the data.
This process helps simplify the data flow, but if you have to make a copy of your data object each time, it can make code a bit brittle as you have to keep track of every attribute of your dataset.
Cloning
To help with this issue of brittle transformations, lodash provides the clone function.
This function takes an object and returns a copy of that object.
That copy is now a separate data object that you can edit without effecting the original object.
var dataObject = {"name":"Carl", "age":"48", "salary":"12300"};
var copyOfData = _.clone(dataObject);
copyOfData.age = +copyOfData.age;
copyOfData.salary = +copyOfData.salary;
console.log(dataObject);
=> {name: "Carl", age: "48", salary: "12300"}
This code is using lodash
console.log(copyOfData);
=> {name: "Carl", age: 48, salary: 12300}
By default, the clone
function will not copy over nested objects.
Instead these nested objects are simply passed by referenced - meaning the original and the copy will still share them.
var dataObject = {"name":"Saul", "stats":{"age":"55"}};
var shallowCopy = _.clone(dataObject);
shallowCopy.stats.age = +shallowCopy.stats.age;
console.log(dataObject);
=> {"name":"Saul","stats":{"age":55}}
This code is using lodash
console.log(shallowCopy);
=> {"name":"Saul","stats":{"age":55}}
Note that because stats
is a nested object the modification happened in both spots!
To prevent this "feature", we can pass true
as the second parameter to clone
to indicate that the copy should be deep and copy nested objects as well.
var dataObject = {"name":"Saul", "stats":{"age":"55"}};
var deepCopy = _.clone(dataObject, true);
deepCopy.stats.age = +deepCopy.stats.age;
console.log(dataObject);
=> {"name":"Saul","stats":{"age":"55"}}
This code is using lodash
console.log(deepCopy);
=> {"name":"Saul","stats":{"age":55}}
lodash also has a cloneDeep that can be used to make the deep-ness more explicit.
Mapping
JavaScript's map can be a very useful tool to implement this concept of a transformation on immutable data.
map
takes an array and produces another array which is the result of the callback function being executed on each element in the array.
var smallData = data.map(function(d,i) {
return {
name: d.city.toUpperCase(),
index: i + 1,
rounded_area: Math.round(d.land_area)
};
});
console.log(data[0]);
console.log(smallData[0]);
=> {city: "seattle", state: "WA", population: 652405, land_area: 83.9}
{name: "SEATTLE", index: 1, rounded_area: 84}
The callback function gets called for each element in the array, and also has access to the index of that element in the array.
The result is an array of returned values from the callback.
With plain JavaScript, the immutability of an array is just in the mind of the developer.
While map
does not modify the array, it is easy for your callback method to do so.
That is why we return a new object in the callback.
lodash's clone would be another approach to getting a copy of each data element as a starting point for the transformation.
Filtering
Select a subset of the data using the built in filter method.
This creates a new array of data (again see transformation talk above) with only the values that the callback function returns true
for.
var large_land = data.filter(function(d) { return d.land_area > 200; });
console.log(JSON.stringify(large_land));
=> [{"city":"new york","state":"NY","population":8405837,"land_area":302.6},
{"city":"kansas city","state":"MO","population":467007,"land_area":315}]
Sorting
Similar to filtering, sorting data based on attributes is something you'll want to do frequently.
The built in sort for arrays can do this.
A caveat to this function is that, unlike filter, map, and other functions, this modifies the array you are sorting in place, instead of returning a new array with the objects sorted.
To sort an array, you need a comparator function.
This is a function that takes two pieces of data and indicates which one you want higher in the list.
The comparator-function-way to do this is to return a negative value if the first value should go higher then the second value, and a positive value if the second value should go higher.
If they are equal, and you don't care, then return a 0.
Let's see it in action.
Here is a way to sort by population in a descending order (larger populations come first).
data.sort(function(a,b) {
return b.population - a.population;
});
console.log(JSON.stringify(data));
=> [{"city":"new york","state":"NY","population":8405837,"land_area":302.6},
{"city":"seattle","state":"WA","population":652405,"land_area":83.9},
{"city":"boston","state":"MA","population":645966,"land_area":48.3},
{"city":"kansas city","state":"MO","population":467007,"land_area":315}]
This b - a
thing is a pretty common way to generate this kind of sort.
But you could also do it more explicitly.
Thinking through it, if b's population is larger then a's, then the value returned by b.population - a.population
will be positive - so b will be sorted toward the top of the array.
If the reverse is true, then the result will be negative, and a will be sorted first.
Note again, that the sort happened on the original data, which I'm not a big fan of.
D3 also has a few helper functions to implement ascending and descending comparator functions - but (as far as I can tell) they only accept arrays of raw numbers instead of objects.
So to use d3.ascending or d3.descending you would have to do something like this:
var populations = data.map(function(d) { return d.population; });
console.log(populations);
=> [652405, 8405837, 645966, 467007]
populations.sort(d3.descending);
console.log(populations);
=> [8405837, 652405, 645966, 467007]
I'm usually looking to keep my data objects together, so I shy away from using these methods, but they might be great for what you are trying to do.
A big gotcha with sorting that you should watch out for is that if you do not pass a comparator function, the default function sorts alphabetically.
So, the array:
var nums = [3,1,10,20];
Would be sorted to:
console.log(nums.sort());
=> [1, 10, 20, 3]
This is never what you want for data sorting.
For this reason, you should never use sort without a comparator function.
Reducing
These functions all take an array and reduce it down to a single number.
But what if that number isn't the one you want? Well, you can take this reduction into your own hands with reduce
!
The syntax for reduce is always hard for me to remember, so let's go over it with the classic example: summing up a value.
var landSum = data.reduce(function(sum, d) {
return sum + d.land_area;
}, 0);
console.log(landSum);
=> 749.8
The first parameter to reduce
is the callback function that will return the running "total" of the reduction.
This function is passed in the previous value returned from the last time the callback was called.
Here, that parameter - sum
provides the running total as we move through the array.
The second parameter to the callback d
is the current value of the array we are working on.
reduce
can take an initial value, which is the second parameter to the reduce
call.
For this example, we start the sum at 0.
If there is no starting value provided, then for the first execution of the callback (when there is no previous value) the first parameter to the callback will be the value of the first element of the array, and the reduction starts with the second element.
It always makes more sense to me to provide a starting value - unless you know what you are doing.
You can also get the current index into the array (and the whole array itself) if that is useful to you.
var weirdString = data.reduce(function(str, d, i) {
var ending = (i % 2 === 0) ? " is cool." : " sucks." ;
return str + " " + d.city + ending;
}, "");
console.log(weirdString);
=> seattle is cool.
new york sucks.
boston is cool.
kansas city sucks.
_And summing over a variable is only used for example.
You can always just use d3.sum for this instead._
Chaining Functions
One of the great things about these more functional functions is that it is possible to chain them together into one big data wrangling pipeline!
var bigCities = data.filter(function(d) { return d.population > 500000; })
.sort(function(a,b) { return a.population - b.population; })
.map(function(d) { return d.city; });
console.log(bigCities);
=> ["boston", "seattle", "new york"]
Since we are using sort
after filter
, sort is working on the returned array from filter
.
The sort function at least is nice enough to also return the array, so chaining is still possible.
Next Task
Grouping Data
See Also
Making Juice with Reduce - Tom MacWright's intro to the ill-used reduce
Immutable JS - if you want to get serious about immutable data structures in JavaScript
Ramda - a more functional approach to data processing in JS
Grouping Data
Grouping data is an important capability to have when doing data analysis.
Often times, you will want to break apart the data by a categorical variable and look at statistics or details for each group.
D3 includes the powerful d3.nest functionality to produce these groupings with a minimal amount of code.
Nest Basics
Fundamentally, d3.nest
is about taking a flat data structure and turning it into a nested one.
The user gets to decide how the nesting should occur, and how deep to nest.
This is a bit different then many group_by concepts, where only a single level of nesting is allowed.
Let's say we have the following CSV file of "expenses":
name,amount,date
jim,34.0,11/12/2015
carl,120.11,11/12/2015
jim,45.0,12/01/2015
stacy,12.00,01/04/2016
stacy,34.10,01/04/2016
stacy,44.80,01/05/2016
And that has been converted to a nice array of objects via our data reading powers into something like this:
var expenses = [{"name":"jim","amount":34,"date":"11/12/2015"},
{"name":"carl","amount":120.11,"date":"11/12/2015"},
{"name":"jim","amount":45,"date":"12/01/2015"},
{"name":"stacy","amount":12.00,"date":"01/04/2016"},
{"name":"stacy","amount":34.10,"date":"01/04/2016"},
{"name":"stacy","amount":44.80,"date":"01/05/2016"}
];
And now we want to slice up this data in different ways.
First, let's use nest to group by name
:
var expensesByName = d3.nest()
.key(function(d) { return d.name; })
.entries(expenses);
This code is using d3.js
Which results in a nested data structure:
expensesByName = [
{"key":"jim","values":[
{"name":"jim","amount":34,"date":"11/12/2015"},
{"name":"jim","amount":45,"date":"12/01/2015"}
]},
{"key":"carl","values":[
{"name":"carl","amount":120.11,"date":"11/12/2015"}
]},
{"key":"stacy","values":[
{"name":"stacy","amount":12.00,"date":"01/04/2016"},
{"name":"stacy","amount":34.10,"date":"01/04/2016"},
{"name":"stacy","amount":44.80,"date":"01/05/2016"}
]}
];
expensesByName
is an array of objects.
Each object has a key
property - which is what we used as the grouping value using the key
function.
Here, we used the values associated with the name
property as the key.
The values
property of these entries is an array containing all the original data objects that had that key.
Summarizing Groups
The nested structure can be great for visualizing your data, but might be a little underwhelming for analytical applications.
Never fear! d3.rollup is here!
With rollup
, you provide a function that takes the array of values for each group and it produces a value based on that array.
This provides for some very flexible group by functionality.
Here is a simple one to get back the counts for each name:
var expensesCount = d3.nest()
.key(function(d) { return d.name; })
.rollup(function(v) { return v.length; })
.entries(expenses);
console.log(JSON.stringify(expensesCount));
=> [{"key":"jim","values":2},{"key":"carl","values":1},{"key":"stacy","values":3}]
This code is using d3.js
The individual records are gone (for better or worse) and in their place are the values returned by our rollup function.
The naming stays the same (key and values) but the content is yours to specify.
Note that the value passed into the rollup
callback is the array of values for that key.
Here is another example where we get the average amount per person:
var expensesAvgAmount = d3.nest()
.key(function(d) { return d.name; })
.rollup(function(v) { return d3.mean(v, function(d) { return d.amount; }); })
.entries(expenses);
console.log(JSON.stringify(expensesAvgAmount));
=> [{"key":"jim","values":39.5},{"key":"carl","values":120.11},{"key":"stacy","values":30.3}]
This code is using d3.js
Pretty cool right? Any roll-up function you can think of, you can make happen.
And you don't need to stop at just one.
rollup
can return an object, so you can easily produce multiple metrics on your groups.
var expenseMetrics = d3.nest()
.key(function(d) { return d.name; })
.rollup(function(v) { return {
count: v.length,
total: d3.sum(v, function(d) { return d.amount; }),
avg: d3.mean(v, function(d) { return d.amount; })
}; })
.entries(expenses);
console.log(JSON.stringify(expenseMetrics));
=> [{"key":"jim","values":{"count":2,"total":79,"avg":39.5}},
{"key":"carl","values":{"count":1,"total":120.11,"avg":120.11}},
{"key":"stacy","values":{"count":3,"total":90.9,"avg":30.3}}]
This code is using d3.js
Object Output
The array output can be useful for using map or forEach.
But you can also have d3.nest
return an object of the results, for direct access.
Note the use of nest.object below.
var expensesTotal = d3.nest()
.key(function(d) { return d.name; })
.rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
.object(expenses);
console.log(JSON.stringify(expensesTotal));
=> {"jim":79,"carl":120.11,"stacy":90.9}
This code is using d3.js
And if you want to get real fancy, take a look at nest.map for getting a d3.map instance back.
Multi-Level Nesting
And you thought that single-level nesting was cool.
Wait till you try multiple levels!
By adding more keys, you can sub-divide your data even further.
Here is expense sums by name and then by date:
var expensesTotalByDay = d3.nest()
.key(function(d) { return d.name; })
.key(function(d) { return d.date; })
.rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
.object(expenses);
console.log(JSON.stringify(expensesTotalByDay));
=> {"jim":{"11/12/2015":34,"12/01/2015":45},
"carl":{"11/12/2015":120.11},
"stacy":{"01/04/2016":46.1,"01/05/2016":44.8}}
This code is using d3.js
Now the rollup
callback is called for each of our smaller subgroups.
The order of the nest.key
calls determines the order of the grouping.
If we reverse our keys, we get the totals by date and then by name:
var expensesTotalByDay = d3.nest()
.key(function(d) { return d.date; })
.key(function(d) { return d.name; })
.rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
.object(expenses);
console.log(JSON.stringify(expensesTotalByDay));
=> {"11/12/2015":{"jim":34,"carl":120.11},
"12/01/2015":{"jim":45},
"01/04/2016":{"stacy":46.1},
"01/05/2016":{"stacy":44.8}}
This code is using d3.js
Here the values are the same, but the mapping might be more convenient, depending on the questions you are trying to answer.
Derived Key Values
Remember, we are specifying our key value using a function.
This gives us the power to group on derived or otherwise on-the-fly keys.
For example, if we wanted to find out totals for all expenses for each year, we would just do some basic string manipulation on the date string:
var expensesByYear = d3.nest()
.key(function(d) { return d.date.split("/")[2]; })
.rollup(function(v) { return d3.sum(v, function(d) { return d.amount; }); })
.object(expenses);
console.log(JSON.stringify(expensesByYear));
=> {"2015":199.11,"2016":90.9}
This code is using d3.js
All this flexibility provides for a powerful toolkit for exploring your data.
Next Task
Working with Strings
See Also
Mister Nester - a d3.nest
power tool!
Phoebe Bright Nest Tutorial - lots more nest examples
Working with Strings
String cleaning is something you end up doing quite a lot.
Hopefully this task will help make the process less painful.
There are a near infinite transformations you might want to do with strings, so we won't get to everything, but this will serve as a starting point for common manipulations that will come up again and again.
We will start with generic JavaScript string functions and add in a bit of lodash magic to make things easier.
String Basics
Similar to arrays, the characters in strings are accessible via indexing
var aChar = "Hello There!"[6];
console.log(aChar);
=> T
Also, just like arrays, you have access to the powerful slice method, which is used to extract sub-sections based on indexes.
var aSlice = "Hello There!".slice(6,11);
console.log(aSlice);
=> There
The sliced string goes up to - but not including - the last index.
And, of course, string concatenation is done in JavaScript using the +
operator.
Use parenthesis if you want to do actual arithmetic inside your concatenation.
var orderNum = 8;
console.log("You are number " + (orderNum + 1) + " in line.");
=> You are number 9 in line.
Check the documentation for all the other basic tools.
Stripping Whitespace
Often, you are going to have some surrounding whitespace that you don't want corrupting the rest of your data.
Reading CSV files gives a good example of this, as spaces are typically also used in conjunction with the commas to separate columns.
A data file like this:
cities_spaced.csv:
city ,state ,population,land area
seattle ,WA , 652405 ,83.9
new york,NY,8405837, 302.6
When read in can produce quite the messy dataset:
d3.csv("data/cities_spaced.csv", function(data) {
console.log(JSON.stringify(data));
});
=> [{"city ":" seattle ","state ":"WA ","population":" 652405 ","land area":"83.9"},
{"city ":"new york","state ":"NY","population":"8405837","land area":" 302.6"}]
This code is using d3.js
Note the spaces in the property names as well as the values.
In cases like this, it might be best to map the data back to a clean version.
Lodash's trim can help.
It removes that unsightly whitespace from the front and back of your strings.
Here is a version of the data loading function that removes whitespace.
It uses
d3.csv("data/cities_spaced.csv").then(function(data) {
var clean = data.map(function(d) {
var cleanD = {};
d3.keys(d).forEach(function(k) {
cleanD[_.trim(k)] = _.trim(d[k]);
});
return cleanD;
});
console.log(JSON.stringify(clean));
});
=> [{"city":"seattle","state":"WA","population":"652405","land area":"83.9"},
{"city":"new york","state":"NY","population":"8405837","land area":"302.6"}]
This code is using d3.js and lodash
The strings are now clear of those pesky spaces.
Find and Replace
Extracting data from strings can sometimes mean extracting pieces of strings.
Finding out if a string contains a keyword or sub-string of interest is a first step in quantifying the content of a body of text.
indexOf can be used to perform this searching.
You pass it a sub-string, and it'll tell you the location in string you are calling it where that sub-string starts.
-1
is returned if the sub-string can't be found.
You can use this to build a little string finder, by comparing the return value to -1
.
console.log("A man, a plan, a canal".indexOf("man") !== -1);
=> true
console.log("A man, a plan, a canal".indexOf("panama") !== -1);
=> false
Replace is the butter to find's bread.
We will see more replacing when we get to regular expressions, but replacing sections of a string can be done with the replace method.
console.log("A man, a plan, a canal".replace("canal", ""));
=> "A man, a plan, a"
Templating
When you need to create a more complicated string, such as an html snippet, it may
become too tedious to just combine strings by concatenating them with your variables.
Consider
the following example:
<div class="person">
<span class="name">Birdman</span>
<span class="occupation">Imaginary Super Hero</span>
</div>
If we wanted to build it using string concatenation, it might look like this:
var person = { name : "Birdman", occupation: "Imaginary Super Hero" };
var html_snippet = "<div class=\"person\">" +
"<span class=\"name\">" + person.name + "</span>" +
"<span class=\"occupation\">" + person.occupation + "</span>" +
"</div>";
console.log(html_snippet);
=> '<div class="person"><span class="name">Birdman</span><span class="occupation">Imaginary Super Hero</span></div>'
That's a lot of string escaping! You can imagine this gets pretty hard to manage
after a while.
In order to simplify this process, you can use lodash templates to define a "template"
that you can reuse with different data.
Using our example above, we might define it
like so:
var templateString = "<div class='person'>" +
" <span class='name'><%= name %></span>" +
" <span class='occupation'><%= occupation %></span>" +
"</div>";
var templateFunction = _.template(templateString);
Now you can use this template function with lots of data to generate the
same snippet of html:
console.log(templateFunction(person));
=> '<div class="person"><span class="name">Birdman</span><span class="occupation">Imaginary Super Hero</span></div>'
This code is using lodash
var anotherPerson = { name : "James.
James Bond", occupation: "Spy" };
console.log(templateFunction(anotherPerson));
=> '<div class="person"><span class="name">James.
James Bond</span><span class="occupation">Spy</span></div>'
Next Task
Regular Expressions
See Also
Working With Strings - a great guide to more string basics
underscore.string - for all the other string functions you might want
underscore.template - for a deeper dive into underscore's template function
ES2016's Template Literal syntax that allows template string without the need for lodash/underscore if you use ES2016 or TypeScript.
Regular Expressions
Regular expressions are used to match certain patterns of strings within other strings.
They can be a useful tool for extracting patterns rather than exact strings, for example:
telephone numbers (sequences of numbers of a specific length,) street numbers or email
addresses.
Finding Strings
var str = "how much wood would a woodchuck chuck if a woodchuck could chuck wood";
var regex = /wood/;
If we want to know whether the string "wood" appears in our larger string str
we
could do the following
if (regex.test(str)) {
console.log("we found 'wood' in the string!");
}
=> "we found 'wood' in the string!"
To see the actual matches we found in the string, we can use the match
method
to find all matches available:
var matches = str.match(regex);
console.log(matches);
=> ["wood"]
Note that this only returned one match, even though the word "wood" appears several
times in our original string.
In order to find all individual instances of wood, we need
to add the global flag, which we can do by adding a g
to the end of our expression:
regex = /wood/g;
console.log(str.match(regex));
=> ["wood", "wood", "wood", "wood"]
Now, note that two of those matches actually belonged to the word "woodchuck", which
was not a part of our results.
If we wanted to extend our regular expression to match both
we could do so in a few ways:
regex = /wood.*?\b/g;
console.log(str.match(regex));
=> ["wood", "woodchuck", "woodchuck", "wood"]
In this regular expression we are matching everything that starts with the string "wood"
followed by 0 or more characters (.*?
) until a word break (\b
) occures.
Alternatively, we could also just search for both words:
regex = /woodchuck|wood/g;
console.log(str.match(regex));
=> ["wood", "woodchuck", "woodchuck", "wood"]
Note the order in which we did the last search.
We used the word "woodchuch" before
the word "wood".
If we were to run our expression like so: /wood|woodchuck/g
, we would
end up with ["wood", "wood", "wood", "wood"]
again because that search would be
"greedy".
Replacing with regular expressions
If we wanted to replace the word "wood" in our original string, with the word
"nun", we could do it like so:
regex = /wood/g;
var newstr = str.replace(regex, "nun");
console.log(newstr);
=> "how much nun would a nunchuck chuck if a nunchuck could chuck nun"
Probably not what you'd be going for, but you get our drift.
Finding Numbers
Extracting numbers from strings is a common task when looking for things like
dollar amounts or any other numerical measurements that might be scattered about
in the text.
For example, if we wanted to extract the total amount of money spent
on groceries from this message:
var message = "I bought a loaf of bread for $3.99, some milk for $2.49 and" +
"a box of chocolate cookies for $6.95";
we could define a regular expression that looks for dollar amounts by defining a
pattern like so.
regex = /\$([0-9\.]+)\b/g;
this pattern looks for:
A dollar sign (\$
) to indicate the beginning of a price
A set of repeating characters that can be a number (0-9
) or the period character .
.
These can appear repeatedly (+
).
Note that we're not being particularly careful in making sure we only have one period in our string, for example.
A word break that would indicate the end of the price string (\b
).
If we wanted to find all the matches, we could use our string match
function like so:
matches = message.match(regex);
console.log(matches);
=> ["$3.99", "$2.49", "$6.95"]
This is great! We have all our dollar amounts.
While this gets us 90% there, we
can't really add them with those $
signs.
To remove them, we can use our trusty
reduce
function like so:
matches.reduce(function(sum, value) {
return sum + Number(value.slice(1));
}, 0);
=> 13.43
Useful special characters
We've used a few special characters so far, like \b
to indicate a word break.
There
are a few others that might be useful to you:
\d
- any number character.
Equivalent to [0-9].
\D
- any non number character.
Equivalent to [^0-9].
\s
- any single space character.
This includes a single space, tab, line feed or
form feed.
You can see a full list of all special characters here:
MDN - Regular Expressions
Next Task
Working With Time
See Also
MDN - Regular Expressions - for more information about regular expressions
Working with Time
Time is one of those tricky programming things that seems like it should be easy, but usually turns out not to be.
We will use D3's built in time parsing and interval functions.
We will also take a look at the powerful Moment.js library, for when you just need more time power.
String to Date
The first task when dealing with dates is usually getting a Date object out of a string.
Most of the time, your data will have dates or times in an (mostly) arbitrary format, and you need to force that mess into an actual date.
D3 has d3.timeParse which provides a way to do this parsing.
It was a little confusing for me the first time I tried it.
You use this function to create a string parser, and then use the parser to actually convert the string.
In our nesting example, we saw data that had dates as strings:
var expense = {"name":"jim","amount":34,"date":"11/12/2015"};
To convert this date string to a Date object, we would need a parser that looks like:
var parser = d3.timeParse("%m/%d/%Y");
This code is using d3.js
The input string to d3.timeParse
indicates what the date string should look like.
You have a lot of options for the special, percent-sign-prefixed variables.
You can see in the string I'm using month, day, and four-digit year.
The slashes in the format string are not special variables - but just what we expect to find separating the fields in the date string.
Next we use the parser to parse our string.
expense.date = parser(expense.date);
console.log(expense);
=> {name: "jim", amount: 34, date: Thu Nov 12 2015 00:00:00 GMT-0500 (EST)}
This code is using d3.js
Note that the returned value of the d3.timeParse
function is itself a function, so we can just pass our date string to this function directly.
Also note that the timezone is dependent on your local browser, so you might see a different value if you live in a different timezone.
Cool! Now our date is actually a Date object.
Here are a few more time parsers to show the capabilities of D3's parsing.
Note again that we are creating a d3.timeParse
function and then passing in a string to parse, this time all on one line.
Just the date:
var date = d3.timeParse("%A, %B %-d, %Y")("Wednesday, November 12, 2014");
console.log(date);
=> Wed Nov 12 2014 00:00:00 GMT-0500 (EST)
This code is using d3.js
The little dash in front of the d
is to remove the 0-padding)
date = d3.timeParse("%m/%y")("12/14");
console.log(date);
=> Mon Dec 01 2014 00:00:00 GMT-0500 (EST)
You can see it defaults to the first day of the month.
Just the time:
var time = d3.timeParse("%I:%M%p")("12:34pm");
console.log(time);
=> Mon Jan 01 1900 12:34:00 GMT-0500 (EST)
This code is using d3.js
Gives you a somewhat strange default date.
Date and time:
time = d3.timeParse("%m/%d/%Y %H:%M:%S %p")("1/2/2014 8:22:05 AM");
console.log(time);
=> Thu Jan 02 2014 08:22:05 GMT-0500 (EST)
This code is using d3.js
This could also be done using some built in short-hands:
time = d3.timeParse("%x %X")("1/2/2014 8:22:05 AM");
console.log(time);
=> Thu Jan 02 2014 08:22:05 GMT-0500 (EST)
This code is using d3.js
You can see that d3.timeParse
gives you a lot of flexibility about what your time string will look like.
Modifying Time
In many cases, you might want to modify a date object.
Perhaps you only want to display the hour from a date, or maybe you want to figure out what a week from now would be.
The d3.time set of functions provides a starting point for these kinds of manipulations.
Intervals allow for modifying dates around specific time slices like minutes, hours, days, months, or years.
We are given a number of functions to work with each interval, depending on what we might want to do.
So, to get the nearest hour from a date, we can use d3.timeHour.round
var hourParser = d3.timeParse("%I:%M%p");
var time = hourParser("10:34pm");
var hour = d3.timeHour.round(time);
console.log(hour);
=> Mon Jan 01 1900 23:00:00 GMT-0500
This code is using d3.js
It returns a date object that just contains the nearest hour (11:00pm).
We can display this by using a d3.timeFormat to format the date object into a string.
var hourFormater = d3.timeFormat("%I:%M%p")
console.log(hourFormater(hour));
=> 11:00PM
Moment.js
Moment.js is another JavaScript library that could be better suited to your needs, if you happen to be doing a lot of time manipulations.
Its syntax and capabilities seem a bit more intuitive for certain time manipulations.
Check it out if you need more time control power!
Next Task
Checking Data Assumptions
See Also
moment.js
Checking Data Assumptions
Data processing is tricky business, full of pitfalls and gotchas.
Hopefully the tasks in this guide help with getting started in this process.
But you, I, and the entire world will make mistakes.
It's natural.
But mistakes in data processing, like all other kinds of mistakes, can be painful.
They can result in hours of bug hunting, days of reprocessing, and months of crying.
Since we know mistakes happen and will continue to happen, what can we do to take away some of the pain?
In a word, padding.
We need some padding to protect us from the bumps and bruises of data processing.
And I would suggest that this padding come in the form of simple tests that check the assumptions you have about the shape and contents of your data.
Unless there is an extreme performance need, these tests should run in the data processing pipeline.
Optimally, they would be easy to turn on and off so that you can disable them if you need to if your code is deployed.
Assertions
These tests can be created with assertions - functions that check the truthiness of a statement in code.
Typically, they raise an error when an expected truth is not actually true.
JavaScript doesn't have a built assertions, but we can rectify this deficiency with a simple function.
function assert(isTrue, message) {
if(!isTrue) {
console.log(message);
return false;
}
return true;
}
This will output a given message if the input is not true.
Typically assertions throw errors, but we can just log it for explaining purposes.
Data Content Assumptions
Now let's use our assert
function to check some assumptions about the details of our data.
We can use lodash's suite of type checking functions to take care of performing the checks, passing the result of the check to assert
to produce our errors.
Let's say our data importing process has made some mistakes:
var data = [{"name":"Dan",
"age":23,
"superhuman":false},
{"name":"Sleepwalker",
"age":NaN,
"superhuman":"TRUE"}
];
Our first entry looks ok, where our second entry has some problems.
The age parsing for the immortal Sleepwalker has left him with no age.
Also, bad input data has left us with a string in superhuman
, where we expect a boolean.
A simple assumption checking function that could be run on this data could look something like this:
function checkDataContent(data) {
data.forEach(function(d) {
var dString = JSON.stringify(d);
assert(_.isString(d.name), dString + " has a bad name - should be a string");
assert(_.isNumber(d.age), dString + " has a bad age - should be a number");
assert(!_.isNaN(d.age), dString + " has a bad age - should not be NaN");
assert(_.isBoolean(d.superhuman), dString + " has a bad superhuman - should be boolean");
});
}
checkDataContent(data);
=> {"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad age - should not be NaN
{"name":"Sleepwalker","age":null,"superhuman":"TRUE"} has a bad superhuman - should be boolean
This code is using lodash
Again, the focus here is on detection of data problems.
You want something quick and simple that will serve as an early warning sign.
Unfortunately, the JavaScript primitive NaN
is indeed a number, and so additional checks need to be made.
As more data comes in, this function will need to be updated to add more checks.
This might get a bit tedious, but a little bit of checking can go a long way towards maintaining sanity.
Data Shape Assumptions
Just as you can test your assumptions about the content of your data elements, it can be a good idea to test your assumptions about the shape of your data.
Here, shape just refers to the size and structure of your data.
Rows and columns.
Something simple to perform this check could look like this:
function checkDataShape(data) {
assert(data.length > 0, "data is empty");
assert(data.length > 4, "data is too small");
var keys = d3.keys(data[0]);
assert(keys.length === 4, "wrong number of columns");
}
checkDataShape(data);
=> data is too small
wrong number of columns
The two assumption functions could easily be combined into one, but it's important to look at both aspects of your data.
Data Equality Assumptions
Finally, its often useful to check assumptions about data objects being equal.
Lodash comes to the rescue again with the isEqual function:
console.log(_.isEqual({ tea: 'green' }, { tea: 'green' }));
console.log(_.isEqual({ tea: 'earl' }, { tea: 'green' }));
=> true
false
More Assertions
If this is an approach that appeals to you, it might be worth exploring more powerful assertion libraries.
One such tool is Chai which comes with a great
collection of assertion helpers.
These can help you check for more complicated things like whether
two objects are equal or whether an object has or doesn't have a property in a more succinct style.
Next Task
Using Node
See Also
Parsing raw data - a great guide that motivated this section
Chai - Chai's assert library
is.js - provides a great set of checking functions to complement lodash's set.
validate.io - provides a similar set of checking functions, but all as separate projects - so you can include only the checks you want to use.
Analyzing Data with Node
As mentioned in the introduction, this guide is mostly geared for client-side data analysis, but with a few augmentations, the same tools can be readily used server-side with Node.
If the data is too large, this might in fact be your only option if you want to use JavaScript for your data analysis.
Trying to deal with large data in the browser might result in your users having to wait for a long time.
No user will wait for 5 minutes with a frozen browser, no matter how cool the analysis might be.
Setting up a Node Project
To get started with Node, ensure both node and npm, the Node package manager, are installed and available via the command line:
which node
# /usr/local/bin/node
which npm
# /usr/local/bin/npm
Your paths may be different then mine, but as long as which
returns something, you should be good to go.
If node isn't installed on your machine, you can install it easily via a package manager.
Create a new directory for your data analysis project.
In this example, we have a directory with a sub-directory called data
which contains our animals.tsv
file inside.
animals_analysis
|
- data
|
- animals.tsv
Installing Node Modules
Next, we want to install our JavaScript tools, D3 and lodash.
With Node, we can automate the process by using npm
.
Inside your data analysis directory run the following:
npm install d3
npm install lodash
You can see that npm creates a new sub-directory called node_modules
by default, where your packages are installed.
Everything is kept local, so you don't have to worry about problems with missing or out-of-date packages.
Your analysis tools for each project are ready to go.
A package.json
file can be useful for saving this kind of meta information about your project: dependencies, name, description, etc.
Check out this interactive example or npm's documentation for more information.
Requiring Modules
Now we create a separate JavaScript file to do our analysis in:
touch analyze.js
Inside this file, we first require our external dependencies.
var fs = require("fs");
var d3 = require("d3");
var _ = require("lodash");
We are requiring our locally installed d3
and lodash
packages.
Note how we assign them to variables, which are used to access their functions later in the code.
We also require the file system module.
As we will see in a second, we need this to load our data - which is really the key difference between client-side and server-side use of these tools
Loading Data in Node
D3's data loading functionality is based on XMLHttpRequest, which is great, but Node does not have XMLHttpRequest
.
There are packages around this mismatch, but a more elegant solution is to just use Node's built in file system functionality to load the data, and then D3 to parse it.
fs.readFile("data/animals.tsv", "utf8", function(error, data) {
data = d3.tsvParse(data);
console.log(JSON.stringify(data));
});
fs.readFile is asynchronous and takes a callback function when it is finished loading the data.
Like our Queue example in client-side reading, the parameters of this function start with error
, which will be null
unless there is an error.
The data returned by readFile
is the raw string contents of the file.
We can use d3.tsvParse which takes a string and and converts it into an array of data objects - just like what we are used to on the client side!
From this point on, we can use d3 and lodash functionality to analyze our data.
A full, but very simple script might look like this:
var fs = require("fs");
var d3 = require("d3");
var _ = require("lodash");
fs.readFile("data/animals.tsv", "utf8", function(error, data) {
data = d3.tsvParse(data);
console.log(JSON.stringify(data));
var maxWeight = d3.max(data, function(d) { return d.avg_weight; });
console.log(maxWeight);
});
Running the Analysis
Since this is not in a browser, we need to execute this script, much like you would with a script written in Ruby or Python.
From the command line, we can simply run it with node
to see the results.
node analyze.js
=> [{"name":"tiger","type":"mammal","avg_weight":"260"},{"name":"hippo","type":"mammal","avg_weight":"3400"},{"name":"komodo dragon","type":"reptile","avg_weight":"150"}]
3400
Writing Data
Maybe the original data set is too big, but we can use Node to perform an initial pre-processing or filtering step and output the result to a new file to work with later.
Node has fs.writeFile that can perform this easily.
Inside the read callback, we can call this to write the data out.
var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = JSON.stringify(bigAnimals);
fs.writeFile("big_animals.json", bigAnimalsString, function(err) {
console.log("file written");
});
Running this should leave us with a big_animals.json
file in our analysis folder.
This is fine if JSON is what you want, but often times you want to output TSV or CSV files for further analysis.
D3 to the rescue again!
D3 includes d3.csvFormat (and the equivalent for TSV and other file formats) which converts our array of data objects into a string - perfect for writing to a file.
Let's use it to make a CSV of our big animals.
var bigAnimals = data.filter(function(d) { return d.avg_weight > 300; });
bigAnimalsString = d3.csvFormat(bigAnimals);
fs.writeFile("big_animals.csv", bigAnimalsString, function(err) {
console.log("file written");
});
Run this with the same node analysis.js
and now you should have a lovely little big_animals.csv
file in your directory.
It even takes care of the headers for you:
name,type,avg_weight
hippo,mammal,3400
Now even BIG data is no match for us - using the power of JavaScript!