recordlinkage
Must Watch!
MustWatch
Pre-processing with recordlinkage
Welcome to the first installment of a five part tutorial series on the recordlinkage python package.
Each tutorial will cover a specific stage of the data integration workflow.
The topics and links for each tutorial are included below:
Data pre-processing
Data pre-processing is a crucial step in any data analysis project, and record linkage problems are no different.
The main goal of pre-processing is to transform messy data into something that is usable in an analysis workflow.
In some senses, the entire process of record linkage can be considered part of the data pre-processing phase as it is preparing data for analysis.
However, the first part of record linkage involves cleaning and standardizing the data.
This can be relatively simple, such as removing extra white-space or more complex, such as parsing and standardizing address strings.
While cleaning and standardizing can be performed without the use of specialized packages, the recordlinkage python package has cleaning and standardizing functionality built-in.
In this tutorial, we will explore how this functionality can be used during the pre-processing phase of data integration.
Goals
By the end of this tutorial you should have an understanding of how to use the recordlinkage package to perform basic cleaning and standardization of Panda’s DataFrames in python.
To get the most out of the tutorials, I recommend you follow along by writing your own code in the python environment of your choice (script, console, notebook, etc).
Writing the code out can help you become familiar with the syntax and functions within recordlinkage.
Before you start
Before you start reading through the tutorial it is important that you have some background knowledge.
I recommend you check out the following resources:
recordlinkage’s standardise module documentation for an introduction to the functions we will be covering.
RegexOne for an introduction (or quick refresher) on regular expressions.
Pre-processing with recordlinkage
recordlinkage’s standardise
sub-module includes four built-in functions.
Each of these functions tackle a different aspect of pre-processing.
Below, we introduce potential uses for each of these functions, provide a basic code example, and present a couple of small problems to solve.
For a full overview of the functions, read the documentation.
The examples below make use of a small dataset.
The table below shows the data in its raw form.
name | phone_number | occupation | address |
1.
Rachel Green | 1(613)555 0149 | buyer (fashion) | “90 Bedford Street, Apt 20” |
2.
Ross Geller | +1-613-555-0138 | paleontogist | “100 Grove Street, Apartment 16” |
3.
Mnica Geller | 16135550185 | Chef | “90 Bedford Street, Apt 20” |
4.
Chandler BING | 1 613 555 0161 | ??? | “90 Bedford Street, Apt 19” |
5.
Pheobe Buffay | 1(613)5550114 | musician | “5 Morton Street, Apt.
14” |
6.
Joseph (Joey) Tribbiani | 1(613)555-0148 | actor | “90 Bedford Street, Apt 19” |
clean()
The clean()
function is used to clean a single column within a data frame, where the column contains strings.
By default, clean()
will turn all strings into lowercase and remove characters such as quotation marks and punctuation.
This function is especially useful for cleaning columns containing attributes such as names, addresses, abstracts, unstructured text, etc.
It is less useful for cleaning columns containing attributes such as phone numbers or lists.
Although not discussed in the documentation, the clean()
function also automatically removes leading and trailing white-space and collapses consecutive white-space into a single space.
Example
In this example we will first use default parameters to clean the name
column and store it as a new column.
Additionally, we will provide a custom replace_by_none
argument to clean the occupation
column, removing the brackets and their content and placing the result in a new column.
The code block and resulting data frame are shown below.
from recordlinkage.standardise import clean
import pandas as pd
df = pd.read_csv("friends.csv")
# Default Cleaning
df["name_clean_default"] = clean(df["name"])
# Clean the `occupation` column, but keep brackets and their contents.
df["occupation_clean"]= clean(df["occupation"],
replace_by_none='[^ \\-\\_\(\)A-Za-z0-9]+',
remove_brackets=False)
name | phone_number | occupation | address | name_clean_default | occupation_clean |
1.
Rachel Green | 1(613)555 0149 | buyer (fashion) | “90 Bedford Street, Apt 20” | 1 rachel green | buyer (fashion) |
2.
Ross Geller | +1-613-555-0138 | paleontogist | “100 Grove Street, Apartment 16” | 2 ross geller | paleontogist |
3.
M nica Geller | 16135550185 | Chef | “90 Bedford Street, Apt 20” | 3 mnica geller | chef |
4.
Chandler BING | 1 613 555 0161 | ??? | “90 Bedford Street, Apt 19” | 4 chandler bing | |
5.
Pheobe Buffay | 1(613)5550114 | musician | “5 Morton Street, Apt.
14” | 5 pheobe buffay | musician |
6.
Joseph (Joey) Tribbiani | 1(613)555-0148 | actor | “90 Bedford Street, Apt 19” | 6 joseph tribbiani | actor |
Test your knowledge
Try writing code to perform the following tasks:
Clean and strip accents from the name
column.
Replace the uncleaned column with the result.
Clean the name
column, removing all instances of numbers from the field.
Replace the uncleaned column with the result.
phonenumbers()
The phonenumbers()
function is used to remove all non-number characters from a cell, excluding the ‘+’ character.
To also remove ‘+’ you can use a custom version of the clean()
function.
Example
The example below cleans the phone_number
column and replaces the old column with the result.
# Clean the phone_number column and replaces the old column with the result.
from recordlinkage.standardise import phonenumbers
import pandas as pd
df = pd.read_csv("friends.csv")
df["phone_number"]= phonenumbers(df["phone_number"])
name | phone_number | occupation | address |
1.
Rachel Green | 16135550149 | buyer (fashion) | “90 Bedford Street, Apt 20” |
2.
Ross Geller | +16135550138 | paleontogist | “100 Grove Street, Apartment 16” |
3.
Mnica Geller | 16135550185 | Chef | “90 Bedford Street, Apt 20” |
4.
Chandler BING | 16135550161 | ??? | “90 Bedford Street, Apt 19” |
5.
Pheobe Buffay | 16135550114 | musician | “5 Morton Street, Apt.
14” |
6.
Joseph (Joey) Tribbiani | 16135550148 | actor | “90 Bedford Street, Apt 19” |
value_occurence()
The value_occurence
function is used to count the number of times each item in a column occurs.
This is a very general function, that can be used in many different situations.
My impression of the function is that it is most useful for solving data specific problems and extracting features from those data.
It could also be used to determine whether a column contains unique values and can act as a candidate key.
The function could also be used to assign a summary value to each attribute, such as the proportion of the dataset which has the same value as an observation.
Example
In the example below, value_occurence()
is used to find the size (in people) of each individual’s household.
from recordlinkage.standardise import value_occurence
from recordlinkage.standardise import value_occurence
import pandas as pd
df = pd.read_csv("friends.csv")
df["household_size"] = value_occurence(df["address"])
name | phone_number | occupation | address | household_size |
1.
Rachel Green | 1(613)555 0149 | buyer (fashion) | “90 Bedford Street, Apt 20” | 2 |
2.
Ross Geller | +1-613-555-0138 | paleontogist | “100 Grove Street, Apartment 16” | 1 |
3.
Mnica Geller | 16135550185 | Chef | “90 Bedford Street, Apt 20” | 2 |
4.
Chandler BING | 1 613 555 0161 | ??? | “90 Bedford Street, Apt 19” | 2 |
5.
Pheobe Buffay | 1(613)5550114 | musician | “5 Morton Street, Apt.
14” | 1 |
6.
Joseph (Joey) Tribbiani | 1(613)555-0148 | actor | “90 Bedford Street, Apt 19” | 2 |
Test your knowledge
Try writing code to find the number of roommates for each individual.
Place the result in a new column.
phonetic()
The phonetic()
function is used to convert strings into their corresponding phonetic codes.
This is particularly useful when comparing names where different possible spellings make it difficult to find exact matches (Ex.
Jillian and Gillian).
Note that some phonetic algorithms have been created specifically for use with English names.
Check out the algorithm before choosing whether to use for your own project.
All the algorithms have trade-offs, some have increased accuracy for specific languages, some build in understanding of non-English names, some run faster, etc.
Example
The example below will standardize the name
column, resulting in a new column called phonetic
, where each of the names have been standardized using the nysiis phonetic algorithm.
from recordlinkage.standardise import phonetic
import pandas as pd
# Read in the data
df = pd.read_csv("friends.csv")
# Clean the name column to remove numbers and strip accents
df["name"]= clean(df["name"], replace_by_none='[^ \\-\\_\(\)A-Za-z]+', strip_accents="unicode")
# Standardize using the nysiis phonetic algorithm
df["phonetic"] = phonetic(df["name"], method="nysiis")
name | phone_number | occupation | address | phonetic |
rachel green | 1(613)555 0149 | buyer (fashion) | “90 Bedford Street, Apt 20” | RACALGRAN |
ross geller | +1-613-555-0138 | paleontogist | “100 Grove Street, Apartment 16” | RASGALAR |
monica geller | 16135550185 | Chef | “90 Bedford Street, Apt 20” | MANACAGALAR |
chandler bing | 1 613 555 0161 | ??? | “90 Bedford Street, Apt 19” | CANDLARBANG |
pheobe buffay | 1(613)5550114 | musician | “5 Morton Street, Apt.
14” | FABABAFY |
joseph tribbiani | 1(613)555-0148 | actor | “90 Bedford Street, Apt 19” | JASAFTRABAN |
Test your knowledge
Try writing code to perform the following tasks:
Standardize the name
column using the metaphone algorithm, having not removed white-space prior to applying the phonetic
function.
Place the result in a new column called name_metaphone
.
Compare the results of all four phonetic algorithms.
Create a new column for the result of each algorithm.
What's next?
After you have sufficiently pre-processed and standardized your data, you will want to begin the record linking process.
The first step is to create a set of candidate links through a process called indexing.
For an introduction to indexing, check out the next tutorial in this series indexing candidate links with recordlinkage.
Indexing candidate links with recordlinkage
Welcome to the second installment of a five part tutorial series on the recordlinkage Python package.
Each tutorial will cover a specific stage of the data integration workflow.
The topics and links for each tutorial are included below:
What is data indexing?
When using recordlinkage for data integration or deduplication, indexing is the process of identifying pairs of data frame rows which might refer to the same real-world entity.
These matched rows are known as candidate links.
The goal of the indexing stage of data integration is to exclude obvious non-matches from the beginning of your analysis to improve computational efficiency.
The easiest, but least efficient, indexing method is to consider every possible link as a candidate link.
Using this method for integrating two data frames, we would be comparing every row in the first data frame with every row in the second.
So, if we had ten rows in the first data frame and five in the second, we would be considering 50 (10 × 5) candidate links.
This growth in complexity can be visualized simply by drawing a 10 × 5 rectangle where coloured squares indicate candidate links:
As you can see, every possible link is coloured.
With such a large set of candidate links, every comparison made between the two data frames must be made length(DataFrame A)
× length(DataFrame B)
times.
This isn’t a problem with only 50 candidate links.
However, most data integration problems are much larger.
If we instead had 10,000 records in data frame A and 100,000 records in data frame B, we would have 1,000,000,000 candidate links.
In the world of big data, 100,000 records barely qualifies as “medium-sized data” — and yet, our data integration problem has become large enough to be difficult on consumer laptops.
Luckily, there’s a better way.
In most data integration problems, the majority of possible links will be non-matches, and so we want to use our knowledge of the data we’re integrating to eliminate bad links from the outset.
Let’s consider an imaginary example in which we are integrating two datasets of employee information.
Supposing that we have accurate “city of residence” data for all employees, and assuming that each employee lives in only one city, we can rule out links between rows with mismatching cities.
Like before, we can visualize what this might look like if employees lived in three cities, where coloured blocks represent candidate links (see how these graphics were generated):
This indexing method is known as “blocking”.
By blocking on city, we were able to reduce our set of candidate links from 50 links to 18 links, or 36% of all possible links.
In our earlier example with larger datasets, this magnitude in reduction would reduce the number of candidate links from 1,000,000,000 to 360,000,000.
However, we’re likely to get even greater reductions if we use blocking variables with more categories, or were blocking on multiple variables.
For example, by blocking on given_name
in the febrl4
example datasets included with recordlinkage, we can reduce the set of candidate links from 25,000,000 (5000 × 5000) to a mere 77,249 (0.3% of all possible links).
However, keep in mind that fewer candidate links are not always better, since you don’t want to throw away actual matches.
The ideal indexing strategy will discard a large number of mismatching records but discard very few matching records.
It is important to have a deep understanding of your data, and the indexing methods that you apply to it, to ensure that you do not erroneously discard matching records.
Indexation methods
This table is a straight-up information dump, which overviews all of the indexing methods built-in to recordlinkage:
| Method | Description | Advantages | Limitations |
0 | Full Index | All possible links are kept as candidate links. | Convenient when performance isn't an issue. | Highly inefficient with large data sets. |
1 | Block Index | Only links exactly equal on specified values are kept as candidate links. | Extremely effective for high-quality, structured data. | Does not allow approximate matching. |
2 | Sorted Neighbourhood Index | Rows are ranked by some value, and candidate links are made between rows with nearby values. | Useful for making approximate matches. | More conceptually difficult. |
3 | Random Index | Creates a random set of candidate links. | Useful for developing your data integration workflow on a subset of candidate links, or creating training data for unsupervised learning models. | Not recommended for your final data integration workflow. |
Code examples and visualizations
We will now demonstrate how to use each indexing method in recordlinkage, and visualize the resulting set of candidate links.
The indexing process has two simple steps:
Create an Index
object to manage the indexing process.
We call this the indexer
in the following code examples.
Call indexer.index()
to return a Pandas MultiIndex
object.
This object identifies candidate links by their original indices in the data frames being linked.
Once you have created the MultiIndex
, you can use these candidate links to begin the record comparison stage of the data integration process.
Setup
We’ll demonstrate these indexing algorithms by indexing candidate links between two lists of names.
To start, we need to add this data to two Pandas data frames.
import pandas as pd
import recordlinkage as rl
# Name data for indexing
names_1 = ['alfred', 'bob', 'calvin', 'hobbes', 'rusty']
names_2 = ['alfred', 'danny', 'callum', 'hobie', 'rusty']
# Convert to DataFrames
df_a = pd.DataFrame(pd.Series(names_1, name='names'))
df_b = pd.DataFrame(pd.Series(names_2, name='names'))
Show df_a
| names |
0 | alfred |
1 | bob |
2 | calvin |
3 | hobbes |
4 | rusty |
Show df_b
| names |
0 | alfred |
1 | danny |
2 | callum |
3 | hobie |
4 | rusty |
Full index
The following code creates a set of candidate links using the full index method:
# Create indexing object
indexer = rl.FullIndex()
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(df_a, df_b)
Show candidate_links
as a data frame
| 0 | 1 |
(0,0) | 0 | 0 |
(0,1) | 0 | 1 |
(0,2) | 0 | 2 |
(0,3) | 0 | 3 |
(0,4) | 0 | 4 |
(1,0) | 1 | 0 |
(1,1) | 1 | 1 |
(1,2) | 1 | 2 |
(1,3) | 1 | 3 |
(1,4) | 1 | 4 |
(2,0) | 2 | 0 |
(2,1) | 2 | 1 |
(2,2) | 2 | 2 |
(2,3) | 2 | 3 |
(2,4) | 2 | 4 |
(3,0) | 3 | 0 |
(3,1) | 3 | 1 |
(3,2) | 3 | 2 |
(3,3) | 3 | 3 |
(3,4) | 3 | 4 |
(4,0) | 4 | 0 |
(4,1) | 4 | 1 |
(4,2) | 4 | 2 |
(4,3) | 4 | 3 |
(4,4) | 4 | 4 |
The result is a set of candidate links containing all possible links.
The image above visualizes the set of candidate links as a network, with data frame rows as labelled nodes and candidate links as connections between them.
You can also click the button to see the candidate links by their original indices (i.e.
the contents of the MultiIndex
containing the candidate links).
Blocked index
The following code creates a set of candidate links using the block index method:
# Create indexing object
indexer = rl.BlockIndex(on='names')
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(df_a, df_b)
Show candidate_links
as a data frame
Here, the set of candidate links is restricted to entries with the exact same name.
Think carefully about the quality of your data when using this method.
If you cannot guarantee exact matches between corresponding rows, you may exclude potential matches from your set of candidate links.
Random index
The following code creates a set of candidate links using the random index method:
# Create indexing object
indexer = rl.RandomIndex()
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(df_a, df_b)
Show candidate_links
as a data frame
| 0 | 1 |
(0,1) | 0 | 1 |
(0,4) | 0 | 4 |
(3,2) | 3 | 2 |
(4,1) | 4 | 1 |
(4,2) | 4 | 2 |
(4,3) | 4 | 3 |
(4,4) | 4 | 4 |
Here, we have a random set of candidate links.
Sorted neighbourhood index
The sorted neighbourhood method is more complex than other indexing methods.
The following code creates a set of candidate links using the sorted neighbourhood index method:
# Create indexing object
indexer = rl.SortedNeighbourhoodIndex(on='names', window=3)
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(df_a, df_b)
Show candidate_links
as a data frame
| 0 | 1 |
(1,2) | 1 | 2 |
(2,1) | 2 | 1 |
(3,3) | 3 | 3 |
(0,0) | 0 | 0 |
(4,4) | 4 | 4 |
(1,0) | 1 | 0 |
(2,2) | 2 | 2 |
(3,1) | 3 | 1 |
(4,3) | 4 | 3 |
Here, we have candidate links between names which would be adjacent in a sorted list.
This method is excellent if you want to reduce the number of candidate links considered, but cannot guarantee exact matches between potential matches.
However, it’s important to understand that this method has some quirks which should be kept in mind while using it.
To learn more about sorted neighbourhood indexing check out our deep dive tutorial.
What's next?
After you have created a set of candidate links, you’re ready to begin comparing the records associated with each candidate link.
The next step will look like this:
# Create pandas MultiIndex containing candidate links
candidate_links = indexer.index(df_a, df_b)
# Create comparing object
comp = rl.Compare(candidate_links, df_a, df_b)
For an introduction to record comparison, check out the next tutorial in this series, record comparison with recordlinkage.
Further reading
We highly recommend that you check out the recordlinkage documentation section on indexing.
For an in-depth look at the mechanics of each indexing class, check out the indexing page of the recordlinkage API Reference.
Record comparison with recordlinkage
Welcome to the third installment of a five part tutorial series on the recordlinkage python package.
Each tutorial will cover a specific stage of the data integration workflow.
The topics and links for each tutorial are included below:
Pre-processing with recordlinkage
Indexing candidate links with recordlinkage
Record comparison with recordlinkage
Record pair classification with recordlinkage
Data fusion (coming soon …)
Goal
By the end of this tutorial you should be comfortable using recordlinkage’s built-in methods and custom functions for record comparison.
Record comparison is an important part of the data integration workflow, as it provides the basis for classifying matches and the eventual fusion of data.
Both of these steps will be covered in subsequent tutorials.
In this tutorial, we will compare two data sets.
The first includes bibliographic data, which was retrieved using the metaknowledge python package.
This data includes information on affiliations, papers, authors, etc.
The second dataset is a collection of institutions from GRID.
Both data sets have been filtered to only include institutions from Canada.
Five rows from each of these datasets have been included below.
Affiliation data
| Affiliation | Country |
7 | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | Canada |
1 | departement de microbiologie infectiologie et immunologieuniversite de montreal montreal quebec h3c 3j7 canada electronic addressroxannecollinumontrealca | Canada |
10 | getting to know cancer room 229a 36 arthur st truro nova scotia b2n 1x5canada | Canada |
17 | michael smith laboratories university of british columbia vancouver britishcolumbia canada | Canada |
13 | gastrointestinal research group and inflammation research network department of physiology and pharmacology calvin joan and phoebe snyder institute for chronicdiseases cumming school of medicine university of calgary calgary albertacanada | Canada |
GRID data
| name | city | state | country |
2457 | holstein association of canada | Brantford | Ontario | Canada |
2499 | yellow island aquaculture | Victoria | British Columbia | Canada |
3094 | mitel | Ottawa | Ontario | Canada |
1370 | tekion | Toronto | Ontario | Canada |
1030 | toronto general hospital | Toronto | Ontario | Canada |
Comparing records
Preliminary work
There is a bit of work we need to do before we can compare records in our data sets.
First, we need to load and pre-process the data.
Then we will need to index the datasets to give us our list of candidate links.
Each of these topics have been covered in detailed in previous tutorials (links at the top of the page).
The script below contains the code used to prepare the data for comparisons.
import recordlinkage as rl
import pandas as pd
# ***************************
# Load Bib Data
# ***************************
bib = pd.read_csv("/path/to/canada_bib.csv") # Read bibliometric data
bib["Affiliation"] = clean(bib["Affiliation"]) # Clean Affiliation Data
# ***************************
# Load GRID Data
# ***************************
grid = pd.read_csv("/path/to/grid.csv") # Read GRID data
grid["name"] = clean(grid["name"]) # Clean GRID data
# ***************************
# Index with Full Index
# ***************************
indexer = rl.FullIndex()
candidate_links = indexer.index(bib, grid)
Set up the Compare object
In recordlinkage you must initiate a Compare
object prior to performing any comparison functionality between records.
This object stores both dataframes, the candidate links, and a vector containing comparison results.
Further, the Compare
object contains the methods for performing comparisons.
The code block below initializes the comparison object.
compare = rl.Compare(candidate_links, bib, grid)
Built-in comparison methods
Now that we have initiated a comparison method we can go ahead and start comparing records.
The simplest way to do comparisons is to use comparison methods that have been built-in to recordlinkage.
Currently there are five specific comparison methods within recordlinkage: Compare.exact()
, Compare.string()
, Compare.numeric()
, Compare.geo()
, and Compare.date()
.
These are all well documented in the documentation.
For this example we will be comparing the Affiliation
column from the bibliometric dataset and the name
column from the grid dataset.
We will start by using the Compare.exact()
comparison method.
This method is simple, if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured.
Since the affiliation data we are trying to match is quite messy, this method returns very low scores.
In fact, looking at the describe table included below the code block, you can see that no exact matches were found between the two columns.
Next, we will use the Compare.string()
comparison method.
This method is a bit more complicated and generates a score based on well known string-comparison algorithms.
For this example, we will use the Levenshtein method.
However, as shown in the describe table below, we still aren’t getting very high comparison scores, with the highest only being 0.52 (of a maximum 1.0).
# Use built-in comparison functions
compare.exact("Affiliation", "name", name="exact")
compare.string("Affiliation", "name", name="string")
# Print description
print(compare.vectors.describe())
| exact | string |
count | 87920 | 87920 |
mean | 0 | 0.152443 |
std | 0 | 0.0587379 |
min | 0 | 0 |
25% | 0 | 0.111111 |
50% | 0 | 0.15 |
75% | 0 | 0.19 |
max | 0 | 0.521739 |
Custom comparison methods
Based on the lack of high comparison scores, it doesn’t seem as though recordlinkage’s built-in methods are going to work for our current use case.
Thankfully, there is a general comparison method built-in to recordlinkage called Compare.compare()
.
This method takes in a comparison function, two columns to compare, and the name of the column the score should be stored in.
So, all we have to do is define and call a custom comparison function.
To demonstrate how the Compare.compare()
method works, we will use a couple of custom comparison function that has already been defined in NetLab’s labutils
package.
We will use the normed_lcss()
and normed_fuzzy_lcss()
functions.
normed_lcss
computes a comparison score based on the length of the longest common substring between two strings.
This score is normalized based on the length of the longest possible substring (the lesser of the two string lengths).
normed_fuzzy_lcss()
computes a score based on the prescence of similar substrings within the two strings.
Once again this is normalized based on the length of the longest possible substring.
For more information on this function see the documentation.
The code block below shows how each of these functions can be implemented using the generalized Compare.compare()
method in recordlinkage.
# Import custom functions
from labutils import normed_lcss, normed_fuzzy_lcss
# Perform the comparison
compare.compare(normed_lcss, "Affiliation", "name", name="lcss")
compare.compare(normed_fuzzy_lcss, "Affiliation", "name", name="fuzzy_lcss")
# Print new description, including the 99th and 99.99th percentiles
compare.vectors.describe([0.99, 0.9999])
| exact | string | lcss | fuzzy_lcss |
count | 87920 | 87920 | 87920 | 87920 |
mean | 0 | 0.152443 | 0.184052 | 0.193961 |
std | 0 | 0.0587379 | 0.0997552 | 0.102741 |
min | 0 | 0 | 0 | 0 |
50% | 0 | 0.15 | 0.163265 | 0.172414 |
99% | 0 | 0.321429 | 0.5 | 0.527778 |
99.99% | 0 | 0.5 | 0.7 | 0.875 |
max | 0 | 0.521739 | 0.9 | 0.9 |
As we can see in the table above, both custom methods are resulting in higher comparison scores, but are still spread out between 0 and 1.
The occurence of rare high scores helps to identify candidate links which likely correspond to true matches.
The table below shows the results of comparisons on four pairs of records, two having low scores and two having high scores.
| Affiliation_l | name_r | exact | string | lcss | fuzzy_lcss |
(26, 2507) | institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 | steel structures education foundation | 0 | 0.202381 | 0.0810811 | 0.0810811 |
(22, 2225) | transgenic core facility clinical research institute of montreal montreal qc canada | grand river conservation authority | 0 | 0.228916 | 0.0882353 | 0.0882353 |
(7, 8) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | ualberta | 0 | 0.105263 | 0.875 | 0.875 |
(17, 11) | michael smith laboratories university of british columbia vancouver britishcolumbia canada | university of british columbia | 0 | 0.333333 | 1 | 1 |
Defining custom comparison methods
Once you know how to use custom comparison methods, the next important step is knowing how to create a custom method.
There is additional information in the recordlinkage documentation, but I hope this example can provide some helpful additional information.
The code block below shows a template that can be used for create a custom comparison method.
A custom comparison method takes in two Pandas.Series
(columns) as well as any additional arguments that that specific function might need.
To use the template below all you need to do is add to the inner_apply()
method, writing the functionality to compare two values and produce a comparison score.
def my_compare(s1, s2):
# This combines the columns you are comparing into a single DataFrame
concat = pd.concat([s1, s2], axis=1, ignore_index=True)
def inner_apply(x):
"""
This is where your custom algorithm is housed.
Create a function to be applied to each pair in the DataFrame.
"""
val1 = x[0]
val2 = x[1]
# Do something to produce the result of comparing val1 and val2
# return the result
return concat.apply(inner_apply, axis=1)
As an example, we will use this template to create a new comparison method.
This method might not be terribly useful, but it should demonstrate how this template can be customixed for your own purposes.
This method will first tokenize both strings (splitting at spaces).
Then it will check that the first token from both strings appears in the other string.
If both appear in the other a score of 1 is given.
If neither appear a score of 0 is given.
Finally, if only one appears in the other, a score of 0.5 is given.
def first_token(s1, s2):
# This combines the columns you are comparing into a single DataFrame
concat = pd.concat([s1, s2], axis=1, ignore_index=True)
def apply_first_token(x):
"""
This is where your custom algorithm is housed.
Create a function to be applied to each pair in the DataFrame.
"""
val1 = x[0]
val2 = x[1]
# Do something to produce the result of comparing val1 and val2
tkn1 = val1.split()
tkn2 = val2.split()
score = 0
if tkn1[0] in tkn2:
score += 0.5
if tkn2[0] in tkn1:
score += 0.5
# return the result
return score
return concat.apply(apply_first_token, axis=1)
The table below shows the resuls of applying this function to our dataset, showing the affiliation and name being compared as well as the resulting score.
While the method itself isn’t very useful, it provides a good demonstration of how to define and use your own custom comparison function.
| Affiliation_l | name_r | first_token |
(17, 1425) | michael smith laboratories university of british columbia vancouver britishcolumbia canada | phenomenome discoveries | 0 |
(14, 2691) | department of surgery the university of british columbia vancouver bc canada | strathcona community hospital | 0 |
(26, 39) | institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 | university of victoria | 0.5 |
(21, 343) | institut de recherches cliniques de montreal montreal qc canada | canada council | 0.5 |
(7, 798) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | university of northern british columbia | 1 |
(7, 737) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | university of ottawa | 1 |
What's next?
After you have compared attributes between a set of candidate links, you will want to determine which should be considered matches.
This is done through classification.
For an introduction to classification, check out the next tutorial in this series (coming soon …).
Futher reading
We highly recommend that you check out the recordlinkage documentation section on comparing records.
For an in-depth look at each comparison method, take a look at the comparing page of the recordlinkage API Reference.
For more information on types of comparison metrics, we suggest reading Chapter 5 of Peter Christen’s book Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection Paperback.
A shorter and more general overview of the process is described in Section 3.5 in Foster et al.’s book Big Data and Social Science: A Practical Guide to Methods and Tools.
Record pair classification with recordlinkage
Welcome to the third installment of a five part tutorial series on the recordlinkage python package.
Each tutorial will cover a specific stage of the data integration workflow.
The topics and links for each tutorial are included below:
Pre-processing with recordlinkage
Indexing candidate links with recordlinkage
Record comparison with recordlinkage
Record pair classification with recordlinkage
Data fusion (coming soon …)
Goal
By the end of this tutorial you should be comfortable using both threshold and learning based methods to classify candidate record pairs as matching/non-matching.
Classification is an important step in the data integration processes, as it is how we determine which record pairs correspond to matches and non-matches (and sometimes possible matches).
Threshold-based methods offer a simple and easy to understand approach to classification.
Alternatively, supervised or unsupervised learning models offer a more sophisticated way to address the task of classification.
Regardless of the approach we use, the records which are determined to be matches will fused in the next step of data integration.
In this tutorial, we will be examining two product data sets, one from Amazon and one from Google.
Our task is to identify the products which appear in both datasets.
You can download both datasets (.zip).
Five rows from each data set have been included below.
Google
id | name | description | manufacturer | price |
http://www.google.com/base/feeds/
snippets/11125907881740407428 | learning quickbooks 2007 | learning quickbooks 2007 | intuit | 38.99 |
http://www.google.com/base/feeds/
snippets/11538923464407758599 | superstart! fun with reading & writing! | fun with reading & writing! is designed to help kids learn to read and write better through exercises puzzle-solving creative writing decoding and more! | | 8.49 |
http://www.google.com/base/feeds/
snippets/11343515411965421256 | qb pos 6.0 basic software | qb pos 6.0 basic retail mngmt software.
for retailers who need basic inventory sales and customer tracking. | intuit | 637.99 |
http://www.google.com/base/feeds/
snippets/12049235575237146821 | math missions: the amazing arcade adventure (grades 3-5) | save spectacle city by disrupting randall underling's plan to drive all the stores out of business and take over the city.
solve real-world math challenges in the uniquely entertaining stores and make theme successful again. | | 12.95 |
http://www.google.com/base/feeds/
snippets/12244614697089679523 | production prem cs3 mac upgrad | adobe cs3 production premium mac upgrade from production studio premium or standard | adobe software | 805.99 |
Amazon
id | title | description | manufacturer | price |
b000jz4hqo | clickart 950 000 - premier image pack (dvd-rom) | | broderbund | 0.00 |
b0006zf55o | ca international - arcserve lap/desktop oem 30pk | oem arcserve backup v11.1 win 30u for laptops ... | computer associates | 0.00 |
b00004tkvy | noah\'s ark activity center (jewel case ages 3-8) | | victory multimedia | 0.00 |
b000g80lqo | peachtree by sage premium accounting for nonpr ... | peachtree premium accounting for nonprofits 20 ... | sage software | 599.99 |
b0006se5bq | singing coach unlimited | singing coach unlimited - electronic learning ... | carry-a-tune technologies | 99.99 |
Preliminary work
The code below tackles the steps of pre-processing, indexing, and record comparison, each of which were introduced in previous blog posts.
Since we working with a fair-sized data set, this code may take a few minutes to run.
import recordlinkage as rl
from recordlinkage.preprocessing import clean
import pandas as pd
# ********************************
# Load Data
# ********************************
goog = pd.read_parquet('amazon-google/Google')
amzn = pd.read_parquet('amazon-google/Amazon')
# ********************************
# Pre-processing
# ********************************
goog['name'] = clean(goog['name'])
goog['description'] = clean(goog['description'])
goog['manufacturer'] = clean(goog['manufacturer'])
goog['price'] = pd.to_numeric(goog['price'], errors='coerce')
goog.columns = ['idGoogle', 'name', 'description', 'manufacturer', 'price']
goog = goog.set_index('idGoogle')
amzn['title'] = clean(amzn['title'])
amzn['description'] = clean(amzn['description'])
amzn['manufacturer'] = clean(amzn['manufacturer'])
amzn.columns = ['idAmazon', 'name', 'description', 'manufacturer', 'price']
amzn = amzn.set_index('idAmazon')
# ********************************
# Indexing
# ********************************
cl = rl.SortedNeighbourhoodIndex('name', window=251)
cl = cl.index(goog, amzn)
# ********************************
# Record Comparison
# ********************************
c = rl.Compare()
c.string('name', 'name', label='cmp_name')
c.string('description', 'description', method='jaro_winkler', label='cmp_description')
c.string('manufacturer', 'manufacturer', label='cmp_manufacturer')
c.numeric('price', 'price', method='linear', scale=1,
offset=10, label='cmp_price')
feature_vectors = c.compute(cl, goog, amzn)
Classification
We will cover three ways to classify our candidate record pairs as matches or non-matches.
Threshold-based methods
Supervised learning methods
Unsupervised learning methods
Once we classify each of the record pairs, it is important to evaluate the classification.
We will use three commonly used metrics: precision, recall, and F-measure.
Precision is the portion of record pairs classified as matching which are actually matching.
Recall is the portion of true matches which are identified as matching.
F-Measure combines precision and recall to balance the two in a single metric.
1.
Threshold-based methods
We will start with implementing a threshold-based approach to classification.
This works by finding some metric, setting a threshold, and then using that as the boundary to define our two classes.
An example of threshold-based classification could be determining whether an animal is a cat or a dog.
Our metric will be weight and 10 lbs (4.5 kgs) will be our threshold.
If an animal weighs more than 10 lbs it is classified as a dog, and if it weighs less it is classified as a cat.
This is a very straightforward approach, it is easy to understand and implement.
However, it's also error-prone.
For example, most Chihuahuas are going to be classified as cats and some well-fed cats will be classified as dogs.
For our recordlinkage problem, our metric will be the sum of all comparisons scores for each candidate record pair.
We will use a threshold of 2.5 (chosen arbitrarily).
predictions = feature_vectors[feature_vectors.sum(axis=1) > 2.5]
print("Threshold-Based: {} matches".format(len(predictions)))
Evaluate the predictions
Now, we should evaluate how this threshold-based approach performed.
First, we will read in the true matches from our dataset.
Then, we will use the three the three different metrics which were introduced above.
As a refresher, precision is the proportion of record pairs classified as matching which are actual matches.
Recall is the proportion of all the actual matches which are identified as matching.
Finally, F-measure can be thought of as metric which balances precision and recall.
# Load True Matches
mapping = pd.read_parquet('amazon-google/Amazon_Google_perfectMapping')
# Convert dataframe to index required by recordlinkage
matches = mapping.set_index(['idGoogle', 'idAmazon']).index
# Get the confusion matrix.
This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(matches, predictions, len(feature_vectors))
# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))
Precision: 0.28607594936708863
Recall: 0.08692307692307692
F-Measure: 0.13333333333333333
We want all three of these numbers to be as close to 1 as possible.
This is a pretty good indication that something in our process needs to change.
Perhaps we should revisit earlier steps (garbage in, garbage out).
But we can also consider use more sophisticated methods of classification.
2.
Supervised learning methods
Next, we will use a supervised-learning approach to classification.
Supervised learning models are trained based on data provided by a user.
This data includes a set of features as well as the proper classification.
You can think of a parent teaching a young child about animals as supervised learning.
The parent will point to an animal and say something like “Look, a dog!” or “Watch out! There is a goose!”.
Soon the child will be able to identify dogs from geese – they will have developed a successful classification model!
We will use a logistic regression model for classification.
There are many other supervised learning models, some of which are also available in recordlinkage.
Load true matches
First we will load the data containing all of the true matches in our data set.
# Load True Matches
mapping = pd.read_parquet('amazon-google/Amazon_Google_perfectMapping')
# Convert dataframe to index required by recordlinkage
match = mapping.set_index(['idGoogle', 'idAmazon']).index
Create training and testing datasets
Next we will split the our dataframe of feature vectors generated during the prelimary stage into our training and testing set.
The training set will be used to train the logistic regression model.
Then, the testing set will be used to evaluate how the model performs.
It is very important to know that you should never train and test your model on the same subset of your data.
To get a sense for why this shouldn’t be done, we can think about a professor giving their students a 10 question sample exam, along with all the solutions.
Then, come the final exam they pass out an exam made up of the exact same 10 questions.
Upon grading the exam the professor discovers every student got 1010 and concludes that the entire class has an amazing grasp of all course content.
However, this is probably not the case at all.
We can imagine that some of the students likely just memorized the 10 solutions and were able to ace the test regardless of whether they studied any other content at all!
This is an example of what is called overfitting.
Essentially, the model has memorized the provided examples but has failed to learn a more general model that is successful beyond these examples.
Once again, do not evaluate your model on the data it was trained on!
from sklearn.model_selection import train_test_split
# Create a training and test set
train, test = train_test_split(feature_vectors, test_size=0.25)
# Get the true pairs for each set
train_matches_index = train.index & match
test_matches_index = test.index & match
Build the model
Finally, we can build our classification model and train it.
# Logistic Regression
# ***********************
# Initialize the classifier
classifier = rl.LogisticRegressionClassifier()
# Train the classifier
classifier.learn(train, train_matches_index)
Evaluate the model
Now, we should evaluate how the model performs on the test set.
First, we use the model to make predictions on the test set.
Then we will use the three different metrics which were introduced above.
As a refresher, precision is the proportion of record pairs classified as matching which are actual matches.
Recall is the proportion of all the actual matches which are identified as matching.
Finally, F-measure can be thought of as metric which balances precision and recall.
# Make Predictions on a test set
predictions = classifier.predict(test)
# Get the confusion matrix.
This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(test_matches_index, predictions, len(test))
# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))
Precision: 0.5
Recall: 0.23170731707317074
F-Measure: 0.3166666666666667
We want all three of these numbers to be as close to 1 as possible.
As we can see, the model is not performing miracles here.
However, it is a pretty good improvement over our threhold-based classification.
We will likely want to revisit earlier parts of our process to see if we can improve these measures.
Improvements to the model could be achieved in a variety of ways, from adding new features to model (perhaps comparing product names using jaccard similarity) to changing model hyper-parameters.
Its important to realize that these metrics are evaluating how the classification model performs on the data it has been given.
It does not consider the fact that during our preliminary work we actually exclude a whole bunch of possible record pairs when indexing.
3.
Unsupervised learning methods
Finally, we will use unsupervised learning to classify our candidate record pairs.
Unsupervised models train themselves without the user having to provide training data to the model.
Once again, we can consider a child learning about animals.
Unsupervised learning would have the child learning about the world without the help of her parents.
Instead, she might be told that there are two types of animals she is likely to encounter.
She doesn’t know anything else about these animals, not even their names.
But by examining lots of different examples of animals she eventually groups them together.
Animals that have tails, chase sticks, and enjoy being approached are one type, while those which have wings, walk aimlessly across the street, and are scary when you get near them are another.
At this point, the child will have developed a successful method of distinguishing the two types of animals.
We will use a K Means Classifier for our example.
Create training and testing datasets
Just as we did for unsupervised learning, we will start with splitting our data into two parts.
By excluding some of our data from training, we will be able to evaluate how the model might perform if given data it has never seen before.
from sklearn.model_selection import train_test_split
# Create a training and test set
train, test = train_test_split(feature_vectors, test_size=0.25)
# Get the true pairs for the test set (Used for Evaluation)
test_matches_index = test.index & match
Build the model
# K-means Classifier
# *******************
# Build The Classifier
kmeans = rl.KMeansClassifier()
# Train the Model
result_kmeans = kmeans.learn(train)
Evaluate the model
Once again, we will make predictions on the test set and then use precision, recall, and F-measure to evaluate the performance.
# Make Predictions on a test set
predictions = kmeans.predict(test)
# Get the confusion matrix.
This is just the table with the numbers of True/False Postives and True/False Negatives.
confusion_matrix = rl.confusion_matrix(test_matches_index, predictions, len(test))
# Print Metrics
print("Precision:", rl.precision(confusion_matrix))
print("Recall:", rl.recall(confusion_matrix))
print("F-Measure:", rl.fscore(confusion_matrix))
Precision: 0.009804922888366867
Recall: 0.5853658536585366
F-Measure: 0.019286790557508787
Remember, we want all three of these numbers to be as close to 1 as possible.
Once again, this has offered an improvement over our original threshold-based approach.
However, it does have a worse F-measure than our supervised approach.
Once again, these poor scores offer an indication that we should revisit earlier steps in our process.
What’s next?
Now that you have your set of matching record pairs, you will likely want to combine the records into a single dataset for easy use.
This task is called data fusion.
For an introduction to fusion check out the next tutorial in this series (coming soon …).
Further reading
While recordlinkage provides some good basic classification algorithms, it does not offer a lot of opportunity to customize the algorithms.
To do this, I suggest using other pacakges built for classification, such as scikit learn.
Its important to remember that your goal is to classify these record pairs as matching or non-matching.
There is nothing saying that you can’t deviate from the traditional methods.
Often, these are good places to start, but be creative in how you design a classification system.
For example, you might want to find ways to combine multiple model types, using something like stacking or a committee-based approaches.