Pandas First Steps
- Install and import
pip install pandas import pandas as pdCore components of pandas: Series and DataFrames
The primary two components of pandas are theSeries
andDataFrame
. ASeries
is essentially a column, and aDataFrame
is a multi-dimensional table made up of a collection of Series.![]()
- Creating DataFrames from scratch
Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like: data = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] } And then pass it to the pandas DataFrame constructor: purchases = pd.DataFrame(data) purchases Out:
apples | oranges | |
---|---|---|
0 | 3 | 0 |
1 | 2 | 3 |
2 | 0 | 7 |
3 | 1 | 2 |
apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
Unnamed: 0 | apples | oranges | |
---|---|---|---|
0 | June | 3 | 0 |
1 | Robert | 2 | 3 |
2 | Lily | 0 | 7 |
3 | David | 1 | 2 |
index_col
when reading:
df = pd.read_csv('purchases.csv', index_col=0)
df
Out:apples | oranges | |
---|---|---|
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
dict
— pandas can read this just as easily:
df = pd.read_json('purchases.json')
df
Out:apples | oranges | |
---|---|---|
David | 1 | 2 |
June | 3 | 0 |
Lily | 0 | 7 |
Robert | 2 | 3 |
pysqlite3
installed, so run this command in your terminal:pip install pysqlite3
Or run this cell if you're in a notebook:
pip install pysqlite3
sqlite3
is used to create a connection to a database which we can then use to generate a DataFrame through a SELECT
query.
So first we'll make a connection to a SQLite database file:
import sqlite3
con = sqlite3.connect("database.db")
psycopg2
(link) is a commonly used library for making connections to PostgreSQL.
Furthermore, you would make a connection to a database URI instead of a file like we did here with SQLite.
In this SQLite database we have a table called purchases, and our index is in a column called "index".
By passing a SELECT query and our con
, we can read from the purchases table:
df = pd.read_sql_query("SELECT * FROM purchases", con)
df
Out:index | apples | oranges | |
---|---|---|---|
0 | June | 3 | 0 |
1 | Robert | 2 | 3 |
2 | Lily | 0 | 7 |
3 | David | 1 | 2 |
index_col='index'
, but we can also set an index after-the-fact:
df = df.set_index('index')
df
Out:apples | oranges | |
---|---|---|
index | ||
June | 3 | 0 |
Robert | 2 | 3 |
Lily | 0 | 7 |
David | 1 | 2 |
set_index()
on any DataFrame using any column at any time.
Indexing Series and DataFrames is a very common task, and the different ways of doing it is worth remembering.
con
variable from before.Let's move on to importing some real-world data and detailing a few of the operations you'll be using a lot.
.head()
:
movies_df.head()
Out:Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Guardians of the Galaxy | 1 | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Split | 3 | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
Sing | 4 | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
Suicide Squad | 5 | Action,Adventure,Fantasy | A secret government agency recruits some of th... | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D... | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
.head()
outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10)
would output the top ten rows, for example.To see the last five rows use .tail()
.
tail()
also accepts a number, and in this case we printing the bottom two rows.:
movies_df.tail(2)
Out:Rank | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Search Party | 999 | Adventure,Comedy | A pair of friends embark on a mission to reuni... | Scot Armstrong | Adam Pally, T.J. Miller, Thomas Middleditch,Sh... | 2014 | 93 | 5.6 | 4881 | NaN | 22.0 |
Nine Lives | 1000 | Comedy,Family,Fantasy | A stuffy businessman finds himself trapped ins... | Barry Sonnenfeld | Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... | 2016 | 87 | 5.3 | 12435 | 19.64 | 11.0 |
.info()
should be one of the very first commands you run after loading your data:
movies_df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank 1000 non-null int64
Genre 1000 non-null object
Description 1000 non-null object
Director 1000 non-null object
Actors 1000 non-null object
Year 1000 non-null int64
Runtime (Minutes) 1000 non-null int64
Rating 1000 non-null float64
Votes 1000 non-null int64
Revenue (Millions) 872 non-null float64
Metascore 936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB
.info()
provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
Notice in our movies dataset we have some obvious missing values in the Revenue
and Metascore
columns.
Imagine you just imported some JSON and the integers were recorded as strings.
You go to do some arithmetic and find an "unsupported operand" Exception because you can't do math with strings.
Calling .info()
will quickly point out that your column you thought was all integers are actually string objects.
Another fast and useful attribute is .shape
, which outputs just a tuple of (rows, columns):
movies_df.shape
Out:
(1000, 11)
Note that .shape
has no parentheses and is a simple tuple of format (rows, columns).
So we have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape
a lot when cleaning and transforming data.
For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.
append()
will return a copy without affecting the original DataFrame.
We are capturing this copy in temp
so we aren't working with the real data.Notice call .shape
quickly proves our DataFrame rows have doubled.Now we can try dropping duplicates:
temp_df = temp_df.drop_duplicates()
temp_df.shape
Out:
(1000, 11)
Just like append()
, the drop_duplicates()
method will also return a copy of your DataFrame, but this time with duplicates removed.
Calling .shape
confirms we're back to the 1000 rows of our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example.
For this reason, pandas has the inplace
keyword argument on many of its methods.
Using inplace=True
will modify the DataFrame object in place:
temp_df.drop_duplicates(inplace=True)
Now our temp_df
will have the transformed data automatically.
Another important argument for drop_duplicates()
is keep
, which has three possible options:
first
: (default) Drop duplicates except for the first occurrence.
last
: Drop duplicates except for the last occurrence.
False
: Drop all duplicates.
Since we didn't define the keep
arugment in the previous example it was defaulted to first
.
This means that if two rows are the same pandas will drop the second row and keep the first row.
Using last
has the opposite effect: the first row is dropped.keep
, on the other hand, will drop all duplicates.
If two rows are the same then both will be dropped.
Watch what happens to temp_df
:
temp_df = movies_df.append(movies_df) # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape
Out:
(0, 11)
Since all rows were duplicates, keep=False
dropped them all resulting in zero rows being left over.
If you're wondering why you would want to do this, one reason is that it allows you to locate all duplicates in your dataset.
When conditional selections are shown below you'll see how to do that.
.columns
come in handy if you want to rename columns by allowing for simple copy and paste, it's also useful if you need to understand why you are receiving a Key Error
when selecting data by column.
We can use the .rename()
method to rename certain or all columns via a dict
.
We don't want parentheses, so let's rename those:
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
movies_df.columns
Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
dtype='object')
Excellent.
But what if we want to lowercase all names? Instead of using .rename()
we could also set a list of names to the columns like so:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 'rating', 'votes', 'revenue_millions', 'metascore']
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
But that's too much work.
Instead of just renaming each column manually we can do a list comprehension:
movies_df.columns = [col.lower() for col in movies_df]
movies_df.columns
Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore'],
dtype='object')
list
(and dict
) comprehensions come in handy a lot when working with pandas and data in general.It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be working with a dataset for some time.
None
or NumPy's np.nan
, each of which are handled differently in some situations.
There are two options in dealing with nulls:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Guardians of the Galaxy | False | False | False | False | False | False | False | False | False | False | False |
Prometheus | False | False | False | False | False | False | False | False | False | False | False |
Split | False | False | False | False | False | False | False | False | False | False | False |
Sing | False | False | False | False | False | False | False | False | False | False | False |
Suicide Squad | False | False | False | False | False | False | False | False | False | False | False |
isnull()
returns a DataFrame where each cell is either True or False depending on that cell's null status.
To count the number of nulls in each column we use an aggregate function for summing:
movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 128
metascore 64
dtype: int64
.isnull()
just by iteself isn't very useful, and is usually used in conjunction with other methods, like sum()
.
We can see now that our data has 128 missing values for revenue_millions
and 64 missing values for metascore
.
inplace=True
in this method as well.So in the case of our dataset, this operation would remove 128 rows where revenue_millions
is null and 64 rows where metascore
is null.
This obviously seems like a waste since there's perfectly good data in the other columns of those dropped rows.
That's why we'll look at imputation next.Other than just dropping rows, you can also drop columns with null values by setting axis=1
:
movies_df.dropna(axis=1)
In our dataset, this operation would drop the revenue_millions
and metascore
columns
axis=1
parameter?It's not immediately obvious where axis
comes from and why you need it to be 1 for it to affect columns.
To see why, just look at the .shape
output:
movies_df.shape
Out: (1000, 11)
As we learned above, this is a tuple that represents the shape of the DataFrame, i.e.
1000 rows and 11 columns.
Note that the rows are at index zero of this tuple and columns are at index one of this tuple.
This is why axis=1
affects columns.
This comes from NumPy, and is a great example of why learning NumPy is worth your time.
revenue_millions
column.
First we'll extract that column into its own variable:
revenue = movies_df['revenue_millions']
Using square brackets is the general way we select columns in a DataFrame.If you remember back to when we created DataFrames from scratch, the keys of the dict
ended up as column names.
Now when we select columns of a DataFrame, we use brackets just like if we were accessing a Python dictionary.revenue
now contains a Series:
revenue.head()
Out:
Title
Guardians of the Galaxy 333.13
Prometheus 126.46
Split 138.12
Sing 270.32
Suicide Squad 325.02
Name: revenue_millions, dtype: float64
Slightly different formatting than a DataFrame, but we still have our Title
index.We'll impute the missing values of revenue using the mean.
Here's the mean value:
revenue_mean = revenue.mean()
revenue_mean
Out:
82.95637614678897
With the mean, let's fill the nulls using fillna()
:
revenue.fillna(revenue_mean, inplace=True)
We have now replaced all nulls in revenue
with the mean of the column.
Notice that by using inplace=True
we have actually affected the original movies_df
:
movies_df.isnull().sum()
Out:
rank 0
genre 0
description 0
director 0
actors 0
year 0
runtime 0
rating 0
votes 0
revenue_millions 0
metascore 64
dtype: int64
Imputing an entire column with the same value like this is a basic example.
It would be a better idea to try a more granular imputation by Genre or Director.For example, you would find the mean of the revenue generated in each genre individually and impute the nulls in each genre with that genre's mean.Let's now look at more ways to examine and understand the dataset.
describe()
on an entire DataFrame we can get a summary of the distribution of continuous variables:
movies_df.describe()
Out:rank | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1.000000e+03 | 1000.000000 | 936.000000 |
mean | 500.500000 | 2012.783000 | 113.172000 | 6.723200 | 1.698083e+05 | 82.956376 | 58.985043 |
std | 288.819436 | 3.205962 | 18.810908 | 0.945429 | 1.887626e+05 | 96.412043 | 17.194757 |
min | 1.000000 | 2006.000000 | 66.000000 | 1.900000 | 6.100000e+01 | 0.000000 | 11.000000 |
25% | 250.750000 | 2010.000000 | 100.000000 | 6.200000 | 3.630900e+04 | 17.442500 | 47.000000 |
50% | 500.500000 | 2014.000000 | 111.000000 | 6.800000 | 1.107990e+05 | 60.375000 | 59.500000 |
75% | 750.250000 | 2016.000000 | 123.000000 | 7.400000 | 2.399098e+05 | 99.177500 | 72.000000 |
max | 1000.000000 | 2016.000000 | 191.000000 | 9.000000 | 1.791916e+06 | 936.630000 | 100.000000 |
.describe()
can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:
movies_df['genre'].describe()
Out:
count 1000
unique 207
top Action,Adventure,Sci-Fi
freq 50
Name: genre, dtype: object
This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq)..value_counts()
can tell us the frequency of all values in a column:
movies_df['genre'].value_counts().head(10)
Out:
Action,Adventure,Sci-Fi 50
Drama 48
Comedy,Drama,Romance 35
Comedy 32
Drama,Romance 31
Action,Adventure,Fantasy 27
Comedy,Drama 27
Animation,Adventure,Comedy 27
Comedy,Romance 26
Crime,Drama,Thriller 24
Name: genre, dtype: int64
.corr()
we can generate the relationship between each continuous variable:
movies_df.corr()
rank | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|
rank | 1.000000 | -0.261605 | -0.221739 | -0.219555 | -0.283876 | -0.252996 | -0.191869 |
year | -0.261605 | 1.000000 | -0.164900 | -0.211219 | -0.411904 | -0.117562 | -0.079305 |
runtime | -0.221739 | -0.164900 | 1.000000 | 0.392214 | 0.407062 | 0.247834 | 0.211978 |
rating | -0.219555 | -0.211219 | 0.392214 | 1.000000 | 0.511537 | 0.189527 | 0.631897 |
votes | -0.283876 | -0.411904 | 0.407062 | 0.511537 | 1.000000 | 0.607941 | 0.325684 |
revenue_millions | -0.252996 | -0.117562 | 0.247834 | 0.189527 | 0.607941 | 1.000000 | 0.133328 |
metascore | -0.191869 | -0.079305 | 0.211978 | 0.631897 | 0.325684 | 0.133328 | 1.000000 |
rank
has a perfect correlation with itself, which is obvious.
On the other hand, the correlation between votes
and revenue_millions
is 0.6.
A little more interesting.Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome.
You can visually represent bivariate relationships with scatterplots (seen below in the plotting section).For a deeper look into data summarizations check out Essential Statistics for Data Science.Let's now look more at manipulating DataFrames.
fillna()
.
Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.It's important to note that, although many methods are the same, DataFrames and Series have different attributes, so you'll need be sure to know which type you are working with or else you will receive attribute errors.Let's look at working with columns first.
genre | rating | |
---|---|---|
Title | ||
Guardians of the Galaxy | Action,Adventure,Sci-Fi | 8.1 |
Prometheus | Adventure,Mystery,Sci-Fi | 7.0 |
Split | Horror,Thriller | 7.3 |
Sing | Animation,Comedy,Family | 7.2 |
Suicide Squad | Action,Adventure,Fantasy | 6.2 |
.loc
- locates by name
.iloc
- locates by numerical index
Remember that we are still indexed by movie Title, so to use .loc
we give it the Title of a movie:
prom = movies_df.loc["Prometheus"]
prom
Out:
rank 2
genre Adventure,Mystery,Sci-Fi
description Following clues to the origin of mankind, a te...
director Ridley Scott
actors Noomi Rapace, Logan Marshall-Green, Michael Fa...
year 2012
runtime 124
rating 7
votes 485820
revenue_millions 126.46
metascore 65
Name: Prometheus, dtype: object
On the other hand, with iloc
we give it the numerical index of Prometheus:
prom = movies_df.iloc[1]
loc
and iloc
can be thought of as similar to Python list
slicing.
To show this even further, let's select multiple rows.How would you do it with a list? In Python, just slice with brackets like example_list[1:4]
.
It's works the same way in pandas:
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
movie_subset
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Split | 3 | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
Sing | 4 | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
.loc
and .iloc
to select multiple rows is that .loc
includes the movie Sing in the result, but when using .iloc
we're getting rows 1:4 but the movie at index 4 (Suicide Squad) is not included.Slicing with .iloc
follows the same rules as slicing with lists, the object at the index at the end is not included.
isnull()
, this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films.
To return the rows where that condition is True we have to pass this operation into the DataFrame:
movies_df[movies_df['director'] == "Ridley Scott"]
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | rating_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Title | ||||||||||||
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 | bad |
The Martian | 103 | Adventure,Drama,Sci-Fi | An astronaut becomes stranded on Mars after hi... | Ridley Scott | Matt Damon, Jessica Chastain, Kristen Wiig, Ka... | 2015 | 144 | 8.0 | 556097 | 228.43 | 80.0 | good |
Robin Hood | 388 | Action,Adventure,Drama | In 12th century England, Robin and his band of... | Ridley Scott | Russell Crowe, Cate Blanchett, Matthew Macfady... | 2010 | 140 | 6.7 | 221117 | 105.22 | 53.0 | bad |
American Gangster | 471 | Biography,Crime,Drama | In 1970s America, a detective works to bring d... | Ridley Scott | Denzel Washington, Russell Crowe, Chiwetel Eji... | 2007 | 157 | 7.8 | 337835 | 130.13 | 76.0 | bad |
Exodus: Gods and Kings | 517 | Action,Adventure,Drama | The defiant leader Moses rises up against the ... | Ridley Scott | Christian Bale, Joel Edgerton, Ben Kingsley, S... | 2014 | 150 | 6.0 | 137299 | 65.01 | 52.0 | bad |
movies_df
where movies_df
director equals Ridley Scott.Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:
movies_df[movies_df['rating'] >= 8.6].head(3)
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Interstellar | 37 | Adventure,Drama,Sci-Fi | A team of explorers travel through a wormhole ... | Christopher Nolan | Matthew McConaughey, Anne Hathaway, Jessica Ch... | 2014 | 169 | 8.6 | 1047747 | 187.99 | 74.0 |
The Dark Knight | 55 | Action,Crime,Drama | When the menace known as the Joker wreaks havo... | Christopher Nolan | Christian Bale, Heath Ledger, Aaron Eckhart,Mi... | 2008 | 152 | 9.0 | 1791916 | 533.32 | 82.0 |
Inception | 81 | Action,Adventure,Sci-Fi | A thief, who steals corporate secrets through ... | Christopher Nolan | Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... | 2010 | 148 | 8.8 | 1583625 | 292.57 | 74.0 |
|
for "or" and &
for "and".Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Interstellar | 37 | Adventure,Drama,Sci-Fi | A team of explorers travel through a wormhole ... | Christopher Nolan | Matthew McConaughey, Anne Hathaway, Jessica Ch... | 2014 | 169 | 8.6 | 1047747 | 187.99 | 74.0 |
The Dark Knight | 55 | Action,Crime,Drama | When the menace known as the Joker wreaks havo... | Christopher Nolan | Christian Bale, Heath Ledger, Aaron Eckhart,Mi... | 2008 | 152 | 9.0 | 1791916 | 533.32 | 82.0 |
The Prestige | 65 | Drama,Mystery,Sci-Fi | Two stage magicians engage in competitive one-... | Christopher Nolan | Christian Bale, Hugh Jackman, Scarlett Johanss... | 2006 | 130 | 8.5 | 913152 | 53.08 | 66.0 |
Inception | 81 | Action,Adventure,Sci-Fi | A thief, who steals corporate secrets through ... | Christopher Nolan | Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... | 2010 | 148 | 8.8 | 1583625 | 292.57 | 74.0 |
isin()
method we could make this more concise though:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
Interstellar | 37 | Adventure,Drama,Sci-Fi | A team of explorers travel through a wormhole ... | Christopher Nolan | Matthew McConaughey, Anne Hathaway, Jessica Ch... | 2014 | 169 | 8.6 | 1047747 | 187.99 | 74.0 |
The Dark Knight | 55 | Action,Crime,Drama | When the menace known as the Joker wreaks havo... | Christopher Nolan | Christian Bale, Heath Ledger, Aaron Eckhart,Mi... | 2008 | 152 | 9.0 | 1791916 | 533.32 | 82.0 |
The Prestige | 65 | Drama,Mystery,Sci-Fi | Two stage magicians engage in competitive one-... | Christopher Nolan | Christian Bale, Hugh Jackman, Scarlett Johanss... | 2006 | 130 | 8.5 | 913152 | 53.08 | 66.0 |
Inception | 81 | Action,Adventure,Sci-Fi | A thief, who steals corporate secrets through ... | Christopher Nolan | Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... | 2010 | 148 | 8.8 | 1583625 | 292.57 | 74.0 |
rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | |
---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||
3 Idiots | 431 | Comedy,Drama | Two friends are searching for their long lost ... | Rajkumar Hirani | Aamir Khan, Madhavan, Mona Singh, Sharman Joshi | 2009 | 170 | 8.4 | 238789 | 6.52 | 67.0 |
The Lives of Others | 477 | Drama,Thriller | In 1984 East Berlin, an agent of the secret po... | Florian Henckel von Donnersmarck | Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul... | 2006 | 137 | 8.5 | 278103 | 11.28 | 89.0 |
Incendies | 714 | Drama,Mystery,War | Twins journey to the Middle East to discover t... | Denis Villeneuve | Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim... | 2010 | 131 | 8.2 | 92863 | 6.86 | 80.0 |
Taare Zameen Par | 992 | Drama,Family,Music | An eight-year-old boy is thought to be a lazy ... | Aamir Khan | Darsheel Safary, Aamir Khan, Tanay Chheda, Sac... | 2007 | 165 | 8.5 | 102697 | 1.20 | 42.0 |
.describe()
the 25th percentile for revenue was about 17.4, and we can access this value directly by using the quantile()
method with a float of 0.25.So here we have only four movies that match that criteria.
apply()
a function to the dataset.
For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.First we would create a function that, when given a rating, determines if it's good or bad:
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
Now we want to send the entire rating column through this function, which is what apply()
does:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(2)
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | rating_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Title | ||||||||||||
Guardians of the Galaxy | 1 | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 | good |
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 | bad |
.apply()
method passes every value in the rating
column through the rating_function
and then returns a new Series.
This Series is then assigned to a new column called rating_category
.You can also use anonymous functions as well.
This lambda function achieves the same result as rating_function
:
movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')
movies_df.head(2)
Out:rank | genre | description | director | actors | year | runtime | rating | votes | revenue_millions | metascore | rating_category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Title | ||||||||||||
Guardians of the Galaxy | 1 | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 | good |
Prometheus | 2 | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 | bad |
apply()
will be much faster than iterating manually over rows because pandas is utilizing vectorization.
Vectorization: a style of computer programming where operations are applied to whole arrays instead of individual elements —WikipediaA good example of high usage of
apply()
is during natural language processing (NLP) work.
You'll need to apply all sorts of text cleaning functions to strings to prepare for machine learning.
pip install matplotlib
):
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger
Now we can begin.
There won't be a lot of coverage on plotting, but it should be enough to explore you're data easily.
.plot()
on movies_df
with some info about how to construct the plot:
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');
RESULT:
<matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18>
output when plotting in Jupyter notebooks.If we want to plot a simple Histogram based on a single column, we can call plot on a column:
movies_df['rating'].plot(kind='hist', title='Rating');
RESULT: