Python Pandas and SciPy Tutorial

MustWatch




Pandas First Steps
- Install and import
pip install pandas

import pandas as pd

Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series and DataFrame.
A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.




- Creating DataFrames from scratch
Let's say we have a fruit stand that sells apples and oranges. 
We want to have a column for each fruit and a row for each customer purchase. 
To organize this as a dictionary for pandas we could do something like:

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

purchases = pd.DataFrame(data)

purchases

Out: apples oranges
0 3 0
1 2 3
2 0 7
3 1 2

Let's have customer names as our index:

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Out: apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2
So now we could locate a customer's order by using their name:

purchases.loc['June']

Out:
apples     3
oranges    0
Name: June, dtype: int64

How to read in data
- Reading data from CSVs
With CSV files all you need is a single line to load in the data:

df = pd.read_csv('purchases.csv')

df

Out: Unnamed: 0 apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
CSVs don't have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

df = pd.read_csv('purchases.csv', index_col=0)

df

Out: apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

Here we're setting the index to be column zero.You'll find that most CSVs won't ever have an index column and so usually you don't have to worry about this step.


- Reading data from JSON
If you have a JSON file — which is essentially a stored Python dict — pandas can read this just as easily:

df = pd.read_json('purchases.json')

df

Out: apples oranges
David 1 2
June 3 0
Lily 0 7
Robert 2 3

Notice this time our index came with us correctly since using JSON allowed indexes to work through nesting. 


- Reading data from a SQL database

If you’re working with data from a SQL database you need to first establish a connection using an appropriate Python library, then pass a query to pandas. 
Here we'll use SQLite to demonstrate.
First, we need pysqlite3 installed, so run this command in your terminal:pip install pysqlite3Or run this cell if you're in a notebook:

pip install pysqlite3

sqlite3 is used to create a connection to a database which we can then use to generate a DataFrame through a SELECT query.
So first we'll make a connection to a SQLite database file:

import sqlite3
con = sqlite3.connect("database.db")

. . . SQL Tip
If you have data in PostgreSQL, MySQL, or some other SQL server, you'll need to obtain the right Python library to make a connection. 
For example, psycopg2 (link) is a commonly used library for making connections to PostgreSQL. 
Furthermore, you would make a connection to a database URI instead of a file like we did here with SQLite.

In this SQLite database we have a table called purchases, and our index is in a column called "index".
By passing a SELECT query and our con, we can read from the purchases table:

df = pd.read_sql_query("SELECT * FROM purchases", con)

df

Out: index apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2
Just like with CSVs, we could pass index_col='index', but we can also set an index after-the-fact:

df = df.set_index('index')

df

Out: apples oranges
index 
June 3 0
Robert 2 3
Lily 0 7
David 1 2
In fact, we could use set_index() on any DataFrame using any column at any time. 
Indexing Series and DataFrames is a very common task, and the different ways of doing it is worth remembering.


- Converting back to a CSV, JSON, or SQL
To save it as a file of your choice. 
Similar to the ways we read in data, pandas provides intuitive commands to save it:

df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
df.to_sql('new_purchases', con)

When we save JSON and CSV files, all we have to input into those functions is our desired filename with the appropriate file extension. 
With SQL, we’re not creating a new file but instead inserting a new table into the database using our con variable from before.Let's move on to importing some real-world data and detailing a few of the operations you'll be using a lot.


Most important DataFrame operations
DataFrames possess hundreds of methods and other operations that are crucial to any analysis. 
As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.
Let's load in the IMDB movies dataset to begin:

movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

We're loading this dataset from a CSV and designating the movie titles to be our index.


- Viewing your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. 
We accomplish this with .head():

movies_df.head()

Out: Rank Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
Title 
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. 
Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
Sing 4 Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
Suicide Squad 5 Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.To see the last five rows use .tail(). 
tail() also accepts a number, and in this case we printing the bottom two rows.:

movies_df.tail(2)

Out: Rank Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
Title 
Search Party 999 Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. 
Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
Nine Lives 1000 Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

Typically when we load in a dataset, we like to view the first five or so rows to see what's under the hood. 
Here we can see the names of each column, the index, and examples of values in each row.You'll notice that the index in our DataFrame is the Title column, which you can tell by how the word Title is slightly lower than the rest of the columns.


- Getting info about your data
.info() should be one of the very first commands you run after loading your data:

movies_df.info()

Out:
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank                  1000 non-null int64
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. 

Imagine you just imported some JSON and the integers were recorded as strings. 
You go to do some arithmetic and find an "unsupported operand" Exception because you can't do math with strings. 
Calling .info() will quickly point out that your column you thought was all integers are actually string objects.
Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

movies_df.shape

Out:
(1000, 11)


Note that .shape has no parentheses and is a simple tuple of format (rows, columns). 
So we have 1000 rows and 11 columns in our movies DataFrame.
You'll be going to .shape a lot when cleaning and transforming data. 
For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

- Handling duplicates
This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate rows.
To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:

temp_df = movies_df.append(movies_df)

temp_df.shape

Out:
(2000, 11)

Using append() will return a copy without affecting the original DataFrame. 
We are capturing this copy in temp so we aren't working with the real data.Notice call .shape quickly proves our DataFrame rows have doubled.Now we can try dropping duplicates:

temp_df = temp_df.drop_duplicates()

temp_df.shape

Out:
(1000, 11)

Just like append(), the drop_duplicates() method will also return a copy of your DataFrame, but this time with duplicates removed. 
Calling .shape confirms we're back to the 1000 rows of our original dataset.
It's a little verbose to keep assigning DataFrames to the same variable like in this example. 
For this reason, pandas has the inplace keyword argument on many of its methods. 
Using inplace=True will modify the DataFrame object in place:

temp_df.drop_duplicates(inplace=True)

Now our temp_df will have the transformed data automatically.
Another important argument for drop_duplicates() is keep, which has three possible options:

first: (default) Drop duplicates except for the first occurrence.
last: Drop duplicates except for the last occurrence.
False: Drop all duplicates.
Since we didn't define the keep arugment in the previous example it was defaulted to first. 
This means that if two rows are the same pandas will drop the second row and keep the first row. 
Using last has the opposite effect: the first row is dropped.keep, on the other hand, will drop all duplicates. 
If two rows are the same then both will be dropped. 
Watch what happens to temp_df:

temp_df = movies_df.append(movies_df)  # make a new copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape

Out:
(0, 11)

Since all rows were duplicates, keep=False dropped them all resulting in zero rows being left over. 
If you're wondering why you would want to do this, one reason is that it allows you to locate all duplicates in your dataset. 
When conditional selections are shown below you'll see how to do that.


- Column cleanup

Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. 
To make selecting data by column name easier we can spend a little time cleaning up their names.
Here's how to print the column names of our dataset:

movies_df.columns

Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')


Not only does .columns come in handy if you want to rename columns by allowing for simple copy and paste, it's also useful if you need to understand why you are receiving a Key Error when selecting data by column.
We can use the .rename() method to rename certain or all columns via a dict. 
We don't want parentheses, so let's rename those:

movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)


movies_df.columns

Out:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

Excellent. 
But what if we want to lowercase all names? Instead of using .rename() we could also set a list of names to the columns like so:

movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 'rating', 'votes', 'revenue_millions', 'metascore']


movies_df.columns

Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

But that's too much work. 
Instead of just renaming each column manually we can do a list comprehension:

movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns

Out:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')


list (and dict) comprehensions come in handy a lot when working with pandas and data in general.It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be working with a dataset for some time.


- How to work with missing values

When exploring data, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. 
Most commonly you'll see Python's None or NumPy's np.nan, each of which are handled differently in some situations.
There are two options in dealing with nulls:
Get rid of rows or columns with nulls
Replace nulls with non-null values, a technique known as imputationLet's calculate to total number of nulls in each column of our dataset. 
The first step is to check which cells in our DataFrame are null:

movies_df.isnull()

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
Guardians of the Galaxy False False False False False False False False False False False
Prometheus False False False False False False False False False False False
Split False False False False False False False False False False False
Sing False False False False False False False False False False False
Suicide Squad False False False False False False False False False False False

Notice isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.
To count the number of nulls in each column we use an aggregate function for summing:

movies_df.isnull().sum()

Out:
rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64


.isnull() just by iteself isn't very useful, and is usually used in conjunction with other methods, like sum().
We can see now that our data has 128 missing values for revenue_millions and 64 missing values for metascore.

. . . Removing null values
Data Scientists and Analysts regularly face the dilemma of dropping or imputing null values, and is a decision that requires intimate knowledge of your data and its context. 
Overall, removing null data is only suggested if you have a small amount of missing data.Remove nulls is pretty simple:

movies_df.dropna()


This operation will delete any row with at least a single null value, but it will return a new DataFrame without altering the original one. 
You could specify inplace=True in this method as well.So in the case of our dataset, this operation would remove 128 rows where revenue_millions is null and 64 rows where metascore is null. 
This obviously seems like a waste since there's perfectly good data in the other columns of those dropped rows. 
That's why we'll look at imputation next.Other than just dropping rows, you can also drop columns with null values by setting axis=1:

movies_df.dropna(axis=1)

In our dataset, this operation would drop the revenue_millions and metascore columns


. . . Intuition
What's with this axis=1parameter?It's not immediately obvious where axis comes from and why you need it to be 1 for it to affect columns. 
To see why, just look at the .shape output:
movies_df.shapeOut: (1000, 11)
As we learned above, this is a tuple that represents the shape of the DataFrame, i.e. 
1000 rows and 11 columns. 
Note that the rows are at index zero of this tuple and columns are at index one of this tuple. 
This is why axis=1 affects columns. 
This comes from NumPy, and is a great example of why learning NumPy is worth your time.



. . . Imputation

Imputation is a conventional feature engineering technique used to keep valuable data that have null values.There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the mean or the median of that column.Let's look at imputing the missing values in the revenue_millions column. 
First we'll extract that column into its own variable:

revenue = movies_df['revenue_millions']


Using square brackets is the general way we select columns in a DataFrame.If you remember back to when we created DataFrames from scratch, the keys of the dict ended up as column names. 
Now when we select columns of a DataFrame, we use brackets just like if we were accessing a Python dictionary.revenue now contains a Series:

revenue.head()

Out:
Title
Guardians of the Galaxy    333.13
Prometheus                 126.46
Split                      138.12
Sing                       270.32
Suicide Squad              325.02
Name: revenue_millions, dtype: float64


Slightly different formatting than a DataFrame, but we still have our Title index.We'll impute the missing values of revenue using the mean. 
Here's the mean value:

revenue_mean = revenue.mean()

revenue_mean

Out:
82.95637614678897

With the mean, let's fill the nulls using fillna():

revenue.fillna(revenue_mean, inplace=True)

We have now replaced all nulls in revenue with the mean of the column. 
Notice that by using inplace=True we have actually affected the original movies_df:

movies_df.isnull().sum()

Out:
rank                 0
genre                0
description          0
director             0
actors               0
year                 0
runtime              0
rating               0
votes                0
revenue_millions     0
metascore           64
dtype: int64


Imputing an entire column with the same value like this is a basic example. 
It would be a better idea to try a more granular imputation by Genre or Director.For example, you would find the mean of the revenue generated in each genre individually and impute the nulls in each genre with that genre's mean.Let's now look at more ways to examine and understand the dataset.


. . . Understanding your variables
Using describe() on an entire DataFrame we can get a summary of the distribution of continuous variables:

movies_df.describe()

Out: rank year runtime rating votes revenue_millions metascore
count 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 1000.000000 936.000000
mean 500.500000 2012.783000 113.172000 6.723200 1.698083e+05 82.956376 58.985043
std 288.819436 3.205962 18.810908 0.945429 1.887626e+05 96.412043 17.194757
min 1.000000 2006.000000 66.000000 1.900000 6.100000e+01 0.000000 11.000000
25% 250.750000 2010.000000 100.000000 6.200000 3.630900e+04 17.442500 47.000000
50% 500.500000 2014.000000 111.000000 6.800000 1.107990e+05 60.375000 59.500000
75% 750.250000 2016.000000 123.000000 7.400000 2.399098e+05 99.177500 72.000000
max 1000.000000 2016.000000 191.000000 9.000000 1.791916e+06 936.630000 100.000000

Understanding which numbers are continuous also comes in handy when thinking about the type of plot to use to represent your data visually..describe() can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:

movies_df['genre'].describe()

Out:
count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: genre, dtype: object


This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq)..value_counts() can tell us the frequency of all values in a column:

movies_df['genre'].value_counts().head(10)

Out:
Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Action,Adventure,Fantasy      27
Comedy,Drama                  27
Animation,Adventure,Comedy    27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: genre, dtype: int64




. . . Relationships between continuous variables
By using the correlation method .corr() we can generate the relationship between each continuous variable:

movies_df.corr()

rank year runtime rating votes revenue_millions metascore
rank 1.000000 -0.261605 -0.221739 -0.219555 -0.283876 -0.252996 -0.191869
year -0.261605 1.000000 -0.164900 -0.211219 -0.411904 -0.117562 -0.079305
runtime -0.221739 -0.164900 1.000000 0.392214 0.407062 0.247834 0.211978
rating -0.219555 -0.211219 0.392214 1.000000 0.511537 0.189527 0.631897
votes -0.283876 -0.411904 0.407062 0.511537 1.000000 0.607941 0.325684
revenue_millions -0.252996 -0.117562 0.247834 0.189527 0.607941 1.000000 0.133328
metascore -0.191869 -0.079305 0.211978 0.631897 0.325684 0.133328 1.000000

Correlation tables are a numerical representation of the bivariate relationships in the dataset.Positive numbers indicate a positive correlation — one goes up the other goes up — and negative numbers represent an inverse correlation — one goes up the other goes down. 
1.0 indicates a perfect correlation.So looking in the first row, first column we see rank has a perfect correlation with itself, which is obvious. 
On the other hand, the correlation between votes and revenue_millions is 0.6. 
A little more interesting.Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome. 
You can visually represent bivariate relationships with scatterplots (seen below in the plotting section).For a deeper look into data summarizations check out Essential Statistics for Data Science.Let's now look more at manipulating DataFrames.


- DataFrame slicing, selecting, extracting

Up until now we've focused on some basic summaries of our data. 
We've learned about simple column extraction using single brackets, and we imputed null values in a column using fillna(). 
Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.It's important to note that, although many methods are the same, DataFrames and Series have different attributes, so you'll need be sure to know which type you are working with or else you will receive attribute errors.Let's look at working with columns first.


. . . By column
You already saw how to extract a column using square brackets like this:

genre_col = movies_df['genre']

type(genre_col)

Out:
pandas.core.series.Series

This will return a Series. 
To extract a column as a DataFrame, you need to pass a list of column names. 
In our case that's just a single column:

genre_col = movies_df[['genre']]

type(genre_col)


pandas.core.frame.DataFrame

Since it's just a list, adding another column name is easy:

subset = movies_df[['genre', 'rating']]

subset.head()

Out: genre rating
Title 
Guardians of the Galaxy Action,Adventure,Sci-Fi 8.1
Prometheus Adventure,Mystery,Sci-Fi 7.0
Split Horror,Thriller 7.3
Sing Animation,Comedy,Family 7.2
Suicide Squad Action,Adventure,Fantasy 6.2
Now we'll look at getting data by rows.


. . . By rows

For rows, we have two options:

.loc - locates by name
.iloc- locates by numerical index
Remember that we are still indexed by movie Title, so to use .loc we give it the Title of a movie:

prom = movies_df.loc["Prometheus"]

prom

Out:
rank                                                                2
genre                                        Adventure,Mystery,Sci-Fi
description         Following clues to the origin of mankind, a te...
director                                                 Ridley Scott
actors              Noomi Rapace, Logan Marshall-Green, Michael Fa...
year                                                             2012
runtime                                                           124
rating                                                              7
votes                                                          485820
revenue_millions                                               126.46
metascore                                                          65
Name: Prometheus, dtype: object

On the other hand, with iloc we give it the numerical index of Prometheus:

prom = movies_df.iloc[1]


loc and iloc can be thought of as similar to Python list slicing. 
To show this even further, let's select multiple rows.How would you do it with a list? In Python, just slice with brackets like example_list[1:4]. 
It's works the same way in pandas:

movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. 
Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
Sing 4 Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0

One important distinction between using .loc and .iloc to select multiple rows is that .locincludes the movie Sing in the result, but when using .iloc we're getting rows 1:4 but the movie at index 4 (Suicide Squad) is not included.Slicing with .iloc follows the same rules as slicing with lists, the object at the index at the end is not included.


. . . Conditional selections

We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?To do that, we take a column from the DataFrame and apply a Boolean condition to it. 
Here's an example of a Boolean condition:

condition = (movies_df['director'] == "Ridley Scott")

condition.head()

Out:
Title
Guardians of the Galaxy    False
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
Name: director, dtype: bool


Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. 
To return the rows where that condition is True we have to pass this operation into the DataFrame:

movies_df[movies_df['director'] == "Ridley Scott"]

Out: rank genre description director actors year runtime rating votes revenue_millions metascore rating_category
Title 
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 bad
The Martian 103 Adventure,Drama,Sci-Fi An astronaut becomes stranded on Mars after hi... Ridley Scott Matt Damon, Jessica Chastain, Kristen Wiig, Ka... 2015 144 8.0 556097 228.43 80.0 good
Robin Hood 388 Action,Adventure,Drama In 12th century England, Robin and his band of... Ridley Scott Russell Crowe, Cate Blanchett, Matthew Macfady... 2010 140 6.7 221117 105.22 53.0 bad
American Gangster 471 Biography,Crime,Drama In 1970s America, a detective works to bring d... Ridley Scott Denzel Washington, Russell Crowe, Chiwetel Eji... 2007 157 7.8 337835 130.13 76.0 bad
Exodus: Gods and Kings 517 Action,Adventure,Drama The defiant leader Moses rises up against the ... Ridley Scott Christian Bale, Joel Edgerton, Ben Kingsley, S... 2014 150 6.0 137299 65.01 52.0 bad

You can get used to looking at these conditionals by reading it like:Select movies_df where movies_df director equals Ridley Scott.Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:

movies_df[movies_df['rating'] >= 8.6].head(3)

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

We can make some richer conditionals by using logical operators | for "or" and & for "and".Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
The Prestige 65 Drama,Mystery,Sci-Fi Two stage magicians engage in competitive one-... Christopher Nolan Christian Bale, Hugh Jackman, Scarlett Johanss... 2006 130 8.5 913152 53.08 66.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

We need to make sure to group evaluations with parentheses so Python knows how to evaluate the conditional.Using the isin() method we could make this more concise though:

movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
The Prestige 65 Drama,Mystery,Sci-Fi Two stage magicians engage in competitive one-... Christopher Nolan Christian Bale, Hugh Jackman, Scarlett Johanss... 2006 130 8.5 913152 53.08 66.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but made below the 25th percentile in revenue.Here's how we could do all of that:

movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]

Out: rank genre description director actors year runtime rating votes revenue_millions metascore
Title 
3 Idiots 431 Comedy,Drama Two friends are searching for their long lost ... Rajkumar Hirani Aamir Khan, Madhavan, Mona Singh, Sharman Joshi 2009 170 8.4 238789 6.52 67.0
The Lives of Others 477 Drama,Thriller In 1984 East Berlin, an agent of the secret po... Florian Henckel von Donnersmarck Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul... 2006 137 8.5 278103 11.28 89.0
Incendies 714 Drama,Mystery,War Twins journey to the Middle East to discover t... Denis Villeneuve Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim... 2010 131 8.2 92863 6.86 80.0
Taare Zameen Par 992 Drama,Family,Music An eight-year-old boy is thought to be a lazy ... Aamir Khan Darsheel Safary, Aamir Khan, Tanay Chheda, Sac... 2007 165 8.5 102697 1.20 42.0

If you recall up when we used .describe() the 25th percentile for revenue was about 17.4, and we can access this value directly by using the quantile() method with a float of 0.25.So here we have only four movies that match that criteria.


- Applying functions

It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.An efficient alternative is to apply() a function to the dataset. 
For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.First we would create a function that, when given a rating, determines if it's good or bad:

def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

Now we want to send the entire rating column through this function, which is what apply() does:

movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(2)

Out: rank genre description director actors year runtime rating votes revenue_millions metascore rating_category
Title 
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 good
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 bad

The .apply() method passes every value in the rating column through the rating_function and then returns a new Series. 
This Series is then assigned to a new column called rating_category.You can also use anonymous functions as well. 
This lambda function achieves the same result as rating_function:

movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(2)

Out: rank genre description director actors year runtime rating votes revenue_millions metascore rating_category
Title 
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 good
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 bad
Overall, using apply() will be much faster than iterating manually over rows because pandas is utilizing vectorization.
Vectorization: a style of computer programming where operations are applied to whole arrays instead of individual elements —Wikipedia
A good example of high usage of apply() is during natural language processing (NLP) work. 
You'll need to apply all sorts of text cleaning functions to strings to prepare for machine learning.


- Brief Plotting
Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. 
To get started we need to import Matplotlib (pip install matplotlib):

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

Now we can begin. 
There won't be a lot of coverage on plotting, but it should be enough to explore you're data easily.


. . . Plotting Tip
For categorical variables utilize Bar Charts* and Boxplots.For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots.

Let's plot the relationship between ratings and revenue. 
All we need to do is call .plot() on movies_df with some info about how to construct the plot:

movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

RESULT:

What's with the semicolon? It's not a syntax error, just a way to hide the <matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18> output when plotting in Jupyter notebooks.If we want to plot a simple Histogram based on a single column, we can call plot on a column:

movies_df['rating'].plot(kind='hist', title='Rating');

RESULT:
 gives us on the ratings column:

movies_df['rating'].describe()

Out:
count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: rating, dtype: float64

Using a Boxplot we can visualize this data:

movies_df['rating'].plot(kind="box");

RESULT:

Source: *Flowing Data*By combining categorical and continuous data, we can create a Boxplot of revenue that is grouped by the Rating Category we created above:

movies_df.boxplot(column='revenue_millions', by='rating_category');

RESULT:
 for more information on what it can do.


Wrapping up

Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. 
Just cleaning wrangling data is 80% of your job as a Data Scientist. 
After a few projects and some practice, you should be very comfortable with most of the basics.
To keep improving, view the extensive tutorials offered by the official pandas docs, follow along with a few Kaggle kernels, and keep working on your own projects!





Statistical Analysis in Python
In this section, we introduce a few useful methods for analyzing your data in Python. 
Namely, we cover how to compute the mean, variance, and standard error of a data set. 
For more advanced statistical analysis, we cover how to perform a Mann-Whitney-Wilcoxon (MWW) RankSum test, how to perform an Analysis of variance (ANOVA) between multiple data sets, and how to compute bootstrapped 95% confidence intervals for non-normally distributed data sets.

- Python's SciPy Module
The majority of data analysis in Python can be performed with the SciPy module. 
SciPy provides a plethora of statistical functions and tests that will handle the majority of your analytical needs. 
If we don't cover a statistical function or test that you require for your research, SciPy's full statistical library is described in detail at: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html

- Python's pandas Module
	from pandas import *
	
	# must specify that blank space " " is NaN
	experimentDF = read_csv("parasite_data.csv", na_values=[" "])
	


- Accessing data in pandas DataFrames
You can directly access any column and row by indexing the DataFrame.

	# show all entries in the Virulence column
	print(experimentDF["Virulence"])

	# show the 12th row in the ShannonDiversity column
	print(experimentDF["ShannonDiversity"][12])


You can also access all of the values in a column meeting a certain criteria.

	# show all entries in the ShannonDiversity column > 2.0
	print(experimentDF[experimentDF["ShannonDiversity"] > 2.0])

- Blank/omitted data (NA or NaN) in pandas DataFrames
Blank/omitted data is a piece of cake to handle in pandas. 
Here's an example data set with NA/NaN values.

	import numpy as np
	print(experimentDF[np.isnan(experimentDF["Virulence"])])


DataFrame methods automatically ignore NA/NaN values.

	print("Mean virulence across all treatments:", experimentDF["Virulence"].mean())
	
	Mean virulence across all treatments: 0.75


However, not all methods in Python are guaranteed to handle NA/NaN values properly.

	from scipy import stats

	print("Mean virulence across all treatments:", stats.sem(experimentDF["Virulence"]))
	
	Mean virulence across all treatments: nan


Thus, it behooves you to take care of the NA/NaN values before performing your analysis. 
You can either:

(1) filter out all of the entries with NA/NaN

	# NOTE: this drops the entire row if any of its entries are NA/NaN!
	print(experimentDF.dropna())


If you only care about NA/NaN values in a specific column, you can specify the column name first.

	print(experimentDF["Virulence"].dropna())

(2) replace all of the NA/NaN entries with a valid value

	print(experimentDF.fillna(0.0)["Virulence"])

Take care when deciding what to do with NA/NaN entries. 
It can have a significant impact on your results!

	print ("Mean virulence across all treatments w/ dropped NaN:",
		experimentDF["Virulence"].dropna().mean())
	
	print ("Mean virulence across all treatments w/ filled NaN:",
		experimentDF.fillna(0.0)["Virulence"].mean())
	
	Mean virulence across all treatments w/ dropped NaN: 0.75
	Mean virulence across all treatments w/ filled NaN: 0.642857142857


- Mean of a data set
The mean performance of an experiment gives a good idea of how the experiment will turn out on average under a given treatment.

Conveniently, DataFrames have all kinds of built-in functions to perform standard operations on them en masse: `add()`, `sub()`, `mul()`, `div()`, `mean()`, `std()`, etc. 
The full list is located at: http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

Thus, computing the mean of a DataFrame only takes one line of code:

	from pandas import *
	
	print ("Mean Shannon Diversity w/ 0.8 Parasite Virulence =",
		experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"].mean())
	
	Mean Shannon Diversity w/ 0.8 Parasite Virulence = 1.2691338188


- Variance in a data set
The variance in the performance provides a measurement of how consistent the results of an experiment are. 
The lower the variance, the more consistent the results are, and vice versa.

Computing the variance is also built in to pandas DataFrames:

	from pandas import *
	
	print ("Variance in Shannon Diversity w/ 0.8 Parasite Virulence =",
		experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"].var())
	
	Variance in Shannon Diversity w/ 0.8 Parasite Virulence = 0.611038433313


- Standard Error of the Mean (SEM)
Combined with the mean, the SEM enables you to establish a range around a mean that the majority of any future replicate experiments will most likely fall within.

pandas DataFrames don't have methods like SEM built in, but since DataFrame rows/columns are treated as lists, you can use any NumPy/SciPy method you like on them.

	from pandas import *
	from scipy import stats
	
	print ("SEM of Shannon Diversity w/ 0.8 Parasite Virulence =",
		stats.sem(experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]))
	
	SEM of Shannon Diversity w/ 0.8 Parasite Virulence = 0.110547585529


A single SEM will usually envelop 68% of the possible replicate means and two SEMs envelop 95% of the possible replicate means. 
Two SEMs are called the “estimated 95% confidence interval.” The confidence interval is estimated because the exact width depend on how many replicates you have; this approximation is good when you have more than 20 replicates.

- Mann-Whitney-Wilcoxon (MWW) RankSum test
The MWW RankSum test is a useful test to determine if two distributions are significantly different or not. 
Unlike the t-test, the RankSum test does not assume that the data are normally distributed, potentially providing a more accurate assessment of the data sets.

As an example, let's say we want to determine if the results of the two following treatments significantly differ or not:

	# select two treatment data sets from the parasite data
	treatment1 = experimentDF[experimentDF["Virulence"] == 0.5]["ShannonDiversity"]
	treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]
	
	print "Data set 1:\n", treatment1
	print "Data set 2:\n", treatment2
	
	Data set 1:
	0     0.059262
	1     1.093600
	2     1.139390
	3     0.547651
	...
	45    1.937930
	46    1.284150
	47    1.651680
	48    0.000000
	49    0.000000
	Name: ShannonDiversity
	Data set 2:
	150    1.433800
	151    2.079700
	152    0.892139
	153    2.384740
	...
	196    2.077180
	197    1.566410
	198    0.000000
	199    1.990900
	Name: ShannonDiversity


A RankSum test will provide a P value indicating whether or not the two distributions are the same.

	from scipy import stats
	
	z_stat, p_val = stats.ranksums(treatment1, treatment2)
	
	print "MWW RankSum P for treatments 1 and 2 =", p_val
	
	MWW RankSum P for treatments 1 and 2 = 0.000983355902735


If P <= 0.05, we are highly confident that the distributions significantly differ, and can claim that the treatments had a significant impact on the measured value.
If the treatments do not significantly differ, we could expect a result such as the following:

	treatment3 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]
	treatment4 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"]
	
	print "Data set 3:\n", treatment3
	print "Data set 4:\n", treatment4
	
	Data set 3:
	150    1.433800
	151    2.079700
	152    0.892139
	153    2.384740
	...
	196    2.077180
	197    1.566410
	198    0.000000
	199    1.990900
	Name: ShannonDiversity
	Data set 4:
	200    1.036930
	201    0.938018
	202    0.995956
	203    1.006970
	...
	246    1.564330
	247    1.870380
	248    1.262280
	249    0.000000
	Name: ShannonDiversity


	# compute RankSum P value
	z_stat, p_val = stats.ranksums(treatment3, treatment4)

	print "MWW RankSum P for treatments 3 and 4 =", p_val
	
	MWW RankSum P for treatments 3 and 4 = 0.994499571124


With P > 0.05, we must say that the distributions do not significantly differ. 
Thus changing the parasite virulence between 0.8 and 0.9 does not result in a significant change in Shannon Diversity.

- One-way analysis of variance (ANOVA)
If you need to compare more than two data sets at a time, an ANOVA is your best bet. 
For example, we have the results from three experiments with overlapping 95% confidence intervals, and we want to confirm that the results for all three experiments are not significantly different.

	treatment1 = experimentDF[experimentDF["Virulence"] == 0.7]["ShannonDiversity"]
	treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]
	treatment3 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"]
	
	print "Data set 1:\n", treatment1
	print "Data set 2:\n", treatment2
	print "Data set 3:\n", treatment3
	
	Data set 1:
	100    1.595440
	101    1.419730
	102    0.000000
	103    0.000000
	...
	146    0.000000
	147    1.139100
	148    2.383260
	149    0.056819
	Name: ShannonDiversity
	Data set 2:
	150    1.433800
	151    2.079700
	152    0.892139
	153    2.384740
	...
	196    2.077180
	197    1.566410
	198    0.000000
	199    1.990900
	Name: ShannonDiversity
	Data set 3:
	200    1.036930
	201    0.938018
	202    0.995956
	203    1.006970
	...
	246    1.564330
	247    1.870380
	248    1.262280
	249    0.000000
	Name: ShannonDiversity


	# compute one-way ANOVA P value	
	from scipy import stats
		
	f_val, p_val = stats.f_oneway(treatment1, treatment2, treatment3)
	
	print "One-way ANOVA P =", p_val
	
	One-way ANOVA P = 0.381509481874


If P > 0.05, we can claim with high confidence that the means of the results of all three experiments are not significantly different.

- Bootstrapped 95% confidence intervals
Oftentimes in wet lab research, it's difficult to perform the 20 replicate runs recommended for computing reliable confidence intervals with SEM.

In this case, bootstrapping the confidence intervals is a much more accurate method of determining the 95% confidence interval around your experiment's mean performance.

Unfortunately, SciPy doesn't have bootstrapping built into its standard library yet. 
However, there is already a scikit out there for bootstrapping. 
Enter the following command to install it:

sudo easy_install scikits.bootstrap


Bootstrapping 95% confidence intervals around the mean with this function is simple:

	# subset a list of 10 data points
	treatment1 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"][:10]
	
	print "Small data set:\n", treatment1
	
	Small data set:
	150    1.433800
	151    2.079700
	152    0.892139
	153    2.384740
	154    0.006980
	155    1.971760
	156    0.000000
	157    1.428470
	158    1.715950
	159    0.000000
	Name: ShannonDiversity


	import scipy
	import scikits.bootstrap as bootstrap

	# compute 95% confidence intervals around the mean
	CIs = bootstrap.ci(data=treatment1, statfunction=scipy.mean)

	print "Bootstrapped 95% confidence intervals\nLow:", CIs[0], "\nHigh:", CIs[1]
	
	Bootstrapped 95% confidence intervals
	Low: 0.659028048 
	High: 1.722468024


Note that you can change the range of the confidence interval by setting the alpha:

	# 80% confidence interval
	CIs = bootstrap.ci(treatment1, scipy.mean, alpha=0.2)
	print "Bootstrapped 80% confidence interval\nLow:", CIs[0], "\nHigh:", CIs[1]
	
	Bootstrapped 80% confidence interval
	Low: 0.827291024 
	High: 1.5420059


And also modify the size of the bootstrapped sample pool that the confidence intervals are taken from:

	# bootstrap 20,000 samples instead of only 10,000
	CIs = bootstrap.ci(treatment1, scipy.mean, n_samples=20000)
	print ("Bootstrapped 95% confidence interval w/ 20,000 samples\nLow:",
		CIs[0], "\nHigh:", CIs[1])
	
	Bootstrapped 95% confidence interval w/ 20,000 samples
	Low: 0.644756972 
	High: 1.7071459


Generally, bootstrapped 95% confidence intervals provide more accurate confidence intervals than 95% confidence intervals estimated from the SEM.
	Rank	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
Title
Guardians of the Galaxy	1	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced ...	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...	2014	121	8.1	757074	333.13	76.0
Prometheus	2	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te...	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa...	2012	124	7.0	485820	126.46	65.0
Split	3	Horror,Thriller	Three girls are kidnapped by a man with a diag...	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar...	2016	117	7.3	157606	138.12	62.0
Sing	4	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea...	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma...	2016	108	7.2	60545	270.32	59.0
Suicide Squad	5	Action,Adventure,Fantasy	A secret government agency recruits some of th...	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D...	2016	123	6.2	393727	325.02	40.0
	Rank	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
Title
Search Party	999	Adventure,Comedy	A pair of friends embark on a mission to reuni...	Scot Armstrong	Adam Pally, T.J. Miller, Thomas Middleditch,Sh...	2014	93	5.6	4881	NaN	22.0
Nine Lives	1000	Comedy,Family,Fantasy	A stuffy businessman finds himself trapped ins...	Barry Sonnenfeld	Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...	2016	87	5.3	12435	19.64	11.0
	rank	genre	description	director	actors	year	runtime	rating	votes	revenue_millions	metascore
Title
Guardians of the Galaxy	False	False	False	False	False	False	False	False	False	False	False
Prometheus	False	False	False	False	False	False	False	False	False	False	False
Split	False	False	False	False	False	False	False	False	False	False	False
Sing	False	False	False	False	False	False	False	False	False	False	False
Suicide Squad	False	False	False	False	False	False	False	False	False	False	False
	rank	year	runtime	rating	votes	revenue_millions	metascore
count	1000.000000	1000.000000	1000.000000	1000.000000	1.000000e+03	1000.000000	936.000000
mean	500.500000	2012.783000	113.172000	6.723200	1.698083e+05	82.956376	58.985043
std	288.819436	3.205962	18.810908	0.945429	1.887626e+05	96.412043	17.194757
min	1.000000	2006.000000	66.000000	1.900000	6.100000e+01	0.000000	11.000000
25%	250.750000	2010.000000	100.000000	6.200000	3.630900e+04	17.442500	47.000000
50%	500.500000	2014.000000	111.000000	6.800000	1.107990e+05	60.375000	59.500000
75%	750.250000	2016.000000	123.000000	7.400000	2.399098e+05	99.177500	72.000000
max	1000.000000	2016.000000	191.000000	9.000000	1.791916e+06	936.630000	100.000000
	rank	year	runtime	rating	votes	revenue_millions	metascore
rank	1.000000	-0.261605	-0.221739	-0.219555	-0.283876	-0.252996	-0.191869
year	-0.261605	1.000000	-0.164900	-0.211219	-0.411904	-0.117562	-0.079305
runtime	-0.221739	-0.164900	1.000000	0.392214	0.407062	0.247834	0.211978
rating	-0.219555	-0.211219	0.392214	1.000000	0.511537	0.189527	0.631897
votes	-0.283876	-0.411904	0.407062	0.511537	1.000000	0.607941	0.325684
revenue_millions	-0.252996	-0.117562	0.247834	0.189527	0.607941	1.000000	0.133328
metascore	-0.191869	-0.079305	0.211978	0.631897	0.325684	0.133328	1.000000
	rank	genre	description	director	actors	year	runtime	rating	votes	revenue_millions	metascore
Title
Interstellar	37	Adventure,Drama,Sci-Fi	A team of explorers travel through a wormhole ...	Christopher Nolan	Matthew McConaughey, Anne Hathaway, Jessica Ch...	2014	169	8.6	1047747	187.99	74.0
The Dark Knight	55	Action,Crime,Drama	When the menace known as the Joker wreaks havo...	Christopher Nolan	Christian Bale, Heath Ledger, Aaron Eckhart,Mi...	2008	152	9.0	1791916	533.32	82.0
Inception	81	Action,Adventure,Sci-Fi	A thief, who steals corporate secrets through ...	Christopher Nolan	Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...	2010	148	8.8	1583625	292.57	74.0
	rank	genre	description	director	actors	year	runtime	rating	votes	revenue_millions	metascore
Title
3 Idiots	431	Comedy,Drama	Two friends are searching for their long lost ...	Rajkumar Hirani	Aamir Khan, Madhavan, Mona Singh, Sharman Joshi	2009	170	8.4	238789	6.52	67.0
The Lives of Others	477	Drama,Thriller	In 1984 East Berlin, an agent of the secret po...	Florian Henckel von Donnersmarck	Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul...	2006	137	8.5	278103	11.28	89.0
Incendies	714	Drama,Mystery,War	Twins journey to the Middle East to discover t...	Denis Villeneuve	Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim...	2010	131	8.2	92863	6.86	80.0
Taare Zameen Par	992	Drama,Family,Music	An eight-year-old boy is thought to be a lazy ...	Aamir Khan	Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...	2007	165	8.5	102697	1.20	42.0