Pandas Cheat Sheet
Detailed guide for using the Pandas library.
Installing Pandas
| Command | Description |
|---|---|
conda install pandas |
The best way to get pandas is via conda |
python3 -m pip install --upgrade pandas |
Via PyPI |
pip install pandas |
Standard pip installation |
Key Abbreviations and Package Imports
| Abbreviation | Description |
|---|---|
df |
Any Pandas DataFrame object |
s |
Any Pandas Series object |
Import Methods
| Command | Description |
|---|---|
import pandas as pd |
Import pandas with the alias pd |
import numpy as np |
Import numpy with the alias np |
Data Selection
| Command | Description |
|---|---|
df[col] |
Returns a column as a Series based on column name |
df[[col1, col2]] |
Returns multiple columns as a DataFrame |
s.iloc[0] |
Select data by position |
s.loc['index_one'] |
Select data by index |
df.iloc[0,:] |
Returns the first row |
df.iloc[0,0] |
Returns the first element of the first column |
Data Cleaning
| Command | Description |
|---|---|
df.columns = ['a','b','c'] |
Rename columns |
pd.isnull() |
Check for null values in a DataFrame; returns a Boolean array |
pd.notnull() |
Check for non-null values in a DataFrame; returns a Boolean array |
df.dropna() |
Delete all rows containing null values |
df.dropna(axis=1) |
Delete all columns containing null values |
df.dropna(axis=1, thresh=n) |
Delete all rows with fewer than n non-null values |
df.fillna(x) |
Replace all null values in a DataFrame with x |
s.astype(float) |
Convert Series data type to float |
s.replace(1, 'one') |
Replace all values equal to 1 with ‘one’ |
s.replace([1,3], ['one', 'three']) |
Replace 1 with ‘one’ and 3 with ’three’ |
df.rename(columns=lambda x: x + 1) |
Batch rename columns |
df.rename(columns={'old_name': 'new_name'}) |
Selectively rename columns |
df.set_index('column_one') |
Change the index column |
df.rename(index=lambda x: x + 1) |
Batch rename indices |
Importing Data
| Command | Description |
|---|---|
pd.read_csv(filename) |
Import data from a CSV file |
pd.read_table(filename) |
Import data from a delimited text file |
pd.read_excel(filename) |
Import data from an Excel file |
pd.read_sql(query, connection_object) |
Import data from a SQL table/database |
pd.read_json(json_string) |
Import data from a JSON string |
pd.read_html(url) |
Extract tables from a URL, string, or HTML file |
pd.read_clipboard() |
Get content from clipboard and pass to read_table() |
pd.DataFrame(dict) |
Import data from a dictionary (Keys are column names, Values are data) |
Exporting Data
| Command | Description |
|---|---|
df.to_csv(filename) |
Export data to a CSV file |
df.to_excel(filename) |
Export data to an Excel file |
df.to_sql(table_name, connection_object) |
Export data to a SQL table |
df.to_json(filename) |
Export data to a text file in JSON format |
Data Processing: Filter, Sort, and GroupBy
| Command | Description |
|---|---|
df[df[col] > 0.5] |
Select rows where the value in column ‘col’ is > 0.5 |
df.sort_values(col1) |
Sort data by column col1 (ascending by default) |
df.sort_values(col2, ascending=False) |
Sort data by column col2 in descending order |
df.sort_values([col1, col2], ascending=[True, False]) |
Sort by col1 ascending, then col2 descending |
df.groupby(col) |
Returns a GroupBy object grouped by column ‘col’ |
df.groupby([col1, col2]) |
Returns a GroupBy object grouped by multiple columns |
df.groupby(col1)[col2].mean() |
Returns the mean of column col2 for each group in col1 |
df.pivot_table(index=col1, values=[col2, col3], aggfunc=max) |
Create a pivot table grouped by col1, calculating max for col2 and col3 |
df.groupby(col1).agg(np.mean) |
Returns the mean for all columns grouped by col1 |
df.apply(np.mean) |
Apply function np.mean across each column |
df.apply(np.max, axis=1) |
Apply function np.max across each row |
Creating Test Objects
| Command | Description |
|---|---|
pd.DataFrame(np.random.rand(20,5)) |
Create a DataFrame with 20 rows and 5 columns of random numbers |
pd.Series(my_list) |
Create a Series from an iterable object my_list |
df.index = pd.date_range('1900/1/30', periods=df.shape[0]) |
Add a date index |
Viewing and Inspecting Data
| Command | Description |
|---|---|
df.head(n) |
View the first n rows of a DataFrame |
df.tail(n) |
View the last n rows of a DataFrame |
df.shape |
View the number of rows and columns |
df.info() |
View index, data types, and memory info |
df.describe() |
View summary statistics for numerical columns |
s.value_counts(dropna=False) |
View unique values and counts for a Series |
df.apply(pd.Series.value_counts) |
View unique values and counts for each column in a DataFrame |
Data Merging
| Command | Description |
|---|---|
df1.append(df2) |
Append rows from df2 to the end of df1 |
pd.concat([df1, df2], axis=1) |
Append columns from df2 to the end of df1 |
df1.join(df2, on=col1, how='inner') |
SQL-style join between df1 and df2 columns |
Data Statistics
| Command | Description |
|---|---|
df.describe() |
View summary statistics for numerical columns |
df.mean() |
Returns the mean of all columns |
df.corr() |
Returns the correlation coefficient between columns |
df.count() |
Returns the number of non-null values in each column |
df.max() |
Returns the maximum value of each column |
df.min() |
Returns the minimum value of each column |
df.median() |
Returns the median of each column |
df.std() |
Returns the standard deviation of each column |