Forked from bsweger/useful_pandas_snippets.md
Created
April 19, 2019 07:46
Revisions
-
bsweger revised this gist
Dec 6, 2018 . 1 changed file with 8 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -56,15 +56,19 @@ df = df.loc[df.column == 'somevalue'] Grab DataFrame rows where column value is present in a list ``` test_data = {'hi': 'yo', 'bye': 'later'} df = pd.DataFrame(list(d.items()), columns=['col1', 'col2']) valuelist = ['yo', 'heya'] df[df.col2.isin(valuelist)] ``` Grab DataFrame rows where column value is not present in a list ``` test_data = {'hi': 'yo', 'bye': 'later'} df = pd.DataFrame(list(d.items()), columns=['col1', 'col2']) valuelist = ['yo', 'heya'] df[~df.col2.isin(valuelist)] ``` Select from DataFrame using criteria from multiple columns -
bsweger revised this gist
Dec 6, 2018 . 1 changed file with 104 additions and 99 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,19 +1,9 @@ # Useful Pandas Snippets A personal diary of DataFrame munging over the years. ## Data Types and Conversion Convert Series datatype to numeric (will error if column has non-numeric values) (h/t [@makmanalp](https://github.com/makmanalp)) @@ -29,62 +19,60 @@ Convert Series datatype to numeric, changing non-numeric values to NaN pd.to_numeric(df['Column Name'], errors='coerce') ``` Change data type of DataFrame column ``` df.column_name = df.column_name.astype(np.int64) ``` ## Exploring and Finding Data Get a report of all duplicate records in a DataFrame, based on specific columns ``` dupes = df[df.duplicated( ['col1', 'col2', 'col3'], keep=False)] ``` List unique values in a DataFrame column (h/t [@makmanalp](https://github.com/makmanalp) for the updated syntax!) ``` df['Column Name'].unique() ``` For each unique value in a DataFrame column, get a frequency count ``` df['Column Name'].value_counts() ``` Grab DataFrame rows where column = a specific value ``` df = df.loc[df.column == 'somevalue'] ``` Grab DataFrame rows where column value is present in a list ``` valuelist = ['value1', 'value2', 'value3'] df = df[df.column.isin(valuelist)] ``` Grab DataFrame rows where column value is not present in a list ``` valuelist = ['value1', 'value2', 'value3'] df = df[~df.column.isin(value_list)] ``` Select from DataFrame using criteria from multiple columns (use `|` instead of `&` to do an OR) ``` newdf = df[(df['column_one']>2004) & (df['column_two']==9)] ``` Loop through rows in a DataFrame (if you must) @@ -101,19 +89,39 @@ for row in df.itertuples(): print(row) ``` Get top n for each group of columns in a sorted DataFrame (make sure DataFrame is sorted first) ``` top5 = df.groupby( ['groupingcol1', 'groupingcol2']).head(5) ``` Grab DataFrame rows where specific column is null/notnull ``` newdf = df[df['column'].isnull()] ``` Select from DataFrame using multiple keys of a hierarchical index ``` df.xs( ('index level 1 value','index level 2 value'), level=('level 1','level 2')) ``` Slice values in a DataFrame column (aka Series) ``` df.column.str[0:2] ``` Get quick count of rows in a DataFrame ``` len(df.index) ``` Get length of data in a DataFrame column @@ -122,69 +130,61 @@ Get length of data in a DataFrame column df.column_name.str.len() ``` ## Updating and Cleaning Data Delete column from DataFrame ``` del df['column'] ``` Rename several DataFrame columns ``` df = df.rename(columns = { 'col1 old name':'col1 new name', 'col2 old name':'col2 new name', 'col3 old name':'col3 new name', }) ``` Lower-case all DataFrame column names ``` df.columns = map(str.lower, df.columns) ``` Even more fancy DataFrame column re-naming lower-case all DataFrame column names (for example) ``` df.rename(columns=lambda x: x.split('.')[-1], inplace=True) ``` Lower-case everything in a DataFrame column ``` df.column_name = df.column_name.str.lower() ``` Sort DataFrame by multiple columns ``` df = df.sort_values( ['col1','col2','col3'],ascending=[1,1,0]) ``` Change all NaNs to None (useful before loading to a db) ``` df = df.where((pd.notnull(df)), None) ``` More pre-db insert cleanup...make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to `None` (not especially recommended but including here b/c I had to do this in real life once) ``` df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None) ``` Get rid of non-numeric values throughout a DataFrame: @@ -214,21 +214,11 @@ df = df.fillna({ 'col6': '99' }) ``` Doing calculations with DataFrame columns that have missing values. In example below, swap in 0 for df['col1'] cells that contain null. ``` df['new_col'] = np.where( pd.isnull(df['col1']), 0, df['col1']) + df['col2'] ``` Split delimited values in a DataFrame column into two new columns @@ -245,27 +235,26 @@ Collapse hierarchical column indexes df.columns = df.columns.get_level_values(0) ``` ## Reshaping, Concatenating, and Merging Data Pivot data (with flexibility about what what becomes a column and what stays a row). ``` pd.pivot_table( df,values='cell_value', index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values columns=['col4']) #data values in this column become their own column ``` Concatenate two DataFrame columns into a new, single column (useful when dealing with composite keys, for example) (h/t [@makmanalp](https://github.com/makmanalp) for improving this one!) ``` df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str) ``` ## Display and formatting Set up formatting so larger numbers aren't displayed in scientific notation (h/t [@thecapacity](https://github.com/thecapacity)) @@ -278,3 +267,19 @@ To display with commas and no decimals ``` pd.options.display.float_format = '{:,.0f}'.format ``` ## Creating DataFrames Create a DataFrame from a Python dictionary ``` df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) ``` Convert Django queryset to DataFrame ``` qs = DjangoModelName.objects.all() q = qs.values() df = pd.DataFrame.from_records(q) ``` -
bsweger revised this gist
Dec 6, 2018 . 2 changed files with 280 additions and 153 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,280 @@ # Useful Pandas Snippets A personal diary of DataFrame munging from over the years. List unique values in a DataFrame column (h/t [@makmanalp](https://github.com/makmanalp) for the updated syntax!) ``` df['Column Name'].unique() ``` For each unique value in a DataFrame column, get a frequency count ``` df['Column Name'].value_counts() ``` Convert Series datatype to numeric (will error if column has non-numeric values) (h/t [@makmanalp](https://github.com/makmanalp)) ``` pd.to_numeric(df['Column Name']) ``` Convert Series datatype to numeric, changing non-numeric values to NaN (h/t [@makmanalp](https://github.com/makmanalp) for the updated syntax!) ``` pd.to_numeric(df['Column Name'], errors='coerce') ``` Grab DataFrame rows where column = a specific value ``` df = df.loc[df.column == 'somevalue'] ``` Grab DataFrame rows where column value is present in a list ``` valuelist = ['value1', 'value2', 'value3'] df = df[df.column.isin(valuelist)] ``` Grab DataFrame rows where column value is not present in a list ``` valuelist = ['value1', 'value2', 'value3'] df = df[~df.column.isin(value_list)] ``` Delete column from DataFrame ``` del df['column'] ``` Select from DataFrame using criteria from multiple columns (use `|` instead of `&` to do an OR) ``` newdf = df[(df['column_one']>2004) & (df['column_two']==9)] ``` Rename several DataFrame columns ``` df = df.rename(columns = { 'col1 old name':'col1 new name', 'col2 old name':'col2 new name', 'col3 old name':'col3 new name', }) ``` Lower-case all DataFrame column names ``` df.columns = map(str.lower, df.columns) ``` Even more fancy DataFrame column re-naming lower-case all DataFrame column names (for example) ``` df.rename(columns=lambda x: x.split('.')[-1], inplace=True) ``` Loop through rows in a DataFrame (if you must) ``` for index, row in df.iterrows(): print (index, row['some column']) ``` Much faster way to loop through DataFrame rows if you can work with tuples (h/t [hughamacmullaniv](https://github.com/hughamacmullaniv)) ``` for row in df.itertuples(): print(row) ``` Next few examples show how to work with text data in Pandas. Full list of .str functions: http://pandas.pydata.org/pandas-docs/stable/text.html Slice values in a DataFrame column (aka Series) ``` df.column.str[0:2] ``` Lower-case everything in a DataFrame column ``` df.column_name = df.column_name.str.lower() ``` Get length of data in a DataFrame column ``` df.column_name.str.len() ``` Sort DataFrame by multiple columns ``` df = df.sort_values( ['col1','col2','col3'],ascending=[1,1,0]) ``` Get top n for each group of columns in a sorted DataFrame (make sure DataFrame is sorted first) ``` top5 = df.groupby( ['groupingcol1', 'groupingcol2']).head(5) ``` Grab DataFrame rows where specific column is null/notnull ``` newdf = df[df['column'].isnull()] ``` Select from DataFrame using multiple keys of a hierarchical index ``` df.xs( ('index level 1 value','index level 2 value'), level=('level 1','level 2')) ``` Change all NaNs to None (useful before loading to a db) ``` df = df.where((pd.notnull(df)), None) ``` More pre-db insert cleanup...make a pass through the DataFrame, stripping whitespace from strings and changing any empty values to `None` (not especially recommended but including here b/c I had to do this in real life one time) ``` df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None) ``` Get quick count of rows in a DataFrame ``` len(df.index) ``` Pivot data (with flexibility about what what becomes a column and what stays a row). Syntax works on Pandas >= .14 ``` pd.pivot_table( df,values='cell_value', index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values columns=['col4']) #data values in this column become their own column ``` Change data type of DataFrame column ``` df.column_name = df.column_name.astype(np.int64) ``` Get rid of non-numeric values throughout a DataFrame: ``` for col in refunds.columns.values: refunds[col] = refunds[col].replace( '[^0-9]+.-', '', regex=True) ``` Set DataFrame column values based on other column values (h/t: [@mlevkov](https://github.com/mlevkov)) ``` df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value ``` Clean up missing values in multiple DataFrame columns ``` df = df.fillna({ 'col1': 'missing', 'col2': '99.999', 'col3': '999', 'col4': 'missing', 'col5': 'missing', 'col6': '99' }) ``` Concatenate two DataFrame columns into a new, single column (useful when dealing with composite keys, for example) (h/t [@makmanalp](https://github.com/makmanalp) for improving this one!) ``` df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str) ``` Doing calculations with DataFrame columns that have missing values In example below, swap in 0 for df['col1'] cells that contain null. ``` df['new_col'] = np.where( pd.isnull(df['col1']),0,df['col1']) + df['col2'] ``` Split delimited values in a DataFrame column into two new columns ``` df['new_col1'], df['new_col2'] = zip( *df['original_col'].apply( lambda x: x.split(': ', 1))) ``` Collapse hierarchical column indexes ``` df.columns = df.columns.get_level_values(0) ``` Convert Django queryset to DataFrame ``` qs = DjangoModelName.objects.all() q = qs.values() df = pd.DataFrame.from_records(q) ``` Create a DataFrame from a Python dictionary ``` df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) ``` Get a report of all duplicate records in a DataFrame, based on specific columns ``` dupes = df[df.duplicated( ['col1', 'col2', 'col3'], keep=False)] ``` Set up formatting so larger numbers aren't displayed in scientific notation (h/t [@thecapacity](https://github.com/thecapacity)) ``` pd.set_option('display.float_format', lambda x: '%.3f' % x) ``` To display with commas and no decimals ``` pd.options.display.float_format = '{:,.0f}'.format ``` This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,153 +0,0 @@ -
bsweger revised this gist
Nov 7, 2018 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,6 +2,9 @@ # h/t @makmanalp for the updated syntax! df['Column Name'].unique() # For each unique value in a DataFrame column, get a frequency count df['Column Name'].value_counts() # Convert Series datatype to numeric (will error if column has non-numeric values) # h/t @makmanalp pd.to_numeric(df['Column Name']) -
bsweger revised this gist
May 22, 2018 . 1 changed file with 5 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -10,11 +10,14 @@ # h/t @makmanalp for the updated syntax! pd.to_numeric(df['Column Name'], errors='coerce') # Grab DataFrame rows where column = a specific value df = df.loc[df.column == 'somevalue'] # Grab DataFrame rows where column value is present in a list valuelist = ['value1', 'value2', 'value3'] df = df[df.column.isin(valuelist)] # Grab DataFrame rows where column value is not present in a list valuelist = ['value1', 'value2', 'value3'] df = df[~df.column.isin(value_list)] -
bsweger revised this gist
Mar 4, 2018 . 1 changed file with 3 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -142,4 +142,6 @@ dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)] # Set up formatting so larger numbers aren't displayed in scientific notation (h/t @thecapacity) pd.set_option('display.float_format', lambda x: '%.3f' % x) # To display with commas and no decimals pd.options.display.float_format = '{:,.0f}'.format -
bsweger revised this gist
Mar 4, 2018 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -63,7 +63,7 @@ df.column_name.str.len() # Sort dataframe by multiple columns df = df.sort_values(['col1','col2','col3'],ascending=[1,1,0]) # Get top n for each group of columns in a sorted dataframe # (make sure dataframe is sorted first) -
bsweger revised this gist
Sep 19, 2017 . 1 changed file with 7 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -2,8 +2,13 @@ # h/t @makmanalp for the updated syntax! df['Column Name'].unique() # Convert Series datatype to numeric (will error if column has non-numeric values) # h/t @makmanalp pd.to_numeric(df['Column Name']) # Convert Series datatype to numeric, changing non-numeric values to NaN # h/t @makmanalp for the updated syntax! pd.to_numeric(df['Column Name'], errors='coerce') # Grab DataFrame rows where column has certain values valuelist = ['value1', 'value2', 'value3'] -
bsweger revised this gist
Sep 19, 2017 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,5 +1,6 @@ # List unique values in a DataFrame column # h/t @makmanalp for the updated syntax! df['Column Name'].unique() # Convert Series datatype to numeric, getting rid of any non-numeric values df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True) -
bsweger revised this gist
Sep 19, 2017 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -111,7 +111,8 @@ # Concatenate two DataFrame columns into a new, single column # (useful when dealing with composite keys, for example) # (h/t @makmanalp for improving this one!) df['newcol'] = df['col1'].astype(str) + df['col2'].astype(str) # Doing calculations with DataFrame columns that have missing values # In example below, swap in 0 for df['col1'] cells that contain null -
bsweger revised this gist
May 23, 2017 . 1 changed file with 7 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -36,7 +36,13 @@ # Loop through rows in a DataFrame # (if you must) for index, row in df.iterrows(): print index, row['some column'] # Much faster way to loop through DataFrame rows # if you can work with tuples # (h/t hughamacmullaniv) for row in df.itertuples(): print(row) # Next few examples show how to work with text data in Pandas. # Full list of .str functions: http://pandas.pydata.org/pandas-docs/stable/text.html -
bsweger revised this gist
May 14, 2017 . 1 changed file with 5 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -67,6 +67,11 @@ # loading to a db) df = df.where((pd.notnull(df)), None) # More pre-db insert cleanup...make a pass through the dataframe, stripping whitespace # from strings and changing any empty values to None # (not especially recommended but including here b/c I had to do this in real life one time) df = df.applymap(lambda x: str(x).strip() if len(str(x).strip()) else None) # Get quick count of rows in a DataFrame len(df.index) -
bsweger revised this gist
Nov 2, 2016 . 1 changed file with 4 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -121,4 +121,7 @@ df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) # Get a report of all duplicate records in a dataframe, based on specific columns dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)] # Set up formatting so larger numbers aren't displayed in scientific notation (h/t @thecapacity) pd.set_option('display.float_format', lambda x: '%.3f' % x) -
bsweger revised this gist
Aug 19, 2016 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -75,7 +75,7 @@ # Syntax works on Pandas >= .14 pd.pivot_table( df,values='cell_value', index=['col1', 'col2', 'col3'], #these stay as columns; will fail silently if any of these cols have null values columns=['col4']) #data values in this column become their own column # Change data type of DataFrame column -
bsweger revised this gist
Aug 18, 2016 . 1 changed file with 39 additions and 38 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,93 +1,94 @@ # List unique values in a DataFrame column pd.unique(df.column_name.ravel()) # Convert Series datatype to numeric, getting rid of any non-numeric values df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True) # Grab DataFrame rows where column has certain values valuelist = ['value1', 'value2', 'value3'] df = df[df.column.isin(valuelist)] # Grab DataFrame rows where column doesn't have certain values valuelist = ['value1', 'value2', 'value3'] df = df[~df.column.isin(value_list)] # Delete column from DataFrame del df['column'] # Select from DataFrame using criteria from multiple columns # (use `|` instead of `&` to do an OR) newdf = df[(df['column_one']>2004) & (df['column_two']==9)] # Rename several DataFrame columns df = df.rename(columns = { 'col1 old name':'col1 new name', 'col2 old name':'col2 new name', 'col3 old name':'col3 new name', }) # Lower-case all DataFrame column names df.columns = map(str.lower, df.columns) # Even more fancy DataFrame column re-naming # lower-case all DataFrame column names (for example) df.rename(columns=lambda x: x.split('.')[-1], inplace=True) # Loop through rows in a DataFrame # (if you must) for index, row in df.iterrows(): print index, row['some column'] # Next few examples show how to work with text data in Pandas. # Full list of .str functions: http://pandas.pydata.org/pandas-docs/stable/text.html # Slice values in a DataFrame column (aka Series) df.column.str[0:2] # Lower-case everything in a DataFrame column df.column_name = df.column_name.str.lower() # Get length of data in a DataFrame column df.column_name.str.len() # Sort dataframe by multiple columns df = df.sort(['col1','col2','col3'],ascending=[1,1,0]) # Get top n for each group of columns in a sorted dataframe # (make sure dataframe is sorted first) top5 = df.groupby(['groupingcol1', 'groupingcol2']).head(5) # Grab DataFrame rows where specific column is null/notnull newdf = df[df['column'].isnull()] # Select from DataFrame using multiple keys of a hierarchical index df.xs(('index level 1 value','index level 2 value'), level=('level 1','level 2')) # Change all NaNs to None (useful before # loading to a db) df = df.where((pd.notnull(df)), None) # Get quick count of rows in a DataFrame len(df.index) # Pivot data (with flexibility about what what # becomes a column and what stays a row). # Syntax works on Pandas >= .14 pd.pivot_table( df,values='cell_value', index=['col1', 'col2', 'col3'], #these stay as columns columns=['col4']) #data values in this column become their own column # Change data type of DataFrame column df.column_name = df.column_name.astype(np.int64) # Get rid of non-numeric values throughout a DataFrame: for col in refunds.columns.values: refunds[col] = refunds[col].replace('[^0-9]+.-', '', regex=True) # Set DataFrame column values based on other column values (h/t: @mlevkov) df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value # Clean up missing values in multiple DataFrame columns df = df.fillna({ 'col1': 'missing', 'col2': '99.999', @@ -97,12 +98,12 @@ 'col6': '99' }) # Concatenate two DataFrame columns into a new, single column # (useful when dealing with composite keys, for example) df['newcol'] = df['col1'].map(str) + df['col2'].map(str) # Doing calculations with DataFrame columns that have missing values # In example below, swap in 0 for df['col1'] cells that contain null df['new_col'] = np.where(pd.isnull(df['col1']),0,df['col1']) + df['col2'] # Split delimited values in a DataFrame column into two new columns @@ -111,13 +112,13 @@ # Collapse hierarchical column indexes df.columns = df.columns.get_level_values(0) # Convert Django queryset to DataFrame qs = DjangoModelName.objects.all() q = qs.values() df = pd.DataFrame.from_records(q) # Create a DataFrame from a Python dictionary df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) # Get a report of all duplicate records in a dataframe, based on specific columns dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)] -
bsweger revised this gist
Aug 13, 2016 . 1 changed file with 4 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -117,4 +117,7 @@ df = pd.DataFrame.from_records(q) #Create a DataFrame from a Python dictionary df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) #Get a report of all duplicate records in a dataframe, based on specific columns dupes = df[df.duplicated(['col1', 'col2', 'col3'], keep=False)] -
bsweger revised this gist
Aug 13, 2016 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -84,8 +84,8 @@ for col in refunds.columns.values: refunds[col] = refunds[col].replace('[^0-9]+.-', '', regex=True) #Set DataFrame column values based on other column values (h/t: @mlevkov) df.loc[(df['column1'] == some_value) & (df['column2'] == some_other_value), ['column_to_change']] = new_value #Clean up missing values in multiple DataFrame columns df = df.fillna({ -
bsweger revised this gist
Aug 13, 2016 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ #Grab DataFrame rows where column has certain values valuelist = ['value1', 'value2', 'value3'] df = df[df.column.isin(valuelist)] #Grab DataFrame rows where column doesn't have certain values valuelist = ['value1', 'value2', 'value3'] -
bsweger revised this gist
Oct 12, 2015 . 1 changed file with 9 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -37,9 +37,18 @@ for index, row in df.iterrows(): print index, row['some column'] #Next few examples show how to work with text data in Pandas. #Full list of .str functions: http://pandas.pydata.org/pandas-docs/stable/text.html #Slice values in a DataFrame column (aka Series) df.column.str[0:2] #Lower-case everything in a DataFrame column df.column_name = df.column_name.str.lower() #Get length of data in a DataFrame column df.column_name.str.len() #Sort dataframe by multiple columns df = df.sort(['col1','col2','col3'],ascending=[1,1,0]) @@ -49,7 +58,6 @@ #Grab DataFrame rows where specific column is null/notnull newdf = df[df['column'].isnull()] #select from DataFrame using multiple keys of a hierarchical index df.xs(('index level 1 value','index level 2 value'), level=('level 1','level 2')) @@ -58,9 +66,6 @@ #loading to a db) df = df.where((pd.notnull(df)), None) #Get quick count of rows in a DataFrame len(df.index) -
bsweger revised this gist
Sep 24, 2015 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -34,7 +34,7 @@ #Loop through rows in a DataFrame #(if you must) for index, row in df.iterrows(): print index, row['some column'] #Lower-case everything in a DataFrame column -
bsweger revised this gist
Aug 28, 2015 . 1 changed file with 4 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -107,7 +107,9 @@ df.columns = df.columns.get_level_values(0) #Convert Django queryset to DataFrame qs = DjangoModelName.objects.all() q = qs.values() df = pd.DataFrame.from_records(q) #Create a DataFrame from a Python dictionary df = pd.DataFrame(list(a_dictionary.items()), columns = ['column1', 'column2']) -
bsweger revised this gist
Jul 6, 2015 . 1 changed file with 4 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -92,6 +92,10 @@ 'col6': '99' }) #Concatenate two DataFrame columns into a new, single column #(useful when dealing with composite keys, for example) df['newcol'] = df['col1'].map(str) + df['col2'].map(str) #Doing calculations with DataFrame columns that have missing values #In example below, swap in 0 for df['col1'] cells that contain null df['new_col'] = np.where(pd.isnull(df['col1']),0,df['col1']) + df['col2'] -
bsweger revised this gist
Jul 3, 2015 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -99,6 +99,9 @@ # Split delimited values in a DataFrame column into two new columns df['new_col1'], df['new_col2'] = zip(*df['original_col'].apply(lambda x: x.split(': ', 1))) # Collapse hierarchical column indexes df.columns = df.columns.get_level_values(0) #Convert Django queryset to DataFrame #convert queryset to dataframe qs = DjangoModelName.objects.all() -
bsweger revised this gist
Dec 31, 2014 . 1 changed file with 0 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -32,10 +32,6 @@ #lower-case all DataFrame column names (for example) df.rename(columns=lambda x: x.split('.')[-1], inplace=True) #Loop through rows in a DataFrame #(if you must) for index, row in df: -
bsweger revised this gist
Dec 19, 2014 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -34,7 +34,7 @@ #set entire column values in a way that avoids the #SettingWithCopy Warning df.loc[:, 'column_name'] = 'new value' #Loop through rows in a DataFrame #(if you must) -
bsweger revised this gist
Dec 19, 2014 . 1 changed file with 4 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -32,6 +32,10 @@ #lower-case all DataFrame column names (for example) df.rename(columns=lambda x: x.split('.')[-1], inplace=True) #set entire column values in a way that avoids the #SettingWithCopy Warning df.loc[df.index, 'column_name'] = 'new value' #Loop through rows in a DataFrame #(if you must) for index, row in df: -
bsweger revised this gist
Dec 5, 2014 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -12,6 +12,9 @@ valuelist = ['value1', 'value2', 'value3'] df = df[~df.column.isin(value_list)] #Delete column from DataFrame del df['column'] #Select from DataFrame using criteria from multiple columns newdf = df[(df['column_one']>2004) & (df['column_two']==9)] -
bsweger revised this gist
Dec 4, 2014 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -55,6 +55,9 @@ #loading to a db) df = df.where((pd.notnull(df)), None) #Slice values in a DataFrame column (aka Series) df.column.str[0:2] #Get quick count of rows in a DataFrame len(df.index) -
bsweger revised this gist
Aug 20, 2014 . 1 changed file with 7 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -22,14 +22,18 @@ 'col3 old name':'col3 new name', }) #lower-case all DataFrame column names df.columns = map(str.lower, df.columns) #even more fancy DataFrame column re-naming #lower-case all DataFrame column names (for example) df.rename(columns=lambda x: x.split('.')[-1], inplace=True) #Loop through rows in a DataFrame #(if you must) for index, row in df: print index, row['some column'] #Lower-case everything in a DataFrame column df.column_name = df.column_name.str.lower() -
bsweger revised this gist
Aug 19, 2014 . 1 changed file with 3 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -69,6 +69,9 @@ for col in refunds.columns.values: refunds[col] = refunds[col].replace('[^0-9]+.-', '', regex=True) #Set DataFrame column values based on other column values df['column_to_change'][(df['column1'] == some_value) & (df['column2'] == some_other_value)] = new_value #Clean up missing values in multiple DataFrame columns df = df.fillna({ 'col1': 'missing',
NewerOlder