Skip to content

Instantly share code, notes, and snippets.

@Jorundur
Last active April 4, 2023 20:57
Show Gist options
  • Save Jorundur/d374269c0ab2d4e2aee1d141540d034c to your computer and use it in GitHub Desktop.
Save Jorundur/d374269c0ab2d4e2aee1d141540d034c to your computer and use it in GitHub Desktop.
Restructuring an Excel table with Python 3

Restructuring an Excel table with Python 3

Warning: Includes some dodgy code... can definitely be improved! I was asked for help restructuring a huge Excel table and this was the result.

Problem:

We had an Excel table of the format shown in the Input Table. It could have any number of experiments, subjects and trials (unique values under ExperimentsName, Subject and Trial columns) in addition to any number of score IDs (ScoreX columns). In the example below there are 3 experiments, 3 subjects, 4 trials and N score columns. This needed to be transformed into a table similar to the Output Table, where each row represents one subject.

Input Table:

ExperimentName Subject Trial Score1 Score2 Score3 Score4 ... ScoreN
FirstExperiment 1 1 1 5 6 3 ... ...
FirstExperiment 1 2 7 6 7 4 ... ...
FirstExperiment 1 3 6 3 5 7 ... ...
FirstExperiment 1 4 3 7 4 3 ... ...
SecondExperiment 1 1 6 3 7 4 ... ...
... (2 rows) ... ... ... ... ... ... ... ...
SecondExperiment 1 4 5 5 7 2 ... ...
ThirdExperiment 1 1 3 6 3 1 ... ...
... (2 rows) ... ... ... ... ... ... ... ...
ThirdExperiment 1 4 5 2 1 4 ... ...
FirstExperiment 2 1 7 6 6 5 ... ...
... (10 rows) ... ... ... ... ... ... ... ...
ThirdExperiment 2 4 7 1 2 5 ... ...
FirstExperiment 3 1 2 2 5 1 ... ...
... (10 rows) ... ... ... ... ... ... ... ...
ThirdExperiment 3 4 2 2 6 1 ... ...

Output Table (right scroll to see rest of table):

Subject FirstExperiment_Trial1_Score1 FirstExperiment_Trial1_Score2 FirstExperiment_Trial1_Score3 ... (several columns) FirstExperiment_Trial1_ScoreN FirstExperiment_Trial2_Score1 ... (many columns) FirstExperiment_Trial4_ScoreN SecondExperiment_Trial1_Score1 ... (even more columns) ThirdExperiment_Trial4_ScoreN
1 1 5 6 ... ... 7 ... ... 6 ... ...
2 7 6 6 ... ... ... ... ... ... ... ...
3 2 2 5 ... ... ... ... ... ... ... ...

Solution:

# 0) PREREQUISITES

# We'll use Pandas to manipulate Excel files.
import pandas as pd 

# Read Excel data from our input file called "input.xlsx" (using first sheet)
# and put to a Pandas dataframe.
input_df = pd.read_excel("input.xlsx", sheetname=0)

# 1) STRUCTURE OUR OUTPUT BY CREATING ALL COLUMNS
# E.g. "Subject", "FirstExperiment_Trial1_Score1", "FirstExperiment_Trial1_Score2", etc.

# Create empty pandas dataframe
# (which will later be filled and exported to a new Excel file).
output_df = pd.DataFrame()
# Create empty Subject column as the 1st column.
output_df["Subject"] = []

# Get unique experiment names, trial numbers and score columns (score IDs).
uniq_exp_names = input_df.ExperimentName.unique()
uniq_trials = input_df.Trial.unique()
columns = list(input_df.columns)
uniq_score_cols = [x for x in columns if x.startswith('Score')]

# Create rest of columns in output dataframe by concatenating
# ExperimentName, Trial number and Score ID (i.e. score column).
for exp in uniq_exp_names:
  for trial in uniq_trials:
    for score_col in uniq_score_cols:
      new_column_name = str(exp) + "_" + "Trial" + str(trial) + "_" + str(score_col)
      output_df[new_column_name] = []

# 2) FILL IN DATA TO OUTPUT DATAFRAME

# Get unique subject values.
uniq_subjects = input_df.Subject.unique()

for subject in uniq_subjects:
  # Create new pandas dataframe from input table
  # containing only data for current subject.
  subj_df = input_df.loc[input_df.Subject == subject]
  # Add new subject to Subject column in output dataframe.
  output_df.set_value(subject, "Subject", subject)
  for exp in uniq_exp_names:
    # Create new pandas dataframe from the one above
    # containing only data for current subject and current experiment.
    subj_exp_df = subj_df.loc[subj_df.ExperimentName == exp]
    for trial in uniq_trials:
      # Create new pandas dataframe from the one above 
      # containing only data for current subject,
      # current experiment and current trial (i.e. one row in input file).
      subj_exp_trial_df = subj_exp_df[subj_exp_df.Trial == trial]
      for score_col in uniq_score_cols:
        # For each Score column (score ID), get value and add to correct cell in output dataframe. 
        new_column_name = str(exp) + "_" + "Trial" + str(trial) + "_" + str(score_col)
        output_df.loc[output_df.Subject == subject, new_column_name] = subj_exp_trial_df.iloc[0][score_col]


# 3) WRITE OUTPUT DATAFRAME TO EXCEL FILE

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')
# Move our dataframe to excel file
# (index=False indicates we don't want row numbers as first column).
output_df.to_excel(writer, index=False)

And that's it! The desired table can now be found in output.xlsx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment