Warning: Includes some dodgy code... can definitely be improved! I was asked for help restructuring a huge Excel table and this was the result.
We had an Excel table of the format shown in the Input Table. It could have any number of experiments, subjects and trials (unique values under ExperimentsName
, Subject
and Trial
columns) in addition to any number of score IDs (ScoreX
columns). In the example below there are 3 experiments, 3 subjects, 4 trials and N score columns. This needed to be transformed into a table similar to the Output Table, where each row represents one subject.
Input Table:
ExperimentName | Subject | Trial | Score1 | Score2 | Score3 | Score4 | ... | ScoreN |
---|---|---|---|---|---|---|---|---|
FirstExperiment | 1 | 1 | 1 | 5 | 6 | 3 | ... | ... |
FirstExperiment | 1 | 2 | 7 | 6 | 7 | 4 | ... | ... |
FirstExperiment | 1 | 3 | 6 | 3 | 5 | 7 | ... | ... |
FirstExperiment | 1 | 4 | 3 | 7 | 4 | 3 | ... | ... |
SecondExperiment | 1 | 1 | 6 | 3 | 7 | 4 | ... | ... |
... (2 rows) | ... | ... | ... | ... | ... | ... | ... | ... |
SecondExperiment | 1 | 4 | 5 | 5 | 7 | 2 | ... | ... |
ThirdExperiment | 1 | 1 | 3 | 6 | 3 | 1 | ... | ... |
... (2 rows) | ... | ... | ... | ... | ... | ... | ... | ... |
ThirdExperiment | 1 | 4 | 5 | 2 | 1 | 4 | ... | ... |
FirstExperiment | 2 | 1 | 7 | 6 | 6 | 5 | ... | ... |
... (10 rows) | ... | ... | ... | ... | ... | ... | ... | ... |
ThirdExperiment | 2 | 4 | 7 | 1 | 2 | 5 | ... | ... |
FirstExperiment | 3 | 1 | 2 | 2 | 5 | 1 | ... | ... |
... (10 rows) | ... | ... | ... | ... | ... | ... | ... | ... |
ThirdExperiment | 3 | 4 | 2 | 2 | 6 | 1 | ... | ... |
Output Table (right scroll to see rest of table):
Subject | FirstExperiment_Trial1_Score1 | FirstExperiment_Trial1_Score2 | FirstExperiment_Trial1_Score3 | ... (several columns) | FirstExperiment_Trial1_ScoreN | FirstExperiment_Trial2_Score1 | ... (many columns) | FirstExperiment_Trial4_ScoreN | SecondExperiment_Trial1_Score1 | ... (even more columns) | ThirdExperiment_Trial4_ScoreN |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 5 | 6 | ... | ... | 7 | ... | ... | 6 | ... | ... |
2 | 7 | 6 | 6 | ... | ... | ... | ... | ... | ... | ... | ... |
3 | 2 | 2 | 5 | ... | ... | ... | ... | ... | ... | ... | ... |
# 0) PREREQUISITES
# We'll use Pandas to manipulate Excel files.
import pandas as pd
# Read Excel data from our input file called "input.xlsx" (using first sheet)
# and put to a Pandas dataframe.
input_df = pd.read_excel("input.xlsx", sheetname=0)
# 1) STRUCTURE OUR OUTPUT BY CREATING ALL COLUMNS
# E.g. "Subject", "FirstExperiment_Trial1_Score1", "FirstExperiment_Trial1_Score2", etc.
# Create empty pandas dataframe
# (which will later be filled and exported to a new Excel file).
output_df = pd.DataFrame()
# Create empty Subject column as the 1st column.
output_df["Subject"] = []
# Get unique experiment names, trial numbers and score columns (score IDs).
uniq_exp_names = input_df.ExperimentName.unique()
uniq_trials = input_df.Trial.unique()
columns = list(input_df.columns)
uniq_score_cols = [x for x in columns if x.startswith('Score')]
# Create rest of columns in output dataframe by concatenating
# ExperimentName, Trial number and Score ID (i.e. score column).
for exp in uniq_exp_names:
for trial in uniq_trials:
for score_col in uniq_score_cols:
new_column_name = str(exp) + "_" + "Trial" + str(trial) + "_" + str(score_col)
output_df[new_column_name] = []
# 2) FILL IN DATA TO OUTPUT DATAFRAME
# Get unique subject values.
uniq_subjects = input_df.Subject.unique()
for subject in uniq_subjects:
# Create new pandas dataframe from input table
# containing only data for current subject.
subj_df = input_df.loc[input_df.Subject == subject]
# Add new subject to Subject column in output dataframe.
output_df.set_value(subject, "Subject", subject)
for exp in uniq_exp_names:
# Create new pandas dataframe from the one above
# containing only data for current subject and current experiment.
subj_exp_df = subj_df.loc[subj_df.ExperimentName == exp]
for trial in uniq_trials:
# Create new pandas dataframe from the one above
# containing only data for current subject,
# current experiment and current trial (i.e. one row in input file).
subj_exp_trial_df = subj_exp_df[subj_exp_df.Trial == trial]
for score_col in uniq_score_cols:
# For each Score column (score ID), get value and add to correct cell in output dataframe.
new_column_name = str(exp) + "_" + "Trial" + str(trial) + "_" + str(score_col)
output_df.loc[output_df.Subject == subject, new_column_name] = subj_exp_trial_df.iloc[0][score_col]
# 3) WRITE OUTPUT DATAFRAME TO EXCEL FILE
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')
# Move our dataframe to excel file
# (index=False indicates we don't want row numbers as first column).
output_df.to_excel(writer, index=False)
And that's it! The desired table can now be found in output.xlsx
.