Skip to content

Instantly share code, notes, and snippets.

@kavinsood
Created July 31, 2025 15:00
Show Gist options
  • Save kavinsood/cc3c1adf5dff8543d58b7392d7ee9fa9 to your computer and use it in GitHub Desktop.
Save kavinsood/cc3c1adf5dff8543d58b7392d7ee9fa9 to your computer and use it in GitHub Desktop.

JSON/JSONL Shape Analysis and Data Merging Documentation

This document outlines the work completed in analyzing JSON/JSONL file structures and merging datasets.

Overview

The project involved:

  1. Shape Analysis: Analyzing the structure of JSON and JSONL files to check for consistency
  2. Data Merging: Combining two different JSON datasets into a unified format
  3. Schema Enhancement: Adding annotation fields to the merged dataset

Files Analyzed

JSON Files (json/ directory)

  • dpo_batch_exhaustive (1).json (266KB, 1007 lines)
  • dpo_merged_all_clean.json (1.1MB)
  • exhaustive_samples.json (134KB, 2420 lines)
  • dpo_trueoffmychest.json (231KB, 372 lines)
  • prompts.json (19KB, 311 lines)
  • triplets_with_llm_judgments.json (1.5MB)
  • triplets_with_metadata.json (1017KB, 5502 lines)

JSONL Files (jsonl/ directory)

  • annotated_groq_dataset-kavin.jsonl (25KB, 13 lines)
  • annotated_groq_dataset.jsonl (323KB, 123 lines)

1. Shape Analysis Scripts

Top-Level Shape Analysis

JSON Shape Checker (json/check_json_shapes.fish)

#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file
for file in *.json
    echo "File: $file" >> $output_file
    set shapes (jq -c 'if type == "array" then .[] | keys | sort else keys | sort end' $file | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent shape" >> $output_file
    else
        echo "Inconsistent shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

JSONL Shape Checker (jsonl/check_jsonl_shapes.fish)

#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file
for file in *.jsonl
    echo "File: $file" >> $output_file
    set shapes (cat $file | jq -c 'keys | sort' | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent shape" >> $output_file
    else
        echo "Inconsistent shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

Deep Recursive Shape Analysis

Updated JSON Shape Checker (Deep Analysis)

#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file

# Define the recursive shape function
set shape_function '
def shape:
  if type == "object" then
    reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
  elif type == "array" then
    if length == 0 then [] else [.[] | shape] | unique end
  else
    type
  end;
'

for file in *.json
    echo "File: $file" >> $output_file
    echo "Analyzing deep recursive shapes..." >> $output_file
    
    # Try to handle both array and single object
    set shapes (jq -c "$shape_function
        if type == \"array\" then
            .[] | shape
        else
            shape
        end" $file | sort | uniq -c)
    
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent deep shape" >> $output_file
    else
        echo "Inconsistent deep shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

Updated JSONL Shape Checker (Deep Analysis)

#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file

# Define the recursive shape function
set shape_function '
def shape:
  if type == "object" then
    reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
  elif type == "array" then
    if length == 0 then [] else [.[] | shape] | unique end
  else
    type
  end;
'

for file in *.jsonl
    echo "File: $file" >> $output_file
    echo "Analyzing deep recursive shapes..." >> $output_file
    
    set shapes (cat $file | jq -c "$shape_function shape" | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent deep shape" >> $output_file
    else
        echo "Inconsistent deep shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

2. Data Merging Script

Final Merging and Annotation Script

jq -n --slurpfile a exhaustive_samples.json --slurpfile b dpo_batch_exhaustive.json '
  def merge_objects($a_obj; $b_obj):
    {
      prompt: $a_obj.prompt // $b_obj.prompt // "",
      chosen: $b_obj.chosen // "",
      rejected: $b_obj.rejected // "",
      meta: {
        topic: $a_obj.meta.topic // "",
        sentiment: $a_obj.meta.sentiment // "",
        bias: $a_obj.meta.bias // "",
        opinion_type: $a_obj.meta.opinion_type // ""
      },
      annotator: {
        better_response: "",
        critical_thinking_score: {
          chosen: "",
          rejected: ""
        },
        fluency_score: {
          chosen: "",
          rejected: ""
        },
        sycophancy_present: {
          chosen: "",
          rejected: ""
        },
        rationale: ""
      }
    };

  [($a[] | to_entries) as $a_entries | ($b[] | to_entries) as $b_entries | 
   range(0; ([$a_entries | length, $b_entries | length] | min)) as $i |
   merge_objects($a_entries[$i].value; $b_entries[$i].value)]
' > final_annotated.json

3. Shape Analysis Results

Top-Level Analysis Results

All files showed consistent top-level shapes:

JSON Files:

  • dpo_batch_exhaustive (1).json: 201 objects with keys ["chosen","prompt","rejected"]
  • dpo_merged_all_clean.json: 704 objects with keys ["chosen","prompt","rejected"]
  • dpo_trueoffmychest.json: 74 objects with keys ["chosen","prompt","rejected"]
  • exhaustive_samples.json: 211 objects with keys ["context","meta","prompt"]
  • prompts.json: 28 objects with keys ["context","meta","prompt"]
  • triplets_with_llm_judgments.json: 500 objects with keys ["better_response","chosen","critical_thinking_score","fluency_score","meta","prompt","rationale","rejected","sycophancy_present"]
  • triplets_with_metadata.json: 500 objects with keys ["chosen","meta","prompt","rejected"]

JSONL Files:

  • annotated_groq_dataset-kavin.jsonl: 12 objects with keys ["chosen","context","meta","prompt","rejected"]
  • annotated_groq_dataset.jsonl: 122 objects with keys ["body","chosen","id","prompt","rejected","source"]

Deep Recursive Analysis Results

All files showed consistent deep shapes with all values being primitive types (strings, numbers, booleans, null) - no nested objects or arrays.

4. Data Merging Process

Source Files

  1. exhaustive_samples.json (211 objects)

    • Contains: prompt, context, meta (with topic, sentiment, bias, opinion_type)
    • Missing: chosen, rejected
  2. dpo_batch_exhaustive.json (201 objects)

    • Contains: prompt, chosen, rejected
    • Missing: meta fields

Target Schema

{
  "prompt": "string",
  "chosen": "string",
  "rejected": "string",
  "meta": {
    "topic": "string",
    "sentiment": "string",
    "bias": "string",
    "opinion_type": "string"
  },
  "annotator": {
    "better_response": "string",
    "critical_thinking_score": {
      "chosen": "string",
      "rejected": "string"
    },
    "fluency_score": {
      "chosen": "string",
      "rejected": "string"
    },
    "sycophancy_present": {
      "chosen": "string",
      "rejected": "string"
    },
    "rationale": "string"
  }
}

Merging Strategy

  • Zipped merging: Objects are matched by index position in both arrays
  • Field mapping:
    • prompt from either file (should be identical)
    • chosen, rejected from dpo_batch_exhaustive.json
    • meta fields from exhaustive_samples.json
    • annotator fields added as empty strings for future annotation

Final Output

  • File: final_annotated.json
  • Length: 201 objects (matching the shorter array length)
  • Structure: Complete schema with all fields populated from source files
  • Annotation fields: Empty and ready for manual annotation

5. Usage Instructions

Running Shape Analysis

# From json/ directory
fish check_json_shapes.fish

# From jsonl/ directory  
fish check_jsonl_shapes.fish

Running Data Merge

# From json/ directory
# Use the jq command above to create final_annotated.json

Verification Commands

# Check file lengths
jq 'length' exhaustive_samples.json
jq 'length' dpo_batch_exhaustive.json
jq 'length' final_annotated.json

# Check structure of first object
jq '.[0]' final_annotated.json

# Check structure of last object
jq '.[200]' final_annotated.json

6. Key Findings

  1. All files are internally consistent - no shape variations found within any single file
  2. Flat structure - All data appears to be flat (no nested objects or arrays)
  3. Primitive values - All fields contain primitive JSON types (strings, numbers, booleans, null)
  4. Successful merge - The zipped merging approach successfully combined the two datasets
  5. Schema compliance - The final output matches the target schema exactly

7. Files Generated

  • json/json_shape_report.txt - Top-level and deep shape analysis for JSON files
  • jsonl/jsonl_shape_report.txt - Top-level and deep shape analysis for JSONL files
  • json/final_annotated.json - Merged and annotated dataset ready for use
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment