JSON/JSONL Shape Analysis and Data Merging Documentation

This document outlines the work completed in analyzing JSON/JSONL file structures and merging datasets.

Overview

The project involved:

Shape Analysis: Analyzing the structure of JSON and JSONL files to check for consistency
Data Merging: Combining two different JSON datasets into a unified format
Schema Enhancement: Adding annotation fields to the merged dataset

Files Analyzed

JSON Files (`json/` directory)

dpo_batch_exhaustive (1).json (266KB, 1007 lines)
dpo_merged_all_clean.json (1.1MB)
exhaustive_samples.json (134KB, 2420 lines)
dpo_trueoffmychest.json (231KB, 372 lines)
prompts.json (19KB, 311 lines)
triplets_with_llm_judgments.json (1.5MB)
triplets_with_metadata.json (1017KB, 5502 lines)

JSONL Files (`jsonl/` directory)

annotated_groq_dataset-kavin.jsonl (25KB, 13 lines)
annotated_groq_dataset.jsonl (323KB, 123 lines)

1. Shape Analysis Scripts

Top-Level Shape Analysis

JSON Shape Checker (`json/check_json_shapes.fish`)

#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file
for file in *.json
    echo "File: $file" >> $output_file
    set shapes (jq -c 'if type == "array" then .[] | keys | sort else keys | sort end' $file | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent shape" >> $output_file
    else
        echo "Inconsistent shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

JSONL Shape Checker (`jsonl/check_jsonl_shapes.fish`)

#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file
for file in *.jsonl
    echo "File: $file" >> $output_file
    set shapes (cat $file | jq -c 'keys | sort' | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent shape" >> $output_file
    else
        echo "Inconsistent shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

Deep Recursive Shape Analysis

Updated JSON Shape Checker (Deep Analysis)

#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file

# Define the recursive shape function
set shape_function '
def shape:
  if type == "object" then
    reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
  elif type == "array" then
    if length == 0 then [] else [.[] | shape] | unique end
  else
    type
  end;
'

for file in *.json
    echo "File: $file" >> $output_file
    echo "Analyzing deep recursive shapes..." >> $output_file
    
    # Try to handle both array and single object
    set shapes (jq -c "$shape_function
        if type == \"array\" then
            .[] | shape
        else
            shape
        end" $file | sort | uniq -c)
    
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent deep shape" >> $output_file
    else
        echo "Inconsistent deep shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

Updated JSONL Shape Checker (Deep Analysis)

#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file

# Define the recursive shape function
set shape_function '
def shape:
  if type == "object" then
    reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
  elif type == "array" then
    if length == 0 then [] else [.[] | shape] | unique end
  else
    type
  end;
'

for file in *.jsonl
    echo "File: $file" >> $output_file
    echo "Analyzing deep recursive shapes..." >> $output_file
    
    set shapes (cat $file | jq -c "$shape_function shape" | sort | uniq -c)
    echo "$shapes" >> $output_file
    set n_shapes (count $shapes)
    if test $n_shapes -eq 1
        echo "Consistent deep shape" >> $output_file
    else
        echo "Inconsistent deep shapes found:" >> $output_file
        echo "$shapes" >> $output_file
    end
    echo "-----------------------------" >> $output_file
end

2. Data Merging Script

Final Merging and Annotation Script

jq -n --slurpfile a exhaustive_samples.json --slurpfile b dpo_batch_exhaustive.json '
  def merge_objects($a_obj; $b_obj):
    {
      prompt: $a_obj.prompt // $b_obj.prompt // "",
      chosen: $b_obj.chosen // "",
      rejected: $b_obj.rejected // "",
      meta: {
        topic: $a_obj.meta.topic // "",
        sentiment: $a_obj.meta.sentiment // "",
        bias: $a_obj.meta.bias // "",
        opinion_type: $a_obj.meta.opinion_type // ""
      },
      annotator: {
        better_response: "",
        critical_thinking_score: {
          chosen: "",
          rejected: ""
        },
        fluency_score: {
          chosen: "",
          rejected: ""
        },
        sycophancy_present: {
          chosen: "",
          rejected: ""
        },
        rationale: ""
      }
    };

  [($a[] | to_entries) as $a_entries | ($b[] | to_entries) as $b_entries | 
   range(0; ([$a_entries | length, $b_entries | length] | min)) as $i |
   merge_objects($a_entries[$i].value; $b_entries[$i].value)]
' > final_annotated.json

3. Shape Analysis Results

Top-Level Analysis Results

All files showed consistent top-level shapes:

JSON Files:

dpo_batch_exhaustive (1).json: 201 objects with keys ["chosen","prompt","rejected"]
dpo_merged_all_clean.json: 704 objects with keys ["chosen","prompt","rejected"]
dpo_trueoffmychest.json: 74 objects with keys ["chosen","prompt","rejected"]
exhaustive_samples.json: 211 objects with keys ["context","meta","prompt"]
prompts.json: 28 objects with keys ["context","meta","prompt"]
triplets_with_llm_judgments.json: 500 objects with keys ["better_response","chosen","critical_thinking_score","fluency_score","meta","prompt","rationale","rejected","sycophancy_present"]
triplets_with_metadata.json: 500 objects with keys ["chosen","meta","prompt","rejected"]

JSONL Files:

annotated_groq_dataset-kavin.jsonl: 12 objects with keys ["chosen","context","meta","prompt","rejected"]
annotated_groq_dataset.jsonl: 122 objects with keys ["body","chosen","id","prompt","rejected","source"]

Deep Recursive Analysis Results

All files showed consistent deep shapes with all values being primitive types (strings, numbers, booleans, null) - no nested objects or arrays.

4. Data Merging Process

Source Files

exhaustive_samples.json (211 objects)
- Contains: prompt, context, meta (with topic, sentiment, bias, opinion_type)
- Missing: chosen, rejected
dpo_batch_exhaustive.json (201 objects)
- Contains: prompt, chosen, rejected
- Missing: meta fields

Target Schema

{
  "prompt": "string",
  "chosen": "string",
  "rejected": "string",
  "meta": {
    "topic": "string",
    "sentiment": "string",
    "bias": "string",
    "opinion_type": "string"
  },
  "annotator": {
    "better_response": "string",
    "critical_thinking_score": {
      "chosen": "string",
      "rejected": "string"
    },
    "fluency_score": {
      "chosen": "string",
      "rejected": "string"
    },
    "sycophancy_present": {
      "chosen": "string",
      "rejected": "string"
    },
    "rationale": "string"
  }
}

Merging Strategy

Zipped merging: Objects are matched by index position in both arrays
Field mapping:
- prompt from either file (should be identical)
- chosen, rejected from dpo_batch_exhaustive.json
- meta fields from exhaustive_samples.json
- annotator fields added as empty strings for future annotation

Final Output

File: final_annotated.json
Length: 201 objects (matching the shorter array length)
Structure: Complete schema with all fields populated from source files
Annotation fields: Empty and ready for manual annotation

5. Usage Instructions

Running Shape Analysis

# From json/ directory
fish check_json_shapes.fish

# From jsonl/ directory  
fish check_jsonl_shapes.fish

Running Data Merge

# From json/ directory
# Use the jq command above to create final_annotated.json

Verification Commands

# Check file lengths
jq 'length' exhaustive_samples.json
jq 'length' dpo_batch_exhaustive.json
jq 'length' final_annotated.json

# Check structure of first object
jq '.[0]' final_annotated.json

# Check structure of last object
jq '.[200]' final_annotated.json

6. Key Findings

All files are internally consistent - no shape variations found within any single file
Flat structure - All data appears to be flat (no nested objects or arrays)
Primitive values - All fields contain primitive JSON types (strings, numbers, booleans, null)
Successful merge - The zipped merging approach successfully combined the two datasets
Schema compliance - The final output matches the target schema exactly

7. Files Generated

json/json_shape_report.txt - Top-level and deep shape analysis for JSON files
jsonl/jsonl_shape_report.txt - Top-level and deep shape analysis for JSONL files
json/final_annotated.json - Merged and annotated dataset ready for use

kavinsood/cleaning.md