This document outlines the work completed in analyzing JSON/JSONL file structures and merging datasets.
The project involved:
- Shape Analysis: Analyzing the structure of JSON and JSONL files to check for consistency
- Data Merging: Combining two different JSON datasets into a unified format
- Schema Enhancement: Adding annotation fields to the merged dataset
dpo_batch_exhaustive (1).json
(266KB, 1007 lines)dpo_merged_all_clean.json
(1.1MB)exhaustive_samples.json
(134KB, 2420 lines)dpo_trueoffmychest.json
(231KB, 372 lines)prompts.json
(19KB, 311 lines)triplets_with_llm_judgments.json
(1.5MB)triplets_with_metadata.json
(1017KB, 5502 lines)
annotated_groq_dataset-kavin.jsonl
(25KB, 13 lines)annotated_groq_dataset.jsonl
(323KB, 123 lines)
#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file
for file in *.json
echo "File: $file" >> $output_file
set shapes (jq -c 'if type == "array" then .[] | keys | sort else keys | sort end' $file | sort | uniq -c)
echo "$shapes" >> $output_file
set n_shapes (count $shapes)
if test $n_shapes -eq 1
echo "Consistent shape" >> $output_file
else
echo "Inconsistent shapes found:" >> $output_file
echo "$shapes" >> $output_file
end
echo "-----------------------------" >> $output_file
end
#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file
for file in *.jsonl
echo "File: $file" >> $output_file
set shapes (cat $file | jq -c 'keys | sort' | sort | uniq -c)
echo "$shapes" >> $output_file
set n_shapes (count $shapes)
if test $n_shapes -eq 1
echo "Consistent shape" >> $output_file
else
echo "Inconsistent shapes found:" >> $output_file
echo "$shapes" >> $output_file
end
echo "-----------------------------" >> $output_file
end
#!/usr/bin/env fish
set output_file "json_shape_report.txt"
rm -f $output_file
# Define the recursive shape function
set shape_function '
def shape:
if type == "object" then
reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
elif type == "array" then
if length == 0 then [] else [.[] | shape] | unique end
else
type
end;
'
for file in *.json
echo "File: $file" >> $output_file
echo "Analyzing deep recursive shapes..." >> $output_file
# Try to handle both array and single object
set shapes (jq -c "$shape_function
if type == \"array\" then
.[] | shape
else
shape
end" $file | sort | uniq -c)
echo "$shapes" >> $output_file
set n_shapes (count $shapes)
if test $n_shapes -eq 1
echo "Consistent deep shape" >> $output_file
else
echo "Inconsistent deep shapes found:" >> $output_file
echo "$shapes" >> $output_file
end
echo "-----------------------------" >> $output_file
end
#!/usr/bin/env fish
set output_file "jsonl_shape_report.txt"
rm -f $output_file
# Define the recursive shape function
set shape_function '
def shape:
if type == "object" then
reduce keys_unsorted[] as $k ({}; . + { ($k): (.[$k] | shape) })
elif type == "array" then
if length == 0 then [] else [.[] | shape] | unique end
else
type
end;
'
for file in *.jsonl
echo "File: $file" >> $output_file
echo "Analyzing deep recursive shapes..." >> $output_file
set shapes (cat $file | jq -c "$shape_function shape" | sort | uniq -c)
echo "$shapes" >> $output_file
set n_shapes (count $shapes)
if test $n_shapes -eq 1
echo "Consistent deep shape" >> $output_file
else
echo "Inconsistent deep shapes found:" >> $output_file
echo "$shapes" >> $output_file
end
echo "-----------------------------" >> $output_file
end
jq -n --slurpfile a exhaustive_samples.json --slurpfile b dpo_batch_exhaustive.json '
def merge_objects($a_obj; $b_obj):
{
prompt: $a_obj.prompt // $b_obj.prompt // "",
chosen: $b_obj.chosen // "",
rejected: $b_obj.rejected // "",
meta: {
topic: $a_obj.meta.topic // "",
sentiment: $a_obj.meta.sentiment // "",
bias: $a_obj.meta.bias // "",
opinion_type: $a_obj.meta.opinion_type // ""
},
annotator: {
better_response: "",
critical_thinking_score: {
chosen: "",
rejected: ""
},
fluency_score: {
chosen: "",
rejected: ""
},
sycophancy_present: {
chosen: "",
rejected: ""
},
rationale: ""
}
};
[($a[] | to_entries) as $a_entries | ($b[] | to_entries) as $b_entries |
range(0; ([$a_entries | length, $b_entries | length] | min)) as $i |
merge_objects($a_entries[$i].value; $b_entries[$i].value)]
' > final_annotated.json
All files showed consistent top-level shapes:
JSON Files:
dpo_batch_exhaustive (1).json
: 201 objects with keys["chosen","prompt","rejected"]
dpo_merged_all_clean.json
: 704 objects with keys["chosen","prompt","rejected"]
dpo_trueoffmychest.json
: 74 objects with keys["chosen","prompt","rejected"]
exhaustive_samples.json
: 211 objects with keys["context","meta","prompt"]
prompts.json
: 28 objects with keys["context","meta","prompt"]
triplets_with_llm_judgments.json
: 500 objects with keys["better_response","chosen","critical_thinking_score","fluency_score","meta","prompt","rationale","rejected","sycophancy_present"]
triplets_with_metadata.json
: 500 objects with keys["chosen","meta","prompt","rejected"]
JSONL Files:
annotated_groq_dataset-kavin.jsonl
: 12 objects with keys["chosen","context","meta","prompt","rejected"]
annotated_groq_dataset.jsonl
: 122 objects with keys["body","chosen","id","prompt","rejected","source"]
All files showed consistent deep shapes with all values being primitive types (strings, numbers, booleans, null) - no nested objects or arrays.
-
exhaustive_samples.json (211 objects)
- Contains:
prompt
,context
,meta
(with topic, sentiment, bias, opinion_type) - Missing:
chosen
,rejected
- Contains:
-
dpo_batch_exhaustive.json (201 objects)
- Contains:
prompt
,chosen
,rejected
- Missing:
meta
fields
- Contains:
{
"prompt": "string",
"chosen": "string",
"rejected": "string",
"meta": {
"topic": "string",
"sentiment": "string",
"bias": "string",
"opinion_type": "string"
},
"annotator": {
"better_response": "string",
"critical_thinking_score": {
"chosen": "string",
"rejected": "string"
},
"fluency_score": {
"chosen": "string",
"rejected": "string"
},
"sycophancy_present": {
"chosen": "string",
"rejected": "string"
},
"rationale": "string"
}
}
- Zipped merging: Objects are matched by index position in both arrays
- Field mapping:
prompt
from either file (should be identical)chosen
,rejected
fromdpo_batch_exhaustive.json
meta
fields fromexhaustive_samples.json
annotator
fields added as empty strings for future annotation
- File:
final_annotated.json
- Length: 201 objects (matching the shorter array length)
- Structure: Complete schema with all fields populated from source files
- Annotation fields: Empty and ready for manual annotation
# From json/ directory
fish check_json_shapes.fish
# From jsonl/ directory
fish check_jsonl_shapes.fish
# From json/ directory
# Use the jq command above to create final_annotated.json
# Check file lengths
jq 'length' exhaustive_samples.json
jq 'length' dpo_batch_exhaustive.json
jq 'length' final_annotated.json
# Check structure of first object
jq '.[0]' final_annotated.json
# Check structure of last object
jq '.[200]' final_annotated.json
- All files are internally consistent - no shape variations found within any single file
- Flat structure - All data appears to be flat (no nested objects or arrays)
- Primitive values - All fields contain primitive JSON types (strings, numbers, booleans, null)
- Successful merge - The zipped merging approach successfully combined the two datasets
- Schema compliance - The final output matches the target schema exactly
json/json_shape_report.txt
- Top-level and deep shape analysis for JSON filesjsonl/jsonl_shape_report.txt
- Top-level and deep shape analysis for JSONL filesjson/final_annotated.json
- Merged and annotated dataset ready for use