Skip to content

Instantly share code, notes, and snippets.

@mattbillenstein
Last active December 6, 2018 18:53
Show Gist options
  • Save mattbillenstein/34cf2907390102ffbabd982a3662b204 to your computer and use it in GitHub Desktop.
Save mattbillenstein/34cf2907390102ffbabd982a3662b204 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
import json
import time
start = time.time()
L = []
i = 0
with open('in.json') as f:
for line in f:
L.append(json.loads(line))
i += 1
if i % 100000 == 0:
print(i)
print('read', time.time() - start)
L.sort(key=lambda x: x['id'])
print('sort', time.time() - start)
i = 0
with open('out.json', 'w') as f:
for d in L:
f.write(json.dumps(d, sort_keys=True) + '\n')
i += 1
if i % 100000 == 0:
print(i)
print('write', time.time() - start)
@nvictor
Copy link

nvictor commented Dec 6, 2018

not my experience dealing with large new line delimited JSON files. what's inside in.json? the bottleneck has always been with the json module...

@mattbillenstein
Copy link
Author

It's part of a db table dump - just part of the largest line-delimited json I had lying around -- I was curious what a python script could do re https://genius.engineering/faster-and-simpler-with-the-command-line-deep-comparing-two-5gb-json-files-3x-faster-by-ditching-the-code/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment