Last active
October 7, 2024 09:28
-
-
Save momota/ba302f0f0720ff5b2445fb81820c5b82 to your computer and use it in GitHub Desktop.
Generate a large size of CSV file was filled random values. This script generates around 250MB size of the file. You can adjust two parameters `row` and `col` to generate the file which has desirable size.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import csv | |
import random | |
# 1000000 and 52 == roughly 1GB (WARNING TAKES a while, 30s+) | |
rows = 1000000 | |
columns = 52 | |
def generate_random_row(col): | |
a = [] | |
l = [i] | |
for j in range(col): | |
l.append(random.random()) | |
a.append(l) | |
return a | |
if __name__ == '__main__': | |
f = open('sample.csv', 'w') | |
w = csv.writer(f, lineterminator='\n') | |
for i in range(rows): | |
w.writerows(generate_random_row(columns)) | |
f.close() |
Also, this script isn't really written to scale, as it would require the amount of RAM of the file size you wish to write. The second you go above you'll OOM and crash. ;) I just tested, it takes about 2.5GB of RAM to write a 1GB file. :)
Fixed it for you. Takes almost zero memory, can go to infinitely sized files, and it's actually a little faster runtime...
root@ip-172-100-182-190:~/1tb-project# time python3 generate-random-1gb-csv-old.py
..........
real 0m39.878s
user 0m38.671s
sys 0m1.200s
root@ip-172-100-182-190:~/1tb-project# time python3 generate-random-1gb-csv.py
..........
real 0m35.726s
user 0m35.026s
sys 0m0.572s
import csv
import random
# 1000000 and 52 == roughly 1GB (WARNING TAKES a while, 30s+)
rows = 1000000
columns = 52
def generate_random_row(col):
a = []
l = [i]
for j in range(col):
l.append(random.random())
a.append(l)
return a
if __name__ == '__main__':
f = open('sample.csv', 'w')
w = csv.writer(f, lineterminator='\n')
for i in range(rows):
w.writerows(generate_random_row(columns))
f.close()
@AndrewFarley Thanks for your comments.
This code was the first Python code I ever wrote, and I fondly remember writing it back then because I needed a large CSV file for testing.
I completely agree with the improvements you pointed out and have fixed it.
Thank you :)
Wanted to let you know, this worked perfect for what I needed. Thank you so much!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@momota Fun little script, only one issue is you have your col and row logic backwards in the
generate_random_array
function. The firstrange
needs to be row, then the secondrange
needs to be col. :)