Last active
January 10, 2023 19:03
-
-
Save ellisvalentiner/63b083180afe54f17f16843dd51f4394 to your computer and use it in GitHub Desktop.
Recipe for (fast) bulk insert from python Pandas DataFrame to Postgres database
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env/python | |
import psycopg2 | |
import os | |
from io import StringIO | |
import pandas as pd | |
# Get a database connection | |
dsn = os.environ.get('DB_DSN') # Use ENV vars: keep it secret, keep it safe | |
conn = psycopg2.connect(dsn) | |
# Do something to create your dataframe here... | |
df = pd.read_csv("file.csv") | |
# Initialize a string buffer | |
sio = StringIO() | |
sio.write(df.to_csv(index=None, header=None)) # Write the Pandas DataFrame as a csv to the buffer | |
sio.seek(0) # Be sure to reset the position to the start of the stream | |
# Copy the string buffer to the database, as if it were an actual file | |
with conn.cursor() as c: | |
c.copy_from(sio, "schema.table", columns=df.columns, sep=',') | |
conn.commit() |
@rehoter-cyber could you try using the pandas.DataFrame.to_sql
method but use the parameter method=psql_insert_copy
, where psql_insert_copy
is the callable function defined in the Insertion method documentation
I can't get it to work :( I think it's because in the dataframe everything is string but in the table there's a mixture of data types
This is the error I'm getting at the moment:
c.copy_from(sio, "public.sku", columns=df_ora.columns, sep=',')
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: ""
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
LOL,thanks for your reply. First, by applying [copy_expert] the result I think is quite good. It shortens the time of insert from 10 hours to 10 minutes and without any rejection. hahha the reason for choosing the Python/psycopg2 is about the original file is a little bit big (700+MB) and separate (dataset have 6 files) and they are tsv and json files. (I was working on the kaggle IMDB dataset). And I want to see the execution time and memory use while inserting. So choose Python/psycopg2 to handle the task.