Python3 pandas 如何加快 SQL Server 读写速度?

一、pandas.DataFrame.to_sql

官网 | pandas.DataFrame.to_sql 方法简介

DataFrame.to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)[source]

Write records stored in a DataFrame to a SQL database.

Databases supported by SQLAlchemy [1] are supported. Tables can be newly created, appended to, or overwritten.

Parameters
namestr
Name of SQL table.

consqlalchemy.engine.(Engine or Connection) or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable See here.

schemastr, optional
Specify the schema (if database flavor supports this). If None, use default schema.

if_exists{‘fail’, ‘replace’, ‘append’}, default ‘fail’
How to behave if the table already exists.

fail: Raise a ValueError.

replace: Drop the table before inserting new values.

append: Insert new values to the existing table.

indexbool, default True
Write DataFrame index as a column. Uses index_label as the column name in the table.

index_labelstr or sequence, default None
Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.

chunksizeint, optional
Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.

dtypedict or scalar, optional
Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.

method{None, ‘multi’, callable}, optional
Controls the SQL insertion clause used:

None : Uses standard SQL INSERT clause (one per row).

‘multi’: Pass multiple values in a single INSERT clause.

callable with signature (pd_table, conn, keys, data_iter).

Details and a sample callable implementation can be found in the section insert method.

Returns
None or int
Number of rows affected by to_sql. None is returned if the callable passed into method does not return an integer number of rows.

The number of returned rows affected is the sum of the rowcount attribute of sqlite3.Cursor or SQLAlchemy connectable which may not reflect the exact number of written rows as stipulated in the sqlite3 or SQLAlchemy.

示例展示:

df_name.to_sql('table_name',
                          schema = 'public',
                          con = engine,
                          index = False,
                          if_exists = 'replace')

Problem description
Im writing a 500,000 row dataframe to a postgres AWS database and it takes a very, very long time to push the data through.

It is a fairly large SQL server and my internet connection is excellent so I've ruled those out as contributing to the problem.

In comparison, csv2sql or using cat and piping into psql on the command line is much quicker.

# dont forget to import event
from sqlalchemy import event, create_engine

engine = create_engine(connection_string)

@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True
        cursor.commit()

In my code, to_sql function was taking 7 min to execute, and now it takes only 5 seconds ;)

Github | Pandas to_sql is too slow #15276

  1. 初步尝试:利用pandas.DataFrame.to_sql写入数据库,例如:
    
    import pandas as pd
    from sqlalchemy import create_engine

engine = create_engine('mssql+pymssql://user:pws@server/db')
data = pd.read_csv('file_name.csv')
data.to_sql('TABLE_NAME', engine, if_exists='append', chunksize=10000)


会发现写入速度极其慢,一个大致11w行的DataFrame写入数据库(局域网连接)需要400+ sec. 查了一圈发现是在类型转换中耗时太多,关于这种写入性能,odo文档中有详细的介绍:

 [Loading CSVs into SQL Databases](https://odo.readthedocs.io/en/latest/perf.html)

> There is an unnecessary and very expensive amount of data conversion going on here. First we convert our CSV into an iterator of DataFrames, then those DataFrames are converted into Python data structures compatible with SQLAlchemy. Those Python objects then need to be serialized in a way that’s compatible with the database they are being sent to. Before you know it, more time is spent converting data and serializing Python data structures than on reading data from disk.

而事实上,将大容量csv文件导入数据库中,可以利用BULK INSERT,但因为缺乏权限,无法测试,详细可参考BULK INSERT (Transact-SQL).3. 曲线救国:既然无权利用BULK INSERT导入,那么只能退而求其次了。我们知道SQL Server 2008+是支持一次插入多行的:

INSERT INTO TABLE_NAME VALUES (a0, b0, c0, ...), (a1, b1, c1, ...), ..., (an, bn, cn, ...)


插入时指定dtype,一般为varchar,长度写大一点

def set_d_type_dict(df):
type_dict = {}
for i, j in zip(df.columns, df.dtypes):
if "object" in str(j):
type_dict.update({i: VARCHAR(512)})
if "float" in str(j):
type_dict.update({i: DECIMAL(19, 2)})
if "int" in str(j):
type_dict.update({i: DECIMAL(19)})
return type_dict

d_type = set_d_type_dict(df)
df.to_sql('table_name', engine, if_exists='append', index=False, dtype=d_type)



参考:

http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

https://blog.csdn.net/chenKFKevin/article/details/72911525

---
相关文章:
[Python3 pandas如何加快SQL Server读写速度?](https://www.zhihu.com/question/39824989/answer/120884748)
[Github | Pandas to_sql is too slow #15276](https://github.com/pandas-dev/pandas/issues/15276)
[pandas to_sql插入过慢](https://blog.csdn.net/baidu_30809315/article/details/93745888)

为者常成,行者常至