pd.read_<file> and I/O APIs#

A number of IO methods default to pandas. We have parallelized read_csv, read_parquet and some more (see table), though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no, P stands for partial (meaning some parameters may not be supported yet), and D stands for default to pandas.

Note

Support for fully asynchronous reading has been added for the following functions: read_csv, read_fwf, read_table, read_custom_text. This mode is disabled by default, one can enable it using MODIN_ASYNC_READ_MODE=True environment variable. Some parameter combinations are not supported and the function will be executed in synchronous mode.

IO method

Modin Implementation? (Y/N/P/D)

Notes for Current implementation

read_csv

Y

read_fwf

Y

read_table

Y

read_parquet

P

Parameters besides filters and storage_options passed via **kwargs are not supported. use_nullable_dtypes == True is not supported.

Experimental implementation: read_parquet_glob

read_json

P

Implemented for lines=True Experimental implementation: read_json_glob

read_xml

D

Experimental implementation: read_xml_glob

read_html

D

read_clipboard

D

read_excel

D

read_hdf

D

read_feather

Y

read_stata

D

read_sas

D

read_pickle

D

Experimental implementation: read_pickle_glob

read_sql

Y