pd.read_<file> and I/O APIs#

A number of IO methods default to pandas. We have parallelized read_csv and read_parquet, though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no, P stands for partial (meaning some parameters may not be supported yet), and D stands for default to pandas.

Note

Currently, the second column reflects implementation status for Ray and Dask engines. By default, support for a method in the Hdk engine could be treated as D unless Notes column contains additional information.

IO method

Modin Implementation? (Y/N/P/D)

Notes for Current implementation

read_csv

Y

Hdk: P, only basic cases and parameters supported: filepath_or_buffer can be local file only, sep, delimiter, header (partly) names, usecols, dtype, true/false_values, skiprows (partly) skip_blank_lines (partly), parse_dates (partly), compression (inferred automatically, should not be specified), quotechar, escapechar, doublequote, delim_whitespace

read_table

Y

read_parquet

Y

read_json

P

Implemented for lines=True

read_html

D

read_clipboard

D

read_excel

D

read_hdf

D

read_feather

Y

read_msgpack

D

read_stata

D

read_sas

D

read_pickle

D

Experimental implementation: read_pickle_distributed

read_sql

Y