O APIs#

A number of IO methods default to pandas. We have parallelized read_csv and read_parquet, though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no, P stands for partial (meaning some parameters may not be supported yet), and D stands for default to pandas.

Note

Currently, the second column reflects implementation status for Ray and Dask engines. By default, support for a method in the Omnisci engine could be treated as D unless Notes column contains additional information.

IO method	Modin Implementation? (Y/N/P/D)	Notes for Current implementation
read_csv	Y	Omnisci: `P`, only basic cases and parameters supported: `filepath_or_buffer` can be local file only, `sep`, `delimiter`, `header` (partly) `names`, `usecols`, `dtype`, `true/false_values`, `skiprows` (partly) `skip_blank_lines` (partly), `parse_dates` (partly), `compression` (infered automatically, should not be specified), `quotechar`, `escapechar`, `doublequote`, `delim_whitespace`
read_table	Y
read_parquet	Y
read_json	P	Implemented for `lines=True`
read_html	D
read_clipboard	D
read_excel	D
read_hdf	D
read_feather	Y
read_msgpack	D
read_stata	D
read_sas	D
read_pickle	D	Experimental implementation: read_pickle_distributed
read_sql	Y

pandas Utilities Supported

Development

pd.read_<file> and I/O APIs#

`pd.read_<file>` and I/O APIs#