pd.read_<file> and I/O APIsΒΆ

A number of IO methods default to pandas. We have parallelized read_csv and read_parquet, though many of the remaining methods can be relatively easily parallelized. Some of the operations default to the pandas implementation, meaning it will read in serially as a single, non-distributed DataFrame and distribute it. Performance will be affected by this.

The following table is structured as follows: The first column contains the method name. The second column is a flag for whether or not there is an implementation in Modin for the method in the left column. Y stands for yes, N stands for no, P stands for partial (meaning some parameters may not be supported yet), and D stands for default to pandas.

IO method Modin Implementation? (Y/N/P/D) Notes for Current implementation
read_csv Y  
read_table Y  
read_parquet Y  
read_json P Implemented for lines=True
read_html D  
read_clipboard D  
read_excel D  
read_hdf Y  
read_feather Y  
read_msgpack D  
read_stata D  
read_sas D  
read_pickle D  
read_sql Y