IO Module Description#

Dispatcher Classes Workflow Overview#

Calls from read_* functions of execution-specific IO classes (for example, PandasOnRayIO for Ray engine and pandas storage format) are forwarded to the _read function of the file format-specific class (for example CSVDispatcher for CSV files), where function parameters are preprocessed to check if they are supported (defaulting to pandas if not) and common metadata is computed for all partitions. The file is then split into chunks (splitting mechanism described below) and the data is used to launch tasks on the remote workers. After the remote tasks finish, additional postprocessing is performed on the results, and a new query compiler with the imported data will be returned.

Data File Splitting Mechanism#

Modin’s file splitting mechanism differs depending on the data format type:

  • text format type - the file is split into bytes according to user specified arguments. In the simplest case, when no row related parameters (such as nrows or skiprows) are passed, data chunk limits (start and end bytes) are derived by dividing the file size by the number of partitions (chunks can slightly differ between each other because usually end byte may occurs inside a line and in that case the last byte of the line should be used instead of initial value). In other cases the same splitting mechanism is used, but chunks sizes are defined according to the number of lines that each partition should contain.

  • columnar store type - the file is split so that each chunk contains approximately the same number of columns.

  • SQL type - chunking is obtained by wrapping initial SQL query with a query that specifies initial row offset and number of rows in the chunk.

After file splitting is complete, chunks data is passed to the parser functions (PandasCSVParser.parse for read_csv function with pandas storage format) for further processing on each worker.

Submodules Description#

modin.core.io module is used mostly for storing utils and dispatcher classes for reading files of different formats.

  • io.py - class containing basic utils and default implementation of IO functions.

  • file_dispatcher.py - class reading data from different kinds of files and handling some util functions common for all formats. Also this class contains read function which is entry point function for all dispatchers _read functions.

  • text - directory for storing all text file format dispatcher classes

    • text_file_dispatcher.py - class for reading text formats files. This class holds partitioned_file function for splitting text format files into chunks, offset function for moving file offset at the specified amount of bytes, _read_rows function for moving file offset at the specified amount of rows and many other functions.

    • format/feature specific dispatchers: csv_dispatcher.py, excel_dispatcher.py, fwf_dispatcher.py and json_dispatcher.py.

  • column_stores - directory for storing all columnar store file format dispatcher classes

    • column_store_dispatcher.py - class for reading columnar type files. This class holds build_query_compiler function that performs file splitting, deploying remote tasks and results postprocessing and many other functions.

    • format/feature specific dispatchers: feather_dispatcher.py, hdf_dispatcher.py and parquet_dispatcher.py.

  • sql - directory for storing SQL dispatcher class

    • sql_dispatcher.py - class for reading SQL queries or database tables.

Public API#

IO functions implementations.

class modin.core.io.BaseIO#

Class for basic utils and default implementation of IO functions.

classmethod from_arrow(at)#

Create a Modin query_compiler from a pyarrow.Table.

Parameters:

at (Arrow Table) – The Arrow Table to convert from.

Returns:

QueryCompiler containing data from the Arrow Table.

Return type:

BaseQueryCompiler

classmethod from_dask(dask_obj)#

Create a Modin query_compiler from a Dask DataFrame.

Parameters:

dask_obj (dask.dataframe.DataFrame) – The Dask DataFrame to convert from.

Returns:

QueryCompiler containing data from the Dask DataFrame.

Return type:

BaseQueryCompiler

Notes

Dask DataFrame can only be converted to a Modin DataFrame if Modin uses a Dask engine. If another engine is used, the runtime exception will be raised.

classmethod from_dataframe(df)#

Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol __dataframe__().

Parameters:

df (DataFrame) – The DataFrame object supporting the DataFrame exchange protocol.

Returns:

QueryCompiler containing data from the DataFrame.

Return type:

BaseQueryCompiler

classmethod from_map(func, iterable, *args, **kwargs)#

Create a Modin query_compiler from a map function.

This method will construct a Modin query_compiler split by row partitions. The number of row partitions matches the number of elements in the iterable object.

Parameters:
  • func (callable) – Function to map across the iterable object.

  • iterable (Iterable) – An iterable object.

  • *args (tuple) – Positional arguments to pass in func.

  • **kwargs (dict) – Keyword arguments to pass in func.

Returns:

QueryCompiler containing data returned by map function.

Return type:

BaseQueryCompiler

classmethod from_non_pandas(*args, **kwargs)#

Create a Modin query_compiler from a non-pandas object.

Parameters:
  • *args (iterable) – Positional arguments to be passed into func.

  • **kwargs (dict) – Keyword arguments to be passed into func.

classmethod from_pandas(df)#

Create a Modin query_compiler from a pandas.DataFrame.

Parameters:

df (pandas.DataFrame) – The pandas DataFrame to convert from.

Returns:

QueryCompiler containing data from the pandas.DataFrame.

Return type:

BaseQueryCompiler

classmethod from_ray(ray_obj)#

Create a Modin query_compiler from a Ray Dataset.

Parameters:

ray_obj (ray.data.Dataset) – The Ray Dataset to convert from.

Returns:

QueryCompiler containing data from the Ray Dataset.

Return type:

BaseQueryCompiler

Notes

Ray Dataset can only be converted to a Modin Dataframe if Modin uses a Ray engine. If another engine is used, the runtime exception will be raised.

classmethod read_clipboard(sep='\\s+', **kwargs)#

Read text from clipboard into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_clipboard for more.

classmethod read_csv(filepath_or_buffer, **kwargs)#

Read a comma-separated values (CSV) file into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler or TextParser with read data.

Return type:

BaseQueryCompiler or TextParser

Notes

See pandas API documentation for pandas.read_csv for more.

classmethod read_excel(**kwargs)#

Read an Excel file into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler or dict with read data.

Return type:

BaseQueryCompiler or dict

Notes

See pandas API documentation for pandas.read_excel for more.

classmethod read_feather(path, **kwargs)#

Load a feather-format object from the file path into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_feather for more.

classmethod read_fwf(filepath_or_buffer, *, colspecs='infer', widths=None, infer_nrows=100, dtype_backend=_NoDefault.no_default, iterator=False, chunksize=None, **kwds)#

Read a table of fixed-width formatted lines into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler or TextParser with read data.

Return type:

BaseQueryCompiler or TextParser

Notes

See pandas API documentation for pandas.read_fwf for more.

classmethod read_gbq(query: str, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=False, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, private_key=None, verbose=None, progress_bar_type=None, max_results=None)#

Load data from Google BigQuery into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_gbq for more.

classmethod read_hdf(path_or_buf, key=None, mode: str = 'r', errors: str = 'strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)#

Read data from hdf store into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_hdf for more.

classmethod read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, **kwargs)#

Read HTML tables into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_html for more.

classmethod read_json(**kwargs)#

Convert a JSON string to query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_json for more.

classmethod read_parquet(**kwargs)#

Load a parquet object from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_parquet for more.

classmethod read_pickle(filepath_or_buffer, **kwargs)#

Load pickled pandas object (or any object) from file into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_pickle for more.

classmethod read_sas(filepath_or_buffer, *, format=None, index=None, encoding=None, chunksize=None, iterator=False, **kwargs)#

Read SAS files stored as either XPORT or SAS7BDAT format files into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sas for more.

classmethod read_spss(path, usecols, convert_categoricals, dtype_backend)#

Load an SPSS file from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_spss for more.

classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default, dtype=None)#

Read SQL query or database table into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql for more.

classmethod read_sql_query(sql, con, **kwargs)#

Read SQL query into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql_query for more.

classmethod read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default)#

Read SQL database table into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql_table for more.

classmethod read_stata(filepath_or_buffer, **kwargs)#

Read Stata file into query compiler using pandas. For parameters description please refer to pandas API.

Returns:

QueryCompiler with read data.

Return type:

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_stata for more.

classmethod to_csv(obj, **kwargs)#

Write object to a comma-separated values (CSV) file using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_csv for more.

classmethod to_dask(modin_obj)#

Convert a Modin DataFrame to a Dask DataFrame.

Parameters:

modin_obj (modin.pandas.DataFrame, modin.pandas.Series) – The Modin DataFrame/Series to convert.

Returns:

Converted object with type depending on input.

Return type:

dask.dataframe.DataFrame or dask.dataframe.Series

Notes

Modin DataFrame/Series can only be converted to a Dask DataFrame/Series if Modin uses a Dask engine. If another engine is used, the runtime exception will be raised.

classmethod to_json(obj, path, **kwargs)#

Convert the object to a JSON string.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_json for more.

classmethod to_parquet(obj, path, **kwargs)#

Write object to the binary parquet format using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_parquet for more.

classmethod to_pickle(obj: Any, filepath_or_buffer, **kwargs)#

Pickle (serialize) object to file.

Notes

See pandas API documentation for pandas.DataFrame.to_pickle for more.

classmethod to_ray(modin_obj)#

Convert a Modin DataFrame/Series to a Ray Dataset.

Parameters:

modin_obj (modin.pandas.DataFrame, modin.pandas.Series) – The Modin DataFrame/Series to convert.

Returns:

Converted object with type depending on input.

Return type:

ray.data.Dataset

Notes

Modin DataFrame/Series can only be converted to a Ray Dataset if Modin uses a Ray engine. If another engine is used, the runtime exception will be raised.

classmethod to_sql(qc, name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)#

Write records stored in a DataFrame to a SQL database using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_sql for more.

classmethod to_xml(obj, path_or_buffer, **kwargs)#

Convert the object to a XML string.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_xml for more.

class modin.core.io.CSVDispatcher#

Class handles utils for reading .csv files.

class modin.core.io.ExcelDispatcher#

Class handles utils for reading excel files.

class modin.core.io.FWFDispatcher#

Class handles utils for reading of tables with fixed-width formatted lines.

classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]]#

Check support of parameters of read_fwf function.

Parameters:
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_fwf function.

  • read_kwargs (dict) – Parameters of read_fwf function.

  • skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.

  • header_size (int) – Number of rows that are used by header.

Returns:

  • bool – Whether passed parameters are supported or not.

  • Optional[str]None if parameters are supported, otherwise an error message describing why parameters are not supported.

class modin.core.io.FeatherDispatcher#

Class handles utils for reading .feather files.

class modin.core.io.FileDispatcher#

Class handles util functions for reading data from different kinds of files.

Notes

_read, deploy, parse and materialize are abstract methods and should be implemented in the child classes (functions signatures can differ between child classes).

classmethod build_partition(partition_ids, row_lengths, column_widths)#

Build array with partitions of cls.frame_partition_cls class.

Parameters:
  • partition_ids (list) – Array with references to the partitions data.

  • row_lengths (list) – Partitions rows lengths.

  • column_widths (list) – Number of columns in each partition.

Returns:

array with shape equals to the shape of partition_ids and filed with partition objects.

Return type:

np.ndarray

classmethod deploy(func, *args, num_returns=1, **kwargs)#

Deploy remote task.

Should be implemented in the task class (for example in the RayWrapper).

classmethod file_exists(file_path, storage_options=None)#

Check if file_path exists.

Parameters:
  • file_path (str) – String that represents the path to the file (paths to S3 buckets are also acceptable).

  • storage_options (dict, optional) – Keyword from read_* functions.

Returns:

Whether file exists or not.

Return type:

bool

classmethod file_size(f)#

Get the size of file associated with file handle f.

Parameters:

f (file-like object) – File-like object, that should be used to get file size.

Returns:

File size in bytes.

Return type:

int

classmethod get_path(file_path)#

Process file_path in accordance to it’s type.

Parameters:

file_path (str, os.PathLike[str] object or file-like object) – The file, or a path to the file. Paths to S3 buckets are also acceptable.

Returns:

Updated or verified file_path parameter.

Return type:

str

Notes

if file_path is a URL, parameter will be returned as is, otherwise absolute path will be returned.

classmethod materialize(obj_id)#

Get results from worker.

Should be implemented in the task class (for example in the RayWrapper).

parse(func, args, num_returns)#

Parse file’s data in the worker process.

Should be implemented in the parser class (for example in the PandasCSVParser).

classmethod read(*args, **kwargs)#

Read data according passed args and kwargs.

Parameters:
  • *args (iterable) – Positional arguments to be passed into _read function.

  • **kwargs (dict) – Keywords arguments to be passed into _read function.

Returns:

query_compiler – Query compiler with imported data for further processing.

Return type:

BaseQueryCompiler

Notes

read is high-level function that calls specific for defined storage format, engine and dispatcher class _read function with passed parameters and performs some postprocessing work on the resulting query_compiler object.

class modin.core.io.HDFDispatcher#

Class handles utils for reading hdf data.

Inherits some common for columnar store files util functions from ColumnStoreDispatcher class.

class modin.core.io.JSONDispatcher#

Class handles utils for reading .json files.

class modin.core.io.ParquetDispatcher#

Class handles utils for reading .parquet files.

classmethod build_index(dataset, partition_ids, index_columns, filters)#

Compute index and its split sizes of resulting Modin DataFrame.

Parameters:
  • dataset (Dataset) – Dataset object of Parquet file/files.

  • partition_ids (list) – Array with references to the partitions data.

  • index_columns (list) – List of index columns specified by pandas metadata.

  • filters (list) – List of filters to be used in reading the Parquet file/files.

Returns:

  • index (pandas.Index) – Index of resulting Modin DataFrame.

  • needs_index_sync (bool) – Whether the partition indices need to be synced with frame index because there’s no index column, or at least one index column is a RangeIndex.

Notes

See build_partition for more detail on the contents of partitions_ids.

classmethod build_partition(partition_ids, column_widths)#

Build array with partitions of cls.frame_partition_cls class.

Parameters:
  • partition_ids (list) – Array with references to the partitions data.

  • column_widths (list) – Number of columns in each partition.

Returns:

array with shape equals to the shape of partition_ids and filed with partition objects.

Return type:

np.ndarray

Notes

The second level of partitions_ids contains a list of object references for each read call: partition_ids[i][j] -> [ObjectRef(df), ObjectRef(df.index), ObjectRef(len(df))].

classmethod build_query_compiler(dataset, columns, index_columns, **kwargs)#

Build query compiler from deployed tasks outputs.

Parameters:
  • dataset (Dataset) – Dataset object of Parquet file/files.

  • columns (list) – List of columns that should be read from file.

  • index_columns (list) – List of index columns specified by pandas metadata.

  • **kwargs (dict) – Parameters of deploying read_* function.

Returns:

new_query_compiler – Query compiler with imported data for further processing.

Return type:

BaseQueryCompiler

classmethod call_deploy(partition_files: list[list[ParquetFileToRead]], col_partitions: list[list[str]], storage_options: dict, engine: str, **kwargs)#

Deploy remote tasks to the workers with passed parameters.

Parameters:
  • partition_files (list[list[ParquetFileToRead]]) – List of arrays with files that should be read by each partition.

  • col_partitions (list[list[str]]) – List of arrays with columns names that should be read by each partition.

  • storage_options (dict) – Parameters for specific storage engine.

  • engine ({"auto", "pyarrow", "fastparquet"}) – Parquet library to use for reading.

  • **kwargs (dict) – Parameters of deploying read_* function.

Returns:

Array with references to the task deploy result for each partition.

Return type:

List

classmethod get_dataset(path, engine, storage_options)#

Retrieve Parquet engine specific Dataset implementation.

Parameters:
  • path (str, path object or file-like object) – The filepath of the parquet file in local filesystem or hdfs.

  • engine (str) – Parquet library to use (only ‘PyArrow’ is supported for now).

  • storage_options (dict) – Parameters for specific storage engine.

Returns:

Either a PyArrowDataset or FastParquetDataset object.

Return type:

Dataset

classmethod write(qc, **kwargs)#

Write a DataFrame to the binary parquet format.

Parameters:
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_parquet on.

  • **kwargs (dict) – Parameters for pandas.to_parquet(**kwargs).

class modin.core.io.SQLDispatcher#

Class handles utils for reading SQL queries or database tables.

classmethod write(qc, **kwargs)#

Write records stored in the qc to a SQL database.

Parameters:
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_sql on.

  • **kwargs (dict) – Parameters for pandas.to_sql(**kwargs).

class modin.core.io.TextFileDispatcher#

Class handles utils for reading text formats files.

classmethod build_partition(partition_ids, row_lengths, column_widths)#

Build array with partitions of cls.frame_partition_cls class.

Parameters:
  • partition_ids (list) – Array with references to the partitions data.

  • row_lengths (list) – Partitions rows lengths.

  • column_widths (list) – Number of columns in each partition.

Returns:

array with shape equals to the shape of partition_ids and filed with partitions objects.

Return type:

np.ndarray

classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]]#

Check support of only general parameters of read_* function.

Parameters:
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_* function.

  • read_kwargs (dict) – Parameters of read_* function.

  • skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.

  • header_size (int) – Number of rows that are used by header.

Returns:

  • bool – Whether passed parameters are supported or not.

  • Optional[str]None if parameters are supported, otherwise an error message describing why parameters are not supported.

classmethod compute_newline(file_like, encoding, quotechar)#

Compute byte or sequence of bytes indicating line endings.

Parameters:
  • file_like (file-like object) – File handle that should be used for line endings computing.

  • encoding (str) – Encoding of file_like.

  • quotechar (str) – Quotechar used for parsing file-like.

Returns:

line endings

Return type:

bytes

classmethod get_path_or_buffer(filepath_or_buffer)#

Extract path from filepath_or_buffer.

Parameters:

filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

Returns:

verified filepath_or_buffer parameter.

Return type:

str or path object

Notes

Given a buffer, try and extract the filepath from it so that we can use it without having to fall back to pandas and share file objects between workers. Given a filepath, return it immediately.

classmethod offset(f, offset_size: int, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None)#

Move the file offset at the specified amount of bytes.

Parameters:
  • f (file-like object) – File handle that should be used for offset movement.

  • offset_size (int) – Number of bytes to read and ignore.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

Returns:

If file pointer reached the end of the file, but did not find closing quote returns False. True in any other case.

Return type:

bool

classmethod partitioned_file(f, num_partitions: int = None, nrows: int = None, skiprows: int = None, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None, header_size: int = 0, pre_reading: int = 0, get_metadata_kw: dict = None)#

Compute chunk sizes in bytes for every partition.

Parameters:
  • f (file-like object) – File handle of file to be partitioned.

  • num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().

  • nrows (int, optional) – Number of rows of file to read.

  • skiprows (int, optional) – Specifies rows to skip.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

  • header_size (int, default: 0) – Number of rows, that occupied by header.

  • pre_reading (int, default: 0) – Number of rows between header and skipped rows, that should be read.

  • get_metadata_kw (dict, optional) – Keyword arguments for cls.read_callback to compute metadata if needed. This option is not compatible with pre_reading!=0.

Returns:

  • list

    List with the next elements:

    int : partition start read byte int : partition end read byte

  • pandas.DataFrame or None – Dataframe from which metadata can be retrieved. Can be None if get_metadata_kw=None.

classmethod pathlib_or_pypath(filepath_or_buffer)#

Check if filepath_or_buffer is instance of py.path.local or pathlib.Path.

Parameters:

filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

Returns:

Whether or not filepath_or_buffer is instance of py.path.local or pathlib.Path.

Return type:

bool

classmethod preprocess_func()#

Prepare a function for transmission to remote workers.

classmethod rows_skipper_builder(f, quotechar, is_quoting, encoding=None, newline=None)#

Build object for skipping passed number of lines.

Parameters:
  • f (file-like object) – File handle that should be used for offset movement.

  • quotechar (bytes) – Indicate quote in a file.

  • is_quoting (bool) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

Returns:

skipper object.

Return type:

object

Handling skiprows Parameter#

Handling skiprows parameter by pandas import functions can be very tricky, especially for read_csv function because of interconnection with header parameter. In this section the techniques of skiprows processing by both pandas and Modin are covered.

Processing skiprows by pandas#

Let’s consider a simple snippet with pandas.read_csv in order to understand interconnection of header and skiprows parameters:

import pandas
from io import StringIO

data = """0
1
2
3
4
5
6
7
8
"""

# `header` parameter absence is equivalent to `header="infer"` or `header=0`
# rows 1, 5, 6, 7, 8 are read with header "0"
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4])
# rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1)
# rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)

In the examples above list-like skiprows values are fixed and header is varied. In the first example with no header provided, rows 2, 3, 4 are skipped and row 0 is considered as the header. In the second example header == 1, so the zeroth row is skipped and the next available row is considered the header. The third example illustrates when the header and skiprows parameters values are both present - in this case skiprows rows are dropped first and then the header is derived from the remaining rows (rows before header are skipped too).

In the examples above only list-like skiprows and integer header parameters are considered, but the same logic is applicable for other types of the parameters.

Processing skiprows by Modin#

As it can be seen, skipping rows in the pandas import functions is complicated and distributing this logic across multiple workers can complicate it even more. Thus in some rare corner cases default pandas implementation is used in Modin to avoid excessive Modin code complication.

Modin uses two techniques for skipping rows:

1) During file partitioning (setting file limits that should be read by each partition) exact rows can be excluded from partitioning scope, thus they won’t be read at all and can be considered as skipped. This is the most effective way of skipping rows since it doesn’t require any actual data reading and postprocessing, but in this case skiprows parameter can be an integer only. When it is possible Modin always uses this approach.

2) Rows for skipping can be dropped after full dataset import. This is more expensive way since it requires extra IO work and postprocessing afterwards, but skiprows parameter can be of any non-integer type supported by pandas.read_csv.

In some cases, if skiprows is uniformly distributed array (e.g. [1, 2, 3]), skiprows can be “squashed” and represented as an integer to make a fastpath by skipping these rows during file partitioning (using the first option). But if there is a gap between the first row for skipping and the last line of the header (that will be skipped too since header is read by each partition to ensure metadata is defined properly), then this gap should be assigned for reading first by assigning the first partition to read these rows by setting pre_reading parameter.

Let’s consider an example of skipping rows during partitioning when header="infer" and skiprows=[3, 4, 5]. In this specific case fastpath can be done since skiprows is uniformly distributed array, so we can “squash” it to an integer and set “partitioning” skiprows to 3. But if no additional action is done, these three rows will be skipped right after header line, that corresponds to skiprows=[1, 2, 3]. To avoid this discrepancy, we need to assign the first partition to read data between header line and the first row for skipping by setting special pre_reading parameter to 2. Then, after the skipping of rows considered to be skipped during partitioning, the rest data will be divided between the rest of partitions, see rows assignment below:

0 - header line (skip during partitioning)
1 - pre reading (assign to read by the first partition)
2 - pre reading (assign to read by the first partition)
3 - "partitioning" skiprows (skip during partitioning)
4 - "partitioning" skiprows (skip during partitioning)
5 - "partitioning" skiprows (skip during partitioning)
6 - data to partition (divide between the rest of partitions)
7 - data to partition (divide between the rest of partitions)