IO Module Description#

Dispatcher Classes Workflow Overview#

Calls from read_* functions of execution-specific IO classes (for example, PandasOnRayIO for Ray engine and pandas storage format) are forwarded to the _read function of the file format-specific class (for example CSVDispatcher for CSV files), where function parameters are preprocessed to check if they are supported (defaulting to pandas if not) and common metadata is computed for all partitions. The file is then split into chunks (splitting mechanism described below) and the data is used to launch tasks on the remote workers. After the remote tasks finish, additional postprocessing is performed on the results, and a new query compiler with the imported data will be returned.

Data File Splitting Mechanism#

Modin’s file splitting mechanism differs depending on the data format type:

  • text format type - the file is split into bytes according to user specified arguments. In the simplest case, when no row related parameters (such as nrows or skiprows) are passed, data chunk limits (start and end bytes) are derived by dividing the file size by the number of partitions (chunks can slightly differ between each other because usually end byte may occurs inside a line and in that case the last byte of the line should be used instead of initial value). In other cases the same splitting mechanism is used, but chunks sizes are defined according to the number of lines that each partition should contain.

  • columnar store type - the file is split so that each chunk contains approximately the same number of columns.

  • SQL type - chunking is obtained by wrapping initial SQL query with a query that specifies initial row offset and number of rows in the chunk.

After file splitting is complete, chunks data is passed to the parser functions (PandasCSVParser.parse for read_csv function with pandas storage format) for further processing on each worker.

Submodules Description#

modin.core.io module is used mostly for storing utils and dispatcher classes for reading files of different formats.

  • io.py - class containing basic utils and default implementation of IO functions.

  • file_dispatcher.py - class reading data from different kinds of files and handling some util functions common for all formats. Also this class contains read function which is entry point function for all dispatchers _read functions.

  • text - directory for storing all text file format dispatcher classes

    • text_file_dispatcher.py - class for reading text formats files. This class holds partitioned_file function for splitting text format files into chunks, offset function for moving file offset at the specified amount of bytes, _read_rows function for moving file offset at the specified amount of rows and many other functions.

    • format/feature specific dispatchers: csv_dispatcher.py, csv_glob_dispatcher.py (reading multiple files simultaneously, experimental feature), excel_dispatcher.py, fwf_dispatcher.py and json_dispatcher.py.

  • column_stores - directory for storing all columnar store file format dispatcher classes

    • column_store_dispatcher.py - class for reading columnar type files. This class holds build_query_compiler function that performs file splitting, deploying remote tasks and results postprocessing and many other functions.

    • format/feature specific dispatchers: feather_dispatcher.py, hdf_dispatcher.py and parquet_dispatcher.py.

  • sql - directory for storing SQL dispatcher class

    • sql_dispatcher.py - class for reading SQL queries or database tables.

Public API#

IO functions implementations.

class modin.core.io.BaseIO#

Class for basic utils and default implementation of IO functions.

classmethod from_arrow(at)#

Create a Modin query_compiler from a pyarrow.Table.

Parameters

at (Arrow Table) – The Arrow Table to convert from.

Returns

QueryCompiler containing data from the Arrow Table.

Return type

BaseQueryCompiler

classmethod from_dataframe(df)#

Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol __dataframe__().

Parameters

df (DataFrame) – The DataFrame object supporting the DataFrame exchange protocol.

Returns

QueryCompiler containing data from the DataFrame.

Return type

BaseQueryCompiler

classmethod from_non_pandas(*args, **kwargs)#

Create a Modin query_compiler from a non-pandas object.

Parameters
  • *args (iterable) – Positional arguments to be passed into func.

  • **kwargs (dict) – Keyword arguments to be passed into func.

classmethod from_pandas(df)#

Create a Modin query_compiler from a pandas.DataFrame.

Parameters

df (pandas.DataFrame) – The pandas DataFrame to convert from.

Returns

QueryCompiler containing data from the pandas.DataFrame.

Return type

BaseQueryCompiler

classmethod read_clipboard(sep='\\s+', **kwargs)#

Read text from clipboard into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_clipboard for more.

classmethod read_csv(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=False, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)#

Read a comma-separated values (CSV) file into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler or TextParser with read data.

Return type

BaseQueryCompiler or TextParser

Notes

See pandas API documentation for pandas.read_csv for more.

classmethod read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, na_filter=True, **kwds)#

Read an Excel file into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler or OrderedDict/dict with read data.

Return type

BaseQueryCompiler or dict/OrderedDict

Notes

See pandas API documentation for pandas.read_excel for more.

classmethod read_feather(path, columns=None, use_threads=True, storage_options=None)#

Load a feather-format object from the file path into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_feather for more.

classmethod read_fwf(filepath_or_buffer, colspecs='infer', widths=None, infer_nrows=100, **kwds)#

Read a table of fixed-width formatted lines into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler or TextParser with read data.

Return type

BaseQueryCompiler or TextParser

Notes

See pandas API documentation for pandas.read_fwf for more.

classmethod read_gbq(query: str, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=False, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, private_key=None, verbose=None, progress_bar_type=None, max_results=None)#

Load data from Google BigQuery into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_gbq for more.

classmethod read_hdf(path_or_buf, key=None, mode: str = 'r', errors: str = 'strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)#

Read data from hdf store into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_hdf for more.

classmethod read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)#

Read HTML tables into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_html for more.

classmethod read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, encoding_errors='strict', lines=False, chunksize=None, compression='infer', nrows: Optional[int] = None, storage_options=None)#

Convert a JSON string to query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_json for more.

classmethod read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)#

Load a parquet object from the file path, returning a query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_parquet for more.

classmethod read_pickle(filepath_or_buffer, compression='infer', storage_options=None)#

Load pickled pandas object (or any object) from file into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_pickle for more.

classmethod read_sas(filepath_or_buffer, format=None, index=None, encoding=None, chunksize=None, iterator=False)#

Read SAS files stored as either XPORT or SAS7BDAT format files into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sas for more.

classmethod read_spss(path, usecols, convert_categoricals)#

Load an SPSS file from the file path, returning a query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_spss for more.

classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)#

Read SQL query or database table into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql for more.

classmethod read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None)#

Read SQL query into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql_query for more.

classmethod read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)#

Read SQL database table into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_sql_table for more.

classmethod read_stata(filepath_or_buffer, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)#

Read Stata file into query compiler using pandas.

For parameters description please refer to pandas API.

Returns

QueryCompiler with read data.

Return type

BaseQueryCompiler

Notes

See pandas API documentation for pandas.read_stata for more.

classmethod to_csv(obj, **kwargs)#

Write object to a comma-separated values (CSV) file using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_csv for more.

classmethod to_parquet(obj, **kwargs)#

Write object to the binary parquet format using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_parquet for more.

classmethod to_pickle(obj: Any, filepath_or_buffer, compression: Optional[Union[Literal['infer', 'gzip', 'bz2', 'zip', 'xz', 'zstd'], Dict[str, Any]]] = 'infer', protocol: int = 5, storage_options: Optional[Dict[str, Any]] = None)#

Pickle (serialize) object to file.

Notes

See pandas API documentation for pandas.DataFrame.to_pickle for more.

classmethod to_sql(qc, name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)#

Write records stored in a DataFrame to a SQL database using pandas.

For parameters description please refer to pandas API.

Notes

See pandas API documentation for pandas.DataFrame.to_sql for more.

class modin.core.io.CSVDispatcher#

Class handles utils for reading .csv files.

read_callback(**kwargs)#

Parse data on each partition.

Parameters
  • *args (list) – Positional arguments to be passed to the callback function.

  • **kwargs (dict) – Keyword arguments to be passed to the callback function.

Returns

Function call result.

Return type

pandas.DataFrame or pandas.io.parsers.TextParser

class modin.core.io.CSVGlobDispatcher#

Class contains utils for reading multiple .csv files simultaneously.

classmethod file_exists(file_path: str) bool#

Check if the file_path is valid.

Parameters

file_path (str) – String representing a path.

Returns

True if the path is valid.

Return type

bool

classmethod get_path(file_path: str) list#

Return the path of the file(s).

Parameters

file_path (str) – String representing a path.

Returns

List of strings of absolute file paths.

Return type

list

classmethod partitioned_file(files, fnames: List[str], num_partitions: Optional[int] = None, nrows: Optional[int] = None, skiprows: Optional[int] = None, skip_header: Optional[int] = None, quotechar: bytes = b'"', is_quoting: bool = True) List[List[Tuple[str, int, int]]]#

Compute chunk sizes in bytes for every partition.

Parameters
  • files (file or list of files) – File(s) to be partitioned.

  • fnames (str or list of str) – File name(s) to be partitioned.

  • num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().

  • nrows (int, optional) – Number of rows of file to read.

  • skiprows (int, optional) – Specifies rows to skip.

  • skip_header (int, optional) – Specifies header rows to skip.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

Returns

List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.

Return type

list

Notes

The logic gets really complicated if we try to use the TextFileDispatcher.partitioned_file.

class modin.core.io.CustomTextExperimentalDispatcher#

Class handles utils for reading custom text files.

class modin.core.io.ExcelDispatcher#

Class handles utils for reading excel files.

class modin.core.io.FWFDispatcher#

Class handles utils for reading of tables with fixed-width formatted lines.

classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int)#

Check support of parameters of read_fwf function.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_fwf function.

  • read_kwargs (dict) – Parameters of read_fwf function.

  • skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.

  • header_size (int) – Number of rows that are used by header.

Returns

Whether passed parameters are supported or not.

Return type

bool

read_callback(**kwargs)#

Parse data on each partition.

Parameters
  • *args (list) – Positional arguments to be passed to the callback function.

  • **kwargs (dict) – Keyword arguments to be passed to the callback function.

Returns

Function call result.

Return type

pandas.DataFrame or pandas.io.parsers.TextFileReader

class modin.core.io.FeatherDispatcher#

Class handles utils for reading .feather files.

class modin.core.io.FileDispatcher#

Class handles util functions for reading data from different kinds of files.

Notes

_read, deploy, parse and materialize are abstract methods and should be implemented in the child classes (functions signatures can differ between child classes).

classmethod build_partition(partition_ids, row_lengths, column_widths)#

Build array with partitions of cls.frame_partition_cls class.

Parameters
  • partition_ids (list) – Array with references to the partitions data.

  • row_lengths (list) – Partitions rows lengths.

  • column_widths (list) – Number of columns in each partition.

Returns

array with shape equals to the shape of partition_ids and filed with partition objects.

Return type

np.ndarray

classmethod deploy(func, *args, num_returns=1, **kwargs)#

Deploy remote task.

Should be implemented in the task class (for example in the RayTask).

classmethod file_exists(file_path)#

Check if file_path exists.

Parameters

file_path (str) – String that represents the path to the file (paths to S3 buckets are also acceptable).

Returns

Whether file exists or not.

Return type

bool

classmethod file_size(f)#

Get the size of file associated with file handle f.

Parameters

f (file-like object) – File-like object, that should be used to get file size.

Returns

File size in bytes.

Return type

int

classmethod get_path(file_path)#

Process file_path in accordance to it’s type.

Parameters

file_path (str, os.PathLike[str] object or file-like object) – The file, or a path to the file. Paths to S3 buckets are also acceptable.

Returns

Updated or verified file_path parameter.

Return type

str

Notes

if file_path is an S3 bucket, parameter will be returned as is, otherwise absolute path will be returned.

classmethod materialize(obj_id)#

Get results from worker.

Should be implemented in the task class (for example in the RayTask).

parse(func, args, num_returns)#

Parse file’s data in the worker process.

Should be implemented in the parser class (for example in the PandasCSVParser).

classmethod read(*args, **kwargs)#

Read data according passed args and kwargs.

Parameters
  • *args (iterable) – Positional arguments to be passed into _read function.

  • **kwargs (dict) – Keywords arguments to be passed into _read function.

Returns

query_compiler – Query compiler with imported data for further processing.

Return type

BaseQueryCompiler

Notes

read is high-level function that calls specific for defined storage format, engine and dispatcher class _read function with passed parameters and performs some postprocessing work on the resulting query_compiler object.

class modin.core.io.HDFDispatcher#

Class handles utils for reading hdf data.

Inherits some common for columnar store files util functions from ColumnStoreDispatcher class.

class modin.core.io.JSONDispatcher#

Class handles utils for reading .json files.

class modin.core.io.ParquetDispatcher#

Class handles utils for reading .parquet files.

class modin.core.io.PickleExperimentalDispatcher#

Class handles utils for reading pickle files.

class modin.core.io.SQLDispatcher#

Class handles utils for reading SQL queries or database tables.

class modin.core.io.TextFileDispatcher#

Class handles utils for reading text formats files.

classmethod build_partition(partition_ids, row_lengths, column_widths)#

Build array with partitions of cls.frame_partition_cls class.

Parameters
  • partition_ids (list) – Array with references to the partitions data.

  • row_lengths (list) – Partitions rows lengths.

  • column_widths (list) – Number of columns in each partition.

Returns

array with shape equals to the shape of partition_ids and filed with partitions objects.

Return type

np.ndarray

classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) bool#

Check support of only general parameters of read_* function.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_* function.

  • read_kwargs (dict) – Parameters of read_* function.

  • skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.

  • header_size (int) – Number of rows that are used by header.

Returns

Whether passed parameters are supported or not.

Return type

bool

classmethod compute_newline(file_like, encoding, quotechar)#

Compute byte or sequence of bytes indicating line endings.

Parameters
  • file_like (file-like object) – File handle that should be used for line endings computing.

  • encoding (str) – Encoding of file_like.

  • quotechar (str) – Quotechar used for parsing file-like.

Returns

line endings

Return type

bytes

classmethod get_path_or_buffer(filepath_or_buffer)#

Extract path from filepath_or_buffer.

Parameters

filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

Returns

verified filepath_or_buffer parameter.

Return type

str or path object

Notes

Given a buffer, try and extract the filepath from it so that we can use it without having to fall back to pandas and share file objects between workers. Given a filepath, return it immediately.

classmethod offset(f, offset_size: int, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None)#

Move the file offset at the specified amount of bytes.

Parameters
  • f (file-like object) – File handle that should be used for offset movement.

  • offset_size (int) – Number of bytes to read and ignore.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

Returns

If file pointer reached the end of the file, but did not find closing quote returns False. True in any other case.

Return type

bool

classmethod partitioned_file(f, num_partitions: int = None, nrows: int = None, skiprows: int = None, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None, header_size: int = 0, pre_reading: int = 0)#

Compute chunk sizes in bytes for every partition.

Parameters
  • f (file-like object) – File handle of file to be partitioned.

  • num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().

  • nrows (int, optional) – Number of rows of file to read.

  • skiprows (int, optional) – Specifies rows to skip.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

  • header_size (int, default: 0) – Number of rows, that occupied by header.

  • pre_reading (int, default: 0) – Number of rows between header and skipped rows, that should be read.

Returns

List with the next elements:

int : partition start read byte int : partition end read byte

Return type

list

classmethod pathlib_or_pypath(filepath_or_buffer)#

Check if filepath_or_buffer is instance of py.path.local or pathlib.Path.

Parameters

filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

Returns

Whether or not filepath_or_buffer is instance of py.path.local or pathlib.Path.

Return type

bool

classmethod rows_skipper_builder(f, quotechar, is_quoting, encoding=None, newline=None)#

Build object for skipping passed number of lines.

Parameters
  • f (file-like object) – File handle that should be used for offset movement.

  • quotechar (bytes) – Indicate quote in a file.

  • is_quoting (bool) – Whether or not to consider quotes.

  • encoding (str, optional) – Encoding of f.

  • newline (bytes, optional) – Byte or sequence of bytes indicating line endings.

Returns

skipper object.

Return type

object

Handling skiprows Parameter#

Handling skiprows parameter by pandas import functions can be very tricky, especially for read_csv function because of interconnection with header parameter. In this section the techniques of skiprows processing by both pandas and Modin are covered.

Processing skiprows by pandas#

Let’s consider a simple snippet with pandas.read_csv in order to understand interconnection of header and skiprows parameters:

import pandas
from io import StringIO

data = """0
1
2
3
4
5
6
7
8
"""

# `header` parameter absence is equivalent to `header="infer"` or `header=0`
# rows 1, 5, 6, 7, 8 are read with header "0"
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4])
# rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1)
# rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)

In the examples above list-like skiprows values are fixed and header is varied. In the first example with no header provided, rows 2, 3, 4 are skipped and row 0 is considered as the header. In the second example header == 1, so the zeroth row is skipped and the next available row is considered the header. The third example illustrates when the header and skiprows parameters values are both present - in this case skiprows rows are dropped first and then the header is derived from the remaining rows (rows before header are skipped too).

In the examples above only list-like skiprows and integer header parameters are considered, but the same logic is applicable for other types of the parameters.

Processing skiprows by Modin#

As it can be seen, skipping rows in the pandas import functions is complicated and distributing this logic across multiple workers can complicate it even more. Thus in some rare corner cases default pandas implementation is used in Modin to avoid excessive Modin code complication.

Modin uses two techniques for skipping rows:

1) During file partitioning (setting file limits that should be read by each partition) exact rows can be excluded from partitioning scope, thus they won’t be read at all and can be considered as skipped. This is the most effective way of skipping rows since it doesn’t require any actual data reading and postprocessing, but in this case skiprows parameter can be an integer only. When it is possible Modin always uses this approach.

2) Rows for skipping can be dropped after full dataset import. This is more expensive way since it requires extra IO work and postprocessing afterwards, but skiprows parameter can be of any non-integer type supported by pandas.read_csv.

In some cases, if skiprows is uniformly distributed array (e.g. [1, 2, 3]), skiprows can be “squashed” and represented as an integer to make a fastpath by skipping these rows during file partitioning (using the first option). But if there is a gap between the first row for skipping and the last line of the header (that will be skipped too since header is read by each partition to ensure metadata is defined properly), then this gap should be assigned for reading first by assigning the first partition to read these rows by setting pre_reading parameter.

Let’s consider an example of skipping rows during partitioning when header="infer" and skiprows=[3, 4, 5]. In this specific case fastpath can be done since skiprows is uniformly distributed array, so we can “squash” it to an integer and set “partitioning” skiprows to 3. But if no additional action is done, these three rows will be skipped right after header line, that corresponds to skiprows=[1, 2, 3]. To avoid this discrepancy, we need to assign the first partition to read data between header line and the first row for skipping by setting special pre_reading parameter to 2. Then, after the skipping of rows considered to be skipped during partitioning, the rest data will be divided between the rest of partitions, see rows assignment below:

0 - header line (skip during partitioning)
1 - pre reading (assign to read by the first partition)
2 - pre reading (assign to read by the first partition)
3 - "partitioning" skiprows (skip during partitioning)
4 - "partitioning" skiprows (skip during partitioning)
5 - "partitioning" skiprows (skip during partitioning)
6 - data to partition (divide between the rest of partitions)
7 - data to partition (divide between the rest of partitions)