IO Module Description#
Dispatcher Classes Workflow Overview#
Calls from read_*
functions of execution-specific IO classes (for example, PandasOnRayIO
for
Ray engine and pandas storage format) are forwarded to the _read
function of the file
format-specific class (for example CSVDispatcher
for CSV files), where function parameters are
preprocessed to check if they are supported (defaulting to pandas if not)
and common metadata is computed for all partitions. The file is then split
into chunks (splitting mechanism described below) and the data is used to launch tasks
on the remote workers. After the remote tasks finish, additional
postprocessing is performed on the results, and a new query compiler with the imported data will
be returned.
Data File Splitting Mechanism#
Modin’s file splitting mechanism differs depending on the data format type:
text format type - the file is split into bytes according to user specified arguments. In the simplest case, when no row related parameters (such as
nrows
orskiprows
) are passed, data chunk limits (start and end bytes) are derived by dividing the file size by the number of partitions (chunks can slightly differ between each other because usually end byte may occurs inside a line and in that case the last byte of the line should be used instead of initial value). In other cases the same splitting mechanism is used, but chunks sizes are defined according to the number of lines that each partition should contain.columnar store type - the file is split so that each chunk contains approximately the same number of columns.
SQL type - chunking is obtained by wrapping initial SQL query with a query that specifies initial row offset and number of rows in the chunk.
After file splitting is complete, chunks data is passed to the parser functions
(PandasCSVParser.parse
for read_csv
function with pandas storage format) for
further processing on each worker.
Submodules Description#
modin.core.io
module is used mostly for storing utils and dispatcher
classes for reading files of different formats.
io.py
- class containing basic utils and default implementation of IO functions.file_dispatcher.py
- class reading data from different kinds of files and handling some util functions common for all formats. Also this class containsread
function which is entry point function for all dispatchers_read
functions.text - directory for storing all text file format dispatcher classes
text_file_dispatcher.py
- class for reading text formats files. This class holdspartitioned_file
function for splitting text format files into chunks,offset
function for moving file offset at the specified amount of bytes,_read_rows
function for moving file offset at the specified amount of rows and many other functions.format/feature specific dispatchers:
csv_dispatcher.py
,excel_dispatcher.py
,fwf_dispatcher.py
andjson_dispatcher.py
.
column_stores - directory for storing all columnar store file format dispatcher classes
column_store_dispatcher.py
- class for reading columnar type files. This class holdsbuild_query_compiler
function that performs file splitting, deploying remote tasks and results postprocessing and many other functions.format/feature specific dispatchers:
feather_dispatcher.py
,hdf_dispatcher.py
andparquet_dispatcher.py
.
sql - directory for storing SQL dispatcher class
sql_dispatcher.py
- class for reading SQL queries or database tables.
Public API#
IO functions implementations.
- class modin.core.io.BaseIO#
Class for basic utils and default implementation of IO functions.
- classmethod from_arrow(at)#
Create a Modin query_compiler from a pyarrow.Table.
- Parameters:
at (Arrow Table) – The Arrow Table to convert from.
- Returns:
QueryCompiler containing data from the Arrow Table.
- Return type:
- classmethod from_dask(dask_obj)#
Create a Modin query_compiler from a Dask DataFrame.
- Parameters:
dask_obj (dask.dataframe.DataFrame) – The Dask DataFrame to convert from.
- Returns:
QueryCompiler containing data from the Dask DataFrame.
- Return type:
Notes
Dask DataFrame can only be converted to a Modin DataFrame if Modin uses a Dask engine. If another engine is used, the runtime exception will be raised.
- classmethod from_dataframe(df)#
Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol __dataframe__().
- Parameters:
df (DataFrame) – The DataFrame object supporting the DataFrame exchange protocol.
- Returns:
QueryCompiler containing data from the DataFrame.
- Return type:
- classmethod from_map(func, iterable, *args, **kwargs)#
Create a Modin query_compiler from a map function.
This method will construct a Modin query_compiler split by row partitions. The number of row partitions matches the number of elements in the iterable object.
- Parameters:
func (callable) – Function to map across the iterable object.
iterable (Iterable) – An iterable object.
*args (tuple) – Positional arguments to pass in func.
**kwargs (dict) – Keyword arguments to pass in func.
- Returns:
QueryCompiler containing data returned by map function.
- Return type:
- classmethod from_non_pandas(*args, **kwargs)#
Create a Modin query_compiler from a non-pandas object.
- Parameters:
*args (iterable) – Positional arguments to be passed into func.
**kwargs (dict) – Keyword arguments to be passed into func.
- classmethod from_pandas(df)#
Create a Modin query_compiler from a pandas.DataFrame.
- Parameters:
df (pandas.DataFrame) – The pandas DataFrame to convert from.
- Returns:
QueryCompiler containing data from the pandas.DataFrame.
- Return type:
- classmethod from_ray(ray_obj)#
Create a Modin query_compiler from a Ray Dataset.
- Parameters:
ray_obj (ray.data.Dataset) – The Ray Dataset to convert from.
- Returns:
QueryCompiler containing data from the Ray Dataset.
- Return type:
Notes
Ray Dataset can only be converted to a Modin Dataframe if Modin uses a Ray engine. If another engine is used, the runtime exception will be raised.
- classmethod read_clipboard(sep='\\s+', **kwargs)#
Read text from clipboard into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_clipboard for more.
- classmethod read_csv(filepath_or_buffer, **kwargs)#
Read a comma-separated values (CSV) file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler or TextParser with read data.
- Return type:
BaseQueryCompiler or TextParser
Notes
See pandas API documentation for pandas.read_csv for more.
- classmethod read_excel(**kwargs)#
Read an Excel file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler or dict with read data.
- Return type:
BaseQueryCompiler or dict
Notes
See pandas API documentation for pandas.read_excel for more.
- classmethod read_feather(path, **kwargs)#
Load a feather-format object from the file path into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_feather for more.
- classmethod read_fwf(filepath_or_buffer, *, colspecs='infer', widths=None, infer_nrows=100, dtype_backend=_NoDefault.no_default, iterator=False, chunksize=None, **kwds)#
Read a table of fixed-width formatted lines into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler or TextParser with read data.
- Return type:
BaseQueryCompiler or TextParser
Notes
See pandas API documentation for pandas.read_fwf for more.
- classmethod read_gbq(query: str, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=False, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, private_key=None, verbose=None, progress_bar_type=None, max_results=None)#
Load data from Google BigQuery into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_gbq for more.
- classmethod read_hdf(path_or_buf, key=None, mode: str = 'r', errors: str = 'strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)#
Read data from hdf store into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_hdf for more.
- classmethod read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, **kwargs)#
Read HTML tables into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_html for more.
- classmethod read_json(**kwargs)#
Convert a JSON string to query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_json for more.
- classmethod read_parquet(**kwargs)#
Load a parquet object from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_parquet for more.
- classmethod read_pickle(filepath_or_buffer, **kwargs)#
Load pickled pandas object (or any object) from file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_pickle for more.
- classmethod read_sas(filepath_or_buffer, *, format=None, index=None, encoding=None, chunksize=None, iterator=False, **kwargs)#
Read SAS files stored as either XPORT or SAS7BDAT format files into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_sas for more.
- classmethod read_spss(path, usecols, convert_categoricals, dtype_backend)#
Load an SPSS file from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_spss for more.
- classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default, dtype=None)#
Read SQL query or database table into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_sql for more.
- classmethod read_sql_query(sql, con, **kwargs)#
Read SQL query into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_sql_query for more.
- classmethod read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default)#
Read SQL database table into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_sql_table for more.
- classmethod read_stata(filepath_or_buffer, **kwargs)#
Read Stata file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns:
QueryCompiler with read data.
- Return type:
Notes
See pandas API documentation for pandas.read_stata for more.
- classmethod to_csv(obj, **kwargs)#
Write object to a comma-separated values (CSV) file using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_csv for more.
- classmethod to_dask(modin_obj)#
Convert a Modin DataFrame to a Dask DataFrame.
- Parameters:
modin_obj (modin.pandas.DataFrame, modin.pandas.Series) – The Modin DataFrame/Series to convert.
- Returns:
Converted object with type depending on input.
- Return type:
dask.dataframe.DataFrame or dask.dataframe.Series
Notes
Modin DataFrame/Series can only be converted to a Dask DataFrame/Series if Modin uses a Dask engine. If another engine is used, the runtime exception will be raised.
- classmethod to_json(obj, path, **kwargs)#
Convert the object to a JSON string.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_json for more.
- classmethod to_parquet(obj, path, **kwargs)#
Write object to the binary parquet format using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_parquet for more.
- classmethod to_pickle(obj: Any, filepath_or_buffer, **kwargs)#
Pickle (serialize) object to file.
Notes
See pandas API documentation for pandas.DataFrame.to_pickle for more.
- classmethod to_ray(modin_obj)#
Convert a Modin DataFrame/Series to a Ray Dataset.
- Parameters:
modin_obj (modin.pandas.DataFrame, modin.pandas.Series) – The Modin DataFrame/Series to convert.
- Returns:
Converted object with type depending on input.
- Return type:
ray.data.Dataset
Notes
Modin DataFrame/Series can only be converted to a Ray Dataset if Modin uses a Ray engine. If another engine is used, the runtime exception will be raised.
- classmethod to_sql(qc, name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)#
Write records stored in a DataFrame to a SQL database using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_sql for more.
- classmethod to_xml(obj, path_or_buffer, **kwargs)#
Convert the object to a XML string.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_xml for more.
- class modin.core.io.CSVDispatcher#
Class handles utils for reading .csv files.
- class modin.core.io.ExcelDispatcher#
Class handles utils for reading excel files.
- class modin.core.io.FWFDispatcher#
Class handles utils for reading of tables with fixed-width formatted lines.
- classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]] #
Check support of parameters of read_fwf function.
- Parameters:
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_fwf function.
read_kwargs (dict) – Parameters of read_fwf function.
skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.
header_size (int) – Number of rows that are used by header.
- Returns:
bool – Whether passed parameters are supported or not.
Optional[str] – None if parameters are supported, otherwise an error message describing why parameters are not supported.
- class modin.core.io.FeatherDispatcher#
Class handles utils for reading .feather files.
- class modin.core.io.FileDispatcher#
Class handles util functions for reading data from different kinds of files.
Notes
_read, deploy, parse and materialize are abstract methods and should be implemented in the child classes (functions signatures can differ between child classes).
- classmethod build_partition(partition_ids, row_lengths, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters:
partition_ids (list) – Array with references to the partitions data.
row_lengths (list) – Partitions rows lengths.
column_widths (list) – Number of columns in each partition.
- Returns:
array with shape equals to the shape of partition_ids and filed with partition objects.
- Return type:
np.ndarray
- classmethod deploy(func, *args, num_returns=1, **kwargs)#
Deploy remote task.
Should be implemented in the task class (for example in the RayWrapper).
- classmethod file_exists(file_path, storage_options=None)#
Check if file_path exists.
- Parameters:
file_path (str) – String that represents the path to the file (paths to S3 buckets are also acceptable).
storage_options (dict, optional) – Keyword from read_* functions.
- Returns:
Whether file exists or not.
- Return type:
bool
- classmethod file_size(f)#
Get the size of file associated with file handle f.
- Parameters:
f (file-like object) – File-like object, that should be used to get file size.
- Returns:
File size in bytes.
- Return type:
int
- classmethod get_path(file_path)#
Process file_path in accordance to it’s type.
- Parameters:
file_path (str, os.PathLike[str] object or file-like object) – The file, or a path to the file. Paths to S3 buckets are also acceptable.
- Returns:
Updated or verified file_path parameter.
- Return type:
str
Notes
if file_path is a URL, parameter will be returned as is, otherwise absolute path will be returned.
- classmethod materialize(obj_id)#
Get results from worker.
Should be implemented in the task class (for example in the RayWrapper).
- parse(func, args, num_returns)#
Parse file’s data in the worker process.
Should be implemented in the parser class (for example in the PandasCSVParser).
- classmethod read(*args, **kwargs)#
Read data according passed args and kwargs.
- Parameters:
*args (iterable) – Positional arguments to be passed into _read function.
**kwargs (dict) – Keywords arguments to be passed into _read function.
- Returns:
query_compiler – Query compiler with imported data for further processing.
- Return type:
Notes
read is high-level function that calls specific for defined storage format, engine and dispatcher class _read function with passed parameters and performs some postprocessing work on the resulting query_compiler object.
- class modin.core.io.HDFDispatcher#
Class handles utils for reading hdf data.
Inherits some common for columnar store files util functions from ColumnStoreDispatcher class.
- class modin.core.io.JSONDispatcher#
Class handles utils for reading .json files.
- class modin.core.io.ParquetDispatcher#
Class handles utils for reading .parquet files.
- classmethod build_index(dataset, partition_ids, index_columns, filters)#
Compute index and its split sizes of resulting Modin DataFrame.
- Parameters:
dataset (Dataset) – Dataset object of Parquet file/files.
partition_ids (list) – Array with references to the partitions data.
index_columns (list) – List of index columns specified by pandas metadata.
filters (list) – List of filters to be used in reading the Parquet file/files.
- Returns:
index (pandas.Index) – Index of resulting Modin DataFrame.
needs_index_sync (bool) – Whether the partition indices need to be synced with frame index because there’s no index column, or at least one index column is a RangeIndex.
Notes
See build_partition for more detail on the contents of partitions_ids.
- classmethod build_partition(partition_ids, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters:
partition_ids (list) – Array with references to the partitions data.
column_widths (list) – Number of columns in each partition.
- Returns:
array with shape equals to the shape of partition_ids and filed with partition objects.
- Return type:
np.ndarray
Notes
The second level of partitions_ids contains a list of object references for each read call: partition_ids[i][j] -> [ObjectRef(df), ObjectRef(df.index), ObjectRef(len(df))].
- classmethod build_query_compiler(dataset, columns, index_columns, **kwargs)#
Build query compiler from deployed tasks outputs.
- Parameters:
dataset (Dataset) – Dataset object of Parquet file/files.
columns (list) – List of columns that should be read from file.
index_columns (list) – List of index columns specified by pandas metadata.
**kwargs (dict) – Parameters of deploying read_* function.
- Returns:
new_query_compiler – Query compiler with imported data for further processing.
- Return type:
- classmethod call_deploy(partition_files: list[list[ParquetFileToRead]], col_partitions: list[list[str]], storage_options: dict, engine: str, **kwargs)#
Deploy remote tasks to the workers with passed parameters.
- Parameters:
partition_files (list[list[ParquetFileToRead]]) – List of arrays with files that should be read by each partition.
col_partitions (list[list[str]]) – List of arrays with columns names that should be read by each partition.
storage_options (dict) – Parameters for specific storage engine.
engine ({"auto", "pyarrow", "fastparquet"}) – Parquet library to use for reading.
**kwargs (dict) – Parameters of deploying read_* function.
- Returns:
Array with references to the task deploy result for each partition.
- Return type:
List
- classmethod get_dataset(path, engine, storage_options)#
Retrieve Parquet engine specific Dataset implementation.
- Parameters:
path (str, path object or file-like object) – The filepath of the parquet file in local filesystem or hdfs.
engine (str) – Parquet library to use (only ‘PyArrow’ is supported for now).
storage_options (dict) – Parameters for specific storage engine.
- Returns:
Either a PyArrowDataset or FastParquetDataset object.
- Return type:
Dataset
- classmethod write(qc, **kwargs)#
Write a
DataFrame
to the binary parquet format.- Parameters:
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_parquet on.
**kwargs (dict) – Parameters for pandas.to_parquet(**kwargs).
- class modin.core.io.SQLDispatcher#
Class handles utils for reading SQL queries or database tables.
- classmethod write(qc, **kwargs)#
Write records stored in the qc to a SQL database.
- Parameters:
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_sql
on.**kwargs (dict) – Parameters for
pandas.to_sql(**kwargs)
.
- class modin.core.io.TextFileDispatcher#
Class handles utils for reading text formats files.
- classmethod build_partition(partition_ids, row_lengths, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters:
partition_ids (list) – Array with references to the partitions data.
row_lengths (list) – Partitions rows lengths.
column_widths (list) – Number of columns in each partition.
- Returns:
array with shape equals to the shape of partition_ids and filed with partitions objects.
- Return type:
np.ndarray
- classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]] #
Check support of only general parameters of read_* function.
- Parameters:
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_* function.
read_kwargs (dict) – Parameters of read_* function.
skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.
header_size (int) – Number of rows that are used by header.
- Returns:
bool – Whether passed parameters are supported or not.
Optional[str] – None if parameters are supported, otherwise an error message describing why parameters are not supported.
- classmethod compute_newline(file_like, encoding, quotechar)#
Compute byte or sequence of bytes indicating line endings.
- Parameters:
file_like (file-like object) – File handle that should be used for line endings computing.
encoding (str) – Encoding of file_like.
quotechar (str) – Quotechar used for parsing file-like.
- Returns:
line endings
- Return type:
bytes
- classmethod get_path_or_buffer(filepath_or_buffer)#
Extract path from filepath_or_buffer.
- Parameters:
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.
- Returns:
verified filepath_or_buffer parameter.
- Return type:
str or path object
Notes
Given a buffer, try and extract the filepath from it so that we can use it without having to fall back to pandas and share file objects between workers. Given a filepath, return it immediately.
- classmethod offset(f, offset_size: int, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None)#
Move the file offset at the specified amount of bytes.
- Parameters:
f (file-like object) – File handle that should be used for offset movement.
offset_size (int) – Number of bytes to read and ignore.
quotechar (bytes, default: b'"') – Indicate quote in a file.
is_quoting (bool, default: True) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
- Returns:
If file pointer reached the end of the file, but did not find closing quote returns False. True in any other case.
- Return type:
bool
- classmethod partitioned_file(f, num_partitions: int = None, nrows: int = None, skiprows: int = None, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None, header_size: int = 0, pre_reading: int = 0, get_metadata_kw: dict = None)#
Compute chunk sizes in bytes for every partition.
- Parameters:
f (file-like object) – File handle of file to be partitioned.
num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().
nrows (int, optional) – Number of rows of file to read.
skiprows (int, optional) – Specifies rows to skip.
quotechar (bytes, default: b'"') – Indicate quote in a file.
is_quoting (bool, default: True) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
header_size (int, default: 0) – Number of rows, that occupied by header.
pre_reading (int, default: 0) – Number of rows between header and skipped rows, that should be read.
get_metadata_kw (dict, optional) – Keyword arguments for cls.read_callback to compute metadata if needed. This option is not compatible with pre_reading!=0.
- Returns:
list –
- List with the next elements:
int : partition start read byte int : partition end read byte
pandas.DataFrame or None – Dataframe from which metadata can be retrieved. Can be None if get_metadata_kw=None.
- classmethod pathlib_or_pypath(filepath_or_buffer)#
Check if filepath_or_buffer is instance of py.path.local or pathlib.Path.
- Parameters:
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.
- Returns:
Whether or not filepath_or_buffer is instance of py.path.local or pathlib.Path.
- Return type:
bool
- classmethod preprocess_func()#
Prepare a function for transmission to remote workers.
- classmethod rows_skipper_builder(f, quotechar, is_quoting, encoding=None, newline=None)#
Build object for skipping passed number of lines.
- Parameters:
f (file-like object) – File handle that should be used for offset movement.
quotechar (bytes) – Indicate quote in a file.
is_quoting (bool) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
- Returns:
skipper object.
- Return type:
object
Handling skiprows
Parameter#
Handling skiprows
parameter by pandas import functions can be very tricky, especially
for read_csv
function because of interconnection with header
parameter. In this section
the techniques of skiprows
processing by both pandas and Modin are covered.
Processing skiprows
by pandas#
Let’s consider a simple snippet with pandas.read_csv
in order to understand interconnection
of header
and skiprows
parameters:
import pandas
from io import StringIO
data = """0
1
2
3
4
5
6
7
8
"""
# `header` parameter absence is equivalent to `header="infer"` or `header=0`
# rows 1, 5, 6, 7, 8 are read with header "0"
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4])
# rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1)
# rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)
In the examples above list-like skiprows
values are fixed and header
is varied. In the first
example with no header
provided, rows 2, 3, 4 are skipped and row 0 is considered as the header.
In the second example header == 1
, so the zeroth row is skipped and the next available row is
considered the header. The third example illustrates when the header
and skiprows
parameters
values are both present - in this case skiprows
rows are dropped first and then the header
is derived
from the remaining rows (rows before header are skipped too).
In the examples above only list-like skiprows
and integer header
parameters are considered,
but the same logic is applicable for other types of the parameters.
Processing skiprows
by Modin#
As it can be seen, skipping rows in the pandas import functions is complicated and distributing this logic across multiple workers can complicate it even more. Thus in some rare corner cases default pandas implementation is used in Modin to avoid excessive Modin code complication.
Modin uses two techniques for skipping rows:
1) During file partitioning (setting file limits that should be read by each partition)
exact rows can be excluded from partitioning scope, thus they won’t be read at all and can be
considered as skipped. This is the most effective way of skipping rows since it doesn’t require
any actual data reading and postprocessing, but in this case skiprows
parameter can be an
integer only. When it is possible Modin always uses this approach.
2) Rows for skipping can be dropped after full dataset import. This is more expensive way since
it requires extra IO work and postprocessing afterwards, but skiprows
parameter can be of any
non-integer type supported by pandas.read_csv
.
In some cases, if skiprows
is uniformly distributed array (e.g. [1, 2, 3]), skiprows
can be
“squashed” and represented as an integer to make a fastpath by skipping these rows during file partitioning
(using the first option). But if there is a gap between the first row for skipping
and the last line of the header (that will be skipped too since header is read by each partition
to ensure metadata is defined properly), then this gap should be assigned for reading first
by assigning the first partition to read these rows by setting pre_reading
parameter.
Let’s consider an example of skipping rows during partitioning when header="infer"
and
skiprows=[3, 4, 5]
. In this specific case fastpath can be done since skiprows
is uniformly
distributed array, so we can “squash” it to an integer and set “partitioning” skiprows to 3. But
if no additional action is done, these three rows will be skipped right after header line,
that corresponds to skiprows=[1, 2, 3]
. To avoid this discrepancy, we need to assign the first
partition to read data between header line and the first row for skipping by setting special
pre_reading
parameter to 2. Then, after the skipping of rows considered to be skipped during
partitioning, the rest data will be divided between the rest of partitions, see rows assignment
below:
0 - header line (skip during partitioning)
1 - pre reading (assign to read by the first partition)
2 - pre reading (assign to read by the first partition)
3 - "partitioning" skiprows (skip during partitioning)
4 - "partitioning" skiprows (skip during partitioning)
5 - "partitioning" skiprows (skip during partitioning)
6 - data to partition (divide between the rest of partitions)
7 - data to partition (divide between the rest of partitions)