IO Module Description#
Dispatcher Classes Workflow Overview#
Calls from read_*
functions of execution-specific IO classes (for example, PandasOnRayIO
for
Ray engine and pandas storage format) are forwarded to the _read
function of the file
format-specific class (for example CSVDispatcher
for CSV files), where function parameters are
preprocessed to check if they are supported (defaulting to pandas if not)
and common metadata is computed for all partitions. The file is then split
into chunks (splitting mechanism described below) and the data is used to launch tasks
on the remote workers. After the remote tasks finish, additional
postprocessing is performed on the results, and a new query compiler with the imported data will
be returned.
Data File Splitting Mechanism#
Modin’s file splitting mechanism differs depending on the data format type:
text format type - the file is split into bytes according to user specified arguments. In the simplest case, when no row related parameters (such as
nrows
orskiprows
) are passed, data chunk limits (start and end bytes) are derived by dividing the file size by the number of partitions (chunks can slightly differ between each other because usually end byte may occurs inside a line and in that case the last byte of the line should be used instead of initial value). In other cases the same splitting mechanism is used, but chunks sizes are defined according to the number of lines that each partition should contain.columnar store type - the file is split so that each chunk contains approximately the same number of columns.
SQL type - chunking is obtained by wrapping initial SQL query with a query that specifies initial row offset and number of rows in the chunk.
After file splitting is complete, chunks data is passed to the parser functions
(PandasCSVParser.parse
for read_csv
function with pandas storage format) for
further processing on each worker.
Submodules Description#
modin.core.io
module is used mostly for storing utils and dispatcher
classes for reading files of different formats.
io.py
- class containing basic utils and default implementation of IO functions.file_dispatcher.py
- class reading data from different kinds of files and handling some util functions common for all formats. Also this class containsread
function which is entry point function for all dispatchers_read
functions.text - directory for storing all text file format dispatcher classes
text_file_dispatcher.py
- class for reading text formats files. This class holdspartitioned_file
function for splitting text format files into chunks,offset
function for moving file offset at the specified amount of bytes,_read_rows
function for moving file offset at the specified amount of rows and many other functions.format/feature specific dispatchers:
csv_dispatcher.py
,excel_dispatcher.py
,fwf_dispatcher.py
andjson_dispatcher.py
.
column_stores - directory for storing all columnar store file format dispatcher classes
column_store_dispatcher.py
- class for reading columnar type files. This class holdsbuild_query_compiler
function that performs file splitting, deploying remote tasks and results postprocessing and many other functions.format/feature specific dispatchers:
feather_dispatcher.py
,hdf_dispatcher.py
andparquet_dispatcher.py
.
sql - directory for storing SQL dispatcher class
sql_dispatcher.py
- class for reading SQL queries or database tables.
Public API#
IO functions implementations.
- class modin.core.io.BaseIO#
Class for basic utils and default implementation of IO functions.
- classmethod from_arrow(at)#
Create a Modin query_compiler from a pyarrow.Table.
- Parameters
at (Arrow Table) – The Arrow Table to convert from.
- Returns
QueryCompiler containing data from the Arrow Table.
- Return type
- classmethod from_dataframe(df)#
Create a Modin QueryCompiler from a DataFrame supporting the DataFrame exchange protocol __dataframe__().
- Parameters
df (DataFrame) – The DataFrame object supporting the DataFrame exchange protocol.
- Returns
QueryCompiler containing data from the DataFrame.
- Return type
- classmethod from_non_pandas(*args, **kwargs)#
Create a Modin query_compiler from a non-pandas object.
- Parameters
*args (iterable) – Positional arguments to be passed into func.
**kwargs (dict) – Keyword arguments to be passed into func.
- classmethod from_pandas(df)#
Create a Modin query_compiler from a pandas.DataFrame.
- Parameters
df (pandas.DataFrame) – The pandas DataFrame to convert from.
- Returns
QueryCompiler containing data from the pandas.DataFrame.
- Return type
- classmethod read_clipboard(sep='\\s+', **kwargs)#
Read text from clipboard into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_clipboard for more.
- classmethod read_csv(filepath_or_buffer, **kwargs)#
Read a comma-separated values (CSV) file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler or TextParser with read data.
- Return type
BaseQueryCompiler or TextParser
Notes
See pandas API documentation for pandas.read_csv for more.
- classmethod read_excel(**kwargs)#
Read an Excel file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler or OrderedDict/dict with read data.
- Return type
BaseQueryCompiler or dict/OrderedDict
Notes
See pandas API documentation for pandas.read_excel for more.
- classmethod read_feather(path, **kwargs)#
Load a feather-format object from the file path into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_feather for more.
- classmethod read_fwf(filepath_or_buffer, colspecs='infer', widths=None, infer_nrows=100, **kwds)#
Read a table of fixed-width formatted lines into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler or TextParser with read data.
- Return type
BaseQueryCompiler or TextParser
Notes
See pandas API documentation for pandas.read_fwf for more.
- classmethod read_gbq(query: str, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=False, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, private_key=None, verbose=None, progress_bar_type=None, max_results=None)#
Load data from Google BigQuery into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_gbq for more.
- classmethod read_hdf(path_or_buf, key=None, mode: str = 'r', errors: str = 'strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)#
Read data from hdf store into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_hdf for more.
- classmethod read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, **kwargs)#
Read HTML tables into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_html for more.
- classmethod read_json(**kwargs)#
Convert a JSON string to query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_json for more.
- classmethod read_parquet(**kwargs)#
Load a parquet object from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_parquet for more.
- classmethod read_pickle(filepath_or_buffer, **kwargs)#
Load pickled pandas object (or any object) from file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_pickle for more.
- classmethod read_sas(filepath_or_buffer, format=None, index=None, encoding=None, chunksize=None, iterator=False, **kwargs)#
Read SAS files stored as either XPORT or SAS7BDAT format files into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_sas for more.
- classmethod read_spss(path, usecols, convert_categoricals)#
Load an SPSS file from the file path, returning a query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_spss for more.
- classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)#
Read SQL query or database table into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_sql for more.
- classmethod read_sql_query(sql, con, **kwargs)#
Read SQL query into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_sql_query for more.
- classmethod read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None)#
Read SQL database table into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_sql_table for more.
- classmethod read_stata(filepath_or_buffer, **kwargs)#
Read Stata file into query compiler using pandas. For parameters description please refer to pandas API.
- Returns
QueryCompiler with read data.
- Return type
Notes
See pandas API documentation for pandas.read_stata for more.
- classmethod to_csv(obj, **kwargs)#
Write object to a comma-separated values (CSV) file using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_csv for more.
- classmethod to_parquet(obj, **kwargs)#
Write object to the binary parquet format using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_parquet for more.
- classmethod to_pickle(obj: Any, filepath_or_buffer, **kwargs)#
Pickle (serialize) object to file.
Notes
See pandas API documentation for pandas.DataFrame.to_pickle for more.
- classmethod to_sql(qc, name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)#
Write records stored in a DataFrame to a SQL database using pandas.
For parameters description please refer to pandas API.
Notes
See pandas API documentation for pandas.DataFrame.to_sql for more.
- class modin.core.io.CSVDispatcher#
Class handles utils for reading .csv files.
- class modin.core.io.ExcelDispatcher#
Class handles utils for reading excel files.
- class modin.core.io.FWFDispatcher#
Class handles utils for reading of tables with fixed-width formatted lines.
- classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]] #
Check support of parameters of read_fwf function.
- Parameters
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_fwf function.
read_kwargs (dict) – Parameters of read_fwf function.
skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.
header_size (int) – Number of rows that are used by header.
- Returns
bool – Whether passed parameters are supported or not.
Optional[str] – None if parameters are supported, otherwise an error message describing why parameters are not supported.
- class modin.core.io.FeatherDispatcher#
Class handles utils for reading .feather files.
- class modin.core.io.FileDispatcher#
Class handles util functions for reading data from different kinds of files.
Notes
_read, deploy, parse and materialize are abstract methods and should be implemented in the child classes (functions signatures can differ between child classes).
- classmethod build_partition(partition_ids, row_lengths, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters
partition_ids (list) – Array with references to the partitions data.
row_lengths (list) – Partitions rows lengths.
column_widths (list) – Number of columns in each partition.
- Returns
array with shape equals to the shape of partition_ids and filed with partition objects.
- Return type
np.ndarray
- classmethod deploy(func, *args, num_returns=1, **kwargs)#
Deploy remote task.
Should be implemented in the task class (for example in the RayWrapper).
- classmethod file_exists(file_path, storage_options=None)#
Check if file_path exists.
- Parameters
file_path (str) – String that represents the path to the file (paths to S3 buckets are also acceptable).
storage_options (dict, optional) – Keyword from read_* functions.
- Returns
Whether file exists or not.
- Return type
bool
- classmethod file_size(f)#
Get the size of file associated with file handle f.
- Parameters
f (file-like object) – File-like object, that should be used to get file size.
- Returns
File size in bytes.
- Return type
int
- classmethod get_path(file_path)#
Process file_path in accordance to it’s type.
- Parameters
file_path (str, os.PathLike[str] object or file-like object) – The file, or a path to the file. Paths to S3 buckets are also acceptable.
- Returns
Updated or verified file_path parameter.
- Return type
str
Notes
if file_path is a URL, parameter will be returned as is, otherwise absolute path will be returned.
- classmethod materialize(obj_id)#
Get results from worker.
Should be implemented in the task class (for example in the RayWrapper).
- parse(func, args, num_returns)#
Parse file’s data in the worker process.
Should be implemented in the parser class (for example in the PandasCSVParser).
- classmethod read(*args, **kwargs)#
Read data according passed args and kwargs.
- Parameters
*args (iterable) – Positional arguments to be passed into _read function.
**kwargs (dict) – Keywords arguments to be passed into _read function.
- Returns
query_compiler – Query compiler with imported data for further processing.
- Return type
Notes
read is high-level function that calls specific for defined storage format, engine and dispatcher class _read function with passed parameters and performs some postprocessing work on the resulting query_compiler object.
- class modin.core.io.HDFDispatcher#
Class handles utils for reading hdf data.
Inherits some common for columnar store files util functions from ColumnStoreDispatcher class.
- class modin.core.io.JSONDispatcher#
Class handles utils for reading .json files.
- class modin.core.io.ParquetDispatcher#
Class handles utils for reading .parquet files.
- classmethod build_index(dataset, partition_ids, index_columns)#
Compute index and its split sizes of resulting Modin DataFrame.
- Parameters
dataset (Dataset) – Dataset object of Parquet file/files.
partition_ids (list) – Array with references to the partitions data.
index_columns (list) – List of index columns specified by pandas metadata.
- Returns
index (pandas.Index) – Index of resulting Modin DataFrame.
needs_index_sync (bool) – Whether the partition indices need to be synced with frame index because there’s no index column, or at least one index column is a RangeIndex.
Notes
See build_partition for more detail on the contents of partitions_ids.
- classmethod build_partition(partition_ids, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters
partition_ids (list) – Array with references to the partitions data.
column_widths (list) – Number of columns in each partition.
- Returns
array with shape equals to the shape of partition_ids and filed with partition objects.
- Return type
np.ndarray
Notes
The second level of partitions_ids contains a list of object references for each read call: partition_ids[i][j] -> [ObjectRef(df), ObjectRef(df.index), ObjectRef(len(df))].
- classmethod build_query_compiler(dataset, columns, index_columns, **kwargs)#
Build query compiler from deployed tasks outputs.
- Parameters
dataset (Dataset) – Dataset object of Parquet file/files.
columns (list) – List of columns that should be read from file.
index_columns (list) – List of index columns specified by pandas metadata.
**kwargs (dict) – Parameters of deploying read_* function.
- Returns
new_query_compiler – Query compiler with imported data for further processing.
- Return type
- classmethod call_deploy(dataset, col_partitions, storage_options, **kwargs)#
Deploy remote tasks to the workers with passed parameters.
- Parameters
dataset (Dataset) – Dataset object of Parquet file/files.
col_partitions (list) – List of arrays with columns names that should be read by each partition.
storage_options (dict) – Parameters for specific storage engine.
**kwargs (dict) – Parameters of deploying read_* function.
- Returns
Array with references to the task deploy result for each partition.
- Return type
List
- classmethod get_dataset(path, engine, storage_options)#
Retrieve Parquet engine specific Dataset implementation.
- Parameters
path (str, path object or file-like object) – The filepath of the parquet file in local filesystem or hdfs.
engine (str) – Parquet library to use (only ‘PyArrow’ is supported for now).
storage_options (dict) – Parameters for specific storage engine.
- Returns
Either a PyArrowDataset or FastParquetDataset object.
- Return type
Dataset
- classmethod write(qc, **kwargs)#
Write a
DataFrame
to the binary parquet format.- Parameters
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_parquet on.
**kwargs (dict) – Parameters for pandas.to_parquet(**kwargs).
- class modin.core.io.SQLDispatcher#
Class handles utils for reading SQL queries or database tables.
- classmethod write(qc, **kwargs)#
Write records stored in the qc to a SQL database.
- Parameters
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_sql
on.**kwargs (dict) – Parameters for
pandas.to_sql(**kwargs)
.
- class modin.core.io.TextFileDispatcher#
Class handles utils for reading text formats files.
- classmethod build_partition(partition_ids, row_lengths, column_widths)#
Build array with partitions of cls.frame_partition_cls class.
- Parameters
partition_ids (list) – Array with references to the partitions data.
row_lengths (list) – Partitions rows lengths.
column_widths (list) – Number of columns in each partition.
- Returns
array with shape equals to the shape of partition_ids and filed with partitions objects.
- Return type
np.ndarray
- classmethod check_parameters_support(filepath_or_buffer, read_kwargs: dict, skiprows_md: Union[Sequence, callable, int], header_size: int) Tuple[bool, Optional[str]] #
Check support of only general parameters of read_* function.
- Parameters
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_* function.
read_kwargs (dict) – Parameters of read_* function.
skiprows_md (int, array or callable) – skiprows parameter modified for easier handling by Modin.
header_size (int) – Number of rows that are used by header.
- Returns
bool – Whether passed parameters are supported or not.
Optional[str] – None if parameters are supported, otherwise an error message describing why parameters are not supported.
- classmethod compute_newline(file_like, encoding, quotechar)#
Compute byte or sequence of bytes indicating line endings.
- Parameters
file_like (file-like object) – File handle that should be used for line endings computing.
encoding (str) – Encoding of file_like.
quotechar (str) – Quotechar used for parsing file-like.
- Returns
line endings
- Return type
bytes
- classmethod get_path_or_buffer(filepath_or_buffer)#
Extract path from filepath_or_buffer.
- Parameters
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.
- Returns
verified filepath_or_buffer parameter.
- Return type
str or path object
Notes
Given a buffer, try and extract the filepath from it so that we can use it without having to fall back to pandas and share file objects between workers. Given a filepath, return it immediately.
- classmethod offset(f, offset_size: int, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None)#
Move the file offset at the specified amount of bytes.
- Parameters
f (file-like object) – File handle that should be used for offset movement.
offset_size (int) – Number of bytes to read and ignore.
quotechar (bytes, default: b'"') – Indicate quote in a file.
is_quoting (bool, default: True) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
- Returns
If file pointer reached the end of the file, but did not find closing quote returns False. True in any other case.
- Return type
bool
- classmethod partitioned_file(f, num_partitions: int = None, nrows: int = None, skiprows: int = None, quotechar: bytes = b'"', is_quoting: bool = True, encoding: str = None, newline: bytes = None, header_size: int = 0, pre_reading: int = 0, read_callback_kw: dict = None)#
Compute chunk sizes in bytes for every partition.
- Parameters
f (file-like object) – File handle of file to be partitioned.
num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().
nrows (int, optional) – Number of rows of file to read.
skiprows (int, optional) – Specifies rows to skip.
quotechar (bytes, default: b'"') – Indicate quote in a file.
is_quoting (bool, default: True) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
header_size (int, default: 0) – Number of rows, that occupied by header.
pre_reading (int, default: 0) – Number of rows between header and skipped rows, that should be read.
read_callback_kw (dict, optional) – Keyword arguments for cls.read_callback to compute metadata if needed. This option is not compatible with pre_reading!=0.
- Returns
list –
- List with the next elements:
int : partition start read byte int : partition end read byte
pandas.DataFrame or None – Dataframe from which metadata can be retrieved. Can be None if read_callback_kw=None.
- classmethod pathlib_or_pypath(filepath_or_buffer)#
Check if filepath_or_buffer is instance of py.path.local or pathlib.Path.
- Parameters
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.
- Returns
Whether or not filepath_or_buffer is instance of py.path.local or pathlib.Path.
- Return type
bool
- classmethod preprocess_func()#
Prepare a function for transmission to remote workers.
- classmethod rows_skipper_builder(f, quotechar, is_quoting, encoding=None, newline=None)#
Build object for skipping passed number of lines.
- Parameters
f (file-like object) – File handle that should be used for offset movement.
quotechar (bytes) – Indicate quote in a file.
is_quoting (bool) – Whether or not to consider quotes.
encoding (str, optional) – Encoding of f.
newline (bytes, optional) – Byte or sequence of bytes indicating line endings.
- Returns
skipper object.
- Return type
object
Handling skiprows
Parameter#
Handling skiprows
parameter by pandas import functions can be very tricky, especially
for read_csv
function because of interconnection with header
parameter. In this section
the techniques of skiprows
processing by both pandas and Modin are covered.
Processing skiprows
by pandas#
Let’s consider a simple snippet with pandas.read_csv
in order to understand interconnection
of header
and skiprows
parameters:
import pandas
from io import StringIO
data = """0
1
2
3
4
5
6
7
8
"""
# `header` parameter absence is equivalent to `header="infer"` or `header=0`
# rows 1, 5, 6, 7, 8 are read with header "0"
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4])
# rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1)
# rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)
In the examples above list-like skiprows
values are fixed and header
is varied. In the first
example with no header
provided, rows 2, 3, 4 are skipped and row 0 is considered as the header.
In the second example header == 1
, so the zeroth row is skipped and the next available row is
considered the header. The third example illustrates when the header
and skiprows
parameters
values are both present - in this case skiprows
rows are dropped first and then the header
is derived
from the remaining rows (rows before header are skipped too).
In the examples above only list-like skiprows
and integer header
parameters are considered,
but the same logic is applicable for other types of the parameters.
Processing skiprows
by Modin#
As it can be seen, skipping rows in the pandas import functions is complicated and distributing this logic across multiple workers can complicate it even more. Thus in some rare corner cases default pandas implementation is used in Modin to avoid excessive Modin code complication.
Modin uses two techniques for skipping rows:
1) During file partitioning (setting file limits that should be read by each partition)
exact rows can be excluded from partitioning scope, thus they won’t be read at all and can be
considered as skipped. This is the most effective way of skipping rows since it doesn’t require
any actual data reading and postprocessing, but in this case skiprows
parameter can be an
integer only. When it is possible Modin always uses this approach.
2) Rows for skipping can be dropped after full dataset import. This is more expensive way since
it requires extra IO work and postprocessing afterwards, but skiprows
parameter can be of any
non-integer type supported by pandas.read_csv
.
In some cases, if skiprows
is uniformly distributed array (e.g. [1, 2, 3]), skiprows
can be
“squashed” and represented as an integer to make a fastpath by skipping these rows during file partitioning
(using the first option). But if there is a gap between the first row for skipping
and the last line of the header (that will be skipped too since header is read by each partition
to ensure metadata is defined properly), then this gap should be assigned for reading first
by assigning the first partition to read these rows by setting pre_reading
parameter.
Let’s consider an example of skipping rows during partitioning when header="infer"
and
skiprows=[3, 4, 5]
. In this specific case fastpath can be done since skiprows
is uniformly
distributed array, so we can “squash” it to an integer and set “partitioning” skiprows to 3. But
if no additional action is done, these three rows will be skipped right after header line,
that corresponds to skiprows=[1, 2, 3]
. To avoid this discrepancy, we need to assign the first
partition to read data between header line and the first row for skipping by setting special
pre_reading
parameter to 2. Then, after the skipping of rows considered to be skipped during
partitioning, the rest data will be divided between the rest of partitions, see rows assignment
below:
0 - header line (skip during partitioning)
1 - pre reading (assign to read by the first partition)
2 - pre reading (assign to read by the first partition)
3 - "partitioning" skiprows (skip during partitioning)
4 - "partitioning" skiprows (skip during partitioning)
5 - "partitioning" skiprows (skip during partitioning)
6 - data to partition (divide between the rest of partitions)
7 - data to partition (divide between the rest of partitions)