Pandas Parsers Module Description#

High-Level Module Overview#

This module houses parser classes (classes that are used for data parsing on the workers) and util functions for handling parsing results. PandasParser is base class for parser classes with pandas storage format, that contains methods common for all child classes. Other module classes implement parse function that performs parsing of specific format data basing on the chunk information computed in the modin.core.io module. After the chunk is parsed, the resulting DataFrame-s will be split into smaller DataFrame-s according to the num_splits parameter, data type, or number of rows/columns in the parsed chunk. These frames, along with some additional metadata, are then returned.

Note

If you are interested in the data parsing mechanism implementation details, please refer to the source code documentation.

Public API#

Module houses Modin parser classes, that are used for data parsing on the workers.

Notes

Data parsing mechanism differs depending on the data format type:

text format type (CSV, EXCEL, FWF, JSON): File parsing begins from retrieving start and end parameters from parse kwargs - these parameters define start and end bytes of data file, that should be read in the concrete partition. Using this data and file handle got from fname, binary data is read by python read function. Then resulting data is passed into pandas.read_* function as io.BytesIO object to get corresponding pandas.DataFrame (we need to do this because Modin partitions internally stores data as pandas.DataFrame).
columnar store type (FEATHER, HDF, PARQUET): In this case data chunk to be read is defined by columns names passed as columns parameter as part of parse kwargs, so no additional action is needed and fname and kwargs are just passed into pandas.read_* function (in some corner cases pyarrow.read_* function can be used).
SQL type: Chunking is incorporated in the sql parameter as part of query, so parse parameters are passed into pandas.read_sql function without modification.

class modin.core.storage_formats.pandas.parsers.PandasCSVParser#

Class for handling CSV files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, common_read_kwargs, **kwargs)#

Parse data on the workers.

Parameters:

fname (str or path object) – Name of the file or path to read.
common_read_kwargs (dict) – Common keyword parameters for read functions.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

static read_callback(*args, **kwargs)#

Parse data on each partition.

Parameters:

*args (list) – Positional arguments to be passed to the callback function.
**kwargs (dict) – Keyword arguments to be passed to the callback function.

Returns:

Function call result.

Return type:

pandas.DataFrame or pandas.io.parsers.TextParser

class modin.core.storage_formats.pandas.parsers.PandasExcelParser#

Class for handling excel files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

classmethod get_sheet_data(sheet, convert_float)#

Get raw data from the excel sheet.

Parameters:

sheet (openpyxl.worksheet.worksheet.Worksheet) – Sheet to get data from.
convert_float (bool) – Whether to convert floats to ints or not.

Returns:

List with sheet data.

Return type:

list

static need_rich_text_param()#

Determine whether a required rich_text parameter should be specified for the WorksheetReader constructor.

Return type:: bool

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters:

fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.PandasFWFParser#

Class for handling tables with fixed-width formatted lines on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, common_read_kwargs, **kwargs)#

Parse data on the workers.

Parameters:

fname (str or path object) – Name of the file or path to read.
common_read_kwargs (dict) – Common keyword parameters for read functions.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

static read_callback(*args, **kwargs)#

Parse data on each partition.

Parameters:

*args (list) – Positional arguments to be passed to the callback function.
**kwargs (dict) – Keyword arguments to be passed to the callback function.

Returns:

Function call result.

Return type:

pandas.DataFrame or pandas.io.parsers.TextFileReader

class modin.core.storage_formats.pandas.parsers.PandasFeatherParser#

Class for handling FEATHER files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters:

fname (str, path object or file-like object) – Name of the file, path or file-like object to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.PandasHDFParser#

Class for handling HDF data on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters:

fname (str, path object, pandas.HDFStore or file-like object) – Name of the file, path pandas.HDFStore or file-like object to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.PandasJSONParser#

Class for handling JSON files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters:

fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.PandasParquetParser#

Class for handling PARQUET data on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(files_for_parser, engine, **kwargs)#

Parse data on the workers.

Parameters:

files_for_parser (list) – List of files to be read.
engine (str) – Parquet library to use (either PyArrow or fastparquet).
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.PandasParser#

Base class for parser classes with pandas storage format.

static generic_parse(fname, **kwargs)#

Parse data on the workers.

Parameters:

fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

classmethod get_dtypes(dtypes_ids, columns)#

Get common for all partitions dtype for each of the columns.

Parameters:

dtypes_ids (list) – Array with references to the partitions dtypes objects.
columns (array-like or Index (1d)) – The names of the columns in this variable will be used for dtypes creation.

Returns:

frame_dtypes – Resulting dtype or pandas.Series where column names are used as index and types of columns are used as values for full resulting frame.

Return type:

pandas.Series, dtype or None

static get_types_mapper(dtype_backend)#

Get types mapper that would be used in read_parquet/read_feather.

Parameters:: dtype_backend ({"numpy_nullable", "pyarrow", lib.no_default}) –
Return type:: dict

infer_compression(compression: str | None) → str | None#

Get the compression method for filepath_or_buffer. If compression=’infer’, the inferred compression method is returned. Otherwise, the input compression method is returned unchanged, unless it’s invalid, in which case an error is raised.

Parameters:

filepath_or_buffer (str or file handle) – File path or object.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

New in version 1.5.0: Added support for .tar files.

Changed in version 1.4.0: Zstandard support.

Return type:

string or None

Raises:

ValueError on invalid compression specified. –

classmethod single_worker_read(fname, *args, reason: str, **kwargs)#

Perform reading by single worker (default-to-pandas implementation).

Parameters:

fname (str, path object or file-like object) – Name of the file or file-like object to read.
*args (tuple) – Positional arguments to be passed into read_* function.
reason (str) – Message describing the reason for falling back to pandas.
**kwargs (dict) – Keywords arguments to be passed into read_* function.

Returns:

BaseQueryCompiler or
dict or
pandas.io.parsers.TextFileReader – Object with imported data (or with reference to data) for further processing, object type depends on the child class parse function result type.

class modin.core.storage_formats.pandas.parsers.PandasSQLParser#

Class for handling SQL queries or tables on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(sql, con, index_col, read_sql_engine, **kwargs)#

Parse data on the workers.

Parameters:

sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.
con (SQLAlchemy connectable, str, or sqlite3 connection) – Connection object to database.
index_col (str or list of str) – Column(s) to set as index(MultiIndex).
read_sql_engine (str) – Underlying engine (‘pandas’ or ‘connectorx’) used for fetching query result.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns:

List with split parse results and it’s metadata (index, dtypes, etc.).

Return type:

list

class modin.core.storage_formats.pandas.parsers.ParquetFileToRead(path: Any, row_group_start: int, row_group_end: int)#

Class to store path and row group information for parquet reads.

Parameters:

path (str, path object or file-like object) – Name of the file to read.
row_group_start (int) – Row group to start read from.
row_group_end (int) – Row group to stop read.

path: Any#: Alias for field number 0

row_group_end: int#: Alias for field number 2

row_group_start: int#: Alias for field number 1

modin.core.storage_formats.pandas.parsers.find_common_type_cat(types)#

Find a common data type among the given dtypes.

Parameters:

types (array-like) – Array of dtypes.

Returns:

pandas.core.dtypes.dtypes.ExtensionDtype or
np.dtype or
None – dtype that is common for all passed types.