Pandas Parsers Module Description#

High-Level Module Overview#

This module houses parser classes (classes that are used for data parsing on the workers) and util functions for handling parsing results. PandasParser is base class for parser classes with pandas storage format, that contains methods common for all child classes. Other module classes implement parse function that performs parsing of specific format data basing on the chunk information computed in the modin.core.io module. After chunk data parsing is completed, resulting DataFrame-s will be splitted into smaller DataFrame-s according to num_splits parameter, data type and number or rows/columns in the parsed chunk, and then these frames and some additional metadata will be returned.

Note

If you are interested in the data parsing mechanism implementation details, please refer to the source code documentation.

Public API#

Module houses Modin parser classes, that are used for data parsing on the workers.

Notes

Data parsing mechanism differs depending on the data format type:

  • text format type (CSV, EXCEL, FWF, JSON): File parsing begins from retrieving start and end parameters from parse kwargs - these parameters define start and end bytes of data file, that should be read in the concrete partition. Using this data and file handle got from fname, binary data is read by python read function. Then resulting data is passed into pandas.read_* function as io.BytesIO object to get corresponding pandas.DataFrame (we need to do this because Modin partitions internally stores data as pandas.DataFrame).

  • columnar store type (FEATHER, HDF, PARQUET): In this case data chunk to be read is defined by columns names passed as columns parameter as part of parse kwargs, so no additional action is needed and fname and kwargs are just passed into pandas.read_* function (in some corner cases pyarrow.read_* function can be used).

  • SQL type: Chunking is incorporated in the sql parameter as part of query, so parse parameters are passed into pandas.read_sql function without modification.

class modin.core.storage_formats.pandas.parsers.CustomTextExperimentalParser#

Class for handling custom text on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser#

Class for handling multiple CSV files simultaneously on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(chunks, **kwargs)#

Parse data on the workers.

Parameters
  • chunks (list) – List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasCSVParser#

Class for handling CSV files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasExcelParser#

Class for handling excel files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

classmethod get_sheet_data(sheet, convert_float)#

Get raw data from the excel sheet.

Parameters
  • sheet (openpyxl.worksheet.worksheet.Worksheet) – Sheet to get data from.

  • convert_float (bool) – Whether to convert floats to ints or not.

Returns

List with sheet data.

Return type

list

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasFWFParser#

Class for handling tables with fixed-width formatted lines on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasFeatherParser#

Class for handling FEATHER files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str, path object or file-like object) – Name of the file, path or file-like object to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasHDFParser#

Class for handling HDF data on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str, path object, pandas.HDFStore or file-like object) – Name of the file, path pandas.HDFStore or file-like object to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasJSONParser#

Class for handling JSON files on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasParquetParser#

Class for handling PARQUET data on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasParser#

Base class for parser classes with pandas storage format.

static generic_parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

classmethod get_dtypes(dtypes_ids)#

Get common for all partitions dtype for each of the columns.

Parameters

dtypes_ids (list) – Array with references to the partitions dtypes objects.

Returns

frame_dtypes – Resulting dtype or pandas.Series where column names are used as index and types of columns are used as values for full resulting frame.

Return type

pandas.Series or dtype

infer_compression(compression: str | None) str | None#

Get the compression method for filepath_or_buffer. If compression=’infer’, the inferred compression method is returned. Otherwise, the input compression method is returned unchanged, unless it’s invalid, in which case an error is raised.

Parameters
  • filepath_or_buffer (str or file handle) – File path or object.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘filepath_or_buffer’ path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, or ‘.zst’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, or zstandard.ZstdDecompressor, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    Changed in version 1.4.0: Zstandard support.

Return type

string or None

Raises

ValueError on invalid compression specified.

classmethod single_worker_read(fname, **kwargs)#

Perform reading by single worker (default-to-pandas implementation).

Parameters
  • fname (str, path object or file-like object) – Name of the file or file-like object to read.

  • **kwargs (dict) – Keywords arguments to be passed into read_* function.

Returns

  • BaseQueryCompiler or

  • dict or

  • pandas.io.parsers.TextFileReader – Object with imported data (or with reference to data) for furher processing, object type depends on the child class parse function result type.

class modin.core.storage_formats.pandas.parsers.PandasPickleExperimentalParser#

Class for handling pickled pandas objects on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(fname, **kwargs)#

Parse data on the workers.

Parameters
  • fname (str or path object) – Name of the file or path to read.

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

class modin.core.storage_formats.pandas.parsers.PandasSQLParser#

Class for handling SQL queries or tables on the workers using pandas storage format.

Inherits common functions from PandasParser class.

static parse(sql, con, index_col, **kwargs)#

Parse data on the workers.

Parameters
  • sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.

  • con (SQLAlchemy connectable, str, or sqlite3 connection) – Connection object to database.

  • index_col (str or list of str) – Column(s) to set as index(MultiIndex).

  • **kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.

Returns

List with splitted parse results and it’s metadata (index, dtypes, etc.).

Return type

list

modin.core.storage_formats.pandas.parsers.find_common_type_cat(types)#

Find a common data type among the given dtypes.

Parameters

types (array-like) – Array of dtypes.

Returns

  • pandas.core.dtypes.dtypes.ExtensionDtype or

  • np.dtype or

  • Nonedtype that is common for all passed types.