Pandas Parsers Module Description#
High-Level Module Overview#
This module houses parser classes (classes that are used for data parsing on the workers)
and util functions for handling parsing results. PandasParser
is base class for parser
classes with pandas storage format, that contains methods common for all child classes. Other
module classes implement parse
function that performs parsing of specific format data
basing on the chunk information computed in the modin.core.io
module. After
the chunk is parsed, the resulting DataFrame
-s will be split into smaller
DataFrame
-s according to the num_splits
parameter, data type, or number of
rows/columns in the parsed chunk. These frames, along with some additional metadata, are then returned.
Note
If you are interested in the data parsing mechanism implementation details, please refer to the source code documentation.
Public API#
Module houses Modin parser classes, that are used for data parsing on the workers.
Notes
Data parsing mechanism differs depending on the data format type:
text format type (CSV, EXCEL, FWF, JSON): File parsing begins from retrieving start and end parameters from parse kwargs - these parameters define start and end bytes of data file, that should be read in the concrete partition. Using this data and file handle got from fname, binary data is read by python read function. Then resulting data is passed into pandas.read_* function as io.BytesIO object to get corresponding pandas.DataFrame (we need to do this because Modin partitions internally stores data as pandas.DataFrame).
columnar store type (FEATHER, HDF, PARQUET): In this case data chunk to be read is defined by columns names passed as columns parameter as part of parse kwargs, so no additional action is needed and fname and kwargs are just passed into pandas.read_* function (in some corner cases pyarrow.read_* function can be used).
SQL type: Chunking is incorporated in the sql parameter as part of query, so parse parameters are passed into pandas.read_sql function without modification.
- class modin.core.storage_formats.pandas.parsers.ExperimentalCustomTextParser#
Class for handling custom text on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.ExperimentalPandasPickleParser#
Class for handling pickled pandas objects on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasCSVGlobParser#
Class for handling multiple CSV files simultaneously on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(chunks, **kwargs)#
Parse data on the workers.
- Parameters
chunks (list) – List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasCSVParser#
Class for handling CSV files on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, common_read_kwargs, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
common_read_kwargs (dict) – Common keyword parameters for read functions.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- static read_callback(*args, **kwargs)#
Parse data on each partition.
- Parameters
*args (list) – Positional arguments to be passed to the callback function.
**kwargs (dict) – Keyword arguments to be passed to the callback function.
- Returns
Function call result.
- Return type
pandas.DataFrame or pandas.io.parsers.TextParser
- class modin.core.storage_formats.pandas.parsers.PandasExcelParser#
Class for handling excel files on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- classmethod get_sheet_data(sheet, convert_float)#
Get raw data from the excel sheet.
- Parameters
sheet (openpyxl.worksheet.worksheet.Worksheet) – Sheet to get data from.
convert_float (bool) – Whether to convert floats to ints or not.
- Returns
List with sheet data.
- Return type
list
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasFWFParser#
Class for handling tables with fixed-width formatted lines on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, common_read_kwargs, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
common_read_kwargs (dict) – Common keyword parameters for read functions.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- static read_callback(*args, **kwargs)#
Parse data on each partition.
- Parameters
*args (list) – Positional arguments to be passed to the callback function.
**kwargs (dict) – Keyword arguments to be passed to the callback function.
- Returns
Function call result.
- Return type
pandas.DataFrame or pandas.io.parsers.TextFileReader
- class modin.core.storage_formats.pandas.parsers.PandasFeatherParser#
Class for handling FEATHER files on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str, path object or file-like object) – Name of the file, path or file-like object to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasHDFParser#
Class for handling HDF data on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str, path object, pandas.HDFStore or file-like object) – Name of the file, path pandas.HDFStore or file-like object to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasJSONParser#
Class for handling JSON files on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasParquetParser#
Class for handling PARQUET data on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(files_for_parser, engine, **kwargs)#
Parse data on the workers.
- Parameters
files_for_parser (list) – List of files to be read.
engine (str) – Parquet library to use (either PyArrow or fastparquet).
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.PandasParser#
Base class for parser classes with pandas storage format.
- static generic_parse(fname, **kwargs)#
Parse data on the workers.
- Parameters
fname (str or path object) – Name of the file or path to read.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- classmethod get_dtypes(dtypes_ids, columns)#
Get common for all partitions dtype for each of the columns.
- Parameters
dtypes_ids (list) – Array with references to the partitions dtypes objects.
columns (array-like or Index (1d)) – The names of the columns in this variable will be used for dtypes creation.
- Returns
frame_dtypes – Resulting dtype or pandas.Series where column names are used as index and types of columns are used as values for full resulting frame.
- Return type
pandas.Series, dtype or None
- infer_compression(compression: str | None) str | None #
Get the compression method for filepath_or_buffer. If compression=’infer’, the inferred compression method is returned. Otherwise, the input compression method is returned unchanged, unless it’s invalid, in which case an error is raised.
- Parameters
filepath_or_buffer (str or file handle) – File path or object.
compression (str or dict, default 'infer') –
For on-the-fly compression of the output data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to
None
for no compression. Can also be a dict with key'method'
set to one of {'zip'
,'gzip'
,'bz2'
,'zstd'
,'tar'
} and other key-value pairs are forwarded tozipfile.ZipFile
,gzip.GzipFile
,bz2.BZ2File
,zstandard.ZstdCompressor
ortarfile.TarFile
, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive:compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}
.New in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
- Return type
string or None
- Raises
ValueError on invalid compression specified. –
- classmethod single_worker_read(fname, *args, reason: str, **kwargs)#
Perform reading by single worker (default-to-pandas implementation).
- Parameters
fname (str, path object or file-like object) – Name of the file or file-like object to read.
*args (tuple) – Positional arguments to be passed into read_* function.
reason (str) – Message describing the reason for falling back to pandas.
**kwargs (dict) – Keywords arguments to be passed into read_* function.
- Returns
BaseQueryCompiler or
dict or
pandas.io.parsers.TextFileReader – Object with imported data (or with reference to data) for further processing, object type depends on the child class parse function result type.
- class modin.core.storage_formats.pandas.parsers.PandasSQLParser#
Class for handling SQL queries or tables on the workers using pandas storage format.
Inherits common functions from PandasParser class.
- static parse(sql, con, index_col, read_sql_engine, **kwargs)#
Parse data on the workers.
- Parameters
sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.
con (SQLAlchemy connectable, str, or sqlite3 connection) – Connection object to database.
index_col (str or list of str) – Column(s) to set as index(MultiIndex).
read_sql_engine (str) – Underlying engine (‘pandas’ or ‘connectorx’) used for fetching query result.
**kwargs (dict) – Keywords arguments to be used by parse function or passed into read_* function.
- Returns
List with split parse results and it’s metadata (index, dtypes, etc.).
- Return type
list
- class modin.core.storage_formats.pandas.parsers.ParquetFileToRead(path: Any, row_group_start: int, row_group_end: int)#
Class to store path and row group information for parquet reads.
- Parameters
path (str, path object or file-like object) – Name of the file to read.
row_group_start (int) – Row group to start read from.
row_group_end (int) – Row group to stop read.
- path: Any#
Alias for field number 0
- row_group_end: int#
Alias for field number 2
- row_group_start: int#
Alias for field number 1
- modin.core.storage_formats.pandas.parsers.find_common_type_cat(types)#
Find a common data type among the given dtypes.
- Parameters
types (array-like) – Array of dtypes.
- Returns
pandas.core.dtypes.dtypes.ExtensionDtype or
np.dtype or
None – dtype that is common for all passed types.