IO module Description For Pandas-on-Ray Excecution#

High-Level Module Overview#

This module houses experimental functionality with pandas storage format and Ray engine. This functionality is concentrated in the ExperimentalPandasOnRayIO class, that contains methods, which extend typical pandas API to give user more flexibility with IO operations.

Usage Guide#

In order to use the experimental features, just modify standard Modin import statement as follows:

# import modin.pandas as pd
import modin.experimental.pandas as pd

Submodules Description#

modin.experimental.core.execution.ray.implementations.pandas_on_ray module is used mostly for storing utils and functions for experimanetal IO class:

  • io.py - submodule containing IO class and parse functions, which are responsible for data processing on the workers.

  • sql.py - submodule with util functions for experimental read_sql function.

Public API#

class modin.experimental.core.execution.ray.implementations.pandas_on_ray.io.io.ExperimentalPandasOnRayIO#

Class for handling experimental IO functionality with pandas storage format and Ray engine.

ExperimentalPandasOnRayIO inherits some util functions and unmodified IO functions from PandasOnRayIO class.

classmethod read_csv_glob(filepath_or_buffer, **kwargs)#

Read data from multiple .csv files passed with filepath_or_buffer simultaneously.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

  • **kwargs (dict) – Parameters of read_csv function.

Returns

new_query_compiler – Query compiler with imported data for further processing.

Return type

BaseQueryCompiler

classmethod read_custom_text(filepath_or_buffer, columns, custom_parser, **kwargs)#

Read data from filepath_or_buffer according to the passed read_custom_text kwargs parameters.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_custom_text function.

  • columns (list or callable(file-like object, **kwargs) -> list) – Column names of list type or callable that create column names from opened file and passed kwargs.

  • custom_parser (callable(file-like object, **kwargs) -> pandas.DataFrame) – Function that takes as input a part of the filepath_or_buffer file loaded into memory in file-like object form.

  • **kwargs (dict) – Parameters of read_custom_text function.

Returns

Query compiler with imported data for further processing.

Return type

BaseQueryCompiler

classmethod read_pickle_distributed(filepath_or_buffer, **kwargs)#

Read data from filepath_or_buffer according to kwargs parameters.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_pickle function.

  • **kwargs (dict) – Parameters of read_pickle function.

Returns

new_query_compiler – Query compiler with imported data for further processing.

Return type

BaseQueryCompiler

Notes

In experimental mode, we can use * in the filename.

The number of partitions is equal to the number of input files.

classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, partition_column=None, lower_bound=None, upper_bound=None, max_sessions=None)#

Read SQL query or database table into a DataFrame.

The function extended with Spark-like parameters such as partition_column, lower_bound and upper_bound. With these parameters, the user will be able to specify how to partition the imported data.

Parameters
  • sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.

  • con (SQLAlchemy connectable or str) – Connection to database (sqlite3 connections are not supported).

  • index_col (str or list of str, optional) – Column(s) to set as index(MultiIndex).

  • coerce_float (bool, default: True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

  • params (list, tuple or dict, optional) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported.

  • parse_dates (list or dict, optional) –

    The behavior is as follows:

    • List of column names to parse as dates.

    • Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

    • Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime. Especially useful with databases without native Datetime support, such as SQLite.

  • columns (list, optional) – List of column names to select from SQL table (only used when reading a table).

  • chunksize (int, optional) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.

  • partition_column (str, optional) – Column name used for data partitioning between the workers (MUST be an INTEGER column).

  • lower_bound (int, optional) – The minimum value to be requested from the partition_column.

  • upper_bound (int, optional) – The maximum value to be requested from the partition_column.

  • max_sessions (int, optional) – The maximum number of simultaneous connections allowed to use.

Returns

A new query compiler with imported data for further processing.

Return type

BaseQueryCompiler

classmethod to_pickle_distributed(qc, **kwargs)#

When * in the filename all partitions are written to their own separate file.

The filenames is determined as follows: - if * in the filename then it will be replaced by the increasing sequence 0, 1, 2, … - if * is not the filename, then will be used default implementation.

Examples #1: 4 partitions and input filename=”partition*.pkl.gz”, then filenames will be: partition0.pkl.gz, partition1.pkl.gz, partition2.pkl.gz, partition3.pkl.gz.

Parameters
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_pickle_distributed on.

  • **kwargs (dict) – Parameters for pandas.to_pickle(**kwargs).