Pandas-on-Ray Module Description

High-Level Module Overview

This module houses experimental functionality with pandas backend and Ray engine. This functionality is concentrated in the ExperimentalPandasOnRayIO class, that contains methods, which extend typical pandas API to give user more flexibility with IO operations.

Usage Guide

In order to use the experimental features, just modify standard Modin import statement as follows:

# import modin.pandas as pd
import modin.experimental.pandas as pd

Implemented Operations

For now ExperimentalPandasOnRayIO implements two methods - read_sql() and read_csv_glob(). The first method allows the user to use typical pandas.read_sql function extended with Spark-like parameters such as partition_column, lower_bound and upper_bound. With these parameters, the user will be able to specify how to partition the imported data. The second implemented method allows to read multiple CSV files simultaneously when a Python Glob object is provided as a parameter.

Submodules Description

modin.experimental.engines.pandas_on_ray module is used mostly for storing utils and functions for experimanetal IO class:

  • io_exp.py - submodule containing IO class and parse functions, which are responsible for data processing on the workers.

  • sql.py - submodule with util functions for experimental read_sql function.

Public API

class modin.experimental.engines.pandas_on_ray.io_exp.ExperimentalPandasOnRayIO

Class for handling experimental IO functionality with pandas backend and Ray engine.

ExperimentalPandasOnRayIO inherits some util functions and unmodified IO functions from PandasOnRayIO class.

classmethod read_csv_glob(filepath_or_buffer, **kwargs)

Read data from multiple .csv files passed with filepath_or_buffer simultaneously.

Parameters
  • filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of read_csv function.

  • **kwargs (dict) – Parameters of read_csv function.

Returns

new_query_compiler – Query compiler with imported data for further processing.

Return type

BaseQueryCompiler

read_parquet_remote_task(columns, num_splits, kwargs)

Read columns from Parquet file into a pandas.DataFrame using Ray task.

Parameters
  • path (str or List[str]) – The path of the Parquet file.

  • columns (List[str]) – The list of column names to read.

  • num_splits (int) – The number of partitions to split the column into.

  • kwargs (dict) – Keyward arguments to pass into pyarrow.parquet.read function.

Returns

A list containing the splitted pandas.DataFrame-s and the Index as the last element.

Return type

list

Notes

pyarrow.parquet.read is used internally as the parse function.

classmethod read_pickle_distributed(filepath_or_buffer, **kwargs)

In experimental mode, we can use * in the filename.

Note: the number of partitions is equal to the number of input files.

classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, partition_column=None, lower_bound=None, upper_bound=None, max_sessions=None)

Read SQL query or database table into a DataFrame.

Parameters
  • sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.

  • con (SQLAlchemy connectable or str) – Connection to database (sqlite3 connections are not supported).

  • index_col (str or list of str, optional) – Column(s) to set as index(MultiIndex).

  • coerce_float (bool, default: True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

  • params (list, tuple or dict, optional) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported.

  • parse_dates (list or dict, optional) –

    The behavior is as follows:

    • List of column names to parse as dates.

    • Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

    • Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime. Especially useful with databases without native Datetime support, such as SQLite.

  • columns (list, optional) – List of column names to select from SQL table (only used when reading a table).

  • chunksize (int, optional) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.

  • partition_column (str, optional) – Column name used for data partitioning between the workers (MUST be an INTEGER column).

  • lower_bound (int, optional) – The minimum value to be requested from the partition_column.

  • upper_bound (int, optional) – The maximum value to be requested from the partition_column.

  • max_sessions (int, optional) – The maximum number of simultaneous connections allowed to use.

Returns

A new query compiler with imported data for further processing.

Return type

BaseQueryCompiler

classmethod to_pickle_distributed(qc, **kwargs)

When * in the filename all partitions are written to their own separate file.

The filenames is determined as follows: - if * in the filename then it will be replaced by the increasing sequence 0, 1, 2, … - if * is not the filename, then will be used default implementation.

Examples #1: 4 partitions and input filename=”partition*.pkl.gz”, then filenames will be: partition0.pkl.gz, partition1.pkl.gz, partition2.pkl.gz, partition3.pkl.gz.

Parameters
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_pickle_distributed on.

  • **kwargs (dict) – Parameters for pandas.to_pickle(**kwargs).