Pandas-on-Ray Module Description¶
High-Level Module Overview¶
This module houses experimental functionality with pandas backend and Ray
engine. This functionality is concentrated in the ExperimentalPandasOnRayIO
class, that contains methods, which extend typical pandas API to give user
more flexibility with IO operations.
Usage Guide¶
In order to use the experimental features, just modify standard Modin import statement as follows:
# import modin.pandas as pd
import modin.experimental.pandas as pd
Implemented Operations¶
For now ExperimentalPandasOnRayIO
implements two methods - read_sql()
and
read_csv_glob()
.
The first method allows the user to use typical pandas.read_sql
function extended
with Spark-like parameters
such as partition_column
, lower_bound
and upper_bound
. With these
parameters, the user will be able to specify how to partition the imported data.
The second implemented method allows to read multiple CSV files simultaneously
when a Python Glob object is
provided as a parameter.
Submodules Description¶
modin.experimental.engines.pandas_on_ray
module is used mostly for storing utils and
functions for experimanetal IO class:
io_exp.py
- submodule containing IO class and parse functions, which are responsible for data processing on the workers.sql.py
- submodule with util functions for experimentalread_sql
function.
Public API¶
- class modin.experimental.engines.pandas_on_ray.io_exp.ExperimentalPandasOnRayIO¶
Class for handling experimental IO functionality with pandas backend and Ray engine.
ExperimentalPandasOnRayIO
inherits some util functions and unmodified IO functions fromPandasOnRayIO
class.- classmethod read_csv_glob(filepath_or_buffer, **kwargs)¶
Read data from multiple .csv files passed with filepath_or_buffer simultaneously.
- Parameters
filepath_or_buffer (str, path object or file-like object) – filepath_or_buffer parameter of
read_csv
function.**kwargs (dict) – Parameters of
read_csv
function.
- Returns
new_query_compiler – Query compiler with imported data for further processing.
- Return type
- read_parquet_remote_task(columns, num_splits, kwargs)¶
Read columns from Parquet file into a
pandas.DataFrame
using Ray task.- Parameters
path (str or List[str]) – The path of the Parquet file.
columns (List[str]) – The list of column names to read.
num_splits (int) – The number of partitions to split the column into.
kwargs (dict) – Keyward arguments to pass into
pyarrow.parquet.read
function.
- Returns
A list containing the splitted
pandas.DataFrame
-s and the Index as the last element.- Return type
list
Notes
pyarrow.parquet.read
is used internally as the parse function.
- classmethod read_pickle_distributed(filepath_or_buffer, **kwargs)¶
In experimental mode, we can use * in the filename.
Note: the number of partitions is equal to the number of input files.
- classmethod read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, partition_column=None, lower_bound=None, upper_bound=None, max_sessions=None)¶
Read SQL query or database table into a DataFrame.
- Parameters
sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.
con (SQLAlchemy connectable or str) – Connection to database (sqlite3 connections are not supported).
index_col (str or list of str, optional) – Column(s) to set as index(MultiIndex).
coerce_float (bool, default: True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
params (list, tuple or dict, optional) – List of parameters to pass to
execute
method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported.parse_dates (list or dict, optional) –
The behavior is as follows:
List of column names to parse as dates.
Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.
Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of
pandas.to_datetime
. Especially useful with databases without native Datetime support, such as SQLite.
columns (list, optional) – List of column names to select from SQL table (only used when reading a table).
chunksize (int, optional) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.
partition_column (str, optional) – Column name used for data partitioning between the workers (MUST be an INTEGER column).
lower_bound (int, optional) – The minimum value to be requested from the partition_column.
upper_bound (int, optional) – The maximum value to be requested from the partition_column.
max_sessions (int, optional) – The maximum number of simultaneous connections allowed to use.
- Returns
A new query compiler with imported data for further processing.
- Return type
- classmethod to_pickle_distributed(qc, **kwargs)¶
When * in the filename all partitions are written to their own separate file.
The filenames is determined as follows: - if * in the filename then it will be replaced by the increasing sequence 0, 1, 2, … - if * is not the filename, then will be used default implementation.
Examples #1: 4 partitions and input filename=”partition*.pkl.gz”, then filenames will be: partition0.pkl.gz, partition1.pkl.gz, partition2.pkl.gz, partition3.pkl.gz.
- Parameters
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_pickle_distributed
on.**kwargs (dict) – Parameters for
pandas.to_pickle(**kwargs)
.