PandasOnRayDataframePartitionManager#

This class is the specific implementation of GenericRayDataframePartitionManager using Ray distributed engine. This class is responsible for partition manipulation and applying a function to block/row/column partitions.

Public API#

class modin.core.execution.ray.implementations.pandas_on_ray.partitioning.PandasOnRayDataframePartitionManager#

The class implements the interface in PandasDataframePartitionManager.

classmethod apply_func_to_indices_both_axis(partitions, func, row_partitions_list, col_partitions_list, item_to_distribute=_NoDefault.no_default, row_lengths=None, col_widths=None)#

Apply a function along both axes.

Parameters:

partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
row_partitions_list (iterable of tuples) –
Iterable of tuples, containing 2 values:
1. Integer row partition index.
2. Internal row indexer of this partition.
col_partitions_list (iterable of tuples) –
Iterable of tuples, containing 2 values:
1. Integer column partition index.
2. Internal column indexer of this partition.
item_to_distribute (np.ndarray or scalar, default: no_default) – The item to split up so it can be applied over both axes.
row_lengths (list of ints, optional) – Lengths of partitions for every row. If not specified this information is extracted from partitions itself.
col_widths (list of ints, optional) – Widths of partitions for every column. If not specified this information is extracted from partitions itself.

Returns:

A NumPy array with partitions.

Return type:

np.ndarray

Notes

For your func to operate directly on the indices provided, it must use row_internal_indices, col_internal_indices as keyword arguments.

classmethod apply_func_to_select_indices(axis, partitions, func, indices, keep_remaining=False)#

Apply a function to select indices.

Parameters:

axis ({0, 1}) – Axis to apply the func over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply to these indices of partitions.
indices (dict) – The indices to apply the function to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.

Returns:

A NumPy array with partitions.

Return type:

np.ndarray

Notes

Your internal function must take a kwarg internal_indices for this to work correctly. This prevents information leakage of the internal index to the external representation.

classmethod apply_func_to_select_indices_along_full_axis(axis, partitions, func, indices, keep_remaining=False)#

Apply a function to a select subset of full columns/rows.

Parameters:

axis ({0, 1}) – The axis to apply the function over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
indices (list-like) – The global indices to apply the func to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.

Returns:

A NumPy array with partitions.

Return type:

np.ndarray

Notes

This should be used when you need to apply a function that relies on some global information for the entire column/row, but only need to apply a function to a subset. For your func to operate directly on the indices provided, it must use internal_indices as a keyword argument.

classmethod lazy_map_partitions(partitions, map_func, func_args=None, func_kwargs=None, enumerate_partitions=False)#

Apply map_func to every partition in partitions lazily.

Parameters:

partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
func_args (iterable, optional) – Positional arguments for the ‘map_func’.
func_kwargs (dict, optional) – Keyword arguments for the ‘map_func’.
enumerate_partitions (bool, default: False) –

Returns:

An array of partitions

Return type:

NumPy array

classmethod map_axis_partitions(axis, partitions, map_func, keep_partitioning=False, num_splits=None, lengths=None, enumerate_partitions=False, **kwargs)#

Apply map_func to every partition in partitions along given axis.

Parameters:

axis ({0, 1}) – Axis to perform the map across (0 - index, 1 - columns).
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
keep_partitioning (boolean, default: False) – The flag to keep partition boundaries for Modin Frame if possible. Setting it to True disables shuffling data from one partition to another in case the resulting number of splits is equal to the initial number of splits.
num_splits (int, optional) – The number of partitions to split the result into across the axis. If None, then the number of splits will be infered automatically. If num_splits is None and keep_partitioning=True then the number of splits is preserved.
lengths (list of ints, default: None) –

The list of lengths to shuffle the object. Note:
1. Passing lengths omits the num_splits parameter as the number of splits will now be inferred from the number of integers present in lengths. 2. When passing lengths you must explicitly specify keep_partitioning=False.
enumerate_partitions (bool, default: False) – Whether or not to pass partition index into map_func. Note that map_func must be able to accept partition_idx kwarg.
**kwargs (dict) – Additional options that could be used by different engines.

Returns:

An array of new partitions for Modin Frame.

Return type:

NumPy array

Notes

This method should be used in the case when map_func relies on some global information about the axis.

classmethod map_partitions(partitions, map_func, func_args=None, func_kwargs=None)#

Apply map_func to partitions using different approaches to achieve the best performance.

Parameters:

partitions (NumPy 2D array) – Partitions housing the data of Modin Frame.
map_func (callable) – Function to apply.
func_args (iterable, optional) – Positional arguments for the ‘map_func’.
func_kwargs (dict, optional) – Keyword arguments for the ‘map_func’.

Returns:

An array of partitions

Return type:

NumPy array

materialize_futures()#

Get the value of object from the Plasma store.

Parameters:: obj_id (ray.ObjectID) – Ray object identifier to get the value by.
Returns:: Whatever was identified by obj_id.
Return type:: object

classmethod n_ary_operation(left, func, right: list)#

Apply an n-ary operation to multiple PandasDataframe objects.

This method assumes that all the partitions of the dataframes in left and right have the same dimensions. For each position i, j in each dataframe’s partitions, the result has a partition at (i, j) whose data is func(left_partitions[i,j], *each_right_partitions[i,j]).

Parameters:

left (np.ndarray) – The partitions of left PandasDataframe.
func (callable) – The function to apply.
right (list of np.ndarray) – The list of partitions of other PandasDataframe.

Returns:

A NumPy array with new partitions.

Return type:

np.ndarray

classmethod split_pandas_df_into_partitions(df, row_chunksize, col_chunksize, update_bar)#

Split given pandas DataFrame according to the row/column chunk sizes into distributed partitions.

Parameters:

df (pandas.DataFrame) –
row_chunksize (int) –
col_chunksize (int) –
update_bar (callable(x) -> x) – Function that updates a progress bar.

Return type:

2D np.ndarray[PandasDataframePartition]

classmethod wait_partitions(partitions)#

Wait on the objects wrapped by partitions in parallel, without materializing them.

This method will block until all computations in the list have completed.

Parameters:: partitions (np.ndarray) – NumPy array with PandasDataframePartition-s.