PandasOnRayDataframePartitionManager#
This class is the specific implementation of GenericRayDataframePartitionManager
using Ray distributed engine. This class is responsible for partition manipulation and applying a function to
block/row/column partitions.
Public API#
- class modin.core.execution.ray.implementations.pandas_on_ray.partitioning.PandasOnRayDataframePartitionManager#
The class implements the interface in PandasDataframePartitionManager.
- classmethod apply_func_to_indices_both_axis(partitions, func, row_partitions_list, col_partitions_list, item_to_distribute=_NoDefault.no_default, row_lengths=None, col_widths=None)#
Apply a function along both axes.
- Parameters
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
row_partitions_list (iterable of tuples) –
- Iterable of tuples, containing 2 values:
Integer row partition index.
Internal row indexer of this partition.
col_partitions_list (iterable of tuples) –
- Iterable of tuples, containing 2 values:
Integer column partition index.
Internal column indexer of this partition.
item_to_distribute (np.ndarray or scalar, default: no_default) – The item to split up so it can be applied over both axes.
row_lengths (list of ints, optional) – Lengths of partitions for every row. If not specified this information is extracted from partitions itself.
col_widths (list of ints, optional) – Widths of partitions for every column. If not specified this information is extracted from partitions itself.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
For your func to operate directly on the indices provided, it must use row_internal_indices, col_internal_indices as keyword arguments.
- classmethod apply_func_to_select_indices(axis, partitions, func, indices, keep_remaining=False)#
Apply a function to select indices.
- Parameters
axis ({0, 1}) – Axis to apply the func over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply to these indices of partitions.
indices (dict) – The indices to apply the function to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
Your internal function must take a kwarg internal_indices for this to work correctly. This prevents information leakage of the internal index to the external representation.
- classmethod apply_func_to_select_indices_along_full_axis(axis, partitions, func, indices, keep_remaining=False)#
Apply a function to a select subset of full columns/rows.
- Parameters
axis ({0, 1}) – The axis to apply the function over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
indices (list-like) – The global indices to apply the func to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
This should be used when you need to apply a function that relies on some global information for the entire column/row, but only need to apply a function to a subset. For your func to operate directly on the indices provided, it must use internal_indices as a keyword argument.
- classmethod get_objects_from_partitions(partitions)#
Get the objects wrapped by partitions in parallel.
This function assumes that each partition in partitions contains a single block.
- Parameters
partitions (np.ndarray) – NumPy array with
PandasDataframePartition-s.- Returns
The objects wrapped by partitions.
- Return type
list
- classmethod lazy_map_partitions(partitions, map_func)#
Apply map_func to every partition in partitions lazily.
- Parameters
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
- Returns
An array of partitions
- Return type
NumPy array
- classmethod map_axis_partitions(axis, partitions, map_func, keep_partitioning=False, lengths=None, enumerate_partitions=False, **kwargs)#
Apply map_func to every partition in partitions along given axis.
- Parameters
axis ({0, 1}) – Axis to perform the map across (0 - index, 1 - columns).
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
keep_partitioning (bool, default: False) – Whether to keep partitioning for Modin Frame. Setting it to True stops data shuffling between partitions.
lengths (list of ints, default: None) – List of lengths to shuffle the object.
enumerate_partitions (bool, default: False) – Whether or not to pass partition index into map_func. Note that map_func must be able to accept partition_idx kwarg.
**kwargs (dict) – Additional options that could be used by different engines.
- Returns
An array of new partitions for Modin Frame.
- Return type
NumPy array
Notes
This method should be used in the case when map_func relies on some global information about the axis.
- classmethod map_partitions(partitions, map_func)#
Apply map_func to every partition in partitions.
- Parameters
partitions (NumPy 2D array) – Partitions housing the data of Modin Frame.
map_func (callable) – Function to apply.
- Returns
An array of partitions
- Return type
NumPy array
- classmethod n_ary_operation(left, func, right: list)#
Apply an n-ary operation to multiple
PandasDataframeobjects.This method assumes that all the partitions of the dataframes in left and right have the same dimensions. For each position i, j in each dataframe’s partitions, the result has a partition at (i, j) whose data is func(left_partitions[i,j], *each_right_partitions[i,j]).
- Parameters
left (np.ndarray) – The partitions of left
PandasDataframe.func (callable) – The function to apply.
right (list of np.ndarray) – The list of partitions of other
PandasDataframe.
- Returns
A NumPy array with new partitions.
- Return type
np.ndarray
- classmethod wait_partitions(partitions)#
Wait on the objects wrapped by partitions in parallel, without materializing them.
This method will block until all computations in the list have completed.
- Parameters
partitions (np.ndarray) – NumPy array with
PandasDataframePartition-s.