PandasDataframeAxisPartition#

The class implements abstract interface methods from BaseDataframeAxisPartition giving the means for a sibling partition manager to actually work with the axis-wide partitions.

The class is base for any axis partition class of pandas storage format.

Subclasses must implement list_of_blocks which represents data wrapped by the PandasDataframePartition objects and creates something interpretable as a pandas.DataFrame.

See PandasOnRayDataframeAxisPartition for an example on how to override/use this class when the implementation needs to be augmented.

The PandasDataframeAxisPartition object has an invariant that requires that this object is never returned from a function. It assumes that there will always be PandasDataframeAxisPartition object stored and structures itself accordingly.

Public API#

class modin.core.dataframe.pandas.partitioning.axis_partition.PandasDataframeAxisPartition(list_of_partitions, get_ip=False, full_axis=True, call_queue=None, length=None, width=None)#

An abstract class is created to simplify and consolidate the code for axis partition that run pandas.

Because much of the code is similar, this allows us to reuse this code.

Parameters:
  • list_of_partitions (Union[list, PandasDataframePartition]) – List of PandasDataframePartition and PandasDataframeAxisPartition objects, or a single PandasDataframePartition.

  • get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.

  • full_axis (bool, default: True) – Whether or not the axis partition encompasses the whole axis.

  • call_queue (list, optional) – A list of tuples (callable, args, kwargs) that contains deferred calls.

  • length (the future's type or int, optional) – Length, or reference to length, of wrapped pandas.DataFrame.

  • width (the future's type or int, optional) – Width, or reference to width, of wrapped pandas.DataFrame.

add_to_apply_calls(func, *args, length=None, width=None, **kwargs)#

Add a function to the call queue.

Parameters:
  • func (callable or a future type) – Function to be added to the call queue.

  • *args (iterable) – Additional positional arguments to be passed in func.

  • length (A future type or int, optional) – Length, or reference to it, of wrapped pandas.DataFrame.

  • width (A future type or int, optional) – Width, or reference to it, of wrapped pandas.DataFrame.

  • **kwargs (dict) – Additional keyword arguments to be passed in func.

Returns:

A new PandasDataframeAxisPartition object.

Return type:

PandasDataframeAxisPartition

apply(func, *args, num_splits=None, other_axis_partition=None, maintain_partitioning=True, lengths=None, manual_partition=False, **kwargs)#

Apply a function to this axis partition along full axis.

Parameters:
  • func (callable) – The function to apply.

  • *args (iterable) – Positional arguments to pass to func.

  • num_splits (int, default: None) – The number of times to split the result object.

  • other_axis_partition (PandasDataframeAxisPartition, default: None) – Another PandasDataframeAxisPartition object to be applied to func. This is for operations that are between two data sets.

  • maintain_partitioning (bool, default: True) – Whether to keep the partitioning in the same orientation as it was previously or not. This is important because we may be operating on an individual AxisPartition and not touching the rest. In this case, we have to return the partitioning to its previous orientation (the lengths will remain the same). This is ignored between two axis partitions.

  • lengths (iterable, default: None) – The list of lengths to shuffle the object.

  • manual_partition (bool, default: False) – If True, partition the result with lengths.

  • **kwargs (dict) – Additional keywords arguments to be passed in func.

Returns:

A list of PandasDataframePartition objects.

Return type:

list

classmethod deploy_axis_func(axis, func, f_args, f_kwargs, num_splits, maintain_partitioning, *partitions, min_block_size, lengths=None, manual_partition=False, return_generator=False)#

Deploy a function along a full axis.

Parameters:
  • axis ({0, 1}) – The axis to perform the function along.

  • func (callable) – The function to perform.

  • f_args (list or tuple) – Positional arguments to pass to func.

  • f_kwargs (dict) – Keyword arguments to pass to func.

  • num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).

  • maintain_partitioning (bool) – If True, keep the old partitioning if possible. If False, create a new partition layout.

  • *partitions (iterable) – All partitions that make up the full axis (row or column).

  • min_block_size (int) – Minimum number of rows/columns in a single split.

  • lengths (list, optional) – The list of lengths to shuffle the object.

  • manual_partition (bool, default: False) – If True, partition the result with lengths.

  • return_generator (bool, default: False) – Return a generator from the function, set to True for Ray backend as Ray remote functions can return Generators.

Returns:

A list or generator of pandas DataFrames.

Return type:

list | Generator

classmethod deploy_func_between_two_axis_partitions(axis, func, f_args, f_kwargs, num_splits, len_of_left, other_shape, *partitions, min_block_size, return_generator=False)#

Deploy a function along a full axis between two data sets.

Parameters:
  • axis ({0, 1}) – The axis to perform the function along.

  • func (callable) – The function to perform.

  • f_args (list or tuple) – Positional arguments to pass to func.

  • f_kwargs (dict) – Keyword arguments to pass to func.

  • num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).

  • len_of_left (int) – The number of values in partitions that belong to the left data set.

  • other_shape (np.ndarray) – The shape of right frame in terms of partitions, i.e. (other_shape[i-1], other_shape[i]) will indicate slice to restore i-1 axis partition.

  • *partitions (iterable) – All partitions that make up the full axis (row or column) for both data sets.

  • min_block_size (int) – Minimum number of rows/columns in a single split.

  • return_generator (bool, default: False) – Return a generator from the function, set to True for Ray backend as Ray remote functions can return Generators.

Returns:

A list or generator of pandas DataFrames.

Return type:

list | Generator

classmethod deploy_splitting_func(axis, split_func, f_args, f_kwargs, num_splits, *partitions, extract_metadata=False)#

Deploy a splitting function along a full axis.

Parameters:
  • axis ({0, 1}) – The axis to perform the function along.

  • split_func (callable(pandas.DataFrame) -> list[pandas.DataFrame]) – The function to perform.

  • f_args (list or tuple) – Positional arguments to pass to split_func.

  • f_kwargs (dict) – Keyword arguments to pass to split_func.

  • num_splits (int) – The number of splits the split_func return.

  • *partitions (iterable) – All partitions that make up the full axis (row or column).

  • extract_metadata (bool, default: False) – Whether to return metadata (length, width, ip) of the result. Note that True value is not supported in PandasDataframeAxisPartition class.

Returns:

A list of pandas DataFrames.

Return type:

list

classmethod drain(df: DataFrame, call_queue: list)#

Execute all operations stored in the call queue on the pandas object (helper function).

Parameters:
  • df (pandas.DataFrame) –

  • call_queue (list) – Call queue that needs to be executed on pandas DataFrame.

Return type:

pandas.DataFrame

drain_call_queue(num_splits=None)#

Execute all operations stored in this partition’s call queue.

Parameters:

num_splits (int, default: None) – The number of times to split the result object.

force_materialization(get_ip=False)#

Materialize partitions into a single partition.

Parameters:

get_ip (bool, default: False) – Whether to get node ip address to a single partition or not.

Returns:

An axis partition containing only a single materialized partition.

Return type:

PandasDataframeAxisPartition

length(materialize=True)#

Get the length of this partition.

Parameters:

materialize (bool, default: True) – Whether to forcibly materialize the result into an integer. If False was specified, may return a future of the result if it hasn’t been materialized yet.

Returns:

The length of the partition.

Return type:

int

property list_of_block_partitions: list#

Get the list of block partitions that compose this partition.

Returns:

A list of PandasDataframePartition.

Return type:

List

property list_of_blocks#

Get the list of physical partition objects that compose this partition.

Returns:

A list of physical partition objects (ray.ObjectRef, distributed.Future e.g.).

Return type:

list

mask(row_indices, col_indices)#

Create (synchronously) a mask that extracts the indices provided.

Parameters:
  • row_indices (list-like, slice or label) – The row labels for the rows to extract.

  • col_indices (list-like, slice or label) – The column labels for the columns to extract.

Returns:

A new PandasDataframeAxisPartition object, materialized.

Return type:

PandasDataframeAxisPartition

split(split_func, num_splits, f_args=None, f_kwargs=None, extract_metadata=False)#

Split axis partition into multiple partitions using the split_func.

Parameters:
  • split_func (callable(pandas.DataFrame) -> list[pandas.DataFrame]) – A function that takes partition’s content and split it into multiple chunks.

  • num_splits (int) – The number of splits the split_func return.

  • f_args (iterable, optional) – Positional arguments to pass to the split_func.

  • f_kwargs (dict, optional) – Keyword arguments to pass to the split_func.

  • extract_metadata (bool, default: False) – Whether to return metadata (length, width, ip) of the result. Passing False may relax the load on object storage as the remote function would return X times fewer futures (where X is the number of metadata values). Passing False makes sense for temporary results where you know for sure that the metadata will never be requested.

Returns:

List of wrapped remote partition objects.

Return type:

list

to_numpy()#

Convert the data in this partition to a numpy.array.

Return type:

NumPy array.

to_pandas()#

Convert the data in this partition to a pandas.DataFrame.

Return type:

pandas DataFrame.

wait()#

Wait completing computations on the object wrapped by the partition.

width(materialize=True)#

Get the width of this partition.

Parameters:

materialize (bool, default: True) – Whether to forcibly materialize the result into an integer. If False was specified, may return a future of the result if it hasn’t been materialized yet.

Returns:

The width of the partition.

Return type:

int