PandasDataframePartition#
The class is base for any partition class of pandas
storage format and serves as the last level
on which operations that were conveyed from the partition manager are being performed on an
individual block partition.
The class provides an API that has to be overridden by child classes in order to manipulate on data and metadata they store.
The public API exposed by the children of this class is used in PandasDataframePartitionManager
.
The objects wrapped by the child classes are treated as immutable by PandasDataframePartitionManager
subclasses
and no logic for updating inplace.
Public API#
- class modin.core.dataframe.pandas.partitioning.partition.PandasDataframePartition#
An abstract class that is base for any partition class of
pandas
storage format.The class providing an API that has to be overridden by child classes.
- add_to_apply_calls(func, *args, length=None, width=None, **kwargs)#
Add a function to the call queue.
- Parameters:
func (callable) – Function to be added to the call queue.
*args (iterable) – Additional positional arguments to be passed in func.
length (reference or int, optional) – Length, or reference to length, of wrapped
pandas.DataFrame
.width (reference or int, optional) – Width, or reference to width, of wrapped
pandas.DataFrame
.**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns:
New PandasDataframePartition object with the function added to the call queue.
- Return type:
Notes
This function will be executed when apply is called. It will be executed in the order inserted; apply’s func operates the last and return.
- apply(func, *args, **kwargs)#
Apply a function to the object wrapped by this partition.
- Parameters:
func (callable) – Function to apply.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns:
New PandasDataframePartition object.
- Return type:
Notes
It is up to the implementation how kwargs are handled. They are an important part of many implementations. As of right now, they are not serialized.
- drain_call_queue()#
Execute all operations stored in the call queue on the object wrapped by this partition.
- classmethod empty()#
Create a new partition that wraps an empty pandas DataFrame.
- Returns:
New PandasDataframePartition object.
- Return type:
- get()#
Get the object wrapped by this partition.
- Returns:
The object that was wrapped by this partition.
- Return type:
object
Notes
This is the opposite of the classmethod put. E.g. if you assign x = PandasDataframePartition.put(1), x.get() should always return 1.
- length(materialize=True)#
Get the length of the object wrapped by this partition.
- Parameters:
materialize (bool, default: True) – Whether to forcibly materialize the result into an integer. If
False
was specified, may return a future of the result if it hasn’t been materialized yet.- Returns:
The length of the object.
- Return type:
int or its Future
- property list_of_blocks#
Get the list of physical partition objects that compose this partition.
- Returns:
A list of physical partition objects (
ray.ObjectRef
,distributed.Future
e.g.).- Return type:
list
- mask(row_labels, col_labels)#
Lazily create a mask that extracts the indices provided.
- Parameters:
row_labels (list-like, slice or label) – The row labels for the rows to extract.
col_labels (list-like, slice or label) – The column labels for the columns to extract.
- Returns:
New PandasDataframePartition object.
- Return type:
- classmethod preprocess_func(func)#
Preprocess a function before an apply call.
- Parameters:
func (callable) – Function to preprocess.
- Returns:
An object that can be accepted by apply.
- Return type:
callable
Notes
This is a classmethod because the definition of how to preprocess should be class-wide. Also, we may want to use this before we deploy a preprocessed function to multiple PandasDataframePartition objects.
- classmethod put(obj)#
Put an object into a store and wrap it with partition object.
- Parameters:
obj (object) – An object to be put.
- Returns:
New PandasDataframePartition object.
- Return type:
- split(split_func, num_splits, *args)#
Split the object wrapped by the partition into multiple partitions.
- Parameters:
split_func (Callable[pandas.DataFrame, List[Any]] -> List[pandas.DataFrame]) – The function that will split this partition into multiple partitions. The list contains pivots to split by, and will have the same dtype as the major column we are shuffling on.
num_splits (int) – The number of resulting partitions (may be empty).
*args (List[Any]) – Arguments to pass to
split_func
.
- Returns:
A list of partitions.
- Return type:
list
- to_numpy(**kwargs)#
Convert the object wrapped by this partition to a NumPy array.
- Parameters:
**kwargs (dict) – Additional keyword arguments to be passed in
to_numpy
.- Return type:
np.ndarray
Notes
If the underlying object is a pandas DataFrame, this will return a 2D NumPy array.
- to_pandas()#
Convert the object wrapped by this partition to a
pandas.DataFrame
.- Return type:
pandas.DataFrame
Notes
If the underlying object is a pandas DataFrame, this will likely only need to call get.
- wait()#
Wait for completion of computations on the object wrapped by the partition.
- width(materialize=True)#
Get the width of the object wrapped by the partition.
- Parameters:
materialize (bool, default: True) – Whether to forcibly materialize the result into an integer. If
False
was specified, may return a future of the result if it hasn’t been materialized yet.- Returns:
The width of the object.
- Return type:
int or its Future