PandasDataframe#

PandasDataframe is a direct descendant of ModinDataframe. Its purpose is to implement the abstract interfaces for usage with all pandas-based storage formats. PandasDataframe could be inherited and augmented further by any specific implementation which needs it to take special care of some behavior or to improve performance for certain execution engine.

The class serves as the intermediate level between pandas query compiler and conforming partition manager. All queries formed at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions into the partition manager for processing. Direct partitions manipulation by this class is prohibited except cases if an operation is strictly private or protected and called inside of the class only. The class provides significantly reduced set of operations that fit plenty of pandas operations.

Main tasks of PandasDataframe are storage of partitions, manipulation with labels of axes and providing set of methods to perform operations on the internal data.

As mentioned above, PandasDataframe shouldn’t work with stored partitions directly and the responsibility for modifying partitions array has to lay on PandasDataframePartitionManager. For example, method broadcast_apply_full_axis() redirects applying function to broadcast_axis_partitions() method.

Modin PandasDataframe can be created from pandas.DataFrame, pyarrow.Table (methods from_pandas(), from_arrow() are used respectively). Also, PandasDataframe can be converted to np.array, pandas.DataFrame (methods to_numpy(), to_pandas() are used respectively).

Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.

Public API#

class modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe(partitions, index=None, columns=None, row_lengths=None, column_widths=None, dtypes: Optional[Union[Series, ModinDtypes, Callable]] = None, pandas_backend: Optional[str] = None)#

An abstract class that represents the parent class for any pandas storage format dataframe class.

This class provides interfaces to run operations on dataframe partitions.

Parameters:
  • partitions (np.ndarray) – A 2D NumPy array of partitions.

  • index (sequence or callable, optional) – The index for the dataframe. Converted to a pandas.Index. Is computed from partitions on demand if not specified. If callable() -> (pandas.Index, list of row lengths or None) type, then the calculation will be delayed until self.index is called.

  • columns (sequence, optional) – The columns object for the dataframe. Converted to a pandas.Index. Is computed from partitions on demand if not specified.

  • row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.

  • column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.

  • dtypes (pandas.Series or callable, optional) – The data types for the dataframe columns.

  • pandas_backend ({"pyarrow", None}, optional) – Backend used by pandas.

apply_full_axis(axis, func, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions: bool = False, dtypes=None, keep_partitioning=True, num_splits=None, sync_labels=True, pass_axis_lengths_to_partitions=False) PandasDataframe#

Perform a function across an entire axis.

Parameters:
  • axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).

  • func (callable) – The function to apply.

  • new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.

  • apply_indices (list-like, optional) – Indices of axis ^ 1 to apply function over.

  • enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.

  • dtypes (list-like or scalar, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

  • keep_partitioning (boolean, default: True) – The flag to keep partition boundaries for Modin Frame if possible. Setting it to True disables shuffling data from one partition to another in case the resulting number of splits is equal to the initial number of splits.

  • num_splits (int, optional) – The number of partitions to split the result into across the axis. If None, then the number of splits will be infered automatically. If num_splits is None and keep_partitioning=True then the number of splits is preserved.

  • sync_labels (boolean, default: True) – Synchronize external indexes (new_index, new_columns) with internal indexes. This could be used when you’re certain that the indices in partitions are equal to the provided hints in order to save time on syncing them.

  • pass_axis_lengths_to_partitions (bool, default: False) – Whether pass partition lengths along axis ^ 1 to the kernel func. Note that func must be able to obtain df, *axis_lengths.

Returns:

A new dataframe.

Return type:

PandasDataframe

Notes

The data shape may change as a result of the function.

apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False, new_dtypes: Optional[Union[Series, ModinDtypes]] = None)#

Apply a function across an entire axis for a subset of the data.

Parameters:
  • axis (int) – The axis to apply over.

  • func (callable) – The function to apply.

  • apply_indices (list-like, optional) – The labels to apply over.

  • numeric_indices (list-like, optional) – The indices to apply over.

  • new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.

  • keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.

  • new_dtypes (ModinDtypes or pandas.Series, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

Returns:

A new dataframe.

Return type:

PandasDataframe

apply_select_indices(axis, func, apply_indices=None, row_labels=None, col_labels=None, new_index=None, new_columns=None, new_dtypes: Optional[Series] = None, keep_remaining=False, item_to_distribute=_NoDefault.no_default) PandasDataframe#

Apply a function for a subset of the data.

Parameters:
  • axis ({0, 1}) – The axis to apply over.

  • func (callable) – The function to apply.

  • apply_indices (list-like, optional) – The labels to apply over. Must be given if axis is provided.

  • row_labels (list-like, optional) – The row labels to apply over. Must be provided with col_labels to apply over both axes.

  • col_labels (list-like, optional) – The column labels to apply over. Must be provided with row_labels to apply over both axes.

  • new_index (list-like, optional) – The index of the result, if known in advance.

  • new_columns (list-like, optional) – The columns of the result, if known in advance.

  • new_dtypes (pandas.Series, optional) – The dtypes of the result, if known in advance.

  • keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.

  • item_to_distribute (np.ndarray or scalar, default: no_default) – The item to split up so it can be applied over both axes.

Returns:

A new dataframe.

Return type:

PandasDataframe

astype(col_dtypes, errors: str = 'raise')#

Convert the columns dtypes to given dtypes.

Parameters:
  • col_dtypes (dictionary of {col: dtype,...} or str) – Where col is the column name and dtype is a NumPy dtype.

  • errors ({'raise', 'ignore'}, default: 'raise') – Control raising of exceptions on invalid data for provided dtype.

Returns:

Dataframe with updated dtypes.

Return type:

BaseDataFrame

property axes#

Get index and columns that can be accessed with an axis integer.

Returns:

List with two values: index and columns.

Return type:

list

broadcast_apply(axis, func, other, join_type='left', copartition=True, labels='keep', dtypes=None)#

Broadcast axis partitions of other to partitions of self and apply a function.

Parameters:
  • axis ({0, 1}) – Axis to broadcast over.

  • func (callable) – Function to apply.

  • other (PandasDataframe) – Modin DataFrame to broadcast.

  • join_type (str, default: "left") – Type of join to apply.

  • copartition (bool, default: True) – Whether to align indices/partitioning of the self and other frame. Disabling this may save some time, however, you have to be 100% sure that the indexing and partitioning are identical along the broadcasting axis, this might be the case for example if other is a projection of the self or vice-versa. If copartitioning is disabled and partitioning/indexing are incompatible then you may end up with undefined behavior.

  • labels ({"keep", "replace", "drop"}, default: "keep") – Whether keep labels from self Modin DataFrame, replace them with labels from joined DataFrame or drop altogether to make them be computed lazily later.

  • dtypes ("copy", pandas.Series or None, optional) – Dtypes of the result. “copy” to keep old dtypes and None to compute them on demand.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None, keep_partitioning=True, num_splits=None, sync_labels=True, pass_axis_lengths_to_partitions=False)#

Broadcast partitions of other Modin DataFrame and apply a function along full axis.

Parameters:
  • axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).

  • func (callable) – Function to apply.

  • other (PandasDataframe or list) – Modin DataFrame(s) to broadcast.

  • new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

  • apply_indices (list-like, optional) – Indices of axis ^ 1 to apply function over.

  • enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.

  • dtypes (list-like or scalar, optional) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

  • keep_partitioning (boolean, default: True) – The flag to keep partition boundaries for Modin Frame if possible. Setting it to True disables shuffling data from one partition to another in case the resulting number of splits is equal to the initial number of splits.

  • num_splits (int, optional) – The number of partitions to split the result into across the axis. If None, then the number of splits will be infered automatically. If num_splits is None and keep_partitioning=True then the number of splits is preserved.

  • sync_labels (boolean, default: True) – Synchronize external indexes (new_index, new_columns) with internal indexes. This could be used when you’re certain that the indices in partitions are equal to the provided hints in order to save time on syncing them.

  • pass_axis_lengths_to_partitions (bool, default: False) – Whether pass partition lengths along axis ^ 1 to the kernel func. Note that func must be able to obtain df, *axis_lengths.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

broadcast_apply_select_indices(axis, func, other: PandasDataframe, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None) PandasDataframe#

Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.

Parameters:
  • axis ({0, 1}) – Axis to apply function along.

  • func (callable) – Function to apply.

  • other (PandasDataframe) – Partitions of which should be broadcasted.

  • apply_indices (list, optional) – List of labels to apply (if numeric_indices are not specified).

  • numeric_indices (list, optional) – Numeric indices to apply (if apply_indices are not specified).

  • keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.

  • broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.

  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

case_when(caselist)#

Replace values where the conditions are True.

This is Series.case_when() implementation and, thus, it’s designed to work only with single-column DataFrames.

Parameters:

caselist (list of tuples) –

Return type:

PandasDataframe

property column_widths#

Compute the column partitions widths if they are not cached.

Returns:

A list of column partitions widths.

Return type:

list

property columns#

Get the columns from the cache object.

Returns:

An index object containing the column labels.

Return type:

pandas.Index

combine() PandasDataframe#

Create a single partition PandasDataframe from the partitions of the current dataframe.

Returns:

A single partition PandasDataframe.

Return type:

PandasDataframe

combine_and_apply(func, new_index=None, new_columns=None, new_dtypes=None)#

Combine all partitions into a single big one and apply the passed function to it.

Use this method with care as it collects all the data on the same worker, it’s only recommended to use this method on small or reduced datasets.

Parameters:
  • func (callable(pandas.DataFrame) -> pandas.DataFrame) – A function to apply to the combined partition.

  • new_index (sequence, optional) – Index of the result.

  • new_columns (sequence, optional) – Columns of the result.

  • new_dtypes (dict-like, optional) – Dtypes of the result.

Return type:

PandasDataframe

concat(axis: Union[int, Axis], others: Union[PandasDataframe, List[PandasDataframe]], how, sort) PandasDataframe#

Concatenate self with one or more other Modin DataFrames.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – Axis to concatenate over.

  • others (list) – List of Modin DataFrames to concatenate with.

  • how (str) – Type of join to use for the axis.

  • sort (bool) – Whether sort the result or not.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

copy()#

Copy this object.

Returns:

A copied version of this object.

Return type:

PandasDataframe

copy_axis_cache(axis=0, copy_lengths=False)#

Copy the axis cache (index or columns).

Parameters:
  • axis (int, default: 0) –

  • copy_lengths (bool, default: False) – Whether to copy the stored partition lengths to the new index object.

Returns:

If there is an pandas.Index in the cache, then copying occurs.

Return type:

pandas.Index, callable or None

copy_columns_cache(copy_lengths=False)#

Copy the columns cache.

Parameters:

copy_lengths (bool, default: False) – Whether to copy the stored partition lengths to the new index object.

Returns:

If there is an pandas.Index in the cache, then copying occurs.

Return type:

pandas.Index or None

copy_dtypes_cache()#

Copy the dtypes cache.

Returns:

If there is an pandas.Series in the cache, then copying occurs.

Return type:

pandas.Series, callable or None

copy_index_cache(copy_lengths=False)#

Copy the index cache.

Parameters:

copy_lengths (bool, default: False) – Whether to copy the stored partition lengths to the new index object.

Returns:

If there is an pandas.Index in the cache, then copying occurs.

Return type:

pandas.Index, callable or ModinIndex

property dtypes#

Compute the data types if they are not cached.

Returns:

A pandas Series containing the data types for this dataframe.

Return type:

pandas.Series

explode(axis: Union[int, Axis], func: Callable) PandasDataframe#

Explode list-like entries along an entire axis.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis specifying how to explode. If axis=1, explode according to columns.

  • func (callable) – The function to use to explode a single element.

Returns:

A new filtered dataframe.

Return type:

PandasFrame

filter(axis: Union[Axis, int], condition: Callable) PandasDataframe#

Filter data based on the function provided along an entire axis.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to filter over.

  • condition (callable(row|col) -> bool) – The function to use for the filter. This function should filter the data itself.

Returns:

A new filtered dataframe.

Return type:

PandasDataframe

filter_by_types(types: List[Hashable]) PandasDataframe#

Allow the user to specify a type or set of types by which to filter the columns.

Parameters:

types (list) – The types to filter columns by.

Returns:

A new PandasDataframe from the filter provided.

Return type:

PandasDataframe

finalize()#

Perform all deferred calls on partitions.

This makes self Modin Dataframe independent of a history of queries that were used to build it.

fold(axis, func, new_index=None, new_columns=None, shape_preserved=False)#

Perform a function across an entire axis.

Parameters:
  • axis (int) – The axis to apply over.

  • func (callable) – The function to apply.

  • new_index (list-like, optional) – The index of the result.

  • new_columns (list-like, optional) – The columns of the result.

  • shape_preserved (bool, default: False) – Whether the shape of the dataframe is preserved or not after applying a function.

Returns:

A new dataframe.

Return type:

PandasDataframe

classmethod from_arrow(at)#

Create a Modin DataFrame from an Arrow Table.

Parameters:

at (pyarrow.table) – Arrow Table.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

classmethod from_dataframe(df: ProtocolDataframe) PandasDataframe#

Convert a DataFrame implementing the dataframe exchange protocol to a Core Modin Dataframe.

See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.

Parameters:

df (ProtocolDataframe) – The DataFrame object supporting the dataframe exchange protocol.

Returns:

A new Core Modin Dataframe object.

Return type:

PandasDataframe

from_labels() PandasDataframe#

Convert the row labels to a column of data, inserted at the first position.

Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.

Returns:

A PandasDataframe with new columns from index labels.

Return type:

PandasDataframe

classmethod from_pandas(df)#

Create a Modin DataFrame from a pandas DataFrame.

Parameters:

df (pandas.DataFrame) – A pandas DataFrame.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

get_axis(axis: int = 0) Index#

Get index object for the requested axis.

Parameters:

axis ({0, 1}, default: 0) –

Return type:

pandas.Index

get_dtypes_set()#

Get a set of dtypes that are in this dataframe.

Return type:

set

groupby(axis: Union[int, Axis], internal_by: List[str], external_by: List[PandasDataframe], by_positions: List[int], operator: Callable, result_schema: Optional[Dict[Hashable, type]] = None, align_result_columns: bool = False, series_groupby: bool = False, add_missing_cats: bool = False, **kwargs: dict) PandasDataframe#

Generate groups based on values in the input column(s) and perform the specified operation on each.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to apply the grouping over.

  • internal_by (list of strings) – One or more column labels from the self dataframe to use for grouping.

  • external_by (list of PandasDataframes) – PandasDataframes to group by (may be specified along with or without internal_by).

  • by_positions (list of ints) –

    Specifies the order of grouping by internal_by and external_by columns. Each element in by_positions specifies an index from either external_by or internal_by. Indices for external_by are positive and start from 0. Indices for internal_by are negative and start from -1 (so in order to convert them to a valid indices one should do -idx - 1). ‘’’ by_positions = [0, -1, 1, -2, 2, 3] internal_by = [“col1”, “col2”] external_by = [sr1, sr2, sr3, sr4]

    df.groupby([sr1, “col1”, sr2, “col2”, sr3, sr4]) ‘’’.

  • operator (callable(pandas.core.groupby.DataFrameGroupBy) -> pandas.DataFrame) – The operation to carry out on each of the groups. The operator is another algebraic operator with its own user-defined function parameter, depending on the output desired by the user.

  • result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.

  • align_result_columns (bool, default: False) – Whether to manually align columns between all the resulted row partitions. This flag is helpful when dealing with UDFs as they can change the partition’s shape and labeling unpredictably, resulting in an invalid dataframe.

  • series_groupby (bool, default: False) – Whether to convert a one-column DataFrame to a Series before performing groupby.

  • add_missing_cats (bool, default: False) – Whether to add missing categories from by columns to the result.

  • **kwargs (dict) – Additional arguments to pass to the df.groupby method (besides the ‘by’ argument).

Returns:

A new PandasDataframe containing the groupings specified, with the operator

applied to each group.

Return type:

PandasDataframe

Notes

No communication between groups is allowed in this algebra implementation.

The number of rows (columns if axis=1) returned by the user-defined function passed to the groupby may be at most the number of rows in the group, and may be as small as a single row.

Unlike the pandas API, an intermediate “GROUP BY” object is not present in this algebra implementation.

groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)#

Groupby another Modin DataFrame dataframe and aggregate the result.

Parameters:
  • axis ({0, 1}) – Axis to groupby and aggregate over.

  • by (PandasDataframe or None) – A Modin DataFrame to group by.

  • map_func (callable) – Map component of the aggregation.

  • reduce_func (callable) – Reduce component of the aggregation.

  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

  • apply_indices (list-like, optional) – Indices of axis ^ 1 to apply groupby over.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

has_axis_cache(axis=0) bool#

Check if the cache for the specified axis exists.

Parameters:

axis (int, default: 0) –

Return type:

bool

property has_columns_cache#

Check if the columns cache exists.

Return type:

bool

property has_dtypes_cache: bool#

Check if the dtypes cache exists.

Return type:

bool

property has_index_cache#

Check if the index cache exists.

Return type:

bool

property has_materialized_columns#

Check if dataframe has materialized columns cache.

Return type:

bool

property has_materialized_dtypes: bool#

Check if dataframe has materialized index cache.

Return type:

bool

property has_materialized_index#

Check if dataframe has materialized index cache.

Return type:

bool

property index#

Get the index from the cache object.

Returns:

An index object containing the row labels.

Return type:

pandas.Index

infer_objects() PandasDataframe#

Attempt to infer better dtypes for object columns.

Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.

Returns:

A new PandasDataframe with the inferred schema.

Return type:

PandasDataframe

infer_types(col_labels: List[str]) PandasDataframe#

Determine the compatible type shared by all values in the specified columns, and coerce them to that type.

Parameters:

col_labels (list) – List of column labels to infer and induce types over.

Returns:

A new PandasDataframe with the inferred schema.

Return type:

PandasDataframe

join(axis: Union[int, Axis], condition: Callable, other: ModinDataframe, join_type: Union[str, JoinType]) PandasDataframe#

Join this dataframe with the other.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the join on.

  • condition (callable) – Function that determines which rows should be joined. The condition can be a simple equality, e.g. “left.col1 == right.col1” or can be arbitrarily complex.

  • other (ModinDataframe) – The other data to join with, i.e. the right dataframe.

  • join_type (string {"inner", "left", "right", "outer"} or modin.core.dataframe.base.utils.JoinType) – The type of join to perform.

Returns:

A new PandasDataframe that is the result of applying the specified join over the two dataframes.

Return type:

PandasDataframe

Notes

During the join, this dataframe is considered the left, while the other is treated as the right.

Only inner joins, left outer, right outer, and full outer joins are currently supported. Support for other join types (e.g. natural join) may be implemented in the future.

map(func: Callable, dtypes: Optional[str] = None, new_columns: Optional[Index] = None, func_args=None, func_kwargs=None, lazy=False) PandasDataframe#

Perform a function that maps across the entire dataset.

Parameters:
  • func (callable(row|col|cell) -> row|col|cell) – The function to apply.

  • dtypes (dtypes of the result, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

  • new_columns (pandas.Index, optional) – New column labels of the result, its length has to be identical to the older columns. If not specified, old column labels are preserved.

  • func_args (iterable, optional) – Positional arguments for the ‘func’ callable.

  • func_kwargs (dict, optional) – Keyword arguments for the ‘func’ callable.

  • lazy (bool, default: False) – Whether to prefer lazy execution or not.

Returns:

A new dataframe.

Return type:

PandasDataframe

n_ary_op(op, right_frames: list[modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe], join_type='outer', sort=None, copartition_along_columns=True, labels='replace', dtypes: Optional[Series] = None) PandasDataframe#

Perform an n-opary operation by joining with other Modin DataFrame(s).

Parameters:
  • op (callable) – Function to apply after the join.

  • right_frames (list of PandasDataframe) – Modin DataFrames to join with.

  • join_type (str, default: "outer") – Type of join to apply.

  • sort (bool, default: None) – Whether to sort index and columns or not.

  • copartition_along_columns (bool, default: True) – Whether to perform copartitioning along columns or not. For some ops this isn’t needed (e.g., fillna).

  • labels ({"replace", "drop"}, default: "replace") – Whether use labels from joined DataFrame or drop altogether to make them be computed lazily later.

  • dtypes (pandas.Series, optional) – Dtypes of the resultant dataframe, this argument will be received if the resultant dtypes of n-opary operation is precomputed.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

property num_parts: int#

Get the total number of partitions for this frame.

Return type:

int

numeric_columns(include_bool=True)#

Return the names of numeric columns in the frame.

Parameters:

include_bool (bool, default: True) – Whether to consider boolean columns as numeric.

Returns:

List of column names.

Return type:

list

reduce(axis: Union[int, Axis], function: Callable, dtypes: Optional[str] = None) PandasDataframe#

Perform a user-defined aggregation on the specified axis, where the axis reduces down to a singleton. Requires knowledge of the full axis for the reduction.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the reduce over.

  • function (callable(row|col) -> single value) – The reduce function to apply to each column.

  • dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns:

Modin series (1xN frame) containing the reduced data.

Return type:

PandasDataframe

Notes

The user-defined function must reduce to a single value.

rename(new_row_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None, new_col_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None) PandasDataframe#

Replace the row and column labels with the specified new labels.

Parameters:
  • new_row_labels (dictionary or callable, optional) – Mapping or callable that relates old row labels to new labels.

  • new_col_labels (dictionary or callable, optional) – Mapping or callable that relates old col labels to new labels.

Returns:

A new PandasDataframe with the new row and column labels.

Return type:

PandasDataframe

property row_lengths#

Compute the row partitions lengths if they are not cached.

Returns:

A list of row partitions lengths.

Return type:

list

set_axis_cache(value, axis=0)#

Set cache for the specified axis (index or columns).

Parameters:
  • value (sequence, callable or None) –

  • axis (int, default: 0) –

set_columns_cache(columns)#

Set columns cache.

Parameters:

columns (sequence, callable or None) –

set_dtypes_cache(dtypes)#

Set dtypes cache.

Parameters:

dtypes (pandas.Series, ModinDtypes, callable or None) –

set_index_cache(index)#

Set index cache.

Parameters:

index (sequence, callable or None) –

sort_by(axis: Union[int, Axis], columns: Union[str, List[str]], ascending: bool = True, **kwargs) PandasDataframe#

Logically reorder rows (columns if axis=1) lexicographically by the data in a column or set of columns.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the sort over.

  • columns (string or list) – Column label(s) to use to determine lexicographical ordering.

  • ascending (boolean, default: True) – Whether to sort in ascending or descending order.

  • **kwargs (dict) – Keyword arguments to pass when sorting partitions.

Returns:

A new PandasDataframe sorted into lexicographical order by the specified column(s).

Return type:

PandasDataframe

support_materialization_in_worker_process() bool#

Whether it’s possible to call function to_pandas during the pickling process, at the moment of recreating the object.

Return type:

bool

synchronize_labels(axis=None)#

Set the deferred axes variables for the PandasDataframe.

Parameters:

axis (int, optional) – The deferred axis. 0 for the index, 1 for the columns.

take_2d_labels_or_positional(row_labels: Optional[List[Hashable]] = None, row_positions: Optional[List[int]] = None, col_labels: Optional[List[Hashable]] = None, col_positions: Optional[List[int]] = None) PandasDataframe#

Lazily select columns or rows from given indices.

Parameters:
  • row_labels (list of hashable, optional) – The row labels to extract.

  • row_positions (list-like of ints, optional) – The row positions to extract.

  • col_labels (list of hashable, optional) – The column labels to extract.

  • col_positions (list-like of ints, optional) – The column positions to extract.

Returns:

A new PandasDataframe from the mask provided.

Return type:

PandasDataframe

Notes

If both row_labels and row_positions are provided, a ValueError is raised. The same rule applies for col_labels and col_positions.

to_labels(column_list: List[Hashable]) PandasDataframe#

Move one or more columns into the row labels. Previous labels are dropped.

Parameters:

column_list (list of hashable) – The list of column names to place as the new row labels.

Returns:

A new PandasDataframe that has the updated labels.

Return type:

PandasDataframe

to_numpy(**kwargs)#

Convert this Modin DataFrame to a NumPy array.

Parameters:

**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.

Return type:

np.ndarray

to_pandas()#

Convert this Modin DataFrame to a pandas DataFrame.

Return type:

pandas.DataFrame

transpose()#

Transpose the index and columns of this Modin DataFrame.

Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.

Returns:

New Modin DataFrame.

Return type:

PandasDataframe

tree_reduce(axis: Union[int, Axis], map_func: Callable, reduce_func: Optional[Callable] = None, dtypes: Optional[str] = None) PandasDataframe#

Apply function that will reduce the data to a pandas Series.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the tree reduce over.

  • map_func (callable(row|col) -> row|col) – Callable function to map the dataframe.

  • reduce_func (callable(row|col) -> single value, optional) – Callable function to reduce the dataframe. If none, then apply map_func twice.

  • dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns:

A new dataframe.

Return type:

PandasDataframe

wait_computations()#

Wait for all computations to complete without materializing data.

window(axis: Union[int, Axis], reduce_fn: Callable, window_size: int, result_schema: Optional[Dict[Hashable, type]] = None) PandasDataframe#

Apply a sliding window operator that acts as a GROUPBY on each window, and reduces down to a single row (column) per window.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to slide over.

  • reduce_fn (callable(rowgroup|colgroup) -> row|col) – The reduce function to apply over the data.

  • window_size (int) – The number of row/columns to pass to the function. (The size of the sliding window).

  • result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.

Returns:

A new PandasDataframe with the reduce function applied over windows of the specified

axis.

Return type:

PandasDataframe

Notes

The user-defined reduce function must reduce each window’s column (row if axis=1) down to a single value.