PandasDataframe#

PandasDataframe is a direct descendant of ModinDataframe. Its purpose is to implement the abstract interfaces for usage with all pandas-based storage formats. PandasDataframe could be inherited and augmented further by any specific implementation which needs it to take special care of some behavior or to improve performance for certain execution engine.

The class serves as the intermediate level between pandas query compiler and conforming partition manager. All queries formed at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions into the partition manager for processing. Direct partitions manipulation by this class is prohibited except cases if an operation is strictly private or protected and called inside of the class only. The class provides significantly reduced set of operations that fit plenty of pandas operations.

Main tasks of PandasDataframe are storage of partitions, manipulation with labels of axes and providing set of methods to perform operations on the internal data.

As mentioned above, PandasDataframe shouldn’t work with stored partitions directly and the responsibility for modifying partitions array has to lay on PandasDataframePartitionManager. For example, method broadcast_apply_full_axis() redirects applying function to broadcast_axis_partitions() method.

Modin PandasDataframe can be created from pandas.DataFrame, pyarrow.Table (methods from_pandas(), from_arrow() are used respectively). Also, PandasDataframe can be converted to np.array, pandas.DataFrame (methods to_numpy(), to_pandas() are used respectively).

Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.

Public API#

class modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe(partitions, index=None, columns=None, row_lengths=None, column_widths=None, dtypes=None)#

An abstract class that represents the parent class for any pandas storage format dataframe class.

This class provides interfaces to run operations on dataframe partitions.

Parameters
  • partitions (np.ndarray) – A 2D NumPy array of partitions.

  • index (sequence, optional) – The index for the dataframe. Converted to a pandas.Index. Is computed from partitions on demand if not specified.

  • columns (sequence, optional) – The columns object for the dataframe. Converted to a pandas.Index. Is computed from partitions on demand if not specified.

  • row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.

  • column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.

  • dtypes (pandas.Series, optional) – The data types for the dataframe columns.

add_prefix(prefix, axis)#

Add a prefix to the current row or column labels.

Parameters
  • prefix (str) – The prefix to add.

  • axis (int) – The axis to update.

Returns

A new dataframe with the updated labels.

Return type

PandasDataframe

add_suffix(suffix, axis)#

Add a suffix to the current row or column labels.

Parameters
  • suffix (str) – The suffix to add.

  • axis (int) – The axis to update.

Returns

A new dataframe with the updated labels.

Return type

PandasDataframe

apply_full_axis(axis, func, new_index=None, new_columns=None, dtypes=None)#

Perform a function across an entire axis.

Parameters
  • axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).

  • func (callable) – The function to apply.

  • new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.

  • dtypes (list-like, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

Returns

A new dataframe.

Return type

PandasDataframe

Notes

The data shape may change as a result of the function.

apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False)#

Apply a function across an entire axis for a subset of the data.

Parameters
  • axis (int) – The axis to apply over.

  • func (callable) – The function to apply.

  • apply_indices (list-like, default: None) – The labels to apply over.

  • numeric_indices (list-like, default: None) – The indices to apply over.

  • new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.

  • keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.

Returns

A new dataframe.

Return type

PandasDataframe

apply_select_indices(axis, func, apply_indices=None, row_labels=None, col_labels=None, new_index=None, new_columns=None, keep_remaining=False, item_to_distribute=_NoDefault.no_default)#

Apply a function for a subset of the data.

Parameters
  • axis ({0, 1}) – The axis to apply over.

  • func (callable) – The function to apply.

  • apply_indices (list-like, default: None) – The labels to apply over. Must be given if axis is provided.

  • row_labels (list-like, default: None) – The row labels to apply over. Must be provided with col_labels to apply over both axes.

  • col_labels (list-like, default: None) – The column labels to apply over. Must be provided with row_labels to apply over both axes.

  • new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.

  • keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.

  • item_to_distribute (np.ndarray or scalar, default: no_default) – The item to split up so it can be applied over both axes.

Returns

A new dataframe.

Return type

PandasDataframe

astype(col_dtypes)#

Convert the columns dtypes to given dtypes.

Parameters

col_dtypes (dictionary of {col: dtype,...}) – Where col is the column name and dtype is a NumPy dtype.

Returns

Dataframe with updated dtypes.

Return type

BaseDataFrame

property axes#

Get index and columns that can be accessed with an axis integer.

Returns

List with two values: index and columns.

Return type

list

broadcast_apply(axis, func, other, join_type='left', labels='keep', dtypes=None)#

Broadcast axis partitions of other to partitions of self and apply a function.

Parameters
  • axis ({0, 1}) – Axis to broadcast over.

  • func (callable) – Function to apply.

  • other (PandasDataframe) – Modin DataFrame to broadcast.

  • join_type (str, default: "left") – Type of join to apply.

  • labels ({"keep", "replace", "drop"}, default: "keep") – Whether keep labels from self Modin DataFrame, replace them with labels from joined DataFrame or drop altogether to make them be computed lazily later.

  • dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.

Returns

New Modin DataFrame.

Return type

PandasDataframe

broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None)#

Broadcast partitions of other Modin DataFrame and apply a function along full axis.

Parameters
  • axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).

  • func (callable) – Function to apply.

  • other (PandasDataframe or list) – Modin DataFrame(s) to broadcast.

  • new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

  • apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.

  • enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.

  • dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

Returns

New Modin DataFrame.

Return type

PandasDataframe

broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)#

Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.

Parameters
  • axis ({0, 1}) – Axis to apply function along.

  • func (callable) – Function to apply.

  • other (PandasDataframe) – Partitions of which should be broadcasted.

  • apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).

  • numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).

  • keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.

  • broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.

  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

Returns

New Modin DataFrame.

Return type

PandasDataframe

property column_widths#

Compute the column partitions widths if they are not cached.

Returns

A list of column partitions widths.

Return type

list

property columns#

Get the columns from the cache object.

Returns

An index object containing the column labels.

Return type

pandas.Index

concat(axis: Union[int, Axis], others: Union[PandasDataframe, List[PandasDataframe]], how, sort) PandasDataframe#

Concatenate self with one or more other Modin DataFrames.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – Axis to concatenate over.

  • others (list) – List of Modin DataFrames to concatenate with.

  • how (str) – Type of join to use for the axis.

  • sort (bool) – Whether sort the result or not.

Returns

New Modin DataFrame.

Return type

PandasDataframe

copy()#

Copy this object.

Returns

A copied version of this object.

Return type

PandasDataframe

property dtypes#

Compute the data types if they are not cached.

Returns

A pandas Series containing the data types for this dataframe.

Return type

pandas.Series

explode(axis: Union[int, Axis], func: Callable) PandasDataframe#

Explode list-like entries along an entire axis.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis specifying how to explode. If axis=1, explode according to columns.

  • func (callable) – The function to use to explode a single element.

Returns

A new filtered dataframe.

Return type

PandasFrame

filter(axis: Union[Axis, int], condition: Callable) PandasDataframe#

Filter data based on the function provided along an entire axis.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to filter over.

  • condition (callable(row|col) -> bool) – The function to use for the filter. This function should filter the data itself.

Returns

A new filtered dataframe.

Return type

PandasDataframe

filter_by_types(types: List[Hashable]) PandasDataframe#

Allow the user to specify a type or set of types by which to filter the columns.

Parameters

types (list) – The types to filter columns by.

Returns

A new PandasDataframe from the filter provided.

Return type

PandasDataframe

finalize()#

Perform all deferred calls on partitions.

This makes self Modin Dataframe independent of a history of queries that were used to build it.

fold(axis, func)#

Perform a function across an entire axis.

Parameters
  • axis (int) – The axis to apply over.

  • func (callable) – The function to apply.

Returns

A new dataframe.

Return type

PandasDataframe

Notes

The data shape is not changed (length and width of the table).

classmethod from_arrow(at)#

Create a Modin DataFrame from an Arrow Table.

Parameters

at (pyarrow.table) – Arrow Table.

Returns

New Modin DataFrame.

Return type

PandasDataframe

classmethod from_dataframe(df: ProtocolDataframe) PandasDataframe#

Convert a DataFrame implementing the dataframe exchange protocol to a Core Modin Dataframe.

See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.

Parameters

df (ProtocolDataframe) – The DataFrame object supporting the dataframe exchange protocol.

Returns

A new Core Modin Dataframe object.

Return type

PandasDataframe

from_labels() PandasDataframe#

Convert the row labels to a column of data, inserted at the first position.

Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.

Returns

A PandasDataframe with new columns from index labels.

Return type

PandasDataframe

classmethod from_pandas(df)#

Create a Modin DataFrame from a pandas DataFrame.

Parameters

df (pandas.DataFrame) – A pandas DataFrame.

Returns

New Modin DataFrame.

Return type

PandasDataframe

groupby(axis: Union[int, Axis], by: Union[str, List[str]], operator: Callable, result_schema: Optional[Dict[Hashable, type]] = None) PandasDataframe#

Generate groups based on values in the input column(s) and perform the specified operation on each.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to apply the grouping over.

  • by (string or list of strings) – One or more column labels to use for grouping.

  • operator (callable) – The operation to carry out on each of the groups. The operator is another algebraic operator with its own user-defined function parameter, depending on the output desired by the user.

  • result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.

Returns

A new PandasDataframe containing the groupings specified, with the operator

applied to each group.

Return type

PandasDataframe

Notes

No communication between groups is allowed in this algebra implementation.

The number of rows (columns if axis=1) returned by the user-defined function passed to the groupby may be at most the number of rows in the group, and may be as small as a single row.

Unlike the pandas API, an intermediate “GROUP BY” object is not present in this algebra implementation.

groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)#

Groupby another Modin DataFrame dataframe and aggregate the result.

Parameters
  • axis ({0, 1}) – Axis to groupby and aggregate over.

  • by (PandasDataframe or None) – A Modin DataFrame to group by.

  • map_func (callable) – Map component of the aggregation.

  • reduce_func (callable) – Reduce component of the aggregation.

  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.

  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

  • apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.

Returns

New Modin DataFrame.

Return type

PandasDataframe

property index#

Get the index from the cache object.

Returns

An index object containing the row labels.

Return type

pandas.Index

infer_objects() PandasDataframe#

Attempt to infer better dtypes for object columns.

Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.

Returns

A new PandasDataframe with the inferred schema.

Return type

PandasDataframe

infer_types(col_labels: List[str]) PandasDataframe#

Determine the compatible type shared by all values in the specified columns, and coerce them to that type.

Parameters

col_labels (list) – List of column labels to infer and induce types over.

Returns

A new PandasDataframe with the inferred schema.

Return type

PandasDataframe

join(axis: Union[int, Axis], condition: Callable, other: ModinDataframe, join_type: Union[str, JoinType]) PandasDataframe#

Join this dataframe with the other.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the join on.

  • condition (callable) – Function that determines which rows should be joined. The condition can be a simple equality, e.g. “left.col1 == right.col1” or can be arbitrarily complex.

  • other (ModinDataframe) – The other data to join with, i.e. the right dataframe.

  • join_type (string {"inner", "left", "right", "outer"} or modin.core.dataframe.base.utils.JoinType) – The type of join to perform.

Returns

A new PandasDataframe that is the result of applying the specified join over the two dataframes.

Return type

PandasDataframe

Notes

During the join, this dataframe is considered the left, while the other is treated as the right.

Only inner joins, left outer, right outer, and full outer joins are currently supported. Support for other join types (e.g. natural join) may be implemented in the future.

map(func: Callable, dtypes: Optional[str] = None) PandasDataframe#

Perform a function that maps across the entire dataset.

Parameters
  • func (callable(row|col|cell) -> row|col|cell) – The function to apply.

  • dtypes (dtypes of the result, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns

A new dataframe.

Return type

PandasDataframe

n_ary_op(op, right_frames: list, join_type='outer', copartition_along_columns=True)#

Perform an n-opary operation by joining with other Modin DataFrame(s).

Parameters
  • op (callable) – Function to apply after the join.

  • right_frames (list of PandasDataframe) – Modin DataFrames to join with.

  • join_type (str, default: "outer") – Type of join to apply.

  • copartition_along_columns (bool, default: True) – Whether to perform copartitioning along columns or not. For some ops this isn’t needed (e.g., fillna).

Returns

New Modin DataFrame.

Return type

PandasDataframe

numeric_columns(include_bool=True)#

Return the names of numeric columns in the frame.

Parameters

include_bool (bool, default: True) – Whether to consider boolean columns as numeric.

Returns

List of column names.

Return type

list

reduce(axis: Union[int, Axis], function: Callable, dtypes: Optional[str] = None) PandasDataframe#

Perform a user-defined aggregation on the specified axis, where the axis reduces down to a singleton. Requires knowledge of the full axis for the reduction.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the reduce over.

  • function (callable(row|col) -> single value) – The reduce function to apply to each column.

  • dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns

Modin series (1xN frame) containing the reduced data.

Return type

PandasDataframe

Notes

The user-defined function must reduce to a single value.

rename(new_row_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None, new_col_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None, level: Optional[Union[int, List[int]]] = None) PandasDataframe#

Replace the row and column labels with the specified new labels.

Parameters
  • new_row_labels (dictionary or callable, optional) – Mapping or callable that relates old row labels to new labels.

  • new_col_labels (dictionary or callable, optional) – Mapping or callable that relates old col labels to new labels.

  • level (int, optional) – Level whose row labels to replace.

Returns

A new PandasDataframe with the new row and column labels.

Return type

PandasDataframe

Notes

If level is not specified, the default behavior is to replace row labels in all levels.

property row_lengths#

Compute the row partitions lengths if they are not cached.

Returns

A list of row partitions lengths.

Return type

list

sort_by(axis: Union[int, Axis], columns: Union[str, List[str]], ascending: bool = True) PandasDataframe#

Logically reorder rows (columns if axis=1) lexicographically by the data in a column or set of columns.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the sort over.

  • columns (string or list) – Column label(s) to use to determine lexicographical ordering.

  • ascending (boolean, default: True) – Whether to sort in ascending or descending order.

Returns

A new PandasDataframe sorted into lexicographical order by the specified column(s).

Return type

PandasDataframe

synchronize_labels(axis=None)#

Set the deferred axes variables for the PandasDataframe.

Parameters

axis (int, default: None) – The deferred axis. 0 for the index, 1 for the columns.

take_2d_labels_or_positional(row_labels: Optional[List[Hashable]] = None, row_positions: Optional[List[int]] = None, col_labels: Optional[List[Hashable]] = None, col_positions: Optional[List[int]] = None) PandasDataframe#

Lazily select columns or rows from given indices.

Parameters
  • row_labels (list of hashable, optional) – The row labels to extract.

  • row_positions (list-like of ints, optional) – The row positions to extract.

  • col_labels (list of hashable, optional) – The column labels to extract.

  • col_positions (list-like of ints, optional) – The column positions to extract.

Returns

A new PandasDataframe from the mask provided.

Return type

PandasDataframe

Notes

If both row_labels and row_positions are provided, a ValueError is raised. The same rule applies for col_labels and col_positions.

to_labels(column_list: List[Hashable]) PandasDataframe#

Move one or more columns into the row labels. Previous labels are dropped.

Parameters

column_list (list of hashable) – The list of column names to place as the new row labels.

Returns

A new PandasDataframe that has the updated labels.

Return type

PandasDataframe

to_numpy(**kwargs)#

Convert this Modin DataFrame to a NumPy array.

Parameters

**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.

Return type

np.ndarray

to_pandas()#

Convert this Modin DataFrame to a pandas DataFrame.

Return type

pandas.DataFrame

transpose()#

Transpose the index and columns of this Modin DataFrame.

Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.

Returns

New Modin DataFrame.

Return type

PandasDataframe

tree_reduce(axis: Union[int, Axis], map_func: Callable, reduce_func: Optional[Callable] = None, dtypes: Optional[str] = None) PandasDataframe#

Apply function that will reduce the data to a pandas Series.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the tree reduce over.

  • map_func (callable(row|col) -> row|col) – Callable function to map the dataframe.

  • reduce_func (callable(row|col) -> single value, optional) – Callable function to reduce the dataframe. If none, then apply map_func twice.

  • dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns

A new dataframe.

Return type

PandasDataframe

window(axis: Union[int, Axis], reduce_fn: Callable, window_size: int, result_schema: Optional[Dict[Hashable, type]] = None) PandasDataframe#

Apply a sliding window operator that acts as a GROUPBY on each window, and reduces down to a single row (column) per window.

Parameters
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to slide over.

  • reduce_fn (callable(rowgroup|colgroup) -> row|col) – The reduce function to apply over the data.

  • window_size (int) – The number of row/columns to pass to the function. (The size of the sliding window).

  • result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.

Returns

A new PandasDataframe with the reduce function applied over windows of the specified

axis.

Return type

PandasDataframe

Notes

The user-defined reduce function must reduce each window’s column (row if axis=1) down to a single value.