PandasFrame¶

The class is base for any frame class of pandas backend and serves as the intermediate level between pandas query compiler and conforming partition manager. All queries formed at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions into the partition manager for processing. Direct partitions manipulation by this class is prohibited except cases if an operation is striclty private or protected and called inside of the class only. The class provides significantly reduced set of operations that fit plenty of pandas operations.

Main tasks of PandasFrame are storage of partitions, manipulation with labels of axes and providing set of methods to perform operations on the internal data.

As mentioned above, PandasFrame shouldn’t work with stored partitions directly and the responsibility for modifying partitions array has to lay on PandasFramePartitionManager. For example, method broadcast_apply_full_axis() redirects applying function to PandasFramePartitionManager.broadcast_axis_partitions method.

PandasFrame can be created from pandas.DataFrame, pyarrow.Table (methods from_pandas(), from_arrow() are used respectively). Also, PandasFrame can be converted to np.array, pandas.DataFrame (methods to_numpy(), to_pandas() are used respectively).

Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.

Public API¶

class modin.engines.base.frame.data.PandasFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶

An abstract class that represents the parent class for any pandas backend dataframe class.

This class provides interfaces to run operations on dataframe partitions.

Parameters

partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a pandas.Index.
columns (sequence) – The columns object for the dataframe. Converted to a pandas.Index.
row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.

add_prefix(prefix, axis)¶

Add a prefix to the current row or column labels.

Parameters

prefix (str) – The prefix to add.
axis (int) – The axis to update.

Returns

A new dataframe with the updated labels.

Return type

add_suffix(suffix, axis)¶

Add a suffix to the current row or column labels.

Parameters

suffix (str) – The suffix to add.
axis (int) – The axis to update.

Returns

A new dataframe with the updated labels.

Return type

apply_full_axis(axis, func, new_index=None, new_columns=None, dtypes=None)¶

Perform a function across an entire axis.

Parameters

axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).
func (callable) – The function to apply.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
dtypes (list-like, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

Returns

A new dataframe.

Return type

Notes

The data shape may change as a result of the function.

apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False)¶

Apply a function across an entire axis for a subset of the data.

Parameters

axis (int) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over.
numeric_indices (list-like, default: None) – The indices to apply over.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.

Returns

A new dataframe.

Return type

apply_select_indices(axis, func, apply_indices=None, row_indices=None, col_indices=None, new_index=None, new_columns=None, keep_remaining=False, item_to_distribute=None)¶

Apply a function for a subset of the data.

Parameters

axis ({0, 1}) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over. Must be given if axis is provided.
row_indices (list-like, default: None) – The row indices to apply over. Must be provided with col_indices to apply over both axes.
col_indices (list-like, default: None) – The column indices to apply over. Must be provided with row_indices to apply over both axes.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
item_to_distribute ((optional)) – The item to split up so it can be applied over both axes.

Returns

A new dataframe.

Return type

astype(col_dtypes)¶

Convert the columns dtypes to given dtypes.

Parameters: col_dtypes (dictionary of {col: dtype,...}) – Where col is the column name and dtype is a NumPy dtype.
Returns: Dataframe with updated dtypes.
Return type: BaseDataFrame

property axes¶

Get index and columns that can be accessed with an axis integer.

Returns: List with two values: index and columns.
Return type: list

binary_op(op, right_frame, join_type='outer')¶

Perform an operation that requires joining with another Modin DataFrame.

Parameters

op (callable) – Function to apply after the join.
right_frame (PandasFrame) – Modin DataFrame to join with.
join_type (str, default: "outer") – Type of join to apply.

Returns

New Modin DataFrame.

Return type

broadcast_apply(axis, func, other, join_type='left', preserve_labels=True, dtypes=None)¶

Broadcast axis partitions of other to partitions of self and apply a function.

Parameters

axis ({0, 1}) – Axis to broadcast over.
func (callable) – Function to apply.
other (PandasFrame) – Modin DataFrame to broadcast.
join_type (str, default: "left") – Type of join to apply.
preserve_labels (bool, default: True) – Whether keep labels from self Modin DataFrame or not.
dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.

Returns

New Modin DataFrame.

Return type

broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None)¶

Broadcast partitions of other Modin DataFrame and apply a function along full axis.

Parameters

axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).
func (callable) – Function to apply.
other (PandasFrame or list) – Modin DataFrame(s) to broadcast.
new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.

Returns

New Modin DataFrame.

Return type

broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)¶

Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.

Parameters

axis ({0, 1}) – Axis to apply function along.
func (callable) – Function to apply.
other (PandasFrame) – Partitions of which should be broadcasted.
apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).
numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).
keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.
broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.

Returns

New Modin DataFrame.

Return type

property columns¶

Get the columns from the cache object.

Returns: An index object containing the column labels.
Return type: pandas.Index

classmethod combine_dtypes(list_of_dtypes, column_names)¶

Describe how data types should be combined when they do not match.

Parameters

list_of_dtypes (list) – A list of pandas Series with the data types.
column_names (list) – The names of the columns that the data types map to.

Returns

A pandas Series containing the finalized data types.

Return type

pandas.Series

concat(axis, others, how, sort)¶

Concatenate self with one or more other Modin DataFrames.

Parameters

axis ({0, 1}) – Axis to concatenate over.
others (list) – List of Modin DataFrames to concatenate with.
how (str) – Type of join to use for the axis.
sort (bool) – Whether sort the result or not.

Returns

New Modin DataFrame.

Return type

copy()¶

Copy this object.

Returns: A copied version of this object.
Return type: PandasFrame

property dtypes¶

Compute the data types if they are not cached.

Returns: A pandas Series containing the data types for this dataframe.
Return type: pandas.Series

filter_full_axis(axis, func)¶

Filter data based on the function provided along an entire axis.

Parameters

axis (int) – The axis to filter over.
func (callable) – The function to use for the filter. This function should filter the data itself.

Returns

A new filtered dataframe.

Return type

finalize()¶

Perform all deferred calls on partitions.

This makes self Modin Dataframe independent of a history of queries that were used to build it.

fold(axis, func)¶

Perform a function across an entire axis.

Parameters

axis (int) – The axis to apply over.
func (callable) – The function to apply.

Returns

A new dataframe.

Return type

Notes

The data shape is not changed (length and width of the table).

fold_reduce(axis, func)¶

Apply function that reduces Frame Manager to series but requires knowledge of full axis.

Parameters

axis ({0, 1}) – The axis to apply the function to (0 - index, 1 - columns).
func (callable) – The function to reduce the Manager by. This function takes in a Manager.

Returns

Modin series (1xN frame) containing the reduced data.

Return type

classmethod from_arrow(at)¶

Create a Modin DataFrame from an Arrow Table.

Parameters: at (pyarrow.table) – Arrow Table.
Returns: New Modin DataFrame.
Return type: PandasFrame

from_labels() → modin.engines.base.frame.data.PandasFrame¶

Convert the row labels to a column of data, inserted at the first position.

Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.

Returns: A PandasFrame with new columns from index labels.
Return type: PandasFrame

classmethod from_pandas(df)¶

Create a Modin DataFrame from a pandas DataFrame.

Parameters: df (pandas.DataFrame) – A pandas DataFrame.
Returns: New Modin DataFrame.
Return type: PandasFrame

groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)¶

Groupby another Modin DataFrame dataframe and aggregate the result.

Parameters

axis ({0, 1}) – Axis to groupby and aggregate over.
by (PandasFrame or None) – A Modin DataFrame to group by.
map_func (callable) – Map component of the aggregation.
reduce_func (callable) – Reduce component of the aggregation.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.

Returns

New Modin DataFrame.

Return type

property index¶

Get the index from the cache object.

Returns: An index object containing the row labels.
Return type: pandas.Index

map(func, dtypes=None)¶

Perform a function that maps across the entire dataset.

Parameters

func (callable) – The function to apply.
dtypes (dtypes of the result, default: None) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.

Returns

A new dataframe.

Return type

map_reduce(axis, map_func, reduce_func=None)¶

Apply function that will reduce the data to a pandas Series.

Parameters

axis ({0, 1}) – 0 for columns and 1 for rows.
map_func (callable) – Callable function to map the dataframe.
reduce_func (callable, default: None) – Callable function to reduce the dataframe. If none, then apply map_func twice.

Returns

A new dataframe.

Return type

mask(row_indices=None, row_numeric_idx=None, col_indices=None, col_numeric_idx=None)¶

Lazily select columns or rows from given indices.

Parameters

row_indices (list of hashable, optional) – The row labels to extract.
row_numeric_idx (list of int, optional) – The row indices to extract.
col_indices (list of hashable, optional) – The column labels to extract.
col_numeric_idx (list of int, optional) – The column indices to extract.

Returns

A new PandasFrame from the mask provided.

Return type

Notes

If both row_indices and row_numeric_idx are set, row_indices will be used. The same rule applied to col_indices and col_numeric_idx.

numeric_columns(include_bool=True)¶

Return the names of numeric columns in the frame.

Parameters: include_bool (bool, default: True) – Whether to consider boolean columns as numeric.
Returns: List of column names.
Return type: list

synchronize_labels(axis=None)¶

Synchronize labels by applying the index object for specific axis to the self._partitions lazily.

Adds set_axis function to call-queue of each partition from self._partitions to apply new axis.

Parameters: axis (int, default: None) – The axis to apply to. If it’s None applies to both axes.

to_labels(column_list: List[Hashable]) → modin.engines.base.frame.data.PandasFrame¶

Move one or more columns into the row labels. Previous labels are dropped.

Parameters: column_list (list of hashable) – The list of column names to place as the new row labels.
Returns: A new PandasFrame that has the updated labels.
Return type: PandasFrame

to_numpy(**kwargs)¶

Convert this Modin DataFrame to a NumPy array.

Parameters: **kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
Returns
Return type: np.ndarray

to_pandas()¶

Convert this Modin DataFrame to a pandas DataFrame.

Returns
Return type: pandas.DataFrame

transpose()¶

Transpose the index and columns of this Modin DataFrame.

Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.

Returns: New Modin DataFrame.
Return type: PandasFrame