BasePandasFrame

The class is base for any frame class of pandas backend and serves as the intermediate level between pandas query compiler and conforming partition manager. All queries formed at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions into the partition manager for processing. Direct partitions manipulation by this class is prohibited except cases if an operation is striclty private or protected and called inside of the class only. The class provides significantly reduced set of operations that fit plenty of pandas operations.

Main task of BasePandasFrame is storage of partitions, manipulation with labels of axes and providing set of methods to perform operations on the internal data.

As mentioned above, BasePandasFrame shouldn’t work with stored partitions directly and the responsibility for modifying partitions array has to lay on BaseFrameManager. For example, method broadcast_apply_full_axis() redirects applying function to BaseFrameManager.broadcast_axis_partitions method.

BasePandasFrame can be created from pandas.DataFrame, pyarrow.Table (methods from_pandas(), from_arrow() are used respectively). Also, BasePandasFrame can be converted to np.array, pandas.DataFrame (methods to_numpy(), to_pandas() are used respectively).

Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.

Public API

class modin.engines.base.frame.data.BasePandasFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)

An abstract class that represents the parent class for any pandas backend dataframe class.

This class provides interfaces to run operations on dataframe partitions.

Parameters:
  • partitions (np.ndarray) – A 2D NumPy array of partitions.
  • index (sequence) – The index for the dataframe. Converted to a pandas.Index.
  • columns (sequence) – The columns object for the dataframe. Converted to a pandas.Index.
  • row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
  • column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
  • dtypes (pandas.Series, optional) – The data types for the dataframe columns.
add_prefix(prefix, axis)

Add a prefix to the current row or column labels.

Parameters:
  • prefix (str) – The prefix to add.
  • axis (int) – The axis to update.
Returns:

A new dataframe with the updated labels.

Return type:

BasePandasFrame

add_suffix(suffix, axis)

Add a suffix to the current row or column labels.

Parameters:
  • suffix (str) – The suffix to add.
  • axis (int) – The axis to update.
Returns:

A new dataframe with the updated labels.

Return type:

BasePandasFrame

astype(col_dtypes)

Convert the columns dtypes to given dtypes.

Parameters:col_dtypes (dictionary of {col: dtype,..}) – Where col is the column name and dtype is a NumPy dtype.
Returns:Dataframe with updated dtypes.
Return type:BaseDataFrame
axes

Get index and columns that can be accessed with an axis integer.

Returns:List with two values: index and columns.
Return type:list
broadcast_apply(axis, func, other, join_type='left', preserve_labels=True, dtypes=None)

Broadcast axis partitions of other to partitions of self and apply a function.

Parameters:
  • axis ({0, 1}) – Axis to broadcast over.
  • func (callable) – Function to apply.
  • other (BasePandasFrame) – Modin DataFrame to broadcast.
  • join_type (str, default: "left") – Type of join to apply.
  • preserve_labels (bool, default: True) – Whether keep labels from self Modin DataFrame or not.
  • dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.
Returns:

New Modin DataFrame.

Return type:

BasePandasFrame

broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None)

Broadcast partitions of other Modin DataFrame and apply a function along full axis.

Parameters:
  • axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).
  • func (callable) – Function to apply.
  • other (BasePandasFrame or list) – Modin DataFrame(s) to broadcast.
  • new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
  • new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
  • apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
  • enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
  • dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
Returns:

New Modin DataFrame.

Return type:

BasePandasFrame

broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)

Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.

Parameters:
  • axis ({0, 1}) – Axis to apply function along.
  • func (callable) – Function to apply.
  • other (BasePandasFrame) – Partitions of which should be broadcasted.
  • apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).
  • numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).
  • keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.
  • broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.
  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
Returns:

New Modin DataFrame.

Return type:

BasePandasFrame

columns

Get the columns from the cache object.

Returns:An index object containing the column labels.
Return type:pandas.Index
classmethod combine_dtypes(list_of_dtypes, column_names)

Describe how data types should be combined when they do not match.

Parameters:
  • list_of_dtypes (list) – A list of pandas Series with the data types.
  • column_names (list) – The names of the columns that the data types map to.
Returns:

A pandas Series containing the finalized data types.

Return type:

pandas.Series

copy()

Copy this object.

Returns:A copied version of this object.
Return type:BasePandasFrame
dtypes

Compute the data types if they are not cached.

Returns:A pandas Series containing the data types for this dataframe.
Return type:pandas.Series
filter_full_axis(axis, func)

Filter data based on the function provided along an entire axis.

Parameters:
  • axis (int) – The axis to filter over.
  • func (callable) – The function to use for the filter. This function should filter the data itself.
Returns:

A new filtered dataframe.

Return type:

BasePandasFrame

finalize()

Perform all deferred calls on partitions.

This makes self Modin DataFrame independent of a history of queries that were used to build it.

classmethod from_arrow(at)

Create a Modin DataFrame from an Arrow Table.

Parameters:at (pyarrow.table) – Arrow Table.
Returns:New Modin DataFrame.
Return type:BasePandasFrame
from_labels() → modin.engines.base.frame.data.BasePandasFrame

Convert the row labels to a column of data, inserted at the first position.

Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.

Returns:A BasePandasFrame with new columns from index labels.
Return type:BasePandasFrame
classmethod from_pandas(df)

Create a Modin DataFrame from a pandas DataFrame.

Parameters:df (pandas.DataFrame) – A pandas DataFrame.
Returns:New Modin DataFrame.
Return type:BasePandasFrame
groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)

Groupby another Modin DataFrame dataframe and aggregate the result.

Parameters:
  • axis ({0, 1}) – Axis to groupby and aggregate over.
  • by (BasePandasFrame or None) – A Modin DataFrame to group by.
  • map_func (callable) – Map component of the aggregation.
  • reduce_func (callable) – Reduce component of the aggregation.
  • new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
  • new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
  • apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.
Returns:

New Modin DataFrame.

Return type:

BasePandasFrame

index

Get the index from the cache object.

Returns:An index object containing the row labels.
Return type:pandas.Index
mask(row_indices=None, row_numeric_idx=None, col_indices=None, col_numeric_idx=None)

Lazily select columns or rows from given indices.

Parameters:
  • row_indices (list of hashable, optional) – The row labels to extract.
  • row_numeric_idx (list of int, optional) – The row indices to extract.
  • col_indices (list of hashable, optional) – The column labels to extract.
  • col_numeric_idx (list of int, optional) – The column indices to extract.
Returns:

A new BasePandasFrame from the mask provided.

Return type:

BasePandasFrame

Notes

If both row_indices and row_numeric_idx are set, row_indices will be used. The same rule applied to col_indices and col_numeric_idx.

reorder_labels(row_numeric_idx=None, col_numeric_idx=None)

Reorder the column and or rows in this DataFrame.

Parameters:
  • row_numeric_idx (list of int, optional) – The ordered list of new row orders such that each position within the list indicates the new position.
  • col_numeric_idx (list of int, optional) – The ordered list of new column orders such that each position within the list indicates the new position.
Returns:

A new BasePandasFrame with reordered columns and/or rows.

Return type:

BasePandasFrame

to_labels(column_list: List[Hashable]) → modin.engines.base.frame.data.BasePandasFrame

Move one or more columns into the row labels. Previous labels are dropped.

Parameters:column_list (list of hashable) – The list of column names to place as the new row labels.
Returns:A new BasePandasFrame that has the updated labels.
Return type:BasePandasFrame
to_numpy(**kwargs)

Convert this Modin DataFrame to a NumPy array.

Parameters:**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
Returns:
Return type:np.ndarray
to_pandas()

Convert this Modin DataFrame to a pandas DataFrame.

Returns:
Return type:pandas.DataFrame
transpose()

Transpose the index and columns of this Modin DataFrame.

Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.

Returns:New Modin DataFrame.
Return type:BasePandasFrame