PandasFrame¶
The class is base for any frame class of pandas
backend and serves as the intermediate level
between pandas
query compiler and conforming partition manager. All queries formed
at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions
into the partition manager for processing. Direct partitions manipulation by this class is prohibited except
cases if an operation is striclty private or protected and called inside of the class only. The class provides
significantly reduced set of operations that fit plenty of pandas operations.
Main tasks of PandasFrame
are storage of partitions, manipulation with labels of axes and
providing set of methods to perform operations on the internal data.
As mentioned above, PandasFrame
shouldn’t work with stored partitions directly and
the responsibility for modifying partitions array has to lay on PandasFramePartitionManager. For example, method
broadcast_apply_full_axis()
redirects applying
function to PandasFramePartitionManager.broadcast_axis_partitions
method.
PandasFrame
can be created from pandas.DataFrame
, pyarrow.Table
(methods from_pandas()
,
from_arrow()
are used respectively). Also,
PandasFrame
can be converted to np.array
, pandas.DataFrame
(methods to_numpy()
,
to_pandas()
are used respectively).
Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.
Public API¶
- class modin.engines.base.frame.data.PandasFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
An abstract class that represents the parent class for any pandas backend dataframe class.
This class provides interfaces to run operations on dataframe partitions.
- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a
pandas.Index
.columns (sequence) – The columns object for the dataframe. Converted to a
pandas.Index
.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
- add_prefix(prefix, axis)¶
Add a prefix to the current row or column labels.
- Parameters
prefix (str) – The prefix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- add_suffix(suffix, axis)¶
Add a suffix to the current row or column labels.
- Parameters
suffix (str) – The suffix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- apply_full_axis(axis, func, new_index=None, new_columns=None, dtypes=None)¶
Perform a function across an entire axis.
- Parameters
axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).
func (callable) – The function to apply.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
dtypes (list-like, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
Notes
The data shape may change as a result of the function.
- apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False)¶
Apply a function across an entire axis for a subset of the data.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over.
numeric_indices (list-like, default: None) – The indices to apply over.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
- Returns
A new dataframe.
- Return type
- apply_select_indices(axis, func, apply_indices=None, row_indices=None, col_indices=None, new_index=None, new_columns=None, keep_remaining=False, item_to_distribute=None)¶
Apply a function for a subset of the data.
- Parameters
axis ({0, 1}) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over. Must be given if axis is provided.
row_indices (list-like, default: None) – The row indices to apply over. Must be provided with col_indices to apply over both axes.
col_indices (list-like, default: None) – The column indices to apply over. Must be provided with row_indices to apply over both axes.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
item_to_distribute ((optional)) – The item to split up so it can be applied over both axes.
- Returns
A new dataframe.
- Return type
- astype(col_dtypes)¶
Convert the columns dtypes to given dtypes.
- Parameters
col_dtypes (dictionary of {col: dtype,...}) – Where col is the column name and dtype is a NumPy dtype.
- Returns
Dataframe with updated dtypes.
- Return type
BaseDataFrame
- property axes¶
Get index and columns that can be accessed with an axis integer.
- Returns
List with two values: index and columns.
- Return type
list
- binary_op(op, right_frame, join_type='outer')¶
Perform an operation that requires joining with another Modin DataFrame.
- Parameters
op (callable) – Function to apply after the join.
right_frame (PandasFrame) – Modin DataFrame to join with.
join_type (str, default: "outer") – Type of join to apply.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply(axis, func, other, join_type='left', preserve_labels=True, dtypes=None)¶
Broadcast axis partitions of other to partitions of self and apply a function.
- Parameters
axis ({0, 1}) – Axis to broadcast over.
func (callable) – Function to apply.
other (PandasFrame) – Modin DataFrame to broadcast.
join_type (str, default: "left") – Type of join to apply.
preserve_labels (bool, default: True) – Whether keep labels from self Modin DataFrame or not.
dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None)¶
Broadcast partitions of other Modin DataFrame and apply a function along full axis.
- Parameters
axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).
func (callable) – Function to apply.
other (PandasFrame or list) – Modin DataFrame(s) to broadcast.
new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)¶
Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.
- Parameters
axis ({0, 1}) – Axis to apply function along.
func (callable) – Function to apply.
other (PandasFrame) – Partitions of which should be broadcasted.
apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).
numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).
keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.
broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
- Returns
New Modin DataFrame.
- Return type
- property columns¶
Get the columns from the cache object.
- Returns
An index object containing the column labels.
- Return type
pandas.Index
- classmethod combine_dtypes(list_of_dtypes, column_names)¶
Describe how data types should be combined when they do not match.
- Parameters
list_of_dtypes (list) – A list of pandas Series with the data types.
column_names (list) – The names of the columns that the data types map to.
- Returns
A pandas Series containing the finalized data types.
- Return type
pandas.Series
- concat(axis, others, how, sort)¶
Concatenate self with one or more other Modin DataFrames.
- Parameters
axis ({0, 1}) – Axis to concatenate over.
others (list) – List of Modin DataFrames to concatenate with.
how (str) – Type of join to use for the axis.
sort (bool) – Whether sort the result or not.
- Returns
New Modin DataFrame.
- Return type
- copy()¶
Copy this object.
- Returns
A copied version of this object.
- Return type
- property dtypes¶
Compute the data types if they are not cached.
- Returns
A pandas Series containing the data types for this dataframe.
- Return type
pandas.Series
- filter_full_axis(axis, func)¶
Filter data based on the function provided along an entire axis.
- Parameters
axis (int) – The axis to filter over.
func (callable) – The function to use for the filter. This function should filter the data itself.
- Returns
A new filtered dataframe.
- Return type
- finalize()¶
Perform all deferred calls on partitions.
This makes self Modin Dataframe independent of a history of queries that were used to build it.
- fold(axis, func)¶
Perform a function across an entire axis.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
- Returns
A new dataframe.
- Return type
Notes
The data shape is not changed (length and width of the table).
- fold_reduce(axis, func)¶
Apply function that reduces Frame Manager to series but requires knowledge of full axis.
- Parameters
axis ({0, 1}) – The axis to apply the function to (0 - index, 1 - columns).
func (callable) – The function to reduce the Manager by. This function takes in a Manager.
- Returns
Modin series (1xN frame) containing the reduced data.
- Return type
- classmethod from_arrow(at)¶
Create a Modin DataFrame from an Arrow Table.
- Parameters
at (pyarrow.table) – Arrow Table.
- Returns
New Modin DataFrame.
- Return type
- from_labels() modin.engines.base.frame.data.PandasFrame ¶
Convert the row labels to a column of data, inserted at the first position.
Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.
- Returns
A PandasFrame with new columns from index labels.
- Return type
- classmethod from_pandas(df)¶
Create a Modin DataFrame from a pandas DataFrame.
- Parameters
df (pandas.DataFrame) – A pandas DataFrame.
- Returns
New Modin DataFrame.
- Return type
- groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)¶
Groupby another Modin DataFrame dataframe and aggregate the result.
- Parameters
axis ({0, 1}) – Axis to groupby and aggregate over.
by (PandasFrame or None) – A Modin DataFrame to group by.
map_func (callable) – Map component of the aggregation.
reduce_func (callable) – Reduce component of the aggregation.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.
- Returns
New Modin DataFrame.
- Return type
- property index¶
Get the index from the cache object.
- Returns
An index object containing the row labels.
- Return type
pandas.Index
- map(func, dtypes=None)¶
Perform a function that maps across the entire dataset.
- Parameters
func (callable) – The function to apply.
dtypes (dtypes of the result, default: None) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
- map_reduce(axis, map_func, reduce_func=None)¶
Apply function that will reduce the data to a pandas Series.
- Parameters
axis ({0, 1}) – 0 for columns and 1 for rows.
map_func (callable) – Callable function to map the dataframe.
reduce_func (callable, default: None) – Callable function to reduce the dataframe. If none, then apply map_func twice.
- Returns
A new dataframe.
- Return type
- mask(row_indices=None, row_numeric_idx=None, col_indices=None, col_numeric_idx=None)¶
Lazily select columns or rows from given indices.
- Parameters
row_indices (list of hashable, optional) – The row labels to extract.
row_numeric_idx (list of int, optional) – The row indices to extract.
col_indices (list of hashable, optional) – The column labels to extract.
col_numeric_idx (list of int, optional) – The column indices to extract.
- Returns
A new PandasFrame from the mask provided.
- Return type
Notes
If both row_indices and row_numeric_idx are set, row_indices will be used. The same rule applied to col_indices and col_numeric_idx.
- numeric_columns(include_bool=True)¶
Return the names of numeric columns in the frame.
- Parameters
include_bool (bool, default: True) – Whether to consider boolean columns as numeric.
- Returns
List of column names.
- Return type
list
- synchronize_labels(axis=None)¶
Synchronize labels by applying the index object for specific axis to the self._partitions lazily.
Adds set_axis function to call-queue of each partition from self._partitions to apply new axis.
- Parameters
axis (int, default: None) – The axis to apply to. If it’s None applies to both axes.
- to_labels(column_list: List[Hashable]) modin.engines.base.frame.data.PandasFrame ¶
Move one or more columns into the row labels. Previous labels are dropped.
- Parameters
column_list (list of hashable) – The list of column names to place as the new row labels.
- Returns
A new PandasFrame that has the updated labels.
- Return type
- to_numpy(**kwargs)¶
Convert this Modin DataFrame to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
- Returns
- Return type
np.ndarray
- to_pandas()¶
Convert this Modin DataFrame to a pandas DataFrame.
- Returns
- Return type
- transpose()¶
Transpose the index and columns of this Modin DataFrame.
Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.
- Returns
New Modin DataFrame.
- Return type