PandasDataframe#
PandasDataframe
is a direct descendant of ModinDataframe
. Its purpose is to implement the abstract interfaces for usage with all pandas
-based storage formats.
PandasDataframe
could be inherited and augmented further by any specific implementation which needs it to take special care of some behavior or to improve performance for certain execution engine.
The class serves as the intermediate level
between pandas
query compiler and conforming partition manager. All queries formed
at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions
into the partition manager for processing. Direct partitions manipulation by this class is prohibited except
cases if an operation is strictly private or protected and called inside of the class only. The class provides
significantly reduced set of operations that fit plenty of pandas operations.
Main tasks of PandasDataframe
are storage of partitions, manipulation with labels of axes and
providing set of methods to perform operations on the internal data.
As mentioned above, PandasDataframe
shouldn’t work with stored partitions directly and
the responsibility for modifying partitions array has to lay on PandasDataframePartitionManager. For example, method
broadcast_apply_full_axis()
redirects applying
function to broadcast_axis_partitions()
method.
Modin PandasDataframe
can be created from pandas.DataFrame
, pyarrow.Table
(methods from_pandas()
,
from_arrow()
are used respectively). Also,
PandasDataframe
can be converted to np.array
, pandas.DataFrame
(methods to_numpy()
,
to_pandas()
are used respectively).
Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.
Public API#
- class modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe(partitions, index=None, columns=None, row_lengths=None, column_widths=None, dtypes=None)#
An abstract class that represents the parent class for any pandas storage format dataframe class.
This class provides interfaces to run operations on dataframe partitions.
- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence, optional) – The index for the dataframe. Converted to a
pandas.Index
. Is computed from partitions on demand if not specified.columns (sequence, optional) – The columns object for the dataframe. Converted to a
pandas.Index
. Is computed from partitions on demand if not specified.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
- add_prefix(prefix, axis)#
Add a prefix to the current row or column labels.
- Parameters
prefix (str) – The prefix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- add_suffix(suffix, axis)#
Add a suffix to the current row or column labels.
- Parameters
suffix (str) – The suffix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- apply_full_axis(axis, func, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions: bool = False, dtypes=None, keep_partitioning=True, num_splits=None, sync_labels=True, pass_axis_lengths_to_partitions=False)#
Perform a function across an entire axis.
- Parameters
axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).
func (callable) – The function to apply.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
dtypes (list-like, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
keep_partitioning (boolean, default: True) – The flag to keep partition boundaries for Modin Frame if possible. Setting it to True disables shuffling data from one partition to another in case the resulting number of splits is equal to the initial number of splits.
num_splits (int, optional) – The number of partitions to split the result into across the axis. If None, then the number of splits will be infered automatically. If num_splits is None and keep_partitioning=True then the number of splits is preserved.
sync_labels (boolean, default: True) – Synchronize external indexes (new_index, new_columns) with internal indexes. This could be used when you’re certain that the indices in partitions are equal to the provided hints in order to save time on syncing them.
pass_axis_lengths_to_partitions (bool, default: False) – Whether pass partition lengths along axis ^ 1 to the kernel func. Note that func must be able to obtain df, *axis_lengths.
- Returns
A new dataframe.
- Return type
Notes
The data shape may change as a result of the function.
- apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False)#
Apply a function across an entire axis for a subset of the data.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over.
numeric_indices (list-like, default: None) – The indices to apply over.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
- Returns
A new dataframe.
- Return type
- apply_select_indices(axis, func, apply_indices=None, row_labels=None, col_labels=None, new_index=None, new_columns=None, keep_remaining=False, item_to_distribute=_NoDefault.no_default)#
Apply a function for a subset of the data.
- Parameters
axis ({0, 1}) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over. Must be given if axis is provided.
row_labels (list-like, default: None) – The row labels to apply over. Must be provided with col_labels to apply over both axes.
col_labels (list-like, default: None) – The column labels to apply over. Must be provided with row_labels to apply over both axes.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
item_to_distribute (np.ndarray or scalar, default: no_default) – The item to split up so it can be applied over both axes.
- Returns
A new dataframe.
- Return type
- astype(col_dtypes)#
Convert the columns dtypes to given dtypes.
- Parameters
col_dtypes (dictionary of {col: dtype,...}) – Where col is the column name and dtype is a NumPy dtype.
- Returns
Dataframe with updated dtypes.
- Return type
BaseDataFrame
- property axes#
Get index and columns that can be accessed with an axis integer.
- Returns
List with two values: index and columns.
- Return type
list
- broadcast_apply(axis, func, other, join_type='left', labels='keep', dtypes=None)#
Broadcast axis partitions of other to partitions of self and apply a function.
- Parameters
axis ({0, 1}) – Axis to broadcast over.
func (callable) – Function to apply.
other (PandasDataframe) – Modin DataFrame to broadcast.
join_type (str, default: "left") – Type of join to apply.
labels ({"keep", "replace", "drop"}, default: "keep") – Whether keep labels from self Modin DataFrame, replace them with labels from joined DataFrame or drop altogether to make them be computed lazily later.
dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None, keep_partitioning=True, num_splits=None, sync_labels=True, pass_axis_lengths_to_partitions=False)#
Broadcast partitions of other Modin DataFrame and apply a function along full axis.
- Parameters
axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).
func (callable) – Function to apply.
other (PandasDataframe or list) – Modin DataFrame(s) to broadcast.
new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
keep_partitioning (boolean, default: True) – The flag to keep partition boundaries for Modin Frame if possible. Setting it to True disables shuffling data from one partition to another in case the resulting number of splits is equal to the initial number of splits.
num_splits (int, optional) – The number of partitions to split the result into across the axis. If None, then the number of splits will be infered automatically. If num_splits is None and keep_partitioning=True then the number of splits is preserved.
sync_labels (boolean, default: True) – Synchronize external indexes (new_index, new_columns) with internal indexes. This could be used when you’re certain that the indices in partitions are equal to the provided hints in order to save time on syncing them.
pass_axis_lengths_to_partitions (bool, default: False) – Whether pass partition lengths along axis ^ 1 to the kernel func. Note that func must be able to obtain df, *axis_lengths.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)#
Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.
- Parameters
axis ({0, 1}) – Axis to apply function along.
func (callable) – Function to apply.
other (PandasDataframe) – Partitions of which should be broadcasted.
apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).
numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).
keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.
broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
- Returns
New Modin DataFrame.
- Return type
- property column_widths#
Compute the column partitions widths if they are not cached.
- Returns
A list of column partitions widths.
- Return type
list
- property columns#
Get the columns from the cache object.
- Returns
An index object containing the column labels.
- Return type
pandas.Index
- concat(axis: Union[int, Axis], others: Union[PandasDataframe, List[PandasDataframe]], how, sort) PandasDataframe #
Concatenate self with one or more other Modin DataFrames.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – Axis to concatenate over.
others (list) – List of Modin DataFrames to concatenate with.
how (str) – Type of join to use for the axis.
sort (bool) – Whether sort the result or not.
- Returns
New Modin DataFrame.
- Return type
- copy()#
Copy this object.
- Returns
A copied version of this object.
- Return type
- property dtypes#
Compute the data types if they are not cached.
- Returns
A pandas Series containing the data types for this dataframe.
- Return type
pandas.Series
- explode(axis: Union[int, Axis], func: Callable) PandasDataframe #
Explode list-like entries along an entire axis.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis specifying how to explode. If axis=1, explode according to columns.
func (callable) – The function to use to explode a single element.
- Returns
A new filtered dataframe.
- Return type
PandasFrame
- filter(axis: Union[Axis, int], condition: Callable) PandasDataframe #
Filter data based on the function provided along an entire axis.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to filter over.
condition (callable(row|col) -> bool) – The function to use for the filter. This function should filter the data itself.
- Returns
A new filtered dataframe.
- Return type
- filter_by_types(types: List[Hashable]) PandasDataframe #
Allow the user to specify a type or set of types by which to filter the columns.
- Parameters
types (list) – The types to filter columns by.
- Returns
A new PandasDataframe from the filter provided.
- Return type
- finalize()#
Perform all deferred calls on partitions.
This makes self Modin Dataframe independent of a history of queries that were used to build it.
- fold(axis, func, new_columns=None)#
Perform a function across an entire axis.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
new_columns (list-like, optional) – The columns of the result. Must be the same length as the columns’ length of self. The column labels of self may change during an operation so we may want to pass the new column labels in (e.g., see cat.codes).
- Returns
A new dataframe.
- Return type
Notes
The data shape is not changed (length and width of the table).
- classmethod from_arrow(at)#
Create a Modin DataFrame from an Arrow Table.
- Parameters
at (pyarrow.table) – Arrow Table.
- Returns
New Modin DataFrame.
- Return type
- classmethod from_dataframe(df: ProtocolDataframe) PandasDataframe #
Convert a DataFrame implementing the dataframe exchange protocol to a Core Modin Dataframe.
See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.
- Parameters
df (ProtocolDataframe) – The DataFrame object supporting the dataframe exchange protocol.
- Returns
A new Core Modin Dataframe object.
- Return type
- from_labels() PandasDataframe #
Convert the row labels to a column of data, inserted at the first position.
Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.
- Returns
A PandasDataframe with new columns from index labels.
- Return type
- classmethod from_pandas(df)#
Create a Modin DataFrame from a pandas DataFrame.
- Parameters
df (pandas.DataFrame) – A pandas DataFrame.
- Returns
New Modin DataFrame.
- Return type
- groupby(axis: Union[int, Axis], by: Union[str, List[str]], operator: Callable, result_schema: Optional[Dict[Hashable, type]] = None) PandasDataframe #
Generate groups based on values in the input column(s) and perform the specified operation on each.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to apply the grouping over.
by (string or list of strings) – One or more column labels to use for grouping.
operator (callable) – The operation to carry out on each of the groups. The operator is another algebraic operator with its own user-defined function parameter, depending on the output desired by the user.
result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.
- Returns
- A new PandasDataframe containing the groupings specified, with the operator
applied to each group.
- Return type
Notes
No communication between groups is allowed in this algebra implementation.
The number of rows (columns if axis=1) returned by the user-defined function passed to the groupby may be at most the number of rows in the group, and may be as small as a single row.
Unlike the pandas API, an intermediate “GROUP BY” object is not present in this algebra implementation.
- groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)#
Groupby another Modin DataFrame dataframe and aggregate the result.
- Parameters
axis ({0, 1}) – Axis to groupby and aggregate over.
by (PandasDataframe or None) – A Modin DataFrame to group by.
map_func (callable) – Map component of the aggregation.
reduce_func (callable) – Reduce component of the aggregation.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.
- Returns
New Modin DataFrame.
- Return type
- property index#
Get the index from the cache object.
- Returns
An index object containing the row labels.
- Return type
pandas.Index
- infer_objects() PandasDataframe #
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.
- Returns
A new PandasDataframe with the inferred schema.
- Return type
- infer_types(col_labels: List[str]) PandasDataframe #
Determine the compatible type shared by all values in the specified columns, and coerce them to that type.
- Parameters
col_labels (list) – List of column labels to infer and induce types over.
- Returns
A new PandasDataframe with the inferred schema.
- Return type
- join(axis: Union[int, Axis], condition: Callable, other: ModinDataframe, join_type: Union[str, JoinType]) PandasDataframe #
Join this dataframe with the other.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the join on.
condition (callable) – Function that determines which rows should be joined. The condition can be a simple equality, e.g. “left.col1 == right.col1” or can be arbitrarily complex.
other (ModinDataframe) – The other data to join with, i.e. the right dataframe.
join_type (string {"inner", "left", "right", "outer"} or modin.core.dataframe.base.utils.JoinType) – The type of join to perform.
- Returns
A new PandasDataframe that is the result of applying the specified join over the two dataframes.
- Return type
Notes
During the join, this dataframe is considered the left, while the other is treated as the right.
Only inner joins, left outer, right outer, and full outer joins are currently supported. Support for other join types (e.g. natural join) may be implemented in the future.
- map(func: Callable, dtypes: Optional[str] = None) PandasDataframe #
Perform a function that maps across the entire dataset.
- Parameters
func (callable(row|col|cell) -> row|col|cell) – The function to apply.
dtypes (dtypes of the result, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
- n_ary_op(op, right_frames: list, join_type='outer', copartition_along_columns=True, dtypes=None)#
Perform an n-opary operation by joining with other Modin DataFrame(s).
- Parameters
op (callable) – Function to apply after the join.
right_frames (list of PandasDataframe) – Modin DataFrames to join with.
join_type (str, default: "outer") – Type of join to apply.
copartition_along_columns (bool, default: True) – Whether to perform copartitioning along columns or not. For some ops this isn’t needed (e.g., fillna).
dtypes (series, default: None) – Dtypes of the resultant dataframe, this argument will be received if the resultant dtypes of n-opary operation is precomputed.
- Returns
New Modin DataFrame.
- Return type
- numeric_columns(include_bool=True)#
Return the names of numeric columns in the frame.
- Parameters
include_bool (bool, default: True) – Whether to consider boolean columns as numeric.
- Returns
List of column names.
- Return type
list
- reduce(axis: Union[int, Axis], function: Callable, dtypes: Optional[str] = None) PandasDataframe #
Perform a user-defined aggregation on the specified axis, where the axis reduces down to a singleton. Requires knowledge of the full axis for the reduction.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the reduce over.
function (callable(row|col) -> single value) – The reduce function to apply to each column.
dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.
- Returns
Modin series (1xN frame) containing the reduced data.
- Return type
Notes
The user-defined function must reduce to a single value.
- rename(new_row_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None, new_col_labels: Optional[Union[Dict[Hashable, Hashable], Callable]] = None, level: Optional[Union[int, List[int]]] = None) PandasDataframe #
Replace the row and column labels with the specified new labels.
- Parameters
new_row_labels (dictionary or callable, optional) – Mapping or callable that relates old row labels to new labels.
new_col_labels (dictionary or callable, optional) – Mapping or callable that relates old col labels to new labels.
level (int, optional) – Level whose row labels to replace.
- Returns
A new PandasDataframe with the new row and column labels.
- Return type
Notes
If level is not specified, the default behavior is to replace row labels in all levels.
- property row_lengths#
Compute the row partitions lengths if they are not cached.
- Returns
A list of row partitions lengths.
- Return type
list
- sort_by(axis: Union[int, Axis], columns: Union[str, List[str]], ascending: bool = True, **kwargs) PandasDataframe #
Logically reorder rows (columns if axis=1) lexicographically by the data in a column or set of columns.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the sort over.
columns (string or list) – Column label(s) to use to determine lexicographical ordering.
ascending (boolean, default: True) – Whether to sort in ascending or descending order.
**kwargs (dict) – Keyword arguments to pass when sorting partitions.
- Returns
A new PandasDataframe sorted into lexicographical order by the specified column(s).
- Return type
- synchronize_labels(axis=None)#
Set the deferred axes variables for the
PandasDataframe
.- Parameters
axis (int, default: None) – The deferred axis. 0 for the index, 1 for the columns.
- take_2d_labels_or_positional(row_labels: Optional[List[Hashable]] = None, row_positions: Optional[List[int]] = None, col_labels: Optional[List[Hashable]] = None, col_positions: Optional[List[int]] = None) PandasDataframe #
Lazily select columns or rows from given indices.
- Parameters
row_labels (list of hashable, optional) – The row labels to extract.
row_positions (list-like of ints, optional) – The row positions to extract.
col_labels (list of hashable, optional) – The column labels to extract.
col_positions (list-like of ints, optional) – The column positions to extract.
- Returns
A new PandasDataframe from the mask provided.
- Return type
Notes
If both row_labels and row_positions are provided, a ValueError is raised. The same rule applies for col_labels and col_positions.
- to_labels(column_list: List[Hashable]) PandasDataframe #
Move one or more columns into the row labels. Previous labels are dropped.
- Parameters
column_list (list of hashable) – The list of column names to place as the new row labels.
- Returns
A new PandasDataframe that has the updated labels.
- Return type
- to_numpy(**kwargs)#
Convert this Modin DataFrame to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
- Return type
np.ndarray
- to_pandas()#
Convert this Modin DataFrame to a pandas DataFrame.
- Return type
pandas.DataFrame
- transpose()#
Transpose the index and columns of this Modin DataFrame.
Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.
- Returns
New Modin DataFrame.
- Return type
- tree_reduce(axis: Union[int, Axis], map_func: Callable, reduce_func: Optional[Callable] = None, dtypes: Optional[str] = None) PandasDataframe #
Apply function that will reduce the data to a pandas Series.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to perform the tree reduce over.
map_func (callable(row|col) -> row|col) – Callable function to map the dataframe.
reduce_func (callable(row|col) -> single value, optional) – Callable function to reduce the dataframe. If none, then apply map_func twice.
dtypes (str, optional) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
- window(axis: Union[int, Axis], reduce_fn: Callable, window_size: int, result_schema: Optional[Dict[Hashable, type]] = None) PandasDataframe #
Apply a sliding window operator that acts as a GROUPBY on each window, and reduces down to a single row (column) per window.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to slide over.
reduce_fn (callable(rowgroup|colgroup) -> row|col) – The reduce function to apply over the data.
window_size (int) – The number of row/columns to pass to the function. (The size of the sliding window).
result_schema (dict, optional) – Mapping from column labels to data types that represents the types of the output dataframe.
- Returns
- A new PandasDataframe with the reduce function applied over windows of the specified
axis.
- Return type
Notes
The user-defined reduce function must reduce each window’s column (row if axis=1) down to a single value.