PandasDataframe#

PandasDataframe is a direct descendant of ModinDataframe. Its purpose is to implement the abstract interfaces for usage with all pandas-based storage formats. PandasDataframe could be inherited and augmented further by any specific implementation which needs it to take special care of some behavior or to improve performance for certain execution engine.

The class serves as the intermediate level between pandas query compiler and conforming partition manager. All queries formed at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions into the partition manager for processing. Direct partitions manipulation by this class is prohibited except cases if an operation is striclty private or protected and called inside of the class only. The class provides significantly reduced set of operations that fit plenty of pandas operations.

Main tasks of PandasDataframe are storage of partitions, manipulation with labels of axes and providing set of methods to perform operations on the internal data.

As mentioned above, PandasDataframe shouldn’t work with stored partitions directly and the responsibility for modifying partitions array has to lay on PandasDataframePartitionManager. For example, method broadcast_apply_full_axis() redirects applying function to broadcast_axis_partitions() method.

Modin PandasDataframe can be created from pandas.DataFrame, pyarrow.Table (methods from_pandas(), from_arrow() are used respectively). Also, PandasDataframe can be converted to np.array, pandas.DataFrame (methods to_numpy(), to_pandas() are used respectively).

Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.

Public API#

class modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)#

An abstract class that represents the parent class for any pandas storage format dataframe class.

This class provides interfaces to run operations on dataframe partitions.

Parameters

partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a pandas.Index.
columns (sequence) – The columns object for the dataframe. Converted to a pandas.Index.
row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.

add_prefix(prefix, axis)#

Add a prefix to the current row or column labels.

Parameters

prefix (str) – The prefix to add.
axis (int) – The axis to update.

Returns

A new dataframe with the updated labels.

Return type