HdkOnNativeDataframe#

Public API#

class modin.experimental.core.execution.native.implementations.hdk_on_native.dataframe.dataframe.HdkOnNativeDataframe(partitions=None, index=None, columns=None, row_lengths=None, column_widths=None, dtypes=None, op=None, index_cols=None, uses_rowid=False, force_execution_mode=None, has_unsupported_data=False)#

Lazy dataframe based on Arrow table representation and embedded HDK storage format.

Currently, materialized dataframe always has a single partition. This partition can hold either Arrow table or pandas dataframe.

Operations on a dataframe are not instantly executed and build an operations tree instead. When frame’s data is accessed this tree is transformed into a query which is executed in HDK storage format. In case of simple transformations Arrow API can be used instead of HDK storage format.

Since frames are used as an input for other frames, all operations produce new frames and are not executed in-place.

Parameters:
  • partitions (np.ndarray, optional) – Partitions of the frame.

  • index (pandas.Index, optional) – Index of the frame to be used as an index cache. If None then will be computed on demand.

  • columns (pandas.Index, optional) – Columns of the frame.

  • row_lengths (np.ndarray, optional) – Partition lengths. Should be None if lengths are unknown.

  • column_widths (np.ndarray, optional) – Partition widths. Should be None if widths are unknown.

  • dtypes (pandas.Index, optional) – Column data types.

  • op (DFAlgNode, optional) – A tree describing how frame is computed. For materialized frames it is always FrameNode.

  • index_cols (list of str, optional) – A list of columns included into the frame’s index. None value means a default index (row id is used as an index).

  • uses_rowid (bool, default: False) – True for frames which require access to the virtual ‘rowid’ column for its execution.

  • force_execution_mode (str or None) – Used by tests to control frame’s execution process.

  • has_unsupported_data (bool) – True for frames holding data not supported by Arrow or HDK storage format.

id#

ID of the frame. Used for debug prints only.

Type:

int

_op#

A tree to be used to compute the frame. For materialized frames it is always FrameNode.

Type:

DFAlgNode

_partitions#

Partitions of the frame. For materialized dataframes it holds a single partition. None for frames requiring execution.

Type:

numpy.ndarray or None

_index_cols#

Names of index columns. None for default index. Index columns have mangled names to handle labels which cannot be directly used as an HDK table column name (e.g. non-string labels, SQL keywords etc.).

Type:

list of str or None

_table_cols#

A list of all frame’s columns. It includes index columns if any. Index columns are always in the head of the list.

Type:

list of str

_index_cache#

Materialized index of the frame or None when index is not materialized. If callable() -> (pandas.Index, list of row lengths or None) type, then the calculation will be done in __init__.

Type:

pandas.Index, callable or None

_has_unsupported_data#

True for frames holding data not supported by Arrow or HDK storage format. Operations on such frames are not allowed and should be defaulted to pandas instead.

Type:

bool

_dtypes#

Column types.

Type:

pandas.Series

_uses_rowid#

True for frames which require access to the virtual ‘rowid’ column for its execution.

Type:

bool

_force_execution_mode#

Used by tests to control frame’s execution process. Value “lazy” is used to raise RuntimeError if execution is triggered for the frame. The values “arrow” and “hdk” are used to force the corresponding execution mode.

Type:

str or None

agg(agg)#

Perform specified aggregation along columns.

Parameters:

agg (str) – Name of the aggregation function to perform.

Returns:

New frame containing the result of aggregation.

Return type:

HdkOnNativeDataframe

astype(col_dtypes, **kwargs)#

Cast frame columns to specified types.

Parameters:
  • col_dtypes (dict) – Maps column names to new data types.

  • **kwargs (dict) – Keyword args. Not used.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

bin_op(other, op_name, **kwargs)#

Perform binary operation.

An arithmetic binary operation or a comparison operation to perform on columns.

Parameters:
  • other (scalar, list-like, or HdkOnNativeDataframe) – The second operand.

  • op_name (str) – An operation to perform.

  • **kwargs (dict) – Keyword args.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

cat_codes()#

Extract codes for a category column.

The frame should have a single data column.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

property columns#

Return column labels of the frame.

Return type:

pandas.Index

concat(axis: Union[int, Axis], other_modin_frames: List[HdkOnNativeDataframe], join: Optional[str] = 'outer', sort: Optional[bool] = False, ignore_index: Optional[bool] = False)#

Concatenate frames along a particular axis.

Parameters:
  • axis (int or modin.core.dataframe.base.utils.Axis) – The axis to concatenate along.

  • other_modin_frames (list of HdkOnNativeDataframe) – Frames to concat.

  • join ({"outer", "inner"}, default: "outer") – How to handle mismatched indexes on other axis.

  • sort (bool, default: False) – Sort non-concatenation axis if it is not already aligned when join is ‘outer’.

  • ignore_index (bool, default: False) – Ignore index along the concatenation axis.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

copy(partitions=_NoDefault.no_default, index=_NoDefault.no_default, columns=_NoDefault.no_default, dtypes=_NoDefault.no_default, op=_NoDefault.no_default, index_cols=_NoDefault.no_default, uses_rowid=_NoDefault.no_default, has_unsupported_data=_NoDefault.no_default)#

Copy this DataFrame.

Parameters:
  • partitions (np.ndarray, optional) – Partitions of the frame.

  • index (pandas.Index or list, optional) – Index of the frame to be used as an index cache. If None then will be computed on demand.

  • columns (pandas.Index or list, optional) – Columns of the frame.

  • dtypes (pandas.Index or list, optional) – Column data types.

  • op (DFAlgNode, optional) – A tree describing how frame is computed. For materialized frames it is always FrameNode.

  • index_cols (list of str, optional) – A list of columns included into the frame’s index. None value means a default index (row id is used as an index).

  • uses_rowid (bool, optional) – True for frames which require access to the virtual ‘rowid’ column for its execution.

  • has_unsupported_data (bool, optional) – True for frames holding data not supported by Arrow or HDK storage format.

Returns:

A copy of this DataFrame.

Return type:

HdkOnNativeDataframe

dropna(subset, how='any')#

Drop rows with NULLs.

Parameters:
  • subset (list of str) – Columns to check.

  • how ({"any", "all"}, default: "any") – Determine if row is removed from DataFrame, when we have at least one NULL or all NULLs.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

dt_extract(obj)#

Extract a date or a time unit from a datetime value.

Parameters:

obj (str) – Datetime unit to extract.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

property dtypes#

Return column data types.

Returns:

A pandas Series containing the data types for this dataframe.

Return type:

pandas.Series

fillna(value=None, method=None, axis=None, limit=None, downcast=None)#

Replace NULLs operation.

Parameters:
  • value (dict or scalar, optional) – A value to replace NULLs with. Can be a dictionary to assign different values to columns.

  • method (None, optional) – Should be None.

  • axis ({0, 1}, optional) – Should be 0.

  • limit (None, optional) – Should be None.

  • downcast (None, optional) – Should be None.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

filter(key)#

Filter rows by a boolean key column.

Parameters:

key (HdkOnNativeDataframe) – A frame with a single bool data column used as a filter.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

force_import() DbTable#

Force table import.

Returns:

The imported table.

Return type:

DbTable

classmethod from_arrow(at, index_cols=None, index=None, columns=None, encode_col_names=True)#

Build a frame from an Arrow table.

Parameters:
  • at (pyarrow.Table) – Source table.

  • index_cols (list of str, optional) – List of index columns in the source table which are ignored in transformation.

  • index (pandas.Index, optional) – An index to be used by the new frame. Should present if index_cols is not None.

  • columns (Index or array-like, optional) – Column labels to use for resulting frame.

  • encode_col_names (bool, default: True) – Encode column names.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

classmethod from_dataframe(df: ProtocolDataframe) HdkOnNativeDataframe#

Convert a DataFrame implementing the dataframe exchange protocol to a Core Modin Dataframe.

See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.

Parameters:

df (ProtocolDataframe) – The DataFrame object supporting the dataframe exchange protocol.

Returns:

A new Core Modin Dataframe object.

Return type:

HdkOnNativeDataframe

classmethod from_pandas(df)#

Build a frame from a pandas.DataFrame.

Parameters:

df (pandas.DataFrame) – Source frame.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

get_dtype(col)#

Get data type for a column.

Parameters:

col (str) – Column name.

Return type:

dtype

get_index_name()#

Get the name of the index column.

Returns None for default index and multi-index.

Return type:

str or None

get_index_names()#

Get index column names.

Return type:

list of str

groupby_agg(by, axis, agg, groupby_args, **kwargs)#

Groupby with aggregation operation.

Parameters:
  • by (DFAlgQueryCompiler or list-like of str) – Grouping keys.

  • axis ({0, 1}) – Only rows groupby is supported, so should be 0.

  • agg (str or dict) – Aggregates to compute.

  • groupby_args (dict) – Additional groupby args.

  • **kwargs (dict) – Keyword args. Currently ignored.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

has_multiindex()#

Check for multi-index usage.

Return True if the frame has a multi-index (index with multiple columns) and False otherwise.

Return type:

bool

id_str()#

Return string identifier of the frame.

Used for debug dumps.

Return type:

str

property index#

Get the index of the frame in pandas format.

Materializes the frame if required.

Return type:

pandas.Index

insert(loc, column, value)#

Insert a constant column.

Parameters:
  • loc (int) – Inserted column location.

  • column (str) – Inserted column name.

  • value (scalar) – Inserted column value.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

invert()#

Apply bitwise inverse to each column.

Return type:

HdkOnNativeDataframe

isna(invert)#

Detect missing values.

Parameters:

invert (bool) –

Return type:

HdkOnNativeDataframe

join(other: HdkOnNativeDataframe, how: Optional[Union[str, JoinType]] = JoinType.INNER, left_on: Optional[List[str]] = None, right_on: Optional[List[str]] = None, sort: Optional[bool] = False, suffixes: Optional[Tuple[str]] = ('_x', '_y'))#

Join operation.

Parameters:
  • other (HdkOnNativeDataframe) – A frame to join with.

  • how (str or modin.core.dataframe.base.utils.JoinType, default: JoinType.INNER) – A type of join.

  • left_on (list of str, optional) – A list of columns for the left frame to join on.

  • right_on (list of str, optional) – A list of columns for the right frame to join on.

  • sort (bool, default: False) – Sort the result by join keys.

  • suffixes (list-like of str, default: ("_x", "_y")) – A length-2 sequence of suffixes to add to overlapping column names of left and right operands respectively.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

ref(col)#

Return an expression referencing a frame’s column.

Parameters:

col (str) – Column name.

Return type:

InputRefExpr

reset_index(drop)#

Set the default index for the frame.

Parameters:

drop (bool) – If True then drop current index columns, otherwise make them data columns.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

set_index_name(name)#

Set new name for the index column.

Shouldn’t be called for frames with multi-index.

Parameters:

name (str or None) – New index name.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

set_index_names(names)#

Set index labels for frames with multi-index.

Parameters:

names (list of str) – New index labels.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

sort_rows(columns, ascending, ignore_index, na_position)#

Sort rows of the frame.

Parameters:
  • columns (str or list of str) – Sorting keys.

  • ascending (bool or list of bool) – Sort order.

  • ignore_index (bool) – Drop index columns.

  • na_position ({"first", "last"}) – NULLs position.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

take_2d_labels_or_positional(row_labels: Optional[List[Hashable]] = None, row_positions: Optional[List[int]] = None, col_labels: Optional[List[Hashable]] = None, col_positions: Optional[List[int]] = None) HdkOnNativeDataframe#

Mask rows and columns in the dataframe.

Allow users to perform selection and projection on the row and column labels (named notation), in addition to the row and column number (positional notation).

Parameters:
  • row_labels (list of hashable, optional) – The row labels to extract.

  • row_positions (list of int, optional) – The row positions to extract.

  • col_labels (list of hashable, optional) – The column labels to extract.

  • col_positions (list of int, optional) – The column positions to extract.

Returns:

The new frame.

Return type:

HdkOnNativeDataframe

Notes

If both row_labels and row_positions are provided, a ValueError is raised. The same rule applies for col_labels and col_positions.

to_pandas()#

Transform the frame to pandas format.

Return type:

pandas.DataFrame