HdkOnNativeDataframe#
Public API#
- class modin.experimental.core.execution.native.implementations.hdk_on_native.dataframe.dataframe.HdkOnNativeDataframe(partitions=None, index=None, columns=None, row_lengths=None, column_widths=None, dtypes=None, op=None, index_cols=None, uses_rowid=False, force_execution_mode=None, has_unsupported_data=False)#
Lazy dataframe based on Arrow table representation and embedded HDK storage format.
Currently, materialized dataframe always has a single partition. This partition can hold either Arrow table or pandas dataframe.
Operations on a dataframe are not instantly executed and build an operations tree instead. When frame’s data is accessed this tree is transformed into a query which is executed in HDK storage format. In case of simple transformations Arrow API can be used instead of HDK storage format.
Since frames are used as an input for other frames, all operations produce new frames and are not executed in-place.
- Parameters
partitions (np.ndarray, optional) – Partitions of the frame.
index (pandas.Index, optional) – Index of the frame to be used as an index cache. If None then will be computed on demand.
columns (pandas.Index, optional) – Columns of the frame.
row_lengths (np.ndarray, optional) – Partition lengths. Should be None if lengths are unknown.
column_widths (np.ndarray, optional) – Partition widths. Should be None if widths are unknown.
dtypes (pandas.Index, optional) – Column data types.
op (DFAlgNode, optional) – A tree describing how frame is computed. For materialized frames it is always
FrameNode
.index_cols (list of str, optional) – A list of columns included into the frame’s index. None value means a default index (row id is used as an index).
uses_rowid (bool, default: False) – True for frames which require access to the virtual ‘rowid’ column for its execution.
force_execution_mode (str or None) – Used by tests to control frame’s execution process.
has_unsupported_data (bool) – True for frames holding data not supported by Arrow or HDK storage format.
- id#
ID of the frame. Used for debug prints only.
- Type
int
- _op#
A tree to be used to compute the frame. For materialized frames it is always
FrameNode
.- Type
- _partitions#
Partitions of the frame. For materialized dataframes it holds a single partition. None for frames requiring execution.
- Type
numpy.ndarray or None
- _index_cols#
Names of index columns. None for default index. Index columns have mangled names to handle labels which cannot be directly used as an HDK table column name (e.g. non-string labels, SQL keywords etc.).
- Type
list of str or None
- _table_cols#
A list of all frame’s columns. It includes index columns if any. Index columns are always in the head of the list.
- Type
list of str
- _index_cache#
Materialized index of the frame or None when index is not materialized. If
callable() -> (pandas.Index, list of row lengths or None)
type, then the calculation will be done in __init__.- Type
pandas.Index, callable or None
- _has_unsupported_data#
True for frames holding data not supported by Arrow or HDK storage format. Operations on such frames are not allowed and should be defaulted to pandas instead.
- Type
bool
- _dtypes#
Column types.
- Type
pandas.Series
- _uses_rowid#
True for frames which require access to the virtual ‘rowid’ column for its execution.
- Type
bool
- _force_execution_mode#
Used by tests to control frame’s execution process. Value “lazy” is used to raise RuntimeError if execution is triggered for the frame. Value “arrow” is used to raise RuntimeError execution is triggered and cannot be done using Arrow API (have to use HDK for execution).
- Type
str or None
- agg(agg)#
Perform specified aggregation along columns.
- Parameters
agg (str) – Name of the aggregation function to perform.
- Returns
New frame containing the result of aggregation.
- Return type
- astype(col_dtypes, **kwargs)#
Cast frame columns to specified types.
- Parameters
col_dtypes (dict) – Maps column names to new data types.
**kwargs (dict) – Keyword args. Not used.
- Returns
The new frame.
- Return type
- bin_op(other, op_name, **kwargs)#
Perform binary operation.
An arithmetic binary operation or a comparison operation to perform on columns.
- Parameters
other (scalar, list-like, or HdkOnNativeDataframe) – The second operand.
op_name (str) – An operation to perform.
**kwargs (dict) – Keyword args.
- Returns
The new frame.
- Return type
- cat_codes()#
Extract codes for a category column.
The frame should have a single data column.
- Returns
The new frame.
- Return type
- property columns#
Return column labels of the frame.
- Return type
pandas.Index
- concat(axis: Union[int, Axis], other_modin_frames: List[HdkOnNativeDataframe], join: Optional[str] = 'outer', sort: Optional[bool] = False, ignore_index: Optional[bool] = False)#
Concatenate frames along a particular axis.
- Parameters
axis (int or modin.core.dataframe.base.utils.Axis) – The axis to concatenate along.
other_modin_frames (list of HdkOnNativeDataframe) – Frames to concat.
join ({"outer", "inner"}, default: "outer") – How to handle mismatched indexes on other axis.
sort (bool, default: False) – Sort non-concatenation axis if it is not already aligned when join is ‘outer’.
ignore_index (bool, default: False) – Ignore index along the concatenation axis.
- Returns
The new frame.
- Return type
- copy(partitions=_NoDefault.no_default, index=_NoDefault.no_default, columns=_NoDefault.no_default, dtypes=_NoDefault.no_default, op=_NoDefault.no_default, index_cols=_NoDefault.no_default)#
Copy this DataFrame.
- Parameters
partitions (np.ndarray, optional) – Partitions of the frame.
index (pandas.Index, optional) – Index of the frame to be used as an index cache. If None then will be computed on demand.
columns (pandas.Index, optional) – Columns of the frame.
dtypes (pandas.Index, optional) – Column data types.
op (DFAlgNode, optional) – A tree describing how frame is computed. For materialized frames it is always
FrameNode
.index_cols (list of str, optional) – A list of columns included into the frame’s index. None value means a default index (row id is used as an index).
- Returns
A copy of this DataFrame.
- Return type
- dropna(subset, how='any')#
Drop rows with NULLs.
- Parameters
subset (list of str) – Columns to check.
how ({"any", "all"}, default: "any") – Determine if row is removed from DataFrame, when we have at least one NULL or all NULLs.
- Returns
The new frame.
- Return type
- dt_extract(obj)#
Extract a date or a time unit from a datetime value.
- Parameters
obj (str) – Datetime unit to extract.
- Returns
The new frame.
- Return type
- property dtypes#
Return column data types.
- Returns
A pandas Series containing the data types for this dataframe.
- Return type
pandas.Series
- fillna(value=None, method=None, axis=None, limit=None, downcast=None)#
Replace NULLs operation.
- Parameters
value (dict or scalar, optional) – A value to replace NULLs with. Can be a dictionary to assign different values to columns.
method (None, optional) – Should be None.
axis ({0, 1}, optional) – Should be 0.
limit (None, optional) – Should be None.
downcast (None, optional) – Should be None.
- Returns
The new frame.
- Return type
- filter(key)#
Filter rows by a boolean key column.
- Parameters
key (HdkOnNativeDataframe) – A frame with a single bool data column used as a filter.
- Returns
The new frame.
- Return type
- classmethod from_arrow(at, index_cols=None, index=None, columns=None, encode_col_names=True)#
Build a frame from an Arrow table.
- Parameters
at (pyarrow.Table) – Source table.
index_cols (list of str, optional) – List of index columns in the source table which are ignored in transformation.
index (pandas.Index, optional) – An index to be used by the new frame. Should present if index_cols is not None.
columns (Index or array-like, optional) – Column labels to use for resulting frame.
encode_col_names (bool, default: True) – Encode column names.
- Returns
The new frame.
- Return type
- classmethod from_dataframe(df: ProtocolDataframe) HdkOnNativeDataframe #
Convert a DataFrame implementing the dataframe exchange protocol to a Core Modin Dataframe.
See more about the protocol in https://data-apis.org/dataframe-protocol/latest/index.html.
- Parameters
df (ProtocolDataframe) – The DataFrame object supporting the dataframe exchange protocol.
- Returns
A new Core Modin Dataframe object.
- Return type
- classmethod from_pandas(df)#
Build a frame from a pandas.DataFrame.
- Parameters
df (pandas.DataFrame) – Source frame.
- Returns
The new frame.
- Return type
- get_dtype(col)#
Get data type for a column.
- Parameters
col (str) – Column name.
- Return type
dtype
- get_index_name()#
Get the name of the index column.
Returns None for default index and multi-index.
- Return type
str or None
- get_index_names()#
Get index column names.
- Return type
list of str
- groupby_agg(by, axis, agg, groupby_args, **kwargs)#
Groupby with aggregation operation.
- Parameters
by (DFAlgQueryCompiler or list-like of str) – Grouping keys.
axis ({0, 1}) – Only rows groupby is supported, so should be 0.
agg (str or dict) – Aggregates to compute.
groupby_args (dict) – Additional groupby args.
**kwargs (dict) – Keyword args. Currently ignored.
- Returns
The new frame.
- Return type
- has_multiindex()#
Check for multi-index usage.
Return True if the frame has a multi-index (index with multiple columns) and False otherwise.
- Return type
bool
- id_str()#
Return string identifier of the frame.
Used for debug dumps.
- Return type
str
- property index#
Get the index of the frame in pandas format.
Materializes the frame if required.
- Return type
pandas.Index
- insert(loc, column, value)#
Insert a constant column.
- Parameters
loc (int) – Inserted column location.
column (str) – Inserted column name.
value (scalar) – Inserted column value.
- Returns
The new frame.
- Return type
- join(other: HdkOnNativeDataframe, how: Optional[Union[str, JoinType]] = JoinType.INNER, left_on: Optional[List[str]] = None, right_on: Optional[List[str]] = None, sort: Optional[bool] = False, suffixes: Optional[Tuple[str]] = ('_x', '_y'))#
Join operation.
- Parameters
other (HdkOnNativeDataframe) – A frame to join with.
how (str or modin.core.dataframe.base.utils.JoinType, default: JoinType.INNER) – A type of join.
left_on (list of str, optional) – A list of columns for the left frame to join on.
right_on (list of str, optional) – A list of columns for the right frame to join on.
sort (bool, default: False) – Sort the result by join keys.
suffixes (list-like of str, default: ("_x", "_y")) – A length-2 sequence of suffixes to add to overlapping column names of left and right operands respectively.
- Returns
The new frame.
- Return type
- ref(col)#
Return an expression referencing a frame’s column.
- Parameters
col (str) – Column name.
- Return type
- reset_index(drop)#
Set the default index for the frame.
- Parameters
drop (bool) – If True then drop current index columns, otherwise make them data columns.
- Returns
The new frame.
- Return type
- set_index_name(name)#
Set new name for the index column.
Shouldn’t be called for frames with multi-index.
- Parameters
name (str or None) – New index name.
- Returns
The new frame.
- Return type
- set_index_names(names)#
Set index labels for frames with multi-index.
- Parameters
names (list of str) – New index labels.
- Returns
The new frame.
- Return type
- sort_rows(columns, ascending, ignore_index, na_position)#
Sort rows of the frame.
- Parameters
columns (str or list of str) – Sorting keys.
ascending (bool or list of bool) – Sort order.
ignore_index (bool) – Drop index columns.
na_position ({"first", "last"}) – NULLs position.
- Returns
The new frame.
- Return type
- take_2d_labels_or_positional(row_labels: Optional[List[Hashable]] = None, row_positions: Optional[List[int]] = None, col_labels: Optional[List[Hashable]] = None, col_positions: Optional[List[int]] = None) HdkOnNativeDataframe #
Mask rows and columns in the dataframe.
Allow users to perform selection and projection on the row and column labels (named notation), in addition to the row and column number (positional notation).
- Parameters
row_labels (list of hashable, optional) – The row labels to extract.
row_positions (list of int, optional) – The row positions to extract.
col_labels (list of hashable, optional) – The column labels to extract.
col_positions (list of int, optional) – The column positions to extract.
- Returns
The new frame.
- Return type
Notes
If both row_labels and row_positions are provided, a ValueError is raised. The same rule applies for col_labels and col_positions.
- to_pandas()#
Transform the frame to pandas format.
- Return type
pandas.DataFrame