PandasOnUnidist Execution#
Queries that perform data transformation, data ingress or data egress using the pandas on Unidist execution pass through the Modin components detailed below.
To enable pandas on Unidist execution, please refer to the usage section in pandas on Unidist.
Data Transformation#
When a user calls any DataFrame
API, a query starts forming at the API layer
to be executed at the Execution layer. The API layer is responsible for processing the query appropriately,
for example, determining whether the final result should be a DataFrame
or Series
object. This layer is also responsible for sanitizing the input to the
PandasQueryCompiler
, e.g. validating a parameter from the query
and defining specific intermediate values to provide more context to the query compiler.
The PandasQueryCompiler
is responsible for
processing the query, received from the DataFrame
API layer,
to determine how to apply it to a subset of the data - either cell-wise or along an axis-wise partition backed by the pandas
storage format. The PandasQueryCompiler
maps the query to one of the Core Algebra Operators of
the PandasOnUnidistDataframe
which inherits
generic functionality from the GenericUnidistDataframe
and the PandasDataframe
.
PandasOnUnidist Dataframe implementation#
Modin implements Dataframe
, PartitionManager
, VirtualPartition
(a specific kind of AxisPartition
with the capability
to combine smaller partitions into the one “virtual”) and Partition
classes specifically for the PandasOnUnidist
execution:
Data Ingress#
Data Egress#
When a user calls any IO function from the modin.pandas.io
module, the API layer queries the
FactoryDispatcher
which defines a factory specific for
the execution, namely, the PandasOnUnidistFactory
. The factory, in turn,
exposes the PandasOnUnidistIO
class
whose responsibility is to perform a parallel read/write from/to a file.
When reading data from a CSV file, for example, the PandasOnUnidistIO
class forwards
the user query to the _read()
method of CSVDispatcher
, where the query’s parameters are preprocessed
to check if they are supported by the execution (defaulting to pandas if they are not) and computes some metadata
common for all partitions to be read. Then, the file is split into row chunks, and this data is used to launch remote tasks on the Unidist workers
via the deploy()
method of UnidistWrapper
.
On each Unidist worker, the PandasCSVParser
parses data.
After the remote tasks are finished, additional result postprocessing is performed,
and a new query compiler with the data read is returned.
When writing data to a CSV file, for example, the PandasOnUnidistIO
processes
the user query to execute it on Unidist workers. Then, the PandasOnUnidistIO
asks the
PandasOnUnidistDataframe
to decompose the data into row-wise partitions
that will be written into the file in parallel in Unidist workers.