PandasOnPython Execution#
Queries that perform data transformation, data ingress or data egress using the pandas on Python execution pass through the Modin components detailed below.
pandas on Python execution is sequential and it’s used for the debug purposes. To enable pandas on Python execution, please refer to the usage section in pandas on Python.
Data Transformation#
When a user calls any DataFrame
API, a query starts forming at the API layer
to be executed at the Execution layer. The API layer is responsible for processing the query appropriately,
for example, determining whether the final result should be a DataFrame
or Series
object. This layer is also responsible for sanitizing the input to the
PandasQueryCompiler
, e.g. validating a parameter from the query
and defining specific intermediate values to provide more context to the query compiler.
The PandasQueryCompiler
is responsible for
processing the query, received from the DataFrame
API layer,
to determine how to apply it to a subset of the data - either cell-wise or along an axis-wise partition backed by the pandas
storage format. The PandasQueryCompiler
maps the query to one of the Core Algebra Operators of
the PandasOnPythonDataframe
which inherits
generic functionality from the PandasDataframe
.
PandasOnPython Dataframe implementation#
This page describes implementation of Modin PandasDataframe Objects
specific for PandasOnPython execution. Since Python engine doesn’t allow computation parallelization,
operations on partitions are performed sequentially. The absence of parallelization doesn’t give any
performance speed-up, so PandasOnPython
is used for testing purposes only.
Data Ingress#
Data Egress#
When a user calls any IO function from the modin.pandas.io
module, the API layer queries the
FactoryDispatcher
which defines a factory specific for
the execution, namely, the PandasOnPythonFactory
. The factory, in turn,
exposes the PandasOnPythonIO
class
whose responsibility is a read/write from/to a file.
When reading data from a CSV file, for example, the PandasOnPythonIO
class
reads the data using corresponding pandas function (pandas.read_csv()
in this case). After the reading is complete, a new query compiler is created from pandas object
using from_pandas()
and returned.
When writing data to a CSV file, for example, the PandasOnPythonIO
converts a query compiler
to pandas object using to_pandas()
. After that, pandas writes the data to the file using
corresponding function (pandas.to_csv()
in this case).