PyArrow-on-Ray Module Description#
High-Level Module Overview#
This module houses experimental functionality with PyArrow storage format and Ray
engine. The biggest difference from core engines is that internally each partition
is represented as pyarrow.Table
put in the Ray
Plasma store.
Why to Use PyArrow Tables#
As it was mentioned
by the pandas creator, pandas internal architecture is not optimal and sometimes
needs up to ten times more memory than the original dataset size
(note, that pandas rule of thumb: have 5 to 10 times as much RAM as the size of your
dataset). In order to fix this issue (or at least to reduce needed memory amount and
needed data copying), PyArrow-on-Ray
module was added. Due to the optimized architecture
of PyArrow Tables, no additional copies are needed in some
corner cases, which can significantly improve Modin performance. The downside of this approach
is that PyArrow and pandas do not support the same APIs and some functions/parameters may have
different signatures or output different results, so for now the PyArrow-on-Ray
engine is
under development and marked as experimental.