:orphan:
PyArrow-on-Ray Module Description
"""""""""""""""""""""""""""""""""
High-Level Module Overview
''''''''''''''''''''''''''
This module houses experimental functionality with PyArrow storage format and Ray
engine. The biggest difference from core engines is that internally each partition
is represented as ``pyarrow.Table`` put in the ``Ray`` Plasma store.
Why to Use PyArrow Tables
'''''''''''''''''''''''''
As it was `mentioned `_
by the pandas creator, pandas internal architecture is not optimal and sometimes
needs up to ten times more memory than the original dataset size
(note, that pandas rule of thumb: `have 5 to 10 times as much RAM as the size of your
dataset`). In order to fix this issue (or at least to reduce needed memory amount and
needed data copying), ``PyArrow-on-Ray`` module was added. Due to the optimized architecture
of PyArrow Tables, `no additional copies are needed
`_ in some
corner cases, which can significantly improve Modin performance. The downside of this approach
is that PyArrow and pandas do not support the same APIs and some functions/parameters may have
different signatures or output different results, so for now the ``PyArrow-on-Ray`` engine is
under development and marked as experimental.