PyArrow-on-Ray Module Description#

High-Level Module Overview#

This module houses experimental functionality with PyArrow storage format and Ray engine. The biggest difference from core engines is that internally each partition is represented as pyarrow.Table put in the Ray Plasma store.

Why to Use PyArrow Tables#

As it was mentioned by the pandas creator, pandas internal architecture is not optimal and sometimes needs up to ten times more memory than the original dataset size (note, that pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset). In order to fix this issue (or at least to reduce needed memory amount and needed data copying), PyArrow-on-Ray module was added. Due to the optimized architecture of PyArrow Tables, no additional copies are needed in some corner cases, which can significantly improve Modin performance. The downside of this approach is that PyArrow and pandas do not support the same APIs and some functions/parameters may have different signatures or output different results, so for now the PyArrow-on-Ray engine is under development and marked as experimental.