Out of Core in Modin

If you are working with very large files or would like to exceed your memory, you may change the primary location of the DataFrame. If you would like to exceed memory, you can use your disk as an overflow for the memory.

Starting Modin with out of core enabled

Out of core is now enabled by default for both Ray and Dask engines.

Disabling Out of Core

Out of core is enabled by the compute engine selected. To disable it, start your preferred compute engine with the appropriate arguments. For example:

import modin.pandas as pd
import ray

ray.init(_plasma_directory="/tmp")  # setting to disable out of core in Ray
df = pd.read_csv("some.csv")

If you are using Dask, you have to modify local configuration files. Visit the Dask documentation on object spilling to see how.

Running an example with out of core

Before you run this, please make sure you follow the instructions listed above.

import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**20, 2**8)) # 2GB each
df = pd.DataFrame(frame_data).add_prefix("col")
big_df = pd.concat([df for _ in range(20)]) # 20x2GB frames
nan_big_df = big_df.isna() # The performance here represents a simple map
print(big_df.groupby("col1").count()) # group by on a large dataframe

This example creates a 40GB DataFrame from 20 identical 2GB DataFrames and performs various operations on them. Feel free to play around with this code and let us know what you think!