Modin Configuration Settings#

To adjust Modin’s default behavior, you can set the value of Modin configs by setting an environment variable or by using the modin.config API. To list all available configs in Modin, please run python -m modin.config to print all Modin configs with descriptions.

Public API#

Potentially, the source of configs can be any, but for now only environment variables are implemented. Any environment variable originate from EnvironmentVariable, which contains most of the config API implementation.

class modin.config.envvars.EnvironmentVariable#

Base class for environment variables-based configuration.

classmethod get() Any#

Get config value.

Returns:

Decoded and verified config value.

Return type:

Any

classmethod get_help() str#

Generate user-presentable help for the config.

Return type:

str

classmethod get_value_source() ValueSource#

Get value source of the config.

Return type:

ValueSource

classmethod once(onvalue: Any, callback: Callable) None#

Execute callback if config value matches onvalue value.

Otherwise accumulate callbacks associated with the given onvalue in the _once container.

Parameters:
  • onvalue (Any) – Config value to set.

  • callback (callable) – Callable that should be executed if config value matches onvalue.

classmethod put(value: Any) None#

Set config value.

Parameters:

value (Any) – Config value to set.

classmethod subscribe(callback: Callable) None#

Add callback to the _subs list and then execute it.

Parameters:

callback (callable) – Callable to execute.

Modin Configs List#

Config Name

Env. Variable Name

Default Value

Description

Options

AsvDataSizeConfig

MODIN_ASV_DATASIZE_CONFIG

Allows to override default size of data (shapes).

AsvImplementation

MODIN_ASV_USE_IMPL

modin

Allows to select a library that we will use for testing performance.

(‘modin’, ‘pandas’)

AsyncReadMode

MODIN_ASYNC_READ_MODE

False

It does not wait for the end of reading information from the source.

It basically means, that the reading function only launches tasks for the dataframe to be read/created, but not ensures that the construction is finalized by the time the reading function returns a dataframe.

This option was brought to improve performance of reading/construction of Modin DataFrames, however it may also:

1. Increase the peak memory consumption. Since the garbage collection of the temporary objects created during the reading is now also lazy and will only be performed when the reading/construction is actually finished.

2. Can break situations when the source is manually deleted after the reading function returns a result, for example, when reading inside of a context-block that deletes the file on __exit__().

BenchmarkMode

MODIN_BENCHMARK_MODE

False

Whether or not to perform computations synchronously.

CIAWSAccessKeyID

AWS_ACCESS_KEY_ID

foobar_key

Set to AWS_ACCESS_KEY_ID when running mock S3 tests for Modin in GitHub CI.

CIAWSSecretAccessKey

AWS_SECRET_ACCESS_KEY

foobar_secret

Set to AWS_SECRET_ACCESS_KEY when running mock S3 tests for Modin in GitHub CI.

CpuCount

MODIN_CPUS

multiprocessing.cpu_count()

How many CPU cores to use during initialization of the Modin engine.

DaskThreadsPerWorker

MODIN_DASK_THREADS_PER_WORKER

1

Number of threads per Dask worker.

DoUseCalcite

MODIN_USE_CALCITE

True

Whether to use Calcite for HDK queries execution.

DocModule

MODIN_DOC_MODULE

pandas

The module to use that will be used for docstrings.

The value set here must be a valid, importable module. It should have a DataFrame, Series, and/or several APIs directly (e.g. read_csv).

Engine

MODIN_ENGINE

Ray

Distribution engine to run queries by.

(‘Ray’, ‘Dask’, ‘Python’, ‘Native’, ‘Unidist’)

ExperimentalGroupbyImpl

MODIN_EXPERIMENTAL_GROUPBY

False

Set to true to use Modin’s range-partitioning group by implementation.

This parameter is deprecated. Use RangePartitioningGroupby instead.

ExperimentalNumPyAPI

MODIN_EXPERIMENTAL_NUMPY_API

False

Set to true to use Modin’s implementation of NumPy API.

This parameter is deprecated. Use ModinNumpy instead.

GithubCI

MODIN_GITHUB_CI

False

Set to true when running Modin in GitHub CI.

GpuCount

MODIN_GPUS

How may GPU devices to utilize across the whole distribution.

HdkFragmentSize

MODIN_HDK_FRAGMENT_SIZE

How big a fragment in HDK should be when creating a table (in rows).

HdkLaunchParameters

MODIN_HDK_LAUNCH_PARAMETERS

{‘enable_union’: 1, ‘enable_columnar_output’: 1, ‘enable_lazy_fetch’: 0, ‘null_div_by_zero’: 1, ‘enable_watchdog’: 0, ‘enable_thrift_logs’: 0, ‘enable_multifrag_execution_result’: 1, ‘cpu_only’: 1, ‘enable_lazy_dict_materialization’: 0, ‘log_dir’: ‘pyhdk_log’}

Additional command line options for the HDK engine.

Please visit OmniSci documentation for the description of available parameters: https://docs.omnisci.com/installation-and-configuration/config-parameters#configuration-parameters-for-omniscidb

IsDebug

MODIN_DEBUG

Force Modin engine to be “Python” unless specified by $MODIN_ENGINE.

IsExperimental

MODIN_EXPERIMENTAL

Whether to Turn on experimental features.

IsRayCluster

MODIN_RAY_CLUSTER

Whether Modin is running on pre-initialized Ray cluster.

LazyExecution

MODIN_LAZY_EXECUTION

Auto

Lazy execution mode.

Supported values:

Auto - the execution mode is chosen by the engine for each operation (default value). On - the lazy execution is performed wherever it’s possible. Off - the lazy execution is disabled.

(‘Auto’, ‘On’, ‘Off’)

LogFileSize

MODIN_LOG_FILE_SIZE

10

Max size of logs (in MBs) to store per Modin job.

LogMemoryInterval

MODIN_LOG_MEMORY_INTERVAL

5

Interval (in seconds) to profile memory utilization for logging.

LogMode

MODIN_LOG_MODE

disable

Set LogMode value if users want to opt-in.

(‘enable’, ‘disable’, ‘enable_api_only’)

Memory

MODIN_MEMORY

How much memory (in bytes) give to an execution engine.

Notes:

  • In Ray case: the amount of memory to start the Plasma object store with.

  • In Dask case: the amount of memory that is given to each worker depending on CPUs used.

MinPartitionSize

MODIN_MIN_PARTITION_SIZE

32

Minimum number of rows/columns in a single pandas partition split.

Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition.

ModinNumpy

MODIN_NUMPY

False

Set to true to use Modin’s implementation of NumPy API.

NPartitions

MODIN_NPARTITIONS

equals to MODIN_CPUS env

How many partitions to use for a Modin DataFrame (along each axis).

PersistentPickle

MODIN_PERSISTENT_PICKLE

False

Whether serialization should be persistent.

ProgressBar

MODIN_PROGRESS_BAR

False

Whether or not to show the progress bar.

RangePartitioning

MODIN_RANGE_PARTITIONING

False

Set to true to use Modin’s range-partitioning implementation where possible.

Please refer to documentation for cases where enabling this options would be beneficial: https://modin.readthedocs.io/en/stable/flow/modin/experimental/range_partitioning_groupby.html

RangePartitioningGroupby

MODIN_RANGE_PARTITIONING_GROUPBY

False

Set to true to use Modin’s range-partitioning group by implementation.

Experimental groupby is implemented using a range-partitioning technique, note that it may not always work better than the original Modin’s TreeReduce and FullAxis implementations. For more information visit the according section of Modin’s documentation: TODO: add a link to the section once it’s written.

RayRedisAddress

MODIN_REDIS_ADDRESS

Redis address to connect to when running in Ray cluster.

RayRedisPassword

MODIN_REDIS_PASSWORD

random string

What password to use for connecting to Redis.

ReadSqlEngine

MODIN_READ_SQL_ENGINE

Pandas

Engine to run read_sql.

(‘Pandas’, ‘Connectorx’)

StorageFormat

MODIN_STORAGE_FORMAT

Pandas

Engine to run on a single node of distribution.

(‘Pandas’, ‘Hdk’, ‘Cudf’)

TestDatasetSize

MODIN_TEST_DATASET_SIZE

Dataset size for running some tests.

(‘Small’, ‘Normal’, ‘Big’)

TestReadFromPostgres

MODIN_TEST_READ_FROM_POSTGRES

False

Set to true to test reading from Postgres.

TestReadFromSqlServer

MODIN_TEST_READ_FROM_SQL_SERVER

False

Set to true to test reading from SQL server.

TrackFileLeaks

MODIN_TEST_TRACK_FILE_LEAKS

True

Whether to track for open file handles leakage during testing.

Usage Guide#

See example of interaction with Modin configs below, as it can be seen config value can be set either by setting the environment variable or by using config API.

import os

# Setting `MODIN_STORAGE_FORMAT` environment variable.
# Also can be set outside the script.
os.environ["MODIN_STORAGE_FORMAT"] = "Hdk"

import modin.config
import modin.pandas as pd

# Checking initially set `StorageFormat` config,
# which corresponds to `MODIN_STORAGE_FORMAT` environment
# variable
print(modin.config.StorageFormat.get()) # prints 'Hdk'

# Checking default value of `NPartitions`
print(modin.config.NPartitions.get()) # prints '8'

# Changing value of `NPartitions`
modin.config.NPartitions.put(16)
print(modin.config.NPartitions.get()) # prints '16'