Modin Configuration Settings#

To adjust Modin’s default behavior, you can set the value of Modin configs by setting an environment variable or by using the modin.config API. To list all available configs in Modin, please run python -m modin.config to print all Modin configs with descriptions.

Public API#

Potentially, the source of configs can be any, but for now only environment variables are implemented. Any environment variable originate from EnvironmentVariable, which contains most of the config API implementation.

class modin.config.envvars.EnvironmentVariable#

Base class for environment variables-based configuration.

classmethod get() → Any#

Get config value.

Returns:: Decoded and verified config value.
Return type:: Any

classmethod get_help() → str#

Generate user-presentable help for the config.

Return type:: str

classmethod get_value_source() → ValueSource#

Get value source of the config.

Return type:: ValueSource

classmethod once(onvalue: Any, callback: Callable) → None#

Execute callback if config value matches onvalue value.

Otherwise accumulate callbacks associated with the given onvalue in the _once container.

Parameters:

onvalue (Any) – Config value to set.
callback (callable) – Callable that should be executed if config value matches onvalue.

classmethod put(value: Any) → None#

Set config value.

Parameters:: value (Any) – Config value to set.

classmethod subscribe(callback: Callable) → None#

Add callback to the _subs list and then execute it.

Parameters:: callback (callable) – Callable to execute.

Modin Configs List#

Config Name	Env. Variable Name	Default Value	Description	Options
AsvDataSizeConfig	MODIN_ASV_DATASIZE_CONFIG		Allows to override default size of data (shapes).
AsvImplementation	MODIN_ASV_USE_IMPL	modin	Allows to select a library that we will use for testing performance.	(‘modin’, ‘pandas’)
AsyncReadMode	MODIN_ASYNC_READ_MODE	False	It does not wait for the end of reading information from the source. It basically means, that the reading function only launches tasks for the dataframe to be read/created, but not ensures that the construction is finalized by the time the reading function returns a dataframe. This option was brought to improve performance of reading/construction of Modin DataFrames, however it may also: 1. Increase the peak memory consumption. Since the garbage collection of the temporary objects created during the reading is now also lazy and will only be performed when the reading/construction is actually finished. 2. Can break situations when the source is manually deleted after the reading function returns a result, for example, when reading inside of a context-block that deletes the file on `__exit__()`.
AutoSwitchBackend	MODIN_AUTO_SWITCH_BACKENDS	False	Whether automatic backend switching is allowed. When this flag is set, a Modin backend can attempt to automatically choose an appropriate backend for different operations based on features of the input data. When disabled, backends should avoid implicit backend switching outside of explicit operations like to_pandas and to_ray.
Backend	MODIN_BACKEND	Ray	An alias for execution, i.e. the combination of StorageFormat and Engine. Setting backend may change StorageFormat and/or Engine to the corresponding respective values, and setting Engine or StorageFormat may change Backend. Modin’s built-in backends include: “Ray” <-> (StorageFormat=”Pandas”, Engine=”Ray”) “Dask” <-> (StorageFormat=”Pandas”, Engine=”Dask”) “Python_Test” <-> (StorageFormat=”Pandas”, Engine=”Python”) This execution mode is meant for testing only. “Unidist” <-> (StorageFormat=”Pandas”, Engine=”Unidist”) “Pandas” <-> (StorageFormat=”Native”, Engine=”Native”)	(‘Ray’, ‘Dask’, ‘Python_Test’, ‘Unidist’, ‘Pandas’)
BackendJoinConsiderAllBackends	MODIN_BACKEND_JOIN_CONSIDER_ALL_BACKENDS	True	Whether to consider all active backends when performing a pre-operation switch for join operations. Only used when AutoSwitchBackend is active. By default, only backends already present in the arguments of a join operation are considered when switching backends. Enabling this flag will allow join operations that are registered as pre-op switches to consider backends other than those directly present in the arguments.
BackendMergeCastInPlace	MODIN_BACKEND_MERGE_CAST_IN_PLACE	True	Whether to cast a DataFrame in-place when performing a merge when using hybrid mode. This flag modifies the behavior of a cast performed on operations involving more than one type of query compiler. If enabled the actual cast will be performed in-place and the input DataFrame will have a new backend. If disabled the original DataFrame will remain on the same underlying engine.
BenchmarkMode	MODIN_BENCHMARK_MODE	False	Whether or not to perform computations synchronously.
CIAWSAccessKeyID	AWS_ACCESS_KEY_ID	foobar_key	Set to AWS_ACCESS_KEY_ID when running mock S3 tests for Modin in GitHub CI.
CIAWSSecretAccessKey	AWS_SECRET_ACCESS_KEY	foobar_secret	Set to AWS_SECRET_ACCESS_KEY when running mock S3 tests for Modin in GitHub CI.
CpuCount	MODIN_CPUS	multiprocessing.cpu_count()	How many CPU cores to use during initialization of the Modin engine.
DaskThreadsPerWorker	MODIN_DASK_THREADS_PER_WORKER	1	Number of threads per Dask worker.
DocModule	MODIN_DOC_MODULE	pandas	The module to use that will be used for docstrings. The value set here must be a valid, importable module. It should have a DataFrame, Series, and/or several APIs directly (e.g. read_csv).
DynamicPartitioning	MODIN_DYNAMIC_PARTITIONING	False	Set to true to use Modin’s dynamic-partitioning implementation where possible. Please refer to documentation for cases where enabling this options would be beneficial: https://modin.readthedocs.io/en/stable/usage_guide/optimization_notes/index.html#dynamic-partitioning-in-modin
Engine	MODIN_ENGINE	Ray	Distribution engine to run queries by.	(‘Ray’, ‘Dask’, ‘Python’, ‘Unidist’, ‘Native’)
GithubCI	MODIN_GITHUB_CI	False	Set to true when running Modin in GitHub CI.
GpuCount	MODIN_GPUS		How may GPU devices to utilize across the whole distribution.
IsDebug	MODIN_DEBUG		Force Modin engine to be “Python” unless specified by $MODIN_ENGINE.
IsExperimental	MODIN_EXPERIMENTAL		Whether to Turn on experimental features.
IsRayCluster	MODIN_RAY_CLUSTER		Whether Modin is running on pre-initialized Ray cluster.
LazyExecution	MODIN_LAZY_EXECUTION	Auto	Lazy execution mode. Supported values: Auto - the execution mode is chosen by the engine for each operation (default value). On - the lazy execution is performed wherever it’s possible. Off - the lazy execution is disabled.	(‘Auto’, ‘On’, ‘Off’)
LogFileSize	MODIN_LOG_FILE_SIZE	10	Max size of logs (in MBs) to store per Modin job.
LogMemoryInterval	MODIN_LOG_MEMORY_INTERVAL	5	Interval (in seconds) to profile memory utilization for logging.
LogMode	MODIN_LOG_MODE	disable	Set `LogMode` value if users want to opt-in.	(‘enable’, ‘disable’)
Memory	MODIN_MEMORY		How much memory (in bytes) give to an execution engine. Notes: In Ray case: the amount of memory to start the Plasma object store with. In Dask case: the amount of memory that is given to each worker depending on CPUs used.
MetricsMode	MODIN_METRICS_MODE	enable	Set `MetricsMode` value to disable/enable metrics collection. Metric handlers are registered through add_metric_handler and can be used to record graphite-style timings or values. It is the responsibility of the handler to define how those emitted metrics are handled.	(‘enable’, ‘disable’)
MinColumnPartitionSize	MODIN_MIN_COLUMN_PARTITION_SIZE	32	Minimum number of columns in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition.
MinPartitionSize	MODIN_MIN_PARTITION_SIZE	32	Minimum number of rows/columns in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition.
MinRowPartitionSize	MODIN_MIN_ROW_PARTITION_SIZE	32	Minimum number of rows in a single pandas partition split. Once a partition for a pandas dataframe has more than this many elements, Modin adds another partition.
ModinNumpy	MODIN_NUMPY	False	Set to true to use Modin’s implementation of NumPy API.
NPartitions	MODIN_NPARTITIONS	equals to MODIN_CPUS env	How many partitions to use for a Modin DataFrame (along each axis).
NativePandasDeepCopy	MODIN_NATIVE_DEEP_COPY	False	Whether to perform deep copies when transferring data with the native pandas backend. Copies occur when constructing a Modin frame from a native pandas object with pd.DataFrame(pandas.DataFrame([])), or when creating a native pandas frame from a Modin one via df.modin.to_pandas(). Leaving this flag disabled produces significant performance improvements by reducing the number of copy operations performed. However, it may create unexpected results if the user mutates the Modin frame or native pandas frame in-place. >>> import pandas >>> import modin.pandas as pd >>> from modin.config import Backend >>> Backend.put("Pandas") >>> pandas.set_option("mode.copy_on_write", False) >>> native_df = pandas.DataFrame([0]) >>> modin_df = pd.DataFrame(native_df) >>> native_df.loc[0, 0] = -1 >>> modin_df 0 0 -1
NativePandasMaxRows	MODIN_NATIVE_MAX_ROWS	10000000	Maximum number of rows which can be processed using local, native, pandas.
NativePandasTransferThreshold	MODIN_NATIVE_MAX_XFER_ROWS	10000000	Targeted max number of dataframe rows which should be transferred between engines. This is often the same value as MODIN_NATIVE_MAX_ROWS but it can be independently set to change how transfer costs are considered.
PersistentPickle	MODIN_PERSISTENT_PICKLE	False	Whether serialization should be persistent.
ProgressBar	MODIN_PROGRESS_BAR	False	Whether or not to show the progress bar.
RangePartitioning	MODIN_RANGE_PARTITIONING	False	Set to true to use Modin’s range-partitioning implementation where possible. Please refer to documentation for cases where enabling this options would be beneficial: https://modin.readthedocs.io/en/stable/flow/modin/experimental/range_partitioning_groupby.html
RayInitCustomResources	MODIN_RAY_INIT_CUSTOM_RESOURCES		Ray node’s custom resources to initialize with. Visit Ray documentation for more details: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources Notes: Relying on Modin to initialize Ray, you should set this config for the proper initialization with custom resources.
RayRedisAddress	MODIN_REDIS_ADDRESS		Redis address to connect to when running in Ray cluster.
RayRedisPassword	MODIN_REDIS_PASSWORD	random string	What password to use for connecting to Redis.
RayTaskCustomResources	MODIN_RAY_TASK_CUSTOM_RESOURCES		Ray node’s custom resources to request them in tasks or actors. Visit Ray documentation for more details: https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#custom-resources Notes: You can use this config to limit the parallelism for the entire workflow by setting the config at the very beginning. >>> import modin.config as cfg >>> cfg.RayTaskCustomResources.put({“special_hardware”: 0.001}) This way each single remote task or actor will require 0.001 of “special_hardware” to run. You can also use this config to limit the parallelism for a certain operation by setting the config with context. >>> with context(RayTaskCustomResources={“special_hardware”: 0.001}): … df.<op> This way each single remote task or actor will require 0.001 of “special_hardware” to run within the context only.
ReadSqlEngine	MODIN_READ_SQL_ENGINE	Pandas	Engine to run read_sql.	(‘Pandas’, ‘Connectorx’)
ShowBackendSwitchProgress	MODIN_BACKEND_SWITCH_PROGRESS	True	Whether to show progress when switching between backends. When enabled, progress messages are displayed during backend switches to inform users about data transfer operations. When disabled, backend switches occur silently.
StorageFormat	MODIN_STORAGE_FORMAT	Pandas	Engine to run on a single node of distribution.	(‘Pandas’, ‘Native’)
TestDatasetSize	MODIN_TEST_DATASET_SIZE		Dataset size for running some tests.	(‘Small’, ‘Normal’, ‘Big’)
TestReadFromPostgres	MODIN_TEST_READ_FROM_POSTGRES	False	Set to true to test reading from Postgres.
TestReadFromSqlServer	MODIN_TEST_READ_FROM_SQL_SERVER	False	Set to true to test reading from SQL server.
TrackFileLeaks	MODIN_TEST_TRACK_FILE_LEAKS	True	Whether to track for open file handles leakage during testing.

Usage Guide#

See example of interaction with Modin configs below, as it can be seen config value can be set either by setting the environment variable or by using config API.

import os

# Setting `MODIN_ENGINE` environment variable.
# Also can be set outside the script.
os.environ["MODIN_ENGINE"] = "Dask"

import modin.config
import modin.pandas as pd

# Checking initially set `Engine` config,
# which corresponds to `MODIN_ENGINE` environment
# variable
print(modin.config.Engine.get()) # prints 'Dask'

# Checking default value of `NPartitions`
print(modin.config.NPartitions.get()) # prints '8'

# Changing value of `NPartitions`
modin.config.NPartitions.put(16)
print(modin.config.NPartitions.get()) # prints '16'

One can also use config variables with a context manager in order to use some config only for a certain part of the code:

import modin.config as cfg

# Default value for this config is 'False'
print(cfg.RangePartitioning.get()) # False

# Set the config to 'True' inside of the context-manager
with cfg.context(RangePartitioning=True):
    print(cfg.RangePartitioning.get()) # True
    df.merge(...) # will use range-partitioning impl

# Once the context is over, the config gets back to its previous value
print(cfg.RangePartitioning.get()) # False

# You can also set multiple config at once when you pass a dictionary to 'cfg.context'
print(cfg.AsyncReadMode.get()) # False

with cfg.context(RangePartitioning=True, AsyncReadMode=True):
    print(cfg.RangePartitioning.get()) # True
    print(cfg.AsyncReadMode.get()) # True
print(cfg.RangePartitioning.get()) # False
print(cfg.AsyncReadMode.get()) # False