Modin in the Cloud#

Modin implements functionality that allows to transfer computing to the cloud with minimal effort. Please note that this feature is experimental and behavior or interfaces could be changed.

Prerequisites#

Sign up with a cloud provider and get credentials file. Note that we supported only AWS currently, more are planned. (AWS credentials file format)

Setup environment#

pip install modin[remote]

This command install the following dependencies:

  • RPyC - allows to perform remote procedure calls.

  • Cloudpickle - allows pickling of functions and classes, which is used in our distributed runtime.

  • Boto3 - allows to create and setup AWS cloud machines. Optional library for Ray Autoscaler.

Notes:
  • It also needs Ray Autoscaler component, which is implicitly installed with Ray (note that Ray from conda is now missing that component!). More information in Ray docs.

Architecture#

Architecture of Modin in the Cloud
Notes:
  • To get maximum performance, you need to try to reduce the amount of data transferred between local and remote environments as much as possible.

  • To ensure correct operation, it is necessary to ensure the equivalence of versions of all Python libraries (including the interpreter) in the local and remote environments.

Public interface#

exception modin.experimental.cloud.CannotDestroyCluster(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)#

Raised when cluster cannot be destroyed in the cloud

exception modin.experimental.cloud.CannotSpawnCluster(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)#

Raised when cluster cannot be spawned in the cloud

exception modin.experimental.cloud.ClusterError(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)#

Generic cluster operating exception

modin.experimental.cloud.create_cluster(provider: Union[Provider, str], credentials: Optional[str] = None, region: Optional[str] = None, zone: Optional[str] = None, image: Optional[str] = None, project_name: Optional[str] = None, cluster_name: str = 'modin-cluster', workers: int = 4, head_node: Optional[str] = None, worker_node: Optional[str] = None, add_conda_packages: Optional[list] = None, cluster_type: str = 'rayscale') BaseCluster#

Creates an instance of a cluster with desired characteristics in a cloud. Upon entering a context via with statement Modin will redirect its work to the remote cluster. Spawned cluster can be destroyed manually, or it will be destroyed when the program exits.

Parameters
  • provider (str or instance of Provider class) – Specify the name of the provider to use or a Provider object. If Provider object is given, then credentials, region and zone are ignored.

  • credentials (str, optional) – Path to the file which holds credentials used by given cloud provider. If not specified, cloud provider will use its default means of finding credentials on the system.

  • region (str, optional) – Region in the cloud where to spawn the cluster. If omitted a default for given provider will be taken.

  • zone (str, optional) – Availability zone (part of region) where to spawn the cluster. If omitted a default for given provider and region will be taken.

  • image (str, optional) – Image to use for spawning head and worker nodes. If omitted a default for given provider will be taken.

  • project_name (str, optional) – Project name to assign to the cluster in cloud, for easier manual tracking.

  • cluster_name (str, optional) – Name to be given to the cluster. To spawn multiple clusters in single region and zone use different names.

  • workers (int, optional) – How many worker nodes to spawn in the cluster. Head node is not counted for here.

  • head_node (str, optional) – What machine type to use for head node in the cluster.

  • worker_node (str, optional) – What machine type to use for worker nodes in the cluster.

  • add_conda_packages (list, optional) – Custom conda packages for remote environments. By default remote modin version is the same as local version.

  • cluster_type (str, optional) – How to spawn the cluster. Currently spawning by Ray autoscaler (“rayscale” for general and “hdk” for HDK-based) is supported

Returns

The object that knows how to destroy the cluster and how to activate it as remote context. Note that by default spawning and destroying of the cluster happens in the background, as it’s usually a rather lengthy process.

Return type

BaseCluster descendant

Notes

Cluster computation actually can work when proxies are required to access the cloud. You should set normal “http_proxy”/”https_proxy” variables for HTTP/HTTPS proxies and set “MODIN_SOCKS_PROXY” variable for SOCKS proxy before calling the function.

Using SOCKS proxy requires Ray newer than 0.8.6, which might need to be installed manually.

modin.experimental.cloud.get_connection()#

Returns an RPyC connection object to execute Python code remotely on the active cluster.

Usage examples#

"""
This is a very basic sample script for running things remotely.
It requires `aws_credentials` file to be present in current working directory.
On credentials file format see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-where
"""
import logging
import modin.pandas as pd
from modin.experimental.cloud import cluster
# set up verbose logging so Ray autoscaler would print a lot of things
# and we'll see that stuff is alive and kicking
logging.basicConfig(format="%(asctime)s %(message)s")
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
example_cluster = cluster.create("aws", "aws_credentials")
with example_cluster:
    remote_df = pd.DataFrame([1, 2, 3, 4])
    print(len(remote_df))  # len() is executed remotely

Some more examples can be found in examples/cloud folder.