Frequently Asked Questions (FAQs)#

Below, you will find answers to the most commonly asked questions about Modin. If you still cannot find the answer you are looking for, please post your question on the #support channel on our Slack community or open a Github issue.

FAQs: Why choose Modin?#

What’s wrong with pandas and why should I use Modin?#

While pandas works extremely well on small datasets, as soon as you start working with medium to large datasets that are more than a few GBs, pandas can become painfully slow or run out of memory. This is because pandas is single-threaded. In other words, you can only process your data with one core at a time. This approach does not scale to larger data sets and adding more hardware does not lead to more performance gain.

The DataFrame is a highly scalable, parallel DataFrame. Modin transparently distributes the data and computation so that you can continue using the same pandas API while being able to work with more data faster. Modin lets you use all the CPU cores on your machine, and because it is lightweight, it often has less memory overhead than pandas. See this page to learn more about how Modin is different from pandas.

Why not just improve pandas?#

pandas is a massive community and well established codebase. Many of the issues we have identified and resolved with pandas are fundamental to its current implementation. While we would be happy to donate parts of Modin that make sense in pandas, many of these components would require significant (or total) redesign of the pandas architecture. Modin’s architecture goes beyond pandas, which is why the pandas API is just a thin layer at the user level. To learn more about Modin’s architecture, see the architecture documentation.

How much faster can I go with Modin compared to pandas?#

Modin is designed to scale with the amount of hardware available. Even in a traditionally serial task like read_csv, we see large gains by efficiently distributing the work across your entire machine. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores. This speedup scales efficiently to larger machines with more cores. We have several published papers that include performance results and comparisons against pandas.

How much more data would I be able to process with Modin?#

Often data scientists have to use different tools for operating on datasets of different sizes. This is not only because processing large dataframes is slow, but also pandas does not support working with dataframes that don’t fit into the available memory. As a result, pandas workflows that work well for prototyping on a few MBs of data do not scale to tens or hundreds of GBs (depending on the size of your machine). Modin supports operating on data that does not fit in memory, so that you can comfortably work with hundreds of GBs without worrying about substantial slowdown or memory errors. For more information, see out-of-memory support for Modin.

How does Modin compare to Dask DataFrame and Koalas?#

TLDR: Modin has better coverage of the pandas API, has a flexible backend, better ordering semantics, and supports both row and column-parallel operations. Check out this page detailing the differences!

How does Modin work under the hood?#

Modin is logically separated into different layers that represent the hierarchy of a typical Database Management System. User queries which perform data transformation, data ingress or data egress pass through the Modin Query Compiler which translates queries from the top-level pandas API Layer that users interact with to the Modin Core Dataframe layer. The Modin Core DataFrame is our efficient DataFrame implementation that utilizes a partitioning schema which allows for distributing tasks and queries. From here, the Modin DataFrame works with engines like Ray or Dask to execute computation, and then return the results to the user.

For more details, take a look at our system architecture.

FAQs: How to use Modin?#

If I’m only using my laptop, can I still get the benefits of Modin?#

Absolutely! Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores and allows you to work on data that doesn’t fit in your laptop’s RAM.

How do I use Jupyter or Colab notebooks with Modin?#

You can take a look at this Google Colab installation guide and this notebook tutorial. Once Modin is installed, simply replace your pandas import with Modin import:

# import pandas as pd
import modin.pandas as pd

Which execution engine (Ray or Dask) should I use for Modin?#

Modin lets you effortlessly speed up your pandas workflows with either Ray’s or Dask’s execution engine. You don’t need to know anything about either engine in order to use it with Modin. If you only have one engine installed, Modin will automatically detect which engine you have installed and use that for scheduling computation. If you don’t have a preference, we recommend starting with Modin’s default Ray engine. If you want to use a specific compute engine, you can set the environment variable MODIN_ENGINE and Modin will do computation with that engine:

pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
export MODIN_ENGINE=ray  # Modin will use Ray

pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
export MODIN_ENGINE=dask  # Modin will use Dask

This can also be done with:

from modin.config import Engine

Engine.put("ray")  # Modin will use Ray
Engine.put("dask")  # Modin will use Dask

We also have an experimental HDK-based engine of Modin you can read about here. We plan to support more execution engines in future. If you have a specific request, please post on the #feature-requests channel on our Slack community.

How do I connect Modin to a database via read_sql?#

To read from a SQL database, you have two options:

  1. Pass a connection string, e.g. postgresql://reader:NWDMCE5xdipIjRrp@hh-pgsql-public.ebi.ac.uk:5432/pfmegrnargs

  2. Pass an open database connection, e.g. for psycopg2, psycopg2.connect("dbname=pfmegrnargs user=reader password=NWDMCE5xdipIjRrp host=hh-pgsql-public.ebi.ac.uk")

The first option works with both Modin and pandas. If you try the second option in Modin, Modin will default to pandas because open database connections cannot be pickled. Pickling is required to send connection details to remote workers. To handle the unique requirements of distributed database access, Modin has a distributed database connection called ModinDatabaseConnection:

import modin.pandas as pd
from modin.db_conn import ModinDatabaseConnection
con = ModinDatabaseConnection(
    'psycopg2',
    host='hh-pgsql-public.ebi.ac.uk',
    dbname='pfmegrnargs',
    user='reader',
    password='NWDMCE5xdipIjRrp')
df = pd.read_sql("SELECT * FROM rnc_database",
        con,
        index_col=None,
        coerce_float=True,
        params=None,
        parse_dates=None,
        chunksize=None)

The ModinDatabaseConnection will save any arguments you supply it and forward them to the workers to make their own connections.

How can I contribute to Modin?#

Modin is currently under active development. Requests and contributions are welcome!

If you are interested in contributing please check out the Contributing Guide and then refer to the Development Documentation, where you can find system architecture, internal implementation details, and other useful information. Also check out the Github to view open issues and make contributions.