Benchmarking Modin#
Summary#
To benchmark a single Modin function, often turning on the
configuration variable variable
BenchmarkMode
will suffice.
There is no simple way to benchmark more complex Modin workflows, though
benchmark mode or calling modin.utils.execute
on Modin objects may be useful.
The Modin logs may help you
identify bottlenecks in your code, and they may also help profile the execution
of each Modin function.
Modin’s execution and benchmark mode#
Most of Modin’s execution happens asynchronously, i.e. in separate processes that run independently of the main program flow. Some execution is also lazy, meaning that it doesn’t start immediately once the user calls a Modin function. While Modin provides the same API as pandas, lazy and asynchronous execution can often make it hard to tell how much time each Modin function call takes, as well as to compare Modin’s performance to pandas and other similar libraries.
Note
All examples in this doc use the system specified at the bottom of this page.
Consider the following ipython script:
import modin.pandas as pd
from modin.config import MinRowPartitionSize
import time
import ray
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
df = pd.DataFrame(list(range(MinRowPartitionSize.get() * 2)))
%time result = df.map(lambda x: time.sleep(0.1) or x)
%time print(result)
Modin takes just 2.68 milliseconds for the map
, and 3.78 seconds to print
the result. However, if we run this script in pandas by replacing
import modin.pandas as pd
with import pandas as pd
, the map
takes 6.63 seconds, and printing the result takes just 5.53 milliseconds.
Both pandas and Modin start executing the map
as soon as the interpreter
evalutes it. While pandas blocks until the map
has finished, Modin just kicks
off asynchronous functions in remote ray processes. Printing the function result
is fairly fast in pandas and Modin, but before Modin can print the data, it has to
wait until all the remote functions complete.
To time how long Modin takes for a single operation, you should typically use benchmark mode. Benchmark mode will wait for all asynchronous remote execution to complete. You can turn on benchmark mode on at any point as follows:
from modin.config import BenchmarkMode
BenchmarkMode.put(True)
Rerunning the script above with benchmark mode on, the Modin map
takes
3.59 seconds, and the print
takes 183 milliseconds. These timings better
reflect where Modin is spending its execution time.
A caveat about benchmark mode#
While benchmark code is often good for measuring the performance of a single Modin function call, it can underestimate Modin’s performance in cases where Modin’s asynchronous execution improves Modin’s performance. Consider the following script with benchmark mode on:
import numpy as np
import time
import ray
from io import BytesIO
import modin.pandas as pd
from modin.config import BenchmarkMode, MinRowPartitionSize
BenchmarkMode.put(True)
start = time.time()
df = pd.DataFrame(list(range(MinRowPartitionSize.get())), columns=['A'])
result1 = df.map(lambda x: time.sleep(0.2) or x + 1)
result2 = df.map(lambda x: time.sleep(0.2) or x + 2)
result1.to_parquet(BytesIO())
result2.to_parquet(BytesIO())
end = time.time()
print(f'map and write to parquet took {end - start} seconds.')
The script does two slow map
on a dataframe and then writes each result
to a buffer. The whole script takes 13 seconds with benchmark mode on, but
just 7 seconds with benchmark mode off. Because Modin can run the map
asynchronously, it can start writing the first result to its buffer while
it’s still computing the second result. With benchmark mode on, Modin has to
execute every function synchronously instead.
How to benchmark complex workflows#
Typically, to benchmark Modin’s overall performance on your workflow, you should start by looking at end-to-end performance with benchmark mode off. It’s common for Modin worfklows to end with writing results to one or more files, or with printing some Modin objects to an interactive console. Such end points are natural ways to make sure that all of the Modin execution that you require is complete.
To measure more fine-grained performance, it can be helpful to turn benchmark mode on, but beware that doing so may reduce your script’s overall performance and thus may not reflect where Modin is normally spending execution time, as pointed out above.
Turning on Modin logging and using the Modin logs can also help you profile your workflow. The Modin logs can also give a detailed break down of the performance of each Modin function at each Modin layer. Log mode is more useful when used in conjuction with benchmark mode.
Sometimes, if you don’t have a natural end-point to your workflow, you can
just call modin.utils.execute
on the workflow’s final Modin objects.
That will typically block on any asynchronous computation:
import time
import ray
from io import BytesIO
import modin.pandas as pd
from modin.config import MinRowPartitionSize, NPartitions
import modin.utils
MinRowPartitionSize.put(32)
NPartitions.put(16)
def slow_add_one(x):
if x == 5000:
time.sleep(10)
return x + 1
# Look at the Ray documentation with respect to the Ray configuration suited to you most.
ray.init()
df1 = pd.DataFrame(list(range(10_000)), columns=['A'])
result = df1.map(slow_add_one)
# %time modin.utils.execute(result)
%time result.to_parquet(BytesIO())
Writing the result to a buffer takes 9.84 seconds. However, if you uncomment
the %time modin.utils.execute(result)
before the to_parquet
call, the to_parquet
takes just 23.8 milliseconds!
Note
If you see any Modin documentation touting Modin’s speed without using benchmark mode or otherwise guaranteeing that Modin is finishing all asynchronous and deferred computation, you should file an issue on the Modin GitHub. It’s not fair to compare the speed of an async Modin function call to an equivalent synchronous call using another library.
Appendix: System Information#
The example scripts here were run on the following system:
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.4
Modin version: d6d503ac7c3028d871c34d9e99e925ddb0746df6
Ray version: 2.0.0
Python version: 3.10.4
Machine: MacBook Pro (16-inch, 2019)
Processor: 2.3 GHz 8-core Intel Core i9 processor
Memory: 16 GB 2667 MHz DDR4