In today's world, data is critical to how any business or organization is run. Your ability to efficiently and speedily clean and analyze data can determine how relevant and successful your business is. Data is critical and there is a lot of it. It is likely that your dataset is not just a few hundred megabytes in size. What if you have a large dataset that is upwards of a terabyte in size? Large datasets like this can be incredibly cumbersome to analyze using the pandas library directly because it does not natively take advantage of all of the cores on your host machine.
Modin is a Python library that parallelizes your pandas datasets. It is a drop-in replacement for pandas and provides full parallelization of most of the pandas API. Even on a simple laptop, you can process data around four times as fast using Modin! A full exploration of the pandas library is beyond the scope of this guide, but you can check out their documentation here.
In the following sections, you will learn how to get started using Modin. You will also learn how to install and configure Modin to work in tandem with either the Ray or Dask computation engine. Let's dive in!
At the time of this writing, the Modin library works with the following Python versions:
You can install Modin by using
To install using
1 pip install modin
To install using
1 conda install -c conda-forge modin
Modin works in tandem with either the Ray or Dask computation engine. You can alter your install of Modin to include the computation engine of your choice by changing your
pip install to include one of the following targets:
pip install modin[ray]
pip install modin[dask]
pip install modin[all]
Once the Modin library is downloaded, you are now ready to replace the pandas import you have in your code.
import pandas as pd can now become
import modin.pandas as pd
And there you have it, you're on your way to a much faster experience using pandas.
Modin is a layer of abstraction over Ray and Dask--two different parallel computation engines. Which engine you use is almost entirely of no consequence, as Modin abstracts away all of the complexity of using these engines. The only limitation you will find regarding the use of these engines is that if you are using Windows you are limited to using Dask, as Ray only supports macOS/Linux operating systems.
Modin is smart enough to detect your installed engine, but if you want more fine-grained control of the engine that is used, you can inform Modin of this. Modin will look for the
MODIN_ENGINE environment variable. On macOS or Linux, you can set this environment variable via running
export MODIN_ENGINE=<ray|dask>. On Windows, this can be achieved by
But what if you have a particular instance of Ray or Dask resident in-memory and you want to utilize it as your parallelization engine? This is easily possible as well! All you have to do is instantiate your Ray or Dask client before you import Modin. Here is an example of how to achieve this with Ray:
1 import ray 2 ray.init(plasma_directory="/tmp/plasma", num_cpus=20, num_gpus=2, object_store_memory=10**10) 3 4 import modin.pandas as pd
Here is an example utilizing Dask:
1 from distributed import Client 2 dask_client = Client(n_workers=4) 3 4 import modin.pandas as pd
In this guide, you learned about the Python library Modin and how it can be used on top of either Ray or Dask in order to drastically speed up computation in relation to your pandas datasets. Enjoy faster data analysis by using this simple, lightweight API on top of pandas! Modin is actively developed and has a bright future that includes plans to provide a SQL API on top of pandas. For more information, examples, and graphs depicting the possible speed ups and future work, check out the Modin documentation.