Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Explore Python Libraries: Make Your DataFrames Parallel With Modin

Jul 23, 2020 • 4 Minute Read

Introduction

In today's world, data is critical to how any business or organization is run. Your ability to efficiently and speedily clean and analyze data can determine how relevant and successful your business is. Data is critical and there is a lot of it. It is likely that your dataset is not just a few hundred megabytes in size. What if you have a large dataset that is upwards of a terabyte in size? Large datasets like this can be incredibly cumbersome to analyze using the pandas library directly because it does not natively take advantage of all of the cores on your host machine.

Enter Modin.

Modin is a Python library that parallelizes your pandas datasets. It is a drop-in replacement for pandas and provides full parallelization of most of the pandas API. Even on a simple laptop, you can process data around four times as fast using Modin! A full exploration of the pandas library is beyond the scope of this guide, but you can check out their documentation here.

In the following sections, you will learn how to get started using Modin. You will also learn how to install and configure Modin to work in tandem with either the Ray or Dask computation engine. Let's dive in!

Installation

At the time of this writing, the Modin library works with the following Python versions:

  • =3.6.1

You can install Modin by using pip or conda.

To install using pip:

      pip install modin
    

To install using conda:

      conda install -c conda-forge modin
    

Modin works in tandem with either the Ray or Dask computation engine. You can alter your install of Modin to include the computation engine of your choice by changing your pip install to include one of the following targets:

  • pip install modin[ray]
  • pip install modin[dask]
  • pip install modin[all]

Once the Modin library is downloaded, you are now ready to replace the pandas import you have in your code. The import import pandas as pd can now become import modin.pandas as pd

And there you have it, you're on your way to a much faster experience using pandas.

Choosing A Computation Engine

Modin is a layer of abstraction over Ray and Dask--two different parallel computation engines. Which engine you use is almost entirely of no consequence, as Modin abstracts away all of the complexity of using these engines. The only limitation you will find regarding the use of these engines is that if you are using Windows you are limited to using Dask, as Ray only supports macOS/Linux operating systems.

Modin is smart enough to detect your installed engine, but if you want more fine-grained control of the engine that is used, you can inform Modin of this. Modin will look for the MODIN_ENGINE environment variable. On macOS or Linux, you can set this environment variable via running export MODIN_ENGINE=<ray|dask>. On Windows, this can be achieved by set MODIN_ENGINE=<ray|dask>.

But what if you have a particular instance of Ray or Dask resident in-memory and you want to utilize it as your parallelization engine? This is easily possible as well! All you have to do is instantiate your Ray or Dask client before you import Modin. Here is an example of how to achieve this with Ray:

      import ray
    ray.init(plasma_directory="/tmp/plasma", num_cpus=20, num_gpus=2, object_store_memory=10**10)

    import modin.pandas as pd
    

Here is an example utilizing Dask:

      from distributed import Client
    dask_client = Client(n_workers=4)

    import modin.pandas as pd
    

Conclusion

In this guide, you learned about the Python library Modin and how it can be used on top of either Ray or Dask in order to drastically speed up computation in relation to your pandas datasets. Enjoy faster data analysis by using this simple, lightweight API on top of pandas! Modin is actively developed and has a bright future that includes plans to provide a SQL API on top of pandas. For more information, examples, and graphs depicting the possible speed ups and future work, check out the Modin documentation.

Zachary Bennett

Zachary B.

Zach is currently a Lead Software Developer at OpalSoft where he uses tools such as Scala, TypeScript, Python, Docker, Node, and Angular. Zach has a passion for GIS programming along with open-source software. You can view some of his work on GitHub (https://github.com/zbennett10) and Stack Overflow (https://stackoverflow.com/users/6879849/zachary-bennett).

More about this author