Machine Learning Scalability

Ray, Dask, and Modin provide lower level primitives for machine learning.

Jul 28, 2020

Listen to the Software Daily podcast on iTunes or Google.

We are resuming our newsletter with a summary of emergent themes from previous episodes.

One problem encountered by machine learning practitioners is that of scaling their Python-based workflows. Today’s newsletter reviews our episodes about data science and machine learning scalability.

Dask: Scalable Python with Matthew Rocklin

Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn. These libraries improve the user experience of a Python data scientist by giving them access to high level APIs.

Data science is often performed over huge datasets, and the data structures that are instantiated with those datasets need to be spread across multiple machines. To manage large distributed datasets, a library such as scikit-learn can use a system called Dask. Dask allows the instantiation of data structures such as a Dask dataframe or a Dask array.

Matthew Rocklin is the creator of Dask. He joins the show to talk about distributed computing with Dask, its use cases, and the Python ecosystem. He also provides a detailed comparison between Dask and Spark, which is also used for distributed data science.

You have tools that are good, really good for smaller scale data and they’re really intuitive. They give you a real feel for having this kind of grasp of your data. At the large scale, you have these really big batch processing engines that kind of put all of these extra restrictions on what you can do, because there are a lot of things that are kind of challenging to scale that these smaller scale tools do. So we get this kind of rift in data science, where a lot of companies have these pipelines where they start in Pandas, or they might start in Spark and run a batch job to kind of ﬁlter down their data so that they can use it in Pandas. And then they use Pandas to kind of do some preliminary analysis and understand the data. And then they hand it off to somebody else to rewrite that workﬂow into Spark. And this rift here is happening because there’re different requirements for different scales of data. I think data science, it’s in kind of a – I don’t know if weird is the right term for it, but it’s in an interesting place where we don’t necessarily have the right abstractions for performing data science at all scales using the same kind of toolset.

Ray Applications with Richard Liaw

Ray is a general purpose distributed computing framework. At a low level, Ray provides fault-tolerant primitives that support applications running across multiple processors. At a higher level, Ray supports scalable reinforcement learning, including the common problem of hyperparameter tuning.

In a previous episode, we explored the primitives of Ray as well as Anyscale, the business built around Ray and reinforcement learning. In today’s episode, Richard Liaw explores some of the libraries and applications that sit on top of Ray.

RLlib gives APIs for reinforcement learning such as policy serving and multi-agent environments. Tune gives developers an easy way to do scalable hyperparameter tuning, which is necessary for exploring different types of deep learning configurations. In a future show, we will explore Tune in more detail.

Ray solves a larger problem of making distributed computing simple, and it provides two particular interfaces; the task interface and an actor interface, which allows us to do both stateless and stateful computations across a large cluster while seemingly programming like you’re on a single process. So Tune speciﬁcally leverages these stateful interfaces, so this this actor interface that Ray provides to scale out the evaluation and tuning of different hyperparameter and neural networks. And furthermore, leverages this idea of the stateful computation and being able to interact with these stateful processes to implement a law of distributed hyperparameter tuning algorithms that otherwise would need custom-made solutions or custom-made implementations that might not be available for the general public

Anyscale with Ion Stoica

Machine learning applications are widely deployed across the software industry.

Most of these applications used supervised learning, a process in which labeled data sets are used to find correlations between the labels and the trends in that underlying data. But supervised learning is only one application of machine learning. Another broad set of machine learning methods is described by the term “reinforcement learning.”

Reinforcement learning involves an agent interacting with its environment. As the model interacts with the environment, it learns to make better decisions over time based on a reward function. Newer AI applications will need to operate in increasingly dynamic environments, and react to changes in those environments, which makes reinforcement learning a useful technique.

Reinforcement learning has several attributes that make it a distinctly different engineering problem than supervised learning. Reinforcement learning relies on simulation and distributed training to rapidly examine how different model parameters could affect the performance of a model in different scenarios.

Ray is an open source project for distributed applications. Although Ray was designed with reinforcement learning in mind, the potential use cases go beyond machine learning, and could be as influential and broadly applicable as distributed systems projects like Apache Spark or Kubernetes. Ray is a project from the Berkeley RISE Lab, the same place that gave rise to Spark, Mesos, and Alluxio.

The RISE Lab is led by Ion Stoica, a professor of computer science at Berkeley. He is also the co-founder of Anyscale, a company started to commercialize Ray by offering tools and services for enterprises looking to adopt Ray. Ion Stoica returns to the show to discuss reinforcement learning, distributed computing, and the Ray project.

You can have millions of tasks and actors being scheduled. For that one, again, in many systems before, what you have the scheduler, it’s kind of centralized. You have only one entity. Imagine that if I want to schedule something, I’m going to send to this guy, say, it’s called master or something like that, and ask that guy, master, to schedule on my behalf. That’s an easier problem because the master will have more of a global state. There is only one guy to schedule the task or actors. However, like we discussed, the problem with that is that doesn’t scale. Now, you have to distribute the scheduling and this is what makes it difﬁcult and is one of the innovation in Ray having the distributed scheduler where you allow every now to independently schedule and then there is a way in which you are going to allow mistakes to happen, say, for instance too many tasks are scheduled on a node and then to correct these mistakes by taking some of these tasks and moving to another node till everything is more or less load balanced.

Modin: Pandas Scalability with Devin Petersohn

Pandas is a Python data analysis library, and an essential tool in data science. Pandas allows users to load large quantities of data into a data structure called a dataframe, over which the user can call mathematical operations. When the data fits entirely into memory this works well, but sometimes there is too much data for a single box.

The Modin project scales Pandas workflows to multiple machines by utilizing Dask or Ray, which are distributed computing primitives for Python programs. Modin builds an execution plan for large data frames to be operated on against each other, which makes data science considerably easier for these large data sets.

Devin Petersohn started the Modin project, and he joins the show to talk about data science with Python, and his work in the Berkeley RISELab.

The problem with tools like Spark at that scale is that they require the user to be very intimate with the running environment that they’re operating in, and also with a data layout. Spark put some restrictions on requiring to understand kind of partitioning and trying to understand when do you want to trigger computation. The kind of lazy, the lazy evaluation paradigm. I mean, Spark is a great tool, and I do not have anything ill against Spark. But what we’re seeing in the real-world and a lot of business applications is actually that the typical Pandas user is not an expert in distributed computing. And Spark is a tool that kind of requires you to have this distributed computing understanding to not be penalized for your performance.

Jeff Daily

Discussion about this post