Mleap

MLeap: Deploy ML Pipelines to Production

Generate Convert Improve

Install / Use

/learn @combust/Mleap

About this skill

Quality Score

0/100

README

Deploying machine learning data pipelines and algorithms should not be a time-consuming or difficult task. MLeap allows data scientists and engineers to deploy machine learning pipelines from Spark and Scikit-learn to a portable format and execution engine.

Documentation

Documentation is available at https://combust.github.io/mleap-docs/.

Read Serializing a Spark ML Pipeline and Scoring with MLeap to gain a full sense of what is possible.

Introduction

Using the MLeap execution engine and serialization format, we provide a performant, portable and easy-to-integrate production library for machine learning data pipelines and algorithms.

For portability, we build our software on the JVM and only use serialization formats that are widely-adopted.

We also provide a high level of integration with existing technologies.

Our goals for this project are:

Allow Researchers/Data Scientists and Engineers to continue to build data pipelines and train algorithms with Spark and Scikit-Learn
Extend Spark/Scikit/TensorFlow by providing ML Pipelines serialization/deserialization to/from a common framework (Bundle.ML)
Use MLeap Runtime to execute your pipeline and algorithm without dependenices on Spark or Scikit (numpy, pandas, etc)

Overview

Core execution engine implemented in Scala
Spark, PySpark and Scikit-Learn support
Export a model with Scikit-learn or Spark and execute it using the MLeap Runtime (without dependencies on the Spark Context, or sklearn/numpy/pandas/etc)
Choose from 2 portable serialization formats (JSON, Protobuf)
Implement your own custom data types and transformers for use with MLeap data frames and transformer pipelines
Extensive test coverage with full parity tests for Spark and MLeap pipelines
Optional Spark transformer extension to extend Spark's default transformer offerings

Dependency Compatibility Matrix

Other versions besides those listed below may also work (especially more recent Java versions for the JRE), but these are the configurations which are tested by mleap.

| MLeap Version | Spark Version | Scala Version | Java Version | Python Version | XGBoost Version | Tensorflow Version | |---------------|---------------|------------------|--------------|----------------|-----------------|--------------------| | 0.24.0 | 4.0.1 | 2.13.16 | 17 | 3.9 - 3.13 | 2.0.3 | 2.10.1 | | 0.23.4 | 3.4.4 | 2.12.18 | 11 | 3.7 - 3.12 | 1.7.6 | 2.10.1 | | 0.23.3 | 3.4.0 | 2.12.18 | 11 | 3.7, 3.8 | 1.7.6 | 2.10.1 | | 0.23.2 | 3.4.0 | 2.12.18 | 11 | 3.7, 3.8 | 1.7.6 | 2.10.1 | | 0.23.1 | 3.4.0 | 2.12.18 | 11 | 3.7, 3.8 | 1.7.6 | 2.10.1 | | 0.23.0 | 3.4.0 | 2.12.13 | 11 | 3.7, 3.8 | 1.7.3 | 2.10.1 | | 0.22.0 | 3.3.0 | 2.12.13 | 11 | 3.7, 3.8 | 1.6.1 | 2.7.0 | | 0.21.1 | 3.2.0 | 2.12.13 | 11 | 3.7 | 1.6.1 | 2.7.0 | | 0.21.0 | 3.2.0 | 2.12.13 | 11 | 3.6, 3.7 | 1.6.1 | 2.7.0 | | 0.20.0 | 3.2.0 | 2.12.13 | 8 | 3.6, 3.7 | 1.5.2 | 2.7.0 | | 0.19.0 | 3.0.2 | 2.12.13 | 8 | 3.6, 3.7 | 1.3.1 | 2.4.1 | | 0.18.1 | 3.0.2 | 2.12.13 | 8 | 3.6, 3.7 | 1.0.0 | 2.4.1 | | 0.18.0 | 3.0.2 | 2.12.13 | 8 | 3.6, 3.7 | 1.0.0 | 2.4.1 | | 0.17.0 | 2.4.5 | 2.11.12, 2.12.10 | 8 | 3.6, 3.7 | 1.0.0 | 1.11.0 |

Setup

Link with Maven or SBT

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-runtime" % "0.24.0"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-runtime_2.13</artifactId>
    <version>0.24.0</version>
</dependency>

For Spark Integration

SBT

libraryDependencies += "ml.combust.mleap" %% "mleap-spark" % "0.24.0"

Maven

<dependency>
    <groupId>ml.combust.mleap</groupId>
    <artifactId>mleap-spark_2.13</artifactId>
    <version>0.24.0</version>
</dependency>

PySpark Integration

Install MLeap from PyPI

$ pip install mleap

Using the Library

For more complete examples, see our other Git repository: MLeap Demos

Create and Export a Spark Pipeline

The first step is to create our pipeline in Spark. For our example we will manually build a simple Spark ML pipeline.

import ml.combust.bundle.BundleFile
import ml.combust.mleap.spark.SparkSupport._
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.bundle.SparkBundleContext
import org.apache.spark.ml.feature.{Binarizer, StringIndexer}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.util.Using

  val datasetName = "./examples/spark-demo.csv"

  val dataframe: DataFrame = spark.sqlContext.read.format("csv")
    .option("header", true)
    .load(datasetName)
    .withColumn("test_double", col("test_double").cast("double"))

  // User out-of-the-box Spark transformers like you normally would
  val stringIndexer = new StringIndexer().
    setInputCol("test_string").
    setOutputCol("test_index")

  val binarizer = new Binarizer().
    setThreshold(0.5).
    setInputCol("test_double").
    setOutputCol("test_bin")

  val pipelineEstimator = new Pipeline()
    .setStages(Array(stringIndexer, binarizer))

  val pipeline = pipelineEstimator.fit(dataframe)

  // then serialize pipeline
  val sbc = SparkBundleContext().withDataset(pipeline.transform(dataframe))
  Using(BundleFile("jar:file:/tmp/simple-spark-pipeline.zip")) { bf =>
    pipeline.writeBundle.save(bf)(sbc).get
  }

The dataset used for training can be found here

Spark pipelines are not meant to be run outside of Spark. They require a DataFrame and therefore a SparkContext to run. These are expensive data structures and libraries to include in a project. With MLeap, there is no dependency on Spark to execute a pipeline. MLeap dependencies are lightweight and we use fast data structures to execute your ML pipelines.

PySpark Integration

Import the MLeap library in your PySpark job

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

See PySpark Integration of python/README.md for more.

Create and Export a Scikit-Learn Pipeline

import pandas as pd

from mleap.sklearn.pipeline import Pipeline
from mleap.sklearn.preprocessing.data import FeatureExtractor, LabelEncoder, ReshapeArrayToN1
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame(['a', 'b', 'c'], columns=['col_a'])

categorical_features = ['col_a']

feature_extractor_tf = FeatureExtractor(input_scalars=categorical_features, 
                                         output_vector='imputed_features', 
                                         output_vector_items=categorical_features)

# Label Encoder for x1 Label 
label_encoder_tf = LabelEncoder(input_features=feature_extractor_tf.output_vector_items,
                               output_features='{}_label_le'.format(categorical_features[0]))

# Reshape the output of the LabelEncoder to N-by-1 array
reshape_le_tf = ReshapeArrayToN1()

# Vector Assembler for x1 One Hot Encoder
one_hot_encoder_tf = OneHotEncoder(sparse=False)
one_hot_encoder_tf.mlinit(prior_tf = label_encoder_tf, 
                          output_features = '{}_label_one_hot_encoded'.format(categorical_features[0]))

one_hot_encoder_pipeline_x0 = Pipeline([
                                         (feature_extractor_tf.name, feature_extractor_tf),
                                         (label_encoder_tf.name, label_encoder_tf),
                                         (reshape_le_tf.name, reshape_le_tf),
                                         (one_hot_encoder_tf.name, one_hot_encoder_tf)
                                        ])

one_hot_encoder_pipeline_x0.mlinit()
one_hot_encoder_pipeline_x0.fit_transform(data)
one_hot_encoder_pipeline_x0.serialize_to_bundle('/tmp', 'mleap-scikit-test-pipeline', init=True)

# array([[ 1.,  0.,  0.],
#        [ 0.,  1.,  0.],
#        [ 0.,  0.,  1.]])

Load and Transform Using MLeap

Because we export Spark and Scikit-learn pipelines to a standard format, we can use either our Spark-trained pipeline or our Scikit-learn pipeline from the previous steps to demonstrate usage of MLeap in this section. The choice is yours!

import ml.combust.bundle.BundleFil

Related Skills

tmux

352.5k

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.

claude-opus-4-5-migration

111.3k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

352.5k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

diffs

352.5k

Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.