Quantization

A deep dive into Apple's coremltools quantization: Reduce the size of a Core ML model without losing (too much) accuracy and performance

Last year Apple gave us Core ML, an easy to use framework for running trained models on our devices. However the technology was not without its challenges. There were limited integration with third party frameworks, training was still a non trivial process, (which we covered last year on how to train your own Core ML model) and model sizes could run into 100s of MBs.

This year Apple introduced an array of solutions to address these challenges. Among them, more third party ML frameworks support, the ability to define custom models and layers, introduction of CreateML for easy training and quantization for size reduction.

In this post we are going to dig a little deeper into one of these new features: Quantization. Model size is one of the most common reason for skipping using a local model and opting for an online cloud solution. Fully trained models can go into 100s of MBs and can easily deter potential users from downloading our app. However if you followed WWDC’s What’s new in Core ML session we got a taste of quantization. An approach that can possibly cut down the size of a fully trained model by two third without losing much in accuracy or performance.

So let’s test it out together. We're going to take a previously trained model for food classification and see what kind of size/accuracy trade off we can get through quantization.

But first, lets go over quantization and what it really means to quantize a model. The simplest way to explain the idea is to perhaps phrase it as “reducing the resolution on a model”. Each trained Core ML model comes with a finite number of weights that are set when the model’s trained. Imagine each of these weights represent 1 cm^2 on an image. For example, if you have a high resolution image you can fit a lot of pixels in that space and get crisp clear picture of a pizza. However if the purpose of your image is for the person who is looking at it to figure out they're looking at pizza, then you don't need a lot of pixels in that 1 cm^2. You can opt for less pixels in that space and still get something that resembles a pizza. You can in fact do this by quite a bit and still see pizza. It's at the lower end where things get a bit more complicated and the image starts to look like something that can be a plate of pasta or lasagna. We will see similar behavior later on.

Depending on the model, you could be dealing with tens of millions of weights, which by default are stored as Float32 (Since iOS 11.2 weight are stored as half precision Float16). A Float32 is a 32 bit single precision floating point number that takes 4 bytes. When we use a Float32 we have billions (2^31 − 1) of possible values that our weight can take. It turns out we can reduce the possibilities to a smaller subset and retain most of our accuracy.

<img src ="https://s3.amazonaws.com/pixpit/quantized/quantize.png" width ="100%" style ="margin: 0 auto"> *(What's new in Core ML, WWDC 2018)*

When we quantize a model, we iterate through its weights and use a number format with lower precision. These Float32 weights can be reduced to half precision (16-bits) or 8-bits and lower. The distribution of the quantization process can either be linear, linear lookup table, k-means generated look up tables or a custom look-up table function.

We can see that there are multiple options available to us. We have to pick a bit size we want to quantize down to and a function we want to use for the the quantization distribution. It's important not to forget that reducing precision doesn’t come free, it will affect how the model performs. However we can reduce precision by quite a bit before we notice major reduction in accuracy.

So if there is a sweet spot between accuracy and quantization, where is it? How can we find it? The bad news is there is no simple formula, a lot of this will depend on your model and how its used. The good news is quantizing a model and testing it can be done fairly quickly. So lets Goldilock it.

We will quantize a model into all its possible bit levels and functions. Then we will run a test against each model and compare its accuracy against its full precision model. We then use the data collected to find the Goldilocks model: the one model that is the smallest for the least loss in accuracy.

For this example I will be using a squeeznet model that I’ve trained to detect from 101 different dishes. I have already converted the model into Core ML and I’m ready to quantize it.

Before we can quantize a model we need to get the latest version of coremltools. At the time of writing, we are on 2.0b1 which is in beta. To get this version we need to run pip install coremltools==2.0b1

The method we are interested in is quantize_weights. lets look at its documentations. For quantize_weights there are four different modes available. However at the time of this writing the modes mentioned in the documentation are different than what is actually available in coremltools. The modes in the documentation are linear, linear_lut, kmeans_lut and custom_lut. The modes that are actually available are linear, kmeans, linear_lut, custom_lut and dequantization. We will omit custom_lut and dequantization since they are beyond the scope of this article and focus on linear, linear_lut and kmeans.

LUT stands for look up table

Once coremltools version 2.0b1 is installed, we can run the following python script. Ensure that the script is located in the same folder that has our original model. This script will create all the possible permutations of bits and functions that quantize a model.

import coremltools

from coremltools.models.neural_network.quantization_utils import *
mode_name = "food"

model = coremltools.models.MLModel(mode_name+".mlmodel")

functions = ["linear", "linear_lut", "kmeans"]

for function in functions :
    for bit in [16,8,7,6,5,4,3,2,1]:
        print("processing ",function," on ",bit,".")    
        lin_quant_model = quantize_weights(model, bit, function)
        lin_quant_model.short_description = str(bit)+" bit per quantized weight, using "+function+"."
        lin_quant_model.save(mode_name+"_"+function+"_"+str(bit)+".mlmodel")

First we set mode_name to be equal to the name of the model. This should be the same as the name of the file without its mlmodel extension.

Then we run python run.py to create all the permutations.

In less than ten minutes, we’re proud owners of 27 new models, all in different sizes. We can see that quantization can result in a substantial reduction in size. All quantized models are substantially smaller than the full precision model.

Just by looking at the data, it seems like reducing precision by half to 16 bit reduced the models by 40%. This reveals just how much of a model is actually composed of weights.

Of these 27 models, one holds the most reduction in size for the least reduction in accuracy. The question is, which one?

There are a few options available. First one is a method provided by coremltools called compare_method. Through this method we can pass the original full precision model, the quantized model and a folder of sample images and see how well the two models match.

compare_models(model, lin_quant_model, 'testing_data/pizza')

Analyzing 100 images
Running Analysis this may take a while ...


Analyzed 100/100

Output prob:
--------------
Top 5 Agreement: 100.0%

Output classLabel:
--------------------
Top 1 Agreement: 98.0%

The problem with this method is that there isn't much we can do with it beyond observing what it prints to the console. Nothing else is returned.

If you want more data and a more comprehensive comparison between multiple models there is another powerful tool available at your disposal: Xcode Playgrounds.

At the time of writing we're on Xcode 10.0 Beta (10L176w) and macOS Mojave 10.14 Beta (18A293u)

One of many great things about Xcode Playgrounds is that you can perform inference on a fully trained CoreML model directly from the playground. There is no need to create a full-fledged iOS or macOS app.

So with that in mind we are going to start a new Playground. We will iterate through the models and test their accuracy against our data and save the information we've collected from the tests into a CSV file. I have posted one way this can be done below. Although it may seem like a lot of code, it actually doesn't do anything beyond what I mentioned. If you're interested in playing around with it (non pun intended) here is a link to the repo with the Playground file, models and the test data.

import Vision
import CoreML
import Cocoa

let testingFolder = "/Users/rezashirazian/Projects/Practice/Quantize/testing_data/"

let modelFolder = "/Users/rezashirazian/Projects/Practice/Quantize/"

func getCIImage(url: URL) -> CIImage {
    guard let image = NSImage(contentsOf: url) else {
        fatalError()
    }
    let data = image.tiffRepresentation!
    let bitmap = NSBitmapImageRep(data: data)!
    let ciimage = CIImage(bitmapImageRep: bitmap)!
    return ciimage
}

func getFoldersInDirectory(path: String) -> [String:URL]  {
    guard let contents = try? FileManager.default.contentsOfDirectory(atPath: path) else {
        print("Make sure you hav

Quantization

Install / Use

README

Quantization