Serving multiple large ML models on multiple GPUs with Tensorflow Serving

May 15, 2021

As machine learning problem gets more complex and training data gets more massive, the machine learning models are growing in size and variation as well. This creates bigger challenges to serve models in production. To achieve better speed, we turn our eyes to GPU. To serve bigger and more types of models, we again look for multiple GPUs. In this article, we will discuss how to serve multiple large ML models on multiple GPUs with Tensorflow Serving on a multi-GPU machine.

What's Tensorflow Serving

Tensorflow Serving is high performance model serving framework, developed by Tensorflow, to serve machine learning models in production environment. It offers a lot of nice features including:

Serves multiple models in a single API endpoint
Supports server side request batching to boost throughput
Supports model version management

Tesorflow serving was released back in 2016 and has been around for quite a few years. New features have been continuously added along with the development of Tensorflow framework itself. With the current features it offers, you can easily use it to serve a single or multiple machine learning models on a single GPU.

However, when it comes to use Tensorflow serving to serve multiple large ML models across multiple GPUs on a multi-GPU machine, things get a little messy and the solution is not that straightforward. Here is my experience of how to resolve it.

What problem are we solving

The problem we are trying to solve is to serve multiple large machine learning models across multiple GPUs on a multi-GPU machine, with Tensorflow Serving. If your ML serving system is one of the following, then this situation probably does not apply to you:

You only have 1 model to serve; OR
You have multiple models, but total size of the models fits in a single GPU on the machine you use; OR
You only use single-GPU machine

Now, the ML serving system we are having problem with using Tensorflow Serving is:

We have multiple models; AND
Each model fits in a single GPU, but the total size of all the models do not fit in a single GPU; AND
We are using a multi-GPU machine so that we can serve multiple models in multiple GPUs

With that said, imagine that our ML system looks like the following:

12 machine learning models, and each model is 4GB in size
A multi-gpu machine (such as AWS g4dn.12xlarge EC2 machine, which has 4 GPUs and each GPU is 15GB) to host the models

Notice that in this ML system, we can't use only 1 GPU to server all 12 models, as it doesn't fit. We have to use all 4 GPUs to serve those 12 models. Let's take a look how Tensorflow Serving can be used to serve our 12 models.

Tensorflow Serving load models on single GPU by default

If you have already been using Tensorflow Serving, then you are probably familiar with the typical ways of running Tensorflow serving server.

Serve via Docker

To server multiple models using Tensorflow Serving GPU docker container:

docker run --runtime=nvidia -p 8500:8500 -p 8501:8501 \
--mount type=bind, source=/path/to/my_model/, target=/models/my_model \
--mount type=bind, source=/path/to/my/models.config, target=/models/models.config \
-t tensorflow/serving:latest-gpu
--model_config_file=/models/models.config

Serve via tensorflow_model_server binary:

If for some reason you can't use docker or Tensorflow Serving docker container, then you can serve multiple models using tensorflow_model_server binary with GPU support:

tensorflow_model_server --port=8500 --rest_api_port=8501 --model_config_file=/models/models.config

The problem is, Tensorflow Serving only allocate 1 GPU by default ("/GPU:0") to load all the models, and as soon as we start our model server with all 12 models, the model server crashes. Remember that the total size of our 12 models is 48 GB, while a single GPU is only 15 GB. This will cause GPU out of memory very quickly before the model server is able to fully load all models, due to the size of all 12 models being bigger that the single GPU memory.

Normally, If we are using regular Tensorflow only, then we can easily specify which GPU to load which model, and evenly split the models across GPUs, by doing the following:

/GPU:0 will load model 0, 4, 8
/GPU:1 will load model 1, 5, 9
/GPU:2 will load model 2, 6, 10
/GPU:3 will load model 3, 7, 11

with tf.device('/GPU:0'):
// load model 0, model 4, model 8

with tf.device('/GPU:1'):
// load model 1, model 5, model 9

with tf.device('/GPU:2'):
// load model 2, model 6, model 10

with tf.device('/GPU:3'):
// load model 3, model 7, model 11

However, with Tensorflow Serving, there is no interface or command line argument to control which GPU to load which model. At least not yet. The tensorflow_model_server command takes all our 12 models and by default tries to load all of them into a single GPU instead.

Solution: Save our models with specific device placement when training

Although tensorflow_model_server does not offer us a way to split our models across multiple GPUs, we can, however, split the models across multiple GPUs when we save our models after training.

During training process, if we don't specify a GPU device, Tensorflow will use the default GPU device (which would be "/GPU:0") to load the graph, and when the model is saved, the device placement is not specified either. When serving the models, Tensorflow Serving, upon not seeing a specific device placement, will again choose the default GPU device (which would be "/GPU:0") to load the models. This will not be good if the total size of our model is bigger than the single GPU memory size.

By understanding that, the solution to our problem is straightforward - saving our models with specific device placement during training. A quick illustration of how that looks:

/GPU:0 will be used to train and save model 0, 4, 8
/GPU:1 will be used to train and save model 1, 5, 9
/GPU:2 will be used to train and save model 2, 6, 10
/GPU:3 will be used to train and save model 3, 7, 11

with tf.device('/GPU:0'):
// train and save model 0, model 4, model 8

with tf.device('/GPU:1'):
// train and save model 1, model 5, model 9

with tf.device('/GPU:2'):
// train and save model 2, model 6, model 10

with tf.device('/GPU:3'):
// train and save model 3, model 7, model 11

Now that each model graph has a specific device placement, Tensorflow Serving, when loading the models, will honor the device placement request, and only load the model on the specific GPU device. As a result of that, Tensorflow Serving will split the models across multiple GPUs. This helps us avoid Out-of-Memory problem and successfully starts the model server.

Final Words

Tensorflow Serving is a great tool to serve machine learning models in production. While we hope that the tool gets improved and updated soon to natively support serving multiple large models across multiple GPUs, hopefully the solution I shared here can get you unblocked and also keep using Tensorflow Serving for your production serving system.