As machine learning problem gets more complex and training data gets more massive, the machine learning models are growing in size and variation as well. This creates bigger challenges to serve models in production. To achieve better speed, we turn our eyes to GPU. To serve bigger and more types of models, we again look for multiple GPUs. In this article, we will discuss how to serve multiple large ML models on multiple GPUs with Tensorflow Serving on a multi-GPU machine.
Tensorflow Serving is high performance model serving framework, developed by Tensorflow, to serve machine learning models in production environment. It offers a lot of nice features including:
Tesorflow serving was released back in 2016 and has been around for quite a few years. New features have been continuously added along with the development of Tensorflow framework itself. With the current features it offers, you can easily use it to serve a single or multiple machine learning models on a single GPU.
However, when it comes to use Tensorflow serving to serve multiple large ML models across multiple GPUs on a multi-GPU machine, things get a little messy and the solution is not that straightforward. Here is my experience of how to resolve it.
The problem we are trying to solve is to serve multiple large machine learning models across multiple GPUs on a multi-GPU machine, with Tensorflow Serving. If your ML serving system is one of the following, then this situation probably does not apply to you:
Now, the ML serving system we are having problem with using Tensorflow Serving is:
With that said, imagine that our ML system looks like the following:
Notice that in this ML system, we can't use only 1 GPU to server all 12 models, as it doesn't fit. We have to use all 4 GPUs to serve those 12 models. Let's take a look how Tensorflow Serving can be used to serve our 12 models.
If you have already been using Tensorflow Serving, then you are probably familiar with the typical ways of running Tensorflow serving server.
To server multiple models using Tensorflow Serving GPU docker container:
docker run --runtime=nvidia -p 8500:8500 -p 8501:8501 \
--mount type=bind, source=/path/to/my_model/, target=/models/my_model \
--mount type=bind, source=/path/to/my/models.config, target=/models/models.config \
-t tensorflow/serving:latest-gpu
--model_config_file=/models/models.config
If for some reason you can't use docker or Tensorflow Serving docker container, then you can serve multiple models using tensorflow_model_server binary with GPU support:
tensorflow_model_server --port=8500 --rest_api_port=8501 --model_config_file=/models/models.config
The problem is, Tensorflow Serving only allocate 1 GPU by default ("/GPU:0"
) to load all the models, and as soon as we start our model server with all 12 models, the model server crashes. Remember that the total size of our 12 models is 48 GB, while a single GPU is only 15 GB. This will cause GPU out of memory very quickly before the model server is able to fully load all models, due to the size of all 12 models being bigger that the single GPU memory.
Normally, If we are using regular Tensorflow only, then we can easily specify which GPU to load which model, and evenly split the models across GPUs, by doing the following:
/GPU:0
will load model 0, 4, 8/GPU:1
will load model 1, 5, 9/GPU:2
will load model 2, 6, 10/GPU:3
will load model 3, 7, 11with tf.device('/GPU:0'):
// load model 0, model 4, model 8
with tf.device('/GPU:1'):
// load model 1, model 5, model 9
with tf.device('/GPU:2'):
// load model 2, model 6, model 10
with tf.device('/GPU:3'):
// load model 3, model 7, model 11
However, with Tensorflow Serving, there is no interface or command line argument to control which GPU to load which model. At least not yet. The tensorflow_model_server
command takes all our 12 models and by default tries to load all of them into a single GPU instead.
Although tensorflow_model_server
does not offer us a way to split our models across multiple GPUs, we can, however, split the models across multiple GPUs when we save our models after training.
During training process, if we don't specify a GPU device, Tensorflow will use the default GPU device (which would be "/GPU:0"
) to load the graph, and when the model is saved, the device placement is not specified either. When serving the models, Tensorflow Serving, upon not seeing a specific device placement, will again choose the default GPU device (which would be "/GPU:0"
) to load the models. This will not be good if the total size of our model is bigger than the single GPU memory size.
By understanding that, the solution to our problem is straightforward - saving our models with specific device placement during training. A quick illustration of how that looks:
/GPU:0
will be used to train and save model 0, 4, 8/GPU:1
will be used to train and save model 1, 5, 9/GPU:2
will be used to train and save model 2, 6, 10/GPU:3
will be used to train and save model 3, 7, 11with tf.device('/GPU:0'):
// train and save model 0, model 4, model 8
with tf.device('/GPU:1'):
// train and save model 1, model 5, model 9
with tf.device('/GPU:2'):
// train and save model 2, model 6, model 10
with tf.device('/GPU:3'):
// train and save model 3, model 7, model 11
Now that each model graph has a specific device placement, Tensorflow Serving, when loading the models, will honor the device placement request, and only load the model on the specific GPU device. As a result of that, Tensorflow Serving will split the models across multiple GPUs. This helps us avoid Out-of-Memory problem and successfully starts the model server.
Tensorflow Serving is a great tool to serve machine learning models in production. While we hope that the tool gets improved and updated soon to natively support serving multiple large models across multiple GPUs, hopefully the solution I shared here can get you unblocked and also keep using Tensorflow Serving for your production serving system.