Data Science, Machine Learning & AI
Content Hub
Blog Post

Building a scaleable Deep Learning Serving Environment for Keras models using NVIDIA TensorRT Server and Google Cloud

  • Autor
  • Datum 19. November 2018
  • Topic Cloud TechnologyCodingDeep Learning
  • Type Blog
  • Theme TechnologyTechnology
Building a scaleable Deep Learning Serving Environment for Keras models using NVIDIA TensorRT Server and Google Cloud

In a recent project at STATWORX, I’ve developed a large scale deep learning application for image classification using Keras and Tensorflow. After developing the model, we needed to deploy it in a quite complex pipeline of data acquisition and preparation routines in a cloud environment. We decided to deploy the model on a prediction server that exposes the model through an API. Thereby, we came across NVIDIA TensorRT Server (TRT Server), a serious alternative to good old TF Serving (which is an awesome product, by the way!). After checking the pros and cons, we decided to give TRT Server a shot. TRT Server has sevaral advantages over TF Serving, such as optimized inference speed, easy model management and ressource allocation, versioning and parallel inference handling. Furthermore, TensorRT Server is not “limited” to TensorFlow (and Keras) models. It can serve models from all major deep learning frameworks, such as TensorFlow, MxNet, pytorch, theano, Caffe and CNTK.

Despite the load of cool features, I found it a bit cumbersome to set up the TRT server. The installation and documentation is scattered to quite a few repositories, documetation guides and blog posts. That is why I decided to write this blog post about setting up the server and get your predictions going!

NVIDIA TensorRT Server

TensorRT Inference Server is NVIDIA’s cutting edge server product to put deep learning models into production. It is part of the NVIDIA’s TensorRT inferencing platform and provides a scaleable, production-ready solution for serving your deep learning models from all major frameworks. It is based on NVIDIA Docker and contains everything that is required to run the server from the inside of the container. Furthermore, NVIDIA Docker allows for using GPUs inside a Docker container, which, in most cases, significantly speeds up model inference. Talking about speed – TRT Server can be considerably faster than TF Serving and allows for multiple inferences from multiple models at the same time, using CUDA streams to exploit GPU scheduling and serialization (see image below).

Visualization of model serialization and parallelism

With TRT Server you can specify the number of concurrent inference computations using so called instance groups that can be configured on the model level (see section “Model Configuration File”) . For example, if you are serving two models and one model gets significantly more inference requests, you can assign more GPU ressources to this model allowing you to compute more multiple requests in parallel. Furthermore, instance groups allow you to specify, whether a model should be executed on CPU or GPU, which can be a very interesting feature in more complex serving environments. Overall, TRT Server has a bunch of great features that makes it interesting for production usage.

NVIDIA architecture

The upper image illustrates the general architecture of the server. One can see the HTTP and gRPC interfaces that allow you to integrate your models into other applications that are connected to the server over LAN or WAN. Pretty cool! Furthermore, the server exposes a couple of sanity features such as health status checks etc., that also come in handy in production.

Setting up the Server

As mentioned before, TensorRT Server lives inside a NVIDIA Docker container. In order to get things going, you need to complete several installation steps (in case you are starting with a blank machine, like here). The overall process is quite long and requires a certain amount of “general cloud, network and IT knowledge”. I hope, that the following steps make the installation and setup process clear to you.

Launch a Deep Learning VM on Google Cloud

For my project, I used a Google Deep Learning VM that comes with preinstalled CUDA as well as TensorFlow libraries. You can launch a cloud VM using the Google Cloud SDK or in the GCP console (which is pretty easy to use, in my opinion). The installation of the GCP SDK can be found here. Please note, that it might take some time until you can connect to the server because of the CUDA installation process, which takes several minutes. You can check the status of the VM in the cloud logging console.

# Create project
gcloud projects create tensorrt-server

# Start instance with deep learning image
gcloud compute instances create tensorrt-server-vm 
	--project tensorrt-server 
	--zone your-zone 
	--machine-type n1-standard-4 
	--image-family tf-latest-gpu 
	--maintenance-policy TERMINATE

After successfully setting up your instance, you can SSH into the VM using the terminal. From there you can execute all the neccessary steps to install the required components.

# SSH into instance
gcloud compute ssh tensorrt-server-vm --project tensorrt-server --zone your-zone

Note: Of course, you have to adapt the script for your project and instance names.

Install Docker

After setting up the GCP cloud VM, you have to install the Docker service on your machine. The Google Deep Learning VM uses Debian as OS. You can use the following code to install Docker on the VM.

# Install Docker
sudo apt-get update
sudo apt-get install 
curl -fsSL | sudo apt-key add -
sudo add-apt-repository 
   "deb [arch=amd64] 
   $(lsb_release -cs) 
sudo apt-get update
sudo apt-get install docker-ce

You can verify that Docker has been successfully installed by running the following command.

sudo docker run --rm hello-world

You should see a “Hello World!” from the docker container which should give you something like this:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
d1725b59e92d: Already exists 
Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971e499788
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:

For more examples and ideas, visit:

Congratulations, you’ve just installed Docker successfully!

Install NVIDIA Docker

Unfortunately, Docker has no “out of the box” support for GPUs connected to the host system. Therefore, the installation of the NVIDIA Docker runtime is required to use TensorRT Server’s GPU capabilities within a containerized environment. NVIDIA Docker is also used for TF Serving, if you want to use your GPUs for model inference. The following figure illustrates the architecture of the NVIDIA Docker Runtime.

NVIDIA docker

You can see, that the NVIDIA Docker Runtime is layered around the Docker engine allowing you to use standard Docker as well as NVIDIA Docker containers on your system.

Since the NVIDIA Docker Runtime is a proprietary product of NVIDIA, you have to register at NVIDIA GPU Cloud (NGC) to get an API key in order to install and download it. To authenticate against NGC execute the following command in the server command line:

# Login to NGC
sudo docker login

You will be prompted for username and API key. For username you have to enter $oauthtoken, the password is the generated API key. After you have successfully logged in, you can install the NVIDIA Docker components. Following the instructions on the NVIDIA Docker GitHub repo, you can install NVIDIA Docker by executing the following script (Ubuntu 14.04/16.04/18.04, Debian Jessie/Stretch).

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# Add the package repositories
curl -s -L | 
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L$distribution/nvidia-docker.list | 
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
sudo docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Installing TensorRT Server

The next step, after successfully installing NVIDIA Docker, is to install TensorRT Server. It can be pulled from the NVIDIA Container Registry (NCR). Again, you need to be authenticated against NGC to perform this action.

# Pull TensorRT Server (make sure to check the current version)
sudo docker pull

After pulling the image, TRT Server is ready to be started on your cloud machine. The next step is to create a model that will be served by TRT Server.

Model Deployment

After installing the required technical components and pulling the TRT Server container you need to take care of your model and the deployment. TensorRT Server manages it’s models in a folder on your server, the so called model repository.

Setting up the Model Repository

The model repository contains your exported TensorFlow / Keras etc. model graphs in a specific folder structure. For each model in the model repository, a subfolder with the corresponding model name needs to be defined. Within those model subfolders, the model schema files (config.pbtxt), label definitions (labels.txt) as well as model version subfolders are located. Those subfolders allow you to manage and serve different model versions. The file labels.txt contains strings of the target labels in appropriate order, corresponding to the output layer of the model. Within the version subfolder a file named model.graphdef (the exported protobuf graph) is stored. model.graphdef is actually a frozen tensorflow graph, that is created after exporting a TensorFlow model and needs to be named accordingly.

Remark: I did not manage to get a working serving from a tensoflow.python.saved_model.simple_save() or tensorflow.python.saved_model.builder.SavedModelBuilder() export with TRT Server due to some variable initialization error. We therefore use the “freezing graph” approach, which converts all TensorFlow variable inside a graph to constants and outputs everything into a single file (which is model.graphdef).

|-   model_1/
|--      config.pbtxt
|--      labels.txt
|--      1/
|---		model.graphdef

Since the model repository is just a folder, it can be located anywhere the TRT Server host has a network connection to. For exmaple, you can store your exported model graphs in a cloud repository or a local folder on your machine. New models can be exported and deployed there in order to be servable through the TRT Server.

Model Configuration File

Within your model repository, the model configuration file (config.pbtxt) sets important parameters for each model on the TRT Server. It contains technical information about your servable model and is required for the model to be loaded properly. There are sevaral things you can control here:

name: "model_1"
platform: "tensorflow_graphdef"
max_batch_size: 64
input [
      name: "dense_1_input"
      data_type: TYPE_FP32
      dims: [ 5 ]
output [
      name: "dense_2_output"
      data_type: TYPE_FP32
      dims: [ 2 ]
      label_filename: "labels.txt"
instance_group [
      kind: KIND_GPU
      count: 4

First, name defines the tag under the model is reachable on the server. This has to be the name of your model folder in the model repository. platform defines the framework, the model was built with. If you are using TensorFlow or Keras, there are two options: (1) tensorflow_savedmodel and tensorflow_graphdef. As mentioned before, I used tensorflow_graphdef (see my remark at the end of the previous section). batch_size, as the name says, controls the batch size for your predictions. input defines your model’s input layer node name, such as the name of the input layer (yes, you should name your layers and nodes in TensorFlow or Keras), the data_type, currently only supporting numeric types, such as TYPE_FP16, TYPE_FP32, TYPE_FP64 and the input dims. Correspondingly, output defines your model’s output layer name, it’s data_type and dims. You can specify a labels.txt file that holds the labels of the output neurons in appropriate order. Since we only have two output classes here, the file looks simply like this:


Each row defines a single class label. Note, that the file does not contain any header. The last section instance_group lets you define specific GPU (KIND_GPU)or CPU (KIND_CPU) ressources to your model. In the example file, there are 4 concurrent GPU threads assigned to the model, allowing for four simultaneous predictions.

Building a simple model for serving

In order to serve a model through TensorRT server, you’ll first need – well – a model. I’ve prepared a small script that builds a simple MLP for demonstration purposes in Keras. I’ve already used TRT Server successfully with bigger models such as InceptionResNetV2 or ResNet50 in production and it worked very well.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import InputLayer, Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.utils import to_categorical

# Make toy data
X, y = make_classification(n_samples=1000, n_features=5)

# Make target categorical
y = to_categorical(y)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scale inputs
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model definition
model_1 = Sequential()
model_1.add(Dense(input_shape=(X_train.shape[1], ),
                  units=16, activation='relu', name='dense_1'))
model_1.add(Dense(units=2, activation='softmax', name='dense_2'))
model_1.compile(optimizer='adam', loss='categorical_crossentropy')

# Early stopping
early_stopping = EarlyStopping(patience=5)
model_checkpoint = ModelCheckpoint(filepath='model_checkpoint.h5',
callbacks = [early_stopping, model_checkpoint]

# Fit model and load best weights, y=y_train, validation_data=(X_test, y_test),
            epochs=50, batch_size=32, callbacks=callbacks)

# Load best weights after early stopping

# Export model'model_1.h5')

The script builds some toy data using sklearn.datasets.make_classification and fits a single layer MLP to the data. After fitting, the model gets saved for further treatment in a separate export script.

Freezing the graph for serving

Serving a Keras (TensorFlow) model works by exporting the model graph as a separate protobuf file (.pb-file extension). A simple way to export the model into a single file, that contains all the weights of the network, is to “freeze” the graph and write it to disk. Thereby, all the tf.Variables in the graph are converted to tf.constant which are stored together with the graph in a single file. I’ve modified this script for that purpose.

import os
import shutil
import keras.backend as K
import tensorflow as tf
from keras.models import load_model
from tensorflow.python.framework import graph_util
from tensorflow.python.framework import graph_io

def freeze_model(model, path):
    """ Freezes the graph for serving as protobuf """
    # Remove folder if present
    if os.path.isdir(path):
        shutil.copy('config.pbtxt', path)
        shutil.copy('labels.txt', path)
    # Disable Keras learning phase
    # Load model
    model_export = load_model(model)
    # Get Keras sessions
    sess = K.get_session()
    # Output node name
    pred_node_names = ['dense_2_output']
    # Dummy op to rename the output node
    dummy = tf.identity(input=model_export.outputs[0], name=pred_node_names)
    # Convert all variables to constants
    graph_export = graph_util.convert_variables_to_constants(
                         logdir=path + '/1',

# Freeze Model
freeze_model(model='model_1.h5', path='model_1')

# Upload to GCP
os.system('gcloud compute scp model_1 tensorrt-server-vm:~/models/ --project tensorrt-server --zone us-west1-b --recurse')

The freeze_model() function takes the path to the saved Keras model file model_1.h5 as well as the path for the graph to be exported. Furthermore, I’ve enhanced the function in order to build the required model repository folder structure containing the version subfolder, config.pbtxt as well as labels.txt, both stored in my project folder. The function loads the model and exports the graph into the defined destination. In order to do so, you need to define the output node’s name and then convert all variables in the graph to constants using graph_util.convert_variables_to_constants, which uses the respective Keras backend session, that has to be fetched using K.get_session(). Furthermore, it is important to disable the Keras learning mode using K.set_learning_phase(0) prior to export. Lastly, I’ve included a small CLI command that uploads my model folder to my GCP instance to the model repository /models.

Starting the Server

Now that everything is installed, set up and configured, it is (finally) time to launch our TRT prediciton server. The following command starts the NVIDIA Docker container and maps the model repository to the container.

sudo nvidia-docker run --rm --name trtserver -p 8000:8000 -p 8001:8001 
-v ~/models:/models trtserver 

--rm removes existing containers of the same name, given by --name. -p exposes ports 8000 (REST) and 8001 (gRPC) on the host and maps them to the respective container ports. -v mounts the model repository folder on the host, which is /models in my case, to the container into /models, which is then referenced by --model-store as the location to look for servable model graphs. If everything goes fine you should see similar console output as below. If you don’t want to see the output of the server, you can start the container in detached model using the -d flag on startup.

== TensorRT Inference Server ==

NVIDIA Release 18.09 (build 688039)

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2018 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

I1014 10:38:55.951258 1] Initializing TensorRT Inference Server
I1014 10:38:55.951339 1] Reporting prometheus metrics on port 8002
I1014 10:38:56.524257 1] found 1 GPUs supported power usage metric
I1014 10:38:57.141885 1]   GPU 0: Tesla K80
I1014 10:38:57.142555 1] Starting server 'inference:0' listening on
I1014 10:38:57.142583 1]  localhost:8001 for gRPC requests
I1014 10:38:57.143381 1]  localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[ : 235] RAW: Entering the event loop ...
I1014 10:38:57.880877 1] Adding/updating models.
I1014 10:38:57.880908 1]  (Re-)adding model: model_1
I1014 10:38:57.981276 1] Successfully reserved resources to load servable {name: model_1 version: 1}
I1014 10:38:57.981313 1] Approving load for servable version {name: model_1 version: 1}
I1014 10:38:57.981326 1] Loading servable version {name: model_1 version: 1}
I1014 10:38:57.982034 1] Creating instance model_1_0_0_gpu0 on GPU 0 (3.7) using model.savedmodel
I1014 10:38:57.982108 1] Attempting to load native SavedModelBundle in bundle-shim from: /models/model_1/1/model.savedmodel
I1014 10:38:57.982138 1] Reading SavedModel from: /models/model_1/1/model.savedmodel
I1014 10:38:57.983817 1] Reading meta graph with tags { serve }
I1014 10:38:58.041695 1] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1014 10:38:58.042145 1] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
I1014 10:38:58.042177 1] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
I1014 10:38:58.042192 1] Device interconnect StreamExecutor with strength 1 edge matrix:
I1014 10:38:58.042200 1]      0 
I1014 10:38:58.042207 1] 0:   N 
I1014 10:38:58.067349 1] Restoring SavedModel bundle.
I1014 10:38:58.074260 1] Running LegacyInitOp on SavedModel bundle.
I1014 10:38:58.074302 1] SavedModel load for tags { serve }; Status: success. Took 92161 microseconds.
I1014 10:38:58.075314 1] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
I1014 10:38:58.075343 1] Device interconnect StreamExecutor with strength 1 edge matrix:
I1014 10:38:58.075348 1]      0 
I1014 10:38:58.075353 1] 0:   N 
I1014 10:38:58.083451 1] Successfully loaded servable version {name: model_1 version: 1}

There is also a warning showing that you should start the container using the following arguments

--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864

You can do this of course. However, in this example I did not use them.

Installing the Python Client

Now it is time to test our prediction server. TensorRT Server comes with several client libraries that allow you to send data to the server and get predictions. The recommended method of building the client libraries is again – Docker. To use the Docker container, that contains the client libraries, you need to clone the respective GitHub repo using:

git clone

Then, cd into the folder dl-inference-server and run

docker build -t inference_server_clients .

This will build the container on your machine (takes some time). To use the client libraries within the container on your host, you need to mount a folder to the container. First, start the container in an interactive session (-it flag)

docker run --name tensorrtclient --rm -it -v /tmp:/tmp/host inference_server_clients

Then, run the following commands in the container’s shell (you may have to create /tmp/host first):

cp build/image_client /tmp/host/.
cp build/perf_client /tmp/host/.
cp build/dist/dist/tensorrtserver-*.whl /tmp/host/.
cd /tmp/host

The code above copies the prebuilt image_client and perf_client libraries into the mounted folder and makes it accessible from the host system. Lastly, you need to install the Python client library using

pip install tensorrtserver-0.6.0-cp35-cp35m-linux_x86_64.whl

on the container system. Finally! That’s it, we’re ready to go (sounds like it was an easy way)!

Inference using the Python Client

Using Python, you can easily perform predictions using the client library. In order to send data to the server, you need an InferContext() from the inference_server.api module that takes the TRT Server IP and port as well as the desired model name. If you are using the TRT Server in the cloud, make sure, that you have appropriate firewall rules allowing for traffic on ports 8000 and 8001.

from tensorrtserver.api import *
import numpy as np

# Some parameters
outputs = 2
batch_size = 1

# Init client
trt_host = '123.456.789.0:8000' # local or remote IP of TRT Server
model_name = 'model_1'
ctx = InferContext(trt_host, ProtocolType.HTTP, model_name)

# Sample some random data
data = np.float32(np.random.normal(0, 1, [1, 5]))

# Get prediction
# Layer names correspond to the names in config.pbtxt
response =
    {'dense_1_input': data}, 
    {'dense_2_output': (InferContext.ResultFormat.CLASS, outputs)},

# Result
{'output0': [[(0, 1.0, 'class_0'), (1, 0.0, 'class_1')]]}

Note: It is important that the data you are sending to the server matches the floating point precision, previously defined for the input layer in the model definition file. Furthermore, the names of the input and output layers must exactly match those of your model. If everything went well, returns a dictionary of predicted values, which you would further postprocess according to your needs.

Conclusion and Outlook

Wow, that was quite a ride! However, TensorRT Server is a great product for putting your deep learning models into production. It is fast, scaleable and full of neat features for production usage. I did not go into details regarding inference performance. If you’re interested in more, make sure to check out this blog post from NVIDIA. I must admit, that in comparison to TRT Server, TF Serving is much more handy when it comes to installation, model deployment and usage. However, compared to TRT Server it lacks some functionalities that are handy in production. Bottom line: my team and I will definitely add TRT Server to our production tool stack for deep learning models.

If you have any comments or questions on my story, feel free to comment below! I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice.

If you’re interested in more content like this, join our mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team at STATWORX right into your inbox!

Lastly, follow me on LinkedIn or Twitter, if you’re interested to connect with me.


Erfahre mehr!

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Lorem ipsum dolor sit amet.
Über uns