Deploying ONNX Models with Triton Inference Server: A Comprehensive Guide
My experience with MLOps in the past year has been limited due to my focus on developing large-scale systems using a microservices architecture. However, in my new role, I will be exploring the domains of deep learning and symbolic AI. This transition from backend services to building data pipelines and working with MLOps presents both a challenge and an opportunity to expand my skill set. As a novice, I'm eager to initiate my learning journey by leveraging existing finely tuned open-source models and tools for experimentation, aiming to assess their practicality for real-world applications.
This post serves as a documentation of my learning experiences, chronicling my efforts to use an existing model and host it via an open-source inference server. Throughout this process, I've gained insights into ML terminology, tools, and software components that facilitate the seamless hosting of models. Acknowledging my status as a learner, I recognize the possibility of misinterpretations and am receptive to insights from seasoned experts in the field.
Although the subsequent sections may present a seemingly straightforward process, my actual experience was a challenging weekend-long immersion that required overcoming various hurdles. Hopefully, my insights can be helpful for novice users and save them some time in the future.
Step-1: Build and use an existing pre-tuned open source model
Identify and download an ML Model that you need from HuggingFace. I choose this model to analyze sentiments of product reviews.
Here’s the link: bert-base-uncased-finetuned-review-sentiment-analysis
git lfs install
git clone
https://huggingface.co/DataMonke/bert-base-uncased-finetuned-review-sentiment-analysis
Step-2: Convert the model to ONNX format
This model is pytorch based and we need to convert it to ONNX format so that it can run on CPU-only based systems efficiently. ONNX stands for “Open Neural Network Exchange“ and is basically an open representation format for machine learning algorithms. It allows for portability, in other words, an ONNX model can run everywhere.
Ideally, knowing the model internals will help but we can hack our way to create a simple python script to convert the pytorch_model.bin to _onnx file.
In addition to the ONNX file, we need a configuration file that specifies the input and output, among a few other things. Here, Netron will be very useful. As a viewer for neural networks and machine learning models, it generates beautiful visualizations that you can use to clearly communicate the structure of your neural network. Using the tool, get the input/output configs to populate the config.pbtxt file required to host the model.
CLI:
$ netron sentiment-analysis.onnx
Serving 'sentiment-analysis.onnx' at http://localhost:8081
Output: [partial]
Step-3: Setup Model Registry
We will be using Triton Inference server and that requires a model registry to be set up in a format as shown below:
model_repository
└── sentiment-analysis_onnx
├── 1
│ └── model.onnx
└── config.pbtxt
3 directories, 2 files
Copy/move the ONNX model to the folder under version 1 as shown above.
Step-4: Setup Triton Inference Server
Triton Inference Server is an open source inference serving software that streamlines AI inferencing. Triton enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton Inference Server supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia.
Download and build Nvidia Triton Inference Server from here:
https://github.com/triton-inference-server/server.git
By default, the build is enabled for GPUs, and you should modify the build.py to disable the flag. For some reason, the server built locally did not start and there were no errors. Not sure if additional flags need to be disabled. Hence, I decided to use the pre-built docker image.
Here’s the command to start Triton using a pre-built docker image:
docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/Users/mlstudent/opensrc/model_repository:/models nvcr.io/nvidia/tritonserver:23.10-py3 tritonserver --model-repository=/models
At this point, you will have Triton running and ready to receive inference requests. Here are few commands to check the server state:
curl -v localhost:8000/v2
curl -v localhost:8000/v2/health/ready
curl localhost:8000/v2/models/<model_name>/config. #replace <model_name>
Step-5: Test Inference Requests
For inference we need to provide encodings as input and since I wanted to try our inference using curl, I used a script to generate the input encodings for an input text to get the result.
Here’s a snippet of the code:
# Input text
text = "I love this product"
# Tokenize and encode the text
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
Here’s the curl request using the input encodings:
Input:
curl localhost:8000/v2/models/sentiment-analysis_onnx/infer -H "Content-Type: application/json" -d '{
"inputs": [
{"name": "input.1", "shape": [1, 6], "datatype": "INT64", "data": [101, 151, 11157, 10372, 20058, 102]}
]
}'
Output:
{"model_name":"sentiment-analysis_onnx","model_version":"1","outputs":[{"name":"1345","datatype":"FP32","shape":[1,5],"data":[-2.8784561157226564,-3.1829323768615724,-1.5491331815719605,1.99672269821167,4.558711051940918]}]}
The model's output tensor named "1345" has shape [1, 5], indicating a 2D array with 1 row and 5 columns. These values represent the model's confidence scores or probabilities for different classes. In this specific case, the negative values suggest a lower confidence or probability, while the positive values suggest a higher confidence or probability. The class corresponding to the highest value in the output array is often considered as the predicted class.
For comparison, here’s the predicted output using the initial pytorch_model :
Input text: “I love this product”
Predicted class: 4
Probabilities: [[0.0005451233009807765, 0.00040203420212492347, 0.0020597416441887617, 0.0714099332690239, 0.9255831241607666]]
Conclusion
By adopting the Kaizen approach, I aspire to optimize my learning journey toward becoming an MLOps expert and eventually evolving into an ML specialist. With each iteration, I aim to refine my understanding of the subject through experimentation with novel tools, and solidify my expertise in the ever-evolving realm of machine learning. During my attempt to host a simple model, I gained insights into PyTorch models, the ONNX format, and NVIDIA's Triton Inference Server. This marks the beginning of a new chapter in my professional journey.