AI 모델 추론을 위한 최적화 모델: Triton Server & Tensor RT

728x90

AI모델에 대한 학습이 끝난 이후, 실제 production 환경에서 모델을 서빙할 때 필요한 부분들은 학습할 때와는 다르다. 가장 간단한 방식은 .predict()/.forward()를 실행하는 것이다. 하지만 더 속도와 TPS를 고민하고 더 좋은 방식이 없을지 생각하다 보면 다음과 같은 질문들이 떠오를 수 있다.

Is there something more we can do with our model now that we don’t need to train anymore?
Is there something better we can do than calling a high level .predict()/.forward() function?

TRT, TRTIS는 학습이 완료된 모델을 inference만 할 때 성능 향상을 위해 사용할 수 있는 프레임워크로, Nvidia GPU에 최적화된 솔루션이다. OpenAI도 TRTIS로 서빙을 하고 있고 벤치마크 테스트 시 매우 큰 속도 향상을 경험할 수 있다.

1. TensorRT Workflow

Nvidia 홈페이지에서는 TensorRT를 "NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications."와 같이 설명하고 있다. 여기서 중요한 내용은 inference optimizer과 runtime이다.

워크샵 자료에서도 다음과 같이 설명하고 있다.

High-performance framework makes it easy to develop GPU-accelerated inference
Optimized inference for a given trained neural network and target GPU
Supports deployment of tf32, fp32,fp16,int8 inference
Includes both converters and optimized runtimes

Conversion과 Runtime을 위해 기억해야될 그림은 다음과 같다. Conversion에서 사용할 수 있는 방법 3가지와 Runtime에서 사용할 수 있는 방법 3가지이다.

Nvidia에서 설명하는 TRT의 주요 기능은 다음과 같다.

[그림] TensorRT 설명

TRT를 통해 Inference Server를 만들기 위해서는 TensorFlow, PyTorch로 학습한 모델을 ONNX, SavedModel로 형식을 변환하고, conversion을 통해 TensorRT Engine으로 만들고, runtime으로 inference sever를 띄우는 과정을 거쳐야 한다. 구체적인 그림은 다음과 같으며, Notebook 1과 같이 표시가 되어있는 것은 본문 상단의 Github 링크를 참고하면 된다.

[그림] TensorRT Workflow

TensorRT를 사용하면 'trained model'을 'optimize'시키고 'Inference server'를 구축할 수 있으며, 그 중에서 다음과 같은 포인트들을 생각하면서 진행해야 한다.

How should I export my model to a format TRT understands?
What batch size am I using?
What precision am I using?
How am I converting my model?
What runtime am I using?

2. TensorRT의 Behind the Scenes

1) Graph Optimization

TensorRT가 학습이 완료된 모델을 최적화 하는 내용은 다음과 같다.

기본적으로 연산량을 줄이기 위해서는 작은 연산들을 통합하여 큰 matrix 하나에서 모든 연산이 이루어지는게 좋다. 이를 위해 여러 layer에서 공통적으로 사용되는 layer는 통합시켜 하나의 연산으로 만드는 것이 좋다. 이렇게 해도 되는 이유는 inference시에는 gradient를 계산하지 않아도 되기 때문이다(!)

2) Auto-Tuning

TensorRT는 Graph Optimization을 위해 dummy data를 넣고 layer의 성능을 측정하는 과정을 여러번 거친다. 이를 통해 가장 좋은 graph optimization을 얻을 수 있다.

3. TensorRT Conversion

0) Introduction

학습이 완료된 모델을 TensorRT로 변환하는 방법은 크게 3가지 방법이 있다. 그 중 ONNX로 변환하고 TRT로 가는 방식이 제일 많이 사용되며 특별한 경우가 아니면 이를 사용하는 것이 권장된다.

아래 그림에서는 TensorFlow를 기반으로 설명되어 있지만, PyTorch도 동일하다.

	Flexible and Automatic: Can automatically work around/ignore unsupported layers
	Fast and Automatic: but requires plugins for unsupported layers (제일 사람들이 많이 사용하는 방식임)
	Fast and Flexible: but not automatic – must manually construct a network using TRT ops

1) ONNX란 무엇인가?

ONNX는 모델 parameter, operation을 framework에 종속되지 않게 시작된 프로젝트이며, 파일 표현 형식(graph)이다.

ONNX – Open Neural Network eXchange (.onnx) format
Framework agnostic format that TRT supports directly(community effort standardized across many organizations)
PyTorch can export to it using torch.onnx
Tensorflow can export to it using tf2onnx or keras2onnx

4. Runtime Environments

0) Introduction

변환된 TRT 모델을 사용하여 Inference Sever를 구성하는 방식은 크게 3가지가 있다. ONNX로 변환된 모델이 있고 특별한 상황이 아니라면 Nvidia의 Triton을 사용하는 것이 권장된다.

	Possible only when using TF-TRT, default option for TF-TRT models (PyTorch의 경우 TorchServe를 사용할 수 있음)
	Serving and Load Balancing: Great for serving models over HTTP ...or doing multi-GPU inference!
	Fast, but requires more effort: i.e. memory management, device-host copies, etc

이 외로 MS에서 제공하는 ONNX Runtime, TRTorch 등이 있다.

2) Triton Workflow

5. Conclusion

https://www.nvidia.com/en-us/on-demand/session/gtcspring21-se2690/

Introduction to TensorRT and Triton: A Walkthrough of Optimizing Your First Deep Learning Inference Model | NVIDIA On-Demand

NVIDIA TensorRT is a deep learning platform that optimizes neural network models and speeds up inference across GPU-accelerated platforms running in the da

www.nvidia.com

출처: NVIDIA, 끄적끄적 fine 애플 블로그

뜨리스땅

https://tristanchoi.tistory.com/662

딥러닝/AI 모델의 추론 성능을 높이기 위한 방법

딥러닝 모델 최적화는 딥러닝 모델을 개선하고 최적화하여 더 나은 성능, 효율성, 형태 또는 특정 요구 사항을 충족시키는 프로세스를 의미한다. 딥러닝 모델 최적화는 다양한 목표를 달성하기

tristanchoi.tistory.com

https://tristanchoi.tistory.com/654

AI 모델을 서비스에 사용(서빙)하기 위한 방법 - 서빙 최적화 방법

많은 기업들이 생성AI 시장에 뛰어들기 위해서 각자의 LLM 을 만들기 위해 온 열정을 쏟아붓고 있습니다. 특히 ChatGPT 출시 이후 다양한 종류와 크기의 LLM 들이 만들어 지고 있는데요. 하지만 모든

tristanchoi.tistory.com

728x90

'인터넷, 통신, 플랫폼, 컨텐츠 산업' 카테고리의 다른 글

RAG 구현을 위한 효과적인 툴: 랭체인(LangChain) (0)	2024.06.20
LLM의 추론 성능 향상을 위한 RAG 사용 시, 알아두어야 할 것들 (0)	2024.06.20
딥러닝/AI 모델의 추론 성능을 높이기 위한 방법 (0)	2024.06.18
NVIDIA Triton 란 무엇인가? (1)	2024.06.05
AI 모델을 서비스에 사용(서빙)하기 위한 방법 - 서빙 최적화 방법 (0)	2024.06.05

지식 맛집

AI 모델 추론을 위한 최적화 모델: Triton Server & Tensor RT

1. TensorRT Workflow