- Increasing throughput and reducing latency during inference.
- throughput : Number of inference per second.
- latency : The time it takes to execute a just one inference.
- Flow : torch model -> onnx -> tensorrt -> apply optimizations and generate engine -> evalute tensorrt model inference speed on GPU.
- ONNX is a standard for representing deep learning models enabling them to be transferred between frameworks.
- ONNX has a
dependency. (.pb means that protobuf) - link
- TensorRT components
- ONNX parser : parsing ONNX models into a TensorRT network definition. - link
- Builder : Input : tensorrt, output : Target gpu optimized engine.
- Engine : Engine perform the inference. Input : data, output : inference output.
- Logger : Logger in build and inference phases. (builder / engine)
// About tensorrt (builder phase)
// Declare the CUDA engine
SampleUniquePtr<nvinfer1::ICudaEngine> mEngine{nullptr};
// Create the CUDA engine
mEngine = SampleUniquePtr<nvinfer1::ICudaEngine> (builder->buildEngineWithConfig(*network, *config));
: ONNX model as input for create engine.SimpleOnnx::buildEngine
: Parses the ONNX model, save onnx information to network object.- If you want to use dynamic input -> use builder class. (Should optimize builder class)
- What is the optimization in tensorrt pipielint?
- Optimum input, minimum, and maximum dimensions.
- Build select the kernel in runtime.
- That is, builder class set the inference hyperparameter like batch size, input size, min / max dims. (This is called 'optimization')
- So, tensorrt already set this inference hyperparameter. (Bcz builder create tensorrt not engine.)
- Engine phase, create context for inference.
// About engine (engine)
// Declare the execution context
SampleUniquePtr<nvinfer1::IExecutionContext> mContext{nullptr};
// Create the execution context
mContext = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
- For inference, inputs are copied from host (CPU) to device (GPU) (Has a queue called enqueueV2)
- enqueue does some CPU work to prepare for GPU kernel launches. - enqueue issue
- enqueueV2 request to CUDA Stream / Determine input runtime batch size / Determine pointers to input and output / Determine CUDA stream to be used for kernel execution.
- We can set inference requests on the GPU asynchronously in context.
- Real applications commonly batch inputs(Not single input) to achieve higher performance and efficiency.
- Batch input can be computed in parallel.
- Larger batches generally enable more efficient use of GPU resources.
- latency, throughput, ...
- Consider the following information when evaluate latency.
- Transfer data between the GPU and CPU before inference initiates and after inference completes.
- Pre-fetch data to the GPU + overlap compute with data + hide data transfer overhead. ->
: Computes the elapsed time between two events. link- CudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events.
- Best Practices for TensorRT Performance. - link
- Use mixed precision computation.
- Change the workspace size.
- Reuse the TensorRT engine. (Keep it in GPU memory?)
- Can use FP16 and INT8 precision for inference (default : FP32)
- Also mix computations in FP32 and FP16 precision. (FP32 + FP16 / FP16 + INT8 / FP32 + INT8, ...)
- Increase resource.
- Could share the GPU at the same time.
- Serializing the engine (Reduce the pipeline process.)