Triton Inference Server

Background

Triton是什么？Triton是 NVIDIA 推出的 Inference Server，专门做 AI 模型的部署服务，第一次面世的时间大约在2018年，至今为止开源社区一直在维护。

客户端可以HTTP/REST或gRPC的方式来请求服务，特性包括以下方面：

支持多种框架，例如 Tensorflow、TensoRT、Pytorch、ONNX甚至自定义框架后端；
支持 GPU 和 CPU 方式运行，能最大化利用硬件资源；
容器化部署，集成 k8s，可以方便的进行编排和扩展；
支持并发模型，支持多种模型或同一模型的不同实例在同一GPU上运行
支持多种批处理算法，可以提高推理吞吐量；

Demo

下载triton镜像和示例模型

1
2
3

docker pull nvcr.io/nvidia/tritonserver:21.12-py3
cd docs/examples
./fetch_models.sh #下载模型

启动container

#GPU
docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v{PATH}/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver --model-repository=/models

#CPU
docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v{PATH}/examples/model_repository:/models nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver --model-repository=/models

检查健康情况

1	curl -v localhost:8000/v2/health/ready

下载用户端镜像

docker pull nvcr.io/nvidia/tritonserver:21.12-py3-sdk
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:21.12-py3-sdk
#此时已经在容器里
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
#Request 0, batch size 1
#Image '/workspace/images/mug.jpg':
#    15.349564 (504) = COFFEE MUG
#    13.227464 (968) = CUP
#    10.424892 (505) = COFFEEPOT

Step into container

container的运行command

1	/opt/tritonserver/nvidia_entrypoint.sh tritonserver --model-repository=/models

# vi /opt/tritonserver/nvidia_entrypoint.sh
...

if [[ $# -eq 0 ]]; then
  exec "/bin/bash"
else
  exec "$@" # 执行 tritonserver --model-repository=/models
fi

tritonserver的路径在/opt/tritonserver/bin/tritonserver

local client build

The offical website give the client image. However, in some situations, we want to build the client environment locally because it will reduce the storage and communication.

wget https://github.com/triton-inference-server/client/archive/refs/heads/main.zip
unzip client-main.zip
cd client-main.zip

pip install tritonclient[all]
pip install pillow
pip install attrdict

# 使用HTTP协议
python client-main/src/python/examples/image_client.py -m inception_graphdef -s INCEPTION ./mug.jpg
# Request 1, batch size 1
#    0.826452 (505) = COFFEE MUG
# PASS

python client-main/src/python/examples/image_client.py -m cifar_simple -s INCEPTION ./dog.jpg
# Request 1, batch size 1
#    1.323445 (5) = dog
# PASS

# 使用gRPC(建议，性能优于HTTP)
python image_client.py -m cifar_simple -s INCEPTION -u 127.0.0.1:8001 -i gRPC ~/triton-exp/dog.jpg

What is Triton Inference Server Backend

A Triton backend is the implementation that executes a model. A backend can be a wrapper around a deep-learning framework, like PyTorch, TensorFlow, TensorRT or ONNX Runtime. Or a backend can be custom C/C++ logic performing any operation (for example, image pre-processing).

What is TensorRT

The TensorRT backend is used to execute TensorRT models. The server repo contains the source for the backend.

TensorRT是英伟达针对自家平台做的加速包，TensorRT主要做了这么两件事情，来提升模型的运行速度。
TensorRT支持INT8和FP16的计算。深度学习网络在训练时，通常使用 32 位或 16 位数据。TensorRT则在网络的推理时选用不这么高的精度，达到加速推断的目的。
TensorRT对于网络结构进行了重构，把一些能够合并的运算合并在了一起，针对GPU的特性做了优化。现在大多数深度学习框架是没有针对GPU做过性能优化的，而英伟达，GPU的生产者和搬运工，自然就推出了针对自己GPU的加速工具TensorRT。一个深度学习模型，在没有优化的情况下，比如一个卷积层、一个偏置层和一个reload层，这三层是需要调用三次cuDNN对应的API，但实际上这三层的实现完全是可以合并到一起的，TensorRT会对一些可以合并网络进行合并。

Backend Introduction

简单、通用

ONNX Runtime

The ONNX Runtime backend is used to execute ONNX models. The onnxruntime_backend repo contains the documentation and source for the backend.

ONNX是一种针对机器学习所设计的开放式的文件格式，用于存储训练好的模型。它使得不同的人工智能框架（如Pytorch, MXNet）可以采用相同格式存储模型数据并交互。 ONNX的规范及代码主要由微软，亚马逊，Facebook 和 IBM 等公司共同开发，以开放源代码的方式托管在Github上。目前官方支持加载ONNX模型并进行推理的深度学习框架有： Caffe2, PyTorch, MXNet，ML.NET，TensorRT 和 Microsoft CNTK，并且 TensorFlow 也非官方的支持ONNX。—维基百科
ONNX使用的是Protobuf这个序列化数据结构去存储神经网络的权重信息
当我们加载了一个ONNX之后，我们获得的就是一个ModelProto，它包含了一些版本信息，生产者信息和一个GraphProto。在GraphProto里面又包含了四个repeated数组，它们分别是node(NodeProto类型)，input(ValueInfoProto类型)，output(ValueInfoProto类型)和initializer(TensorProto类型)，其中node中存放了模型中所有的计算节点，input存放了模型的输入节点，output存放了模型中所有的输出节点，initializer存放了模型的所有权重参数。

OpenVINO

The OpenVINO backend is used to execute OpenVINO models. The openvino_backend repo contains the documentation and source for the backend.

Inter

常用深度学习框架

TensorFlow

The TensorFlow backend is used to execute TensorFlow models in both GraphDef and SavedModel formats. The same backend is used to execute both TensorFlow 1 and TensorFlow 2 models. The tensorflow_backend repo contains the documentation and source for the backend.

PyTorch
The PyTorch backend is used to execute TorchScript models. The pytorch_backend repo contains the documentation and source for the backend.

TorchScript类似于ONNX，存储了模型的参数和结构等，但仅支持tensor的操作

Python

The Python backend allows you to write your model logic in Python. For example, you can use this backend to execute pre/post processing code written in Python, or to execute a PyTorch Python script directly (instead of first converting it to TorchScript and then using the PyTorch backend). The python_backend repo contains the documentation and source for the backend.

加速预处理框架

DALI

DALI is a collection of highly optimized building blocks and an execution engine that accelerates the pre-processing of the input data for deep learning applications. The DALI backend allows you to execute your DALI pipeline within Triton. The dali_backend repo contains the documentation and source for the backend.

NVIDIA https://www.jiqizhixin.com/articles/2020-02-04-6

加速树模型框架

FIL

The FIL (Forest Inference Library) backend is used to execute a variety of tree-based ML models, including XGBoost models, LightGBM models, Scikit-Learn random forest models, and cuML random forest models. The fil_backend repo contains the documentation and source for the backend. https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35

How to write Triton Inference Server Backend

Triton Backend API

A Triton backend must implement the C interface defined in tritonbackend.h

what is in tritonbackend.h

// vi triton/core/tritonbackend.h

#include "triton/core/tritonserver.h"
struct TRITONBACKEND_MemoryManager;
struct TRITONBACKEND_Input;
struct TRITONBACKEND_Output;
struct TRITONBACKEND_State;
struct TRITONBACKEND_Request;
struct TRITONBACKEND_ResponseFactory;
/// A response factory allows a request
/// to be released before all responses have been sent. Releasing a
/// request as early as possible releases all input tensor data and
/// therefore may be desirable in some cases.
struct TRITONBACKEND_Response;
struct TRITONBACKEND_Backend;
struct TRITONBACKEND_Model;
struct TRITONBACKEND_ModelInstance;

what is in tritonserver.h

// vi triton/core/tritonserver.h

struct TRITONSERVER_Error;
struct TRITONSERVER_InferenceRequest;
struct TRITONSERVER_InferenceResponse;
struct TRITONSERVER_InferenceTrace;
struct TRITONSERVER_Message;
struct TRITONSERVER_Metrics;
struct TRITONSERVER_ResponseAllocator;
struct TRITONSERVER_Server;
struct TRITONSERVER_ServerOptions;

cd backends
# cmake //dir
# src   //dir
# CMakeLists.txt
以下折叠
CMakeLists.txt是cmake的配置文件，用于自动编译c++工程
以trition中的CMakeLists.txt为例，我们看一下里面有什么
Plaintext
cmake_minimum_required(VERSION 3.17) # cmake版本要求

project(tutorialminimalbackend LANGUAGES C CXX) # 项目名字和指定的语言

#
# Options
#
# Must include options required for this project as well as any
# projects included in this one by FetchContent.
#
# GPU support is disabled by default because minimal backend doesn't
# use GPUs.
#
option(TRITON_ENABLE_GPU "Enable GPU support in backend" OFF) # 是否使用某些库
option(TRITON_ENABLE_STATS "Include statistics collections in backend" ON)
# 定义一些变量，CACHE指的是作用于整个cmake的全局变量
set(TRITON_COMMON_REPO_TAG "main" CACHE STRING "Tag for triton-inference-server/common repo")
set(TRITON_CORE_REPO_TAG "main" CACHE STRING "Tag for triton-inference-server/core repo")
set(TRITON_BACKEND_REPO_TAG "main" CACHE STRING "Tag for triton-inference-server/backend repo")

if(NOT CMAKE_BUILD_TYPE)
  set(CMAKE_BUILD_TYPE Release)
endif()

#
# Dependencies
#
# FetchContent requires us to include the transitive closure of all
# repos that we depend on so that we can override the tags.
#
#  include 指令用来载入并运行来自于文件或模块的 CMake 代码
include(FetchContent)

# 子项目声明
FetchContent_Declare(
  repo-common
  GIT_REPOSITORY https://github.com/triton-inference-server/common.git
  GIT_TAG ${TRITON_COMMON_REPO_TAG}
  GIT_SHALLOW ON
)
FetchContent_Declare(
  repo-core
  GIT_REPOSITORY https://github.com/triton-inference-server/core.git
  GIT_TAG ${TRITON_CORE_REPO_TAG}
  GIT_SHALLOW ON
)
FetchContent_Declare(
  repo-backend
  GIT_REPOSITORY https://github.com/triton-inference-server/backend.git
  GIT_TAG ${TRITON_BACKEND_REPO_TAG}
  GIT_SHALLOW ON
)
FetchContent_MakeAvailable(repo-common repo-core repo-backend)

#
# The backend must be built into a shared library. Use an ldscript to
# hide all symbols except for the TRITONBACKEND API.
#
configure_file(src/libtriton_minimal.ldscript libtriton_minimal.ldscript COPYONLY)

add_library(
  triton-minimal-backend SHARED
  src/minimal.cc
)

add_library(
  TutorialMinimalBackend::triton-minimal-backend ALIAS triton-minimal-backend
)

target_include_directories(
  triton-minimal-backend
  PRIVATE
    ${CMAKE_CURRENT_SOURCE_DIR}/src
)

target_compile_features(triton-minimal-backend PRIVATE cxx_std_11)
target_compile_options(
  triton-minimal-backend PRIVATE
  $<$<OR:$<CXX_COMPILER_ID:Clang>,$<CXX_COMPILER_ID:AppleClang>,$<CXX_COMPILER_ID:GNU>>:
    -Wall -Wextra -Wno-unused-parameter -Wno-type-limits -Werror>
  $<$<CXX_COMPILER_ID:MSVC>:/Wall /D_WIN32_WINNT=0x0A00 /EHsc>
)

target_link_libraries(
  triton-minimal-backend
  PRIVATE
    triton-core-serverapi   # from repo-core
    triton-core-backendapi  # from repo-core
    triton-core-serverstub  # from repo-core
    triton-backend-utils    # from repo-backend
)

if(WIN32)
  set_target_properties(
    triton-minimal-backend PROPERTIES
    POSITION_INDEPENDENT_CODE ON
    OUTPUT_NAME triton_minimal
  )
else()
  set_target_properties(
    triton-minimal-backend PROPERTIES
    POSITION_INDEPENDENT_CODE ON
    OUTPUT_NAME triton_minimal
    LINK_DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/libtriton_minimal.ldscript
    LINK_FLAGS "-Wl,--version-script libtriton_minimal.ldscript"
  )
endif()

#
# Install
#
include(GNUInstallDirs)
set(INSTALL_CONFIGDIR ${CMAKE_INSTALL_LIBDIR}/cmake/TutorialMinimalBackend)

install(
  TARGETS
    triton-minimal-backend
  EXPORT
    triton-minimal-backend-targets
  LIBRARY DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/minimal
  RUNTIME DESTINATION ${CMAKE_INSTALL_PREFIX}/backends/minimal
)

install(
  EXPORT
    triton-minimal-backend-targets
  FILE
    TutorialMinimalBackendTargets.cmake
  NAMESPACE
    TutorialMinimalBackend::
  DESTINATION
    ${INSTALL_CONFIGDIR}
)

include(CMakePackageConfigHelpers)
configure_package_config_file(
  ${CMAKE_CURRENT_LIST_DIR}/cmake/TutorialMinimalBackendConfig.cmake.in
  ${CMAKE_CURRENT_BINARY_DIR}/TutorialMinimalBackendConfig.cmake
  INSTALL_DESTINATION ${INSTALL_CONFIGDIR}
)

install(
  FILES
  ${CMAKE_CURRENT_BINARY_DIR}/TutorialMinimalBackendConfig.cmake
  DESTINATION ${INSTALL_CONFIGDIR}
)

#
# Export from build tree
#
export(
  EXPORT triton-minimal-backend-targets
  FILE ${CMAKE_CURRENT_BINARY_DIR}/TutorialMinimalBackendTargets.cmake
  NAMESPACE TutorialMinimalBackend::
)

export(PACKAGE TutorialMinimalBackend)

Concept

TRITONBACKEND_Backend

A TRITONBACKEND_Backend object represents the backend itself. The same backend object is shared across all models that use the backend. The associated API, like TRITONBACKEND_BackendName, is used to get information about the backend and to associate a user-defined state with the backend.

A backend can optionally implement TRITONBACKEND_Initialize and TRITONBACKEND_Finalize to get notification of when the backend object is created and destroyed (for more information see backend lifecycles).

TRITONBACKEND_Model

A TRITONBACKEND_Model object represents a model. Each model loaded by Triton is associated with a TRITONBACKEND_Model. Each model can use the TRITONBACKEND_ModelBackend API to get the backend object representing the backend that is used by the model.

The same model object is shared across all instances of that model. The associated API, like TRITONBACKEND_ModelName, is used to get information about the model and to associate a user-defined state with the model.

Most backends will implement TRITONBACKEND_ModelInitialize and TRITONBACKEND_ModelFinalize to initialize the backend for a given model and to manage the user-defined state associated with the model (for more information see backend lifecycles).

The backend must take into account threading concerns when implementing TRITONBACKEND_ModelInitialize and TRITONBACKEND_ModelFinalize. Triton will not perform multiple simultaneous calls to these functions for a given model; however, if a backend is used by multiple models Triton may simultaneously call the functions with a different thread for each model. As a result, the backend must be able to handle multiple simultaneous calls to the functions. Best practice for backend implementations is to use only function-local and model-specific user-defined state in these functions, as is shown in the tutorial.

TRITONBACKEND_ModelInstance

A TRITONBACKEND_ModelInstance object represents a model instance. Triton creates one or more instances of the model based on the instance_group settings specified in the model configuration. Each of these instances is associated with a TRITONBACKEND_ModelInstance object.

The only function that the backend must implement is TRITONBACKEND_ModelInstanceExecute. The TRITONBACKEND_ModelInstanceExecute function is called by Triton to perform inference/computation on a batch of inference requests. Most backends will also implement TRITONBACKEND_ModelInstanceInitialize and TRITONBACKEND_ModelInstanceFinalize to initialize the backend for a given model instance and to manage the user-defined state associated with the model (for more information see backend lifecycles).

The backend must take into account threading concerns when implementing TRITONBACKEND_ModelInstanceInitialize, TRITONBACKEND_ModelInstanceFinalize and TRITONBACKEND_ModelInstanceExecute. Triton will not perform multiple simultaneous calls to these functions for a given model instance; however, if a backend is used by a model with multiple instances or by multiple models Triton may simultaneously call the functions with a different thread for each model instance. As a result, the backend must be able to handle multiple simultaneous calls to the functions. Best practice for backend implementations is to use only function-local and model-specific user-defined state in these functions, as is shown in the tutorial.

TRITONBACKEND_Request

A TRITONBACKEND_Request object represents an inference request made to the model. The backend takes ownership of the request object(s) in TRITONBACKEND_ModelInstanceExecute and must release each request by calling TRITONBACKEND_RequestRelease. However, the ownership of request object is returned back to Triton in case TRITONBACKEND_ModelInstanceExecute returns an error. See Inference Requests and Responses for more information about request lifecycle.

The Triton Backend API allows the backend to get information about the request as well as the input and request output tensors of the request. Each request input is represented by a TRITONBACKEND_Input object.

TRITONBACKEND_Response

A TRITONBACKEND_Response object represents a response sent by the backend for a specific request. The backend uses the response API to set the name, shape, datatype and tensor values for each output tensor included in the response. The response can indicate either a failed or a successful request. See Inference Requests and Responses for more information about request-response lifecycle.

MinoBackend

// backend/examples/backends/minimal/src/minimal.cc

// Copyright 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
// are met:
//  * Redistributions of source code must retain the above copyright
//    notice, this list of conditions and the following disclaimer.
//  * Redistributions in binary form must reproduce the above copyright
//    notice, this list of conditions and the following disclaimer in the
//    documentation and/or other materials provided with the distribution.
//  * Neither the name of NVIDIA CORPORATION nor the names of its
//    contributors may be used to endorse or promote products derived
//    from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
// OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

#include "triton/backend/backend_common.h"
#include "triton/backend/backend_input_collector.h"
#include "triton/backend/backend_model.h"
#include "triton/backend/backend_model_instance.h"
#include "triton/backend/backend_output_responder.h"
#include "triton/core/tritonbackend.h"

namespace triton { namespace backend { namespace minimal {

//
// Minimal backend that demonstrates the TRITONBACKEND API. This
// backend works for any model that has 1 input called "IN0" with
// INT32 datatype and shape [ 4 ] and 1 output called "OUT0" with
// INT32 datatype and shape [ 4 ]. The backend supports both batching
// and non-batching models.
//
// For each batch of requests, the backend returns the input tensor
// value in the output tensor.
//

/////////////

//
// ModelState
//
// State associated with a model that is using this backend. An object
// of this class is created and associated with each
// TRITONBACKEND_Model. ModelState is derived from BackendModel class
// provided in the backend utilities that provides many common
// functions.
//
class ModelState : public BackendModel {
 public:
  static TRITONSERVER_Error* Create(
      TRITONBACKEND_Model* triton_model, ModelState** state);
  virtual ~ModelState() = default;

 private:
  ModelState(TRITONBACKEND_Model* triton_model) : BackendModel(triton_model) {}
};

TRITONSERVER_Error*
ModelState::Create(TRITONBACKEND_Model* triton_model, ModelState** state)
{
  try {
    *state = new ModelState(triton_model);
  }
  catch (const BackendModelException& ex) {
    RETURN_ERROR_IF_TRUE(
        ex.err_ == nullptr, TRITONSERVER_ERROR_INTERNAL,
        std::string("unexpected nullptr in BackendModelException"));
    RETURN_IF_ERROR(ex.err_);
  }

  return nullptr;  // success
}
// extern是C/C++语言中表明函数和全局变量作用范围（可见性）的关键字，该关键字告诉编译器，其声明的函数和变量可以在本模块或其它模块中使用
// "C"表明以下代码按照C的规定来编译和连接
extern "C" {

// Triton calls TRITONBACKEND_ModelInitialize when a model is loaded
// to allow the backend to create any state associated with the model,
// and to also examine the model configuration to determine if the
// configuration is suitable for the backend. Any errors reported by
// this function will prevent the model from loading.
//
TRITONSERVER_Error*
TRITONBACKEND_ModelInitialize(TRITONBACKEND_Model* model)
{
  // Create a ModelState object and associate it with the
  // TRITONBACKEND_Model. If anything goes wrong with initialization
  // of the model state then an error is returned and Triton will fail
  // to load the model.
  ModelState* model_state;
  RETURN_IF_ERROR(ModelState::Create(model, &model_state));
  RETURN_IF_ERROR(
      TRITONBACKEND_ModelSetState(model, reinterpret_cast<void*>(model_state)));
 // 创建模型
  return nullptr;  // success
}

// Triton calls TRITONBACKEND_ModelFinalize when a model is no longer
// needed. The backend should cleanup any state associated with the
// model. This function will not be called until all model instances
// of the model have been finalized.
//
TRITONSERVER_Error*
TRITONBACKEND_ModelFinalize(TRITONBACKEND_Model* model)
{
  void* vstate;
  RETURN_IF_ERROR(TRITONBACKEND_ModelState(model, &vstate));
  ModelState* model_state = reinterpret_cast<ModelState*>(vstate);
  delete model_state;
  // 销毁模型
  return nullptr;  // success
}

}  // extern "C"

/////////////

//
// ModelInstanceState
//
// State associated with a model instance. An object of this class is
// created and associated with each
// TRITONBACKEND_ModelInstance. ModelInstanceState is derived from
// BackendModelInstance class provided in the backend utilities that
// provides many common functions.
//
class ModelInstanceState : public BackendModelInstance {
 public:
  static TRITONSERVER_Error* Create(
      ModelState* model_state,
      TRITONBACKEND_ModelInstance* triton_model_instance,
      ModelInstanceState** state);
  virtual ~ModelInstanceState() = default;

  // Get the state of the model that corresponds to this instance.
  ModelState* StateForModel() const { return model_state_; }

 private:
  ModelInstanceState(
      ModelState* model_state,
      TRITONBACKEND_ModelInstance* triton_model_instance)
      : BackendModelInstance(model_state, triton_model_instance),
        model_state_(model_state)
  {
  }

  ModelState* model_state_;
};

TRITONSERVER_Error*
ModelInstanceState::Create(
    ModelState* model_state, TRITONBACKEND_ModelInstance* triton_model_instance,
    ModelInstanceState** state)
{
  try {
    *state = new ModelInstanceState(model_state, triton_model_instance);
  }
  catch (const BackendModelInstanceException& ex) {
    RETURN_ERROR_IF_TRUE(
        ex.err_ == nullptr, TRITONSERVER_ERROR_INTERNAL,
        std::string("unexpected nullptr in BackendModelInstanceException"));
    RETURN_IF_ERROR(ex.err_);
  }

  return nullptr;  // success
}

extern "C" {

// Triton calls TRITONBACKEND_ModelInstanceInitialize when a model
// instance is created to allow the backend to initialize any state
// associated with the instance.
//
TRITONSERVER_Error*
TRITONBACKEND_ModelInstanceInitialize(TRITONBACKEND_ModelInstance* instance)
{
  // Get the model state associated with this instance's model.
  TRITONBACKEND_Model* model;
  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceModel(instance, &model));

  void* vmodelstate;
  RETURN_IF_ERROR(TRITONBACKEND_ModelState(model, &vmodelstate));
  ModelState* model_state = reinterpret_cast<ModelState*>(vmodelstate);

  // Create a ModelInstanceState object and associate it with the
  // TRITONBACKEND_ModelInstance.
  ModelInstanceState* instance_state;
  RETURN_IF_ERROR(
      ModelInstanceState::Create(model_state, instance, &instance_state));
  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceSetState(
      instance, reinterpret_cast<void*>(instance_state)));

  return nullptr;  // success
}

// Triton calls TRITONBACKEND_ModelInstanceFinalize when a model
// instance is no longer needed. The backend should cleanup any state
// associated with the model instance.
//
TRITONSERVER_Error*
TRITONBACKEND_ModelInstanceFinalize(TRITONBACKEND_ModelInstance* instance)
{
  void* vstate;
  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(instance, &vstate));
  ModelInstanceState* instance_state =
      reinterpret_cast<ModelInstanceState*>(vstate);
  delete instance_state;

  return nullptr;  // success
}

}  // extern "C"

/////////////

extern "C" {

// When Triton calls TRITONBACKEND_ModelInstanceExecute it is required
// that a backend create a response for each request in the batch. A
// response may be the output tensors required for that request or may
// be an error that is returned in the response.
//
TRITONSERVER_Error*
TRITONBACKEND_ModelInstanceExecute(
    TRITONBACKEND_ModelInstance* instance, TRITONBACKEND_Request** requests,
    const uint32_t request_count)
{
  // Triton will not call this function simultaneously for the same
  // 'instance'. But since this backend could be used by multiple
  // instances from multiple models the implementation needs to handle
  // multiple calls to this function at the same time (with different
  // 'instance' objects). Best practice for a high-performance
  // implementation is to avoid introducing mutex/lock and instead use
  // only function-local and model-instance-specific state.
  ModelInstanceState* instance_state;
  RETURN_IF_ERROR(TRITONBACKEND_ModelInstanceState(
      instance, reinterpret_cast<void**>(&instance_state)));
  ModelState* model_state = instance_state->StateForModel();

  // 'responses' is initialized as a parallel array to 'requests',
  // with one TRITONBACKEND_Response object for each
  // TRITONBACKEND_Request object. If something goes wrong while
  // creating these response objects, the backend simply returns an
  // error from TRITONBACKEND_ModelInstanceExecute, indicating to
  // Triton that this backend did not create or send any responses and
  // so it is up to Triton to create and send an appropriate error
  // response for each request. RETURN_IF_ERROR is one of several
  // useful macros for error handling that can be found in
  // backend_common.h.

  std::vector<TRITONBACKEND_Response*> responses;
  responses.reserve(request_count);
  for (uint32_t r = 0; r < request_count; ++r) {
    TRITONBACKEND_Request* request = requests[r];
    TRITONBACKEND_Response* response;
    RETURN_IF_ERROR(TRITONBACKEND_ResponseNew(&response, request));
    responses.push_back(response);
  }

  // At this point, the backend takes ownership of 'requests', which
  // means that it is responsible for sending a response for every
  // request. From here, even if something goes wrong in processing,
  // the backend must return 'nullptr' from this function to indicate
  // success. Any errors and failures must be communicated via the
  // response objects.
  //
  // To simplify error handling, the backend utilities manage
  // 'responses' in a specific way and it is recommended that backends
  // follow this same pattern. When an error is detected in the
  // processing of a request, an appropriate error response is sent
  // and the corresponding TRITONBACKEND_Response object within
  // 'responses' is set to nullptr to indicate that the
  // request/response has already been handled an no futher processing
  // should be performed for that request. Even if all responses fail,
  // the backend still allows execution to flow to the end of the
  // function. RESPOND_AND_SET_NULL_IF_ERROR, and
  // RESPOND_ALL_AND_SET_NULL_IF_ERROR are macros from
  // backend_common.h that assist in this management of response
  // objects.

  // The backend could iterate over the 'requests' and process each
  // one separately. But for performance reasons it is usually
  // preferred to create batched input tensors that are processed
  // simultaneously. This is especially true for devices like GPUs
  // that are capable of exploiting the large amount parallelism
  // exposed by larger data sets.
  //
  // The backend utilities provide a "collector" to facilitate this
  // batching process. The 'collector's ProcessTensor function will
  // combine a tensor's value from each request in the batch into a
  // single contiguous buffer. The buffer can be provided by the
  // backend or 'collector' can create and manage it. In this backend,
  // there is not a specific buffer into which the batch should be
  // created, so use ProcessTensor arguments that cause collector to
  // manage it.

  BackendInputCollector collector(
      requests, request_count, &responses, model_state->TritonMemoryManager(),
      false /* pinned_enabled */, nullptr /* stream*/);

  // To instruct ProcessTensor to "gather" the entire batch of IN0
  // input tensors into a single contiguous buffer in CPU memory, set
  // the "allowed input types" to be the CPU ones (see tritonserver.h
  // in the triton-inference-server/core repo for allowed memory
  // types).
  std::vector<std::pair<TRITONSERVER_MemoryType, int64_t>> allowed_input_types =
      {{TRITONSERVER_MEMORY_CPU_PINNED, 0}, {TRITONSERVER_MEMORY_CPU, 0}};

  const char* input_buffer;
  size_t input_buffer_byte_size;
  TRITONSERVER_MemoryType input_buffer_memory_type;
  int64_t input_buffer_memory_type_id;

  RESPOND_ALL_AND_SET_NULL_IF_ERROR(
      responses, request_count,
      collector.ProcessTensor(
          "IN0", nullptr /* existing_buffer */,
          0 /* existing_buffer_byte_size */, allowed_input_types, &input_buffer,
          &input_buffer_byte_size, &input_buffer_memory_type,
          &input_buffer_memory_type_id));

  // Finalize the collector. If 'true' is returned, 'input_buffer'
  // will not be valid until the backend synchronizes the CUDA
  // stream or event that was used when creating the collector. For
  // this backend, GPU is not supported and so no CUDA sync should
  // be needed; so if 'true' is returned simply log an error.
  const bool need_cuda_input_sync = collector.Finalize();
  if (need_cuda_input_sync) {
    LOG_MESSAGE(
        TRITONSERVER_LOG_ERROR,
        "'minimal' backend: unexpected CUDA sync required by collector");
  }

  // 'input_buffer' contains the batched "IN0" tensor. The backend can
  // implement whatever logic is necesary to produce "OUT0". This
  // backend simply returns the IN0 value in OUT0 so no actual
  // computation is needed.

  LOG_MESSAGE(
      TRITONSERVER_LOG_INFO,
      (std::string("model ") + model_state->Name() + ": requests in batch " +
       std::to_string(request_count))
          .c_str());
  std::string tstr;
  IGNORE_ERROR(BufferAsTypedString(
      tstr, input_buffer, input_buffer_byte_size, TRITONSERVER_TYPE_INT32));
  LOG_MESSAGE(
      TRITONSERVER_LOG_INFO,
      (std::string("batched IN0 value: ") + tstr).c_str());

  const char* output_buffer = input_buffer;
  TRITONSERVER_MemoryType output_buffer_memory_type = input_buffer_memory_type;
  int64_t output_buffer_memory_type_id = input_buffer_memory_type_id;

  // This backend supports models that batch along the first dimension
  // and those that don't batch. For non-batch models the output shape
  // will be [ 4 ]. For batch models the output shape will be [ -1, 4
  // ] and the backend "responder" utility below will set the
  // appropriate batch dimension value for each response.
  std::vector<int64_t> output_batch_shape;
  bool supports_first_dim_batching;
  RESPOND_ALL_AND_SET_NULL_IF_ERROR(
      responses, request_count,
      model_state->SupportsFirstDimBatching(&supports_first_dim_batching));
  if (supports_first_dim_batching) {
    output_batch_shape.push_back(-1);
  }
  output_batch_shape.push_back(4);

  // Because the OUT0 values are concatenated into a single contiguous
  // 'output_buffer', the backend must "scatter" them out to the
  // individual response OUT0 tensors.  The backend utilities provide
  // a "responder" to facilitate this scattering process.

  // The 'responders's ProcessTensor function will copy the portion of
  // 'output_buffer' corresonding to each request's output into the
  // response for that request.

  BackendOutputResponder responder(
      requests, request_count, &responses, model_state->TritonMemoryManager(),
      supports_first_dim_batching, false /* pinned_enabled */,
      nullptr /* stream*/);

  responder.ProcessTensor(
      "OUT0", TRITONSERVER_TYPE_INT32, output_batch_shape, output_buffer,
      output_buffer_memory_type, output_buffer_memory_type_id);

  // Finalize the responder. If 'true' is returned, the OUT0
  // tensors' data will not be valid until the backend synchronizes
  // the CUDA stream or event that was used when creating the
  // responder. For this backend, GPU is not supported and so no
  // CUDA sync should be needed; so if 'true' is returned simply log
  // an error.
  const bool need_cuda_output_sync = responder.Finalize();
  if (need_cuda_output_sync) {
    LOG_MESSAGE(
        TRITONSERVER_LOG_ERROR,
        "'minimal' backend: unexpected CUDA sync required by responder");
  }

  // Send all the responses that haven't already been sent because of
  // an earlier error.
  for (auto& response : responses) {
    if (response != nullptr) {
      LOG_IF_ERROR(
          TRITONBACKEND_ResponseSend(
              response, TRITONSERVER_RESPONSE_COMPLETE_FINAL, nullptr),
          "failed to send response");
    }
  }

  // Done with the request objects so release them.
  for (uint32_t r = 0; r < request_count; ++r) {
    auto& request = requests[r];
    LOG_IF_ERROR(
        TRITONBACKEND_RequestRelease(request, TRITONSERVER_REQUEST_RELEASE_ALL),
        "failed releasing request");
  }

  return nullptr;  // success
}

}  // extern "C"

}}}  // namespace triton::backend::minimal

资料

快速入门
https://github.com/triton-inference-server/server/blob/main/docs/quickstart.md

教程

https://blog.csdn.net/qq_38032876/article/details/109597875?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_title~default-4.no_search_link&spm=1001.2101.3001.4242.3&utm_relevant_index=7

If you like this blog or find it useful for you, you are welcome to comment on it. You are also welcome to share this blog, so that more people can participate in it. If the images used in the blog infringe your copyright, please contact the author to delete them. Thank you !

Triton Inference Server

Model inference tool.

Triton Inference Server

Background

Demo

下载triton镜像和示例模型

启动container

检查健康情况

下载用户端镜像

Step into container

local client build

What is Triton Inference Server Backend

What is TensorRT

Backend Introduction

简单、通用

常用深度学习框架

加速预处理框架

加速树模型框架

How to write Triton Inference Server Backend

Triton Backend API

what is in tritonbackend.h

what is in tritonserver.h

Concept

MinoBackend

资料

教程

FEATURED TAGS

FRIENDS