Skip to content

Latest commit

 

History

History
108 lines (84 loc) · 7.01 KB

README.md

File metadata and controls

108 lines (84 loc) · 7.01 KB

Description

This repo contains a set of practice inference graphs implemented using Seldon core inference graph. Inference graphs in seldon folder are implemented using Seldon 1st gen custom python package and pipelines in mlserver folder are implemented using Serving Custom Model Seldon's newer serving platform mlserver and Seldon Inference Graph.

NOTE: This repo is shared for learning purposes, some of the pipeliens implemented here might not have a real-world usecases and they are not fully tested.

Pull requests, suggestions and completing the list of pipelines for future implementation are highly appreciated.

Inference graphs implemented using 1st gen Seldon

Pipelines from InferLine: latency-aware provisioning and scaling for prediction serving pipelines

  1. Cascade
  2. Ensemble
  3. Preprocess
  4. Vidoe Monitoring

inferline

and the following pipelines:

inferline

  1. audio-qa: Audio to text -> Question Answering
  2. audio-sent: Audio to text -> Sentiment Analysis
  3. nlp: language identification -> translate fr to Eng -> summerisation
  4. sum-qa: Summerisation -> Question Answering
  5. video: Object Detection -> Object Classification

Inference graphs implemented using MLServer

  1. audio-qa: Audio to text -> Question Answering
  2. audio-sent: Audio to text -> Sentiment Analysis
  3. nlp: language identification -> translate fr to Eng -> summerisation
  4. sum-qa: Summerisation -> Question Answering
  5. video: Object Detection -> Object Classification

DockerHub

Pre-built container images are also available here. Therefore if you are just trying out, you can deploy yaml files on your K8S cluster the way they are.

Relevant Projects

Some of the academic and industrial relevant projects that could be used as a source of Inference Pipelines for future implementations.

System's related Academic Papers

  1. InferLine: latency-aware provisioning and scaling for prediction serving pipelines
  2. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks
  3. FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees
  4. Rim: Offloading Inference to the Edge
  5. Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
  6. Scrooge: A Cost-Effective Deep Learning Inference System
  7. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis
  8. VideoEdge: Processing Camera Streams using Hierarchical Clusters
  9. Live Video Analytics at Scale with Approximation and Delay-Tolerance
  10. Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
  11. XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse

ML Theory related Academic Papers

  1. On Human Intellect and Machine Failures: Troubleshooting Integrative Machine Learning Systems
  2. Fixes That Fail: Self-Defeating Improvements in Machine-Learning Systems
  3. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  4. PaLM: Scaling Language Modeling with Pathways
  5. Language Model Cascades

Software Engineering related Academic Papers

  1. Understanding the Complexity and Its Impact on Testing in ML-Enabled Systems
  2. PromptChainer: Chaining Large Language Model Prompts through Visual Programming
  3. 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
  4. Feature Interactions on Steroids: On the Composition of ML Models

Industrial Projects

  1. Clarifai Workflows
  2. Facebook DLRM
  3. Nividia DeepStream

Load Tester

This repo also includes a small async load tester for sending workloads to the models/pipeliens. You can find it under async load tester folder.

Sources of Models

Audio and Text Models

Source:

  1. HuggingFace

For Image Models

Source:

  1. Timm
  2. TorchVision

Please give a star if this repo helped you learning somthing new :)

TODOs (sorted by priority)

  1. Re-implement pipelines in Seldon V2
  2. Add an example of using shared models in pipeliens using V2
  3. Example of multi-model request propagation
  4. Example implementation using Nvidia Triton Server as the base containers instead of MLServer
  5. Examples of model load/unload in Triton and MLServer
  6. GPU examples with fractional GPUs
  7. Send image/audio/text in a compresssed fromat
  8. Add performance evaluation scripts and load tester
  9. Complete Unfinished pipelines
  10. Examples of using Triton Client for interacting with MLSserver examples
  11. Examples of using Triton Inference Server as the serving backend
  12. Pipelines implementation in upcoming Seldon core V2
  13. Examples of Integration with Autoscalers (Builtin Autoscaler, VPA and event-driven autoscaler like KEDA)
  14. Implemnet GPT2 -> DALL-E pipeline inspired from dalle-runtime