How does this compare to Huggingface's Text Embedding Inference? #108

alpayariyak · 2024-02-21T00:48:14Z

Hi,

Thank you for your amazing work!

We'd like to add an embedding template for users to deploy on RunPod, and we're deciding between Infinity and HF's Text Embedding Inference. How would you say Infinity compares, especially in performance?

michaelfeil · 2024-02-21T01:43:02Z

Hey @alpayariyak ,
great question, tei is a great project that started slightly later than this, and i like it (apart from its license).

bench-marking is pretty subjective, e.g. a single sentence - 10 token query is nothing you should
we deploy typically Bert-large on Nvidia L4 instances.
Sending batches of ~256 with 380 tokens each, the performance (batch throughput/latency) is likely the only metric you want to care about, since you need to serve under high load to get anything back for your money.

CPU:

CPU is around 3x faster, when using infinity with optimum engine. Candle/torch is not that great at cpu inference, onnx has an edge here.

CUDA:

TEI round 2-5% faster on 0.55 requests per second on TEI vs 0.52 on infinity. You will need to choose the right image for this, and know that e.g. 89 compute capability is what you should go for on Nvidia L4.

startup:

The startup time is slightly faster / same order of magnitute. This is for the GPU image. For roberta large, its similar gap. Docker image of TEI is smaller - torch+cuda is a real heavy weight

Additional features that TEI misses:

AMD GPUs (no docker image yet, but TEI likley never will), AWS Inf2, mac metal inference
fast inference on GPU.
runs custom architectures and any new models with trust_remote_code=True
caching
under an open license (MIT)

michaelfeil · 2024-02-21T18:28:30Z

@alpayariyak Invested like 4-5h on this and set up an extra doc: Can I please have your feedback on it?
https://michaelfeil.eu/infinity/latest/benchmarking/

indranilr · 2024-02-28T18:15:14Z

@alpayariyak Invested like 4-5h on this and set up an extra doc: Can I please have your feedback on it? https://michaelfeil.eu/infinity/latest/benchmarking/

The benchmark link seems dead, could you please repost ?

michaelfeil · 2024-02-28T23:27:48Z

Fixed!

Jimmy-Newtron · 2024-04-03T08:24:20Z

Your project is amazing ! 🚀

I ❤️ your LICENSE that is better respect the one of TEI (👎)

Have you ever though to add an API endpoint that can serve as well as TextSplitter ?
It would replace the need to load in memory the same model for the text Chunker and the Embedder

https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token#sentencetransformers

michaelfeil · 2024-04-03T15:38:48Z

@Jimmy-Newtron Can you open another issue for that?

Jimmy-Newtron · 2024-04-04T13:34:03Z

#193

michaelfeil · 2024-04-06T02:27:12Z

Are the integrations into Langchain?
What would be the expected usage? To count tokens?

Jimmy-Newtron · 2024-04-09T06:47:58Z

The main goal would be to avoid loading in memory twice the same model

once to embed the chunks (passages) that is mandatory for the vector store
the second using the SentenceTransformers splitter that actually loads the same model in memory a second time

Are the integrations into Langchain?

Yes I suppose that a LangChain Integration would be required

What would be the expected usage? To count tokens?

To optimize the resources used (GPU, VRAM) it would be nice to have the Infinity server to be able to chunk long input sequences into smaller sentences that are fitting the window size of the chosen Embed model.

I have found an implementation of a similar concept in the AI21 Studio Text Segmentation that is already available into the LangChain Integrations

Here some source codes that may be of interest to conceive a solution:

LangChain sentence_transformers.py
AI21SemanticTextSplitter

Jimmy-Newtron · 2024-04-09T07:05:49Z

great question, tei is a great project that started slightly later than this, and i like it (apart from its license).

huggingface/text-embeddings-inference#232
huggingface/text-embeddings-inference@3c385a4

Does this means that there will be a convergence of the 2 projects?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does this compare to Huggingface's Text Embedding Inference? #108

How does this compare to Huggingface's Text Embedding Inference? #108

alpayariyak commented Feb 21, 2024

michaelfeil commented Feb 21, 2024 •

edited

Loading

michaelfeil commented Feb 21, 2024 •

edited

Loading

indranilr commented Feb 28, 2024 •

edited by michaelfeil

Loading

michaelfeil commented Feb 28, 2024

Jimmy-Newtron commented Apr 3, 2024

michaelfeil commented Apr 3, 2024

Jimmy-Newtron commented Apr 4, 2024 •

edited

Loading

michaelfeil commented Apr 6, 2024

Jimmy-Newtron commented Apr 9, 2024

Jimmy-Newtron commented Apr 9, 2024

How does this compare to Huggingface's Text Embedding Inference? #108

How does this compare to Huggingface's Text Embedding Inference? #108

Comments

alpayariyak commented Feb 21, 2024

michaelfeil commented Feb 21, 2024 • edited Loading

CPU:

CUDA:

startup:

michaelfeil commented Feb 21, 2024 • edited Loading

indranilr commented Feb 28, 2024 • edited by michaelfeil Loading

michaelfeil commented Feb 28, 2024

Jimmy-Newtron commented Apr 3, 2024

michaelfeil commented Apr 3, 2024

Jimmy-Newtron commented Apr 4, 2024 • edited Loading

michaelfeil commented Apr 6, 2024

Jimmy-Newtron commented Apr 9, 2024

Jimmy-Newtron commented Apr 9, 2024

michaelfeil commented Feb 21, 2024 •

edited

Loading

michaelfeil commented Feb 21, 2024 •

edited

Loading

indranilr commented Feb 28, 2024 •

edited by michaelfeil

Loading

Jimmy-Newtron commented Apr 4, 2024 •

edited

Loading