For each of these tests, each document consists of a single field (text or image).
The images were locally hosted on a Python image server.
Model Name | Image Indexing Time (CBS = 100) | Text Indexing Time (CBS = 100) | Image Indexing Time (CBS = 50) | Text Indexing Time (CBS = 50) | Image Indexing Time (CBS = 10) | Text Indexing Time (CBS = 10) | Image Indexing Time (CBS = 1) | Text Indexing Time (CBS = 1) |
---|---|---|---|---|---|---|---|---|
Vit-B/32 * | 18 | 7 | 19 | 8 | 26 | 14 | 70 | 65 |
fast/Vit-B/32 ** | 17 | 6 | 36 | 8 | 44 | 14 | 80 | 80 |
Vit-L/14 | 74 | 9 | 74 | 11 | 80 | 15 | 129 | 65 |
fast/Vit-L/14 | 58 | 9 | 410 | 10 | 420 | 28 | 500 | 139 |
openclip/Vit-L/14 | 76 | 11.8 | 78 | 13 | 89 | 22 | 220 | 14 |
opencv/Vit-L-14/cuda | 73 | 9 | 77 | 11 | 88 | 15 | 218 | 65 |
opencv/Vit-L-14/trt todo | 73 | 9 | 77 | 11 | 88 | 15 | 218 | 65 |
onnx/ViT-L/14 | 64 | 9 | 60 | 10 | 71 | 28 | 226 | 139 |
For onnx/ViT-L/14, there is a converging process in the processing speed. The indexing time starts from 150m/per doc and converges to 64ms/per doc after 40 batches.
Note:
- CBS == client_batch_size
Models | Time cost | Difference | Comments |
---|---|---|---|
ViT-L/14 | 18.6 ms ± 60.2 µs | N/A | The inference speed is super fast in this unit test |
open-clip/ViT-L/14 | 66.9 ms ± 435 µs | N/A | This is a more reasonable speed on pytorch |
cuda:onnx/ViT-L/14 | 55.7 ms ± 166 µs | 9e-6 | Using clip_onnx package |
tensorrt:onnx/ViT-L/14 | 47.7 ms ± 639 µs | 9e-6 | The environment is really unstable,it has very strict requirements on onnxruntime, cuda, tensorrt version |
TorchDynam | 21 ms ± 234 µs | N/A | Basicly this is just another version of onnx or tensorrt, so it is not helping, link |
kernlai | It requires python>3.9 and gpu capability > 8, g5 instancem maybe, link |
TRANSFORMS | TIME (ms) (PNG File with size = (2162, 762)) | TIME (ms) (JPG File with size = (640, 425)) | Comments |
---|---|---|---|
original_clip | 27.4 ms ± 94.8 µs | 4.39 ms ± 15 µs | |
our_clip_implementation | 27.4 ms ± 49.8 µs | 4.4 ms ± 16.8 µs | |
opencv_based | 4.8 ms ± 194 µs | 1.08 ms ± 3.02 µs | |
script_based | 11.8 ms ± 51.2 µs | 2.26 ms ± 21.1 µs | |
rgb_conversion | 18.4 ms ± 28.4 µs | 4.47 ms ± 13 µs | |
grey_conversion | 12.7 ms ± 15.5 µs | 3 ms ± 60.1 µs | |
read_from_cv | 672 µs ± 143 µs | 652 µs ± 70.4 µs |
Model Name | Text-to-image score (single-label) | Text-to-image score (double-label) | Text-to-image (trible-label) | Image-to-text score | Image-to-Image score |
---|---|---|---|---|---|
Vit-B/32 | 92.5 | 78.75 | 46.7 | 91 | good |
Vit-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
fast/Vit-B/32 | 97.5 | 72.5 | 48 | 88 | good |
fast/Vit-L/14 | 90 | 81.25 | 52.3 | 88 | good |
openclip/Vit-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
opencv/Vit-L-14 | 90 | 81.25 | 52.3 | 88 | good |
onnx/ViT-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
INFO:marqo.s2_inference.s2_inference:The client gives 1 documents to vectorise
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to load all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to preprocess all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It take about 0.011s to encode all images. The average time for each image is 0.011s
INFO:marqo.s2_inference.clip_utils:It takes 0.049s to convert the output with float32
to ndarray from cuda
INFO:marqo.s2_inference.s2_inference:It take about 0.071s to vectorise all documents. The average time for each document is 0.071s
INFO:marqo.s2_inference.s2_inference:The client gives 1 documents to vectorise
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to load all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to preprocess all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It take about 0.012s to encode all images. The average time for each image is 0.012s
INFO:marqo.s2_inference.clip_utils:It takes 0.004s to convert the output with float16
to ndarray from cuda
INFO:marqo.s2_inference.s2_inference:It take about 0.026s to vectorise all documents. The average time for each document is 0.026s
np.abs(np a - np b).sum(). 0.13
INFO:marqo.s2_inference.s2_inference:The client gives 1 documents to vectorise
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to load all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It takes about 0.005s to preprocess all images. The average time for each image is 0.005s
INFO:marqo.s2_inference.clip_utils:It take about 0.011s to encode all images. The average time for each image is 0.011s
INFO:marqo.s2_inference.clip_utils:It takes 0.051s to convert the output with float16
to ndarray from cuda
INFO:marqo.s2_inference.s2_inference:It take about 0.072s to vectorise all documents. The average time for each document is 0.072s
We test the time of adding a document into index under different batch sizes.
CBS_ = Client_Batch_Size
Model Name | Image Indexing Time (CBS = 50) | Text Indexing Time (CBS = 50) | Image Indexing Time (CBS = 10) | Text Indexing Time (CBS = 10) | Image Indexing Time (CBS = 1) | Text Indexing Time (CBS = 1) |
---|---|---|---|---|---|---|
Vit-B/32 * | 64 | 41 | 66 | 64 | 117 | 171 |
Vit-L/14 | 335 | 55 | 345 | 61 | 672 | 128 |
fast/Vit-B/32 ** | 36 | 22 | 44 | 27 | 80 | 80 |
fast/Vit-L/14 | 410 | 41 | 420 | 48 | 500 | 95 |
openclip/Vit-L/14 | 295 | 52 | 306 | 63 | 360 | 105 |
opencv/Vit-L-14 | 280 | 49 | 285 | 66 | 347 | 105 |
onnx/ViT-L/14 | 426 | 41 | 636 | 58 | 488 | 91 |
Model Name | Text-to-image score (single-label) | Text-to-image score (double-label) | Text-to-image (trible-label) | Image-to-text score | Image-to-Image score |
---|---|---|---|---|---|
Vit-B/32 | 92.5 | 78.75 | 46.7 | 91 | good |
Vit-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
fast/Vit-B/32 | 97.5 | 72.5 | 48 | 88 | good |
fast/Vit-L/14 | 90 | 81.25 | 52.3 | 88 | good |
openclip/Vit-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
opencv/Vit-L-14 | 90 | 81.25 | 52.3 | 88 | good |
onnx/ViT-L/14 | 97.5 | 82.5 | 52.3 | 91 | good |
*Vit-B/32 and Vit-L/14 are openai implementations of clip.
**fast means the model is using opencv preprocessing and using onnx model to inference
Fastclip, with opencv preprocessing and onnx model, can reduce the preprocessing time of model ViT-B/32 without losing performance.
However, onnx model is even increasing the inference time for ViT-L/14
Opencv will affect the performance a littile bit but the results are still acceptable.
This section compares different image preprocessing methods.
TRANSFORMS | TIME (ms) | PROCESSED DIFF (mean) | ENCODE DIFF (mean) |
---|---|---|---|
original_clip | 14.6 | 0.0 | 0.0 |
our_clip_implementation | 14.7 | 0.0 | 0.0 |
opencv_based | 4.67 | 1.22 | 0.19 |
script_based | 8.07 | 0.037 | 0.0526 |
rgb_conversion | 12.1 | 0.031 | 0.0475 |
grey_conversion | 5.33 | 0.053 | 0.121 |
read_from_cv | 0.940 | 1.22 | 0.19 |
Models | Time cost | Comments | Links | Difference |
---|---|---|---|---|
ViT-B/32 | 7.76 ms ± 127 µs | N/A | N/A | N/A |
onnx/ViT-B/32 | 4.16 ms ± 152 µs | Using clip_onnx package | link | 9e-6 |
open_clip/ViT-B-32/openai | 8.05 ms ± 104 µs | N/A | N/A | N/A |
Pytorch Dynamic Quantization | N/A | Does not support GPU (support CPU) | link | N/A |
Neural Magic | N/A | Does not support GPU (support CPU) | link | N/A |
DeepSpeed | N/A | Can’t get it work on my windows | link | N/A |
Optimized onnx | 4.12 ms ± 152 µs | No difference between onnx | link | 9e-6 |