[bug]: "Expected all tensors to be on the same device, but found at least two devices" #67

ga-it · 2024-08-20T21:31:27Z

Describe the bug
Context_chat_backend starts giving 500 internal server errors under load from multiple jobs.

Running on a server with two P40 GPUs (24gb each)

To Reproduce
Steps to reproduce the behavior:

Queue multiple jobs via Nexctloud assistant.
Errors start to occur as queries are simultaneously executed.

Expected behavior
This appear to be a fairly common error.
It could be because a device map needs to be defined appropriately. See here and here.

Server logs (if applicable)

``` NA ```

Context Chat Backend logs (if applicable, from the docker container)

``` TRACE: 192.168.0.73:60766 - ASGI [318] Send {'type': 'http.response.start', 'status': 500, 'headers': '<...>'} INFO: 192.168.0.73:60766 - "PUT /loadSources HTTP/1.1" 500 Internal Server Error TRACE: 192.168.0.73:60766 - ASGI [318] Send {'type': 'http.response.body', 'body': '<6070 bytes>'} TRACE: 192.168.0.73:60766 - ASGI [318] Raised exception ERROR: Exception in ASGI application Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/uvicorn/protocols/http/h11_impl.py", line 396, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__ return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 84, in __call__ raise exc from None File "/usr/local/lib/python3.11/dist-packages/uvicorn/middleware/message_logger.py", line 80, in __call__ await self.app(scope, inner_receive, inner_send) File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/app/context_chat_backend/ocs_utils.py", line 75, in __call__ await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/usr/local/lib/python3.11/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/usr/local/lib/python3.11/dist-packages/starlette/routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 193, in run_endpoint_function return await run_in_threadpool(dependant.call, **values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/starlette/concurrency.py", line 42, in run_in_threadpool return await anyio.to_thread.run_sync(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 859, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/utils.py", line 74, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/controller.py", line 276, in _ result = embed_sources(db, sources) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/chain/ingest/injest.py", line 175, in embed_sources return _process_sources(vectordb, sources_filtered) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/context_chat_backend/chain/ingest/injest.py", line 151, in _process_sources doc_ids = user_client.add_documents(split_documents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/langchain_core/vectorstores.py", line 147, in add_documents return self.add_texts(texts, metadatas, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/langchain_community/vectorstores/chroma.py", line 276, in add_texts embeddings = self._embedding_function.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/langchain_community/embeddings/huggingface.py", line 202, in embed_documents embeddings = self.client.encode( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/InstructorEmbedding/instructor.py", line 539, in encode out_features = self.forward(features) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) ^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/sentence_transformers/models/Dense.py", line 38, in forward features.update({'sentence_embedding': self.activation_function(self.linear(features['sentence_embedding']))}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py", line 116, in forward return F.linear(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

</details>

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Setup Details (please complete the following information):**
 - Nextcloud Version: [Nextcloud Hub 8]() (29.0.4)
 - AppAPI Version: 3.1.0
 - Context Chat PHP Version 8.2.22
 - Context Chat Backend Version 3.0.1
 - Nextcloud deployment method: Docker latest stable
 - Context Chat Backend deployment method: manual, remote
 - chroma.sqlite3 is 175Gb (nextcloud file repository is TB+)

The text was updated successfully, but these errors were encountered:

ga-it · 2024-08-22T11:24:32Z

Potential method:

Automatic Device Allocation:
Write a function that automatically distributes the model's layers across available GPUs via a dynamic device map.
Layer Size Analysis:
The allocation function should analyze the size of each layer (in terms of memory consumption) to make informed placement decisions. PyTorch's torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() functions can be helpful here.
Balanced Distribution:
Strive for a balanced distribution of layers across GPUs to maximize parallel processing and minimize potential bottlenecks.

Key Improvements:

Dynamic GPU Count: Handles any number of available GPUs.
Layer Size Awareness: Distributes layers based on their memory needs.
Balanced Distribution: Aims for an even distribution of memory usage across GPUs.

kyteinsky · 2024-08-26T08:19:00Z

hello, no offence but this looks like a llm's reply.

Onto the topic, we don't support multi-gpu setups with the default configuration. However, you can use these params in the config to do that: https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L73-L75

For your system, it could be tensor_split: [1, 1]. The rest is good at default.

Write a function that automatically distributes the model's layers across available GPUs via a dynamic device map.

I can review a PR if you wish to write such a function. It is very much possible but it might not take the processing speed of those devices into account. Anyhow, it is welcomed.

ga-it · 2024-09-03T07:45:44Z

Thanks @kyteinsky

I will try making the change to handle a dynamic number of GPUs.

I am interested in doing this not just for the LLM but particularly the embedding due to the server load related to this.

I believe this to be important in improving the scalability of context chat and I believe that scalability is important to anybody using Nextcloud as a company document repository, for knowledge management and the core of their RAG implementation.

Clearly the reduction of duplication is priority number one in this regard and then processing efficiency follows.

kyteinsky · 2024-09-19T15:18:44Z

I think you can just get by using the options in the instructor's config. Inside embedding->instructor->encode_kwargs, add options as needed from here, probably just the batch size: https://github.com/xlang-ai/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L580-L586

We still can benefit from the enhancement to spawn multiple instances of the embedder if the gpu memory allows this. That would be a nice change.

The LLM scalability is being worked on. We switched to using Nextcloud's Task Processing API to generate the response so it can use whatever is configured to generate text in Nextcloud like llm2 or integration_openai. Example is in the default config.

Clearly the reduction of duplication is priority number one in this regard and then processing efficiency follows.

That is our no. 1 priority right now, yes. We're working on it.

ga-it added the bug Something isn't working label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: "Expected all tensors to be on the same device, but found at least two devices" #67

[bug]: "Expected all tensors to be on the same device, but found at least two devices" #67

ga-it commented Aug 20, 2024

ga-it commented Aug 22, 2024

kyteinsky commented Aug 26, 2024

ga-it commented Sep 3, 2024

kyteinsky commented Sep 19, 2024 •

edited

Loading

[bug]: "Expected all tensors to be on the same device, but found at least two devices" #67

[bug]: "Expected all tensors to be on the same device, but found at least two devices" #67

Comments

ga-it commented Aug 20, 2024

ga-it commented Aug 22, 2024

kyteinsky commented Aug 26, 2024

ga-it commented Sep 3, 2024

kyteinsky commented Sep 19, 2024 • edited Loading

kyteinsky commented Sep 19, 2024 •

edited

Loading