-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: <title> Local LLM and Loacl embedding error? #605
Comments
The local search with embeddings from Ollama now works. |
你 端口号是不是错了? |
If you want to use open-source models, I've created a repository for deploying Hugging Face models to local endpoints, offering functionality similar to OpenAI APIs. You can find the repo here: https://github.com/rushizirpe/open-llm-server Also, I've prepared a Colab notebook for the Graphrag Demo. You might want to take a look: https://colab.research.google.com/drive/1uhFDnih1WKrSRQHisU-L6xw6coapgR51?usp=sharing. |
Consolidating alternate model issues here: #657 |
Describe the issue
Use vllm to launch a local large model, in the style of openai,but it won't work
Steps to reproduce
step1:python -m vllm.entrypoints.openai.api_server --max-model-len 6144 --gpu-memory-utilization 0.95 --disable-log-stats --served-model-name Qwen2-7B-Instruct --model /mnt/workspace/Qwen2-7B-Instruct
step2: start embedding
import os
from contextlib import asynccontextmanager
from typing import List, Union
import tiktoken
import torch
import uvicorn
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer
from sse_starlette.sse import EventSourceResponse
Set up limit request time
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', '/mnt/workspace/m3e-base')
@asynccontextmanager
async def lifespan(app: FastAPI):
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=[""],
allow_credentials=True,
allow_methods=[""],
allow_headers=["*"],
)
class CompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
class EmbeddingResponse(BaseModel):
data: list
model: str
object: str
usage: CompletionUsage
class EmbeddingRequest(BaseModel):
input: Union[List[str], str]
model: str
@app.post("/v1/embeddings", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):
if isinstance(request.input, str):
embeddings = [embedding_model.encode(request.input)]
else:
embeddings = [embedding_model.encode(text) for text in request.input]
embeddings = [embedding.tolist() for embedding in embeddings]
if name == "main":
# load Embedding
embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
uvicorn.run(app, host='0.0.0.0', port=8001, workers=1)
step3:pip install graphrag
step4:mkdir -p ./ragtest/input
step5:curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt
step6:python -m graphrag.index --init --root ./ragtest
step7: Modify the yml file
GraphRAG Config Used
No response
Logs and screenshots
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: Qwen2-7B-Instruct
model_supports_json: false # recommended if this is available for your model.
max_tokens: 2000
request_timeout: 180.0
api_base: http://localhost:8000/v1/
api_version: 2024-02-15-preview
organization: <organization_id>
deployment_name: <azure_model_deployment_name>
tokens_per_minute: 150_000 # set a leaky bucket throttle
requests_per_minute: 10_000 # set a leaky bucket throttle
max_retries: 10
max_retry_wait: 10.0
sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: m3e-base
api_base: http://localhost:8001/v1/
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
Additional Information
The text was updated successfully, but these errors were encountered: