diff --git a/docs/_static/list-score-traces-ragas.png b/docs/_static/list-score-traces-ragas.png new file mode 100644 index 0000000000..eb296a4771 Binary files /dev/null and b/docs/_static/list-score-traces-ragas.png differ diff --git a/docs/_static/traces-score-ragas.png b/docs/_static/traces-score-ragas.png new file mode 100644 index 0000000000..d6a4dec420 Binary files /dev/null and b/docs/_static/traces-score-ragas.png differ diff --git a/docs/getstarted/evaluation.md b/docs/getstarted/evaluation.md index 97cfa66e81..8cc45de8b0 100644 --- a/docs/getstarted/evaluation.md +++ b/docs/getstarted/evaluation.md @@ -15,9 +15,9 @@ pip install ragas ``` Ragas also uses OpenAI for running some metrics so make sure you have your openai key ready and available in your environment + ```python import os - os.environ["OPENAI_API_KEY"] = "your-openai-key" ``` ## The Data diff --git a/docs/getstarted/monitoring.md b/docs/getstarted/monitoring.md index e859f5ba6c..37593c6bfa 100644 --- a/docs/getstarted/monitoring.md +++ b/docs/getstarted/monitoring.md @@ -1,35 +1,27 @@ (get-started-monitoring)= # Monitoring -Maintaining the quality and performance of an LLM application in a production environment can be challenging. Ragas provides a solution through production quality monitoring, offering valuable insights into your application's performance. This is achieved by constructing custom, smaller, more cost-effective, and faster models. - -[**Get Early Access**](https://calendly.com/shahules/30min) - -```{admonition} **Faithfulness** -:class: note - -This feature assists in identifying and quantifying instances of hallucinations. -``` - -```{admonition} **Bad retrieval** -:class: note - -This feature helps identify and quantify poor context retrievals. -``` - -```{admonition} **Bad response** -:class: note - -This feature helps in recognizing and quantifying evasive, harmful, or toxic responses. -``` - -```{admonition} **Bad format** -:class: note - -This feature helps in detecting and quantifying responses with incorrect formatting. -``` - -```{admonition} **Custom use-case** -:class: hint - -For monitoring other critical aspects that are specific to your use case. [Talk to founders](https://calendly.com/shahules/30min) +Maintaining the quality and performance of an LLM application in a production environment can be challenging. Ragas provides with basic building blocks that you can use for production quality monitoring, offering valuable insights into your application's performance. This is achieved by constructing custom, smaller, more cost-effective, and faster models. + +:::{note} +This is feature is still in beta access. You can requests for +[**early access**](https://calendly.com/shahules/30min) to try it out. +::: + +The Ragas metrics can also be used with other LLM observability tools like +[Langsmith](https://www.langchain.com/langsmith) and +[Langfuse](https://langfuse.com/) to get model-based feedback about various +aspects of you application like those mentioned below + +:::{seealso} +[Langfuse Integration](../howtos/integrations/langfuse.ipynb) to see Ragas +monitoring in action within the Langfuse dashboard and how to set it up +::: + +## Aspects to Monitor + +1. Faithfulness: This feature assists in identifying and quantifying instances of hallucinations. +2. Bad retrieval: This feature helps identify and quantify poor context retrievals. +3. Bad response: This feature helps in recognizing and quantifying evasive, harmful, or toxic responses. +4. Bad format: This feature helps in detecting and quantifying responses with incorrect formatting. +5. Custom use-case: For monitoring other critical aspects that are specific to your use case. [Talk to founders](https://calendly.com/shahules/30min) diff --git a/docs/howtos/integrations/index.md b/docs/howtos/integrations/index.md index 37135abf8b..fa17a5bdbd 100644 --- a/docs/howtos/integrations/index.md +++ b/docs/howtos/integrations/index.md @@ -9,4 +9,5 @@ happy to look into it 🙂 llamaindex.ipynb langchain.ipynb langsmith.ipynb +langfuse.ipynb ::: diff --git a/docs/howtos/integrations/langfuse.ipynb b/docs/howtos/integrations/langfuse.ipynb new file mode 100644 index 0000000000..e230f1650e --- /dev/null +++ b/docs/howtos/integrations/langfuse.ipynb @@ -0,0 +1,710 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1079e444-91e1-4b81-a28a-2ce4763f4bc4", + "metadata": {}, + "source": [ + "# Online Evaluation of RAG with Langfuse\n", + "\n", + "Langfuse offers the feature to score your traces and spans. They can be used in multiple ways across Langfuse:\n", + "1. Displayed on trace to provide a quick overview\n", + "2. Segment all execution traces by scores to e.g. find all traces with a low-quality score\n", + "3. Analytics: Detailed score reporting with drill downs into use cases and user segments\n", + "\n", + "Ragas is an open-source tool that can help you run [Model-Based Evaluation](https://langfuse.com/docs/scores/model-based-evals) on your traces/spans, especially for RAG pipelines. Ragas can perform reference-free evaluations of various aspects of your RAG pipeline. Because it is reference-free you don't need ground-truths when running the evaluations and can run it on production traces that you've collected with Langfuse.\n", + "\n", + "## The Environment" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "017dc09a-c59c-4e5f-a632-d8a5110f931d", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "# TODO REMOVE ENVIRONMENT VARIABLES!!!\n", + "# get keys for your project from https://cloud.langfuse.com\n", + "os.environ[\"LANGFUSE_PUBLIC_KEY\"] = \"pk-lf-83fe3fe9-b4c5-4269-b387-86257297cc3a\"\n", + "os.environ[\"LANGFUSE_SECRET_KEY\"] = \"sk-lf-9ef3e2b8-c5ec-4d51-b530-277e5bb98b26\"\n", + " \n", + "# your openai key\n", + "#os.environ[\"OPENAI_API_KEY\"] = \"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90a9536a-4997-47a4-82a7-3970c1145dab", + "metadata": { + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install datasets ragas llama_index python-dotenv --upgrade" + ] + }, + { + "cell_type": "markdown", + "id": "580b6d2a-06e2-4682-8e03-47d054d7f240", + "metadata": {}, + "source": [ + "## The Data\n", + "\n", + "For this example, we are going to use a dataset that has already been prepared by querying a RAG system and gathering its outputs. See below for instruction on how to fetch your production data from Langfuse.\n", + "\n", + "The dataset contains the following columns\n", + "- `question`: list[str] - These are the questions your RAG pipeline will be evaluated on.\n", + "- `answer`: list[str] - The answer generated from the RAG pipeline and given to the user.\n", + "- `contexts`: list[list[str]] - The contexts which were passed into the LLM to answer the question.\n", + "- `ground_truths`: list[list[str]] - The ground truth answer to the questions. However, this can be ignored for online evaluations since we will not have access to ground-truth data in our case." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "ebfb8207-8ddc-4b61-bcbc-f257820bf671", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['question', 'ground_truths', 'answer', 'contexts'],\n", + " num_rows: 30\n", + "})" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from datasets import load_dataset\n", + "\n", + "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")['baseline']\n", + "fiqa_eval" + ] + }, + { + "cell_type": "markdown", + "id": "b889cb2a-f718-4104-a62b-7357f76742d5", + "metadata": {}, + "source": [ + "## The Metrics\n", + "For going to measure the following aspects of an RAG system. These metrics and from the Ragas library.\n", + "\n", + "1. [faithfulness](https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html): This measures the factual consistency of the generated answer against the given context.\n", + "2. [answer_relevency](https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html): Answer Relevancy, focuses on assessing how to-the-point and relevant the generated answer is to the given prompt.\n", + "3. [context precision](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html): Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.\n", + "4. [aspect_critique](https://docs.ragas.io/en/latest/concepts/metrics/critique.html): This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria.\n", + "\n", + "Do refer the documentation to know more about these metrics and how they work [{docs}](https://docs.ragas.io/en/latest/concepts/metrics/index.html)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "33c49997-a491-4aae-bc7f-01adb61071f5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "faithfulness\n", + "answer_relevancy\n", + "context_precision\n", + "harmfulness\n" + ] + } + ], + "source": [ + "# import metrics\n", + "from ragas.metrics import faithfulness, answer_relevancy, context_precision\n", + "from ragas.metrics.critique import SUPPORTED_ASPECTS, harmfulness\n", + "\n", + "# metrics you chose\n", + "metrics = [faithfulness, answer_relevancy, context_precision, harmfulness]\n", + "\n", + "for m in metrics:\n", + " print(m.name)\n", + " # also init the metrics\n", + " m.init_model()" + ] + }, + { + "cell_type": "markdown", + "id": "23e21d8b-f72a-4043-b42e-4cf4c72d7dc8", + "metadata": {}, + "source": [ + "## The Setup\n", + "You can use model-based evaluation with Ragas in 2 ways\n", + "1. Score each Trace: This means you will run the evalalutions for each trace item. This gives you much better idea since of how each call to your RAG pipelines is performing but can be expensive\n", + "2. Score as Batch: In this method we will take a random sample of traces on a periodic basis and score them. This brings down cost and gives you a rough estimate the performance of your app but can miss out on important samples.\n", + "\n", + "In this cookbook, we'll show you how to setup both." + ] + }, + { + "cell_type": "markdown", + "id": "1e482dec-f02c-4fa1-bd07-d292a863cde5", + "metadata": {}, + "source": [ + "## Score with Trace\n", + "\n", + "Lets take a small example of a single trace and see how you can score that with Ragas. First lets load the data" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "184b901f-9f08-4ab1-96c4-1c50586753a3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('How to deposit a cheque issued to an associate in my business into my business account?',\n", + " '\\nThe best way to deposit a cheque issued to an associate in your business into your business account is to open a business account with the bank. You will need a state-issued \"dba\" certificate from the county clerk\\'s office as well as an Employer ID Number (EIN) issued by the IRS. Once you have opened the business account, you can have the associate sign the back of the cheque and deposit it into the business account.')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "row = fiqa_eval[0]\n", + "row['question'], row['answer']" + ] + }, + { + "cell_type": "markdown", + "id": "4f6138d0-2ace-4b0b-beb1-da9c20bd14a0", + "metadata": {}, + "source": [ + "Now lets init a Langfuse client SDK to instrument you app." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "fb0e3d09-fdd6-4093-8dfe-917bc58129e9", + "metadata": {}, + "outputs": [], + "source": [ + "from langfuse import Langfuse\n", + " \n", + "langfuse = Langfuse()" + ] + }, + { + "cell_type": "markdown", + "id": "65cf5034-c2f2-4e5e-b0bc-af34dd414020", + "metadata": {}, + "source": [ + "Here we are defining a utility function to score your trace with the metrics you chose." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "d9b0af86-f152-4e63-a7db-264e7f0ccb18", + "metadata": {}, + "outputs": [], + "source": [ + "def score_with_ragas(query, chunks, answer):\n", + " scores = {}\n", + " for m in metrics:\n", + " print(f\"calculating {m.name}\")\n", + " scores[m.name] = m.score_single(\n", + " {'question': query, 'contexts': chunks, 'answer': answer}\n", + " )\n", + " return scores" + ] + }, + { + "cell_type": "markdown", + "id": "b3b1e219-8319-4dfb-997b-0449a927e113", + "metadata": {}, + "source": [ + "You compute the score with each request. Below I've outlined a dummy application that does the following steps\n", + "1. gets a question from the user\n", + "2. fetch context from the database or vector store that can be used to answer the question from the user\n", + "3. pass the question and the contexts to the LLM to generate the answer\n", + "\n", + "All these step are logged as spans in a single trace in langfuse. You can read more about traces and spans from the [langfuse documentation](https://langfuse.com/docs/tracing)." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "f21a1623-5b47-4425-a6ed-e71a7d5bc25d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "calculating faithfulness\n", + "calculating answer_relevancy\n", + "calculating context_precision\n", + "calculating harmfulness\n" + ] + }, + { + "data": { + "text/plain": [ + "{'faithfulness': 0.6666666666666667,\n", + " 'answer_relevancy': 0.9763805367382368,\n", + " 'context_precision': 0.9999999999,\n", + " 'harmfulness': 0}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langfuse.model import CreateTrace, CreateSpan, CreateGeneration, CreateEvent, CreateScore\n", + "\n", + "# start a new trace when you get a question\n", + "question = row['question']\n", + "trace = langfuse.trace(CreateTrace(name = \"rag\"))\n", + "\n", + "# retrieve the relevant chunks\n", + "# chunks = get_similar_chunks(question)\n", + "contexts = row['contexts']\n", + "# pass it as span\n", + "trace.span(CreateSpan(\n", + " name = \"retrieval\", input={'question': question}, output={'contexts': contexts}\n", + "))\n", + "\n", + "# use llm to generate a answer with the chunks\n", + "# answer = get_response_from_llm(question, chunks)\n", + "answer = row['answer']\n", + "trace.span(CreateSpan(\n", + " name = \"generation\", input={'question': question, 'contexts': contexts}, output={'answer': answer}\n", + "))\n", + "\n", + "# compute scores for the question, context, answer tuple\n", + "ragas_scores = score_with_ragas(question, contexts, answer)\n", + "ragas_scores" + ] + }, + { + "cell_type": "markdown", + "id": "6403f6ac-807c-441c-9fe2-a6bf2f6771f3", + "metadata": {}, + "source": [ + "Once the scores are computed you can add them to the trace in Langfuse:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "1f0d222b-1db3-44e9-ba98-3d2ba6ee00d2", + "metadata": {}, + "outputs": [], + "source": [ + "# send the scores\n", + "for m in metrics:\n", + " trace.score(CreateScore(name=m.name, value=ragas_scores[m.name]))" + ] + }, + { + "cell_type": "markdown", + "id": "19a0e5d0-3a67-46e0-b5c6-baf272dbea3b", + "metadata": {}, + "source": [ + "![Trace with RAGAS scores](https://langfuse.com/images/docs/ragas-trace-score.png)" + ] + }, + { + "cell_type": "markdown", + "id": "4fd68b13-9743-424f-830a-c6d32e3d09c6", + "metadata": {}, + "source": [ + "Note that the scoring is blocking so make sure that you sent the generated answer before waiting for the scores to get computed. Alternatively you can run `score_with_ragas()` in a seperate thread and pass in the trace_id to log the scores.\n", + "\n", + "Or you can consider" + ] + }, + { + "cell_type": "markdown", + "id": "605416ed-e0ba-43d2-8cc9-b2343ebbe309", + "metadata": {}, + "source": [ + "## Scoring as batch\n", + "\n", + "Scoring each production trace can be time-consuming and costly depending on your application architecture and traffic. In that case, it's better to start off with a batch scoring method. Decide a timespan you want to run the batch process and the number of traces you want to _sample_ from that time slice. Create a dataset and call `ragas.evaluate` to analyse the result.\n", + "\n", + "You can run this periodically to keep track of how the scores are changing across timeslices and figure out if there are any discrepancies. \n", + "\n", + "To create demo data in Langfuse, lets first create ~10 traces with the fiqa dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "66910fc2-b84f-4a44-947a-854ca6fb48e2", + "metadata": {}, + "outputs": [], + "source": [ + "from langfuse.model import CreateTrace, CreateSpan, CreateGeneration, CreateEvent, CreateScore\n", + "# fiqa traces\n", + "for interaction in fiqa_eval.select(range(10, 20)):\n", + " trace = langfuse.trace(CreateTrace(name = \"rag\"))\n", + " trace.span(CreateSpan(\n", + " name = \"retrieval\", \n", + " input={'question': question}, \n", + " output={'contexts': contexts}\n", + " ))\n", + " trace.span(CreateSpan(\n", + " name = \"generation\", \n", + " input={'question': question, 'contexts': contexts}, \n", + " output={'answer': answer}\n", + " ))\n", + "\n", + "# await that Langfuse SDK has processed all events before rying to retrieve it in the next step\n", + "langfuse.flush()" + ] + }, + { + "cell_type": "markdown", + "id": "061d56ac-8acc-4537-b6c0-591d92cb8806", + "metadata": {}, + "source": [ + "Now that the dataset is uploaded to langfuse you can retrieve it as needed with this handy function." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "25f13042-b8b8-44bb-803d-b790c39c39f4", + "metadata": {}, + "outputs": [], + "source": [ + "def get_traces(name=None, limit=None, user_id=None):\n", + " all_data = []\n", + " page = 1 \n", + "\n", + " while True:\n", + " response = langfuse.client.trace.list(\n", + " name=name, page=page, user_id=user_id\n", + " )\n", + " if not response.data:\n", + " break\n", + " page += 1\n", + " all_data.extend(response.data)\n", + " if len(all_data) > limit:\n", + " break\n", + "\n", + " return all_data[:limit]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c0b778cc-cb02-43d6-a81b-fc586905d81e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from random import sample\n", + "\n", + "NUM_TRACES_TO_SAMPLE = 3\n", + "traces = get_traces(name='rag', limit=5)\n", + "traces_sample = sample(traces, NUM_TRACES_TO_SAMPLE)\n", + "\n", + "len(traces_sample)" + ] + }, + { + "cell_type": "markdown", + "id": "d37337dd-fe8a-4a4e-b2fc-f7435cf0eb51", + "metadata": {}, + "source": [ + "Now lets make a batch and score it. Ragas uses huggingface dataset object to build the dataset and run the evaluation. If you run this on your own production data, use the right keys to extract the question, contrexts and answer from the trace" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d3928d37-1b60-471e-a99d-8a158ea8652b", + "metadata": {}, + "outputs": [], + "source": [ + "# score on a sample\n", + "from random import sample\n", + "\n", + "evaluation_batch = {\n", + " \"question\": [],\n", + " \"contexts\": [],\n", + " \"answer\": [],\n", + " \"trace_id\": [],\n", + "}\n", + "\n", + "for t in traces_sample:\n", + " observations = [langfuse.client.observations.get(o) for o in t.observations]\n", + " for o in observations:\n", + " if o.name == 'retrieval':\n", + " question = o.input['question']\n", + " contexts = o.output['contexts']\n", + " if o.name=='generation':\n", + " answer = o.output['answer']\n", + " evaluation_batch['question'].append(question)\n", + " evaluation_batch['contexts'].append(contexts)\n", + " evaluation_batch['answer'].append(answer)\n", + " evaluation_batch['trace_id'].append(t.id)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "bf83f16d-4c15-484d-9ed2-e696aa8a0e28", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [faithfulness]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.91s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "evaluating with [answer_relevancy]\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.04s/it]\n" + ] + } + ], + "source": [ + "# run ragas evaluate\n", + "from datasets import Dataset\n", + "from ragas import evaluate\n", + "from ragas.metrics import faithfulness, answer_relevancy\n", + "\n", + "ds = Dataset.from_dict(evaluation_batch)\n", + "r = evaluate(ds, metrics=[faithfulness, answer_relevancy])" + ] + }, + { + "cell_type": "markdown", + "id": "1a4da691-4965-4fcd-bfd0-9a5f3e454f48", + "metadata": {}, + "source": [ + "And that is it! You can see the scores over a time period." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "ecf51162-edc3-4c54-9fce-89595d01004d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ragas_score': 0.9309, 'faithfulness': 0.8889, 'answer_relevancy': 0.9771}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "r" + ] + }, + { + "cell_type": "markdown", + "id": "d50f869a-50ab-4040-8b48-fb5c8eecb9b8", + "metadata": {}, + "source": [ + "You can also push the scores back into Langfuse or use the exported pandas dataframe to run further analysis." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "e9137f38-275f-4f74-be17-530db0fcc3e1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
questioncontextsanswerfaithfulnessanswer_relevancytrace_id
0How to deposit a cheque issued to an associate...[Just have the associate sign the back and the...\\nThe best way to deposit a cheque issued to a...1.0000000.976908670bc2b5-f25a-4ec4-9567-27b44b3d9fd8
1How to deposit a cheque issued to an associate...[Just have the associate sign the back and the...\\nThe best way to deposit a cheque issued to a...1.0000000.977494cb00e976-10b5-4d85-a35c-b3e8dd3dc955
2How to deposit a cheque issued to an associate...[Just have the associate sign the back and the...\\nThe best way to deposit a cheque issued to a...0.6666670.9768932a00707d-a5e5-49ed-8bc2-67f133136330
\n", + "
" + ], + "text/plain": [ + " question \\\n", + "0 How to deposit a cheque issued to an associate... \n", + "1 How to deposit a cheque issued to an associate... \n", + "2 How to deposit a cheque issued to an associate... \n", + "\n", + " contexts \\\n", + "0 [Just have the associate sign the back and the... \n", + "1 [Just have the associate sign the back and the... \n", + "2 [Just have the associate sign the back and the... \n", + "\n", + " answer faithfulness \\\n", + "0 \\nThe best way to deposit a cheque issued to a... 1.000000 \n", + "1 \\nThe best way to deposit a cheque issued to a... 1.000000 \n", + "2 \\nThe best way to deposit a cheque issued to a... 0.666667 \n", + "\n", + " answer_relevancy trace_id \n", + "0 0.976908 670bc2b5-f25a-4ec4-9567-27b44b3d9fd8 \n", + "1 0.977494 cb00e976-10b5-4d85-a35c-b3e8dd3dc955 \n", + "2 0.976893 2a00707d-a5e5-49ed-8bc2-67f133136330 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = r.to_pandas()\n", + "\n", + "# add the langfuse trace_id to the result dataframe\n", + "df[\"trace_id\"] = ds[\"trace_id\"]\n", + "\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "dce92841-2ef6-4392-875f-cb7184c3cd22", + "metadata": {}, + "outputs": [], + "source": [ + "from langfuse.model import InitialScore\n", + "\n", + "for _, row in df.iterrows():\n", + " for metric_name in [\"faithfulness\", \"answer_relevancy\"]:\n", + " langfuse.score(InitialScore(\n", + " name=metric_name,\n", + " value=row[metric_name],\n", + " trace_id=row[\"trace_id\"]))" + ] + }, + { + "cell_type": "markdown", + "id": "e9b02f19-c421-41fd-9ab3-70466839170c", + "metadata": {}, + "source": [ + "![List of traces with RAGAS scores](https://langfuse.com/images/docs/ragas-list-score-traces.png)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/src/ragas/llms/llamaindex.py b/src/ragas/llms/llamaindex.py index 2a42d9efdb..1c36ba583f 100644 --- a/src/ragas/llms/llamaindex.py +++ b/src/ragas/llms/llamaindex.py @@ -3,7 +3,6 @@ import typing as t from langchain.schema.output import Generation, LLMResult -from llama_index.llms.base import LLM as LiLLM from ragas.async_utils import run_async_tasks from ragas.llms.base import BaseRagasLLM @@ -11,6 +10,7 @@ if t.TYPE_CHECKING: from langchain.callbacks.base import Callbacks from langchain.prompts import ChatPromptTemplate + from llama_index.llms.base import LLM as LiLLM class LlamaIndexLLM(BaseRagasLLM):