Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError('StringIO' object has no attribute 'classifications') #1688

Closed
timelesshc opened this issue Nov 19, 2024 · 13 comments
Closed
Labels
bug Something isn't working module-metrics this is part of metrics module

Comments

@timelesshc
Copy link

timelesshc commented Nov 19, 2024

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
I'm using the latest ragas version and have been encountering the AttributeError('StringIO' object has no attribute 'classifications') error message when evaluating metrics.

I'm using chatglm APIs and wonder if there is a compatibility issue.

Ragas version: 0.2.5
Python version: 3.12

Code to Reproduce

from datasets import Dataset
import pandas as pd
from ragas import evaluate, RunConfig

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.chat_models import ChatZhipuAI
from langchain_community.embeddings import ZhipuAIEmbeddings

from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextRecall,
    LLMContextPrecisionWithReference,
)

def read_excel(file_path, sheet_name='Sheet1'):
    data = pd.read_excel(file_path, sheet_name=sheet_name)
    data.fillna('', inplace=True)
    data = [each_row._asdict() for each_row in data.itertuples(index=False)]
    return data

def write_xlsx(file_path, data, sheet_name='Sheet1'):
    data = pd.DataFrame(data)
    data.to_excel(file_path, sheet_name=sheet_name, index=False)

def get_dataset(data):
    questions = []
    answers = []
    contexts = []
    ground_truths = []

    for each in data:
        questions.append(each['query'])
        answers.append(each['answer'])
        contexts.append([each['context']])
        ground_truths.append(each['reference'])

    data = {
        "user_input": questions, 
        "response": answers, 
        "retrieved_contexts": contexts,
        "reference": ground_truths
    }
    dataset = Dataset.from_dict(data)
    return dataset

def setup_llm_and_embedder():
    llm = LangchainLLMWrapper(ChatZhipuAI(
        base_url="https://open.bigmodel.cn/api/paas/v4/", 
        api_key="xxx",  # API Key
        model="glm-4-plus",  
        max_tokens= 8000
    ))

    text_embedder = LangchainEmbeddingsWrapper(ZhipuAIEmbeddings(
        api_base="https://open.bigmodel.cn/api/paas/v4/",
        api_key="xxx",  # API Key
        model="embedding-2"
        
    ))
    return llm, text_embedder

if __name__ == "__main__":
    file = 'xxx'
    save_file = 'xxx'
    data = read_excel(file)
    dataset = get_dataset(data)
    llm, text_embedder = setup_llm_and_embedder()
    run_config = RunConfig(
        max_retries=5,
        max_wait=120,
        timeout=500,
        max_workers=8
    )
    result = evaluate(
        dataset = dataset,
        llm = llm,
        run_config=run_config,
        embeddings = text_embedder,
        metrics=[
            Faithfulness(llm=llm),
            ResponseRelevancy(llm=llm),
            LLMContextRecall(llm=llm),
            LLMContextPrecisionWithReference(llm=llm),
        ],
    )
    df = result.to_pandas()
    write_xlsx(save_file, df)

Error trace
Evaluating: 2%|█▍ | 14/792 [00:51<33:06, 2.55s/it]Exception raised in Job[10]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating: 5%|████▏ | 41/792 [02:50<1:32:51, 7.42s/it]Exception raised in Job[42]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating: 5%|████▎ | 42/792 [03:01<1:42:38, 8.21s/it]Exception raised in Job[46]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating: 7%|█████▋ | 54/792 [04:01<46:35, 3.79s/it]Exception raised in Job[54]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating: 7%|██████ | 58/792 [04:16<43:52, 3.59s/it]Exception raised in Job[50]: AttributeError('StringIO' object has no attribute 'classifications')
Evaluating: 8%|███████ | 67/792 [04:46<37:19, 3.09s/it]Exception raised in Job[62]: AttributeError('StringIO' object has no attribute 'classifications')
Expected behavior

Additional context
Add any other context about the problem here.

@timelesshc timelesshc added the bug Something isn't working label Nov 19, 2024
@dosubot dosubot bot added the module-metrics this is part of metrics module label Nov 19, 2024
@cruiser1174
Copy link

Also having this same problem while evaluating ragas faithfulness through the giskard.rag.evaluate function.

@Squire-tomsk
Copy link

Squire-tomsk commented Dec 5, 2024

Looks like it occurs because of fix_output_format_prompt object contains StringIO as output_model instead of type defined in pydantic_object field of RagasOutputParser class. I can`t get a logic of FixOutputFormat class for now.

@timelesshc
Copy link
Author

@jjmachan @shahules786
Any inputs on this issue? Thanks

@baptvit
Copy link

baptvit commented Dec 14, 2024

Im also facing the same issue.

@lailanelkoussy
Copy link

lailanelkoussy commented Dec 16, 2024

I am also facing the same issue while evaluating ragas metrics through the giskard.rag.evaluate function.

@cruiser1174
Copy link

cruiser1174 commented Dec 16, 2024

I solved the particular problem I had while evaluating via giskard. When saving model results in an AgentAnswer object I was consolidating all of the contexts into a single stringz, whereas they should be saved as a list of strings. Converting to a list of strings solved the problem for me.

Here is a relevant extract from my model class - see comments in all caps

    def wrap_rag_model(
                self,
                question: str, 
                history=[]):
            messages = []
            for message in history:
                if message["role"] == "user":
                    messages.append({"inputs":{"chat_input":message['content']}})
                elif message["role"] == "assistant":
                    messages[-1]["outputs"] = {"chat_output":message["content"]}
            
            # Generate a response using Azure OpenAI
            response = self.call(user_prompt=question, history=messages)
    
            # Ensure that documents is a list of strings
            documents = self.get_response_context(response) if self.get_response_context(response) else []
            documents = [str(d) for d in documents]
            # Instead of returning a simple string, we return the AgentAnswer object which
            # allows us to specify the retrieved context which is used by RAGAS metrics
            return AgentAnswer(
                message=self.get_response_text(response),
                documents=documents # HERE ENSURE DOCUMENTS IS A LIST OF STRINGS
            )
    
    def scan_rag_model(
        self,
        testset: QATestset,
        knowledgebase: KnowledgeBase,
        ragas_metrics: list
        ):
    
        self.rag_scan = evaluate(
            self.wrap_rag_model,  #HERE THE FUNCTION MUST RETURN AN AgentAnswer object
            testset=testset,
            knowledge_base=knowledgebase,
            metrics=ragas_metrics,
            agent_description=self.description
            )
        
        return self.rag_scan
        

@jjmachan
Copy link
Member

hey folks - taking a look at this now

@lailanelkoussy
Copy link

I solved the particular problem I had while evaluating via giskard. When saving model results in an AgentAnswer object I was consolidating all of the contexts into a single stringz, whereas they should be saved as a list of strings. Converting to a list of strings solved the problem for me.

Here is a relevant extract from my model class - see comments in all caps

    def wrap_rag_model(
                self,
                question: str, 
                history=[]):
            messages = []
            for message in history:
                if message["role"] == "user":
                    messages.append({"inputs":{"chat_input":message['content']}})
                elif message["role"] == "assistant":
                    messages[-1]["outputs"] = {"chat_output":message["content"]}
            
            # Generate a response using Azure OpenAI
            response = self.call(user_prompt=question, history=messages)
    
            # Ensure that documents is a list of strings
            documents = self.get_response_context(response) if self.get_response_context(response) else []
            documents = [str(d) for d in documents]
            # Instead of returning a simple string, we return the AgentAnswer object which
            # allows us to specify the retrieved context which is used by RAGAS metrics
            return AgentAnswer(
                message=self.get_response_text(response),
                documents=documents # HERE ENSURE DOCUMENTS IS A LIST OF STRINGS
            )
    
    def scan_rag_model(
        self,
        testset: QATestset,
        knowledgebase: KnowledgeBase,
        ragas_metrics: list
        ):
    
        self.rag_scan = evaluate(
            self.wrap_rag_model,  #HERE THE FUNCTION MUST RETURN AN AgentAnswer object
            testset=testset,
            knowledge_base=knowledgebase,
            metrics=ragas_metrics,
            agent_description=self.description
            )
        
        return self.rag_scan
        

I tried this in my case and it still did not work (I am using giskard)

@tim-hilde
Copy link

Issue is still occurring as of 2.10: #1831

@andreped
Copy link
Contributor

andreped commented Jan 14, 2025

I'm seeing issues specifically with Faithfulness. This method works perfectly fine for very simple contexts List[str] but as the content inside the list gets increasingly complex something goes wrong. Doesn't matter if I force cast the content within the list to str. It still fails downstream.

Would actually be great if there was more verbose on what exactly goes wrong as it becomes quite an impossible task to debug...


EDIT: No wait, it seems to fail even with very simple List[str] contexts and maybe it is another metric causing it... No idea whats wrong anymore...

@andreped
Copy link
Contributor

andreped commented Jan 14, 2025

So after debugging this way too long, I managed to get around the issue for one of my applications by computing the context_recall metric separately. No idea why that was the issue. But if I do something like this, I am able to compute the metrics:

score = ragas.evaluate(
    dataset,
    metrics=[
        answer_correctness,
        faithfulness,
        answer_similarity,
        context_precision,
        answer_relevancy,
        # context_recall,   # <- DONT include this here
    ],
    llm=[...],
    embeddings=[...],
    raise_exceptions=True,
)
result = score.to_pandas()

# compute context recall separately
context_recall_score = ragas.evaluate(
    dataset,
    metrics=[context_recall],  # <- Include this metric here instead
    llm=[...],
    embeddings=[...],
    raise_exceptions=True,
)

# merge context recall score with result
result["context_recall"] = context_recall_score.to_pandas()["context_recall"]

And of course, there is no need to use evaluate() in the second step for this exact case, but maybe others experience issues with multiple metrics and this could be a quick-fix until the real issue has been resolved. Hopefully this gets resolved very soon!

Tested with ragas==0.2.10.


NOTE: This is by no means the fix to the solution but rather a temporary workaround/hack.

@Manon-56
Copy link

So after debugging this way too long, I managed to get around the issue for one of my applications by computing the context_recall metric separately. No idea why that was the issue. But if I do something like this, I am able to compute the metrics:

score = ragas.evaluate(
    dataset,
    metrics=[
        answer_correctness,
        faithfulness,
        answer_similarity,
        context_precision,
        answer_relevancy,
        # context_recall,   # <- DONT include this here
    ],
    llm=[...],
    embeddings=[...],
    raise_exceptions=True,
)
result = score.to_pandas()

# compute context recall separately
context_recall_score = ragas.evaluate(
    dataset,
    metrics=[context_recall],  # <- Include this metric here instead
    llm=[...],
    embeddings=[...],
    raise_exceptions=True,
)

# merge context recall score with result
result["context_recall"] = context_recall_score.to_pandas()["context_recall"]

And of course, there is no need to use evaluate() in the second step for this exact case, but maybe others experience issues with multiple metrics and this could be a quick-fix until the real issue has been resolved. Hopefully this gets resolved very soon!

Tested with ragas==0.2.10.

NOTE: This is by no means the fix to the solution but rather a temporary workaround/hack.

In my case I get "'StringIO' object has no attribute 'statements'", an error related to faithfulness. Unfortunately, as I feared, this workaround did not work for me.

@jjmachan
Copy link
Member

closed with fix: output parser bug by jjmachan · Pull Request #1864 · explodinggradients/ragas will be released with v0.2.12 🙂

I'm closing this for now but if the issue is still persisting, please do let me know - really sorry about the delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module-metrics this is part of metrics module
Projects
None yet
Development

No branches or pull requests

9 participants