You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[X] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
The bug is simple to explain and reproduce but may require a design decision to solve properly. In the Answer-Relevance class, all generated questions are identical, which doesn't make much sense as we would want to compute the mean cosine similarity from many generated questions (which are diverse) to the original query. This issue occurs because the LangchainLLMWrapper/BaseRagasLLM class does not account for the temperature preset in the LangChain OpenAI class.
In the BaseRagasLLM.generate/generate_text function, the temperature attribute of the passed LLM is not checked, causing the temperature to default to a value close to 0. This results in identical generated questions, which contradicts the expected behavior of diversity in the generated questions for answer relevance.
Where the Bug First Occurs
The bug manifests in the _ascore function of the answer-relevance class. A naive solution that demonstrates the issue involves explicitly passing the temperature from the LangchainLLMWrapper object to the generate function:
While this resolves the issue, it is not an elegant or scalable solution because it redundantly overrides a property that should ideally be encapsulated within the wrapper itself.
Related Commit
This issue is related to a previously closed issue addressed in this commit. However, the commit overwrites the temperature of the langchain_llm object without addressing the root problem of ensuring that the preset temperature is respected.
Proposed Solution
Modify LangchainLLMWrapper or BaseRagasLLM
Encapsulate the logic for handling temperature within the wrapper itself:
class LangchainLLMWrapper(BaseRagasLLM):
def __init__(self, llm):
self.llm = llm
self.temperature = getattr(llm, "temperature", None)
async def generate(self, prompt, n=1, temperature=None, **kwargs):
# Use the passed temperature or fall back to the LLM's default
effective_temperature = temperature if temperature is not None else self.temperature
return await self.llm.generate(prompt, n=n, temperature=effective_temperature, **kwargs)
Refactor the generate_multiple Method
Ensure temperature is propagated consistently in PydanticPrompt:
async def generate_multiple(
self,
llm: BaseRagasLLM,
data: InputModel,
n: int = 1,
temperature: t.Optional[float] = None,
stop: t.Optional[t.List[str]] = None,
callbacks: t.Optional[Callbacks] = None,
retries_left: int = 3,
) -> t.List[OutputModel]:
# Use temperature from the wrapper
resp = await llm.generate(
prompt_value,
n=n,
temperature=temperature, # Wrapper handles default fallback
stop=stop,
callbacks=callbacks,
)
# Rest of the logic remains the same...
Advantages of the Proposed Solution
1. Encapsulation: By moving temperature handling into the wrapper, the logic in higher-level components becomes simpler and more modular.
2. Flexibility: This approach respects user-defined defaults and allows overrides when needed.
3. Readability: Redundant checks and assignments are removed, making the code cleaner and more maintainable.
Ragas version: 0.2.9
Python version: 3.9
Code to Reproduce
from ragas.metrics import ResponseRelevancy
from ragas.llms import LangchainLLMWrapper
from ragas.dataset_schema import SingleTurnSample
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=1.0))
sample_to_evaluate = SingleTurnSample() # Create a SingleTurnSample object with some real data...
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-large", dimensions=3072))
scorer = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
answer_relevance_metric_score = scorer.single_turn_score(sample_to_evaluate)
To reproduce: Put a breakpoint in the _ascore function at the line: response = await self.question_generation.generate Observe that the temperature set on the LangChain LLM is ignored.
Error trace
N/A – No error trace is produced, but the lack of diversity in the generated questions demonstrates the issue.
Expected behavior
Generated questions in the Answer-Relevance flow should exhibit diversity and shouldn't be generate identically N times, with the temperature used in generation reflecting the preset temperature of the LLM object.
Additional context
Setting the temperature to high value like 1 might create diverse questions but also my make the non-commital part of the answer relevance prompt more random, in this case we could split it into two functionalities one with low temperature for the commital test and if that passes the answer generation could be used with an higher temperature set, in this case the it might make more sense for the answer-relevance class to have two properties of temperature and to send them himself further for each case
Let me know what you think of this issue and how you would like to solve it, i don't mind creating a PR with a solution for it, when we have a solution agreed on
The text was updated successfully, but these errors were encountered:
[X] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug
The bug is simple to explain and reproduce but may require a design decision to solve properly. In the Answer-Relevance class, all generated questions are identical, which doesn't make much sense as we would want to compute the mean cosine similarity from many generated questions (which are diverse) to the original query. This issue occurs because the LangchainLLMWrapper/BaseRagasLLM class does not account for the temperature preset in the LangChain OpenAI class.
In the
BaseRagasLLM.generate
/generate_text
function, the temperature attribute of the passed LLM is not checked, causing the temperature to default to a value close to 0. This results in identical generated questions, which contradicts the expected behavior of diversity in the generated questions for answer relevance.Where the Bug First Occurs
The bug manifests in the _ascore function of the answer-relevance class. A naive solution that demonstrates the issue involves explicitly passing the temperature from the LangchainLLMWrapper object to the generate function:
While this resolves the issue, it is not an elegant or scalable solution because it redundantly overrides a property that should ideally be encapsulated within the wrapper itself.
Related Commit
This issue is related to a previously closed issue addressed in this commit. However, the commit overwrites the temperature of the langchain_llm object without addressing the root problem of ensuring that the preset temperature is respected.
Proposed Solution
Encapsulate the logic for handling temperature within the wrapper itself:
Ensure temperature is propagated consistently in PydanticPrompt:
Advantages of the Proposed Solution
1. Encapsulation: By moving temperature handling into the wrapper, the logic in higher-level components becomes simpler and more modular.
2. Flexibility: This approach respects user-defined defaults and allows overrides when needed.
3. Readability: Redundant checks and assignments are removed, making the code cleaner and more maintainable.
Ragas version: 0.2.9
Python version: 3.9
Code to Reproduce
To reproduce: Put a breakpoint in the _ascore function at the line:
response = await self.question_generation.generate
Observe that the temperature set on the LangChain LLM is ignored.
Error trace
N/A – No error trace is produced, but the lack of diversity in the generated questions demonstrates the issue.
Expected behavior
Generated questions in the Answer-Relevance flow should exhibit diversity and shouldn't be generate identically N times, with the temperature used in generation reflecting the preset temperature of the LLM object.
Additional context
Setting the temperature to high value like 1 might create diverse questions but also my make the non-commital part of the answer relevance prompt more random, in this case we could split it into two functionalities one with low temperature for the commital test and if that passes the answer generation could be used with an higher temperature set, in this case the it might make more sense for the answer-relevance class to have two properties of temperature and to send them himself further for each case
Let me know what you think of this issue and how you would like to solve it, i don't mind creating a PR with a solution for it, when we have a solution agreed on
The text was updated successfully, but these errors were encountered: