Releases · confident-ai/deepeval

Synthetic Data generation. Generate synthetic data from documents easily: https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data
caching. If you're running 10k test cases and it fails at the 9999th test case, you no longer have to rerun the first 9999 test case as you can just read from cache using the -c flag: https://docs.confident-ai.com/docs/evaluation-introduction#cache
repeats. If you want to repeat each test case for statistical significant, use the -r flag: https://docs.confident-ai.com/docs/evaluation-introduction#repeats
LLM Benchmarks. Supporting popular benchmarks such as MMLU, HellaSwag, and BIG-BH so anyone can evaluate ANY model on research backed benchmarks in a few lines of code.
G-Eval improvements. The G-Eval metric now supports using logprobs of tokens to find the weighted summed score.

Assets 2

09 Mar 17:27

penguine-ip

v0.20.85

bb40704

Async Support for Prod

In deepeval v0.20.85:

asynchronous support throughout deepeval, and no longer using threads. Users can also call individual metrics asynchronously: https://docs.confident-ai.com/docs/metrics-introduction#measuring-metrics-in-async
improved the way in which you create a custom LLM for evaluation. You'll now have to implement an asynchronous generate() method to use deepeval's async features: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm
strict mode for all metrics!
improve evaluate() function for more customizability: https://docs.confident-ai.com/docs/evaluation-introduction#evaluating-without-pytest

Assets 2

04 Mar 18:04

penguine-ip

v0.20.80

4757393

Conversational Metrics and Synthetic Data Generation

In DeepEval's latest release, there is now:

conversational metrics: https://docs.confident-ai.com/docs/metrics-knowledge-retention. This metric evaluates whether your LLM is able to retain factual information presented to it throughout a conversation
synthetic data generation. Generate evaluation datasets from scratch: https://docs.confident-ai.com/docs/evaluation-datasets#generate-an-evaluation-dataset

Assets 2

25 Feb 11:18

penguine-ip

v0.20.73

564b108

Production Stability

For the newest release, deepeval now is now stable for production use:

reduced package size
separated functionality of pytest vs deepeval test run command
included coverage score for summarization
fix contextual precision node error
released docs for better transparency into metrics calculation
allows users to configure RAGAS metrics for custom embedding models: https://docs.confident-ai.com/docs/metrics-ragas#example
fixed bugs with checking for package updates

Assets 2

14 Feb 06:05

penguine-ip

v0.20.68

2a6da83

Hugging Face and LlamaIndex integration

For the latest release, DeepEval:

Supports Hugging Face users by providing real-time evaluations during fine-tuning: https://docs.confident-ai.com/docs/integrations-huggingface
Supports LlamaIndex users by allowing unit testing of LlamaIndex apps in CI/CD, and offer metrics in LlamaIndex's evaluators: https://docs.confident-ai.com/docs/integrations-llamaindex
Improvements to accuracy and reliability in Faithfulness and Answer Relevancy
Summarization Metric now offers explanation
You can now use ANY LLM for evaluation: https://docs.confident-ai.com/docs/metrics-introduction#using-a-custom-llm

Assets 2

16 Jan 11:22

penguine-ip

v0.20.57

be8e95c

LLM-Evals now support all LangChain chatmodels

LLM-Evals (LLM evaluated metrics) now support all of langchain's chat models.
LLMTestCase now has execution_time and cost, useful for those looking to evaluate on these parameters
minimum_score is now threshold instead, meaning you can now create custom metrics that either have a "minimum" or "maximum" threshold
LLMEvalMetric is now GEval
Llamaindex Tracing integration: (https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#deepeval)

Assets 2

28 Dec 11:50

penguine-ip

v0.20.43

ab16dc3

ALL RAG Metrics now offers score reasoning, and a lot more.

In this release:

Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score.
Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai
New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization
Pulling datasets from Confident AI now offers an intermediate step for additional data processing before evaluation: https://docs.confident-ai.com/docs/confident-ai-evaluate-datasets#pull-your-dataset-from-confident-ai
Decoupled imports from transformers, sentence_transformers, and pandas to reduce package size

Assets 2

14 Dec 10:50

penguine-ip

v0.20.35

c5045b1

Lots of new features

Lots of new features this release:

JudgementalGPT now allows for different languages - useful for our APAC and European friends
RAGAS metrics now supports all OpenAI models - useful for those running into context length issues
LLMEvalMetric now returns a reasoning for its score
deepeval test run now has hooks that call on test run completion
evaluate now displays retrieval_context for RAG evaluation
RAGAS metric now displays metric breakdown for all its distinct metrics

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: confident-ai/deepeval

Agentic Evaluation Metric, Custom Evaluation LLMs, and Async for Synthetic Data Generation

Verbosity in Metrics, Hyperparameter Logging, Improved Synthetic Data Generation, Better Async Support

Synthetic Data, Caching, Benchmarks, and GEval improvement

Async Support for Prod

Conversational Metrics and Synthetic Data Generation

Production Stability

Hugging Face and LlamaIndex integration

LLM-Evals now support all LangChain chatmodels

ALL RAG Metrics now offers score reasoning, and a lot more.

Lots of new features