A comprehensive, expert-validated dataset comprising over 1,000 entries specifically designed for the field of chemistry. The dataset was curated by an initial generation and filtering of a QAC dataset using an automated framework based on GPT-4, followed by rigorous evaluation by Chemistry experts. Additionally, we provide two supplementary datasets ChemLit-QA-neg focused on negative data, and ChemLit-QA-multi focused on multihop reasoning tasks for LLMs, further enhancing the resources available for advanced scientific research.
We provide the full ChemLit-QA dataset and its variants in this repository, as well as the exact train-test split of ChemLit-QA that was used in the fine-tuning task.
Field | Description |
---|---|
chunk | The text chunk from which the Question-Answer-Context (QAC) triple is generated. |
Reasoning_type | Expert-corrected reasoning type. Includes 7 categories: Explanatory, Comparative, Causal, Conditional, Analogical, Evaluative, Predictive |
Question | LLM-generated question |
Answer | Expert-corrected answer |
Difficulty | Expert-assigned difficulty. Includes 3 categories: Easy, Medium, Hard |
Context | Expert-corrected context. Contains the full sentences that supports the answer. |
A_start_end | The start-end indices of the answer (most similar sentences) in the chunk |
similar_chunks | The top 6 most similar chunks to the given chunk in terms of cosine similarity |
Cluster_labels | 2-level hierarchical label describing the topic of this chunk |
ID | Identifier of the entry |
Answer Relevancy Scores_gpt-4o | How relevant the answer is to the question, assessd by GPT-4o |
Faithfulness Scores_gpt-4o | How faithful the answer is to the context, assessd by GPT-4o |
Hallucination Scores_gpt-4o | How much information in the answer is not mentioned in the context, assessd by GPT-4o |
Question Faithfulness Scores_gpt-4o | How faithful the question is to the context, assessd by GPT-4o |
SE_penalized | Penalized Semantic Entropy of the question |
Keywords | Keywords of the question |
Metric | Mean ± std. dev |
---|---|
Answer Relevancy Score (GPT-4o) | 0.99 ± 0.02 |
Faithfulness Score (GPT-4o) | 0.99 ± 0.01 |
Hallucination Score (GPT-4o) | 0.0 ± 0.0 |
Question Faithfulness Score (GPT-4o) | 0.93 ± 0.10 |
Penalized semantic entropy (GPT-4o) | 0.20 ± 0.44 |