Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 916 151

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 513 26

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.2k 263

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    994 40

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 545 52

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 541 123

Repositories

Showing 10 of 161 repositories
  • google-research-datasets/sanpo_dataset’s past year of commit activity
    Python 39 Apache-2.0 1 3 2 Updated Sep 19, 2024
  • SeeGULL-Multilingual Public

    SeeGULL Multilingual is a multilingual and multicultural dataset of stereotypes. It consists of stereotypes in 20 languages with human annotations across 23 languages, including annotations on their degree of offensiveness.

    google-research-datasets/SeeGULL-Multilingual’s past year of commit activity
    3 CC-BY-4.0 1 0 0 Updated Sep 19, 2024
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    google-research-datasets/ToTTo’s past year of commit activity
    435 37 6 0 Updated Sep 11, 2024
  • indic-gen-bench Public

    IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.

    google-research-datasets/indic-gen-bench’s past year of commit activity
    41 6 0 0 Updated Sep 1, 2024
  • hiertext Public

    The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

    google-research-datasets/hiertext’s past year of commit activity
    Jupyter Notebook 260 CC-BY-SA-4.0 24 0 1 Updated Aug 30, 2024
  • cf_triviaqa Public

    The CF-TriviaQA dataset accompanies "Hallucination Augmented Recitations for Language Models" paper (https://arxiv.org/abs/2311.07424). It is a counterfactual open book QA dataset generated from the TriviaQA dataset using Hallucination Augmented Recitations (HAR) approach, with the purpose of improving attribution in LLMs.

    google-research-datasets/cf_triviaqa’s past year of commit activity
    2 Apache-2.0 1 1 0 Updated Aug 30, 2024
  • BamTwoogle Public

    The BamTwoogle dataset accompanies "ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent" paper (https://arxiv.org/abs/2312.10003). It was written to be a complementary, slightly more challenging sequel to Bamboogle dataset. It addresses some of the shortcomings of Bamboogle we discovered while performing human evals for the paper.

    google-research-datasets/BamTwoogle’s past year of commit activity
    3 CC-BY-4.0 1 0 0 Updated Aug 14, 2024
  • mittens Public

    Datasets for measuring misgendering in translation

    google-research-datasets/mittens’s past year of commit activity
    5 0 0 0 Updated Aug 13, 2024
  • visage Public

    Visage contains an image dataset of images with human annotations on whether or not certain attributes are present or depicted in the image. The attribute may either be stereotypical or non-stereotypical w.r.t. to the identity group in the image. It also contains a list of attributes in English along with annotations about whether they are visual.

    google-research-datasets/visage’s past year of commit activity
    7 Apache-2.0 2 0 0 Updated Aug 13, 2024
  • SPICE Public

    SPICE is a stereotype dataset in English containing stereotypes collected in India with community engagement. It spans identity groups and stereotypes unique to India, as well as other stereotypes about gender and nationalities.

    google-research-datasets/SPICE’s past year of commit activity
    2 CC-BY-4.0 1 0 0 Updated Jul 26, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…