Vidhai is a collaborative open science initiative to identify, collect, and create datasets for Tamil. Our objective is to enable and promote cutting-edge AI research in Tamil by building foundation models and tools for native speakers of Tamil globally. We aim to have a single repository of Tamil datasets that could be used for different tasks like language modeling, QA, summarization, etc. All the datasets that are identified as part of the initiative will be added to AI Tamil Nadu HuggingFace organization.
It is going to be a life-long initiative, but to make it outcome-based, we will run the initiative in phases with each phase focusing on different aspects of data curation. The primary language of contribution is Tamil, but we will also accept bilingual datasets aimed at specific tasks after a detailed review.
In Phase 1
, our objective is to curate all the existing datasets (collect metadata) and find sources for creating new datasets in Tamil that can enable further research. The different tasks that you could contribute to are listed below,
You could propose creating a new dataset or sharing information about existing data subsets (which are part of a multilingual dataset or monolingual dataset). Upon submission of the proposal form, our team will review the dataset.
- Once the data proposal is accepted, we will share the format in which the data has to be scraped (if new data) or converted (if existing data).
- You will then follow the guidelines and reformat the data, provide metadata through metadata form and the data will be added to the final
vidhai_dataset
.
If you do not know where to start, you can feel free to choose from the issues. You can work on the listed sources and create datasets out of them.
- Once the data is scraped or text is obtained through OCR, it should be converted to the prescribed format and reviewed.
- Also, provide metadata through metadata form and the data will be added to the final
vidhai_dataset
.
Note: If there are quality, format, or content issues in Phase 1
tasks, then this dataset will be moved to Phase 2
to fix them. Once identified issues are fixed, then the data will be added to the final vidhai_dataset
.
Once the proposal form is submitted, the dataset is approved and the reformatting is done, follow the steps below to submit the dataset to Phase 1
of the initiative.
-
Setup:
- First, fork the repository in GitHub! 🍴
- The repository will be forked and it will be under your GitHub repositories as
PATH_TO_YOUR_FORK
. - Next, clone the forked repository and create a branch with the name of the dataset i.e.
vidhai_<dataset_name>
,git clone $PATH_TO_YOUR_FORK cd Vidhai git checkout -b vidhai_<dataset_name>
-
Creating folder structure for datasets:
- Get into the
datasets
folder and create a sub-folder with the dataset name.cd datasets mkdir vidhai_<dataset_name> git checkout -b vidhai_<dataset_name>
- Create two sub-folders under vidhai_<dataset_name> namely
data_original
anddata_processed
.cd vidhai_<dataset_name> mkdir data_original mkdir data_processed
- Get into the
-
Preprocessing data:
- Place the original dataset that you have scraped or identified in the
data_original
folder. - Create a Python script with the name
vidhai_<dataset_name>_processor.py
and add the relevant code to perform any kind of preprocessing to the original data. - Once the preprocessing is completed and the data is reformatted to an agreed-upon format, place the processed data in
data_processed
.
- Place the original dataset that you have scraped or identified in the
-
Adding data card and requirements.txt:
- Create a detailed data card describing the dataset in detail and add it to
vidhai_<dataset_name>/README.md
. Refer to the sample data card and fill in the relevant information. - If you intend to use external libraries for preprocessing, add them with their version numbers in
vidhai_<dataset_name>/requirements.txt
.
- Create a detailed data card describing the dataset in detail and add it to
-
Create a PR:
- First, commit and push your changes:
git add vidhai_<dataset_name>/README.md git add vidhai_<dataset_name>/requirements.txt git add vidhai_<dataset_name>_processor.py git commit -m "Added new preprocessed dataset - vidhai_<dataset_name>" git push --set-upstream origin vidhai_<dataset_name>
- Finally, submit a pull request. The last
git push
command prints a URL that can be copied into a browser to initiate such a pull request. - Alternatively, you can submit a PR from the GitHub website.
✨ Congratulations, you've submitted a dataset to Vidhai! ✨
In Phase 2
, our objective is to fix the quality, format, and content issues in the approved datasets. We will need annotators, translators, NLP experts, and linguists to work on these issues.
The datasets with quality issues from Phase 1
will be moved to Phase 2
for quality checking. For example, removing other language words, unwanted links, or references from a scraped or OCRed Tamil dataset or even from existing multilingual subsets.
- If you're interested in this task, you can go to the portal and sign up.
- Instructions will be provided on how to fix the quality issues on the portal.
- Follow the instructions and fix the issue of as many data points as you can.
- Everyone can join and contribute to this initiative between 1st March 2024 and 31st December 2024.
Phase 1
will start first and once we start identifying the issues with the quality of the submitted data,Phase 2
will start then.- If you’re able to contribute qualitatively according to our guidelines, you stand a chance to win some SWAGS and become a co-author of our upcoming paper.
We greatly appreciate your help! We recognize that some datasets require more effort than others, so please reach out to us on discord if you have any questions. Our goal is to be inclusive with credits and value everyone's time!
For more information please reach us at coimbatoreai@gmail.com or discord.