Selecting Diverse Instructions

This repository contains the official code for the paper: Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement.

Dataset

To download the datasets used in this project, run this script. We used Alpaca, ShareGPT and WizardLM datasets for training and evaluation.

After downloading, datasets will be stored in the data/processed directory of the project.

Coreset Selection

The hyperparameters and configurations are managed by Hydra. The configurations are stored in selection/config/. You should run the code by executing main.py in the selection directory. You can also specify the hyperparameters by command line arguments.

cd selection
python main.py data=[sharegpt|wizardlm] encoder=miniLM coreset=random

The selected indices are stored under selection/indices/.

Finetuning

# Llama-2-7b-hf (with accelerate and deepspeed)
bash scripts/finetune_llama_with_accelerate.sh [INDICES]

Iterative selection is implemented in the scripts/iter/ directory.

Evaluation

bash scripts/eval/{eval}.sh

Reference

This code is based on the following repository:

open-instruct

Citation

If you find this code useful, please cite our paper:

@misc{yu2024diversify,
    title={Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement},
    author={Simon Yu and Liangyu Chen and Sara Ahmadian and Marzieh Fadaee},
    year={2024},
    eprint={2409.11378},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
ds_configs		ds_configs
eval		eval
finetune		finetune
lm-evaluation-harness @ fa514e1		lm-evaluation-harness @ fa514e1
scripts		scripts
selection		selection
visual		visual
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Selecting Diverse Instructions

Dataset

Coreset Selection

Finetuning

Evaluation

Reference

Citation

About

Releases

Packages

Contributors 3

Languages

for-ai/iterative-data-selection

Folders and files

Latest commit

History

Repository files navigation

Selecting Diverse Instructions

Dataset

Coreset Selection

Finetuning

Evaluation

Reference

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages