- Models and Dataset at HuggingFace
- Paper: arXiv
- Try Maya Model: Demo
Multimodal LLM supporting 8 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi
The following steps worked on a CUDA Version: 12.4
.
- Clone this repository and navigate to maya directory
git clone https://github.com/nahidalam/maya
cd maya
- Install Package
conda create -n maya python=3.10 -y
conda activate maya
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
To pretrain the projection layer,
- get the pretraining dataset from HuggingFace and keep it in
/dev/data/LLaVA_Pretrain
- get the images with
wget https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/resolve/main/images.zip
and keep them in/dev/data/images
bash scripts/maya/pretrain_aya_siglip.sh
Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.
- COCO: train2017
- GQA: images
- OCR-VQA: download script,
- TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in /dev/data/instruction_tune_dataset/
,
instruction_tune_dataset
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
Put the palo_multilingual_dataset.json
in /dev/data/annotations/palo_multilingual_dataset.json
Make sure to keep the pretrained model you have in a path that you specify in the scripts/maya/finetune_aya_siglip.sh
script throught the --pretrain_mm_mlp_adapter
flag
Then run
bash scripts/maya/finetune_aya_siglip.sh
For multilingual evaluation using PALO multilingual test dataset
- Download the PALO evaluation dataset: Create the following directory structure if it doesn't exist.
LLaVA/playground/data/eval git clone https://huggingface.co/datasets/MBZUAI/multilingual-llava-bench-in-the-wild
- Specifically test images can be found here
- Run the evaluation script
bash scripts/v1_5/eval/eval_all_languages.sh \
"model_base" \
"model_path" \
"model_name" \
"your-openai-api-key"
If you find Maya useful for your research and applications, please cite using this BibTeX:
@misc{alam2024mayainstructionfinetunedmultilingual,
title={Maya: An Instruction Finetuned Multilingual Multimodal Model},
author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
year={2024},
eprint={2412.07112},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.07112},
}
In no particular order
- Team Leads: Nahid Alam, Karthik Reddy, Surya Guthikonda
- Timothy Chung
- Abhipsha Das
- Bala Krishna S Vegesna
- Iftekhar Uddin
- Drishti Sushma
- Roshan Santhosh
- Shayakh Islam
- Isha Chaturvedi
- Chen Liu
- Snegha A
- Anthony Susevski
- Ashvanth.S
- Genta Indra Winata
- Ryan Chan
- Sangyeon Kim
- Snehanshu
- This codebase is based on LLaVA. Thank you for the easily understandable codebase.
- This project would not be possible without the support of Cohere and their Aya-35B API grant. We are thankful to Sara Hooker, Madeline, Shivalika, Shristhi and the entire Cohere for AI team for their support.
- We thank Pytho for their generaous GPU grant