diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..95cf77a --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +.idea/ +__pycache__/ +checkpoints/ diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..6abf4f1 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "viper/GLIP"] + path = viper/GLIP + url = https://github.com/sachit-menon/GLIP.git diff --git a/README.md b/README.md new file mode 100644 index 0000000..1ece941 --- /dev/null +++ b/README.md @@ -0,0 +1,138 @@ +# VDebugger + +This repo is for **VDebugger: Harnessing Execution Feedback for Debugging Visual Programs** + +[Paper](), [Website](https://shirley-wu.github.io/vdebugger/index.html) + +The training data and model are uploaded to huggingface: https://huggingface.co/VDebugger + +## Outlines + +- [Environment Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#environment-setup) +- [Dataset Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#dataset-setup) +- [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs) +- [Inference of VDebugger](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#inference-of-vdebugger) + +## Environment Setup + +This code is partially adapted from [ViperGPT](https://github.com/cvlab-columbia/viper). We sincerely thank the authors for their great work! + +To setup the environment, you should: +1. Clone recursively: +```bash +git clone --recurse-submodules https://github.com/cvlab-columbia/viper.git +``` +2. Install pytorch based on your own environment. We installed `torch==2.1.2` with cuda 12.1 +3. Install dependencies: +```bash +pip install -r requirements.txt +``` +4. Setup ViperGPT environments by: +```bash +cd viper +bash download_models.sh +export PATH=/usr/local/cuda/bin:$PATH +cd GLIP +python setup.py clean --all build develop --user +``` +5. If you need to use openai APIs: write api key into `viper/qpi.key` + +## Dataset Setup + +Please follow the guidelines below to download each dataset: +1. GQA: https://cs.stanford.edu/people/dorarad/gqa/download.html. The file structure should look as follows: +``` +gqa/ +├── questions +│ ├── readme.txt +│ ├── {val, test, testdev, challenge}_{all, balanced}_questions.json +│ ├── submission_all_questions.json +│ ├── train_balanced_questions.json +│ ├── train_all_questions/ +└── images + └── *.jpg +``` +2. TallyQA: https://github.com/manoja328/TallyQA_dataset. The file structure should look as follows: +``` +tallyqa/ +├── {test, train}.json +└── {train2014, val2014, VG_100K, VG_100K_2}/ + └── *.jpg +``` +3. NLVRv2: https://github.com/lil-lab/nlvr/tree/master/nlvr2. The file structure should look as follows: +``` +nlvr2/ +├── balanced_{dev, test1, test2, train}.jsonl +└── {dev, test1, test2, train}/ + └── *.png +``` +4. RefCOCO*: https://github.com/lichengunc/refer. The file structure should look as follows: +``` +refer/ +├── refcoco: +│ ├── instances.json +│ ├── refs(google).p +│ └── refs(unc).p +├── refcoco+: +│ ├── instances.json +│ └── refs(unc).p +├── refcocog +│ ├── instances.json +│ ├── refs(google).p +│ └── refs(umd).p +└── {train2014, train2017, val2014, val2017}/ + └── *.jpg +``` +5. COVR: https://covr-dataset.github.io/. The file structure should look as follows: +``` +covr/ +├── {train, val, test}.jsonl +├── gqa_images +│ └── *.jpg +└── imSitu_images + └── {adjusting, ...}/ + └── *.jpg +``` +6. RSVG: https://github.com/ZhanYang-nwpu/RSVG-pytorch. The file structure should look as follows: +``` +rsvg/ +├── {train, val, test.txt} +├── Annotations/ +│ └── *.xml +└── JPEGImages/ + └── *.jpg +``` + +## Generation and Execution of Visual Programs + +Go to `viper/` for this step. We recommend first generating and then executing the visual programs in two separate steps. Take GQA dataset as an example: +1. Generate programs: +```bash +CONFIG_NAMES=generate/gqa python main_batch_generate.py +``` +This script will load the configuration under `config/generate/gqa.yaml`. Please remember to change YOUR_DATA_DIR into your data directory. The generated code will be saved in a csv under `code` field +2. Execute and evaluate programs: +```bash +CONFIG_NAMES=execute/gqa python main_batch_execute.py +``` +This script will load the configuration under `config/execute/gqa.yaml`. Please also remember to update YOUR_DATA_DIR, and change the `cached_codex_path:` field into the csv produced in step 1. The accuracy / IoU will be computed. +3. If you want to obtain execution feedback: +```bash +CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP +``` +You can use the same configuration as in step 2. If you want to run multiple `main_batch_trace.py` in the same time, please use different `A_RANDOM_STAMP` for different processes. The execution feedback will be saved in a csv under `traced` field. + +## Inference of VDebugger + +For inference with VDebugger, it is required to first generate and execute visual programs, and obtain a csv file containing `traced` field. Take GQA dataset and VDebugger/VDebugger-{critic, refiner}-generalist-13B as an example: +```bash +# Step 1: infer critic +python infer_critic.py VDebugger/VDebugger-critic-generalist-13B --input YOUR_CSV_CONTAINING_TRACED_FIELD --dataset gqa # output file will be written to critic-infer.csv +# Step 2: infer refiner +python infer_refine.py critic-infer.csv VDebugger/VDebugger-refiner-generalist-13B # output file will be written to critic-refine-infer.csv +``` +Then you can execute the programs in `critic-refine-infer.csv` as in step 2 of [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs) + +## Training of VDebugger + +If you want to reproduce our training of VDebugger, please use `vdebugger/training_scripts/train_{critic, refiner}.sh`. You will need to install `deepspeed==0.14.0`. diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..621b792 --- /dev/null +++ b/docs/index.html @@ -0,0 +1,242 @@ + + +
+ + + + ++ Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and in005 voke specialized models for each step to solve the problems. +
+However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bot011 tleneck for visual reasoning. +
+To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects pro016 gram errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. +
+Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s ability to generalize to un026 seen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.
+ +Main results.
The two baselines, SelfDebug and LDB, slightly hurt the performance, while our VDebugger consistently improves the performance in every dataset by up to 3.2% accuracy.
+
Ablation study.
The critic consistently achieves high accuracy, but the refiner success rate is less reliable.
+ The execution feedback consistently brings benefits to critic accuracy and the final performance, but the benefits to refiner performance are minimal.
+ This shows that the remaining challenges mainly lie in correcting the program after the errors are identified.
Generalization to unseen LLMs: VDebugger can debug visual programs generated by larger LLMs, including CodeLlama-70b, DeepSeek-Coder-33B and GPT-3.5.
+Generalization to unseen tasks: when trained on all six datasets, the generalist VDebugger can generalize to two unseen tasks: + (1) RSVG, visual grounding for remote sensing images, and (2) COVR, an unseen task form requiring question answering based on a variable number of images.
+Sources of errors.
Program errors significantly affect the end performance. VDebugger consistently reduces program errors on all datasets, and can can also help recover from foundation VLM errors especially on RefCOCOg.
Example where VDebugger fixes program error.
+Example where VDebugger recovers from foundation model error.
The question answering model yields incorrect
+answer "vanity" in the original program. By detecting this error, VDebugger invokes the foundation VLMs in an alternative way
+and thus obtains the correct answer.
+ TODO
+
+ Question
+${question}
+ `; + else + html = ` +Question
+${question} (unit: ${unit})
+ `; + return html; +} + +function make_img(path) { + if (path === null) return ""; + let html = ``; + return html; +} + +function make_box(contents, cls = "") { + if (contents.join("").length === 0) return ""; + let html = ` +Choices
Choices
${choice}
`; + return html; +} + +function make_answer(answer) { + let html = `Answer
${answer}
`; + return html; +} \ No newline at end of file diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..8c9c3a5 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,39 @@ +accelerate==0.21.0 +backoff==2.2.1 +bitsandbytes==0.38.1 +git+https://github.com/openai/CLIP.git +decord==0.6.0 +dill==0.3.6 +einops==0.6.0 +ftfy==6.1.1 +h5py==3.8.0 +inflect==7.2.0 +ipython==8.11.0 +ipykernel==6.22.0 +jupyter==1.0.0 +joblib==1.2.0 +kornia==0.6.9 +matplotlib==3.6.2 +nltk==3.8.1 +num2words==0.5.12 +numpy==1.23.5 +omegaconf==2.3.0 +openai==0.28.0 +pandas==1.5.2 +Pillow==9.4.0 +prettytable==3.6.0 +pycocotools==2.0.6 +PyYAML==6.0 +qd==0.8.9 +regex==2022.10.31 +requests==2.28.1 +rich==13.3.2 +scipy==1.9.3 +tensorboardX==2.6 +tensorflow==2.11.1 +timm==0.6.12 +tqdm==4.64.1 +transformers==4.39.3 +wandb==0.13.9 +word2number==1.1 +yacs==0.1.8 \ No newline at end of file diff --git a/vdebugger/finetune.py b/vdebugger/finetune.py new file mode 100644 index 0000000..268fbab --- /dev/null +++ b/vdebugger/finetune.py @@ -0,0 +1,734 @@ +#!/usr/bin/env python +# coding=utf-8 + +import argparse +import logging +import math +import os +import random +from datetime import timedelta +from functools import partial + +import datasets +import deepspeed +import torch +import transformers +from accelerate import Accelerator +from accelerate.logging import get_logger +from accelerate.utils import set_seed, InitProcessGroupKwargs +from datasets import load_dataset +from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training +from torch.utils.data import DataLoader +from tqdm.auto import tqdm +from transformers import ( + AutoConfig, + AutoModelForCausalLM, + AutoTokenizer, + LlamaTokenizer, + LlamaTokenizerFast, + CodeLlamaTokenizer, + CodeLlamaTokenizerFast, + SchedulerType, + DataCollatorForSeq2Seq, + get_scheduler, + GPTNeoXTokenizerFast, + GPT2Tokenizer, + OPTForCausalLM, + BitsAndBytesConfig, +) + +logger = get_logger(__name__) + + +# try: +# from hf_olmo import OLMoTokenizerFast +# except ImportError: +# logger.warning("OLMo not installed. Ignore if using a different model.") + +def parse_args(): + parser = argparse.ArgumentParser(description="Finetune a transformers model on a causal language modeling task") + parser.add_argument( + "--dataset_name", + type=str, + default=None, + help="The name of the dataset to use (via the datasets library).", + ) + parser.add_argument( + "--dataset_config_name", + type=str, + default=None, + help="The configuration name of the dataset to use (via the datasets library).", + ) + parser.add_argument( + "--train_file", type=str, default=None, help="A csv or a json file containing the training data." + ) + parser.add_argument( + "--model_name_or_path", + type=str, + help="Path to pretrained model or model identifier from huggingface.co/models.", + required=False, + ) + parser.add_argument( + "--config_name", + type=str, + default=None, + help="Pretrained config name or path if not the same as model_name", + ) + parser.add_argument( + "--use_lora", + action="store_true", + help="If passed, will use LORA (low-rank parameter-efficient training) to train the model.", + ) + parser.add_argument( + "--lora_rank", + type=int, + default=64, + help="The rank of lora.", + ) + parser.add_argument( + "--lora_alpha", + type=float, + default=16, + help="The alpha parameter of lora.", + ) + parser.add_argument( + "--lora_dropout", + type=float, + default=0.1, + help="The dropout rate of lora modules.", + ) + parser.add_argument( + "--use_flash_attn", + action="store_true", + help="If passed, will use flash attention to train the model.", + ) + parser.add_argument( + "--tokenizer_name", + type=str, + default=None, + help="Pretrained tokenizer name or path if not the same as model_name", + ) + parser.add_argument( + "--use_slow_tokenizer", + action="store_true", + help="If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library).", + ) + parser.add_argument( + "--max_seq_length", + type=int, + default=512, + help="The maximum total sequence length (prompt+completion) of each training example.", + ) + parser.add_argument( + "--per_device_train_batch_size", + type=int, + default=8, + help="Batch size (per device) for the training dataloader.", + ) + parser.add_argument( + "--learning_rate", + type=float, + default=5e-5, + help="Initial learning rate (after the potential warmup period) to use.", + ) + parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay to use.") + parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.") + parser.add_argument( + "--max_train_steps", + type=int, + default=None, + help="Total number of training steps to perform. If provided, overrides num_train_epochs.", + ) + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=1, + help="Number of updates steps to accumulate before performing a backward/update pass.", + ) + parser.add_argument( + "--lr_scheduler_type", + type=SchedulerType, + default="linear", + help="The scheduler type to use.", + choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], + ) + parser.add_argument( + "--warmup_ratio", type=float, default=0, help="Ratio of total training steps used for warmup." + ) + parser.add_argument("--output_dir", type=str, default=None, help="Where to store the final model.") + parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.") + parser.add_argument( + "--preprocessing_num_workers", + type=int, + default=None, + help="The number of processes to use for the preprocessing.", + ) + parser.add_argument( + "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets" + ) + parser.add_argument( + "--checkpointing_steps", + type=str, + default=None, + help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.", + ) + parser.add_argument( + "--logging_steps", + type=int, + default=None, + help="Log the training loss and learning rate every logging_steps steps.", + ) + parser.add_argument( + "--resume_from_checkpoint", + type=str, + default=None, + help="If the training should continue from a checkpoint folder.", + ) + parser.add_argument( + "--with_tracking", + action="store_true", + help="Whether to enable experiment trackers for logging.", + ) + parser.add_argument( + "--report_to", + type=str, + default="all", + help=( + 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + "Only applicable when `--with_tracking` is passed." + ), + ) + parser.add_argument( + "--low_cpu_mem_usage", + action="store_true", + help=( + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded." + "If passed, LLM loading time and RAM consumption will be benefited." + ), + ) + parser.add_argument( + "--gradient_checkpointing", + action="store_true", + help=( + "Turn on gradient checkpointing. Saves memory but slows training." + ), + ) + parser.add_argument( + "--use_qlora", + action="store_true", + help=( + "Use qLoRA training - main thing is initialising model in quantised form. Not compatible with deepspeed." + ), + ) + parser.add_argument( + '--clip_grad_norm', + type=float, + default=-1, + help='Clip gradient norm. Not compatible with deepspeed (use deepspeed config instead).', + ) + parser.add_argument( + '--use_8bit_optimizer', + action='store_true', + help='Use 8bit optimizer from bitsandbytes. Not compatible with deepspeed (use deepspeed config instead).', + ) + parser.add_argument( + '--timeout', + type=int, + default=1800, + help='Timeout for the training process. Useful if tokenization process is long. Default is 1800 seconds (30 minutes).', + ) + parser.add_argument( + '--trust_remote_code', + action='store_true', + help='Trust remote code when loading pretrained models and tokenizers. Use only when you trust the remote code.', + ) + parser.add_argument( + '--reduce_loss', + default='mean', + choices=['mean', 'sum'], + help='How to reduce loss over tokens. Default is mean, but using sum can improve chat model performance.', + ) + args = parser.parse_args() + + # Sanity checks + if args.dataset_name is None and args.train_file is None: + raise ValueError("Need either a dataset name or a training file.") + else: + if args.train_file is not None: + extension = args.train_file.split(".")[-1] + assert extension in ["json", "jsonl"], "`train_file` should be a json/jsonl file." + return args + + +def encode_with_prompt_completion_format(example, tokenizer, max_seq_length): + ''' + Here we assume each example has 'prompt' and 'completion' fields. + We concatenate prompt and completion and tokenize them together because otherwise prompt will be padded/trancated + and it doesn't make sense to follow directly with the completion. + ''' + tokenized_prompt = [tokenizer.bos_token_id, ] + tokenizer(example['prompt'], add_special_tokens=False).input_ids + tokenized_inst = [] if example['inst'] is None else tokenizer(example['inst'], add_special_tokens=False).input_ids + tokenized_completion = tokenizer(example['completion'], add_special_tokens=False).input_ids + if example.get('eos', True) is not False: + tokenized_completion += [tokenizer.eos_token_id, ] + # assert tokenized_prompt + tokenized_inst + tokenized_completion == \ # fine... + # tokenizer(example['prompt'] + example['inst'] + example['completion']).input_ids + + assert len(tokenized_inst + tokenized_completion) < max_seq_length + tokenized_prompt = tokenized_prompt[: max_seq_length - 256 - len(tokenized_inst)] + # assumes max output length is 256 + input_ids = tokenized_prompt + tokenized_inst + tokenized_completion + labels = [-100, ] * len(tokenized_prompt) + tokenized_inst + tokenized_completion + + input_ids = torch.LongTensor(input_ids) + labels = torch.LongTensor(labels) + attention_mask = torch.ones_like(input_ids) + return { + 'input_ids': input_ids, + 'labels': labels, + 'attention_mask': attention_mask, + } + + +def save_with_accelerate(accelerator, model, tokenizer, output_dir, args): + unwrapped_model = accelerator.unwrap_model(model) + # When doing multi-gpu training, we need to use accelerator.get_state_dict(model) to get the state_dict. + # Otherwise, sometimes the model will be saved with only part of the parameters. + # Also, accelerator needs to use the wrapped model to get the state_dict. + state_dict = accelerator.get_state_dict(model) + if args.use_lora: + # When using lora, the unwrapped model is a PeftModel, which doesn't support the is_main_process + # and has its own save_pretrained function for only saving lora modules. + # We have to manually specify the is_main_process outside the save_pretrained function. + if accelerator.is_main_process: + unwrapped_model.save_pretrained(output_dir, state_dict=state_dict) + else: + # don't use safetensors for saving for now + unwrapped_model.save_pretrained( + output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save, + state_dict=state_dict, + safe_serialization=False + ) + + +def main(): + args = parse_args() + + # Initialize the accelerator. We will let the accelerator handle device placement for us in this example. + # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers + # in the environment + accelerator_log_kwargs = {} + + if args.with_tracking: + accelerator_log_kwargs["log_with"] = args.report_to + accelerator_log_kwargs["project_dir"] = args.output_dir + + # if you get timeouts (e.g. due to long tokenization) increase this. + timeout_kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=args.timeout)) + + accelerator = Accelerator( + gradient_accumulation_steps=args.gradient_accumulation_steps, + **accelerator_log_kwargs, + kwargs_handlers=[timeout_kwargs] + ) + # Make one log on every process with the configuration for debugging. + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO, + ) + logger.info(accelerator.state, main_process_only=False) + if accelerator.is_local_main_process: + datasets.utils.logging.set_verbosity_warning() + transformers.utils.logging.set_verbosity_info() + else: + datasets.utils.logging.set_verbosity_error() + transformers.utils.logging.set_verbosity_error() + + # If passed along, set the training seed now. + if args.seed is not None: + set_seed(args.seed) + + if accelerator.is_main_process: + if args.output_dir is not None: + os.makedirs(args.output_dir, exist_ok=True) + + accelerator.wait_for_everyone() + + if args.dataset_name is not None: + # Downloading and loading a dataset from the hub. + raw_datasets = load_dataset( + args.dataset_name, + args.dataset_config_name, + ) + else: + data_files = {} + dataset_args = {} + if args.train_file is not None: + data_files["train"] = args.train_file + raw_datasets = load_dataset( + "json", + data_files=data_files, + **dataset_args, + ) + + # Load pretrained model and tokenizer + if args.config_name: + config = AutoConfig.from_pretrained(args.config_name, trust_remote_code=args.trust_remote_code) + elif args.model_name_or_path: + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) + else: + raise ValueError( + "You are instantiating a new config instance from scratch. This is not supported by this script." + ) + + if args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, trust_remote_code=args.trust_remote_code, + use_fast=not args.use_slow_tokenizer) + elif args.model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code, + use_fast=not args.use_slow_tokenizer) + else: + raise ValueError( + "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You can do it from another script, save it, and load it from here, using --tokenizer_name." + ) + + if args.model_name_or_path: + if args.use_qlora: + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_compute_dtype=torch.bfloat16, + ) + device_index = accelerator.local_process_index + device_map = {"": device_index} # force data-parallel training. + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + from_tf=bool(".ckpt" in args.model_name_or_path), + config=config, + load_in_4bit=True, + quantization_config=bnb_config, + device_map=device_map, + trust_remote_code=args.trust_remote_code, + torch_dtype=torch.bfloat16, + use_flash_attention_2=True if args.use_flash_attn else False, + ) + else: + model = AutoModelForCausalLM.from_pretrained( + args.model_name_or_path, + from_tf=bool(".ckpt" in args.model_name_or_path), + config=config, + trust_remote_code=args.trust_remote_code, + low_cpu_mem_usage=args.low_cpu_mem_usage, + use_flash_attention_2=True if args.use_flash_attn else False, + torch_dtype=torch.bfloat16, + ) + else: + logger.info("Training new model from scratch") + model = AutoModelForCausalLM.from_config(config) + + # no default pad token for llama! + # here we add all special tokens again, because the default ones are not in the special_tokens_map + if isinstance(tokenizer, LlamaTokenizer) or isinstance(tokenizer, LlamaTokenizerFast) or \ + isinstance(tokenizer, CodeLlamaTokenizer) or isinstance(tokenizer, CodeLlamaTokenizerFast): + num_added_tokens = tokenizer.add_special_tokens({ + "bos_token": "