Releases: huggingface/optimum-habana
Releases · huggingface/optimum-habana
v1.14.1: Patch release
- Enable DeepSpeed for image-to-text example #1455 @schoi-habana
- Fix bug when loading 4bit checkpoint quantized in INC #1447 @xin3he
- Fixes 'Tokenizer does not have padding token' introduced by #1444 for Llama3.1 #1457 @MohitIntel
Full Changelog: v1.14.0...v1.14.1
v1.14.0: Transformers v4.45, SynapseAI v1.18, Qwen2-MoE, text-to-video generation
Transformers v4.45
SynapseAI v1.18
Qwen2-MoE
Text-to-video generation
- Enabling Text to Video Diffusion Model Generation #1109 @pi314ever
- Porting Stable Video Diffusion ControNet to HPU #1037 @wenbinc-Bin
Depth-to-image generation
- Depth to Image Generation #1175 @pi314ever
Model optimizations
- Enable FusedSDPA for Mpt #1101 @Jianhong-Zhang
- Mixtral fp8 #1269 @imangohari1
- Prevent Graph break in Llama when using flash attention #1301 @pramodkumar-habanalabs
- Boost SDXL speed with initialized schedule step reset #1284 @dsocek
- Improve MPT fp8 #1256 @atakaha
- Add Whisper static generation #1275 @Spycsh
- Gemma: enabled HPU Graphs and Flash Attention #1173 @dsmertin
- Recommend jemalloc for gpt-neox-20b 8x #1350 @hsubramony
- Optimized inference of GPT-NEO model on HPU #1319 @XinyuYe-Intel
- Fix graph breaks for BART in torch.compile mode. #1379 @astachowiczhabana
- Gpt_bigcode: added internal_bucketing support #1218 @mgonchar
- refine bucket_internal for mpt #1194 @Jing1Ling
- Qwen finetuning bucketing #1130 @ssarkar2
- Enable FusedSDPA fp8 in Llama FT #1388 @pbielak
- Added gemma specific fp8 quantization file #1445 @yeonsily
Intel Neural Compressor
- Enable INC for llava models and change softmax to use torch.nn.functional.softmax as its supported module by INC #1325 @tthakkal
- Load INC GPTQ checkpoint & rename params #1364 @HolyFalafel
- Fix load INC load weights compile error due to Transformer 4.45 upgrade. #1421 @jiminha
Vera/LN-tuning
Other
- Add callable workflow to post comments when code quality check failed #1263 @regisss
- Fix failed code quality check comment workflow #1264 @regisss
- Accelerate Diffusers CI #1265 @regisss
- Add profiler to SD3 #1267 @atakaha
- Fix profiling step with device finish execution for text-generation #1283 @libinta
- Update FusedSDPA calling method as Gaudi documentation #1285 @yeonsily
- Switch failed code quality check comment to workflow_run #1297 @regisss
- Potential fix for the failed code quality check comment workflow #1299 @regisss
- Fix text-generation example lm_eval evaluation #1308 @changwangss
- Add section to README about Transformers development branch #1307 @regisss
- Fix eager mode in run_generation by removing graph logs #1231 @Vasud-ha
- Fix bug when running google/paligemma-3b-mix-224 #1279 @kaixuanliu
- Use native checkpointing under compile mode #1313 @xinyu-intel
- fixed fused_qkv object AttributeError due to 'LlamaConfig' #1203 @rkumar2patel
- Image to Image Generation Enabling #1196 @pi314ever
- Diffusers timing #1277 @imangohari1
- Fix eos issue in finetune/generation #1253 @sywangyi
- Update CI, tests and examples #1315 @regisss
- Fix Sentence Transformer HPU graphs for training with PEFT model #1320 @nngokhale
- Fix ZeroDivisionError in constrained beam search with static shapes #1317 @skavulya
- Update esmfold model not to use param_buffer_assignment #1324 @jiminha
- Falcon inference crash fix for falcon-40b model #1161 @yeonsily
- Add --use_kv_cache to image-to-text pipeline #1292 @KimBioInfoStudio
- Trl upgrade #1245 @sywangyi
- Fix uint4 url typo. #1340 @kding1
- Use eager attention for wav2vec2 #1333 @skaulintel
- Add _reorder_cache back to Llama for HPU #1233 @jiminha
- SDXL CI script throughput #1296 @imangohari1
- Add image so that transformers tests can run #1338 @skaulintel
- Fixes the no attribute error with the falcon multicard test #1344 @mounikamandava
- Add profiler to sdxl mlperf pipeline #1339 @Jianhong-Zhang
- Fix decoder only generation #948 @tjs-intel
- Upgrade gradient chekpointing #1347 @yafshar
- Run_generation example: fixed graph compilation statistics reporting #1352 @mgonchar
- Fix deepseeed crash with Sentence Transformer Trainer #1328 @nngokhale
- fea(ci): reduced slow test_diffusers timing. minor fixes #1330 @imangohari1
- Flash attn args for GaudiGemmaForCausalLM #1356 @kkoryun
- Transformer models generation supports user-provided input embeddings #1276 @zongwave
- Fixed the expected values after for img2img slice #1332 @imangohari1
- Gpt_big_code: make flash attention impl quantization friendly #1282 @mgonchar
- Fix OOM when inference with llama-3.1-70b #1302 @harborn
- Fix the conditional #1362 @yafshar
- Revert "use native checkpointing under compile mode" #1365 @xinyu-intel
- Remove repetitive pip install commands #1367 @MohitIntel
- Minor UX enhancement #1373 @MohitIntel
- Fix bug when running image-to-text example #1371 @kaixuanliu
- Gpt_bigcode: fixed wrong indentation #1376 @mgonchar
- Support for transformers without self.model to torch.compile #1380 @astachowiczhabana
- Only pass the use_kv_cache True to generator #1366 @yafshar
- Clean up the code and remove unnecessary class #1382 @yafshar
- Add the diffusers examples of inference Tech #1244 @yuanwu2017
- Enhance transformers test suite in Optimum-habana-4.43.4 Auto pr 07654de #1387 @rkumar2patel
- Enhance transformers test suite in Optimum-habana-4.43.4 (auto PR 8926a4b) #1386 @rkumar2patel
- Add README.md for Sentence transformer examples with HPU device #1355 @ZhengHongming888
- Change Falcon/GPT-Neox rotary embedding function to use seq_len for #1368 @yeonsily
- Enhance Optimum-habana as per transformers-4.43.4 #1381 @rkumar2patel
- CI fix - Install stable-diffusion reqs #1389 @vidyasiv
- Fix error caused by uninitialized attn_weights #1391 @hsubramony
- Replace flash attention flag #1393 @skaulintel
- Fix DeepSpeed CI on Gaudi2 #1395 @regisss
- Truncate the cached max seq len #1394 @astachowiczhabana
- Fix gpt-neox training accuracy issue. #1397 @yeonsily
- Simplify HQT config files #1219 @Tiefen-boop
- unify_measurements.py script support to unify PCQ 70B 8x #1322 @Yantom1
- Add misc. training args #1346 @SanityRemnants
- Add quantization config for low bs case #1377 @ulivne
- Remove HQT from OHF #1257 @Yantom1
- Valid sequence length for sdpa #1183 @ssarkar2
- Multiple fixes (dynamo graph break, qwen-moe, multicard) #1410 @ssarkar2
- Change the image path for transformers tests back to the correct location #1401 @skaulintel
- Fix Gaudi2 regression tests #1403 @regisss
- Reverting some of transformer pytest funcs/values #1399 @imangohari1
- Fix StarCoder2 inference #1405 @regisss
- Change the order for test_diffusers #1406 @hsubramony
- Fix llama model text generation error #1402 @zongwave
- Datasets downgrade version to 2.21.0 #1413 @hsubramony
- Update ci sentence_transformer.sh #1424 @ZhengHongming888
- Update language-modeling README.md, add trust_remote_code for flan-t5-xl #1422 @hsubramony
- Update unify_measurements.py support info #1425 @shepark
- Fix GPT_neox incorrect output with batch query #1358 @Jianhong-Zhang
- Fix text-to-image example #1429 @regisss
- Add flag to run inference with partial dataset #1420 @pramodkumar-habanalabs
- Add peft generation example #1427 @sywangyi
- Added missing allocate_kv_cache() call in CausalLM class #1431 @yeonsily
- Fix merge error and update text-to-speech readme #1436 @hsubramony
- Fix OOM error for code llama #1437 @jiminha
- Fix error on 4bit checkpoint load with run_lm_eval on TF4.45.2 #1439 @jiminha
- GPT2 torch.compile fix #1434 @dsmertin
- Update text-gen README.md to add auto-gptq fork install steps #1442 @hsubramony
- Fix scoped linear all-reduce for starcoder model #1432 @skavulya
- Fixed recursion error in SentenceTransformer #1428 @yafshar
- Fix Llama 3.1 generation #1444 @regisss
- Remove cache folder from image data folder #1446 @shepark
v1.13.2: Patch release
Llava(-next) improvements
This patch release adds multi-card support for Llava(-next) and enables users to turn on/off recomputing for flash attention.
- Llava: Added flash_attention_recompute arg to provide an option to enable/disable recompute #1278 @tthakkal
- Add the deepspeed injection_policy of mistral #1309 @yuanwu2017
Full Changelog: v1.13.1...v1.13.2
v1.13.1: Patch release
Fixed memory regressions
- Remove _expand_inputs_for_generation for greedy search (#1266) @libinta
- Fix memory regression for modeling llama (#1271) @libinta
FSDP
FSDP checkpoint saving is fixed.
Known limitations
- ESMFold does not work on Gaudi1, this will be fixed in a future version
Full Changelog: v1.13.0...v1.13.1
v1.13.0: Stable Diffusion 3, Sentence Transformers, SAM, DETR, Kubernetes example
SynapseAI 1.17
- Upgrade SynapseAI version to 1.17.0 #1217
Transformers 4.43
Diffusers 0.29
Stable Diffusion 3
Training with Sentence Transformers
- Enable Sentence Transformer Trainer with Gaudi #1111 @ZhengHongming888
Model optimizations
- Fix starcoder2 accuracy issue and optimize performance with fused rope #1095 @mandy-li
- Enable FusedRoPE using float32 for gpt-neox model #1104 @yeonsily
- Mamba initial enablement. #1122 @libinta
- Adding fused qkv support along with config #1102 @bhargaveede
- Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization #1087 @Zhiwei35
- Enable fp8 inference for Llava-Next and add Fused_SDPA #1120 @tthakkal
- Support bucket_internal for MPT #1137 @pk1d3v
- Enable Flash Attention (Fused SDPA) for Starcoder #1114 @abhilash1910
- gpt_bigcode: added FusedSDPA kernel #1138 @mgonchar
- Enable torch.compile for Granite20B #1185 @dvarshney-habana
- Refine use cache for mpt model #1158 @Jing1Ling
- GPT-J support reuse_cache #1094 @atakaha
- Use fast softmax only on prefill #1159 @jaygala223
- Starcoder2 : KVCache and flash attention (FusedSDPA) enablement #1149 @abhatkal
- Gpt bigcode fused sdpa #1260 @yeonsily
SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM
- Add an example of Segment Anything Model [Inference] #814 @cfgfung
- Add an example of FastViT model (Infernece) #826 @cfgfung
- VideoMAE Model Enabling and Examples #922 @pi314ever
- OpenCLIP sample for visual question answering #977 @vidyasiv
- Enabled DETR (Object Detection) model #1046 @cfgfung
- Table transformer enabling #978 @pi314ever
- deciLM support #1133 @sywangyi
Stable Diffusion inpainting, unconditional image generation
- Add the Stable diffusion inpaint support #869 @yuanwu2017
- Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] #859 @cfgfung
Text feature extraction example
- Feature extraction enabling #994 @pi314ever
Tensor parallelism
- Tensor parallel distributed strategy without using deepspeed #1121 @kalyanjk
- Disable torch.compile for all_reduce when parallel_strategy is set to "tp" #1174 @kalyanjk
Kubernetes cluster example
- Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster #1099 @dmsuehir
- Fix PyTorch version in the Kubernetes docker-compose to match image #1246 @dmsuehir
FP8 training
- TE FP8 integration #1096 @SanjuCSudhakaran
Other
- Updates run_lora_clm.py with enhanced dataset support #955 @dmsuehir
- Fix prefix tuning finetune issue and update test #975 @sywangyi
- Fix throughput calculation in image-to-text example #1070 @regisss
- SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets #1038 @imangohari1
- Updating batch_size of Albert-XXL in README #1063 @vineethanandh
- Fix the error of running run_pipeline.py of text_generation example #1055 @yuanwu2017
- Add a test for llama finetuning with FP8 precision #1106 @SanjuCSudhakaran
- Beam-search fix #1113 @ssarkar2
- Add chat format support dataset in SFT #1066 @libinta
- Fix nan loss of gemma and crash if dataset_concatenation is not set #1088 @sywangyi
- torch.compile keep input mutation in graph this avoids unnecessary memcpy #1069 @sushildubey171
- Updated langchain text-generation pipeline to work with latest release 0.2.5 #1084 @rbrugaro
- Add the MC example #891 @yuanwu2017
- Fix recompiles if limit_hpu_graph is False #1129 @ssarkar2
- Update examples batchsize in README #1123 @shepark
- Fix OOM error in SDXL Fine-Tuning validation stage #1134 @dsocek
- Added an example code to demonstrate how to use deterministic image generation #878 @cfgfung
- SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline #988 @sywangyi
- Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion #1020 @sywangyi
- Llama adapter #983 @sywangyi
- torch.flip issue is fixed in SynapseAI 1.16, so remove the WA #1092 @sywangyi
- Fix test CausalLanguageModelingLORAExampleTester KeyError #1139 @dmsuehir
- fix(ci): new runs-on #1136 @XciD
- Add trust_remote_code for loading datasets in the audio classification example #1074 @regisss
- Generation example: print number of warmup iterations #1145 @mgonchar
- CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines #1140 @imangohari1
- Support for custom files for run_lora_clm.py #1039 @vidyasiv
- Change the device_id for FSDP plugin #1086 @ckvermaAI
- Set KV Cache update as static method #1160 @ulivne
- To fix CPU tensor issue #1157 @mkumargarg
- Adding missing init.py to mistral and mixtral test package #1188 @rkumar2patel
- Add example of multitask_prompt/poly tuning #915 @sywangyi
- Fix data-type mismatch for mlperf_inference accuracy test #1146 @kalyanjk
- Fix spawn MP context, limit cpu and download data #1131 @polisettyvarma
- T5 multi card #1222 @yafshar
- Add trust_remote_code for t5 poly-tuning test #1220 @yafshar
- Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder #1181 @vidyasiv
- Fix VIT, add wav2vec comment #1223 @ssarkar2
- Roberta tests were running on CPU #1229 @ssarkar2
- Fix bert/roberta contrastive search tests #1226 @skavulya
- Remove the default env variable to trust remote code by default #1225 @yafshar
- Improve style check workflow #1230 @regisss
- Added scheduler selection for SDXL fine-tuning #867 @kplau1128
- Clear help msg for ignore_eos to avoid misunderstanding @sywangyi
- Support loading hugging face checkpoint #1165 @ulivne
- Change triggering event for code style check #1238 @regisss
- gptj: fix missing token_idx #1234 @envsp
- fix(nltk): fixed the version to working one #1247 @imangohari1
- Updating to avoid hardcoding tests in CI framework #1221 @vidyasiv
- Fix FSDP graph error due to Tranformer 4.43 update #1251 @jiminha
- Fix SD README commands #1250 @imangohari1
- Fix spelling errors #1252 @changwangss
- Set HLS_MODULE_ID only if it wasn't set previously #1254 @astachowiczhabana
- Fix overflow of steps in SDXL for default diffusers scheduler @dsocek
- fix(test_diffusers): automated the checking for tests without upstream HF #1232 @imangohari1
- fix(nltk): Revert 1247. Updated the version. added the punkt_tab download #1258 @imangohari1
- Set input_embeds before it gets used #1261 @tthakkal
- Update README and more changes, rebase to main #1259 @shepark
Known limitations
- For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work
v1.12.1: Patch Release
Fix 1st token latency time measure
Fix for Mixtral
- Mixtral typo fix #1107 @schoi-habana
Other
Full Changelog: v1.12.0...v1.12.1
v1.12: Qwen2, Gemma, SVD, Dreambooth, speculative sampling
SynapseAI v1.16
Transformers 4.40
Speculative Sampling
- Speculative sampling on Gaudi using Optimum-Habana #973 @nraste
- Fix assisted decoding generation error #1080 @libinta
Model optimizations
- Add --bucket_size support for gpt_bigcode #802 @jiminha
- Optimize StableLM model inference #805 @XinyuYe-Intel
- Enable google/gemma-7b. #747 @lkk12014402
- Enable llava static generation. #767 @lkk12014402
- Fix perf drop in flan-t5 summarization #908 @MohitIntel
- Enable Qwen2 model #774 @XinyuYe-Intel
- Extend bucket_internal to SAMPLE generation mode #819 @xt574chen
- SpeechT5 static consistent dropout #824 @Spycsh
- Optimize inference of Persimmon model #822 @XinyuYe-Intel
- Enable OWL-ViT graph mode on Gaudi platform #783 @cfgfung
- Support mixtral kvcache reuse and remove kv_cache_fp8 #898 @jychen21
- Add fp8 related changes to mistral for text-generation #918 @skaulintel
- Optimization for phi series models: support fp8 kv cache and reuse kv cache #902 @yuwenzho
- Support Mistral 32K input token #931 @jiminha
- Support mixtral long sequence 32k with bs 4 #903 @jychen21
- Adapt Mixtral long sequence handling for Mistral #985 @jiminha
- Fix performance issue in mistral #1030 @jiminha
- Optimized inference of Starcoder2 model #829 @XinyuYe-Intel
- Add support for IBM Granite #1045 @regisss
- Enable fp8 inference for Llava-hf 7B and 13B in 1.16 release #951 @Luca-Calabria
- Fusedrope inp bf16 #1026 @ssarkar2
- Enhance Qwen2 model with FSDPA and bucket #1033 @Zhiwei35
- Optimize seamless-m4t/vits model for text-to-speech generation #825 @sywangyi
- cache_optimization #1028 @ssarkar2
- Ensure KV cache is not returned as output tensor during decode phase for Falcon #993 @schoi-habana
- Fast softmax #972 @wszczurekhabana
- Falcon optimization #974 @libinta
- Quantization for FSDPA #976 @dudilester
- Falcon update park #1052 @ssarkar2
- Add the Llava_next support #1041 @yuanwu2017
- Improve torch compile performance #1082 @libinta
Stable Video Diffusion
PEFT
- Add ia3 and adalora support #809 @sywangyi
- Enable prompt tuning/prefix tuning/p tuning clm and example #758 @sywangyi
TRL
Object Segmentation Example
Dreambooth
Others
- Text generation pipeline: Extended functionality to align with run_generation script #782 @mgonchar
- Enable clip mediapipe and update G2 baseline #856 @MohitIntel
- Add ci test for SFT and DPO #857 @sywangyi
- Fix SFT, DPO CI on Gaudi1 #893 @regisss
- Add SDXL in README #894 @regisss
- Fix falcon 180b oom issue if peft > 0.6.2 #895 @sywangyi
- Enabled additional models in CI #879 @MohitIntel
- Add static shape support for vision_encoder_decoder generation if decoder supports static shape #834 @sywangyi
- Add HabanaProfile to Stable Diffusion and XL #828 @atakaha
- Pytest accuracy updates for Falcon, T5, GPT2 #916 @Luca-Calabria
- Update text-generation readme with torch.compile info. #884 @libinta
- Update Wav2Vec2ModelTest::test_initialization #919 @malkomes
- Add linear and dynamic RoPE to Mistral and Mixtral #892 @regisss
- Fix for wav2vec2 test cases #923 @lqnguyen
- Add nograd() to prevent backward backend #897 @astachowiczhabana
- Assisted decoding not implemented #910 @tjs-intel
- Disable wav2vec2 symbolic tracing test #904 @tjs-intel
- Add support for symbolic tracing of GPT2 models #913 @tjs-intel
- Utils: return more reasonable error in case of attempt of non-PyTorch model loading #921 @mgonchar
- Pytest accuracy updates for Bridgetower, Swin, Vit #927 @Luca-Calabria
- Text generation: added langchain pipeline script #887 @mgonchar
- Fix for AST models #914 @vidyasiv
- Fix AttributeError for wav2vec test #929 @Jianhong-Zhang
- Fix ValueError for test_summarization #939 @Jianhong-Zhang
- Grad norm tensor fix #938 @yeonsily
- Add information to the audio-classification examples README about --ddp_find_unused_parameters parameter #941 @Alberto-Villarreal
- Add leaderboard link #947 @echarlaix
- Fix formatting of arg parse help strings in the PEFT example #944 @dmsuehir
- Use new Habana llama and falcon model configs #940 @skaulintel
- Update based on legal requirements. #900 @libinta
- Update test generation config to raise ValueError #949 @malkomes
- Add --trust_remote_code for text generation examples #870 @yangulei
- Added Llama-2 fp8 text-generation test cases #934 @yeonsily
- Upgrade SD output image verification with CLIP score #920 @MohitIntel
- Llama Guard for text classification example #871 @dsmertin
- Update README logo #950 @regisss
- Add Gaudi CI for Sentence Transformers #928 @regisss
- Get iteration times through generate() #899 @hsubramony
- Update speech recognition seq2seq example #953 @regisss
- Fix wrongly all_gather for mixtral finetune #965 @ccrhx4
- Add intel-mila protST example #860 @sywangyi
- Small CI refacto #968 @regisss
- Llama70b one card to infer device map with max memory limitation #963 @Yantom1
- Map list to tensors #926 @ssarkar2
- Fix fsdp lora torch compile issue #971 @sywangyi
- Fix for the simulate_dyn_prompt flag assertion #984 @alekseyfa
- Initial enablement with FP8 Training (port from OHF #91) #936 @libinta
- Warn user when using --disk_offload without hqt #964 @Yantom1
- Assign grad_norm for logging only if it's a single element tensor #992 @yeonsily
- Update examples #998 @regisss
- Fix warmup for diffusers when batch size < throughput_warmup_steps #960 @dsocek
- Add torch.compile instructions for Roberta-Large #981 @MohitIntel
- Fix gpt_neox, stablelm inference regression caused by RoPE dtype #999 @mandy-li
- fea(examples): Updated the READMEs with requirements.txt installation #1000 @imangohari1
- Initial commit for fp8 CI #995 @yeonsily
- Fixed 'MixtralConfig' object has no attribute 'rope_scaling' #1009 @aslanxie
- Use the lenght of timesteps as the inference step num #986 @yuanwu2017
- Fix the bug of output_type=np or latent. #996 @yuanwu2017
- Fix wav2vec test load adapter #937 @malkomes
- Mark scale as const and remove --fp8 flag usage #962 @Yantom1
- Add per step time collection to other methods #1004 @ssarkar2
- Fix first token time #1019 @ssarkar2
- Fix text-generation example #1025 @regisss
- Updates test_beam_search to transformers_4.40 #1017 @malkomes
- Fix eos problem #1034 @sywangyi
- fp8 textgen ci structure update #1029 @jiminha
- Fix a return value issue casued by PR 973 #1040 @yafshar
- Add no_checks for sub dataset in lvwerra/stack-exchange-paired since it does not contain test split #1003 @sywangyi
- Readme Update for FSDP #980 @hlahkar
- Add unifier script and disk offload flag usages to README. #1023 @libinta
- Add mixtral for meta device load due to mixtral-8x22b model size #909 @libinta
- Update unifier script #1010 @Yantom1
- Update text-generation CI configuration for falcon and Mixtral #1044 @yeonsily
- Update multi-node README to check ssh connection issue #1048 @yeonsily
- Infra upgrade workflows #480 @glegendre01
- Update test_text_generation_example.py #1051 @ssarkar2
- BERT training migrated to torch.compile #990 @ANSHUMAN87
- Update test_examples.py #1053 @ssarkar2
- Update modeling_llama.py: deepspeed fix for codellama #1054 @ssarkar2
- No shapes in profilings by default #1050 @astachowiczhabana
- Change the way to unset environemt variable for gpt-neox ci #1060 @yeonsily
- Update README for Albert torch.compile mode #1061 @MohitIntel
- Fix lm_evaluation_harness to specific commit (#240) #1064 @astachowiczhabana
- Fix text-generation example README.md #1081 @shepark
v1.11.1: Patch Release
Llama3 has been validated on Gaudi
Fix issue with pytest
The latest SynapseAI Docker images come with Pytest v8 already installed, which is incompatible with the Transformers library and leads to an error in a few non-test cases. As a temporary workaround, Pytest is pinned and moved as a hard dependency.
Other
- Fp8 merge fix #863 @libinta
- Fixed "reuse_cache" Bug #888 @Danielohayon
- Remove deprecated AOT_HPU_TRAINING_BACKEND #877 @astachowiczhabana
- Add mark step and inplace residual add in llama model code #833 @puneeshkhanna
- Enable Flash Attention in recompute and causal modes #862 @wszczurekhabana
- Add mark_step for llama inference #875 @libinta
Full Changelog: v1.11.0...v1.11.1
v1.11: SDXL fine-tuning, Whisper, Phi, ControlNet
SynapseAI v1.15
The codebase is fully validated for the latest version of Habana SDK, SynapseAI v1.15.0.
SDXL fine-tuning
Whisper
- Support speech recognition with whisper models and seq2seq #704 @emascarenhas
Phi
- Enable phi series models #732 @lkk12014402
ControlNet
Transformers v4.38
The codebase is fully validated for Transformers v4.38.
Model optimizations
- Add optimization for blip text model generation #653 @sywangyi
- Enable internal kv bucket in llama #720 @xt574chen
- Enable Mixtral-8x7B #739 @jychen-habana
- Update Mixtral-8x7B fp8 hqt example #756 @jychen-habana
- Further fixes for performance with internal bucketing #781 @puneeshkhanna
- speecht5 optimization #722 @sywangyi
- move img_mask@get_attn_mask() to hpu #795 @hsubramony
- Mistral optimizations #804 @ssarkar2
Image-to-text and VQA examples
torch.compile
- Enable torch_compile mode for distributed #659 @kalyanjk
- Fix graph breaks in torch compile mode #806 @hlahkar
- Fix torch.compile for text generation #811 @regisss
- Add Llama7b FSDP test for torch.compile mode #818 @pankd
Bug fixes
- Fix beamsearch crash and incorrect output in decode-only model and encode-decode model #627 @sywangyi
- Fix translation models #710 @vidyasiv
- Fix throughput calculation for diffusion models #715 @skavulya
- Fix crash in llama mode in llava image-to-text generation #755 @sywangyi
- Fix backward error in DDP when running reward model finetune in RLHF #507 @sywangyi
- Fix get_dtype and convert_into_dtypes #769 @regisss
- Override sdpa option in Gaudi #771 @jiminha
- Fix Llama-70B-FSDP model loading issue #752 @hlahkar
- Fix FSDP in transformer4.38 #812 @libinta
- Delay importing deepspeed comm due for perf #810 @jiminha
- Fix llama rotary pos emb issue for transformers 4.38 #813 @libinta
- Fix torch.full issue below when running deepspeed z3 for llama #820 @libinta
- Fix profile issue with 1st step #837 @libinta
- Fix mistral after syn1.15 update #858 @ssarkar2
Others
- Small test_text_generation_example.py refacto #725 @regisss
- Update README, add PPO support #721 @sywangyi
- Update the Mistral model naming #726 @yafshar
- Changing backend name #708 @vivekgoe
- Update ppo_trainer.py #718 @skaulintel
- Add seed in sft example, make sft result reproducable #735 @sywangyi
- Adding a flag whether to save checkpoint or not in run_lora_clm.py #736 @yeonsily
- Refactor and update CI for encoder-decoders #742 @regisss
- Expose Llama Fused OPs control from run_lora_clm.py #751 @hlahkar
- Fixing tests by making static_shapes False #778 @bhargaveede
- Fix ControlNet README #785 @regisss
- Workaround for RoPE computed in bf16 for GPT-NeoX #746 @regisss
- Add Whisper and SpeechT5 to model table #790 @regisss
- Update summarization example README #791 @srajabos
- Block torchscript pytest because of seg fault issue #793 @yeonsily
- Fix test_encoder_decoder.py for opus-mt-zh-en #798 @regisss
- Replacing obsolete API for mediapipe #796 @MohitIntel
- Add --distribution_strategy fast_ddp in contrastive-image-text README and BridgeTower test #799 @regisss
- Fix redundant bucket internal and hpu graph setting #797 @puneeshkhanna
- Add Llama test for fsdp #761 @hlahkar
- Enable dynamic shapes for esmfold #803 @hsubramony
- Add Llama/Llama2 support in Question-Answering #745 @kplau1128
- Update MLM example #830 @regisss
- Revert Wav2Vec2 TDNNLayer forward function same as transformer v4.37.2 #827 @yeonsily
- Save CI test output image #835 @MohitIntel
- Update ckpt loading #773 @schoi-habana
- Skip SDXL test in CI #840 @regisss
- Fix FSDP test on Gaudi1 #841 @regisss
- Remove installation from source for Diffusers in CI #846 @regisss
- Fix fp8 ci #852 @regisss
- Fix PR #848 #853 @regisss
- Disable safe loading tests in CI #854 @regisss
- Add warmup for eval #855 @libinta
Known issue
- A crash may occur with unify_measurements.py
v1.10.4: Patch release
Fix Llama memory issue with DeepSpeed ZeRO-3
- Fix Llama initialization #712
Full Changelog: v1.10.2...v1.10.4