Update dependency vllm to v0.6.2 #7

renovate · 2024-10-01T20:29:37Z

This PR contains the following updates:

Package	Change	Age	Adoption	Passing	Confidence
vllm	`==0.6.1` -> `==0.6.2`

Release Notes

vllm-project/vllm (vllm)

`v0.6.2`

Compare Source

Highlights

Model Support

Support Llama 3.2 models (#8811, #8822)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)
- ⚠️ You will see the following error now, this is breaking change!
  
  Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see https://github.com/vllm-project/vllm/issues/8306
Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)
Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

Initial support for priority sheduling (#5958)
Support Lora lineage and base model metadata management (#6315)
Batch inference for llm.chat() API (#8648)

Performance

Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
Multi-step scheduling enhancements
- Prompt logprobs support in Multi-step (#8199)
- Add output streaming support to multi-step + async (#8335)
- Add flashinfer backend (#7928)
Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

Support sample from HF datasets and image input for benchmark_serving (#8495)
Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

[MISC] Dump model runner inputs when crashing by @comaniac in https://github.com/vllm-project/vllm/pull/8305
[misc] remove engine_use_ray by @youkaichao in https://github.com/vllm-project/vllm/pull/8126
[TPU] Use Ray for default distributed backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8389
Fix the AMD weight loading tests by @mgoin in https://github.com/vllm-project/vllm/pull/8390
[Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in https://github.com/vllm-project/vllm/pull/8366
[Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in https://github.com/vllm-project/vllm/pull/8338
[Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in https://github.com/vllm-project/vllm/pull/8355
[Misc] Use RoPE cache for MRoPE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8396
[torch.compile] hide slicing under custom op for inductor by @youkaichao in https://github.com/vllm-project/vllm/pull/8384
[Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in https://github.com/vllm-project/vllm/pull/8399
[Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in https://github.com/vllm-project/vllm/pull/8375
[Model] Support multiple images for qwen-vl by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8247
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in https://github.com/vllm-project/vllm/pull/8403
[BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in https://github.com/vllm-project/vllm/pull/8423
[Bugfix] Offline mode fix by @joerunde in https://github.com/vllm-project/vllm/pull/8376
[multi-step] add flashinfer backend by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7928
[Core] Add engine option to return only deltas or final output by @njhill in https://github.com/vllm-project/vllm/pull/7381
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8427
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in https://github.com/vllm-project/vllm/pull/8425
[CI/Build] Disable multi-node test for InternVL2 by @ywang96 in https://github.com/vllm-project/vllm/pull/8428
[Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8415
[Bugfix] Fix weight loading issue by rename variable. by @wenxcs in https://github.com/vllm-project/vllm/pull/8293
[Misc] Update Pixtral example by @ywang96 in https://github.com/vllm-project/vllm/pull/8431
[BugFix] fix group_topk by @dsikka in https://github.com/vllm-project/vllm/pull/8430
[Core] Factor out input preprocessing to a separate class by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7329
[Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in https://github.com/vllm-project/vllm/pull/8290
[Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8435
[CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8436
[Bugfix] Fix async log stats by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8417
[bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8354
bump version to v0.6.1.post1 by @simon-mo in https://github.com/vllm-project/vllm/pull/8440
[CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in https://github.com/vllm-project/vllm/pull/8437
[doc] recommend pip instead of conda by @youkaichao in https://github.com/vllm-project/vllm/pull/8446
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in https://github.com/vllm-project/vllm/pull/8442
[misc][ci] fix quant test by @youkaichao in https://github.com/vllm-project/vllm/pull/8449
[Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8456
[plugin][torch.compile] allow to add custom compile backend by @youkaichao in https://github.com/vllm-project/vllm/pull/8445
[CI/Build] Reorganize models tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7820
[Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in https://github.com/vllm-project/vllm/pull/8467
[HotFix] Fix final output truncation with stop string + streaming by @njhill in https://github.com/vllm-project/vllm/pull/8468
bump version to v0.6.1.post2 by @simon-mo in https://github.com/vllm-project/vllm/pull/8473
[Hardware][intel GPU] bump up ipex version to 2.3 by @jikunshang in https://github.com/vllm-project/vllm/pull/8365
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm by @charlifu in https://github.com/vllm-project/vllm/pull/8310
[Model] support minicpm3 by @SUDA-HLT-ywfang in https://github.com/vllm-project/vllm/pull/8297
[torch.compile] fix functionalization by @youkaichao in https://github.com/vllm-project/vllm/pull/8480
[torch.compile] add a flag to disable custom op by @youkaichao in https://github.com/vllm-project/vllm/pull/8488
[TPU] Implement multi-step scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8489
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations by @chrisociepa in https://github.com/vllm-project/vllm/pull/8490
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel by @Isotr0py in https://github.com/vllm-project/vllm/pull/8357
[Kernel] Enable 8-bit weights in Fused Marlin MoE by @ElizaWszola in https://github.com/vllm-project/vllm/pull/8032
[Frontend] Expose revision arg in OpenAI server by @lewtun in https://github.com/vllm-project/vllm/pull/8501
[BugFix] Fix clean shutdown issues by @njhill in https://github.com/vllm-project/vllm/pull/8492
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel by @sasha0552 in https://github.com/vllm-project/vllm/pull/8506
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels by @ProExpertProg in https://github.com/vllm-project/vllm/pull/7270
[doc] update doc on testing and debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/8514
[Bugfix] Bind api server port before starting engine by @kevin314 in https://github.com/vllm-project/vllm/pull/8491
[perf bench] set timeout to debug hanging by @simon-mo in https://github.com/vllm-project/vllm/pull/8516
[misc] small qol fixes for release process by @simon-mo in https://github.com/vllm-project/vllm/pull/8517
[Bugfix] Fix 3.12 builds on main by @joerunde in https://github.com/vllm-project/vllm/pull/8510
[refactor] remove triton based sampler by @simon-mo in https://github.com/vllm-project/vllm/pull/8524
[Frontend] Improve Nullable kv Arg Parsing by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8525
[Misc][Bugfix] Disable guided decoding for mistral tokenizer by @ywang96 in https://github.com/vllm-project/vllm/pull/8521
[torch.compile] register allreduce operations as custom ops by @youkaichao in https://github.com/vllm-project/vllm/pull/8526
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change by @ruisearch42 in https://github.com/vllm-project/vllm/pull/8509
[Benchmark] Support sample from HF datasets and image input for benchmark_serving by @Isotr0py in https://github.com/vllm-project/vllm/pull/8495
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models by @sroy745 in https://github.com/vllm-project/vllm/pull/7631
[Feature][kernel] tensor parallelism with bitsandbytes quantization by @chenqianfzh in https://github.com/vllm-project/vllm/pull/8434
[Model] Add mistral function calling format to all models loaded with "mistral" format by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8515
[Misc] Don't dump contents of kvcache tensors on errors by @njhill in https://github.com/vllm-project/vllm/pull/8527
[Bugfix] Fix TP > 1 for new granite by @joerunde in https://github.com/vllm-project/vllm/pull/8544
[doc] improve installation doc by @youkaichao in https://github.com/vllm-project/vllm/pull/8550
[CI/Build] Excluding kernels/test_gguf.py from ROCm by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8520
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8012
[CI/Build] fix Dockerfile.cpu on podman by @dtrifiro in https://github.com/vllm-project/vllm/pull/8540
[Misc] Add argument to disable FastAPI docs by @Jeffwan in https://github.com/vllm-project/vllm/pull/8554
[CI/Build] Avoid CUDA initialization by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8534
[CI/Build] Update Ruff version by @aarnphm in https://github.com/vllm-project/vllm/pull/8469
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8157
[Core] Prompt logprobs support in Multi-step by @afeldman-nm in https://github.com/vllm-project/vllm/pull/8199
[Core] zmq: bind only to 127.0.0.1 for local-only usage by @russellb in https://github.com/vllm-project/vllm/pull/8543
[Model] Support Solar Model by @shing100 in https://github.com/vllm-project/vllm/pull/8386
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call by @gshtras in https://github.com/vllm-project/vllm/pull/8380
[Kernel] Change interface to Mamba selective_state_update for continuous batching by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8039
[BugFix] Nonzero exit code if MQLLMEngine startup fails by @njhill in https://github.com/vllm-project/vllm/pull/8572
[Bugfix] add dead_error property to engine client by @joerunde in https://github.com/vllm-project/vllm/pull/8574
[Kernel] Remove marlin moe templating on thread_m_blocks by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8573
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. by @sroy745 in https://github.com/vllm-project/vllm/pull/8545
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" by @ywang96 in https://github.com/vllm-project/vllm/pull/8593
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py by @KuntaiDu in https://github.com/vllm-project/vllm/pull/8616
[MISC] remove engine_use_ray in benchmark_throughput.py by @jikunshang in https://github.com/vllm-project/vllm/pull/8615
[Frontend] Use MQLLMEngine for embeddings models too by @njhill in https://github.com/vllm-project/vllm/pull/8584
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention by @charlifu in https://github.com/vllm-project/vllm/pull/8577
[Core] simplify logits resort in _apply_top_k_top_p by @hidva in https://github.com/vllm-project/vllm/pull/8619
[Doc] Add documentation for GGUF quantization by @Isotr0py in https://github.com/vllm-project/vllm/pull/8618
Create SECURITY.md by @simon-mo in https://github.com/vllm-project/vllm/pull/8642
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8551
[Misc] guard against change in cuda library name by @bnellnm in https://github.com/vllm-project/vllm/pull/8609
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference by @garg-amit in https://github.com/vllm-project/vllm/pull/8571
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8474
[Core] Support Lora lineage and base model metadata management by @Jeffwan in https://github.com/vllm-project/vllm/pull/6315
[Model] Add OLMoE by @Muennighoff in https://github.com/vllm-project/vllm/pull/7922
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8670
[Bugfix] Validate SamplingParam n is an int by @saumya-saran in https://github.com/vllm-project/vllm/pull/8548
[Misc] Show AMD GPU topology in collect_env.py by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8649
[Bugfix] Config.init() got an unexpected keyword argument 'engine' api_server args by @Juelianqvq in https://github.com/vllm-project/vllm/pull/8556
[Bugfix][Core] Fix tekken edge case for mistral tokenizer by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8640
[Doc] neuron documentation update by @omrishiv in https://github.com/vllm-project/vllm/pull/8671
[Hardware][AWS] update neuron to 2.20 by @omrishiv in https://github.com/vllm-project/vllm/pull/8676
[Bugfix] Fix incorrect llava next feature size calculation by @zyddnys in https://github.com/vllm-project/vllm/pull/8496
[Core] Rename PromptInputs to PromptType, and inputs to prompt by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8673
[MISC] add support custom_op check by @jikunshang in https://github.com/vllm-project/vllm/pull/8557
[Core] Factor out common code in SequenceData and Sequence by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8675
[beam search] add output for manually checking the correctness by @youkaichao in https://github.com/vllm-project/vllm/pull/8684
[Kernel] Build flash-attn from source by @ProExpertProg in https://github.com/vllm-project/vllm/pull/8245
[VLM] Use SequenceData.from_token_counts to create dummy data by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8687
[Doc] Fix typo in AMD installation guide by @Imss27 in https://github.com/vllm-project/vllm/pull/8689
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 by @rasmith in https://github.com/vllm-project/vllm/pull/8646
[dbrx] refactor dbrx experts to extend FusedMoe class by @divakar-amd in https://github.com/vllm-project/vllm/pull/8518
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8643
[Bugfix] Refactor composite weight loading logic by @Isotr0py in https://github.com/vllm-project/vllm/pull/8656
[ci][build] fix vllm-flash-attn by @youkaichao in https://github.com/vllm-project/vllm/pull/8699
[Model] Refactor BLIP/BLIP-2 to support composite model loading by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8407
[Misc] Use NamedTuple in Multi-image example by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8705
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler by @statelesshz in https://github.com/vllm-project/vllm/pull/8703
[Model][VLM] Add LLaVA-Onevision model support by @litianjian in https://github.com/vllm-project/vllm/pull/8486
[SpecDec][Misc] Cleanup, remove bonus token logic. by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8701
[build] enable existing pytorch (for GH200, aarch64, nightly) by @youkaichao in https://github.com/vllm-project/vllm/pull/8713
[misc] upgrade mistral-common by @youkaichao in https://github.com/vllm-project/vllm/pull/8715
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8702
[Bugfix] Fix CPU CMake build by @ProExpertProg in https://github.com/vllm-project/vllm/pull/8723
[Bugfix] fix docker build for xpu by @yma11 in https://github.com/vllm-project/vllm/pull/8652
[Core][Frontend] Support Passing Multimodal Processor Kwargs by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8657
[Hardware][CPU] Refactor CPU model runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/8729
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/8733
[Model] Support pp for qwen2-vl by @liuyanyi in https://github.com/vllm-project/vllm/pull/8696
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size by @janimo in https://github.com/vllm-project/vllm/pull/8707
[CI/Build] use setuptools-scm to set version by @dtrifiro in https://github.com/vllm-project/vllm/pull/4738
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7701
[Kernel][LoRA] Add assertion for punica sgmv kernels by @jeejeelee in https://github.com/vllm-project/vllm/pull/7585
[Core] Allow IPv6 in VLLM_HOST_IP with zmq by @russellb in https://github.com/vllm-project/vllm/pull/8575
Fix typical acceptance sampler with correct recovered token ids by @jiqing-feng in https://github.com/vllm-project/vllm/pull/8562
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8335
[Hardware][AMD] ROCm6.2 upgrade by @hongxiayang in https://github.com/vllm-project/vllm/pull/8674
Fix tests in test_scheduler.py that fail with BlockManager V2 by @sroy745 in https://github.com/vllm-project/vllm/pull/8728
re-implement beam search on top of vllm core by @youkaichao in https://github.com/vllm-project/vllm/pull/8726
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" by @simon-mo in https://github.com/vllm-project/vllm/pull/8750
[MISC] Skip dumping inputs when unpicklable by @comaniac in https://github.com/vllm-project/vllm/pull/8744
[Core][Model] Support loading weights by ID within models by @petersalas in https://github.com/vllm-project/vllm/pull/7931
[Model] Expose Phi3v num_crops as a mm_processor_kwarg by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8658
[Bugfix] Fix potentially unsafe custom allreduce synchronization by @hanzhi713 in https://github.com/vllm-project/vllm/pull/8558
[Kernel] Split Marlin MoE kernels into multiple files by @ElizaWszola in https://github.com/vllm-project/vllm/pull/8661
[Frontend] Batch inference for llm.chat() API by @aandyw in https://github.com/vllm-project/vllm/pull/8648
[Bugfix] Fix torch dynamo fixes caused by replace_parameters by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/8748
[CI/Build] fix setuptools-scm usage by @dtrifiro in https://github.com/vllm-project/vllm/pull/8771
[misc] soft drop beam search by @youkaichao in https://github.com/vllm-project/vllm/pull/8763
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 by @jeejeelee in https://github.com/vllm-project/vllm/pull/8768
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/8047
[Core] Adding Priority Scheduling by @apatke in https://github.com/vllm-project/vllm/pull/5958
[Bugfix] Use heartbeats instead of health checks by @joerunde in https://github.com/vllm-project/vllm/pull/8583
Fix test_schedule_swapped_simple in test_scheduler.py by @sroy745 in https://github.com/vllm-project/vllm/pull/8780
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal by @sasha0552 in https://github.com/vllm-project/vllm/pull/8776
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 by @sroy745 in https://github.com/vllm-project/vllm/pull/8752
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv by @zifeitong in https://github.com/vllm-project/vllm/pull/8250
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/8770
[Bugfix] load fc bias from config for eagle by @sohamparikh in https://github.com/vllm-project/vllm/pull/8790
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer by @agt in https://github.com/vllm-project/vllm/pull/8672
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node by @darthhexx in https://github.com/vllm-project/vllm/pull/8767
[Misc] Fix minor typo in scheduler by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/8765
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade by @hongxiayang in https://github.com/vllm-project/vllm/pull/8777
[Kernel] Fullgraph and opcheck tests by @bnellnm in https://github.com/vllm-project/vllm/pull/8479
[[Misc]] Add extra deps for openai server image by @jeejeelee in https://github.com/vllm-project/vllm/pull/8792
[VLM][Bugfix] enable internvl running with num_scheduler_steps > 1 by @DefTruth in https://github.com/vllm-project/vllm/pull/8614
[Core] Rename PromptInputs and inputs, with backwards compatibility by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8760
[Frontend] MQLLMEngine supports profiling. by @abatom in https://github.com/vllm-project/vllm/pull/8761
[Misc] Support FP8 MoE for compressed-tensors by @mgoin in https://github.com/vllm-project/vllm/pull/8588
Revert "rename PromptInputs and inputs with backward compatibility (#8760) by @simon-mo in https://github.com/vllm-project/vllm/pull/8810
[Model] Add support for the multi-modal Llama 3.2 model by @heheda12345 in https://github.com/vllm-project/vllm/pull/8811
[Doc] Update doc for Transformers 4.45 by @ywang96 in https://github.com/vllm-project/vllm/pull/8817
[Misc] Support quantization of MllamaForCausalLM by @mgoin in https://github.com/vllm-project/vllm/pull/8822

New Contributors

@blueyo0 made their first contribution in https://github.com/vllm-project/vllm/pull/8338
@lnykww made their first contribution in https://github.com/vllm-project/vllm/pull/8403
@vegaluisjose made their first contribution in https://github.com/vllm-project/vllm/pull/8423
@chrisociepa made their first contribution in https://github.com/vllm-project/vllm/pull/8490
@lewtun made their first contribution in https://github.com/vllm-project/vllm/pull/8501
@russellb made their first contribution in https://github.com/vllm-project/vllm/pull/8543
@shing100 made their first contribution in https://github.com/vllm-project/vllm/pull/8386
@hidva made their first contribution in https://github.com/vllm-project/vllm/pull/8619
@Muennighoff made their first contribution in https://github.com/vllm-project/vllm/pull/7922
@saumya-saran made their first contribution in https://github.com/vllm-project/vllm/pull/8548
@zyddnys made their first contribution in https://github.com/vllm-project/vllm/pull/8496
@Imss27 made their first contribution in https://github.com/vllm-project/vllm/pull/8689
@statelesshz made their first contribution in https://github.com/vllm-project/vllm/pull/8703
@litianjian made their first contribution in https://github.com/vllm-project/vllm/pull/8486
@yma11 made their first contribution in https://github.com/vllm-project/vllm/pull/8652
@liuyanyi made their first contribution in https://github.com/vllm-project/vllm/pull/8696
@janimo made their first contribution in https://github.com/vllm-project/vllm/pull/8707
@jiqing-feng made their first contribution in https://github.com/vllm-project/vllm/pull/8562
@aandyw made their first contribution in https://github.com/vllm-project/vllm/pull/8648
@apatke made their first contribution in https://github.com/vllm-project/vllm/pull/5958
@sohamparikh made their first contribution in https://github.com/vllm-project/vllm/pull/8790
@darthhexx made their first contribution in https://github.com/vllm-project/vllm/pull/8767
@abatom made their first contribution in https://github.com/vllm-project/vllm/pull/8761
@heheda12345 made their first contribution in https://github.com/vllm-project/vllm/pull/8811

Full Changelog: vllm-project/vllm@v0.6.1...v0.6.2

`v0.6.1.post2`

Compare Source

Highlights

This release contains an important bugfix related to token streaming combined with stop string (#8468)

What's Changed

[CI/Build] Enable InternVL2 PP test only on single node by @Isotr0py in https://github.com/vllm-project/vllm/pull/8437
[doc] recommend pip instead of conda by @youkaichao in https://github.com/vllm-project/vllm/pull/8446
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 by @jeejeelee in https://github.com/vllm-project/vllm/pull/8442
[misc][ci] fix quant test by @youkaichao in https://github.com/vllm-project/vllm/pull/8449
[Installation] Gate FastAPI version for Python 3.8 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8456
[plugin][torch.compile] allow to add custom compile backend by @youkaichao in https://github.com/vllm-project/vllm/pull/8445
[CI/Build] Reorganize models tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7820
[Doc] Add oneDNN installation to CPU backend documentation by @Isotr0py in https://github.com/vllm-project/vllm/pull/8467
[HotFix] Fix final output truncation with stop string + streaming by @njhill in https://github.com/vllm-project/vllm/pull/8468
bump version to v0.6.1.post2 by @simon-mo in https://github.com/vllm-project/vllm/pull/8473

Full Changelog: vllm-project/vllm@v0.6.1.post1...v0.6.1.post2

`v0.6.1.post1`

Compare Source

Highlights

This release features important bug fixes and enhancements for

Pixtral models. (#8415, #8425, #8399, #8431)
- Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
Multistep scheduling. (#8417, #7928, #8427)
Tool use. (#8423, #8366)

Also

support multiple images for qwen-vl (#8247)
removes engine_use_ray (#8126)
add engine option to return only deltas or final output (#7381)
add bitsandbytes support for Gemma2 (#8338)

What's Changed

[MISC] Dump model runner inputs when crashing by @comaniac in https://github.com/vllm-project/vllm/pull/8305
[misc] remove engine_use_ray by @youkaichao in https://github.com/vllm-project/vllm/pull/8126
[TPU] Use Ray for default distributed backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8389
Fix the AMD weight loading tests by @mgoin in https://github.com/vllm-project/vllm/pull/8390
[Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in https://github.com/vllm-project/vllm/pull/8366
[Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in https://github.com/vllm-project/vllm/pull/8338
[Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in https://github.com/vllm-project/vllm/pull/8355
[Misc] Use RoPE cache for MRoPE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8396
[torch.compile] hide slicing under custom op for inductor by @youkaichao in https://github.com/vllm-project/vllm/pull/8384
[Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in https://github.com/vllm-project/vllm/pull/8399
[Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in https://github.com/vllm-project/vllm/pull/8375
[Model] Support multiple images for qwen-vl by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8247
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in [https://github.com/[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance vllm-project/vllm#8403]

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate bot force-pushed the renovate/vllm-0.x branch 3 times, most recently from 208ae21 to 7da37ec Compare October 8, 2024 05:16

Update dependency vllm to v0.6.2

3bd9d07

renovate bot force-pushed the renovate/vllm-0.x branch from 7da37ec to 3bd9d07 Compare October 8, 2024 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency vllm to v0.6.2 #7

Update dependency vllm to v0.6.2 #7

renovate bot commented Oct 1, 2024 •

edited

Loading

Update dependency vllm to v0.6.2 #7

Are you sure you want to change the base?

Update dependency vllm to v0.6.2 #7

Conversation

renovate bot commented Oct 1, 2024 • edited Loading

Release Notes

v0.6.2

Highlights

Model Support

Hardware Support

Production Engine

Performance

Others

What's Changed

New Contributors

v0.6.1.post2

Highlights

What's Changed

v0.6.1.post1

Highlights

What's Changed

Configuration

renovate bot commented Oct 1, 2024 •

edited

Loading

`v0.6.2`

`v0.6.1.post2`

`v0.6.1.post1`