This repository has been archived by the owner on Aug 30, 2024. It is now read-only.
Releases: intel/neural-speed
Releases · intel/neural-speed
Intel® Neural Speed v1.0 Release
Highlights
Examples
Validated Configurations
Highlights
- Support models from ModelScope
Examples
- Enable Mistral-base-v0.2 (ee40f28)
Validated Configurations
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04
Intel® Neural Speed v1.0a Release
Highlights
Improvements
Examples
Bug Fixing
Validated Configurations
Highlights
- Improve performance on CPU client
- Support batching and submit GPT-J results to MLPerf v4.0
Improvements
- Support continuous batching and beam search inference (7c2199 )
- Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
- Support FFN_fusion for the ChatGLM2(96fadd )
- Enable loading model from modelscope (ad3d19 )
- Extend long input tokens length (eb41b9 , e76a58e )
- [BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
- [BesTLA] New thread pool and hybrid dispatcher (fd19a44 )
Examples
- Enable Mixtral 8x7B (9bcb612 )
- Enable Mistral-GPTQ (96dc55 )
- Implement the YaRN rop scaling feature (6c36f54 )
- Enable Qwen 1-5 (750b35 )
- Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3) - Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
- Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
- Enable ChatGLM3 (94e74d )
- Enable Gemma-2B (e4c5f71 )
Bug Fixing
- Fix convert_quantized model bug (37d01f3 )
- Fix Autoround acc regression (991c35 )
- Fix Qwen load error (2309fbb )
- Fix the GGUF convert issue (5293ffa )
Validated Configurations
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04
Intel® Neural Speed v0.3 Release
Highlights
Improvements
Examples
Bug Fixing
Validated Configurations
Highlights
- Contributed GPT-J inference to MLPerf v4.0 submission (mlperf commits)
- Enabled 3-bit low precision inference (ee40f28)
Improvements
- Optimization of Layernormalization (98ffee45)
- Update Qwen python API (51088a)
- Load processed model automatically (662553)
- Support continuous batching in Offline and Server (66cb9f5)
- Support loading models from HF directly (bb80273)
- Support autoround (e2d3652)
- Enable OMP in BesTLA (3afae427)
- Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)
- Add YaRN rope scaling data structure (8c846d6)
- Improvements targeting Windows (464239)
Examples
- Enable Qwen 1.8B (ea4b713)
- Enable Phi-2, Phi-1.5 and Phi-1 (c212d8)
- Support 3bits & 4bits GPTQ for Gpt-j 6B (4c9070)
- Support Solar 10.7B with GPTQ (26c68c7, 90f5cbd)
- Support Qwen GGUF inference (cd67b92)
Bug Fixing
- Fix log-level introduced perf problem (6833b2f, 6f85518f)
- Fix straightforward-API issues (4c082b7)
- Fix a blocker on Windows platforms (4adc15)
- Fix whisper python API. (c97dbe)
- Fix Qwen loading & Mistral GPTQ convert (d47984c)
- Fix clang-tidy issues (ad54a1f)
- Fix Mistral online loading issues (0470b1f)
- Handles models that require a HF token access ID (33ffaf07)
- Fix the GGUF convert issue (5293ffa5)
- Fix GPTQ & AWQ convert issue (150e752)
Validated Configurations
- Python 3.10
- Ubuntu 22.04
Intel® Neural Speed v0.2 Release
Highlights
Improvements
Examples
Bug Fixing
Validated Configurations
Highlights
- Support Q4_0, Q5_0 and Q8_0 GGUF models and AWQ
- Enhance Tensor Parallelism with shared memory in multi-sockets in single node
Improvements
- Rename Bestla files and their usage (d5c26d4 )
- Update Python API and reorg scripts (40663e )
- Enable AWQ with Llama2 example (9be307f )
- Enable clang tidy (227e89 )
- TP support multi-node (6dbaa0 )
- Support accuracy calculation for GPTQ models (7b124aa )
- Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)
Examples
- Add Magicoder example (749caca )
- Enable whisper large example (24b270 )
- Add Docker file and Readme (f57d4e1 )
- Support multi-batch ChatGLM-V1 inference (c9fb9d)
Bug Fixing
- Fix avx512-s8-dequant and asymmetric related bug (fad80b14 )
- Fix warmup prompt length and add ns_log_level control (070b6b )
- Fix convert: remove hardcode of AWQ (7729bb )
- Fix the ChatGLM convert issue. (7671467 )
- Fix Bestla windows compile issue (760e5f )
Validated Configurations
- Python 3.10
- Ubuntu 22.04
Intel® Neural Speed v0.1 Release
Highlights
Features
Examples
Highlights
- Created Neural Speed project, spinning off from Intel Extension for Transformers
Features
- Support GPTQ models
- Enable Beam Search post-processing.
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
- Support Tensor Parallelism with jblas and shared memory.
- Improving the performance of Client CPUs.
- Enabling streaming LLM for Runtime
- Enhance QLoRA on CPU with optimized dropout operator.
- Add Script for PPL Evaluation.
- Refine Python API.
- Allow CompileBF16 on GCC11.
- Multi-Round chat with ChatGLM2.
- Shift-RoPE-based Streaming-LLM.
- Enable MHA fusion for LLM.
- Support AVX_VNNI and AVX2
- Optimize QBits backend.
- GELU support
Examples
- Enable finetune for Qwen-7b-chat on CPU.
- Enable Whisper C++ API
- Apply the STS task to BAAI/BGE models.
- Enable Qwen graph.
- Enable instruction_tuning Stable Diffusion examples.
- Enable Mistral-7b.
- Enable Falcon-180B
- Enable Baichuan/Baichuan2 example.
Validated Configurations
- Python 3.9, 3.10, 3.11
- GCC 13.1, 11.1
- Centos 8.4 & Ubuntu 20.04 & Windows 10