Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Releases: intel/neural-speed

Intel® Neural Speed v1.0 Release

29 Mar 11:54
79c3537
Compare
Choose a tag to compare

Highlights
Examples
Validated Configurations

Highlights

Examples

  • Enable Mistral-base-v0.2 (ee40f28)

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • Ubuntu 22.04

Intel® Neural Speed v1.0a Release

22 Mar 11:10
1051182
Compare
Choose a tag to compare

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

  • Improve performance on CPU client
  • Support batching and submit GPT-J results to MLPerf v4.0

Improvements

  • Support continuous batching and beam search inference (7c2199 )
  • Improvement for AVX2 platform (bc5ee16, aa4a8a, 35c6d10 )
  • Support FFN_fusion for the ChatGLM2(96fadd )
  • Enable loading model from modelscope (ad3d19 )
  • Extend long input tokens length (eb41b9 , e76a58e )
  • [BesTLA] Improve RTN quantization accuracy of int4 and int3 (a90aea)
  • [BesTLA] New thread pool and hybrid dispatcher (fd19a44 )

Examples

  • Enable Mixtral 8x7B (9bcb612 )
  • Enable Mistral-GPTQ (96dc55 )
  • Implement the YaRN rop scaling feature (6c36f54 )
  • Enable Qwen 1-5 (750b35 )
  • Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B (a129213)
    • Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 (eed9b3)
  • Enable Baichuan-7B and refactor Baichuan-13B (8d5fe2d)
  • Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B (872876 )
  • Enable ChatGLM3 (94e74d )
  • Enable Gemma-2B (e4c5f71 )

Bug Fixing

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • Ubuntu 22.04

Intel® Neural Speed v0.3 Release

23 Feb 12:57
150e752
Compare
Choose a tag to compare

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

  • Contributed GPT-J inference to MLPerf v4.0 submission (mlperf commits)
  • Enabled 3-bit low precision inference (ee40f28)

Improvements

  • Optimization of Layernormalization (98ffee45)
  • Update Qwen python API (51088a)
  • Load processed model automatically (662553)
  • Support continuous batching in Offline and Server (66cb9f5)
  • Support loading models from HF directly (bb80273)
  • Support autoround (e2d3652)
  • Enable OMP in BesTLA (3afae427)
  • Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)
  • Add YaRN rope scaling data structure (8c846d6)
  • Improvements targeting Windows (464239)

Examples

  • Enable Qwen 1.8B (ea4b713)
  • Enable Phi-2, Phi-1.5 and Phi-1 (c212d8)
  • Support 3bits & 4bits GPTQ for Gpt-j 6B (4c9070)
  • Support Solar 10.7B with GPTQ (26c68c7, 90f5cbd)
  • Support Qwen GGUF inference (cd67b92)

Bug Fixing

  • Fix log-level introduced perf problem (6833b2f, 6f85518f)
  • Fix straightforward-API issues (4c082b7)
  • Fix a blocker on Windows platforms (4adc15)
  • Fix whisper python API. (c97dbe)
  • Fix Qwen loading & Mistral GPTQ convert (d47984c)
  • Fix clang-tidy issues (ad54a1f)
  • Fix Mistral online loading issues (0470b1f)
  • Handles models that require a HF token access ID (33ffaf07)
  • Fix the GGUF convert issue (5293ffa5)
  • Fix GPTQ & AWQ convert issue (150e752)

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04

Intel® Neural Speed v0.2 Release

22 Jan 14:41
abcc0f4
Compare
Choose a tag to compare

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

Highlights

  • Support Q4_0, Q5_0 and Q8_0 GGUF models and AWQ
  • Enhance Tensor Parallelism with shared memory in multi-sockets in single node

Improvements

  • Rename Bestla files and their usage (d5c26d4 )
  • Update Python API and reorg scripts (40663e )
  • Enable AWQ with Llama2 example (9be307f )
  • Enable clang tidy (227e89 )
  • TP support multi-node (6dbaa0 )
  • Support accuracy calculation for GPTQ models (7b124aa )
  • Enable log with NEURAL_SPEED_VERBOSE (a8d9e7)

Examples

  • Add Magicoder example (749caca )
  • Enable whisper large example (24b270 )
  • Add Docker file and Readme (f57d4e1 )
  • Support multi-batch ChatGLM-V1 inference (c9fb9d)

Bug Fixing

  • Fix avx512-s8-dequant and asymmetric related bug (fad80b14 )
  • Fix warmup prompt length and add ns_log_level control (070b6b )
  • Fix convert: remove hardcode of AWQ (7729bb )
  • Fix the ChatGLM convert issue. (7671467 )
  • Fix Bestla windows compile issue (760e5f )

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04

Intel® Neural Speed v0.1 Release

22 Dec 14:47
6d8bb4a
Compare
Choose a tag to compare

Highlights
Features
Examples

Highlights

  • Created Neural Speed project, spinning off from Intel Extension for Transformers

Features

  • Support GPTQ models
  • Enable Beam Search post-processing.
  • Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
  • Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
  • Support Tensor Parallelism with jblas and shared memory.
  • Improving the performance of Client CPUs.
  • Enabling streaming LLM for Runtime
  • Enhance QLoRA on CPU with optimized dropout operator.
  • Add Script for PPL Evaluation.
  • Refine Python API.
  • Allow CompileBF16 on GCC11.
  • Multi-Round chat with ChatGLM2.
  • Shift-RoPE-based Streaming-LLM.
  • Enable MHA fusion for LLM.
  • Support AVX_VNNI and AVX2
  • Optimize QBits backend.
  • GELU support

Examples

  • Enable finetune for Qwen-7b-chat on CPU.
  • Enable Whisper C++ API
  • Apply the STS task to BAAI/BGE models.
  • Enable Qwen graph.
  • Enable instruction_tuning Stable Diffusion examples.
  • Enable Mistral-7b.
  • Enable Falcon-180B
  • Enable Baichuan/Baichuan2 example.

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • GCC 13.1, 11.1
  • Centos 8.4 & Ubuntu 20.04 & Windows 10