Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlas int4 int8 with avx2/512 #20687

Merged
merged 48 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
293f121
quick adapt llama.cpp to experiment performance. Only works with blkl…
liqunfu May 3, 2024
04c2e56
fire
liqunfu May 6, 2024
cdfda6f
tile 2x4 SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/Symme…
liqunfu May 7, 2024
92dad97
use one_16_epi16 and accumulate_2blk_dot: SQNBITGEMM<4>/BlkLen:32/M:2…
liqunfu May 8, 2024
5418e9c
apply to M1, BQuant layout pack block (subblk) larger than blklen: SQ…
liqunfu May 9, 2024
0401f72
use new AQuant layout (not work if total M is not RangeCountM): SQNBI…
liqunfu May 10, 2024
a57eeba
apply blksum to blklen32 and 64: SQNBITGEMM<4>/BlkLen:32/M:2048/N:409…
liqunfu May 13, 2024
f2c33af
blklen16
liqunfu May 15, 2024
0ca24f4
impl avx512: SQNBITGEMM<4>/BlkLen:32/M:2048/N:4096/K:4096/Threads:1/S…
liqunfu May 26, 2024
7f89d5f
matmul_nbit & fix alignment for sgemm
liqunfu Jun 1, 2024
ed0e666
merge main
liqunfu Jun 4, 2024
35d02a6
fix mlas benchmark not using multi threads
liqunfu Jun 10, 2024
b9493ad
profiling
liqunfu Jun 10, 2024
c443eb5
Merge branch 'liqun/mlas-q4-tile-avx' of https://github.com/microsoft…
liqunfu Jun 10, 2024
ac66951
sgemm after sq4bit for avx2
liqunfu Jun 16, 2024
42a1305
avx512
liqunfu Jun 17, 2024
740031a
layout to follow compute, M1 separate with M > 1
liqunfu Jun 27, 2024
1a6031e
make avx512 run
liqunfu Jun 28, 2024
283fd2d
Merge branch 'main' into liqun/mlas-q4-tile-avx
liqunfu Jun 28, 2024
d035939
avx512 blklen64 pass
liqunfu Jul 4, 2024
f329d2d
pass avx512 blklen32
liqunfu Jul 5, 2024
27cfd9c
pass avx512 blklen 16, 128, 256
liqunfu Jul 5, 2024
edee319
pass fp32, refactor sqnbitgemm
liqunfu Jul 11, 2024
fb9221a
merge main
liqunfu Jul 12, 2024
c109b4b
avx512vnni
liqunfu Jul 18, 2024
6654d22
merge main
liqunfu Jul 18, 2024
4b91bed
avxvnni
liqunfu Jul 20, 2024
8674b9f
rm unused ComputeParallelTasksSGemm
liqunfu Jul 23, 2024
e26e29e
avoid _mm256_dpbusds_avx_epi32 in avx512vnni
liqunfu Jul 24, 2024
2b0307e
fix linux build
liqunfu Jul 24, 2024
40df782
Merge branch 'main' into liqun/mlas-q4-tile-avx
liqunfu Jul 26, 2024
51e97c8
refactor for Arm64
liqunfu Jul 26, 2024
48e8639
more refactor for Arm64
liqunfu Jul 26, 2024
705aa1f
hsum_float_16
liqunfu Jul 29, 2024
012e9c4
hsum_float_16
liqunfu Jul 29, 2024
21b9138
condition for -mavxvnni
liqunfu Jul 30, 2024
1fb1c83
CMAKE_CXX_COMPILER_VERSION VERSION_GREATER 10
liqunfu Jul 30, 2024
85918e9
missed 2 files from (__GNUC__ > 10)
liqunfu Jul 30, 2024
9530ac5
missed _mm256_dpbusds_avx_epi32 and print out cmake msgs
liqunfu Jul 30, 2024
f77cffd
unused zp, etc.
liqunfu Jul 30, 2024
a6fd378
unused zp, etc.
liqunfu Jul 30, 2024
c875e5c
remove test code changes
liqunfu Jul 30, 2024
3b56710
remove test code changes
liqunfu Jul 30, 2024
746562f
lint
liqunfu Jul 30, 2024
52fc7fa
lint
liqunfu Jul 30, 2024
0933a6b
code name
liqunfu Jul 30, 2024
2b35c82
update reviewers' comments
liqunfu Jul 31, 2024
caeb35e
Merge branch 'main' into liqun/mlas-q4-tile-avx
liqunfu Aug 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cmake/onnxruntime_mlas.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ onnxruntime_add_static_library(onnxruntime_mlas
${MLAS_SRC_DIR}/qdwconv_kernelsize.cpp
${MLAS_SRC_DIR}/sqnbitgemm.h
${MLAS_SRC_DIR}/sqnbitgemm.cpp
${MLAS_SRC_DIR}/llama.cpp.sgemm.h
${MLAS_SRC_DIR}/llama.cpp.sgemm.cpp
edgchen1 marked this conversation as resolved.
Show resolved Hide resolved
)

target_sources(onnxruntime_mlas PRIVATE
Expand Down
8 changes: 8 additions & 0 deletions onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc
Original file line number Diff line number Diff line change
Expand Up @@ -199,10 +199,18 @@
packed_b_ = IAllocator::MakeUniquePtr<void>(alloc, packed_b_size_, true);
MlasSQNBitGemmPackQuantBData(N_, K_, nbits_, block_size_, compute_type, qptr, packed_b_.get());
if (prepacked_weights) {
// TODO: cannot use packed_b_ after

Check warning on line 202 in onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2] Raw Output: onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc:202: Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2]
assert(false);
liqunfu marked this conversation as resolved.
Show resolved Hide resolved
prepacked_weights->buffers_.push_back(std::move(packed_b_));
prepacked_weights->buffer_sizes_.push_back(packed_b_size_);
}
is_packed = true;
} else if (input_idx == 2) {
// MlasSQNBitGemmPackQuantBData with scales
assert(false);
} else if (input_idx == 3) {
// MlasSQNBitGemmPackQuantBData with zp
assert(false);
}
#endif // defined(ORT_NEURAL_SPEED)

Expand Down
17 changes: 16 additions & 1 deletion onnxruntime/core/mlas/inc/mlas_qnbit.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ struct MLAS_SQNBIT_GEMM_DATA_PARAMS {
const void* QuantBData = nullptr; ///< address of quantized B (quantized n-bit int values)
const float* QuantBScale = nullptr; ///< address of scale values of quantized B, one per block
const void* QuantBZeroPoint = nullptr; ///< optional address of zero point values of quantized B, one per block
const float* QuantBBlkSum = nullptr; ///< optional address of scale * zp, one per block
const float* Bias = nullptr; ///< optional address of Bias, vector size N
float* C = nullptr; ///< address of result matrix
size_t ldc = 0; ///< leading dimension of C
Expand Down Expand Up @@ -159,6 +160,18 @@ MlasSQNBitGemmPackQuantBDataSize(
/**
* @brief Packs the quantized B data in a format that the kernel expects.
*
* If the function is called without QuantBScale and QuantBZeroPoint,
* it just packs QuantBData into PackedQuantBDataAndOrBlkSum.
liqunfu marked this conversation as resolved.
Show resolved Hide resolved
*
* If the function is called with QuantBData, QuantBScale, and QuantBZeroPoint
* additional BlkSum (Scale * zeropoint) is computed and stored at the second part of PackedQuantBDataAndOrBlkSum.
liqunfu marked this conversation as resolved.
Show resolved Hide resolved
*
* Because ORT OpKernel::PrePack is called for each input (in this case, QuantBData,
* QuantBScale, and QuantBZeroPoint) separately, this function may be called 3 times, first with QuantBData,
* and then QuantBScale and QuantBZeroPoint. The second time the function is called with QuantBScale,
liqunfu marked this conversation as resolved.
Show resolved Hide resolved
* BlkSum is computed with default zero point 8 and stored at the second part of PackedQuantBDataAndOrBlkSum.
* If there is a third call with QuantBZeroPoint, BlkSum is recomputed/adjusted with provided zeropoint.
*
* @param[in] N column size of matrix B and C
* @param[in] K column size of matrix A and row size of matrix B
* @param[in] BlkBitWidth quantized value bit width (e.g., 4 means 4 bit ints)
Expand All @@ -176,6 +189,8 @@ MlasSQNBitGemmPackQuantBData(
size_t BlkLen,
MLAS_SQNBIT_GEMM_COMPUTE_TYPE ComputeType,
const void* QuantBData,
void* PackedQuantBData,
void* PackedQuantBDataAndOrBlkSum,
const void* QuantBScale,
const void* QuantBZeroPoint,
MLAS_THREADPOOL* ThreadPool = nullptr
);
321 changes: 321 additions & 0 deletions onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
// ported/adapted from https://github.com/ggerganov/llama.cpp/pull/6414
#define __AVX2__ 1

#include <assert.h>
#include <immintrin.h>
#include "llama.cpp.sgemm.h"

Check warning on line 6 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Include the directory when naming header files [build/include_subdir] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:6: Include the directory when naming header files [build/include_subdir] [4]
#include "sqnbitgemm.h"

Check warning on line 7 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Include the directory when naming header files [build/include_subdir] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:7: Include the directory when naming header files [build/include_subdir] [4]
//#include "sqnbitgemm_kernel_avx_common.h"

Check warning on line 8 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Should have a space between // and comment [whitespace/comments] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:8: Should have a space between // and comment [whitespace/comments] [4]
#include <algorithm>

Check warning on line 9 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Found C++ system header after other header. Should be: llama.cpp.sgemm.h, c system, c++ system, other. [build/include_order] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:9: Found C++ system header after other header. Should be: llama.cpp.sgemm.h, c system, c++ system, other. [build/include_order] [4]
#include <cassert>

Check warning on line 10 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Found C++ system header after other header. Should be: llama.cpp.sgemm.h, c system, c++ system, other. [build/include_order] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:10: Found C++ system header after other header. Should be: llama.cpp.sgemm.h, c system, c++ system, other. [build/include_order] [4]

#ifdef _MSC_VER
#define NOINLINE __declspec(noinline)
#else
#define NOINLINE __attribute__((__noinline__))
#endif

#if defined(__ARM_NEON) || defined(__AVX512F__)
#define VECTOR_REGISTERS 32
#else
#define VECTOR_REGISTERS 16
#endif

////////////////////////////////////////////////////////////////////////////////////////////////////
// VECTORIZED FUSED MULTIPLY ADD

/**
* Computes a * b + c.
*/
template <typename T, typename U>
inline U
madd(T a, T b, U c)
{

Check warning on line 33 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 { should almost always be at the end of the previous line [whitespace/braces] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:33: { should almost always be at the end of the previous line [whitespace/braces] [4]
return add(mul(a, b), c);
}

#if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__)
template <>
inline __m256
madd(__m256 a, __m256 b, __m256 c)
{

Check warning on line 41 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 { should almost always be at the end of the previous line [whitespace/braces] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:41: { should almost always be at the end of the previous line [whitespace/braces] [4]
return _mm256_fmadd_ps(a, b, c);
}
#endif
#if defined(__AVX512F__)
template <>
inline __m512
madd(__m512 a, __m512 b, __m512 c)
{

Check warning on line 49 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 { should almost always be at the end of the previous line [whitespace/braces] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:49: { should almost always be at the end of the previous line [whitespace/braces] [4]
return _mm512_fmadd_ps(a, b, c);
}
#endif


template <typename TA, typename TB, typename TC>
class tinyBLAS_Q0_AVX2
{

Check warning on line 57 in onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 { should almost always be at the end of the previous line [whitespace/braces] [4] Raw Output: onnxruntime/core/mlas/lib/llama.cpp.sgemm.cpp:57: { should almost always be at the end of the previous line [whitespace/braces] [4]
public:
tinyBLAS_Q0_AVX2(int64_t k, const TA *A, int64_t lda, const TB *B, int64_t ldb, TC *C, int64_t ldc,
const float *QuantBScale, int64_t StrideQuantBScale)
: A_q4_(A), B_q8_(B), C(C), k(k), lda_q4_(lda), ldb_q8_(ldb), ldc_(ldc),
Quant4Scale_(QuantBScale), StrideQuant4Scale_(StrideQuantBScale)
{
}

void matmul(int64_t m, int64_t n)
{
mnpack(0, m, 0, n);
}

private:
void mnpack(int64_t m0, int64_t m, int64_t n0, int64_t n)
{
int64_t mc, nc, mp, np;
switch ((std::min(m - m0, (int64_t)4) << 4) | std::min(n - n0, (int64_t)4)) {
#if VECTOR_REGISTERS == 32
case 0x44:
mc = 4;
nc = 4;
gemm<4, 4>(m0, m, n0, n);
break;
case 0x43:
mc = 4;
nc = 3;
gemm<4, 3>(m0, m, n0, n);
break;
case 0x34:
mc = 3;
nc = 4;
gemm<3, 4>(m0, m, n0, n);
break;
case 0x33:
mc = 3;
nc = 3;
gemm<3, 3>(m0, m, n0, n);
break;
case 0x42:
mc = 4;
nc = 2;
gemm<4, 2>(m0, m, n0, n);
break;
case 0x24:
mc = 2;
nc = 4;
gemm<2, 4>(m0, m, n0, n);
break;
#else
case 0x44:
case 0x43:
case 0x42:
mc = 4;
nc = 2;
gemm<4, 2>(m0, m, n0, n);
break;
case 0x34:
case 0x24:
mc = 2;
nc = 4;
gemm<2, 4>(m0, m, n0, n);
break;
case 0x33:
#endif
case 0x32:
mc = 3;
nc = 2;
gemm<3, 2>(m0, m, n0, n);
break;
case 0x23:
mc = 2;
nc = 3;
gemm<2, 3>(m0, m, n0, n);
break;
case 0x41:
mc = 4;
nc = 1;
gemm<4, 1>(m0, m, n0, n);
break;
case 0x22:
mc = 2;
nc = 2;
gemm<2, 2>(m0, m, n0, n);
break;
case 0x14:
mc = 1;
nc = 4;
gemm<1, 4>(m0, m, n0, n);
break;
case 0x31:
mc = 3;
nc = 1;
gemm<3, 1>(m0, m, n0, n);
break;
case 0x13:
mc = 1;
nc = 3;
gemm<1, 3>(m0, m, n0, n);
break;
case 0x21:
mc = 2;
nc = 1;
gemm<2, 1>(m0, m, n0, n);
break;
case 0x12:
mc = 1;
nc = 2;
gemm<1, 2>(m0, m, n0, n);
break;
case 0x11:
mc = 1;
nc = 1;
gemm<1, 1>(m0, m, n0, n);
break;
default:
return;
}
mp = m0 + (m - m0) / mc * mc;
np = n0 + (n - n0) / nc * nc;
mnpack(mp, m, n0, np);
mnpack(m0, m, np, n);
}

template <int RM, int RN>
NOINLINE void gemm(int64_t m0, int64_t m, int64_t n0, int64_t n)
{
constexpr size_t BlkBitWidth4 = 4;
constexpr size_t BlkLen32 = 32;
constexpr size_t BlkDataSizeInBytes16 = MlasQNBitBlkDataSizeInBytes(BlkBitWidth4, BlkLen32);
int64_t ytiles = (m - m0) / RM;
int64_t xtiles = (n - n0) / RN;
int64_t tiles = xtiles * ytiles;
for (int64_t tile = 0; tile < tiles; ++tile) {
int64_t ii = m0 + tile / xtiles * RM;
int64_t jj = n0 + tile % xtiles * RN;
__m256 Cv[RN][RM] = {};
for (int64_t l = 0; l < k; ++l) // blk count (BlockCountK)
for (int64_t j = 0; j < RN; ++j) //
for (int64_t i = 0; i < RM; ++i) {
const std::byte *Quant4ABlk = A_q4_ + lda_q4_ * (ii + i) + l * BlkDataSizeInBytes16;
const std::byte *Quant8BBlk = B_q8_ + ldb_q8_ * (jj + j) + l * Q8BlkSize(BlkLen32);
const float &scale_q8 = Q8BlkScale(Quant8BBlk);
const float &scale_q4 = *(Quant4Scale_ + (ii + i) * StrideQuant4Scale_ + l);

const int8_t zp = 8;
const __m256i q4_v = load_q4(Quant4ABlk, zp);
const __m256i q8_v = load_q8(Quant8BBlk);
Cv[j][i] = madd(
_mm256_set1_ps(scale_q8 * scale_q4),
updot(_mm256_sign_epi8(q4_v, q4_v), _mm256_sign_epi8(q8_v, q4_v)),
Cv[j][i]
);
}
for (int64_t j = 0; j < RN; ++j)
for (int64_t i = 0; i < RM; ++i)
C[ldc_ * (jj + j) + (ii + i)] = hsum(Cv[j][i]);
}
}

inline float hsum(__m128 x)
{
x = _mm_add_ps(x, _mm_movehl_ps(x, x));
x = _mm_add_ss(x, _mm_movehdup_ps(x));
return _mm_cvtss_f32(x);
}
inline float hsum(__m256 x)
{
return hsum(_mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x)));
}

inline __m256i load_q8(const std::byte *Quant8Blk)
{
return _mm256_loadu_si256((const __m256i *)Q8BlkData(Quant8Blk));
}

inline __m256i load_q4(const std::byte *Quant4DataPtr, const int8_t zp)
{
// | v0 v16 | v1 v17 | ... | v14 v30 | v15 v31 |
// | v32 v48 | v33 v49 | ... | v46 v62 | v47 v63 |
const __m128i bv_packed0 = _mm_loadu_si128(reinterpret_cast<const __m128i *>(Quant4DataPtr));

const __m128i low_mask = _mm_set1_epi8(15);
const __m128i bv_lo0 = _mm_and_si128(bv_packed0, low_mask); // 0, 1, 2, 3,...
const __m128i bv_hi0 = _mm_and_si128(_mm_srli_epi16(bv_packed0, 4), low_mask); // 16, 17, 18, 19,...
__m256i bv_32_epi8 = _mm256_set_m128i(bv_hi0, bv_lo0);
const __m256i bzp0 = _mm256_set1_epi8(zp);
bv_32_epi8 = _mm256_sub_epi8(bv_32_epi8, bzp0);

return bv_32_epi8;
}

inline __m256 updot(__m256i u, __m256i s)
{
__m256i res;
#if defined(__AVXVNNI__) || (defined(__AVX512VNNI__) && defined(__AVX512VL__))
res = _mm256_dpbusd_epi32(_mm256_setzero_si256(), u, s);
#else
res = _mm256_madd_epi16(_mm256_set1_epi16(1), _mm256_maddubs_epi16(u, s));
#endif
return _mm256_cvtepi32_ps(res);
}

const TA *const A_q4_;
const TB *const B_q8_;
TC *const C;
const int64_t k;
const int64_t lda_q4_;
const int64_t ldb_q8_;
const int64_t ldc_;
const float *Quant4Scale_;
int64_t StrideQuant4Scale_;
};

/**
* Performs optimized matrix multiplication on CPU.
*
* This subroutine may compute C = Aᵀ * B with column major ordering.
* Despite its name, this isn't a generalized implementation. Work is
* only performed when a handwritten kernel is written and available.
* Otherwise the caller should fall back to a general matmul routine.
*
* For example, for single-threaded single-precision GEMM you can say
*
* llamafile_sgemm(m, n, k, A, lda, B, ldb, C, ldc,
* 0, 1, GGML_TASK_TYPE_COMPUTE,
* GGML_TYPE_F32, GGML_TYPE_F32, GGML_TYPE_F32);
*
* @param m is rows in `A` and `C`
* @param n is cols in `B` and `C`
* @param k is cols in `A` and rows in `B`
* @param A is first input matrix (always transposed)
* @param lda is row stride of `A`
* @param B is second input matrix (never transposed)
* @param ldb is row stride of `B`
* @param C is input/output array of output matrices
* @param ldc is row stride of `C`
* @param ith is thread id (must be less than `nth`)
* @param nth is number of threads (must be greater than zero)
* @param task is GGML task type
* @param Atype is GGML data type of `A`
* @param Btype is GGML data type of `B`
* @param Ctype is GGML data type of `C`
* @return true if this function was able to service the matmul request
*/
bool
llamafile_sgemm(
int64_t m,
int64_t n,
int64_t k,
const std::byte *A,
int64_t lda,
const std::byte *B,
int64_t ldb,
float *C,
int64_t ldc,
const float *QuantBScale,
int64_t StrideQuantBScale
)
{
tinyBLAS_Q0_AVX2<std::byte, std::byte, float> tb{k, A, lda, B, ldb, C, ldc, QuantBScale, StrideQuantBScale};
tb.matmul(m, n);
return true;
}
5 changes: 5 additions & 0 deletions onnxruntime/core/mlas/lib/llama.cpp.sgemm.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#include <cstdint>
#include <cstddef>

bool
llamafile_sgemm(int64_t m, int64_t n, int64_t k, const std::byte *A, int64_t lda, const std::byte *B, int64_t ldb, float *C, int64_t ldc, const float *QuantBScale, int64_t StrideQuantBScale);
Loading
Loading