Skip to content

Releases: halide/Halide

v19.0.0

17 Dec 02:03
Compare
Choose a tag to compare

Major improvements

  • Halide is now available for both C++ and Python usage via Pip. Try pip install halide today!
  • The Vulkan backend has matured substantially.
  • The HTML "conceptual statement" output now supports dark mode viewing.
  • For developers, CMake 3.28 is now required and we no longer require an internet connection during the build.
  • Thread pool improvements mean that workloads that do a small number of small tasks in parallel (e.g. a cheap operation applied to a small image) are up to 3x faster. If you have schedules that do not use parallelism for small inputs because you found it didn't provide any speedup, you may want to re-benchmark.
  • You can now query properties of the compiled-for target as Exprs, simplifying helper code that wants to do different things depending on the target architecture. Example: f(x) = select(target_arch_is(Target::ARM), 3, 7). Helpers include target_arch_is, target_os_is, target_has_feature, target_bits, and target_natural_vector_size. These are resolved to constants at compile-time and simplified away. Use with care, as this (intentionally) results in different behavior on different platforms.

Breaking changes

  • We now distribute libGenGen.a rather than GenGen.cpp.
    • Downstream users should link to this library with /WHOLEARCHIVE: or -Wl,--whole-archive rather than build GenGen.cpp themselves.
    • Users of the CMake package should be unaffected.
  • In keeping with our LLVM support policy, support for LLVM 16 has been removed.
  • We no longer use the le64/le32 generic targets for compiling runtime modules to LLVM. These targets were removed in LLVM upstream.

What's Changed

Apps and tests

Autoschedulers

  • Consider all Exprs a func uses, not just the RHS, in Li2018 by @abadams in #8326

Build system

CodeGen

Debugging

Documentation

Frontend

Hardware backends

LLVM

Python

Runtime

  • Fix profiler to report time spent on GPU kernels again instead of on 'wait for parallel tasks'. by @mcourteaux in #8453
  • Don't spin on the main mutex while waiting for new work by @abadams in #8433

Minor bugfixes / other cleanup

Read more

Halide v18.0.0

17 Jul 20:31
8c651b4
Compare
Choose a tag to compare

Changes Of Note since Halide 17

  • Ring-buffering now supported in schedules (Func::ring_buffer()). This is distinct from fold_storage in that it folds across time (the loop variables) rather than folding across space (the pure vars of the Func).
  • Fixed a longstanding bug in lossless_cast()
  • Lots of fixes for Vulkan backend
  • OpenGLCompute is no longer supported
  • Added support for ARM SVE2
  • Added (basic) support for Intel APX and AVX10
  • Added support for Hexagon HVX v68
  • Added support for numpy's .npy format to .debug_to_file() and the code in halide_image_io.h
  • Python bindings now support bfloat and int64 properly
  • Hacky code that auto-named Funcs, Vars etc via DWARF introspection was removed
  • The profiler was revamped to behave better when multiple Halide pipelines are in flight at the same time.
  • Numerous lowering passes were sped up, resulting in faster compilation for large pipelines. However, time spent in LLVM is still the long pole for most pipelines.
  • Fixed-point instruction selection has been improved via tracking constant integer bounds of expressions.
  • Adds feature detection for ARM CPUs to the runtime library and to the host target feature computation. Supports Windows, macOS,
    Linux, iOS, and Android.

Deprecations / Removals

  • tuple_select() has been removed in favor of overloads to select().
  • Various fixed-point operators have been removed from the Halide::Internal namespace and are now in the public Halide namespace.

What's Changed

Read more

Halide v17.0.2

25 Jun 15:30
b2e6d2a
Compare
Choose a tag to compare

What's Changed

  • Backport a fix for the simpler bug in lossless_cast by @abadams in #8264
  • Fix Vulkan SIMT mappings for GPU loop vars; avoid formatting the GPU kernel to a string for Vulkan (since it's binary SPIR-V needs to remain intact). @derek-gerstmann in #8270

Full Changelog: v17.0.1...v17.0.2

Halide v17.0.1

20 Feb 19:50
5254117
Compare
Choose a tag to compare

What's Changed

  • Changes to make WebGPU code compliant with recent versions of Emscripten (#8106)
  • Fix rfactor adding too many pure loops (#8107)
  • Forward the partition methods from generator outputs (#8090)
  • Fix reduce_expr_modulo of vector in Solve.cpp (#8107)

Full Changelog: v17.0.0...v17.0.1

Halide v17.0.0

02 Feb 00:30
Compare
Choose a tag to compare

Changes Of Note

  • ParamMap has been removed entirely from the public API. All users of ParamMap should migrate to Callable instead.
  • Halide::Parameter has been moved to the public Halide API (it was formerly "internal" and not intended for public use).
  • New scheduling primitives:
    • Func::partition() and friends: Set the loop partition policy, which controls how/whether a loop is split into three loops (prologue/steady-state/epilogue). Loop partitioning can be useful to optimize boundary conditions (e.g. clamp_edge).
    • Func::hoist_storage() and friends: allows a functions's storage to be moved to a given loop level. Unlike Func::store_at(), no optimizations are triggered (e.g. sliding window).
  • New TailStrategy options for for existing scheduling directives:
    • ShiftInwardsAndBlend: Equivalent to ShiftInwards, but protects values that would be re-evaluated by loading the memory location that would be stored to, modifying only the elements not contained within the overlap, and then storing the blended result. Unlike ShiftInwards, this is valid to use in update definitions.
    • RoundUpAndBlend: Equivalent to RoundUp, but protects values that would be written beyond the end by loading the memory location that would be stored to, modifying only the elements within the region being computed, and then storing the blended result. Unlike RoundUp, this is valid to use on non-outermost splits in update definitions.
  • Substantially improved performance and display in the VizIR output.
  • Profiler improvements:
    • Substantially nicer text output
    • Injects timing into calls for copy_to_host and copy_to_device so you can measure host<->device copy overhead
    • Allows option sorting via HL_PROFILER_SORT env var
  • Substantially faster codegen for several GPU backends.
  • Experimental serialization/deserialization feature allows for saving of Halide IR code.
  • Various bug fixes and improvements in the Anderson2021 autoscheduler.
  • Improved ARM codegen, including: better patterns for sdot/udot; improved shift/mul codegen.
  • Support for Zen4 architecture in the x86 backend.
  • Updates to the ONNX app.
  • Various fixes and improvements to sliding-window and storage-folding.
  • Improvements to slow gather operations for some x86 variants.
  • Improvements to correctness for the .async() scheduling directive.
  • Improved codegen for float16 conversion, especially on x86.
  • Several compile-time warnings of dubious usefulness disabled.
  • WebAssembly codegen now defaults to assuming that saturating-float-to-int and sign-extension instructions sets are always available.
  • Target now does some reality-checking that it doesn't contain obviously nonsensical Feature combinations

What's Changed

Read more

Halide v16.0.0

24 Jun 01:10
027547f
Compare
Choose a tag to compare

What's Changed

General Notes

  • Support for the Vulkan API (w/SPIR-V codegen)
  • Support for WebGPU (experimental)
  • Improved Halide IR HTML Visualization
  • Fixed a regression in the Adams2019 auto-scheduler that disabled sub-tiling
  • Added GPU auto-scheduler (Anderson2021)

Efficient Automatic Scheduling of Imaging and Vision Pipelines for the GPU
Luke Anderson, Andrew Adams, Karima Ma, Tzu-Mao Li, Tian Jin, Jonathan Ragan-Kelley
Proceedings of the ACM on Programming Languages (OOPSLA 2021)

Deprecations / Removals

  • OpenGLCompute has been deprecated
  • ParamMap has been deprecated
  • Deprecated HVX_shared_object feature has been removed
  • References to deprecated fixed-point operators have been removed
  • Deprecated halide_target_feature_disable_llvm_loop_opt has been removed
  • Deprecated MIPS device support has been removed

Notable Fixes & Changes

New Contributors

Read more

Halide v15.0.1

07 Apr 23:21
4c63f1b
Compare
Choose a tag to compare

What's Changed

  • The Python binding of compile_to_callable() was not properly copying from device to host for output buffers, so output was typically black (or garbage) when used with a GPU target. (#7213)
  • The bin directory was missing from the installs.
  • Upgraded LLVM to 15.0.7
  • New in 15.0.0, but restated here for visibility: The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.

Halide v15.0.0

06 Mar 23:38
d7651f4
Compare
Choose a tag to compare

What's Changed

General Notes

  • Support for RISC V Vector architectures.

  • Python-related:

    • Halide builds for Python are now being built and provided to PyPI, so it is now possible to use the Halide Python bindings simply by pip install halide
    • Major improvements were made to the Python bindings, with many missing or incomplete sections of the API added or filled in.
    • We now support the use of Generators from Python (for both JIT and AOT usage).
    • The standard CMake rules now support generating a Python extension directly.
    • Support for Python was removed from Halide's Makefiles; you must use CMake to build the Python bindings
  • Halide::Func now allows you to (optionally) constrain the type(s) of Exprs that the Func can contain, and/or the dimensionality of the Func.

  • Added a new way to use the JIT (compile_to_callable) that allows calling a jitted function with the same syntax as for AOT-compiled functions, allowing more control over JIT lifespan, as well as thread-safe arguments without requiring ParamMap

  • General improvements to SIMD codegen

  • Several rarely-used parts of the C++ Generator API were deprecated, and the way that autoschedulers are specified for AOT compilation is now completely different (but better for future expandability).

  • CMake builds now require >= v3.22

  • WABT usage requires >= v1.0.30

  • LLVM 12 is no longer supported

  • The target flag disable_llvm_loop_opt is deprecated, as it's now the default behavior. This means that we have turned off llvm's autovectorization and loop unrolling. This should not affect any schedules with manually-specified vectorization and unrolling, other than trimming code size a little. However, schedules that do not vectorize or unroll may slow down because they were (intentionally or not) relying on llvm to do it automatically. If you see a performance regression with Halide 15, try turning on the enable_llvm_loop_opt target flag.

Notable bug fixes

  • Make Halide::round behave as documented (#7012)
  • Incorrect folding of saturating_sub (#6883)
  • The check for race conditions didn't consider where clauses (#6808)
  • Performance regression for x86 for certain LLVM versions (#6783)
  • Fusing a specialization drops compute_withs from generated code (#6770)
  • Incorrect output when realize condition depends on tuple call (#6915)
  • Python extensions should default to throwing exceptions rather than calling abort() for errors (#6986)
  • Python bindings didn't support bool buffers (#7006)
  • Python bindings didn't support float16 buffers (#7060)
  • Python extensions that executed on GPU didn't copy back to host properly (#6869)
  • Fix bugs in div_round_to_zero and fast_integer_divide_round_to_zero (#7008)
  • Bugs in add_requirement() (#7045)

Major changes

Minor changes

Read more

Halide 14.0.0

07 Apr 23:02
6b9ed2a
Compare
Choose a tag to compare

What's Changed

Major changes

  • @abadams
    • Add ability to pass a user context in JIT mode (#6313)
    • Reenable warning about unscheduled update definitions (#6602)
  • @alexreinking
    • Add helper for cross-compiling Halide generators. (#6366)
  • @LebedevRI
    • Implement SanitizerCoverage support (Refs. #6513) (#6517)
  • @steven-johnson
    • Expand optional static-typing for Buffer to include dimensionality (#6574)
    • Deprecate the Generator::build() method (#6580)
    • Move GeneratorContext into a standalone class (#6618)
    • Python Bindings didn't allow for zero-D Funcs, ImageParams, Buffers (#6633)
  • @zvookin
    • Timer based profiler (#6642)

Minor changes

  • @abadams
    • Deprecate JIT runtime override methods that take void * (#6344)
    • Allow users to use their own cuda contexts and streams in JIT mode (#6345)
    • Add --help flag to rungenmain, fixing #5323 (#6354)
    • Do target-specific lowering of lerp (#6432)
    • Reduce overhead of sampling profiler by having only one thread do it (#6433)
    • Skip custom cuda context test on older GPUs (#6437)
    • Avoid needless gather in fast_integer_divide lowering (#6441)
    • Fixes for c++20 (#6446)
    • Add a fast integer divide that rounds to zero (#6455)
    • Let lerp lowering incorporate a final cast. (#6480)
    • Try removing optional buffer added to closure (#6481)
    • rounding shift rights should use rounding halving add (#6494)
    • Make random faster by putting the innermost var last (#6504)
    • Make it possible to interpret a wide type as multiple smaller elements (#6506)
    • Handle mixed-width args to mul-shift-right (#6526)
    • Attempted redo of faster noise (#6539)
    • Better default lowering of absd (#6545)
    • Make HALIDE_REGISTER_GENERATOR work with multiple template args (#6556)
    • Rename Output to OutputFileType and deprecated Output (#6568)
    • Remove incorrect not-multiple-of-16 claim (#6573)
    • Fix bug in mul_shift_right matching (#6610)
  • @alexreinking
    • Add super-build for cross-compiling HANNK (#6374)
    • Fix empty INSTALL_COMMAND in hannk super-build (#6387)
    • Remove halide_config.cmake from Makefile build. Fixes #6615 (#6616)
    • Make IRComparer consider nans to be less than non-nans. (#6626)
  • @ashishUthama
    • Include LICENSE.txt in package (#6428)
  • @dsharletg
    • Fix description of rounding_shift_left/rounding_shift_right (#6549)
  • @Elarnon
    • Only commutative reductions can be parallelized (#6609)
  • @jinderek
    • Support new warp shuffle intrinsics after CUDA Volta architecture (#6505)
  • @knzivid
    • python_bindings: Fix SIGSEGV in HalidePythonCompileTimeErrorReporter (#6635)
  • @LebedevRI
    • [CMake] Deduplicate Halide_LLVM_VERSION and LLVM_PACKAGE_VERSION (#6646)
  • @masahi
    • [APP] Fix hexagon_benchmarks build (use two-var prefetch) (#6563)
  • @mcleary
    • Add support for AMX instructions (#5818)
  • @mcourteaux
    • Include GPU source kernels in Stmt and StmtHtml file. (#6444)
    • Syntax highlighting for embedded PTX code. (#6447)
  • @mgharbi
    • Fixes the Pytorch Wrapper Codegen for CPU-only machines. (#6590)
  • @OmarEmaraDev
    • Fix default device wrap native function (#6310)
    • Fix wrong type in Ramp CodeGen for OpenGLCompute (#6349)
    • Vectorize Ramp in OpenGLCompute backend (#6372)
    • Support vectorization in OpenGLCompute backend (#6348)
    • Support vectorized Select in OpenGLCompute backend (#6371)
  • @rootjalex
    • Make bounds of let visitor use unique_name() (#6583)
    • Remove incorrect docs on widening_add (#6625)
    • Disallow Type::narrow() and Type::widen() from producing bitwidths between 1 and 8 bits (#6622)
    • Wild match object should not be foldable (#6623)
    • Clear bounds info on casts when value bounds are undefined for overflow types (#6640)
  • @slomp
    • decommissioning StackPrinter (#6470)
  • @steven-johnson
    • [hannk] Fix MeanOp (#6336)
    • Add using OpVisitor::visit; to various OpVisitors to avoid overload warnings for some compilers (#6337)
    • [hannk] Add a prepare() method for ops and interp (#6338)
    • Fix WASM datalayout for top-of-tree LLVM (#6339)
    • Make halide_type_t and halide_type_of constexpr (#6340)
    • Harvest IWYU changes for LLVM, WABT (#6341)
    • Fix HelloWasm (#6342)
    • Fix Makefile for LLVM11 (injection from #5818) (#6343)
    • [hannk] requantize() should never skip the operation (#6350)
    • [hannk] augment SoftmaxOp to allow specifying axis (#6351)
    • Use Node instead of d8 for Wasm AOT testing (#6356)
    • [hannk] Add missing call to Interpreter::prepare in benchmark app (#6358)
    • [hannk] Allow disabling TFLite+Delegate build in CMake (#6360)
    • [hannk] Add support for building/running for wasm (#6361)
    • Update Emscripten settings (#6362)
    • [hannk] Clean up aliasing (v2) (#6364)
    • [hannk] tests should only process .tflite files (#6368)
    • Revamp Hannk IR (#6379)
    • Fix for top-of-tree LLVM (#6380)
    • Remove halide_assert() from halide_default_device_wrap_native (#6381)
    • Rename halide_assert -> halide_abort_if_false (#6382)
    • Convert various halide_assert -> static_assert (#6383)
    • Fix for top-of-tree LLVM (#6386)
    • Check results of all runtime function calls (#6389)
    • Add halide_debug_assert() macro (#6390)
    • [hannk] Have CMake emit .s, .stmt, .ll files (#6392)
    • [hannk] Upgrade hannk to use TFLite 2.7.0 by default (#6393)
    • Clean up CodeGen_LLVM names to match ASAN nomenclature changes (#6395)
    • Drop support for LLVM11 (#6396)
    • Move PyTorch test into standalone tests (#6397)
    • Remove halide_abort_if_false() usage in runtime/metal (#6398)
    • Fix OGLC debug builds (#6399)
    • Add defensive checks to halide_buffer_copy_already_locked (#6401)
    • _halide_buffer_crop() needs to check for runtime failures (v2) (#6403)
    • Fix broken ASAN code (#6408)
    • [hannk] Pacify clang-tidy (#6412)
    • One more ASAN fix (#6413)
    • [hannk] Fix lower_tflite_fullyconnected (#6414)
    • Fix Introspection issues (#6424)
    • Don't remap the function name or the target in the metadata (#6430)
    • Set up SANITIZER_FLAGS and OPTIMIZE for apps/Makefile.inc (#6435)
    • Ensure that halide_start_clock() is called before halide_current_time… (#6438)
    • Codegen_C: buffer compilation needs to special-case scalar buffers (#6442)
    • Add operator<< for Closure (#6443)
    • Re-enable performance_async_gpu for D3D12Compute (#6450)
    • Tweak Hexagon codegen output (#6461)
    • Add LinkageType::ExternalPlusArgv (#6452) (#6463)
    • Fix Closure API (#6464)
    • Move null check from Printer to halide_string_to_string() (#6467)
    • Deal with Printer::scratch (#6469) (#6472)
    • Restore support for using V8 as the Wasm JIT interpreter (#6478)
    • Fail if no_bounds_query specified for HL_JIT_TARGET (#6489)
    • Document the usage of llvm::legacy::PassManager (#6491)
    • Update WABT to 1.0.25 (#6497)
    • Grab Bag of minor cleanups to LowerParallelTasks (#6498)
    • Update simd_op_check for arm64 upz1 code generation (#6499) (#6500)
    • Fix size_t -> int conversion warning (#6501)
    • Fix simd-op-check for top-of-tree LLVM (#6529)
    • Revert "Make random faster by putting the innermost var last" (#6538)
    • Fix GeneratorOutput_Buffer::set_estimates() (#6540)
    • Revert "Make it possible to interpret a wide type as multiple smaller elements" (#6541)
    • Convert apps/hannk/Elementwise to use generate() (#6543)
    • Fixes for top-of-tree LLVM (#6546) (#6548)
    • Fix deprecation warnings in Python tutorials (#6552)
    • Use add_halide_generator() everywhere in apps/ (#6554)
    • Fix for top-of-tree LLVM (#6561)
    • Enable simd_op_check test for wasm i8x16.popcnt (#6562)
    • Revert "Fix for top-of-tree LLVM" (#6564)
    • wasm simd cleanup (#6566)
    • Add support for wasm-simd ops for integer-integer widening (#6567)
    • Add explicit to a handful of Generator-related ctors. (#6569)
    • Fix typo in comment in HalideBuffer.h (#6570)
    • Allow calling scheduling methods on Output<Buffer[]> (#6577)
    • Fix for top-of-tree LLVM (#6579)
    • Fix Win32-specific breakage in top-of-tree LLVM (#6581)
    • Convert apps/ to use static Buffer dims where useful (#6585)
    • Various fixes to static-dimensioned Buffer (#6589)
    • Convert Buffer<> usage in python_bindings/ to use static dimensions (#6591)
    • Convert Buffer<> usage in test/generators to use static dimensions (#6592)
    • Rename BufferDimsUnconstrained -> AnyDims (#6594)
    • Allow building with LLVM15 (#6603)
    • Update WasmExecutor for WABT API changes (#6612)
    • Minor Generator cleanup (#6613)
    • Unbreak WABT again by using main instead of a commit (#6614)
    • Update apps/hannk to use TFLite 2.8.0 (#6617)
    • Update WABT version to the just-released 1.027 (instead of main) (#6619)
    • Clean up python_binding Makefile (#6634)
    • Fix const-correctness in C/C++ backend (Issue #6636) (#6638)
    • Convert most remaining Generators to prefer statically-dimensioned In… (#6641)
    • Allow profiler feature under wasm iff wasm_threads is enabled (#6643)
    • Fix UB in hannk FillWithRandom operation. (#6645)
    • Update initialization of WABT store field to work with top-of-tree (#6649)
    • Fix apparent typo in PR #6294 (#6653)
    • Eliminate some unnecessary clamping in ClampUnsafeAccesses (#6297) (#6654)
    • Python Bindings: fix Python bool -> Expr implicit conversion (#6657)
    • Fix 'variable set but not used` warning/error (#6658)
    • Allow make test_apps to work with ASAN (#6659)
    • Add optional runtime H::R::Buffer access checks (#6660)
    • Add ldscript code for Python extensions in CMake (#6665)
    • Remove the nobuild/partialbuildmethod tests from...
Read more

Halide 13.0.4

22 Jan 00:51
Compare
Choose a tag to compare

This is a patch release that fixes a single bug relating to multiple outputs that depend on each other (#6375).