-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Dear Community, thanks to everyone's effort in the past few months. This is a proposal to do a v0.6 release.
This release will be managed by the TVM PMC, with @yzhliu and myself as moderators. In the next few days we will be populating the release note in this thread. Most release note content will be derived from our monthly report
We also encourage everyone in the community to reply to the thread about pending PRs that should be included in the v0.6.
It is our first release after moving to the apache repo. So the main goal is about passing the general reviews to make sure the released product matches the ASF requirements. We hope that we can use this release to streamline the future releases
New Features
Relay in Production
Relay is a functional, differentiable programming language designed to be an expressive intermediate representation for machine learning systems. Relay supports algebraic data types, closures, control flow, and recursion, allowing it to directly represent more complex models than computation graph-based IRs (e.g., NNVM) can. In TVM v0.6, Relay is in stable phase and is ready for production.
- Algebraic Data Types (ADT) support (#2442, #2575). ADT provides an expressive, efficient, and safe way to realize recursive computation (e.g., RNN). Refer to https://docs.tvm.ai/langref/relay_adt.html for more information.
- Pass manager for Relay (#2546, #3226, #3234, #3191)
- Most frameworks have been supported in Relay, including ONNX, Keras, Tensorflow, Caffe2, CoreML, NNVMv1, MXNet (#2246).
- Explicitly manifest memory and tensor allocations in Relay. (#3560)
Relay Virtual Machine
The Relay Virtual Machine (Relay VM) is the new generation of runtime to strike a balance between performance and flexibility when deploying and executing Relay programs. Previously, the graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.
Relay VM is now usable and is able to achieve decent performance for a various of models and targets.
- Design (#2810 #2915) and a first version of implementation (#2889),
- Add VM runtime for Relay and compiler support (#3120, #3121, #2889, #3139)
- Relay VM (pattern matching #3470, port to python #3391, serialization #3647)
- Relay VM Profiler (#3727)
- Support execution on devices for Relay VM (#3678)
- [Relay][VM] Add more passes to VMCompiler (#4058)
- [relay][vm] Separate VM runtime with executable (#4100)
- Port VM, VM compiler, and Object into Python (#3391)
- VM: Add AllocTensor instruction and better instruction printer (#3306)
- [Relay][VM][Interpreter] Enable first-class constructors in VM and interpreter via eta expansion. (#4218)
- [Relay][VM] Clean up the VM and VM profiler code (#4391)
Training
Relay is designed to natively support first-order and higher-order differentiation. The automatic differentiation infrastructure is now usable and a count of operators with gradient support are available in v0.6 release.
- Higher order reverse mode automatic differentiation that work with control flow (#2496)
- Higher order continuation passing style (#3456, #3485 )
- Relay gradient registration (clip #3509, max_pool2d and avg_pool2d #3601)
- Relay AD algorithm (#3585)
- Relay Training - allow gradient to return a tuple (#3600), numerical gradient check (#3630)
- Improve AD for concatenate (#3729)
- [Relay][Training] Add missing gradient check to gradient pass (#4169)
- As a part of Relay's automatic differentiation system, we are adding primal gradients for Relay operators. Please refer to #2562 for tracking the progress.
- Gradient for Conv2d (#3636)
- Add gradient operators (#3857, #3894, #3901, #3915)
- Add gradient for log-softmax (#4069)
- [Relay][Training] Add gradient for Crossentropy (#3925)
- [Relay][Training] Add and fix gradients (#4126)
Quantization
Low-bit inference is getting more and more popular as it benefits both the performance and storage usage. TVM now supports two types of quantization. 1. Automatic quantizaion takes floating-point precision model, does per-layer calibration and generates low-bit model. 2. TVM also imports pre-quantized model from Tensorflow and MXNet, a new dialect QNN is introduced to handle further lowering to normal operators.
- Automatic Quantization
- Low-bit automatic quantization supported. (#2116). The workflow includes annotation, calibration and transformation.
- Refactor quantization codebase and fix model accuracy. (#3543)
- KL-divergence-based per-layer calibration. (#3538)
- Add option to select which convolution layers are quantized. (#3173)
- [Relay][Quantize] Integrate data-aware calibration into quantization. (#4295)
- Pre-quantized model support (QNN operators and legalize pass).
- Add a legalize pass to Relay (#3672)
- Qnn Concatenate, quantize, dequantize and requantize operators (#3819, #3730, #3745, #3531)
- QNNtoRelay & QNNLegalize Pass utility (#3838, #3782)
- Requantize: Optimize lowering for some corner cases. (#3864)
- New quantized operator support: conv2d, add, dense (#3580, #3736, #3896, #3910)
- Do type checking for the input and kernel in the qnn conv2d (#3904)
- Legalize and AlterOpLayout for Intel int8. (#3961)
- Renaming tests to follow the Relay nomenclature. (#3975)
- Fix padding changes due to Add support for MXNet pad operator, for parameter 'None' in slice operator, 1D Convolution and 1D Deconvolution #3739 (#3989)
- Memorizing quantize node mapping to avoid duplicated simulated quantization (#3233)
- Infrastructure to support pre-quantized models (QNN) (#3971).
- [Relay][AlterOp] NHWC to NCHWc support for Pool, concatenate, sum. (#4059)
- [QNN][TFLite] Parsing QNN Add op. Adding MobilenetV2. (#4142)
- [TOPI][x86] Cascade lake support. (#4123)
- [TOPI][x86] Legalize - Support int8xint8 convolution to use VNNI inst (#4196)
- Qnn dequantize with min max using Mxnet flavor to support Mxnet prequantized models. (#3945)
- Improve the lowering of Qnn Dense (#4213)
- Adding support for dequantizing from int32 to float32. (#4130)
- [QNN] Refactor fixed point multiplication in requantize (#4073)
- [Relay][Quantize] Use fixed point mulplications (#4160)
- Add support for quantized multiply to Relay (#4141)
- Use legalize to handle NHWC layout for arm_cpu (#3754)
- [QNN][TFLite] Parsing TFLite quantized models. (#3900)
- [QNN][Legalize] Specialize for Platforms w/o fast Int8 support (#4307)
- [QNN] Use Int16 upcast in Fallback Conv2D. (#4329)
- Retain input kernel scales in QNN dialect (#4292)
- [QNN] Lowering for Depthwise Convolution. (#4351)
Accelerator and Microcontroller Support
TSIM is introduced to improve software and hardware integration and simulation accuracy. It integrates the hardware development process into the software stack. TSIM enables VTA to provide a more accurate performance feedback, i.e. clock cycles, compared to the traditional functional model of a hardware accelerator. Moreover, Chisel implementation for VTA is availale and it runs on top of TSIM.
There has been a proliferation of resource-constrained and embedded devices that do not have operating systems or a mature software stack. MicroTVM is intended to support TVM on such bare-metal devices.
- [TSIM] Enabling Cycle-Accurate Hardware Simulation for VTA (#3010, #3206, #3242)
- Chisel implementation for VTA and runs on top of TSIM (#3258, #3347)
- MicroTVM (#3227)
- Relay Compilation + AutoTVM compatible operator libraries for VTA (#3135)
- ChangeBatch pass for batched VTA compilation (#3656, #3660)
- VTA fast simulator statistics (#3481)
- TSIM improvements and fixes (#3505)
- Chisel VTA enhancements and fixes (32bit support #3558, alu instruction generation #3592, coherence support #3593, separate types #3605, tensor issue/commit #3637, uop load request #3643, uop dma requests #3654)
- VTA Runtime refactor for non-shared memory FPGAs (#3590)
- VTA HLS codebase refactor for Ultra96 (#3496)
- VTA support for batched inference (#3661)
- VTA bitstream compilation for Intel FPGA (#3494)
- TSIM: Introduce Virtual Memory for TSIM Driver (#3686)
- Parallel TSIM hardware compilation with macOS and debug support (#3797)
- Chisel: scale dram base address in hardware instead of runtime (#3772)
- Chisel: run all unittests by default (#3766)
- Chisel: improved Data Gen, Added ALU Test (#3743)
- Chisel dependencies for TSIM CI (#3721)
- Chisel: Added Module Unit Test Infrastructure (#3698)
- Add ISA BitPat generation (#3891)
- de10-nano driver (#3394)
- Extending Vision model coverage compilation for VTA (#3740)
- Conv2d transpose (deconvolution) operator support (#3777)
- Support TLPP in function simulator. (#3555)
- [VTA][Chisel] TSIM VTA Source Refactor (#4163)
- [VTA][TSIM] Serial GEMM Application Added (#4082)
Rust Support
Rust language support in TVM includes two parts. 1. The frontend wraps the current C API and exposes a Rust programming model. 2. The backend serves as an alternative to C++ runtime. It privdes a standalone WASM module and security support, e.g., SGX.
- Rust frontend (#2292).
- Unify types between bindings and pure Rust impl (#2616)
- Rust: load syslib modules at compile time (#3274)
- Rustify PackedFunc & Friends (#2969)
- Rust DSO module (#2976)
Operator Support
- A special operator
annotation.stop_fusion
to prevent it being fused with previous expressions ([RELAY] Stop_fusion annotation #2624). batch_matmul
supported (#2561).reverse_reshape
supported (#2503).- Faster-RCNN proposal operator for CUDA (#2420).
- Vision operator for YOLO
yolo_reorg
(#1941). slice
operator for MXNet (#2662).arange
supported (#2621).- Vision operator
roi_align
(#2618). where
operator for MXNet (#2647).- Deformable conv2d (#2908)
- Faster-RCNN Proposal OP (#2725)
- ROI Pool operator (#2811)
- Gluoncv SSD support on CPU (#2353)
- shape, reverse, and sign op (#2749, #2800, #2775)
- tile and repeat op (#2720)
- logical operators (#2743, #2453)
- stack op (#2729)
- NCHWc upsampling (#2806)
- clip and wrap mode support in take (#2858)
- AlterLayout support for
intel_graphics
conv2d , depthwise conv2d (#2729, #2806) - Add foldr1 operator (#2928)
- Add rsqrt operator (#2949)
- Add clip and wrap mode support in take (#2858)
- Gather_nd exposed to relay (#2945)
- bitserial_conv2d move to autotvm template and updates (#2819)
- Port x86 NCHWc to AutoTVM for Task Extraction (#2664)
- Implement relay nn.bias_add compute in C++ (#3027)
- Rename output tensors for better readability (#3006)
- int8 dense on CUDA & Dense op quantization (#2877)
- Bitserial dense operators for CPU (#3051)
- Enhance upsample operator to adapt onnx opset v9 (#2968)
- Add adaptive pooling operator (#3085)
- Add all operator (#3124)
- Add cblas batch_matmul (#3210)
- Add packing for int8 1x1 convolution and support the int8 group convolution on X86 (#2991)
- Add op size (#3094)
- x86 TOPI (roi_align #3475, conv2d_transpose #3491 )
- Intel INT8 (dilation in conv2d #3510, type checking #3516)
- Reinterpretation of tensor elements (#3599)
- Spase-Dense for block-sparse multiplication (#3566)
- Winograd matrix computation (#3553)
- CUDA schedule for pool_grad (#3622), group_conv2d (#3663)
- Bitserial operations conv2d, dense and bitpack (#3844)
- Improve numeric gradient check (#3856)
- Resize rework (3788)
- Improve
conv2d_transpose
CUDA schedule template (#3796) - SpaceToDepth and MirrorPad Operators (#3718)
- Add variance and layer norm op (#3700)
- Add
sparse_transpose
for Square CSR matrices (#3707) - TOPI: Memoize winograd matrix (#3687)
- New TOPI operators:
erf
,logical_and
,logical_or
,logical_not
,isnan
(#3702, #3929, #3979) - Improve
ceil_divide
in tile/split (#3842) - [Relay][Frontend][TF] Add tensor array ops (#3798, #4309)
- [TF][Op] Op where (#4045)
- [TOPI]Add op argwhere (#3994)
- [Relay]
crossentropy_with_logits
and its gradient (#4075) - [Relay][Op] Enhance Upsample Operator to support float scales (#4206)
- [Relay][Op] Add instance norm op (#4004)
Frontend and User Interface
- Frontend darknet (#2773)
- Support tf.gather (#2935)
- Support tf.where (#2936)
- Adding ADD operator to tflite frontend for compiling the MobileNetV2 (#2919)
- Support SpaceToBatchND/BatchToSpaceND in Tensorflow frontend (#2943)
- Simplify TF get_output_names (#3025)
- TF Tile Round Sign Pow Exp Reverse (#2960)
- Gluncv SSD support on the GPU (#2784)
- Allow an op as loop var in Tensorflow (#3056)
- Add FULLY_CONNECTED op into tflite frontend (#3019)
- Add MXNet converter for RNN layer ops (#3125)
- Add log op in tf frontend (#3111)
- Add SoftPlus Sqrt in Tensorflow frontend (#3187)
- Add onnx elemwise greater/less (#3186)
- Add PlaceholderWithDefault (limited) implementation in TensorFlow (#3184)
- Support tf.math.reduce_prod (#3166)
- Better shape inference in TensorFlow Frontend (#3176)
- Get list of unsupported ONNX operators (#2995)
- Implement ONNX MaxPool-v8 and MaxPool-v10 (#3114)
- Convert TFLite NCHW to NHWC (#3141)
- Add Crop op converter (#3241)
- TFLite frontend operator support: PAD, RESIZE, MUL, Reduce (min, max, mean, prod), LOGISTIC, elemwise operators (Sub, Divide, Power, Max, Min) (#3310, #3370, #3304, #3421, #3313, 3357)
- Tensorflow frontend operator support: Abs, FloorDiv, GatherND, LeftShift, LogSoftmax, Max, Min, Mod, RightShift, ZerosLike, TruncateMod, Neg, ClipByValue, ResizeNearestNeighbor (#3270, #3211, #3393)
- TFLite: Add fused_activation_function for ADD, SUB, MUL, DIV (#3372)
- Support bidirectional RNN layer for MXNet (#3397)
- TFLite operator support (pack #3521, split #3520 )
- Keras operator support (permute, softmax #3618)
- TF operator support (BatchMatMul #3634)
- TFLite frontend operator support: tile, transpose (#3814, #3705)
- ONNX frontend operator support: PReLU for NNVM, Not, Sign, Equal (#3813, #3836, #3760)
- Keras frontend operator support: Dot (#3668)
- Add more cases to Keras
_convert_reshape
(#3846) - TensorFlow frontend operator support: OneHot, log1p, cos, sin (#3781, #3614)
- Support BatchMatMul with input dimensions larger than 3 for TensorFlow (#3732)
- ONNX new operator support: And, Tile, Erf (#3878, #3941, #3988)
- MXNet new operator support: pad, conv1d, deconv1d (#3739)
- TFLite new operator support:
batch_to_space_nd
,space_to_batch_nd
, tanh, greater, relu (#3850, #3996, #3963, #4022) - TFLite: Support depthwise convolution multiplier greater than 1 (#3922)
- Keras: Fix ReLU in Keras Converter missed the case (#3917)
- Keras: frontend upsample and 1 channel conv2d fixes (#3937)
- Tensorflow: Convert scalar Const into tvm.relay.const (#3885)
- TensorFlow: Add support for SquaredDifference (#3930)
- [relay][frontend] clean up tf frontend (#3710)
- [Relay][Topi][TensorFlow][ONNX][Lang] Add support for Any op (#4205)
- [Relay][Frontend][ONNX] Add support for op Where (#4184)
- [Relay][TopHub] Add switch to disable TopHub download (#4015)
- Add parser support for CAST tflite operator (#4096)
- Add parses support for
zeros_like
tflite operator (#4042) - Add parser support for SUM tflite operator (#4182)
- Add support for tf.assert (as no-op) and
tf.no_op
to TF Relay frontend. (#4172) - [Relay][Frontend][ONNX] New Operators and Opsets to Support BERT (#4197)
- [Relay][Params] Add APIs for storing and retrieving parameters from individual functions. (#4194)
- Add
build_create_shared_func
to tvm/contrib/cc.py (#3840) - Tensorflow saved model for NNVM (#2493 and Relay (#2586).
- Introduced
HybridModule
(#2477) so that normal TVM schedule can be compiled to hybrid target, run and dumped to Hybrid Script. - Relay ][Frontend][Tensorflow] add operator
add_n
(#4181) - [Relay][Frontend][Tensorflow] StopGradient (#4238)
- [Relay][Frontend][ONNX] Add support for broadcasting to Where and MatMul (#4267)
- [TFLite] Support PRelu (#4298)
- [Frontend][MxNet] support mxnet cond op (#4311)
- Add support for
quant.mul
operator in tflite frontend (#4283) - [Relay][Frontend][ONNX] operator support: DepthToSpace, SpaceToDepth (#4271)
- [Relay][Frontend][Tensorflow]Add
conv2d_transpose
. (#4300) - [Frontend]Add TensorFlow FloorMod (#4308)
Runtime and Backend Support
- Make external library extend TVM's NDArray more easily (#2613).
- Improvements for NNPACK integratation, includes ci test, winograd (#2846, #2868, #2856, #2721)
- Improvements for OpenCL runtime (#2741, #2737)
- GraphRuntime: Enable sharing parameters of a model among multiple threads (#3384)
- Android runtime argsort support (#3472)
- GraphRuntime enhancements (set_input_zero_copy #3416)
- A new minimal runtime implementation (~12kb .text on ARMv7/x86) for TVM.
- Add AVX512VNNI support for TVM (#3388)
- Enable miopen Group Convolution (#3987)
- Minimal runtime (~12kb .text on ARMv7/x86) for subset of TVM models (#3567)
- [RUNTIME] Separate runtime related contrib into runtime/contrib (#4207)
- [topi] add ARM v8.2 udot (uint8) support (#3978)
- [codegen] Add multiple operands and function support when using fp16 compilation (#4056)
- [TOPI] Added support for Mali Bifrost target (#4047)
- [topi] enable fp16 sort for arm (#4084)
- Add OpenOCD Low-Level Device (RISC-V Support) (#3756)
- Add wave 32 bc for AMD ROCm backend (#3984)
- [RUTNIME] Support C++ RPC (#4281)
- [TOPI][OP] Support Faster-RCNN Proposal OP on CPU (#4297)
Language and Architecture
- Support custom datatypes (#2900)
- Add the acc16 intrinsic support (#3081)
- Handle float16 constants & fix BatchNorm (#3260)
- Structural hash - incorporate the var type into its hash (#3267)
- Relay C++ Build Module (#3082, #3144, #3174)
- Enable decorating python class to be a Relay Pass (#3364)
- Make Partial Eval support interprocedural optimization and termination check. (#3033)
- Introduce feature manager to Relay. (#3236)
- Use Relay parser to define the Relay prelude (#3043)
- Mechanism to detect incomplete expression match in Relay (#3203)
- EQ/NE operators support for StringImm expressions (#3283)
- Mechanism to detect incomplete expression match in Relay (#3203)
- Introduce CanonicalizeCast pass to formally reduce memory overhead introduced by fused cast operations (#3280)
- Support overloading comparison operations in Relay (#3168)
- Mac count: provide a pass to calculate the number of multiply-accumulate operations in a network (#2609).
- Add Tuple pattern (#3596)
- Text format support for ADTs and prelude (#3863, #3939)
- Add new IR pass CombineParallelDense (#3862)
- Add support for
EQ
op in the deduce bound and the loop partition (#3775) - Introduce base-class IRMutatorWithAnalyzer (#3969)
- Define more standard global functions in the prelude of relay program, includes foldr1, hd, tl, nth, list update (#2928, #2917, #2771, #2866)
- Add SkipVectorize pass (#3222, #3228)
- [Relay][Pass] Add pass to remove unused functions in relay module (#4334)
Feature Improvement
Symbolic shape enhancement
- Add shape function for symbolic shape. It enables certain cases for broadcast with symbolic shapes. (#3606)
- [tvm][any] broadcast with values other than one (#3967)
- Symbolic shape support (broadcast op #3389)
- Support reshape for dynamic shape in tf converter (#4185)
- Runtime Shape Functions (#4179)
Language and Architecture
- An optimization pass to eliminate expressions which have the same functionality and same inputs (#2639).
- Refactor text printer to add stream-like API and FunctionType support (#2605, #2882)
- Build a scaffold for structured error handling (#2838). The new mechanism detects and rewrites error messages so that c++ and python stack trace are unified and not redundant. Guideslines and conventions for error handling is also discussed.
- Higher order reverse mode automatic differentiation that work with control flow (#2496)
- Integer arithmetic analyzers, includes modular set analysis, const integer bound analysis and rewrite simplifier (#2904, #2851, #2768, #2722, #2668, #2860)
- Improve operator fusion for TupleGetItem in relay (#2914, #2929,
- Compute FLOP of autotvm template for int8 models (#2776)
- Common subexpression elimination pass in Relay (#2639)
- Improve quantization in Relay (#2723)
- Refactor
build_func
in measure module of autotvm to better support cross compiler (#2927) - Quantize all fields of concatenate (#2913)
- Remove stale verilog generator (#2964)
- Improve Relay printing (#2984, #2881, #3030, #3041)
- Add min_num_branches option in CombineParallelConv2D (#2961)
- Add expr_visitor, fix expr_functor exponential blowup problem (#2988)
- Support Deriving channels when it is not provided in AlterLayout. (#2972)
- Enhance BoundDeduce algorithm (#2795)
- Enhance loop partition algorithm (#2956)
- Better tuple fusion implementation (#3092)
- Enhance fusion rule that starts from elemwise and broadcast (#2932)
- Remove on_device op after annotation in heterogeneous pass (#3204)
- Improve canonical and rewrite simplifier (#3132, #3149)
- Capture constant external python variables in hybrid script (#3157)
- Remove Peano nats from the prelude (#3045)
- Macro to define NodeRef methods, constructor style example (#3224)
- Consistent RAII scoping API (#3231)
- Register all operators' attributes in Python (#3175)
- Add module supoort in relay.build (#3424)
- Relay pass infrastructure improvement (#3319, #3336, #3430, #3353)
- Migrate Relay passes to pass manager (#3323, #3289, #3251, #3406)
- Improve heterogeneous annotation by using visitor (#3261)
- Support export ADT value in Python (#3299)
- Extend TensorComputeOp to allow scalar inputs (#3300)
- Transitioning low-level IR away from HalideIR (#3533, #3535)
- Tags for ADT constructors (#3369)
- IR dumping for debugging (#3493)
- Pretty printer and parser roundtrip (#3460, #3536)
- Relay type checking (conv2d weight dimension #3511, any shape #3221)
- Relay Module enhancements (remove free variables #3476)
- LLVM DWARF debug information (#3420)
- Printer for Layout/BijectiveLayout (#3582)
- Type inference escape hatch (#3571)
- Making iterators compatible with constructors of STL containers (#3624)
- Moving Conv, Dense, Concatenate InferTypes to header (#3783)
- Simplify casts of constants 0 and 1 (#3758)
- Conditionally replace reduction init axis. (#3408)
- Improve Partial Evaluator (#3749, #3703)
- Strict mode in Relay pattern matching (#3620)
- Quit and clean when TVM is interrupted (#3640)
- Make Type Relation catch more errors (#3899, #3699)
- Refactor the way we interface between different modules of Relay (#3906)
- Introduce
schedule_injective_from_existing
and unify external schedules for all targets (#3983) - [NODE][REFACTOR] Refactor reflection system in node. (#4189)
- Unify node system and object (#4161, #4115, #4128)
- [Relay][Refactor] Rename Datatype to ADT (#4156)
- [Relay] fix exponential blowup in interpreter (#3559)
- [Relay] Fix memory leak in the interpreter (#4155)
- [rpc] use callback func to do send & recv (#4147)
- Add
lift_if_then_else
pass to improve loop partitioning (#3865) - Decrease the complexity of CalcDep from exponential to linear (#4053)
- [IR] Make iterators compatible with constructors of STL containers (#3624)
- [Relay][Pass] Avoid FoldConstant folding some ops (#4245)
- [Relay][Prelude] More dtypes support in
tensor_t
(#4233) - [NODE][REFACTOR] Rename IRFunctor->NodeFunctor, use func pointer (#4247)
- [RUNTIME][REFACTOR] Use object protocol to support runtime::Module (#4289)
- [CodeGen] Add build config option
disable_assert
to control whether to generate assert. (#4340)
Arithmetic Analysis
- Formalize Integer Arithmetic Analysis (RFC: #2588). It is aiming to perform better context-dependent analysis, bound analysis, centralized arithmetic logic and arithmetic simplification. (#3272, #3463, #3464, #3368, #3503, #3504 , #3502, #3479 , #3568)
- Introduce FloorDiv/Mod, TruncDiv/Mod, and IndexDiv/Mod for better arithmetic simplification (#3976, #3986, #4000, #4014, #4008, #4028)
- [ARITH] Use floordiv for the deduce bound (#4025)
- [Simplifier] Rewrite simplification rule to eliminate unnecessary conditionals. (#4076)
Runtime and Backend Support
- Provide error msg for failure function call in tvm4j (#2967)
- Expose backtrace symbols in Debug mode (#3001)
- C++ GraphRuntimeCodegen, Deprecate Python2 (#2986)
- Ensure interpreted functions can take values that are not TensorValues (#3015)
- Make OpenCL runtime Compatible with OpenCL2.0 (#2897)
- Handle INF and NAN in CUDA and OpenCL (#3194)
- Update debug graph runtime for more precise layerwise timing (#3232)
- ROCM support (llvm printing #3662, ld.lld finding #3664, save to file #3665)
- Threadpool: make spin_count configurable (#3577)
- RPC worker children termination (#3669)
- Vulkan runtime reimplementation (stream approach) (#3849)
- Vulkan backend supports Call::reinterpret and vectorized comparison (#3795)
- Support MKL on Windows (#3837)
- Vulkan IR builder (bool to float #3513)
- Force
code_object_v2
for amd gpu backend (#4099) - [Codegen][cuda-fp16] fallback to fp32 simulation when cuda arch < sm53 (#4268)
- Fix and refactoring for AMD gpu backend (#4305, #4321, #4341, #4342)
- [Debugger] Sorting op-time breakdown for quicker analysis. (#4352)
- [nvcc] enable multiple arch in one fatbin (#4377)
Frontend and User Interface
- Relay now supports saving and loading parameter dictionaries. (#2620)
- Add
max_num_threads
to Hybrid Script, which allows users to get max number of threads for GPU targets (#2672). - Improvements for tensorflow frontend (#2830, #2757, #2586), includes decompiling tf control flow (#2830)
- Improvements for mxnet frontend (#2844, #2777, #2772, #2706, #2704, #2709,, #2739)
- Improvements for keras frontend (#2842, #2854)
- Improvements for DarkNet frontend (#2673)
- Improvements for ONNX frontend (#2843, #2840)
- Better profile result dump in Chrome Tracing format (#2922, #2863)
- Unified error handling in NNVM and Relay frontends (#2828)
- Improve NNVM to Relay conversion (#2734)
- Remove
input_0d_mismatch
special handling for TF Frontend([Relay][TensorFlow] Remove 'input_0d_mismatch' special handling #3087) - Bumped ONNX version from 1.1.0 to 1.4.1 (#3286)
- Simplify parameter handling in Tensorflow frontend (#2993)
- CoreML improvement for image scaler and padding (#3800)
- Clean up TensorFlow frontend (#3710)
- Darknet: Solve tvm parsing darknet resnext failure bug (#3778)
- Frontend changes
get_workload
- (#3483) - [TF][Relay][Op] Pass module when infer shape (#4287)
AutoTVM
- Support override in
register_topi_compute
andregister_topi_schedule
. (#3292) - Improve graph tuner dealing with Tuple. (#3649)
- Add AutoTVM template for conv2d Intel int8. (#3955)
- Add AutoTVM template for dense on CUDA. (#3923)
- Add AutoTVM template for conv2d on Intel graphics. (#3839)
- Optimizing autotvm task extraction speed. (#4138)
- [AutoTVM] Add batch_matmul to tunable operations. (#4242)
- Selecting tuning templates when extracting task. (#4338)
Performance Improvements
- Enable AlterOpLayout pass for x86 on Relay (#2585). It is essential to get decent performance for CNN-based model on Intel CPUs.
- Better intrinsic matching for x86 CPU and ARM CPU, includes variants of vcvtph2ps and vmlal.s16 (#2925, #2748).
- Improve injective schedule for ARM CPU(#2801)
- Core functionality for Graph tuner (#2184)
- Fast tanh implementation (#3255)
- Improve multi-batch conv2d on x86 (#3308)
- Improve
non_max_suppression
andget_valid_counts
for CPU (#3305) - Improve
roi_align
performance for CPU (#3296) - Improve
nms
andget_valid_count
performance (#3282) - Graph tuner for multiple subgraph (#3490)
- For sparsity, fast transpose for square CSR matrices has been now merged, which is a good start point for more general sparse type support.
- Reduce
set_input
andset_input_zero_copy
overhead (#3805) - Parallelize batch axis for ARM (#3931)
- Support cuBLAS BatchMatMul (#3936)
- Add AVX512VNNI support for TVM (#3388)
- Enhance tuning space of split (#3949)
- Enable miopen transpose convolution and fp16 support (#3952)
- Improve
conv2d_transpose
schedule on X86 and CUDA (#3948) - Expose llvm.nearbyint intrinsic (#4001)
- [TOPI][X86] Pool operator parallel support. (#4090)
- Improve layout for several operators (#4103, #4040, #4080)
- [Relay][VM] Fix constant folding issue in VM compiler (#4077)
- [relay][vm] Reuse allocated device memory (#4170)
- [Runtime] Enable option to use OpenMP thread pool (#4089)
- [PERF] Parallelize reduction for CPU (#4158)
- [TOPI] Tunable Template for Conv2D HWCN on CUDA (#4168)
- [TOPI] Add valid auto tvm for Intel Graphics (#4078)
- [TOPI] FIFO buffer op, to accelerate sequence modeling with dilated convolutions (#4039)
- TensorCore Support using Intrinsic (#4136)
- Auto TensorCore CodeGen (#4234)
- Use cblas for dense and batch_matmul (#3787)
- Update TOPI softmax compute and CPU schedule (#3680)
- [VTA] Performance optimize, remove unnecessary contigious memory use. (#4246)
- [TOPI][AlterOpLayout][ARM] Enabling NHWC to NCHW layout transformation. (#4249)
- [PERF] Parallelize reduction for CPU (#4158)
- [ThreadPool] Solve thread transitions issue (#4344)