This is a transformer-based large language model accelerator developed using Chisel HDL as the development language.
-
PE (Processing Element): A simple MAC (Multiply-Accumulate) unit implementation, including the necessary input and output interfaces to form a systolic array.
-
ALU (Arithmetic Logic Unit):
- Floating-point computation unit: A single-precision floating-point computation unit following the IEEE-754 standard. It supports basic operations such as addition, subtraction, multiplication, and division. The execution cycles are as follows: 4 cycles for addition and subtraction, 2 cycles for multiplication, and 5 cycles for division. TODO: The conversion module between floating-point and fixed-point numbers is still under development.
- Fixed-point computation unit: A standard fixed-point computation unit with configurable bit width and decimal point position. It supports addition, subtraction, multiplication, division, and square root operations. The execution cycles are as follows: 2 cycles for addition and subtraction, 2 cycles for multiplication,
WOI + WOF + 5
cycles for division, and(WII + 2 - 1) / 2 + WIF + 3
cycles for square root. These cycle counts include two cycles for the number scaling module. Additionally, it supports integer division with remainder using the SRT-16 algorithm, with an execution time of 2 to5 + (len + 3) / 2
cycles. - GEMM (General Matrix Multiply) unit: A systolic array algorithm for matrix multiplication of
n x n
square matrices implemented with fixed-pointUInt
. It produces the result in 3n cycles. - GEMV (General Matrix-Vector Multiply) unit: Implements a vector-matrix multiplication algorithm similar to a reduction tree, using both fixed-point
UInt
andFXP
. The result is available inlog(n) + 2
cycles. - Softmax computation unit: Uses a special exponential function unit based on the algorithm from I-BERT to compute exponentiation. The normalization is computed using the aforementioned
FXP
library's square root and division modules. The exact cycle count depends on the specific implementation. - Deprecated: Matrix computation modules for SPMM (Sparse-Dense Matrix Multiplication) and SDDMM (Sampled Dense-Dense Matrix Multiplication), implemented with MAC units and mask vectors.
-
Memory (Mem):
- Custom SRAM module: A customizable SRAM module where both the size and data type can be defined. Ensure there is sufficient on-chip space when using this module.
- HBM/DRAM streaming read/write module: This module allows for high-bandwidth memory (HBM) or dynamic random-access memory (DRAM) streaming, where burst length and AXI maximum bit width limitations can be ignored. The module automatically slices and transfers data according to the defined size.
- AXI reader and writer interfaces: Native AXI interfaces for reading and writing data.
这是一个使用Chisel HDL作为开发语言的 transformer-based 大语言模型加速器
- PE: 简单的 mac 单元实现,包含构成脉动阵列必须的出口和入口。
- ALU:
- 浮点计算单元,遵循 IEEE-754 标准的单精度浮点数计算单元。可以实现浮点数的加,减,乘,除,四种基本运算,耗时周期分别为:4(加减法),2(乘法),5(除法)。TODO:目前浮点数与定点数的相互转换模块实现还待完善。
- 定点数计算单元,标准的定点计算单元,可以自由缩放定点数位宽以及小数点位置。可以实现定点数的加,减,乘,除,开方运算,耗时周期分别为:2(加减法),2(乘法),WOI+WOF + 5(除法),(WII+2-1)/2 + WIF + 3(开方)。以上周期均包含两个周期的数字缩放模块用时。同时还拥有使用SRT-16算法得到的带余数的整数除法,运算周期为 2 ~ (5 + (len+3) / 2)
- GEMM 运算单元,使用定点数 UInt 实现的 n * n 正方形矩阵的脉动阵列算法,可以在 3n 个周期后得到矩阵结果。
- GEMV 运算单元,使用定点数UInt和定点数FXP均实现了类似于归并树的向量矩阵乘算法,可以在(logn+2)周期后得到结果。
- Softmax 运算单元,使用特殊的指数运算单元 I-Bert中的算法计算指数函数的值,使用上述FXP库中的开方以及除法计算正规化的值,具体周期数与实际代码相关。
- deprcated:使用mac单元和mask向量计算的SPMM和SDDMM矩阵计算模块
- Mem:
- 自定义的 SRAM 模块,可以自定义SRAM的大小和数据类型,需保证使用时片上有足够的空间
- 类似流式读取的 HBM/DRAM 读写模块,使用时可以忽略 burst length 和 AXI 最大位宽的限制,模块将自动切片并将数据用设定好的大小传输
- AXI对应的 reader 和 writer 原生接口