Sunday, 15 September 2024

AX650N - Sipeed Maix-IV (AXeraPi-Pro) NPU teardown

After spending a significant amount of time reverse-engineering the RK3588 NPU and examining Rockchip's 6 TOPS claim. The AXera AX650N SoC piqued my interest due to AXera's ambitious claims about the NPU's processing power, boasting "72T mixed precision computing power, native support for Transformer intelligent processing platform". Upon closer inspection, the AX650N delivers 72.0 TOPS@INT4 and 18.0 TOPS@INT8. Interestingly, the performance claim from INT8 to INT4 is a 4x gain, rather than the typical 2x improvement. There is an ongoing effort to port some of the smaller Transformer model to showcase its capabilities. However, given the performance claims I would have expected some larger models to be showcased, but that doesn't seem to be the case.

Sunday, 19 May 2024

RK3588 - Reverse engineering the RKNN - Running llama2.c with TinyStories

I've finally reached a point with reverse engineering where we can start evaluating the usefulness of the NPU for LMMs. I've crafted a basic library (rk3588-npu) with sufficient features for initial integration. A good reference application for integration is llama2.c because its a single C file, the code structure is straightforward to follow and more importantly modify. We will use TinyStories (stories110m) for testing since the models are relatively small, making it easier to troubleshoot when the outputs go a stray. Credit to karpathy for providing llama2.c

Note, I've set the cpu and npu cores to max clk speeds.


We'll start with running run.c against stories110m, this is the fp32 version using the cpu with a single thread (single core). As we see roughly 9.7 tokens/second.
 

 

 


Next I converted run.c to use fp16 (_Float16) along with the weights. We see a slight drop in performance to roughly 9.3 tokens/s as arithmetic operations require a conversion back to fp32.






As with the fp32 version a single core run at a 100% as the code is single thread.






 

The next step was to offload all FP16 multiplications to the NPU. With a vocabulary size of 32,000, the largest multiplication is 768 x 32,000, with others being 768 x 768, 768 x 2048, and 2048 x 768. For efficient execution, the model weights need to reside entirely in memory, accessible to the NPU. This requires them to be within the 4GB address space, which can be problematic for larger models. In our case, the weights are roughly 256MB, requiring an expansion of the kernel CMA memory allocation to 512MB. Additionally, the weights needed conversion to the NPU format.

The changes result in a additional uplift of roughly 21 to 23 tokens/s depending on the length of the output as per the video below. Conservatively we could say a doubling.


 

CPU fluctuates between 30-60% for the single. The CPU is still critical for a number of reason:

1. We're still having to rely on memory copies to send/receive the remaining data for the multiplication to occur on the NPU.

2. Invocation of the NPU kernel driver requires CPU cycles.

3. The rest of the llam2.c code stills runs of the CPU.


Although the results look promising we need to bear mind that TinyStories is very small model as per the architecture. Furthermore its fortunate that the converted weights can fit in memory without having to shuffle weights between userspace and physical memory. In additional fp16 format would further limit the possibility for larger models to run efficiently. So conclusion so far there is some uplift but mileage will vary depending on model size and number of layers.

Thursday, 8 February 2024

RK3588 - Reverse engineering the RKNN (Rockchip Neural Processing Unit)


The internal operations and capabilities of the RK3588 NPUs are mainly concealed within a closed-source SDK known as RKNPU2. Given the huge interest in Large Language Models (LLMs) and the quest for optimal matrix multiplications for transformer models. I was curious to understand the implementation of the newly introduced matrix multiplication API (rknn_matmul_run) to the sdk . A thorough examination of the RKNN section in the TRM (Technical Reference Manual) reveals no native mechanism for matrix multiplication, especially for vectors.

To grasp whats going on, the initial step was to understand how the NPU functioned. While the TRM furnished a detailed list of registers and a brief overview of the core units constituting the NPU. It notably lacked essential information on programming the registers for executing operations. For example there were no specifics about deriving or calculating register values based on factors such as data formats (e.g., int8 vs. float16) or the size of input data or weights. Furthermore there was no information on how construct a pipeline for the NPU to execute. Fortunately, I had a slight advantage from a previous reverse engineering attempt on the V831 NPU. Nevertheless, even armed with this knowledge, it has still required several months of trial and error, extensive analysis of data streams, encountering a few dead ends, and numerous attempts at reverse engineering. Finally, I managed to understand how to activate the NPU and get it to execute simple operations.

The RK3588 NPU seems to be distant cousin of the NVDLA architecture in that the some of the terminology is similar and the core units has similar functions and pipe lines to NVDLA although they have been named differently. One of primarily differences is that we can give the NPU a list of tasks (RKNN terminology) to execute and then wait for completion. For example if I have simple neural network consisting of 3 layers and each layer consists of convolution + bias then it is possible to feed 3 tasks (each performing convolution + bias) to the NPU along with the necessary input, weight and bias values. Subsequently we just wait for the NPU to notify when its complete.


The image presented above is extracted from the TRM and has been altered because the description provided in the TRM doesn't entirely align with their diagram, and, more crucially, the register naming convention. Here is my interpretation, each NPU comprises of three distinct units:
  • CNA - Convolution Network Accelerator (include CORE rectangle). In the TRM it refers to the Neural Network Accelerating Engine, CNA isn't described.
  • DPU - Data Processing Unit
  • PPU - Planar Processing Unit

Based on the above, the NPU is primarily designed for running conventional Convolutional Neural Networks (CNNs). This is attributed to the CNA core feature, which revolves around executing convolutions by inputting image or feature data along with the corresponding weights. The emphasis on CNNs is further evident by the majority of RKNPU2 samples provided, such as YOLOX, Mobilenet, and ResNet. The CNA output can be directed to the DPU, where element-wise operations such as addition, multiplication, and RELU can be carried out. Subsequently, the DPU's output can be channeled to the PPU, where operations like min, max, and average pooling are executed. Additionally, there is the option to directly feed data to the DPU or PPU without necessitating a convolution step.

To execute convolutions efficiently, the CNA employs multiply-accumulate (MAC) operations. The performance of a CNA is partially determined by the number of MAC units used. According to the TRM, for a single NPU core the count of MAC operations depends on the input data type:

  • 1024 int8 MAC operations per cycle
  • 512 float16 MAC operations per cycle

Each MAC cell caches 1x1x16 weight bytes, for int 8 its 16 values whilst for float16 it  reduces to 8. We require 2 MAC cells to perform float 16 hence the reduction in operations per cycle. Internally feature and weight data must conform to Rockchips NC1HWC2 format where C2 is the aforementioned value. One 1x1x16 cube of feature data is then shared by all MAC cells to calculate partial sums which are then sent to the accumulator. At higher level the CNA appears to execute a block operation, as observed in my tests where, for instance, the MAC caches 32 channels of weight data for fp16. Hence the requirement to layout weights in kernel groups each with 32 channels.

 
Performance is also affected by the access time to input and weight data, the CNA incorporates a second level cache known as convolution buffer (cbuf). In the above diagram the 384KB onboard memory is partly for that purpose. Importantly the numbers of MAC units plus the cbuf influence how large of a convolution can be completed in one task.

Some of you may have already deduced that the matrix multiplication API is essentially executed through a 2D convolution. For instance, let's consider matrix A as [M x K] and matrix B as [K x N]. Matrix A represents the feature data arranged in an Mx1xK (hwc) format, while matrix B denotes the weight data organized in a 1x1xNxK (hwck) format. Consequently, the resulting matrix C [M x N] is arranged as Mx1xN. I'm at the point where I have a simple test running which asks the NPU to perform a matrix multiplication. I'm using matrices data derived from a GGML testcase (test-mul-mat.cpp) to verify the output is correct. To run the test check out my repo and build, sadly I'm still testing against a kernel 5.10 on a Rock-5b. If the test runs output should be as below & screenshot above.

rock@rock-5b:~/rk3588-npu/build$ ./matmul_4_36_16
drm name is rknpu - 20220829 - RKNPU driver
input dma is ffff8000, output dma is ffffa000, weights dma is ffff9000
Size of npu_regs 112
RKNPU_SUBMIT returned 0
=========================================================================================================
 1224.0 1023.0 1158.0 1259.0 1359.0 1194.0 1535.0 1247.0 1185.0 1029.0  889.0 1182.0  955.0 1179.0 1147.0
 1216.0 1087.0 1239.0 1361.0 1392.0 1260.0 1247.0 1563.0 1167.0 1052.0  942.0 1214.0 1045.0 1134.0 1264.0
 1125.0  966.0 1079.0 1333.0 1287.0 1101.0 1185.0 1167.0 1368.0  990.0  967.0 1121.0  971.0 1086.0 1130.0
  999.0  902.0 1020.0 1056.0 1076.0  929.0 1029.0 1052.0  990.0 1108.0  823.0  989.0  759.0 1041.0 1003.0
=========================================================================================================

Regarding reverse engineering, I've reached a stage where I understand the majority of register settings that impact convolution when dealing with feature data as input. The primary uncertainty lies in determining the bank sizes for feature/weight data, however I'm hopeful that this can be deduced. After dedicating a significant amount of time to analyzing the NPU, here is a list of key areas that you should be aware of:

1. All data pointers within the NPU (e.g., input, weights, outputs, task lists) are 32-bit and must reference physical memory. Consequently, this restricts the memory range to 4GB, making it impractical to leverage a board with 16/32GB memory for the NPU to use. Moreover, it potentially imposes limitations on the types of models that can be executed on the NPU.

2. The claim of 6 TOPS should be approached with caution. While each NPU core is rated at 2 TOPS, there are registers that could potentially enable convolution across all 3 cores. However, after analyzing the data streams generated by the SDK, it appears that this feature is never utilized. Additionally, there doesn't seem to be a similar capability available for the DPU/PPU units, which would restrict its usability. In my view, the most effective approach is to treat them as individual cores and execute models on each one, taking into account the memory constraints mentioned earlier.

3. The SDK matrix multiplication API, in certain aspects, represents an inefficient utilization of the NPU. There is the overhead of memory allocation, a kernel call, and instructing the NPU to execute a single convolution. Ideally, the NPU should be tasked with executing multiple operations and providing all the supplied data for those operations. Typically this is how the NPU is utilized when running a CNN model (ie YOLOvX). The caveat here is that the converted model is limited to contains layers where the operations are supported by the NPU.

4. Initial bench marking for the multiplication of two fp16 [512 x 512] matrices suggests that I could achieve completion in a respectable time of around 1ms. Please note, this involves sending 2 tasks to the NPU, as mentioned earlier due to the cbuf limitation. Unfortunately, this is only part of the story when it comes using vectors data as input. The costly operations involve converting the matrices to feature and weight data formats, and vice versa for the output, if done at runtime. I made an effort to create a highly optimized conversion routine for vector to feature data conversion. According to my benchmarks, this process takes approximately 2ms for fp16 [512 x 512] matrices. I would estimate 12-15ms to perform all the conversions for the matrices mentioned above. Ideally, the matrix for the weight data should be converted ahead of time to reduce conversion overhead and, if possible, persisted for reuse.

5. I was hoping there was the capability to use a programmable core to perform custom operations. Unfortunately this isn't case and your left with using OpenCL as the alternative. This brings it own challenges if you need to shuffle data between OpenCL and the NPU.

There is still more to discover about the other units (DPU/NPU) and I'll spend time doing that. Lastly TRM v1.0 contains numerous gaps and inconsistencies for RKNN, if anyone has later version it would be greatly appreciated.