Tiny Devices: September 2024

After spending a significant amount of time reverse-engineering the RK3588 NPU and examining Rockchip's 6 TOPS claim. The AXera AX650N SoC piqued my interest due to AXera's ambitious claims about the NPU's processing power, boasting "72T mixed precision computing power, native support for Transformer intelligent processing platform". Upon closer inspection, the AX650N delivers 72.0 TOPS@INT4 and 18.0 TOPS@INT8. Interestingly, the performance claim from INT8 to INT4 is a 4x gain, rather than the typical 2x improvement. There is an ongoing effort to port some of the smaller Transformer model to showcase its capabilities. However, given the performance claims I would have expected some larger models to be showcased, but that doesn't seem to be the case.

AX650N

The AX650N is part of AXera's vision processor lineup, featuring a unique capability for intelligent black light vision, which appears to generate color images in low-light conditions. From what I understand, this capability likely stems from its ISP (Image Signal Processor) unit utilizing RGB+IR data from the image sensor. This opens up potential for performing object detection in low-light environments. The AX650N is powered by 8 A55 cores and features an unusual configuration with two DDR controllers, each capable of addressing 16GB. This design allows for the potential to double data transfer rates, as the controllers operate independently and don't share the same address bus. Additionally, the SoC includes dual DSP cores (Tensilica Vision Q7 DSP) to further boost its vision processing capabilities. However, since it's primarily a vision processor, I suspect AXera may be positioning it to also capitalize on the growing interest in large language models (LLMs).

Maxi-IV board

AXera collaborated with Sipeed to produce a developer board, the Sipeed Maxi-IV, also known as the AXeraPi-Pro. The board is set to boot from eMMC by default and comes with a preinstalled Busybox image, however, the image is quite minimal. Fortunately there is a Ubuntu image is available for flashing to the eMMC. The main issue with the Ubuntu image is that it doesn't allocate enough space for the different mount points for the root filesystem for it to be useful. To fix this, I had to create a secondary image on a USB drive, boot from it (painfully slow) and then repartition the eMMC. The CPU fan runs continuously as soon as power is applied (with no PWM control), making it a noticeable distraction.

Upon booting the board, the surprising discovery is that Linux recognizes just 4GB of memory, despite the board having 8GB of onboard RAM. The remaining 4GB is reserved as CMM (Contiguous Memory Model), a large block of physical memory allocated for onboard peripherals like the ISP, video encoder/decoder, and NPU. It’s important to note that CMM differs from CMA, which is typically reserved by the Linux kernel. The ratio of CMM to Linux memory can be adjusted. The main challenge with 4GB (or less) available for Linux is the difficulty in fully utilizing all 8 CPU cores.

SDK support

To begin development the AX650N SDK is essential, but accessing it can be challenging if you're outside of China. For reasons unknown, Sipeed only hosts the SDK on their Baidu account. I reached out to Sipeed support with no response, email was sent on July 17, 2024 and still no response. I encountered the same lack of response from the Sipeed Telegram channel. Contacting AXera doesn't help they point you back to Sipeed. After weeks of trying I finally manage to get hold of the SDK without Sipeed.

The SDK includes a small kernel patch that gets applied against kernel 5.15.73, however the main irritation is that most peripheral drivers are provided as pre-compiled binaries, making the kernel effectively closed source. The rest of the SDK provides a set of closed source user space libraries exposing APIs to the different IP blocks. The primary drawback with a closed source approach is addressing security concerns as highlighted on Sipeeds NanoKVM. SDK documentation is primarily in Chinese with occasional English versions.

Transformers Support

As mentioned earlier, my main focus is the NPU architecture and performance of Transformer models. There are number of pre-built transformer models (Qwen1.5-1.8B, Qwen2-0.5B, MiniCPM-1B/2B, Phi-3-mini) available here, unfortunately again hosted on Baidu. The largest one is Phi-3 mini which requires increasing the CCM space to 5 or 6GB, due to other kernel modules reserving CCM space while the kernel boots. Executing Phi-3 mini results in a token generation rate of approximately 4.4 tokens per second.

The token rate appears to be lower than expected given the AX650N's TOPS claims. Phi-3 Mini seems to be using INT8 quantization, although it's unclear which specific approach, I'm guessing it's w8a8 with SmoothQuant. In comparison, Phi-3 Mini on the RK3588, also using INT8 (w8a8), reportedly runs at up to 6.46 tokens per second. The AX650N claims 18.0TOPS@int8, while the RK3588 is rated at 6 TOPS@int8. Although the performance may be lower, it's important to also consider the quality of the quantized models by evaluating each model's perplexity on both platforms, which I haven't done. Furthermore the vendor for both SOCs aren't particularly forthcoming on perplexity evaluations when releasing models. Regardless of perplexity, the expectation was that the additional 12 TOPS would lead to better performance. The next step is to take a closer look at the NPU architecture to gain a clearer understanding of what might be causing this discrepancy. Given the platform's closed-source nature, this presents several challenges.

To deploy models to the NPU, AXera rely on their Pulsar toolchain, which converts ONNX files into their proprietary axmodel format. At a high level. here's my understanding of how the axmodel files are processed:

The axmodel files contain a mix of ONNX data and an internal graph representation of the model, which is sent to the NPU kernel driver. To execute an LLM model, each layer is stored in its own axmodel file, containing descriptions of the input/output parameters and corresponding weights. The input/output parameters and weights are loaded into CCM. To run the layer, the kernel driver receives references to the input/output parameters, weights, and a list of NPU commands, all of which are included in the file.

Note, there is no api to directly interface with NPU, the only mechanism is by generating a axmodel file. For example to instruct the NPU to multiple 2 int8 matrices the SDK contains a bunch of sample which surprisingly rely on loading an axmodel file.

To efficiently run quantized models on the NPU, it ideally needs the ability to execute a Gemm operation. What's notable from the Pulsar documentation is that the ONNX Gemm operation is restricted as outlined below, suggesting that de-quantization cannot be performed while applying FMA.

alpha: Not supported yet

beta: Not supported yet

Power Consumption

After the board completes booting, the idle current consumption is about 450mA, equating to 5.4 watts (12V x 450mA). While running Phi-3 Mini, the consumption averages 680mA or 8.6 watts. For running a small LLM, the AX650N NPU is highly efficient, with the NPU consuming just 2.71 watts and benefiting from direct access to CPU memory. With a well-designed board, these figures could be significantly reduced.

Neutron NPU

The AXera NPU architecture, known as Neutron, has different iterations across the AX6XXX chipsets. It is primarily designed to support vision applications by executing Convolutional Neural Networks (CNNs). Documentation is sparse, but for the AX650N, there's a brief description in Pulsar along with the diagram below:

The AX650 and M76H NPUs are mainly composed of three Conv convolution cores and three groups of Vector cores. These Conv and Vector cores are allocated in a 1:1 ratio and divided into three groups of vNPUs

The diagram states that the NPU consists of 3 convolution cores, 3 vector cores, and 3 SDMA (System direct memory access) units. We’ll need to investigate further to confirm if this is accurate. The NPU can be set up to function either as 3 distinct Virtual NPUs (vNPUs) or as a single NPU with access to all cores.

The SDK documentation provides more details, highlighting the NPU performance as:

Max. 43.2 TOPS @INT4 and 10.8 TOPS @INT8.

This differs from the original claim of 18.0 TOPS @INT8, with only 10.8 TOPS @INT8 coming from the NPU. I speculate that the remaining 7.2 TOPS might be partially attributed to the dual DSP cores, which feature 256 MACs @INT8. However, I’m still unsure about the remaining TOPS and remain unconvinced that the performance figures are genuine. I suspect there may be some creative accounting at play!

Examination of the memory map offers additional insights into the architecture.

The EU0-EU12 labels denote a total of 13 Execution Units (EUs), along with 3 SDMA units, bringing the overall count to 15. The NPU also features its own local memory through OCM (On-Chip Memory), with an address map indicating 11.5MB (0xAFFFFF). Data transfers between the CMM and OCM must be managed by the SDMA units both before and after instructing the NPU cores to execute commands. There will be an overhead of moving each layers weight data to OCM especially for larger LLM models.

I had hoped the Execution Units were general-purpose compute cores akin to those in a GPU. Although this isn't documented, extensive debugging suggests they are fixed-function IP blocks. The 13 EUs are divided as follows, with some names and descriptions being educated guesses based on the limited information at hand.

Convolution Unit (3x) - Capable of performing convolution operations ( Depthwise/Group Conv, Dilation, and ConvTranspose)
Computer Vision Unit (3x) - Image normalization, reszing, cliping & CV remap/warp.
Tensor Unit (3x) - Operations to support activation functions, pooling & elementwise calculations & reduction calcuations.
MAU (Matrix Arithmetic Unit) (x1) - Multiply two vectors, int8/int16 inputs and fp16/f32 outputs, topN (N <32) outputs

For LLMs, I assume the workload is distributed across the Convolution and Tensor Execution Units (EU), possibly involving the Matrix Arithmetic Unit (MAU). If the MAU is used, it could create a bottleneck due to the single instance. Additionally, I suspect the EUs can't always operate in parallel—for example, the Convolution EU likely needs to complete its output before the Tensor EU can use it as input. Moreover, the limited OCM memory might prevent EUs of the same type from running in parallel when dealing with large weight data.

Wrap Up

The AX650N is an intriguing SoC for image and vision applications, though its suitability for LLMs feels like an afterthought or marketing hype. The performance claims are hard to verify, and my testing doesn't seem to support the TOPs figures. Additionally, the closed-source nature of the software makes it challenging, if not impossible to develop for. This raises doubts in my mind about its suitability for use in a commercial product. The lack of support from Sipeed for the Maix-IV only exacerbates these issues. Ideally, the Maix-IV should have 16GB of memory, evenly split between Linux and CCM. This would enable running larger LLMs while fully utilizing all 8 CPU cores. In its current incarnation its very limited for this purpose.

I welcome AXera to reach out and clarify or address the points raised in this post.

Fleetwood was kind enough to donate a board for me to review as we both shared an interest in validating the performance claims for LLMs.

Tiny Devices

Sunday, 15 September 2024

AX650N - Sipeed Maix-IV (AXeraPi-Pro) NPU teardown