After spending a significant amount of time reverse-engineering the RK3588 NPU and examining Rockchip's 6 TOPS claim. The AXera AX650N SoC piqued my
interest due to AXera's ambitious claims about the NPU's
processing power, boasting "72T mixed precision computing power, native support for Transformer intelligent processing platform". Upon closer inspection, the AX650N delivers 72.0 TOPS@INT4 and 18.0
TOPS@INT8. Interestingly, the performance claim from INT8 to INT4 is a
4x gain, rather than the typical 2x improvement. There is an ongoing effort to port some of the smaller Transformer model to showcase its capabilities. However, given the performance claims I would have expected some larger models to be showcased, but that doesn't seem to be the case.
AX650N
The AX650N is part of AXera's vision processor lineup, featuring a unique capability for intelligent black light vision, which appears to generate color images in low-light conditions. From what I understand, this capability likely stems from its ISP (Image Signal Processor) unit utilizing RGB+IR data from the image sensor. This opens up potential for performing object detection in low-light environments. The AX650N is powered by 8 A55 cores and features an unusual configuration with two DDR controllers, each capable of addressing 16GB. This design allows for the potential to double data transfer rates, as the controllers operate independently and don't share the same address bus. Additionally, the SoC includes dual DSP cores (Tensilica Vision Q7 DSP) to further boost its vision processing capabilities. However, since it's primarily a vision processor, I suspect AXera may be
positioning it to also capitalize on the growing interest in large
language models (LLMs).
Maxi-IV board
AXera collaborated with Sipeed to produce a developer board, the Sipeed Maxi-IV, also known as the AXeraPi-Pro. The board is set to boot from eMMC by default and comes with a preinstalled Busybox image, however, the image is quite minimal. Fortunately there is a Ubuntu image is available for flashing to the eMMC. The main issue with the Ubuntu image is that it doesn't allocate enough space for the different mount points for the root filesystem for it to be useful. To fix this, I had to create a secondary image on a USB drive, boot from it (painfully slow) and then repartition the eMMC. The CPU fan runs continuously as soon as power is applied (with no PWM control), making it a noticeable distraction.
SDK support
To begin development the AX650N SDK is essential, but accessing it can be challenging if you're outside of China. For reasons unknown, Sipeed only hosts the SDK on their Baidu account. I reached out to Sipeed support with no response, email was sent on July 17, 2024 and still no response. I encountered the same lack of response from the Sipeed Telegram channel. Contacting AXera doesn't help they point you back to Sipeed. After weeks of trying I finally manage to get hold of the SDK without Sipeed.
The SDK includes a small kernel patch that gets applied against kernel 5.15.73, however the main irritation is that most peripheral drivers are provided as pre-compiled binaries, making the kernel effectively closed source. The rest of the SDK provides a set of closed source user space libraries exposing APIs to the different IP blocks. The primary drawback with a closed source approach is addressing security concerns as highlighted on Sipeeds NanoKVM. SDK documentation is primarily in Chinese with occasional English versions.
Transformers Support
As mentioned earlier, my main focus is the NPU architecture and performance of Transformer models. There are number of pre-built transformer models (Qwen1.5-1.8B, Qwen2-0.5B, MiniCPM-1B/2B, Phi-3-mini) available here, unfortunately again hosted on Baidu. The largest one is Phi-3 mini which requires increasing the CCM space to 5 or 6GB, due to other kernel modules reserving CCM space while the kernel boots. Executing Phi-3 mini results in a token generation rate of approximately 4.4 tokens per second.
The token rate appears to be lower than expected given the AX650N's TOPS claims. Phi-3 Mini seems to be using INT8 quantization, although it's unclear which specific approach, I'm guessing it's w8a8 with SmoothQuant. In comparison, Phi-3 Mini on the RK3588, also using INT8 (w8a8), reportedly runs at up to 6.46 tokens per second. The AX650N claims 18.0TOPS@int8, while the RK3588 is rated at 6 TOPS@int8. Although the performance may be lower, it's important to also consider the quality of the quantized models by evaluating each model's perplexity on both platforms, which I haven't done. Furthermore the vendor for both SOCs aren't particularly forthcoming on perplexity evaluations when releasing models. Regardless of perplexity, the expectation was that the additional 12 TOPS would lead to better performance. The next step is to take a closer look at the NPU architecture to gain a clearer understanding of what might be causing this discrepancy. Given the platform's closed-source nature, this presents several challenges.Power Consumption
After the board completes booting, the idle current consumption is about 450mA, equating to 5.4 watts (12V x 450mA). While running Phi-3 Mini, the consumption averages 680mA or 8.6 watts. For running a small LLM, the AX650N NPU is highly efficient, with the NPU consuming just 2.71 watts and benefiting from direct access to CPU memory. With a well-designed board, these figures could be significantly reduced.
Neutron NPU
The AXera NPU architecture, known as Neutron, has different iterations across the AX6XXX chipsets. It is primarily designed to support vision applications by executing Convolutional Neural Networks (CNNs). Documentation is sparse, but for the AX650N, there's a brief description in Pulsar along with the diagram below:
The diagram states that the NPU consists of 3 convolution cores, 3 vector cores, and 3 SDMA (System direct memory access) units. We’ll need to investigate further to confirm if this is accurate. The NPU can be set up to function either as 3 distinct Virtual NPUs (vNPUs) or as a single NPU with access to all cores.
The SDK documentation provides more details, highlighting the NPU performance as:
Max. 43.2 TOPS @INT4 and 10.8 TOPS @INT8.
This differs from the original claim of 18.0 TOPS @INT8, with only 10.8 TOPS @INT8 coming from the NPU. I speculate that the remaining 7.2 TOPS might be partially attributed to the dual DSP cores, which feature 256 MACs @INT8. However, I’m still unsure about the remaining TOPS and remain unconvinced that the performance figures are genuine. I suspect there may be some creative accounting at play!
Examination of the memory map offers additional insights into the architecture.
The EU0-EU12 labels denote a total of 13 Execution Units (EUs), along with 3 SDMA units, bringing the overall count to 15. The NPU also features its own local memory through OCM (On-Chip Memory), with an address map indicating 11.5MB (0xAFFFFF). Data transfers between the CMM and OCM must be managed by the SDMA units both before and after instructing the NPU cores to execute commands. There will be an overhead of moving each layers weight data to OCM especially for larger LLM models.- Convolution Unit (3x) - Capable of performing convolution operations ( Depthwise/Group Conv, Dilation, and ConvTranspose)
- Computer Vision Unit (3x) - Image normalization, reszing, cliping & CV remap/warp.
- Tensor Unit (3x) - Operations to support activation functions, pooling & elementwise calculations & reduction calcuations.
- MAU (Matrix Arithmetic Unit) (x1) - Multiply two vectors, int8/int16 inputs and fp16/f32 outputs, topN (N <32) outputs
For LLMs, I assume the workload is distributed across the Convolution and Tensor Execution Units (EU), possibly involving the Matrix Arithmetic Unit (MAU). If the MAU is used, it could create a bottleneck due to the single instance. Additionally, I suspect the EUs can't always operate in parallel—for example, the Convolution EU likely needs to complete its output before the Tensor EU can use it as input. Moreover, the limited OCM memory might prevent EUs of the same type from running in parallel when dealing with large weight data.
Wrap Up
The AX650N is an intriguing SoC for image and vision applications, though its suitability for LLMs feels like an afterthought or marketing hype. The performance claims are hard to verify, and my testing doesn't seem to support the TOPs figures. Additionally, the closed-source nature of the software makes it challenging, if not impossible to develop for. This raises doubts in my mind about its suitability for use in a commercial product. The lack of support from Sipeed for the Maix-IV only exacerbates these issues. Ideally, the Maix-IV should have 16GB of memory, evenly split between Linux and CCM. This would enable running larger LLMs while fully utilizing all 8 CPU cores. In its current incarnation its very limited for this purpose.
I welcome AXera to reach out and clarify or address the points raised in this post.
Fleetwood was kind enough to donate a board for me to review as we both shared an interest in validating the performance claims for LLMs.
Hello I also have an AX650N dev board bought from sipeed. When using its NPU for neural network model inference, I noticed that the system's load average is particularly high. Additionally, at this time, the system's built-in serial port experiences data loss when receiving external data. I used cyclictest to test the system's real-time performance and found that the latency reached 9000~13000 microseconds, which is significantly worse compared to products like the Jetson Nano or Rockchip. I tried contacting Axera and Sipeed but received no response. I would like to ask if you've encountered this issue before, or based on your experience, what might be the cause of this problem? Thank you and I hope to hear back from you.
ReplyDeleteWhen I tested the Phi-3 mini, I didn't encounter the issue you're experiencing; in fact, the CPU load was very low, and the serial port worked fine. That said, I suspect the kernel drivers may not be well-coded, but without access to the source code, this can't be confirmed. Which model(s) were you trying to run?
DeleteIn fact, I encountered this issue while running the FRTDemo included in the SDK. The test command is /opt/bin/FRTDemo/run.sh -s 0 -p 0, and the effect is to pull video from the camera and use the NPU for inference. It seems that this command requires two external MIPI cameras to run properly. After running this command, the system's load average becomes particularly high (around 16~18), and the vmstat 1 command shows that the system's context switching is extremely frequent.
DeleteI would also like to confirm which version of the system SDK you are using. I have encountered this issue on SDK versions 1.40, 1.45, and 2.0.2.
Delete