I've finally reached a point with reverse engineering where we can start evaluating the usefulness of the NPU for LMMs. I've crafted a basic library (rk3588-npu) with sufficient features for initial integration. A good reference application for integration is llama2.c because its a single C file, the code structure is straightforward to follow and more importantly modify. We will use TinyStories (stories110m) for testing since the models are relatively small, making it easier to troubleshoot when the outputs go a stray. Credit to karpathy for providing llama2.c
As with the fp32 version a single core run at a 100% as the code is single thread.
The next step was to offload all FP16 multiplications to the NPU. With a vocabulary size of 32,000, the largest multiplication is 768 x 32,000, with others being 768 x 768, 768 x 2048, and 2048 x 768. For efficient execution, the model weights need to reside entirely in memory, accessible to the NPU. This requires them to be within the 4GB address space, which can be problematic for larger models. In our case, the weights are roughly 256MB, requiring an expansion of the kernel CMA memory allocation to 512MB. Additionally, the weights needed conversion to the NPU format.
The changes result in a additional uplift of roughly 21 to 23 tokens/s depending on the length of the output as per the video below. Conservatively we could say a doubling.
CPU fluctuates between 30-60% for the single. The CPU is still critical for a number of reason:
1. We're still having to rely on memory copies to send/receive the remaining data for the multiplication to occur on the NPU.
2. Invocation of the NPU kernel driver requires CPU cycles.
3. The rest of the llam2.c code stills runs of the CPU.
Although the results look promising we need to bear mind that TinyStories is very small model as per the architecture. Furthermore its fortunate that the converted weights can fit in memory without having to shuffle weights between userspace and physical memory. In additional fp16 format would further limit the possibility for larger models to run efficiently. So conclusion so far there is some uplift but mileage will vary depending on model size and number of layers.
Hi, great job. How can I convert the weights to NPU format? Do you have any tools similar to what llama2.c provides?
ReplyDeleteAs of yet no tool, you use the code from this function https://github.com/mtx512/rk3588-npu/blob/5d86093190c203a62c0259036c2659acc3900e9a/src/npu_matmul.c#L485 . Example here https://github.com/mtx512/rk3588-npu/blob/5d86093190c203a62c0259036c2659acc3900e9a/tests/matmul_fp16_fp16.c#L192
DeleteThank you.
DeleteHi. Excellent article series. Do you know about tinygrad? It might be a framework to integrate this custom ship into. It makes it easy for you to to just implement the needed operations and the framework takes care of the rest. I have a couple of the rk3588 board myself and am tinkering with kubernetes and mlops on this board. Unfortunately, using the NPU is challenging and running custom models is a conversion pain with the libraries they provide. I think many people are interested in this topic as this board is currently the best for Performance/Watt (if you don't want to buy apple products, then M4 mac mini is probably the best ATM). It would be possible to run all kinds of workflows (home nas applicattions for instance) much more efficient than using the CPU (with onnx-runtime) all the time. Hope you continue with this project :)
ReplyDelete