Tiny Devices: May 2024

I've finally reached a point with reverse engineering where we can start evaluating the usefulness of the NPU for LMMs. I've crafted a basic library (rk3588-npu) with sufficient features for initial integration. A good reference application for integration is llama2.c because its a single C file, the code structure is straightforward to follow and more importantly modify. We will use TinyStories (stories110m) for testing since the models are relatively small, making it easier to troubleshoot when the outputs go a stray. Credit to karpathy for providing llama2.c

Note, I've set the cpu and npu cores to max clk speeds.

We'll start with running run.c against stories110m, this is the fp32 version using the cpu with a single thread (single core). As we see roughly 9.7 tokens/second.

Next I converted run.c to use fp16 (_Float16) along with the weights. We see a slight drop in performance to roughly 9.3 tokens/s as arithmetic operations require a conversion back to fp32.

As with the fp32 version a single core run at a 100% as the code is single thread.

The next step was to offload all FP16 multiplications to the NPU. With a vocabulary size of 32,000, the largest multiplication is 768 x 32,000, with others being 768 x 768, 768 x 2048, and 2048 x 768. For efficient execution, the model weights need to reside entirely in memory, accessible to the NPU. This requires them to be within the 4GB address space, which can be problematic for larger models. In our case, the weights are roughly 256MB, requiring an expansion of the kernel CMA memory allocation to 512MB. Additionally, the weights needed conversion to the NPU format.

The changes result in a additional uplift of roughly 21 to 23 tokens/s depending on the length of the output as per the video below. Conservatively we could say a doubling.

CPU fluctuates between 30-60% for the single. The CPU is still critical for a number of reason:

1. We're still having to rely on memory copies to send/receive the remaining data for the multiplication to occur on the NPU.

2. Invocation of the NPU kernel driver requires CPU cycles.

3. The rest of the llam2.c code stills runs of the CPU.

Although the results look promising we need to bear mind that TinyStories is very small model as per the architecture. Furthermore its fortunate that the converted weights can fit in memory without having to shuffle weights between userspace and physical memory. In additional fp16 format would further limit the possibility for larger models to run efficiently. So conclusion so far there is some uplift but mileage will vary depending on model size and number of layers.

Tiny Devices

Sunday, 19 May 2024

RK3588 - Reverse engineering the RKNN - Running llama2.c with TinyStories