I took up the challenge posted on the sipeed twitter feed
"We are reversing V831's NPU register, and make opensource AI toolchian based on NCNN~ If you are interested in making opensource AI toolchain and familiar with NCNN, please contact support at sipeed.com, we will send free sample board for you to debug"
Sipeed were kind enough to send me one of the initial prototype board of the MAXI-II. To give you a brief introduction the V831 is a camera SOC targeting video encoding applications (cctv, body cams, etc.). It comprises of a Cortex A7 processor combined with 64MB of embedded RAM and for those interested full details of the V831 capabilities can be found in the datasheet.
The datasheet is sparse on information about the NPU :
- V831: Maximum performance up to 0.2Tops
- Supports Conv, Activation, Pooling, BN, LRN, FC/Inner Product
In addition the datsheet briefly mentions two registers in refer to the NPU, one for enabling/resetting the NPU and the other for setting the clock source. No mention of how it can be programmed to perform the operations specified in the datasheet.
Fortunately the registers listed in the sipeed twitter post provided a first clue and after many months of trial and error, endless deciphering of data dumps, a few dead ends and numerous reverse engineering attempts, parts of the NPU operations have been decoded. Fundamentally a large portion of the NPU is a customised implementation of Nvidia Deep Learning Accelerator (NVDLA) architecture. More details about the project can be found on the NVDLA site and here is a quote of it aims :
The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.
What I have determined so far about the NPU is:
1. The NPU clock can be set between 100-1200 Mhz with the code defaulting to 400 Mhz. My hunch is that this may tie to the clock speed of the onboard DDR2 memory.
2. NPU is implemented with nv_small configuration (NV Small Model) and relies on system memory for all data operations. Importantly CPU and NPU are sharing the memory bus.
3. It supports both int8 and int16, however I haven't verified if FP16 is supported or not. Theoretically int8 should be twice as fast as int16 while also preserving memory given the V831 limited onboard memory (64Mb).
4. Number of MACs is 64 (Atomic-C * Atomic-K)
5. NPU registers are memory mapped and therefore can be programmed from userspace which proved to be extremely useful for initial debugging & testing.
6. NPU requires physical address locations when referencing weights & input/output data locations therefore kernel memory needs to be allocated and the physical addresses retrieved if accessed from userspace.
7. NPU weights and input/output data follow a similar layout to the NVDLA private formats. Therefore well knows formats like nhwc or nchw require transformation before they can be fed to the NPU.
Initial code from my endeavours is located in this repo v831-npu and should be treated as work in progress. Hopefully this forms the basis of the fully open source implementation. The tests directory has code from my initial attempts to interact with the hardware and is redundant. However it can be used as an initial introduction to how the hardware units works and what configuration is required. So far I have decoded the CONV, SDP and PDP units which allow for the following operations (tested with int8 data type) :
1. Direct Convolutions
2. Bias addition
3. Relu/Prelu
4. Element wise operations
5. Max/Average pooling
To verify most of the above I ported across the cifar10 example (see examples directory) from ARMs CMSIS_5 NN library. Furthermore I have managed to removed all dependencies on closed AllWinner libraries, this is partially achieved by implementing a simple ION memory allocation utility. Instructions to build cifar10 for deploying on the MAXI-II are below (assuming you using a linux machine) :
1. Clone the SDK toolchain git repo from here. We are still dependent on the SDK toolchain as the MAXI-II kernel/rootfs is built with this toolchain.
2. Export PATH to include 'lindenis-v536-prebuilt/gcc/linux-x86/arm/toolchain-sunxi-musl/toolchain/bin' so that arm-openwrt-linux-gcc can be found.
3. Run 'make'
4. Copied built executable 'nna_cifar10' to MAXI-II
5. Run './nna_cifar10', output should be as below given the input image was a boat:
There is still quite a bit of work left to be done such as :
1. Weight and input/output data conversion utility
2. The NPU should support pixel input formats which needs to be verified.
2. Decoding remaining hardware units
3. Possibly integrating with an existing AI framework or writing a compiler.
By the way the new Beagle V is also spec'd to implement NVDLA with a larger MAC size of 1024.
I would like to thank sipeed for providing the hardware/software.
I liked to thank motiveorder.com for sponsoring the development time for this work.