Thursday, 29 April 2021

Reverse engineering the V831 NPU (Neural Processor Unit)

I took up the challenge posted on the sipeed twitter feed 

"We are reversing V831's NPU register, and make opensource AI toolchian based on NCNN~ If you are interested in making opensource AI toolchain and familiar with NCNN, please contact support at, we will send free sample board for you to debug"

Sipeed were kind enough to send me one of the initial prototype board of the MAXI-II. To give you a brief introduction the V831 is a camera SOC targeting video encoding applications (cctv, body cams, etc.). It comprises of a Cortex A7 processor combined with 64MB of embedded RAM and for those interested full details of the V831 capabilities can be found in the datasheet.

The datasheet is sparse on information about the NPU :

  • V831: Maximum performance up to 0.2Tops
  • Supports Conv, Activation, Pooling, BN, LRN, FC/Inner Product

In addition the datsheet briefly mentions two registers in refer to the NPU, one for enabling/resetting the NPU and the other for setting the clock source. No mention of how it can be programmed to perform the operations specified in the datasheet.

Fortunately the registers listed in the sipeed twitter post provided a first clue and after many months of trial and error, endless deciphering of data dumps, a few dead ends and numerous reverse engineering attempts, parts of the NPU operations have been decoded. Fundamentally a large portion of the NPU is a customised implementation of Nvidia Deep Learning Accelerator (NVDLA) architecture. More details about the project can be found on the NVDLA site and here is a quote of it aims :

The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.

What I have determined so far about the NPU is:

1. The NPU clock can be set between 100-1200 Mhz with the code defaulting to 400 Mhz. My hunch is that this may tie to the clock speed of the onboard DDR2 memory.

2. NPU is implemented with nv_small configuration (NV Small Model) and relies on system memory for all data operations. Importantly CPU and NPU are sharing the memory bus.

3. It supports both int8 and int16, however I haven't verified if FP16 is supported or not. Theoretically int8 should be twice as fast as int16 while also preserving memory given the V831 limited onboard memory (64Mb).

4. Number of MACs is 64 (Atomic-C * Atomic-K)

5. NPU registers are memory mapped and therefore can be programmed from userspace which proved to be extremely useful for initial debugging & testing.

6. NPU requires physical address locations when referencing weights & input/output data locations therefore kernel memory needs to be allocated and the physical addresses retrieved if accessed from userspace.

7. NPU weights and input/output data follow a similar layout to the NVDLA private formats. Therefore well knows formats like nhwc or nchw require transformation before they can be fed to the NPU.

Initial code from my endeavours is located in this repo v831-npu and should be treated as work in progress. Hopefully this forms the basis of the fully open source implementation. The tests directory has code from my initial attempts to interact with the hardware and is redundant. However it can be used as an initial introduction to how the hardware units works and what configuration is required. So far I have decoded the CONV, SDP and PDP units which allow for the following operations (tested with int8 data type) :

1. Direct Convolutions

2. Bias addition

3. Relu/Prelu

4. Element wise operations

5. Max/Average pooling

To verify most of the above I ported across the cifar10 example (see examples directory) from ARMs CMSIS_5 NN library. Furthermore I have managed to removed all dependencies on closed AllWinner libraries, this is partially achieved by implementing a simple ION memory allocation utility. Instructions to build cifar10 for deploying on the MAXI-II are below (assuming you using a linux machine) :

1. Clone the SDK toolchain git repo from here. We are still dependent on the SDK toolchain as the MAXI-II kernel/rootfs is built with this toolchain.

2. Export PATH to include  'lindenis-v536-prebuilt/gcc/linux-x86/arm/toolchain-sunxi-musl/toolchain/bin' so that arm-openwrt-linux-gcc can be found.

3. Run 'make'

4. Copied built executable 'nna_cifar10' to MAXI-II

5. Run './nna_cifar10', output should be as below given the input image was a boat:

There is still quite a bit of work left to be done such as :

1. Weight and input/output data conversion utility

2. The NPU should support pixel input formats which needs to be verified.

2. Decoding remaining hardware units

3. Possibly integrating with an existing AI framework or writing a compiler.

By the way the new Beagle V is also spec'd to implement NVDLA with a larger MAC size of 1024.

I would like to thank sipeed for providing the hardware/software.

I liked to thank for sponsoring the development time for this work.

Sunday, 8 March 2020

ESP32 impersonates a Particle Xenon

With the announcement that Particle will no longer manufacture the Xenon development board and drop their OpenThread based mesh networking solution. We decided to see if we could impersonate an existing claimed Xenon(s) (ie one that is already registered on the cloud) on alternative hardware. Hence the idea of 'bring your own device' to connect to the cloud.

After reviewing the device-os source code for a few months it turned out to get a proof of concept working I need a implemented at minimum the following:

1. Port across the dtls protocol layer as it turns out the Gen 3 devices create a secure UDP socket connection over dtls.
2. Extract the devices private key and the cloud public key (no certificates are stored). Particles implementation of the dtls handshake purely relies on Raw Public Key support (RFC7250).
3. Implement a COAP layer as the 'Spark protocol' is built on top of this.

The above was implemented as set of library functions using the ESP32-IDF and I reused the ESP32 (LILYGO TTGO) from the previous post which fortunately hosts a OLED 128x64 display. In the video we demonstrate :

1. Connects to a wifi access point.
2. Retrieves time from a SNTP server.
3. Connects to the Particle Cloud via a dtsl handshake.
4. Sends a number of 'Spark protocol' messages to let the cloud know the Xenon is alive.
5. Awaits commands from the Cloud, including ping and signal operations. When receiving the signal command the screen scrolls the text from left to right.

I liked to thank for sponsoring the hardware and development time for this article.

Saturday, 30 November 2019

Particle Xenon - Adding WIFI support with a EPS32

The preferred option for WIFI support with Gen 3 devices is to deploy a Particle Argon. The Argon consists of a Nordic nRF52840 paired with Espressif ESP32. The EPS32 simply provides the WIFI interface and is running a customised version of EPS-AT firmware (argon-ncp-firmware). The nRF52840 communicates with the EPS32 using one its serial ports using fours pins TX,RX,CTS & RTS. The challenge here was to see if we could enable WIFI support on Particle Xenon by connecting it to a ESP32 running the argon-ncp-firmware. As demonstrated in the video it was possible although it required a number of hoops to jump through.

Unfortunately the only spare EPS32 board I had was a LILYGO TTGO this is a 16M board with a OLED display. So the first task was porting the argon-ncp-firmware and re-factoring the pin mappings to support this board. Once this was complete it was fairly easy to validate the firmware was functioning by simply executing the AT commands the Argon issues to establish WIFI connectivity.

For the Xenon the primary changes were to port across the Argon EPS32 networking code. Which turned out to be more challenging that envisaged primarily because the Xenon firmware isn't expecting a WIFI configuration and the command line tools don't support provisioning a WIFI connection for a Xenon. After 4 weeks of effort I finally had built a working version of the Xenon firmware. It took another 2 weeks to get the Xenon provisioned  a WIFI configuration so it could connect to the Particle Cloud.

The main drawback of this approach is that is the firmware on the both the Xenon and ESP32 are customised therefore any updates from the Cloud would override the changes. Hence a customised rebuild is required when new firmware is released.

I liked to thank for sponsoring the hardware and development time for this article.

Tuesday, 26 November 2019

Particle Xenon - Enable Ethernet connectivity with a low cost W5500 module

The preferred option to enable a Xenon to act as a Gateway is to deploy the Particle Ethernet FeatherWing. Unfortunately I didn't have one to hand, however after reviewing the schematics it turns out this FeatherWing simply relies on the WIZnet W5500 Ethernet controller.

From a previous project I did have a W5500 Ethernet Module (which seems to widely available and relatively cheap), so the challenge was to see if the Xenon could work it.

In the end it turned out to be relatively simple to connect the Xenon to the module through the exposed SPI interface. The back of pcb indicates the pin out details for the W5500. The diagram below details which pins from the Xenon connect needed to be connection W5500 Module.

This post on the Particle site covers how to enable Ethernet and fingers crossed your Xenon should connect to the Particle cloud as mine did.

Tuesday, 27 August 2019

Jetson Nano - Developing a Pi v1.3 camera driver Part 2

I liked to thank for sponsoring the hardware and development time for this article. 

Following on from my previous post, finally I am in a position to release a alpha version of the driver unfortunately at this stage only in binary form. Development of the driver has been complicated by the fact that determining the correct settings for the OV5647 is extremely time consuming giving the lack of good documentation.

The driver supports the following resolutions

2592 x 1944 @15 fps
1920 x 1080 @30 fps
1280 x 960  @45 fps
1280 x 720  @60 fps

I have added support for 720p because most of the clone camera seem to be targeting 1080p or 720p based on the lens configuration. I mainly tested with an original RPI V1.3 camera to ensure backward compatibility.

The driver is pre-compiled with the latest L4T R32.2 release so there is a requirement to deploy a kernel plus modules and with a new dtb file. Therefore I recommend you do some background reading to understand the process before deploying. Furthermore I recommend you have access to the linux console via the UART interface if the new kernel fails to boot or the camera is not recognised.

Deployment of the kernel and modules will be done on the Nano itself while flashing of the dtb file has to be done from a Linux machine where the SDK Manager is installed.

Download nano_ov5647.tar.gz and extract to your nano :

mkdir ov5647
cd ov5647

tar -xvf ../nano_ov5647.tar.gz

After extraction you will see the following files:

-rw-r--r-- 1 user group 291462110 Aug 26 17:23 modules_4_9_140.tar.gz
-rw-r--r-- 1 user group 200225    Aug 26 17:26 tegra210-p3448-0000-p3449-0000-a02.dtb
-rw-r--r-- 1 user group  34443272 Aug 26 17:26 Image-ov5647

Copy kernel to /boot directory :

sudo cp  Image-ov5647 /boot/Image-ov5647

Change boot configuration file to load our kernel by editing /boot/extlinux/extlinux.conf. Comment out the following line and added the new kernel, so the change is from this:

      LINUX /boot/Image


       #LINUX /boot/Image
       LINUX /boot/Image-ov5647

Next step is to extract the kernel modules:

cd /lib/modules/
sudo tar -xvf <path to where files were extracted>/modules_4_9_140.tar.gz

The last step is to flash the dtb file, tegra210-p3448-0000-p3449-0000-a02.dtb.  As discussed in the comments section (below) by jiangwei it is possible to copy the dtb file directly to Nano refer to this link on how this can be achieved. See section  "Flash custom DTB on the Jetson Nano"

Alternatively you can use SDK manager,  flashing require copying the dtb file to the linux host machine into the directory Linux_for_Tegra/kernel/dtb/  where SDK your installed. Further instructions on how to flash the dtb are covered in a post I made here however since we don't want to replace the kernel the command to use is:

sudo ./ --no-systemimg -r -k DTB jetson-nano-qspi-sd mmcblk0p1

There seems to be some confusion about how to put the nano into recovery mode. The steps to do that are:

1. Power down nano
2. J40 - Connect recovery pins 3-4 together
3. Power up nano
4. J40 - Disconnect pins 3-4
5. Flash file

After flashing the dtb the nano should boot the new kernel and hopefully the desktop will reappear. To verify the new kernel we can run the following command:

uname -a

It should report the kernel version as 4.19.10+ :

Linux jetson-desktop 4.9.140+

If successful power down the Nano and now you can connect your camera to FPC connector J13. Power up the nano and once desktop reappears verify the camera is detected by:

dmesg | grep ov5647

It should report the following:

[    3.584908] ov5647 6-0036: tegracam sensor driver:ov5647_v2.0.6
[    3.603566] ov5647 6-0036: Found ov5647 with model id:5647 process:11 version:1
[    5.701298] vi subdev ov5647 6-0036 bound

The above indicates the camera was detected and initialised. Finally we can try streaming, commands for different the resolutions are below:

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=2592, height=1944, framerate=15/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=2592, height=1944' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1920, height=1080, framerate=30/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1920, height=1080' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1280, height=960, framerate=45/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=960' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1280, height=720, framerate=60/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=720' ! nvvidconv ! nvegltransform ! nveglglessink -e

The driver supports controlling of the analogue gain which has a range of 16 to 128. This can be set using the 'gainrange' property, example below:

gst-launch-1.0 nvarguscamerasrc gainrange="16 16" ! 'video/x-raw(memory:NVMM),width=1280, height=720, framerate=60/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=720' ! nvvidconv ! nvegltransform ! nveglglessink -e

If you require commercial support please contact

Sunday, 23 June 2019

Jetson Nano - Developing a Pi v1.3 camera driver Part 1

I liked to thank for sponsoring the hardware and development time for this article. 

The jetson nano is fairly capable device considering the appealing price point of the device. In fact its one of the few ARM devices which out of the box provides a decent (and usable) X11 graphics stack (even though the drivers are closed source).
Although the jetson nano supports the same 15 pin CSI connector as the RPI camera support is currently limited to Pi V2 cameras which is host the imx219. The older Pi v1.3 cameras are appealing partly because there are numerous low cost clones available and partly because there are numerous add ons such as lenses and night mode options.

The v1.3 cameras uses the OV5647 which apparently is discontinued by OmniVision furthermore the full datasheet isn't freely available (only under NDA). There is a preliminary datasheet on the internet but it seems to be incomplete or worse inconsistent in places. This does hinder the process some what as debugging errors can be very time consuming and at time frustrating.

One noticeable different is that the v1.3 camera hosts a 25Mhz crystal where most non rpi OV5647 boards use a standard 24Mhz. This can make the tuning more difficult as some of the default setting need adjustments.

The first step in bringing up the camera was ensuring the board was powered on so that it could be detected for through its i2c interface (address 0x36). After numerous attempts the OV5647 finally appeared:

Warning: Can't use SMBus Quick Write command, will skip some addresses
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-6.
I will probe address range 0x03-0x77.
Continue? [Y/n] Y
     0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
30: -- -- -- -- -- -- UU --                        
50: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

The second step was to develop enough of a skeleton kernel driver to initialise the OV5647 and enable it as v4l2 device. Although this sounds may easy it turned out be extremely time consuming for two reasons. Firstly due to the lack of documentations for OV5647 and secondly the NVIVIDA camera driver documentation is also poor and a in number of cases the documentation doesn't match the code. Finally after a few weeks a v4l2 device appeared:

jetson-nano@jetsonnano-desktop:~$ v4l2-ctl -d /dev/video0 -D
Driver Info (not using libv4l2):
    Driver name   : tegra-video
    Card type     : vi-output, ov5647 6-0036
    Bus info      :
    Driver version: 4.9.140
    Capabilities  : 0x84200001
        Video Capture
        Extended Pix Format
        Device Capabilities
    Device Caps   : 0x04200001
        Video Capture
        Extended Pix Format

Next step was to put the camera in test pattern mode and capture a raw image. The OV5647 outputs raw bayer format in our case 10 bit so the captured raw data file needs to be converted to a displayable format. Conversion can be done using a utility like bayer2rgb. Finally I arrived a valid test pattern.

Next stage was to configure the OV5647 to a valid resolution for image capture again which has been extremely challenging for the reasons stated above. Some of the images from numerous attempts are shown on the left and right.

Current progress is that the camera is outputting 1920x1080@30fps however this is work in progress as the driver is in a primitive state and the output format requires further improvements. On the plus side to it is now possible to stream with the nvarguscamerasrc gstreamer plugin. Below is a 1080 recording from the OV5647 with a pipeline based on nvarguscamerasrc and nvv4l2h264enc.

Update: In my 2nd post we have a driver that you can test with.

Friday, 29 March 2019

Machine learning with the i.MX6 and the Intel NCS2

Last October Intel released a upgraded Neural Compute Stick known as NCS2 hosting the Movidius Myriad X VPU (MA2485). Intel claim "NCS 2 delivers up to eight times the performance boost compared to the previous generation NCS". Intel also provide OpenVINO an open visual inference and neural network optimization toolkit with multiplatform support for Intel based hardware. With release R5 of OpenVINO support was added for NCS2/NCS and ARMv7-A CPU architecture through the introduction of library support for Raspberry Pi boards. As a progression from my previous post this give us the opportunity test NCS2 with OpenVINO on the i.mx6 platform. The first video above is showing the sample security_barrier_camera_demo and second is running the model vehicle-detection-adas-0002.xml. These are executed on a imx6q board (BCM AR6MXQ).


To maximise performance from NCS2 ideally it should be connected to a USB 3.0 port. Unfortunately the i.mx6 doesn't host native support for 3.0 however most of the i.mx6 range do support a PCIE interface. So our plan was to deployed a mini PCIE to USB 3.0 card in our case using the NEC UPD720202 chipset. Using PCIE also alleviates saturating the USB bus when testing interference with a USB camera.

Target board for testing was the BCM ARM6QX which host on board mini-pice interface. The mini PICE card host a 20 pin USB connector and a SATA connector for USB power. We used an adapter card to expose two USB 3.0 ports hence the NCS2 ending up in an upright position.

OpenVINO provides a easy to use interface to OpenCV via python and C++. In our case for a embedded platform C++ is best suited for optimum performance. Testing was done using a number of the existing OpenVINO samples with the primary code modification being to accelerate resizing the camera input and rendering of the OpenCV buffer to screen.

The face recoginition video above is using object_detection_demo_ssd_async with model face-detection-retail-0004.xml model and is rated 1.067 GFLOPs Complexitiy. NCS2 interference times average 22ms although the model lacks some accuracy with its ability not to distinguish between a human face and 'Dora'. The overall fps rate at 19 is pretty good. In regards to CPU usage on a i.mx6q only one of the 4 cores is fully occupied as suggested by the output of 'top'.

What is nice about OpenVINO is that we can easily compare these benchmarks against the original NCS by simply plugging in it and re-runing the test.

As shown above the inference times rise from 22 to 62 ms although from our testing the trade off seems to be a rise in power consumption and heat dissipation between the two releases of the NCS.

I liked to thank for sponsoring the hardware and development time for this article.