Sunday, 15 January 2023

RK3588 - Decoding & rendering 16 1080p streams



I'm currently working on a video application for the RK3588 given it is one of the few processors on the market that currently has native HDMI input support (up to 4K30). As part of that work one of the first tasks has been trying to rendering video efficiently within a Wayland/Weston window (not full screen). I reverted to Wayland for video because from my testing on X11it can results in tearing if not played full screen as the graphic stack (ARM Mali )has no ability to vsync.  The existing Rockchip SDK patches the gstreamer waylandsink plugin to provide video rendering support for Wayland. However there are a number of challenges to get the waylandsink to render to a Weston window as by default it resorts to full screen, resulting in a Weston application launching a secondary full screen window to display video within. Whilst trying to find a solution to this problem I can a across a number of claims about the video decoder (part of the VPU) :

Up to 32-channel 1080P@30fps decoding (FireFly ROC-RK3588-PC)

x32 1080P@60fps channels (H.265/VP9) (Khandas Edge 2)

Up to 32 channels 1080P@30fps decoding (PEPPER JOBS X3588)

After reviewing the RK3588 datasheet and TRM I can't find a mention of this capability by Rockchip so I'd assume this a derived figure based on this statement in the datasheet "Multi-channel decoder in parallel for less resolution". From the datasheet H264 max resolution decode is 8K@30 and H265 it is 8K@60, theoretically that would mean 16 channels for H264 1080@30 and possibly 32 for H265 if each stream is 1080@30.

So the challenged turned out be could I decode 16 1080p streams and render each within its own window on a 1080@60 display? As you can tell from the above video it is possible. This is a custom Weston application running on a Rock 5B board  , each video is being read/decoded from a separate file (there is a mixture of trailers/open videos & a fps test video) and then rendered. Initially I tried to resizing each video using RGA3 (Raster Graphics Acceleration) however this turned out be to non-performant as RGA doesn't seem to cope well with more a than few videos. In turns out the only way to render is to use AFBC (Arm framebuffer compression). For this test there are 14 H264 streams (mixture of 30 & 60 fps) and 2 H265 60fps streams.  

Friday, 26 August 2022

Inside another fake ELM327 adaptor (filled with Air)

I'd ordered a couple of ELM327 compatible adapters from Aliexpress expecting that these would be similar to the item in the image below. Normally these contain a PCB board to fit the enclosure and populated with the unknown MCU (covered with epoxy), a Bluetooth chip, CAN transceiver and the necessary circuity to support a K-Line interface.

After dissecting the received adapters here is what we have, 80% air and a small pcb.

Pictures of the small PCB reveal a single 16 SOP package (and a 24Mhz crystal) with the chip marking etched out and no BLE chip or CAN transceiver present 😒. Is this one chip doing all the work?

From a software point of view the device reports itself as ELM V2.1 and I managed to retrieve the firmware version as TDA99 V0.34.0628C (not sure what it means though). The firmware is extremely buggy and feature wise incomplete for ELM V2.1.

The intriguing question was "could a 16 pin chip" replace a number of discrete components. After days of research it turns out the chip seems to be a repurposed Bluetooth audio/toy chip (possibly from ZhuHai Jieli Technology ). The same unmarked chip seems to be present on the Thinmi ELM327C with the chip referred to as QBD255. Can't locate any information for the QBD255. Worst to come is that the CAN implementation seems to be completely written in software (hence no CAN transceiver) and therefore prone to timing errors and limited data rates. Furthermore this chip must have limited memory/flash hence the incomplete implementation of ELM features. 

Buyer beware!

I suspect this chip may be the Jieli AC6329F or AC6329C but need to prove it somehow?

Update 28-08-2022: 

There seems to be another chipset  floating around from YMIOT, described as "ELM327 V2.1 Bluetooth universal diagnostic adapter with 16-pin YM1130 1343E38 chip"

History of this chipset is below:

2017 - YM1120 (131G76)
2018 - YM1122 (1218F57) & YM1121
2019 - YM1130 (1343E38)

Thursday, 29 April 2021

Reverse engineering the V831 NPU (Neural Processor Unit)

I took up the challenge posted on the sipeed twitter feed 

"We are reversing V831's NPU register, and make opensource AI toolchian based on NCNN~ If you are interested in making opensource AI toolchain and familiar with NCNN, please contact support at, we will send free sample board for you to debug"

Sipeed were kind enough to send me one of the initial prototype board of the MAXI-II. To give you a brief introduction the V831 is a camera SOC targeting video encoding applications (cctv, body cams, etc.). It comprises of a Cortex A7 processor combined with 64MB of embedded RAM and for those interested full details of the V831 capabilities can be found in the datasheet.

The datasheet is sparse on information about the NPU :

  • V831: Maximum performance up to 0.2Tops
  • Supports Conv, Activation, Pooling, BN, LRN, FC/Inner Product

In addition the datsheet briefly mentions two registers in refer to the NPU, one for enabling/resetting the NPU and the other for setting the clock source. No mention of how it can be programmed to perform the operations specified in the datasheet.

Fortunately the registers listed in the sipeed twitter post provided a first clue and after many months of trial and error, endless deciphering of data dumps, a few dead ends and numerous reverse engineering attempts, parts of the NPU operations have been decoded. Fundamentally a large portion of the NPU is a customised implementation of Nvidia Deep Learning Accelerator (NVDLA) architecture. More details about the project can be found on the NVDLA site and here is a quote of it aims :

The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.

What I have determined so far about the NPU is:

1. The NPU clock can be set between 100-1200 Mhz with the code defaulting to 400 Mhz. My hunch is that this may tie to the clock speed of the onboard DDR2 memory.

2. NPU is implemented with nv_small configuration (NV Small Model) and relies on system memory for all data operations. Importantly CPU and NPU are sharing the memory bus.

3. It supports both int8 and int16, however I haven't verified if FP16 is supported or not. Theoretically int8 should be twice as fast as int16 while also preserving memory given the V831 limited onboard memory (64Mb).

4. Number of MACs is 64 (Atomic-C * Atomic-K)

5. NPU registers are memory mapped and therefore can be programmed from userspace which proved to be extremely useful for initial debugging & testing.

6. NPU requires physical address locations when referencing weights & input/output data locations therefore kernel memory needs to be allocated and the physical addresses retrieved if accessed from userspace.

7. NPU weights and input/output data follow a similar layout to the NVDLA private formats. Therefore well knows formats like nhwc or nchw require transformation before they can be fed to the NPU.

Initial code from my endeavours is located in this repo v831-npu and should be treated as work in progress. Hopefully this forms the basis of the fully open source implementation. The tests directory has code from my initial attempts to interact with the hardware and is redundant. However it can be used as an initial introduction to how the hardware units works and what configuration is required. So far I have decoded the CONV, SDP and PDP units which allow for the following operations (tested with int8 data type) :

1. Direct Convolutions

2. Bias addition

3. Relu/Prelu

4. Element wise operations

5. Max/Average pooling

To verify most of the above I ported across the cifar10 example (see examples directory) from ARMs CMSIS_5 NN library. Furthermore I have managed to removed all dependencies on closed AllWinner libraries, this is partially achieved by implementing a simple ION memory allocation utility. Instructions to build cifar10 for deploying on the MAXI-II are below (assuming you using a linux machine) :

1. Clone the SDK toolchain git repo from here. We are still dependent on the SDK toolchain as the MAXI-II kernel/rootfs is built with this toolchain.

2. Export PATH to include  'lindenis-v536-prebuilt/gcc/linux-x86/arm/toolchain-sunxi-musl/toolchain/bin' so that arm-openwrt-linux-gcc can be found.

3. Run 'make'

4. Copied built executable 'nna_cifar10' to MAXI-II

5. Run './nna_cifar10', output should be as below given the input image was a boat:

There is still quite a bit of work left to be done such as :

1. Weight and input/output data conversion utility

2. The NPU should support pixel input formats which needs to be verified.

2. Decoding remaining hardware units

3. Possibly integrating with an existing AI framework or writing a compiler.

By the way the new Beagle V is also spec'd to implement NVDLA with a larger MAC size of 1024.

I would like to thank sipeed for providing the hardware/software.

I liked to thank for sponsoring the development time for this work.

Sunday, 8 March 2020

ESP32 impersonates a Particle Xenon

With the announcement that Particle will no longer manufacture the Xenon development board and drop their OpenThread based mesh networking solution. We decided to see if we could impersonate an existing claimed Xenon(s) (ie one that is already registered on the cloud) on alternative hardware. Hence the idea of 'bring your own device' to connect to the cloud.

After reviewing the device-os source code for a few months it turned out to get a proof of concept working I need a implemented at minimum the following:

1. Port across the dtls protocol layer as it turns out the Gen 3 devices create a secure UDP socket connection over dtls.
2. Extract the devices private key and the cloud public key (no certificates are stored). Particles implementation of the dtls handshake purely relies on Raw Public Key support (RFC7250).
3. Implement a COAP layer as the 'Spark protocol' is built on top of this.

The above was implemented as set of library functions using the ESP32-IDF and I reused the ESP32 (LILYGO TTGO) from the previous post which fortunately hosts a OLED 128x64 display. In the video we demonstrate :

1. Connects to a wifi access point.
2. Retrieves time from a SNTP server.
3. Connects to the Particle Cloud via a dtsl handshake.
4. Sends a number of 'Spark protocol' messages to let the cloud know the Xenon is alive.
5. Awaits commands from the Cloud, including ping and signal operations. When receiving the signal command the screen scrolls the text from left to right.

I liked to thank for sponsoring the hardware and development time for this article.

Saturday, 30 November 2019

Particle Xenon - Adding WIFI support with a EPS32

The preferred option for WIFI support with Gen 3 devices is to deploy a Particle Argon. The Argon consists of a Nordic nRF52840 paired with Espressif ESP32. The EPS32 simply provides the WIFI interface and is running a customised version of EPS-AT firmware (argon-ncp-firmware). The nRF52840 communicates with the EPS32 using one its serial ports using fours pins TX,RX,CTS & RTS. The challenge here was to see if we could enable WIFI support on Particle Xenon by connecting it to a ESP32 running the argon-ncp-firmware. As demonstrated in the video it was possible although it required a number of hoops to jump through.

Unfortunately the only spare EPS32 board I had was a LILYGO TTGO this is a 16M board with a OLED display. So the first task was porting the argon-ncp-firmware and re-factoring the pin mappings to support this board. Once this was complete it was fairly easy to validate the firmware was functioning by simply executing the AT commands the Argon issues to establish WIFI connectivity.

For the Xenon the primary changes were to port across the Argon EPS32 networking code. Which turned out to be more challenging that envisaged primarily because the Xenon firmware isn't expecting a WIFI configuration and the command line tools don't support provisioning a WIFI connection for a Xenon. After 4 weeks of effort I finally had built a working version of the Xenon firmware. It took another 2 weeks to get the Xenon provisioned  a WIFI configuration so it could connect to the Particle Cloud.

The main drawback of this approach is that is the firmware on the both the Xenon and ESP32 are customised therefore any updates from the Cloud would override the changes. Hence a customised rebuild is required when new firmware is released.

I liked to thank for sponsoring the hardware and development time for this article.

Tuesday, 26 November 2019

Particle Xenon - Enable Ethernet connectivity with a low cost W5500 module

The preferred option to enable a Xenon to act as a Gateway is to deploy the Particle Ethernet FeatherWing. Unfortunately I didn't have one to hand, however after reviewing the schematics it turns out this FeatherWing simply relies on the WIZnet W5500 Ethernet controller.

From a previous project I did have a W5500 Ethernet Module (which seems to widely available and relatively cheap), so the challenge was to see if the Xenon could work it.

In the end it turned out to be relatively simple to connect the Xenon to the module through the exposed SPI interface. The back of pcb indicates the pin out details for the W5500. The diagram below details which pins from the Xenon connect needed to be connection W5500 Module.

This post on the Particle site covers how to enable Ethernet and fingers crossed your Xenon should connect to the Particle cloud as mine did.

Tuesday, 27 August 2019

Jetson Nano - Developing a Pi v1.3 camera driver Part 2

I liked to thank for sponsoring the hardware and development time for this article. 

Following on from my previous post, finally I am in a position to release a alpha version of the driver unfortunately at this stage only in binary form. Development of the driver has been complicated by the fact that determining the correct settings for the OV5647 is extremely time consuming giving the lack of good documentation.

The driver supports the following resolutions

2592 x 1944 @15 fps
1920 x 1080 @30 fps
1280 x 960  @45 fps
1280 x 720  @60 fps

I have added support for 720p because most of the clone camera seem to be targeting 1080p or 720p based on the lens configuration. I mainly tested with an original RPI V1.3 camera to ensure backward compatibility.

The driver is pre-compiled with the latest L4T R32.2 release so there is a requirement to deploy a kernel plus modules and with a new dtb file. Therefore I recommend you do some background reading to understand the process before deploying. Furthermore I recommend you have access to the linux console via the UART interface if the new kernel fails to boot or the camera is not recognised.

Deployment of the kernel and modules will be done on the Nano itself while flashing of the dtb file has to be done from a Linux machine where the SDK Manager is installed.

Download nano_ov5647.tar.gz and extract to your nano :

mkdir ov5647
cd ov5647

tar -xvf ../nano_ov5647.tar.gz

After extraction you will see the following files:

-rw-r--r-- 1 user group 291462110 Aug 26 17:23 modules_4_9_140.tar.gz
-rw-r--r-- 1 user group 200225    Aug 26 17:26 tegra210-p3448-0000-p3449-0000-a02.dtb
-rw-r--r-- 1 user group  34443272 Aug 26 17:26 Image-ov5647

Copy kernel to /boot directory :

sudo cp  Image-ov5647 /boot/Image-ov5647

Change boot configuration file to load our kernel by editing /boot/extlinux/extlinux.conf. Comment out the following line and added the new kernel, so the change is from this:

      LINUX /boot/Image


       #LINUX /boot/Image
       LINUX /boot/Image-ov5647

Next step is to extract the kernel modules:

cd /lib/modules/
sudo tar -xvf <path to where files were extracted>/modules_4_9_140.tar.gz

The last step is to flash the dtb file, tegra210-p3448-0000-p3449-0000-a02.dtb.  As discussed in the comments section (below) by jiangwei it is possible to copy the dtb file directly to Nano refer to this link on how this can be achieved. See section  "Flash custom DTB on the Jetson Nano"

Alternatively you can use SDK manager,  flashing require copying the dtb file to the linux host machine into the directory Linux_for_Tegra/kernel/dtb/  where SDK your installed. Further instructions on how to flash the dtb are covered in a post I made here however since we don't want to replace the kernel the command to use is:

sudo ./ --no-systemimg -r -k DTB jetson-nano-qspi-sd mmcblk0p1

There seems to be some confusion about how to put the nano into recovery mode. The steps to do that are:

1. Power down nano
2. J40 - Connect recovery pins 3-4 together
3. Power up nano
4. J40 - Disconnect pins 3-4
5. Flash file

After flashing the dtb the nano should boot the new kernel and hopefully the desktop will reappear. To verify the new kernel we can run the following command:

uname -a

It should report the kernel version as 4.19.10+ :

Linux jetson-desktop 4.9.140+

If successful power down the Nano and now you can connect your camera to FPC connector J13. Power up the nano and once desktop reappears verify the camera is detected by:

dmesg | grep ov5647

It should report the following:

[    3.584908] ov5647 6-0036: tegracam sensor driver:ov5647_v2.0.6
[    3.603566] ov5647 6-0036: Found ov5647 with model id:5647 process:11 version:1
[    5.701298] vi subdev ov5647 6-0036 bound

The above indicates the camera was detected and initialised. Finally we can try streaming, commands for different the resolutions are below:

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=2592, height=1944, framerate=15/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=2592, height=1944' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1920, height=1080, framerate=30/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1920, height=1080' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1280, height=960, framerate=45/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=960' ! nvvidconv ! nvegltransform ! nveglglessink -e

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM),width=1280, height=720, framerate=60/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=720' ! nvvidconv ! nvegltransform ! nveglglessink -e

The driver supports controlling of the analogue gain which has a range of 16 to 128. This can be set using the 'gainrange' property, example below:

gst-launch-1.0 nvarguscamerasrc gainrange="16 16" ! 'video/x-raw(memory:NVMM),width=1280, height=720, framerate=60/1' ! nvvidconv flip-method=0 ! 'video/x-raw,width=1280, height=720' ! nvvidconv ! nvegltransform ! nveglglessink -e

If you require commercial support please contact