Tiny Devices

Friday, 2 May 2025

RK3588 - Implementing a Vectorscope for processing video in real time

Following on from my previous post covering decoding and rendering HDMI input on the RK3588, I encountered a new requirement: implementing a real-time vectorscope to visualize chrominance information from the video stream. This proved to be a challenging task due to the need for efficient frame processing and rendering without impacting video playback performance.

Extracting UV Data from Video Frames

The challenges to overcome were:

Accessing U and V (chrominance) values for every pixel in the video stream.
For RGB-formatted frames, a costly color space conversion (RGB to YUV) is required.
The processing overhead increases significantly with higher-resolution frames.

To minimize CPU overhead and offload the color space conversion, I utilized RGA3, which efficiently converts each RGB frame into NV12 or NV16 format. This step significantly reduces the processing time needed to access UV data.

Once converted, the UV plane needed to be imported into an OpenGL ES texture for further processing and visualization. To preserve performance and avoid unnecessary memory copies, the goal was to directly bind the UV data as a texture.

Computing the UV Histogram

The primary processing here

Computing a UV histogram in real time based on pixel data from the video frame.
Normalizing the histogram values so it scales properly regardless of resolution or luminance.

Building a real-time UV histogram from each frame requires scanning and processing a large volume of pixel data efficiently. Traditional OpenGL fragment shaders are not well-suited for this type of arbitrary data accumulation, especially when dealing with high-resolution video.

Given that the Mali G-610 GPU is compliant with OpenGL ES 3.1, I explored the use of compute shaders, a more flexible approach for performing general-purpose GPU (GPGPU) operations like histogram generation.

Compute shaders for OpenGL ES is sparsely documented, and practical usage examples, especially for embedded platforms are limited. As a result, much of the development involved experimentation and iterative debugging, making it feel like a bit of a black art. Finally I managed to chain to together 3 compute shaders each perform one step in the pipeline.

Rendering the Vectorscope Output

The final step involved visualizing the processed chroma data in a way that mirrors a traditional vectorscope. This required overlaying the normalized UV histogram along with reference markers without disrupting video playback.

I drew inspiration from the OBS monitor plugin, which includes a vectorscope feature. Its rendering approach informed how I structured the visualization pipeline, particularly around how histogram data is mapped to screen coordinates.

The demo video shows the Vectorscope is capable of processing a 1080p@60 video stream.

Work was carried out on a ROCK 5B board running a tailored Ubuntu image maintained by Joshua Riek.

Thursday, 1 May 2025

RK3588 - Building an simple HDMI anaylzer

While developing and debugging HDMI drivers and custom video applications. You quickly come to appreciate the convenience of having a straightforward way to debug the HDMI output both video and audio in real time. This led me to develop a utility specifically designed to display basic HDMI metadata and render the incoming video and audio stream. The goal was to create a tool that could simplify the process of diagnosing HDMI-related issues when connected to an HDMI source.

One of the undervalued features of the RK3588 SoC is its built-in HDMI receiver—an often overlooked capability that eliminates the need for external HDMI-to-CSI adapters like the TC358743XBG or RK628D. This built-in receiver makes the RK3588 an ideal platform for developing such a utility, as it allows direct access to HDMI input source without additional hardware.

While the BSP hdmi rx kernel driver may not be of the highest quality, it offers several useful IOCTLs for detecting the present of a valid signal in addition to providing valuable video and audio metadata. With this in mind, the application was developed as a Weston (Wayland) client, which turned out to be more challenging than initially anticipated. The primary difficulties arose from:

Video Input Rendering – Optimizing the pipeline to minimize latency between frame acquisition and on-screen rendering. For rendering, I went with GStreamer pipelines paired with WaylandSink, to ease integration the Wayland compositor.
Overlaying graphical information on top of video playback – synchronizing real-time overlays with live video in a Wayland compositor like Weston required careful handling of surface layers and rendering pipelines. For rendering overlays, I adopted an approach based on Weston subsurfaces.
Converting and scaling the video input – processing the raw video feed and resizing it dynamically while maintaining real-time performance introduced both technical and performance challenges. To handle conversion and scaling, I resorted to developing a custom GStreamer plugin built around RGA3, deliberately bypassing RGA2 due to the known constraints with 32-bit memory addressing (its no different to the NPU).
Audio Detection – The HDMI RX kernel driver monitors for the presence of an audio stream. The challenge was to dynamically add or remove audio from the GStreamer pipeline based on its availability and without disrupting the ongoing video stream.

As illustrated in the image below, basic metadata for both the video and audio streams is displayed in the overlay window.

In the first video (above), the HDMI output from a Lenovo Windows laptop is used as the input source. The laptop is configured to duplicate its display over HDMI, functioning as a secondary screen. This setup effectively demonstrates minimal latency in both user interaction and video playback, highlighting the responsiveness of the HDMI capture pipeline.

The second video showcases the utility’s scaling capabilities. With the output display set to 1080p, I used a Lindy HDMI 18G Signal Generator to feed input signals at various resolutions—720p, 1080p, 4K@30Hz, and 4K@40Hz—in multiple formats including RGB, NV16, and NV12. All input signals were successfully scaled to 1080p.

During testing, I discovered a limitation: although the RK3588’s HDMI RX receiver supports NV24 format input, neither RGA3 nor RGA2 (according to documentation and my own validation) are capable of processing NV24. Interestingly, the RK3576 is equipped with the newer RGA2-Pro which support NV24, despite lacking an HDMI RX receiver.

Work was performed on a ROCK 5B board running a tailored Ubuntu image maintained by Joshua Riek.

Sunday, 15 September 2024

AX650N - Sipeed Maix-IV (AXeraPi-Pro) NPU teardown

After spending a significant amount of time reverse-engineering the RK3588 NPU and examining Rockchip's 6 TOPS claim. The AXera AX650N SoC piqued my interest due to AXera's ambitious claims about the NPU's processing power, boasting "72T mixed precision computing power, native support for Transformer intelligent processing platform". Upon closer inspection, the AX650N delivers 72.0 TOPS@INT4 and 18.0 TOPS@INT8. Interestingly, the performance claim from INT8 to INT4 is a 4x gain, rather than the typical 2x improvement. There is an ongoing effort to port some of the smaller Transformer model to showcase its capabilities. However, given the performance claims I would have expected some larger models to be showcased, but that doesn't seem to be the case.

AX650N

The AX650N is part of AXera's vision processor lineup, featuring a unique capability for intelligent black light vision, which appears to generate color images in low-light conditions. From what I understand, this capability likely stems from its ISP (Image Signal Processor) unit utilizing RGB+IR data from the image sensor. This opens up potential for performing object detection in low-light environments. The AX650N is powered by 8 A55 cores and features an unusual configuration with two DDR controllers, each capable of addressing 16GB. This design allows for the potential to double data transfer rates, as the controllers operate independently and don't share the same address bus. Additionally, the SoC includes dual DSP cores (Tensilica Vision Q7 DSP) to further boost its vision processing capabilities. However, since it's primarily a vision processor, I suspect AXera may be positioning it to also capitalize on the growing interest in large language models (LLMs).

Maxi-IV board

AXera collaborated with Sipeed to produce a developer board, the Sipeed Maxi-IV, also known as the AXeraPi-Pro. The board is set to boot from eMMC by default and comes with a preinstalled Busybox image, however, the image is quite minimal. Fortunately there is a Ubuntu image is available for flashing to the eMMC. The main issue with the Ubuntu image is that it doesn't allocate enough space for the different mount points for the root filesystem for it to be useful. To fix this, I had to create a secondary image on a USB drive, boot from it (painfully slow) and then repartition the eMMC. The CPU fan runs continuously as soon as power is applied (with no PWM control), making it a noticeable distraction.

Upon booting the board, the surprising discovery is that Linux recognizes just 4GB of memory, despite the board having 8GB of onboard RAM. The remaining 4GB is reserved as CMM (Contiguous Memory Model), a large block of physical memory allocated for onboard peripherals like the ISP, video encoder/decoder, and NPU. It’s important to note that CMM differs from CMA, which is typically reserved by the Linux kernel. The ratio of CMM to Linux memory can be adjusted. The main challenge with 4GB (or less) available for Linux is the difficulty in fully utilizing all 8 CPU cores.

SDK support

To begin development the AX650N SDK is essential, but accessing it can be challenging if you're outside of China. For reasons unknown, Sipeed only hosts the SDK on their Baidu account. I reached out to Sipeed support with no response, email was sent on July 17, 2024 and still no response. I encountered the same lack of response from the Sipeed Telegram channel. Contacting AXera doesn't help they point you back to Sipeed. After weeks of trying I finally manage to get hold of the SDK without Sipeed.

The SDK includes a small kernel patch that gets applied against kernel 5.15.73, however the main irritation is that most peripheral drivers are provided as pre-compiled binaries, making the kernel effectively closed source. The rest of the SDK provides a set of closed source user space libraries exposing APIs to the different IP blocks. The primary drawback with a closed source approach is addressing security concerns as highlighted on Sipeeds NanoKVM. SDK documentation is primarily in Chinese with occasional English versions.

Transformers Support

As mentioned earlier, my main focus is the NPU architecture and performance of Transformer models. There are number of pre-built transformer models (Qwen1.5-1.8B, Qwen2-0.5B, MiniCPM-1B/2B, Phi-3-mini) available here, unfortunately again hosted on Baidu. The largest one is Phi-3 mini which requires increasing the CCM space to 5 or 6GB, due to other kernel modules reserving CCM space while the kernel boots. Executing Phi-3 mini results in a token generation rate of approximately 4.4 tokens per second.

The token rate appears to be lower than expected given the AX650N's TOPS claims. Phi-3 Mini seems to be using INT8 quantization, although it's unclear which specific approach, I'm guessing it's w8a8 with SmoothQuant. In comparison, Phi-3 Mini on the RK3588, also using INT8 (w8a8), reportedly runs at up to 6.46 tokens per second. The AX650N claims 18.0TOPS@int8, while the RK3588 is rated at 6 TOPS@int8. Although the performance may be lower, it's important to also consider the quality of the quantized models by evaluating each model's perplexity on both platforms, which I haven't done. Furthermore the vendor for both SOCs aren't particularly forthcoming on perplexity evaluations when releasing models. Regardless of perplexity, the expectation was that the additional 12 TOPS would lead to better performance. The next step is to take a closer look at the NPU architecture to gain a clearer understanding of what might be causing this discrepancy. Given the platform's closed-source nature, this presents several challenges.

To deploy models to the NPU, AXera rely on their Pulsar toolchain, which converts ONNX files into their proprietary axmodel format. At a high level. here's my understanding of how the axmodel files are processed:

The axmodel files contain a mix of ONNX data and an internal graph representation of the model, which is sent to the NPU kernel driver. To execute an LLM model, each layer is stored in its own axmodel file, containing descriptions of the input/output parameters and corresponding weights. The input/output parameters and weights are loaded into CCM. To run the layer, the kernel driver receives references to the input/output parameters, weights, and a list of NPU commands, all of which are included in the file.

Note, there is no api to directly interface with NPU, the only mechanism is by generating a axmodel file. For example to instruct the NPU to multiple 2 int8 matrices the SDK contains a bunch of sample which surprisingly rely on loading an axmodel file.

To efficiently run quantized models on the NPU, it ideally needs the ability to execute a Gemm operation. What's notable from the Pulsar documentation is that the ONNX Gemm operation is restricted as outlined below, suggesting that de-quantization cannot be performed while applying FMA.

alpha: Not supported yet

beta: Not supported yet

Power Consumption

After the board completes booting, the idle current consumption is about 450mA, equating to 5.4 watts (12V x 450mA). While running Phi-3 Mini, the consumption averages 680mA or 8.6 watts. For running a small LLM, the AX650N NPU is highly efficient, with the NPU consuming just 2.71 watts and benefiting from direct access to CPU memory. With a well-designed board, these figures could be significantly reduced.

Neutron NPU

The AXera NPU architecture, known as Neutron, has different iterations across the AX6XXX chipsets. It is primarily designed to support vision applications by executing Convolutional Neural Networks (CNNs). Documentation is sparse, but for the AX650N, there's a brief description in Pulsar along with the diagram below:

The AX650 and M76H NPUs are mainly composed of three Conv convolution cores and three groups of Vector cores. These Conv and Vector cores are allocated in a 1:1 ratio and divided into three groups of vNPUs

The diagram states that the NPU consists of 3 convolution cores, 3 vector cores, and 3 SDMA (System direct memory access) units. We’ll need to investigate further to confirm if this is accurate. The NPU can be set up to function either as 3 distinct Virtual NPUs (vNPUs) or as a single NPU with access to all cores.

The SDK documentation provides more details, highlighting the NPU performance as:

Max. 43.2 TOPS @INT4 and 10.8 TOPS @INT8.

This differs from the original claim of 18.0 TOPS @INT8, with only 10.8 TOPS @INT8 coming from the NPU. I speculate that the remaining 7.2 TOPS might be partially attributed to the dual DSP cores, which feature 256 MACs @INT8. However, I’m still unsure about the remaining TOPS and remain unconvinced that the performance figures are genuine. I suspect there may be some creative accounting at play!

Examination of the memory map offers additional insights into the architecture.

The EU0-EU12 labels denote a total of 13 Execution Units (EUs), along with 3 SDMA units, bringing the overall count to 15. The NPU also features its own local memory through OCM (On-Chip Memory), with an address map indicating 11.5MB (0xAFFFFF). Data transfers between the CMM and OCM must be managed by the SDMA units both before and after instructing the NPU cores to execute commands. There will be an overhead of moving each layers weight data to OCM especially for larger LLM models.

I had hoped the Execution Units were general-purpose compute cores akin to those in a GPU. Although this isn't documented, extensive debugging suggests they are fixed-function IP blocks. The 13 EUs are divided as follows, with some names and descriptions being educated guesses based on the limited information at hand.

Convolution Unit (3x) - Capable of performing convolution operations ( Depthwise/Group Conv, Dilation, and ConvTranspose)
Computer Vision Unit (3x) - Image normalization, reszing, cliping & CV remap/warp.
Tensor Unit (3x) - Operations to support activation functions, pooling & elementwise calculations & reduction calcuations.
MAU (Matrix Arithmetic Unit) (x1) - Multiply two vectors, int8/int16 inputs and fp16/f32 outputs, topN (N <32) outputs

For LLMs, I assume the workload is distributed across the Convolution and Tensor Execution Units (EU), possibly involving the Matrix Arithmetic Unit (MAU). If the MAU is used, it could create a bottleneck due to the single instance. Additionally, I suspect the EUs can't always operate in parallel—for example, the Convolution EU likely needs to complete its output before the Tensor EU can use it as input. Moreover, the limited OCM memory might prevent EUs of the same type from running in parallel when dealing with large weight data.

Wrap Up

The AX650N is an intriguing SoC for image and vision applications, though its suitability for LLMs feels like an afterthought or marketing hype. The performance claims are hard to verify, and my testing doesn't seem to support the TOPs figures. Additionally, the closed-source nature of the software makes it challenging, if not impossible to develop for. This raises doubts in my mind about its suitability for use in a commercial product. The lack of support from Sipeed for the Maix-IV only exacerbates these issues. Ideally, the Maix-IV should have 16GB of memory, evenly split between Linux and CCM. This would enable running larger LLMs while fully utilizing all 8 CPU cores. In its current incarnation its very limited for this purpose.

I welcome AXera to reach out and clarify or address the points raised in this post.

Fleetwood was kind enough to donate a board for me to review as we both shared an interest in validating the performance claims for LLMs.

Sunday, 19 May 2024

RK3588 - Reverse engineering the RKNN - Running llama2.c with TinyStories

I've finally reached a point with reverse engineering where we can start evaluating the usefulness of the NPU for LMMs. I've crafted a basic library (rk3588-npu) with sufficient features for initial integration. A good reference application for integration is llama2.c because its a single C file, the code structure is straightforward to follow and more importantly modify. We will use TinyStories (stories110m) for testing since the models are relatively small, making it easier to troubleshoot when the outputs go a stray. Credit to karpathy for providing llama2.c

Note, I've set the cpu and npu cores to max clk speeds.

We'll start with running run.c against stories110m, this is the fp32 version using the cpu with a single thread (single core). As we see roughly 9.7 tokens/second.

Next I converted run.c to use fp16 (_Float16) along with the weights. We see a slight drop in performance to roughly 9.3 tokens/s as arithmetic operations require a conversion back to fp32.

As with the fp32 version a single core run at a 100% as the code is single thread.

The next step was to offload all FP16 multiplications to the NPU. With a vocabulary size of 32,000, the largest multiplication is 768 x 32,000, with others being 768 x 768, 768 x 2048, and 2048 x 768. For efficient execution, the model weights need to reside entirely in memory, accessible to the NPU. This requires them to be within the 4GB address space, which can be problematic for larger models. In our case, the weights are roughly 256MB, requiring an expansion of the kernel CMA memory allocation to 512MB. Additionally, the weights needed conversion to the NPU format.

The changes result in a additional uplift of roughly 21 to 23 tokens/s depending on the length of the output as per the video below. Conservatively we could say a doubling.

CPU fluctuates between 30-60% for the single. The CPU is still critical for a number of reason:

1. We're still having to rely on memory copies to send/receive the remaining data for the multiplication to occur on the NPU.

2. Invocation of the NPU kernel driver requires CPU cycles.

3. The rest of the llam2.c code stills runs of the CPU.

Although the results look promising we need to bear mind that TinyStories is very small model as per the architecture. Furthermore its fortunate that the converted weights can fit in memory without having to shuffle weights between userspace and physical memory. In additional fp16 format would further limit the possibility for larger models to run efficiently. So conclusion so far there is some uplift but mileage will vary depending on model size and number of layers.

Thursday, 8 February 2024

RK3588 - Reverse engineering the RKNN (Rockchip Neural Processing Unit)

The internal operations and capabilities of the RK3588 NPUs are mainly concealed within a closed-source SDK known as RKNPU2. Given the huge interest in Large Language Models (LLMs) and the quest for optimal matrix multiplications for transformer models. I was curious to understand the implementation of the newly introduced matrix multiplication API (rknn_matmul_run) to the sdk . A thorough examination of the RKNN section in the TRM (Technical Reference Manual) reveals no native mechanism for matrix multiplication, especially for vectors.

To grasp whats going on, the initial step was to understand how the NPU functioned. While the TRM furnished a detailed list of registers and a brief overview of the core units constituting the NPU. It notably lacked essential information on programming the registers for executing operations. For example there were no specifics about deriving or calculating register values based on factors such as data formats (e.g., int8 vs. float16) or the size of input data or weights. Furthermore there was no information on how construct a pipeline for the NPU to execute. Fortunately, I had a slight advantage from a previous reverse engineering attempt on the V831 NPU. Nevertheless, even armed with this knowledge, it has still required several months of trial and error, extensive analysis of data streams, encountering a few dead ends, and numerous attempts at reverse engineering. Finally, I managed to understand how to activate the NPU and get it to execute simple operations.

The RK3588 NPU seems to be distant cousin of the NVDLA architecture in that the some of the terminology is similar and the core units has similar functions and pipe lines to NVDLA although they have been named differently. One of primarily differences is that we can give the NPU a list of tasks (RKNN terminology) to execute and then wait for completion. For example if I have simple neural network consisting of 3 layers and each layer consists of convolution + bias then it is possible to feed 3 tasks (each performing convolution + bias) to the NPU along with the necessary input, weight and bias values. Subsequently we just wait for the NPU to notify when its complete.

The image presented above is extracted from the TRM and has been altered because the description provided in the TRM doesn't entirely align with their diagram, and, more crucially, the register naming convention. Here is my interpretation, each NPU comprises of three distinct units:

CNA - Convolution Network Accelerator (include CORE rectangle). In the TRM it refers to the Neural Network Accelerating Engine, CNA isn't described.
DPU - Data Processing Unit
PPU - Planar Processing Unit

Based on the above, the NPU is primarily designed for running conventional Convolutional Neural Networks (CNNs). This is attributed to the CNA core feature, which revolves around executing convolutions by inputting image or feature data along with the corresponding weights. The emphasis on CNNs is further evident by the majority of RKNPU2 samples provided, such as YOLOX, Mobilenet, and ResNet. The CNA output can be directed to the DPU, where element-wise operations such as addition, multiplication, and RELU can be carried out. Subsequently, the DPU's output can be channeled to the PPU, where operations like min, max, and average pooling are executed. Additionally, there is the option to directly feed data to the DPU or PPU without necessitating a convolution step.

To execute convolutions efficiently, the CNA employs multiply-accumulate (MAC) operations. The performance of a CNA is partially determined by the number of MAC units used. According to the TRM, for a single NPU core the count of MAC operations depends on the input data type:

1024 int8 MAC operations per cycle
512 float16 MAC operations per cycle

Each MAC cell caches 1x1x16 weight bytes, for int 8 its 16 values whilst for float16 it reduces to 8. We require 2 MAC cells to perform float 16 hence the reduction in operations per cycle. Internally feature and weight data must conform to Rockchips NC1HWC2 format where C2 is the aforementioned value. One 1x1x16 cube of feature data is then shared by all MAC cells to calculate partial sums which are then sent to the accumulator. At higher level the CNA appears to execute a block operation, as observed in my tests where, for instance, the MAC caches 32 channels of weight data for fp16. Hence the requirement to layout weights in kernel groups each with 32 channels.

Performance is also affected by the access time to input and weight data, the CNA incorporates a second level cache known as convolution buffer (cbuf). In the above diagram the 384KB onboard memory is partly for that purpose. Importantly the numbers of MAC units plus the cbuf influence how large of a convolution can be completed in one task.

Some of you may have already deduced that the matrix multiplication API is essentially executed through a 2D convolution. For instance, let's consider matrix A as [M x K] and matrix B as [K x N]. Matrix A represents the feature data arranged in an Mx1xK (hwc) format, while matrix B denotes the weight data organized in a 1x1xNxK (hwck) format. Consequently, the resulting matrix C [M x N] is arranged as Mx1xN. I'm at the point where I have a simple test running which asks the NPU to perform a matrix multiplication. I'm using matrices data derived from a GGML testcase (test-mul-mat.cpp) to verify the output is correct. To run the test check out my repo and build, sadly I'm still testing against a kernel 5.10 on a Rock-5b. If the test runs output should be as below & screenshot above.

rock@rock-5b:~/rk3588-npu/build$ ./matmul_4_36_16
drm name is rknpu - 20220829 - RKNPU driver
input dma is ffff8000, output dma is ffffa000, weights dma is ffff9000
Size of npu_regs 112
RKNPU_SUBMIT returned 0
=========================================================================================================
1224.0 1023.0 1158.0 1259.0 1359.0 1194.0 1535.0 1247.0 1185.0 1029.0 889.0 1182.0 955.0 1179.0 1147.0
1216.0 1087.0 1239.0 1361.0 1392.0 1260.0 1247.0 1563.0 1167.0 1052.0 942.0 1214.0 1045.0 1134.0 1264.0
1125.0 966.0 1079.0 1333.0 1287.0 1101.0 1185.0 1167.0 1368.0 990.0 967.0 1121.0 971.0 1086.0 1130.0
999.0 902.0 1020.0 1056.0 1076.0 929.0 1029.0 1052.0 990.0 1108.0 823.0 989.0 759.0 1041.0 1003.0
=========================================================================================================

Regarding reverse engineering, I've reached a stage where I understand the majority of register settings that impact convolution when dealing with feature data as input. The primary uncertainty lies in determining the bank sizes for feature/weight data, however I'm hopeful that this can be deduced. After dedicating a significant amount of time to analyzing the NPU, here is a list of key areas that you should be aware of:

1. All data pointers within the NPU (e.g., input, weights, outputs, task lists) are 32-bit and must reference physical memory. Consequently, this restricts the memory range to 4GB, making it impractical to leverage a board with 16/32GB memory for the NPU to use. Moreover, it potentially imposes limitations on the types of models that can be executed on the NPU.

2. The claim of 6 TOPS should be approached with caution. While each NPU core is rated at 2 TOPS, there are registers that could potentially enable convolution across all 3 cores. However, after analyzing the data streams generated by the SDK, it appears that this feature is never utilized. Additionally, there doesn't seem to be a similar capability available for the DPU/PPU units, which would restrict its usability. In my view, the most effective approach is to treat them as individual cores and execute models on each one, taking into account the memory constraints mentioned earlier.

3. The SDK matrix multiplication API, in certain aspects, represents an inefficient utilization of the NPU. There is the overhead of memory allocation, a kernel call, and instructing the NPU to execute a single convolution. Ideally, the NPU should be tasked with executing multiple operations and providing all the supplied data for those operations. Typically this is how the NPU is utilized when running a CNN model (ie YOLOvX). The caveat here is that the converted model is limited to contains layers where the operations are supported by the NPU.

4. Initial bench marking for the multiplication of two fp16 [512 x 512] matrices suggests that I could achieve completion in a respectable time of around 1ms. Please note, this involves sending 2 tasks to the NPU, as mentioned earlier due to the cbuf limitation. Unfortunately, this is only part of the story when it comes using vectors data as input. The costly operations involve converting the matrices to feature and weight data formats, and vice versa for the output, if done at runtime. I made an effort to create a highly optimized conversion routine for vector to feature data conversion. According to my benchmarks, this process takes approximately 2ms for fp16 [512 x 512] matrices. I would estimate 12-15ms to perform all the conversions for the matrices mentioned above. Ideally, the matrix for the weight data should be converted ahead of time to reduce conversion overhead and, if possible, persisted for reuse.

5. I was hoping there was the capability to use a programmable core to perform custom operations. Unfortunately this isn't case and your left with using OpenCL as the alternative. This brings it own challenges if you need to shuffle data between OpenCL and the NPU.

There is still more to discover about the other units (DPU/NPU) and I'll spend time doing that. Lastly TRM v1.0 contains numerous gaps and inconsistencies for RKNN, if anyone has later version it would be greatly appreciated.

Wednesday, 7 June 2023

RK3588 - RKNN Object detection on multiple video streams

Having previously reversed engineered the V831 NPU , let's now examine the RK3588 NPU. While the RK3588 RKNN advertises 6 TOPs@int8, it is not entirely clear what this figure represents since the RKNN process unit comprises a tri-core NPU. Referring to the Technical Reference Manual (TRM), we can gather further information:

1024x3 integer 8 MAC operations per cycle

The RKNN clock is 1Ghz therefore based on the standard TOPS formula

TOPS = MACs * Frequency * 2

= (1024x3) * 1Ghz * 2

If all three cores (1024x3) are utilized, the total computational power reaches 6 TOPS. The RKNN framework offers various workload configurations, including tri-core, dual-core, and single-core. However, upon reviewing the RKNN documentation, it appears that out of the 43 operators, only around 10 support tri-core or dual-core execution (as of v1.5.0 of RKNPU SDK) :

Conv, DepthwiseConvolution, Add, Concat, Relu, Clip, Relu6, ThresholdedRelu. Prelu, LeakyRelu

Deploying a single RKNN model in tri-core mode allows for achieving a maximum computational power of 6 TOPS, but this relies on encountering operators that support tri-core execution or having the model compiler identify parallelizable operations. Consequently, the utilization of the full 6 TOPS may be limited to specific scenarios. Given this constraint, an alternative approach could be running three instances of the model, with each instance allocated to a core. Although this approach increases memory usage, it may provide improved efficiency. For instance, when running rknn_benchmark against yolov5s-640-640.rknn for 1000 iterations with a core mask of 7 (tri-core), the results observed are (v1.5.0 sdk) :

Avg Time 9.86ms, Avg FPS = 101.416

Running 3 separate instances of rknn benchmark for same model with core mask 1, 2 & 4 (single core) the average per instance is :

Avg Time 18.84ms, Avg FPS = 53.084

The initial benchmark results suggest a potential improvement with this approach, as running three object detection streams in parallel could yield better overall performance. Furthermore this opens up the possibility of multi stream object detection. However, it is crucial to acknowledge that the frames per second (fps) figures reported by the benchmark are quite optimistic. Primarily because the test input is a static pre-cropped RGB (640x640) image, and the outputs are not sorted based on confidence levels. Hence, in a real-world deployment, additional pre and post processing steps would be necessary and effect the overall processing time.

In order to assess the feasibility of the aforementioned approach, I developed a C++ application that performs several tasks concurrently. This application includes the decoding of an H264 stream, resizing and converting each frame to RGB (640x640), running the yolov5 model on each frame for object detection whilst simultaneously rendering the video. It's worth noting that video playback occurs independently of rendering the rectangles generated by yolov5 through an overlay. The primary challenge encountered during development was optimizing the frequency of frame conversions and resizing for both inference and rendering. This optimization was crucial to ensure that the output rectangles from yolov5 remained synchronized with the corresponding video frame intended for rendering. Otherwise fast moving objects in the video stream are noticeably out of sync with the detected rectangle for that frame. The main argument passed to the application is the core mask, which allows the selection of which NPU core(s) to utilize for the processing tasks.

As shown in the showcase video above, by running three instances of the application with each assigned a single NPU core, we were able to achieve sufficient performance to keep up (well almost in case of 60fps stream) with the video playback rate. The application was tested on the following boards running under weston:

Mekotronics R58 Mini HDD
Radxa Rock 5-b

The test videos, sourced from the kangle site, are either 1080p at 60 or 30 frames per second (fps). To fit all the videos on the same display (1080p resolution), they are not resized back to their original format. The detected objects are color-coded as follows:

Red: person
Green: par, truck, bus, bicycle
Blue: anything else

Benchmarks from concurrently running 3 instances show an average per instance of:

Avg Time = 25.20ms Avg FPS = 38.49

Compared to a single instance running with NPU in tri-core mode

Avg Time = 15.92ms Avg FPS = 61.42

Based on my testing it is possible to run object detection on 3 video streams assuming 1080p@30 assuming the inference time of your model on a single npu core is less than 25ms. This work was done as part of a suite of video applications that I'm developing for the RK3588.

CPU usage while running the 3 instances:

Tasks: 236 total,   2 running, 234 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.6 us, 3.3 sy, 0.0 ni, 84.5 id, 1.3 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem :   7691.7 total,   6531.9 free,    558.4 used,    601.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   6942.5 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   1422 rock       1 -19 1153180 128604 89000 S 32.1   1.6   0:18.35 subsurf+
   1439 rock       1 -19 1158676 128880 89480 S 31.8   1.6   0:12.97 subsurf+
   1404 rock       1 -19 1161504 132664 89456 S 28.1   1.7   0:24.37 subsurf+
   1000 rock      20   0 705784 99024 76756 S 21.2   1.3   0:56.09 weston
    363 root      20   0   94212 48112 47004 R   4.6   0.6   0:24.04 systemd+
    212 root     -51   0       0      0      0 S   4.3   0.0   0:10.57 irq/34-+
    927 rock      20   0   16096   4608   3416 S   0.7   0.1   0:00.60 sshd
   1100 root      20   0       0      0      0 I   0.7   0.0   0:01.05 kworker+
   1395 root       0 -20       0      0      0 I   0.7   0.0   0:01.23 kworker+
   1402 root       0 -20       0      0      0 I   0.7   0.0   0:00.76 kworker+
    139 root      20   0       0      0      0 S   0.3   0.0   0:01.07 queue_w+
    371 root       0 -20       0      0      0 I   0.3   0.0   0:00.16 kworker+
    910 rock      20   0   16096   4600   3408 S   0.3   0.1   0:00.89 sshd
   1329 root       0 -20       0      0      0 I   0.3   0.0   0:00.43 kworker+
   1330 root      20   0       0      0      0 I   0.3   0.0   0:00.76 kworker+
   1403 root      20   0    7124   3128   2364 R   0.3   0.0   0:00.60 top
   1421 root      20   0       0      0      0 I   0.3   0.0   0:00.46 kworker+

Sunday, 30 April 2023

RK3588 - Adventures with an external GPU through PCIE Gen3 x4 (Radxa Rock-5b)

One of the interesting features of the RK3588 is the pcie controller because of it support for a Gen3 X4 link. I'd started looking into using the controller for a forth coming project and subsequently this lead me to the idea of testing the controller against a external GPU card to gain an understanding of it limitations and potential. From what I understand Jeff Geerling has been a similar journey with the RPI CM4 and has had limited success with help from numerous developers. Furthermore there was a Radxa tweet which a gave a teasing glimpse of the working GPU. So lets see what is or isn't possible using a Rock-5b.

I'd managed to get hold of a Radeon R7 520 (XFX R7 250 low-profile) card along a with M.2 Key M Extender Cable to PCIE x16 Graphics Card Riser Adapter. To power the card I'd reused a old LR1007 120W 12VDC ATX board which was to hand. Setup as shown below, we reuse the nvme slot for the m.2 adapter and revert back to an sd card for booting an OS. I'd used the Radxa debian image with a custom compiled Radxa kernel to include the graphics card drivers and fixes. Having reviewed the pcie BAR definitions in the rk3588.dtsi there should be enough address space available for the card to use. After removing the hdmi and mali drivers from kernel config, I initially tried the amdgpu driver but that seems to report an error and no display output

[   11.844163] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[   11.844378] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v6_0> failed -110
[   11.844383] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[   11.844388] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[   11.844414] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
[   11.846559] [drm] amdgpu: ttm finalized
[   11.848018] amdgpu: probe of 0000:01:00.0 failed with error -110

The radeon driver fared slightly better with a similar error but at least display output for console login

[ 12.059398] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
[ 12.059408] radeon 0000:01:00.0: disabling GPU acceleration

The was puzzling as the card relies on pcie memory mapped I/O which the RK3588 should see a standard memory and be able to read/write too. It turns out Peter Geis who was attempting to mainline a pcie driver for the RK3566 and raised 2 issues per this thread which Rockchip replied too. The same issues weren't improved/fixed on the RK3588 as mentioned here . In simple terms for our requirements:

1. For the pcie dma transfers memory allocation are limited to 32bits so a 4GB board might not see an issue. While a 8GB board like mine the kernel could pick an address range above 4GB.

2. AMD cards rely on pcie snooping, there is no CPU snooping on the RK3588 interconnect. So any cache copies of the same device memory block won't get updated to remain in sync.

If we hack the Radeon driver to work around these issues we get:

[   12.529087] [drm] ring test on 0 succeeded in 1 usecs
[   12.529094] [drm] ring test on 1 succeeded in 1 usecs
[   12.529102] [drm] ring test on 2 succeeded in 1 usecs
[   12.529121] [drm] ring test on 3 succeeded in 8 usecs
[   12.529132] [drm] ring test on 4 succeeded in 3 usecs
[   12.706419] [drm] ring test on 5 succeeded in 2 usecs
[   12.706427] [drm] UVD initialized successfully.
[   12.816582] [drm] ring test on 6 succeeded in 18 usecs
[   12.816625] [drm] ring test on 7 succeeded in 5 usecs
[   12.816627] [drm] VCE initialized successfully.
[   12.816879] [drm:si_irq_set [radeon]] si_irq_set: sw int gfx
[   12.816921] [drm] ib test on ring 0 succeeded in 0 usecs
[   12.816989] [drm:si_irq_set [radeon]] si_irq_set: sw int cp1
[   12.817028] [drm] ib test on ring 1 succeeded in 0 usecs
[   12.817088] [drm:si_irq_set [radeon]] si_irq_set: sw int cp2
[   12.817127] [drm] ib test on ring 2 succeeded in 0 usecs
[   12.817185] [drm:si_irq_set [radeon]] si_irq_set: sw int dma
[   12.817224] [drm] ib test on ring 3 succeeded in 0 usecs
[   12.817281] [drm:si_irq_set [radeon]] si_irq_set: sw int dma1
[   12.817319] [drm] ib test on ring 4 succeeded in 0 usecs
[   13.477677] [drm] ib test on ring 5 succeeded
[   13.984454] [drm] ib test on ring 6 succeeded
[   14.491404] [drm] ib test on ring 7 succeeded
...
[   14.549296] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on minor 1

So potentially we have graphics acceleration ... let try kmstest

rock@rock-5b:~$ kmstest
trying to open device 'i915'...failed
trying to open device 'amdgpu'...failed
trying to open device 'radeon'...done
main: All ok!

Next (fingers crossed) kmscube

rock@rock-5b:~$ kmscube
Using display 0x55b67f0020 with EGL version 1.5
===================================
EGL information:
version: "1.5"
vendor: "Mesa Project"
client extensions: "EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_device EGL_EXT_platform_wayland EGL_KHR_platform_wayland EGL_EXT_platform_x11 EGL_KHR_platform_x11 EGL_MESA_platform_gbm EGL_KHR_platform_gbm EGL_MESA_platform_surfaceless"
display extensions: "EGL_ANDROID_blob_cache EGL_EXT_buffer_age EGL_EXT_create_context_robustness EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_KHR_cl_event2 EGL_KHR_config_attribs EGL_KHR_create_context EGL_KHR_create_context_no_error EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_image_pixmap EGL_KHR_no_config_context EGL_KHR_reusable_sync EGL_KHR_surfaceless_context EGL_EXT_pixel_format_float EGL_KHR_wait_sync EGL_MESA_configless_context EGL_MESA_drm_image EGL_MESA_image_dma_buf_export EGL_MESA_query_driver EGL_WL_bind_wayland_display "
===================================
OpenGL ES 2.x information:
version: "OpenGL ES 3.2 Mesa 20.3.5"
shading language version: "OpenGL ES GLSL ES 3.20"
vendor: "AMD"
renderer: "AMD VERDE (DRM 2.50.0, 5.10.110-99-rockchip-g6e21553c2116, LLVM 11.0.1)"
extensions: "GL_EXT_blend_minmax GL_EXT_multi_draw_arrays GL_EXT_texture_filter_anisotropic GL_EXT_texture_compression_s3tc GL_EXT_texture_compression_dxt1 GL_EXT_texture_compression_rgtc GL_EXT_texture_format_BGRA8888 GL_OES_compressed_ETC1_RGB8_texture GL_OES_depth24 GL_OES_element_index_uint GL_OES_fbo_render_mipmap GL_OES_mapbuffer GL_OES_rgb8_rgba8 GL_OES_standard_derivatives GL_OES_stencil8 GL_OES_texture_3D GL_OES_texture_float GL_OES_texture_float_linear GL_OES_texture_half_float GL_OES_texture_half_float_linear GL_OES_texture_npot GL_OES_vertex_half_float GL_EXT_draw_instanced GL_EXT_texture_sRGB_decode GL_OES_EGL_image GL_OES_depth_texture GL_AMD_performance_monitor GL_OES_packed_depth_stencil GL_EXT_texture_type_2_10_10_10_REV GL_NV_conditional_render GL_OES_get_program_binary GL_APPLE_texture_max_level GL_EXT_discard_framebuffer GL_EXT_read_format_bgra GL_EXT_frag_depth GL_NV_fbo_color_attachments GL_OES_EGL_image_external GL_OES_EGL_sync GL_OES_vertex_array_object GL_OES_viewport_array GL_ANGLE_pack_reverse_row_order GL_ANGLE_texture_compression_dxt3 GL_ANGLE_texture_compression_dxt5 GL_EXT_occlusion_query_boolean GL_EXT_robustness GL_EXT_texture_rg GL_EXT_unpack_subimage GL_NV_draw_buffers GL_NV_read_buffer GL_NV_read_depth GL_NV_read_depth_stencil GL_NV_read_stencil GL_EXT_draw_buffers GL_EXT_map_buffer_range GL_KHR_debug GL_KHR_robustness GL_KHR_texture_compression_astc_ldr GL_NV_pixel_buffer_object GL_OES_depth_texture_cube_map GL_OES_required_internalformat GL_OES_surfaceless_context GL_EXT_color_buffer_float GL_EXT_sRGB_write_control GL_EXT_separate_shader_objects GL_EXT_shader_group_vote GL_EXT_shader_implicit_conversions GL_EXT_shader_integer_mix GL_EXT_tessellation_point_size GL_EXT_tessellation_shader GL_ANDROID_extension_pack_es31a GL_EXT_base_instance GL_EXT_compressed_ETC1_RGB8_sub_texture GL_EXT_copy_image GL_EXT_draw_buffers_indexed GL_EXT_draw_elements_base_vertex GL_EXT_gpu_shader5 GL_EXT_polygon_offset_clamp GL_EXT_primitive_bounding_box GL_EXT_render_snorm GL_EXT_shader_io_blocks GL_EXT_texture_border_clamp GL_EXT_texture_buffer GL_EXT_texture_cube_map_array GL_EXT_texture_norm16 GL_EXT_texture_view GL_KHR_blend_equation_advanced GL_KHR_context_flush_control GL_KHR_robust_buffer_access_behavior GL_NV_image_formats GL_OES_copy_image GL_OES_draw_buffers_indexed GL_OES_draw_elements_base_vertex GL_OES_gpu_shader5 GL_OES_primitive_bounding_box GL_OES_sample_shading GL_OES_sample_variables GL_OES_shader_io_blocks GL_OES_shader_multisample_interpolation GL_OES_tessellation_point_size GL_OES_tessellation_shader GL_OES_texture_border_clamp GL_OES_texture_buffer GL_OES_texture_cube_map_array GL_OES_texture_stencil8 GL_OES_texture_storage_multisample_2d_array GL_OES_texture_view GL_EXT_blend_func_extended GL_EXT_buffer_storage GL_EXT_float_blend GL_EXT_geometry_point_size GL_EXT_geometry_shader GL_EXT_shader_samples_identical GL_KHR_no_error GL_KHR_texture_compression_astc_sliced_3d GL_OES_EGL_image_external_essl3 GL_OES_geometry_point_size GL_OES_geometry_shader GL_OES_shader_image_atomic GL_EXT_clip_cull_distance GL_EXT_disjoint_timer_query GL_EXT_texture_compression_s3tc_srgb GL_EXT_window_rectangles GL_MESA_shader_integer_functions GL_EXT_clip_control GL_EXT_color_buffer_half_float GL_EXT_memory_object GL_EXT_memory_object_fd GL_EXT_texture_compression_bptc GL_KHR_parallel_shader_compile GL_NV_alpha_to_coverage_dither_control GL_EXT_EGL_image_storage GL_EXT_texture_sRGB_R8 GL_EXT_texture_shadow_lod GL_INTEL_blackhole_render GL_MESA_framebuffer_flip_y GL_EXT_depth_clamp GL_EXT_texture_query_lod "
===================================
Using modifier ffffffffffffff
Modifiers failed!
Bus error

The 'bus error' indicates a memory alignment issue and turns out to be a bit of a of rabbit hole. To fix the Radeon kernel driver we are ensuring the cards memory is mapped as 'Device memory' type Device-nGnRnE. If it were 'Normal Memory' then unaligned access is allowed. This implies fixing up userspace drivers/applications as these errors are encountered as these applications can directly manlipulate the cards memory. For this particular bus error it was caused by a memcpy in the radeon gallium driver and fixed applied there and as shown in the video kmscube runs

===================================
Using modifier ffffffffffffff
Modifiers failed!
Using modifier ffffffffffffff
Modifiers failed!
Rendered 120 frames in 2.000246 sec (59.992635 fps)
Rendered 240 frames in 4.000428 sec (59.993577 fps)
Rendered 361 frames in 6.016865 sec (59.998019 fps)
Rendered 481 frames in 8.017015 sec (59.997390 fps)
Rendered 601 frames in 10.017050 sec (59.997704 fps)
Rendered 721 frames in 12.017079 sec (59.997942 fps)
Rendered 841 frames in 14.017118 sec (59.998067 fps)
Rendered 961 frames in 16.017314 sec (59.997574 fps)
Rendered 1082 frames in 18.033850 sec (59.998280 fps)

Similiar fixes were applied to glmark2-drm & glmark2-es2-drm to run successfully (1680x1050 resolution) although the terrain scene displayed a bunch of colored bars on the screen.

=======================================================
    glmark2 2021.12
=======================================================
    OpenGL Information
    GL_VENDOR:      AMD
    GL_RENDERER:    AMD VERDE (DRM 2.50.0, 5.10.110-99-rockchip-g6e21553c2116, LLVM 11.0.1)
    GL_VERSION:     4.5 (Compatibility Profile) Mesa 20.3.5
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   1680x1050 fullscreen
=======================================================
[build] use-vbo=false: FPS: 939 FrameTime: 1.066 ms
[build] use-vbo=true: FPS: 2411 FrameTime: 0.415 ms
[texture] texture-filter=nearest: FPS: 1957 FrameTime: 0.511 ms
[texture] texture-filter=linear: FPS: 1958 FrameTime: 0.511 ms
[texture] texture-filter=mipmap: FPS: 2003 FrameTime: 0.499 ms
[shading] shading=gouraud: FPS: 1975 FrameTime: 0.506 ms
[shading] shading=blinn-phong-inf: FPS: 1973 FrameTime: 0.507 ms
[shading] shading=phong: FPS: 1976 FrameTime: 0.506 ms
[shading] shading=cel: FPS: 1974 FrameTime: 0.507 ms
[bump] bump-render=high-poly: FPS: 1739 FrameTime: 0.575 ms
[bump] bump-render=normals: FPS: 2373 FrameTime: 0.422 ms
[bump] bump-render=height: FPS: 2330 FrameTime: 0.429 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1254 FrameTime: 0.798 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 707 FrameTime: 1.415 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1338 FrameTime: 0.747 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 456 FrameTime: 2.194 ms
[desktop] effect=shadow:windows=4: FPS: 600 FrameTime: 1.667 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 214 FrameTime: 4.684 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 233 FrameTime: 4.306 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 347 FrameTime: 2.885 ms
[ideas] speed=duration: FPS: 1430 FrameTime: 0.700 ms
[jellyfish] <default>: FPS: 806 FrameTime: 1.242 ms
[terrain] <default>: FPS: 150 FrameTime: 6.706 ms
[shadow] <default>: FPS: 843 FrameTime: 1.188 ms
[refract] <default>: FPS: 115 FrameTime: 8.718 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 1970 FrameTime: 0.508 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 1980 FrameTime: 0.505 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 1979 FrameTime: 0.505 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 1971 FrameTime: 0.507 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 1968 FrameTime: 0.508 ms
=======================================================
                                  glmark2 Score: 1450
=======================================================

Next up was to see if startx would run, unfortunately it drops out with a shader compiler error. Looks like glamor is using egl but encounters an opengl shader to compile, requires further investigation.

[ 7916.924] (II) modeset(0): Modeline "360x202"x119.0   11.25 360 372 404 448 202 204 206 211 doublescan -hsync +vsync (25.1 kHz d)
[ 7916.924] (II) modeset(0): Modeline "360x202"x118.3   10.88 360 384 400 440 202 204 206 209 doublescan +hsync -vsync (24.7 kHz d)
[ 7916.924] (II) modeset(0): Modeline "320x180"x119.7    9.00 320 332 360 400 180 181 184 188 doublescan -hsync +vsync (22.5 kHz d)
[ 7916.924] (II) modeset(0): Modeline "320x180"x118.6    8.88 320 344 360 400 180 181 184 187 doublescan +hsync -vsync (22.2 kHz d)
[ 7916.925] (II) modeset(0): Output DVI-D-1 status changed to disconnected.
[ 7916.925] (II) modeset(0): EDID for output DVI-D-1
[ 7916.939] (II) modeset(0): Output VGA-1 status changed to disconnected.
[ 7916.939] (II) modeset(0): EDID for output VGA-1
[ 7916.939] (II) modeset(0): Output HDMI-1 connected
[ 7916.939] (II) modeset(0): Output DVI-D-1 disconnected
[ 7916.939] (II) modeset(0): Output VGA-1 disconnected
[ 7916.939] (II) modeset(0): Using exact sizes for initial modes
[ 7916.939] (II) modeset(0): Output HDMI-1 using initial mode 1680x1050 +0+0
[ 7916.939] (==) modeset(0): Using gamma correction (1.0, 1.0, 1.0)
[ 7916.939] (==) modeset(0): DPI set to (96, 96)
[ 7916.939] (II) Loading sub module "fb"
[ 7916.939] (II) LoadModule: "fb"
[ 7916.940] (II) Loading /usr/lib/xorg/modules/libfb.so
[ 7916.944] (II) Module fb: vendor="X.Org Foundation"
[ 7916.944]    compiled for 1.20.11, module version = 1.0.0
[ 7916.944]    ABI class: X.Org ANSI C Emulation, version 0.4
[ 7916.964] Failed to compile VS: 0:1(1): error: syntax error, unexpected NEW_IDENTIFIER

[ 7916.964] Program source:
precision highp float;
attribute vec4 v_position;
attribute vec4 v_texcoord;
varying vec2 source_texture;

void main()
{
    gl_Position = v_position;
    source_texture = v_texcoord.xy;
}
[ 7916.964] (EE)
Fatal server error:
[ 7916.964] (EE) GLSL compile failure
[ 7916.964] (EE)

Lastly I installed vappi to attempt video playback unfortunately even after fixing a couple of bus errors in galmium theres more to fix. So this pretty much sums up the nature of the problem to address. Furthermore this does raise the question is the tweet from Radxa using acclerated graphics given the hardware restrictions of the RK3588.