Wednesday 7 June 2023

RK3588 - RKNN Object detection on multiple video streams


Having previously reversed engineered the V831 NPU , let's now examine the RK3588 NPU. While the RK3588 RKNN advertises 6 TOPs@int8, it is not entirely clear what this figure represents since the RKNN process unit comprises a tri-core NPU. Referring to the Technical Reference Manual (TRM), we can gather further information:

1024x3 integer 8 MAC operations per cycle

The RKNN clock is 1Ghz therefore based on the standard TOPS formula 

TOPS = MACs * Frequency * 2

          = (1024x3) * 1Ghz * 2

If all three cores (1024x3) are utilized, the total computational power reaches 6 TOPS. The RKNN framework offers various workload configurations, including tri-core, dual-core, and single-core. However, upon reviewing the RKNN documentation, it appears that out of the 43 operators, only around 10 support tri-core or dual-core execution (as of v1.5.0 of RKNPU SDK) :

Conv, DepthwiseConvolution, Add, Concat, Relu, Clip, Relu6, ThresholdedRelu. Prelu, LeakyRelu

Deploying a single RKNN model in tri-core mode allows for achieving a maximum computational power of 6 TOPS, but this relies on encountering operators that support tri-core execution or having the model compiler identify parallelizable operations. Consequently, the utilization of the full 6 TOPS may be limited to specific scenarios. Given this constraint, an alternative approach could be running three instances of the model, with each instance allocated to a core. Although this approach increases memory usage, it may provide improved efficiency. For instance, when running rknn_benchmark against yolov5s-640-640.rknn for 1000 iterations with a core mask of 7 (tri-core), the results observed are (v1.5.0 sdk) :

Avg Time 9.86ms, Avg FPS = 101.416

Running 3 separate instances of rknn benchmark for same model with core mask 1, 2 & 4 (single core) the average per instance is :

Avg Time 18.84ms, Avg FPS = 53.084

The initial benchmark results suggest a potential improvement with this approach, as running three object detection streams in parallel could yield better overall performance. Furthermore this opens up the possibility of multi stream object detection. However, it is crucial to acknowledge that the frames per second (fps) figures reported by the benchmark are quite optimistic. Primarily because the test input is a static pre-cropped RGB (640x640) image, and the outputs are not sorted based on confidence levels. Hence, in a real-world deployment, additional pre and post processing steps would be necessary and effect the overall processing time.

In order to assess the feasibility of the aforementioned approach, I developed a C++ application that performs several tasks concurrently. This application includes the decoding of an H264 stream, resizing and converting each frame to RGB (640x640), running the yolov5 model on each frame for object detection whilst simultaneously rendering the video. It's worth noting that video playback occurs independently of rendering the rectangles generated by yolov5 through an overlay. The primary challenge encountered during development was optimizing the frequency of frame conversions and resizing for both inference and rendering. This optimization was crucial to ensure that the output rectangles from yolov5 remained synchronized with the corresponding video frame intended for rendering. Otherwise fast moving objects in the video stream are noticeably out of sync with the detected rectangle for that frame. The main argument passed to the application is the core mask, which allows the selection of which NPU core(s) to utilize for the processing tasks. 

As shown in the showcase video above, by running three instances of the application with each assigned a single NPU core, we were able to achieve sufficient performance to keep up (well almost in case of 60fps stream) with the video playback rate. The application was tested on the following boards running under weston:

The test videos, sourced from the kangle site, are either 1080p at 60 or 30 frames per second (fps). To fit all the videos on the same display (1080p resolution), they are not resized back to their original format. The detected objects are color-coded as follows:
  • Red: person
  • Green: par, truck, bus, bicycle
  • Blue: anything else

Benchmarks from concurrently running 3 instances show an average per instance of:

Avg Time = 25.20ms   Avg FPS = 38.49

Compared to a single instance running with NPU in tri-core mode

Avg Time = 15.92ms   Avg FPS = 61.42

Based on my testing it is possible to run object detection on 3 video streams assuming 1080p@30 assuming the inference time of your model on a single npu core is less than 25ms. This work was done as part of a suite of video applications that I'm developing for the RK3588.

CPU usage while running the 3 instances:

Tasks: 236 total,   2 running, 234 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.6 us,  3.3 sy,  0.0 ni, 84.5 id,  1.3 wa,  0.0 hi,  0.3 si,  0.0 st
MiB Mem :   7691.7 total,   6531.9 free,    558.4 used,    601.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   6942.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   1422 rock       1 -19 1153180 128604  89000 S  32.1   1.6   0:18.35 subsurf+
   1439 rock       1 -19 1158676 128880  89480 S  31.8   1.6   0:12.97 subsurf+
   1404 rock       1 -19 1161504 132664  89456 S  28.1   1.7   0:24.37 subsurf+
   1000 rock      20   0  705784  99024  76756 S  21.2   1.3   0:56.09 weston  
    363 root      20   0   94212  48112  47004 R   4.6   0.6   0:24.04 systemd+
    212 root     -51   0       0      0      0 S   4.3   0.0   0:10.57 irq/34-+
    927 rock      20   0   16096   4608   3416 S   0.7   0.1   0:00.60 sshd    
   1100 root      20   0       0      0      0 I   0.7   0.0   0:01.05 kworker+
   1395 root       0 -20       0      0      0 I   0.7   0.0   0:01.23 kworker+
   1402 root       0 -20       0      0      0 I   0.7   0.0   0:00.76 kworker+
    139 root      20   0       0      0      0 S   0.3   0.0   0:01.07 queue_w+
    371 root       0 -20       0      0      0 I   0.3   0.0   0:00.16 kworker+
    910 rock      20   0   16096   4600   3408 S   0.3   0.1   0:00.89 sshd    
   1329 root       0 -20       0      0      0 I   0.3   0.0   0:00.43 kworker+
   1330 root      20   0       0      0      0 I   0.3   0.0   0:00.76 kworker+
   1403 root      20   0    7124   3128   2364 R   0.3   0.0   0:00.60 top     
   1421 root      20   0       0      0      0 I   0.3   0.0   0:00.46 kworker+

Sunday 30 April 2023

RK3588 - Adventures with an external GPU through PCIE Gen3 x4 (Radxa Rock-5b)

One of the interesting features of the RK3588 is the pcie controller because of it support for a Gen3 X4 link. I'd started looking into using the controller for a forth coming project and subsequently this lead me to the idea of testing the controller against a external GPU card to gain an understanding of it limitations and potential. From what I understand Jeff Geerling has been a similar journey with the RPI CM4 and has had limited success with help from numerous developers. Furthermore there was a Radxa tweet which a gave a teasing glimpse of the working GPU. So lets see what is or isn't possible using a Rock-5b.



I'd managed to get hold of a Radeon R7 520 (XFX R7 250 low-profile) card along a with M.2 Key M Extender Cable to PCIE x16 Graphics Card Riser Adapter. To power the card I'd reused a old LR1007 120W 12VDC ATX board which was to hand. Setup as shown below, we reuse the nvme slot for the m.2 adapter and revert back to an sd card for booting an OS. I'd used the Radxa debian image with a custom compiled Radxa kernel to include the graphics card drivers and fixes. Having reviewed the pcie BAR definitions in the rk3588.dtsi there should be enough address space available for the card to use. After removing the hdmi and mali drivers from kernel config, I initially tried the amdgpu driver but that seems to report an error and no display output

[   11.844163] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[   11.844378] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v6_0> failed -110
[   11.844383] amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[   11.844388] amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[   11.844414] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
[   11.846559] [drm] amdgpu: ttm finalized
[   11.848018] amdgpu: probe of 0000:01:00.0 failed with error -110

The radeon driver fared slightly better with a similar error but at least display output for console login

[   12.059398] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x850C)=0xCAFEDEAD)
[   12.059408] radeon 0000:01:00.0: disabling GPU acceleration

The was puzzling as the card relies on pcie memory mapped I/O which the RK3588 should see a standard memory and be able to read/write too. It turns out Peter Geis who was attempting to mainline a pcie driver for the RK3566 and raised 2 issues per this thread which Rockchip replied too. The same issues weren't improved/fixed on the RK3588 as mentioned here . In simple terms for our requirements:

1. For the pcie dma transfers memory allocation are limited to 32bits so a 4GB board might not see an issue. While a 8GB board like mine the kernel could pick an address range above 4GB.

2. AMD cards rely on pcie snooping, there is no CPU snooping on the RK3588 interconnect. So any cache copies of the same device memory block won't get updated to remain in sync.

If we hack the Radeon driver to work around these issues we get:

[   12.529087] [drm] ring test on 0 succeeded in 1 usecs
[   12.529094] [drm] ring test on 1 succeeded in 1 usecs
[   12.529102] [drm] ring test on 2 succeeded in 1 usecs
[   12.529121] [drm] ring test on 3 succeeded in 8 usecs
[   12.529132] [drm] ring test on 4 succeeded in 3 usecs
[   12.706419] [drm] ring test on 5 succeeded in 2 usecs
[   12.706427] [drm] UVD initialized successfully.
[   12.816582] [drm] ring test on 6 succeeded in 18 usecs
[   12.816625] [drm] ring test on 7 succeeded in 5 usecs
[   12.816627] [drm] VCE initialized successfully.
[   12.816879] [drm:si_irq_set [radeon]] si_irq_set: sw int gfx
[   12.816921] [drm] ib test on ring 0 succeeded in 0 usecs
[   12.816989] [drm:si_irq_set [radeon]] si_irq_set: sw int cp1
[   12.817028] [drm] ib test on ring 1 succeeded in 0 usecs
[   12.817088] [drm:si_irq_set [radeon]] si_irq_set: sw int cp2
[   12.817127] [drm] ib test on ring 2 succeeded in 0 usecs
[   12.817185] [drm:si_irq_set [radeon]] si_irq_set: sw int dma
[   12.817224] [drm] ib test on ring 3 succeeded in 0 usecs
[   12.817281] [drm:si_irq_set [radeon]] si_irq_set: sw int dma1
[   12.817319] [drm] ib test on ring 4 succeeded in 0 usecs
[   13.477677] [drm] ib test on ring 5 succeeded
[   13.984454] [drm] ib test on ring 6 succeeded
[   14.491404] [drm] ib test on ring 7 succeeded

[   14.549296] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on minor 1

 So potentially we have graphics acceleration ... let try kmstest

rock@rock-5b:~$ kmstest
trying to open device 'i915'...failed
trying to open device 'amdgpu'...failed
trying to open device 'radeon'...done
main: All ok!

Next (fingers crossed) kmscube

rock@rock-5b:~$ kmscube
Using display 0x55b67f0020 with EGL version 1.5
EGL information:
  version: "1.5"
  vendor: "Mesa Project"
  client extensions: "EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_device EGL_EXT_platform_wayland EGL_KHR_platform_wayland EGL_EXT_platform_x11 EGL_KHR_platform_x11 EGL_MESA_platform_gbm EGL_KHR_platform_gbm EGL_MESA_platform_surfaceless"
  display extensions: "EGL_ANDROID_blob_cache EGL_EXT_buffer_age EGL_EXT_create_context_robustness EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_KHR_cl_event2 EGL_KHR_config_attribs EGL_KHR_create_context EGL_KHR_create_context_no_error EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_image_pixmap EGL_KHR_no_config_context EGL_KHR_reusable_sync EGL_KHR_surfaceless_context EGL_EXT_pixel_format_float EGL_KHR_wait_sync EGL_MESA_configless_context EGL_MESA_drm_image EGL_MESA_image_dma_buf_export EGL_MESA_query_driver EGL_WL_bind_wayland_display "
OpenGL ES 2.x information:
  version: "OpenGL ES 3.2 Mesa 20.3.5"
  shading language version: "OpenGL ES GLSL ES 3.20"
  vendor: "AMD"
  renderer: "AMD VERDE (DRM 2.50.0, 5.10.110-99-rockchip-g6e21553c2116, LLVM 11.0.1)"
  extensions: "GL_EXT_blend_minmax GL_EXT_multi_draw_arrays GL_EXT_texture_filter_anisotropic GL_EXT_texture_compression_s3tc GL_EXT_texture_compression_dxt1 GL_EXT_texture_compression_rgtc GL_EXT_texture_format_BGRA8888 GL_OES_compressed_ETC1_RGB8_texture GL_OES_depth24 GL_OES_element_index_uint GL_OES_fbo_render_mipmap GL_OES_mapbuffer GL_OES_rgb8_rgba8 GL_OES_standard_derivatives GL_OES_stencil8 GL_OES_texture_3D GL_OES_texture_float GL_OES_texture_float_linear GL_OES_texture_half_float GL_OES_texture_half_float_linear GL_OES_texture_npot GL_OES_vertex_half_float GL_EXT_draw_instanced GL_EXT_texture_sRGB_decode GL_OES_EGL_image GL_OES_depth_texture GL_AMD_performance_monitor GL_OES_packed_depth_stencil GL_EXT_texture_type_2_10_10_10_REV GL_NV_conditional_render GL_OES_get_program_binary GL_APPLE_texture_max_level GL_EXT_discard_framebuffer GL_EXT_read_format_bgra GL_EXT_frag_depth GL_NV_fbo_color_attachments GL_OES_EGL_image_external GL_OES_EGL_sync GL_OES_vertex_array_object GL_OES_viewport_array GL_ANGLE_pack_reverse_row_order GL_ANGLE_texture_compression_dxt3 GL_ANGLE_texture_compression_dxt5 GL_EXT_occlusion_query_boolean GL_EXT_robustness GL_EXT_texture_rg GL_EXT_unpack_subimage GL_NV_draw_buffers GL_NV_read_buffer GL_NV_read_depth GL_NV_read_depth_stencil GL_NV_read_stencil GL_EXT_draw_buffers GL_EXT_map_buffer_range GL_KHR_debug GL_KHR_robustness GL_KHR_texture_compression_astc_ldr GL_NV_pixel_buffer_object GL_OES_depth_texture_cube_map GL_OES_required_internalformat GL_OES_surfaceless_context GL_EXT_color_buffer_float GL_EXT_sRGB_write_control GL_EXT_separate_shader_objects GL_EXT_shader_group_vote GL_EXT_shader_implicit_conversions GL_EXT_shader_integer_mix GL_EXT_tessellation_point_size GL_EXT_tessellation_shader GL_ANDROID_extension_pack_es31a GL_EXT_base_instance GL_EXT_compressed_ETC1_RGB8_sub_texture GL_EXT_copy_image GL_EXT_draw_buffers_indexed GL_EXT_draw_elements_base_vertex GL_EXT_gpu_shader5 GL_EXT_polygon_offset_clamp GL_EXT_primitive_bounding_box GL_EXT_render_snorm GL_EXT_shader_io_blocks GL_EXT_texture_border_clamp GL_EXT_texture_buffer GL_EXT_texture_cube_map_array GL_EXT_texture_norm16 GL_EXT_texture_view GL_KHR_blend_equation_advanced GL_KHR_context_flush_control GL_KHR_robust_buffer_access_behavior GL_NV_image_formats GL_OES_copy_image GL_OES_draw_buffers_indexed GL_OES_draw_elements_base_vertex GL_OES_gpu_shader5 GL_OES_primitive_bounding_box GL_OES_sample_shading GL_OES_sample_variables GL_OES_shader_io_blocks GL_OES_shader_multisample_interpolation GL_OES_tessellation_point_size GL_OES_tessellation_shader GL_OES_texture_border_clamp GL_OES_texture_buffer GL_OES_texture_cube_map_array GL_OES_texture_stencil8 GL_OES_texture_storage_multisample_2d_array GL_OES_texture_view GL_EXT_blend_func_extended GL_EXT_buffer_storage GL_EXT_float_blend GL_EXT_geometry_point_size GL_EXT_geometry_shader GL_EXT_shader_samples_identical GL_KHR_no_error GL_KHR_texture_compression_astc_sliced_3d GL_OES_EGL_image_external_essl3 GL_OES_geometry_point_size GL_OES_geometry_shader GL_OES_shader_image_atomic GL_EXT_clip_cull_distance GL_EXT_disjoint_timer_query GL_EXT_texture_compression_s3tc_srgb GL_EXT_window_rectangles GL_MESA_shader_integer_functions GL_EXT_clip_control GL_EXT_color_buffer_half_float GL_EXT_memory_object GL_EXT_memory_object_fd GL_EXT_texture_compression_bptc GL_KHR_parallel_shader_compile GL_NV_alpha_to_coverage_dither_control GL_EXT_EGL_image_storage GL_EXT_texture_sRGB_R8 GL_EXT_texture_shadow_lod GL_INTEL_blackhole_render GL_MESA_framebuffer_flip_y GL_EXT_depth_clamp GL_EXT_texture_query_lod "
Using modifier ffffffffffffff
Modifiers failed!
Bus error

The 'bus error' indicates a memory alignment issue and turns out to be a bit of a of rabbit hole. To fix the Radeon kernel driver we are ensuring the cards memory is mapped as 'Device memory' type Device-nGnRnE. If it were 'Normal Memory' then unaligned access is allowed. This implies fixing up userspace drivers/applications as these errors are encountered as these applications can directly manlipulate the cards memory. For this particular bus error it was caused by a memcpy in the radeon gallium driver and fixed applied there and as shown in the video kmscube runs

Using modifier ffffffffffffff
Modifiers failed!
Using modifier ffffffffffffff
Modifiers failed!
Rendered 120 frames in 2.000246 sec (59.992635 fps)
Rendered 240 frames in 4.000428 sec (59.993577 fps)
Rendered 361 frames in 6.016865 sec (59.998019 fps)
Rendered 481 frames in 8.017015 sec (59.997390 fps)
Rendered 601 frames in 10.017050 sec (59.997704 fps)
Rendered 721 frames in 12.017079 sec (59.997942 fps)
Rendered 841 frames in 14.017118 sec (59.998067 fps)
Rendered 961 frames in 16.017314 sec (59.997574 fps)
Rendered 1082 frames in 18.033850 sec (59.998280 fps)
Similiar fixes were applied to glmark2-drm & glmark2-es2-drm to run successfully (1680x1050 resolution) although the terrain scene displayed a bunch of colored bars on the screen.

    glmark2 2021.12
    OpenGL Information
    GL_VENDOR:      AMD
    GL_RENDERER:    AMD VERDE (DRM 2.50.0, 5.10.110-99-rockchip-g6e21553c2116, LLVM 11.0.1)
    GL_VERSION:     4.5 (Compatibility Profile) Mesa 20.3.5
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   1680x1050 fullscreen
[build] use-vbo=false: FPS: 939 FrameTime: 1.066 ms
[build] use-vbo=true: FPS: 2411 FrameTime: 0.415 ms
[texture] texture-filter=nearest: FPS: 1957 FrameTime: 0.511 ms
[texture] texture-filter=linear: FPS: 1958 FrameTime: 0.511 ms
[texture] texture-filter=mipmap: FPS: 2003 FrameTime: 0.499 ms
[shading] shading=gouraud: FPS: 1975 FrameTime: 0.506 ms
[shading] shading=blinn-phong-inf: FPS: 1973 FrameTime: 0.507 ms
[shading] shading=phong: FPS: 1976 FrameTime: 0.506 ms
[shading] shading=cel: FPS: 1974 FrameTime: 0.507 ms
[bump] bump-render=high-poly: FPS: 1739 FrameTime: 0.575 ms
[bump] bump-render=normals: FPS: 2373 FrameTime: 0.422 ms
[bump] bump-render=height: FPS: 2330 FrameTime: 0.429 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1254 FrameTime: 0.798 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 707 FrameTime: 1.415 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1338 FrameTime: 0.747 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 456 FrameTime: 2.194 ms
[desktop] effect=shadow:windows=4: FPS: 600 FrameTime: 1.667 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 214 FrameTime: 4.684 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 233 FrameTime: 4.306 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 347 FrameTime: 2.885 ms
[ideas] speed=duration: FPS: 1430 FrameTime: 0.700 ms
[jellyfish] <default>: FPS: 806 FrameTime: 1.242 ms
[terrain] <default>: FPS: 150 FrameTime: 6.706 ms
[shadow] <default>: FPS: 843 FrameTime: 1.188 ms
[refract] <default>: FPS: 115 FrameTime: 8.718 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 1970 FrameTime: 0.508 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 1980 FrameTime: 0.505 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 1979 FrameTime: 0.505 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 1971 FrameTime: 0.507 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 1972 FrameTime: 0.507 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 1968 FrameTime: 0.508 ms
                                  glmark2 Score: 1450

Next up was to see if startx would run, unfortunately it drops out with a shader compiler error. Looks like glamor is using egl but encounters an opengl shader to compile, requires further investigation.

[  7916.924] (II) modeset(0): Modeline "360x202"x119.0   11.25  360 372 404 448  202 204 206 211 doublescan -hsync +vsync (25.1 kHz d)
[  7916.924] (II) modeset(0): Modeline "360x202"x118.3   10.88  360 384 400 440  202 204 206 209 doublescan +hsync -vsync (24.7 kHz d)
[  7916.924] (II) modeset(0): Modeline "320x180"x119.7    9.00  320 332 360 400  180 181 184 188 doublescan -hsync +vsync (22.5 kHz d)
[  7916.924] (II) modeset(0): Modeline "320x180"x118.6    8.88  320 344 360 400  180 181 184 187 doublescan +hsync -vsync (22.2 kHz d)
[  7916.925] (II) modeset(0): Output DVI-D-1 status changed to disconnected.
[  7916.925] (II) modeset(0): EDID for output DVI-D-1
[  7916.939] (II) modeset(0): Output VGA-1 status changed to disconnected.
[  7916.939] (II) modeset(0): EDID for output VGA-1
[  7916.939] (II) modeset(0): Output HDMI-1 connected
[  7916.939] (II) modeset(0): Output DVI-D-1 disconnected
[  7916.939] (II) modeset(0): Output VGA-1 disconnected
[  7916.939] (II) modeset(0): Using exact sizes for initial modes
[  7916.939] (II) modeset(0): Output HDMI-1 using initial mode 1680x1050 +0+0
[  7916.939] (==) modeset(0): Using gamma correction (1.0, 1.0, 1.0)
[  7916.939] (==) modeset(0): DPI set to (96, 96)
[  7916.939] (II) Loading sub module "fb"
[  7916.939] (II) LoadModule: "fb"
[  7916.940] (II) Loading /usr/lib/xorg/modules/
[  7916.944] (II) Module fb: vendor="X.Org Foundation"
[  7916.944]    compiled for 1.20.11, module version = 1.0.0
[  7916.944]    ABI class: X.Org ANSI C Emulation, version 0.4
[  7916.964] Failed to compile VS: 0:1(1): error: syntax error, unexpected NEW_IDENTIFIER

[  7916.964] Program source:
precision highp float;
attribute vec4 v_position;
attribute vec4 v_texcoord;
varying vec2 source_texture;

void main()
    gl_Position = v_position;
    source_texture = v_texcoord.xy;
[  7916.964] (EE)
Fatal server error:
[  7916.964] (EE) GLSL compile failure
[  7916.964] (EE)

Lastly I installed vappi to attempt video playback unfortunately even after fixing a couple of bus errors in galmium theres more to fix. So this pretty much sums up the nature of the problem to address. Furthermore this does raise the question is the  tweet from Radxa using acclerated graphics given the hardware restrictions of the RK3588.

Sunday 15 January 2023

RK3588 - Decoding & rendering 16 1080p streams



I'm currently working on a video application for the RK3588 given it is one of the few processors on the market that currently has native HDMI input support (up to 4K30). As part of that work one of the first tasks has been trying to rendering video efficiently within a Wayland/Weston window (not full screen). I reverted to Wayland for video because from my testing on X11 it can result in tearing if not played full screen as the graphic stack (ARM Mali )has no ability to vsync.  The existing Rockchip SDK patches the gstreamer waylandsink plugin to provide video rendering support for Wayland. However there are a number of challenges to get the waylandsink to render to a Weston window as by default it resorts to full screen, resulting in a Weston application launching a secondary full screen window to display video within. Whilst trying to find a solution to this problem I can a across a number of claims about the video decoder (part of the VPU) :

Up to 32-channel 1080P@30fps decoding (FireFly ROC-RK3588-PC)

x32 1080P@60fps channels (H.265/VP9) (Khandas Edge 2)

Up to 32 channels 1080P@30fps decoding (PEPPER JOBS X3588)

After reviewing the RK3588 datasheet and TRM I can't find a mention of this capability by Rockchip so I'd assume this a derived figure based on this statement in the datasheet "Multi-channel decoder in parallel for less resolution". From the datasheet H264 max resolution decode is 8K@30 and H265 it is 8K@60, theoretically that would mean 16 channels for H264 1080@30 and possibly 32 for H265 if each stream is 1080@30.

So the challenged turned out be could I decode 16 1080p streams and render each within its own window on a 1080@60 display? As you can tell from the above video it is possible. This is a custom Weston application running on a Rock 5B board  , each video is being read/decoded from a separate file (there is a mixture of trailers/open videos & a fps test video) and then rendered. Initially I tried to resizing each video using RGA3 (Raster Graphics Acceleration) however this turned out be to non-performant as RGA doesn't seem to cope well with more a than few videos. In turns out the only way to render is to use AFBC (Arm framebuffer compression). For this test there are 14 H264 streams (mixture of 30 & 60 fps) and 2 H265 60fps streams.  

Friday 26 August 2022

Inside another fake ELM327 adaptor (filled with Air)

I'd ordered a couple of ELM327 compatible adapters from Aliexpress expecting that these would be similar to the item in the image below. Normally these contain a PCB board to fit the enclosure and populated with the unknown MCU (covered with epoxy), a Bluetooth chip, CAN transceiver and the necessary circuity to support a K-Line interface.

After dissecting the received adapters here is what we have, 80% air and a small pcb.

Pictures of the small PCB reveal a single 16 SOP package (and a 24Mhz crystal) with the chip marking etched out and no BLE chip or CAN transceiver present 😒. Is this one chip doing all the work?

From a software point of view the device reports itself as ELM V2.1 and I managed to retrieve the firmware version as TDA99 V0.34.0628C (not sure what it means though). The firmware is extremely buggy and feature wise incomplete for ELM V2.1.

The intriguing question was "could a 16 pin chip" replace a number of discrete components. After days of research it turns out the chip seems to be a repurposed Bluetooth audio/toy chip (possibly from ZhuHai Jieli Technology ). The same unmarked chip seems to be present on the Thinmi ELM327C with the chip referred to as QBD255. Can't locate any information for the QBD255. Worst to come is that the CAN implementation seems to be completely written in software (hence no CAN transceiver) and therefore prone to timing errors and limited data rates. Furthermore this chip must have limited memory/flash hence the incomplete implementation of ELM features. 

Buyer beware!

I suspect this chip may be the Jieli AC6329F or AC6329C but need to prove it somehow?

Update 28-08-2022: 

There seems to be another chipset  floating around from YMIOT, described as "ELM327 V2.1 Bluetooth universal diagnostic adapter with 16-pin YM1130 1343E38 chip"

History of this chipset is below:

2017 - YM1120 (131G76)
2018 - YM1122 (1218F57) & YM1121
2019 - YM1130 (1343E38)

Thursday 29 April 2021

Reverse engineering the V831 NPU (Neural Processor Unit)

I took up the challenge posted on the sipeed twitter feed 

"We are reversing V831's NPU register, and make opensource AI toolchian based on NCNN~ If you are interested in making opensource AI toolchain and familiar with NCNN, please contact support at, we will send free sample board for you to debug"

Sipeed were kind enough to send me one of the initial prototype board of the MAXI-II. To give you a brief introduction the V831 is a camera SOC targeting video encoding applications (cctv, body cams, etc.). It comprises of a Cortex A7 processor combined with 64MB of embedded RAM and for those interested full details of the V831 capabilities can be found in the datasheet.

The datasheet is sparse on information about the NPU :

  • V831: Maximum performance up to 0.2Tops
  • Supports Conv, Activation, Pooling, BN, LRN, FC/Inner Product

In addition the datsheet briefly mentions two registers in refer to the NPU, one for enabling/resetting the NPU and the other for setting the clock source. No mention of how it can be programmed to perform the operations specified in the datasheet.

Fortunately the registers listed in the sipeed twitter post provided a first clue and after many months of trial and error, endless deciphering of data dumps, a few dead ends and numerous reverse engineering attempts, parts of the NPU operations have been decoded. Fundamentally a large portion of the NPU is a customised implementation of Nvidia Deep Learning Accelerator (NVDLA) architecture. More details about the project can be found on the NVDLA site and here is a quote of it aims :

The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.

What I have determined so far about the NPU is:

1. The NPU clock can be set between 100-1200 Mhz with the code defaulting to 400 Mhz. My hunch is that this may tie to the clock speed of the onboard DDR2 memory.

2. NPU is implemented with nv_small configuration (NV Small Model) and relies on system memory for all data operations. Importantly CPU and NPU are sharing the memory bus.

3. It supports both int8 and int16, however I haven't verified if FP16 is supported or not. Theoretically int8 should be twice as fast as int16 while also preserving memory given the V831 limited onboard memory (64Mb).

4. Number of MACs is 64 (Atomic-C * Atomic-K)

5. NPU registers are memory mapped and therefore can be programmed from userspace which proved to be extremely useful for initial debugging & testing.

6. NPU requires physical address locations when referencing weights & input/output data locations therefore kernel memory needs to be allocated and the physical addresses retrieved if accessed from userspace.

7. NPU weights and input/output data follow a similar layout to the NVDLA private formats. Therefore well knows formats like nhwc or nchw require transformation before they can be fed to the NPU.

Initial code from my endeavours is located in this repo v831-npu and should be treated as work in progress. Hopefully this forms the basis of the fully open source implementation. The tests directory has code from my initial attempts to interact with the hardware and is redundant. However it can be used as an initial introduction to how the hardware units works and what configuration is required. So far I have decoded the CONV, SDP and PDP units which allow for the following operations (tested with int8 data type) :

1. Direct Convolutions

2. Bias addition

3. Relu/Prelu

4. Element wise operations

5. Max/Average pooling

To verify most of the above I ported across the cifar10 example (see examples directory) from ARMs CMSIS_5 NN library. Furthermore I have managed to removed all dependencies on closed AllWinner libraries, this is partially achieved by implementing a simple ION memory allocation utility. Instructions to build cifar10 for deploying on the MAXI-II are below (assuming you using a linux machine) :

1. Clone the SDK toolchain git repo from here. We are still dependent on the SDK toolchain as the MAXI-II kernel/rootfs is built with this toolchain.

2. Export PATH to include  'lindenis-v536-prebuilt/gcc/linux-x86/arm/toolchain-sunxi-musl/toolchain/bin' so that arm-openwrt-linux-gcc can be found.

3. Run 'make'

4. Copied built executable 'nna_cifar10' to MAXI-II

5. Run './nna_cifar10', output should be as below given the input image was a boat:

There is still quite a bit of work left to be done such as :

1. Weight and input/output data conversion utility

2. The NPU should support pixel input formats which needs to be verified.

2. Decoding remaining hardware units

3. Possibly integrating with an existing AI framework or writing a compiler.

By the way the new Beagle V is also spec'd to implement NVDLA with a larger MAC size of 1024.

I would like to thank sipeed for providing the hardware/software.

I liked to thank for sponsoring the development time for this work.

Sunday 8 March 2020

ESP32 impersonates a Particle Xenon

With the announcement that Particle will no longer manufacture the Xenon development board and drop their OpenThread based mesh networking solution. We decided to see if we could impersonate an existing claimed Xenon(s) (ie one that is already registered on the cloud) on alternative hardware. Hence the idea of 'bring your own device' to connect to the cloud.

After reviewing the device-os source code for a few months it turned out to get a proof of concept working I need a implemented at minimum the following:

1. Port across the dtls protocol layer as it turns out the Gen 3 devices create a secure UDP socket connection over dtls.
2. Extract the devices private key and the cloud public key (no certificates are stored). Particles implementation of the dtls handshake purely relies on Raw Public Key support (RFC7250).
3. Implement a COAP layer as the 'Spark protocol' is built on top of this.

The above was implemented as set of library functions using the ESP32-IDF and I reused the ESP32 (LILYGO TTGO) from the previous post which fortunately hosts a OLED 128x64 display. In the video we demonstrate :

1. Connects to a wifi access point.
2. Retrieves time from a SNTP server.
3. Connects to the Particle Cloud via a dtsl handshake.
4. Sends a number of 'Spark protocol' messages to let the cloud know the Xenon is alive.
5. Awaits commands from the Cloud, including ping and signal operations. When receiving the signal command the screen scrolls the text from left to right.

I liked to thank for sponsoring the hardware and development time for this article.

Saturday 30 November 2019

Particle Xenon - Adding WIFI support with a EPS32

The preferred option for WIFI support with Gen 3 devices is to deploy a Particle Argon. The Argon consists of a Nordic nRF52840 paired with Espressif ESP32. The EPS32 simply provides the WIFI interface and is running a customised version of EPS-AT firmware (argon-ncp-firmware). The nRF52840 communicates with the EPS32 using one its serial ports using fours pins TX,RX,CTS & RTS. The challenge here was to see if we could enable WIFI support on Particle Xenon by connecting it to a ESP32 running the argon-ncp-firmware. As demonstrated in the video it was possible although it required a number of hoops to jump through.

Unfortunately the only spare EPS32 board I had was a LILYGO TTGO this is a 16M board with a OLED display. So the first task was porting the argon-ncp-firmware and re-factoring the pin mappings to support this board. Once this was complete it was fairly easy to validate the firmware was functioning by simply executing the AT commands the Argon issues to establish WIFI connectivity.

For the Xenon the primary changes were to port across the Argon EPS32 networking code. Which turned out to be more challenging that envisaged primarily because the Xenon firmware isn't expecting a WIFI configuration and the command line tools don't support provisioning a WIFI connection for a Xenon. After 4 weeks of effort I finally had built a working version of the Xenon firmware. It took another 2 weeks to get the Xenon provisioned  a WIFI configuration so it could connect to the Particle Cloud.

The main drawback of this approach is that is the firmware on the both the Xenon and ESP32 are customised therefore any updates from the Cloud would override the changes. Hence a customised rebuild is required when new firmware is released.

I liked to thank for sponsoring the hardware and development time for this article.