Having previously reversed engineered the V831 NPU , let's now examine the RK3588 NPU. While the RK3588 RKNN advertises 6 TOPs@int8, it is not entirely clear what this figure represents since the RKNN process unit comprises a tri-core NPU. Referring to the Technical Reference Manual (TRM), we can gather further information:
1024x3 integer 8 MAC operations per cycle
The RKNN clock is 1Ghz therefore based on the standard TOPS formula
TOPS = MACs * Frequency * 2
= (1024x3) * 1Ghz * 2
If all three cores (1024x3) are utilized, the total computational power reaches 6 TOPS. The RKNN framework offers various workload configurations, including tri-core, dual-core, and single-core. However, upon reviewing the RKNN documentation, it appears that out of the 43 operators, only around 10 support tri-core or dual-core execution (as of v1.5.0 of RKNPU SDK) :
Conv, DepthwiseConvolution, Add, Concat, Relu, Clip, Relu6, ThresholdedRelu. Prelu, LeakyRelu
Deploying a single RKNN model in tri-core mode allows for achieving a maximum computational power of 6 TOPS, but this relies on encountering operators that support tri-core execution or having the model compiler identify parallelizable operations. Consequently, the utilization of the full 6 TOPS may be limited to specific scenarios. Given this constraint, an alternative approach could be running three instances of the model, with each instance allocated to a core. Although this approach increases memory usage, it may provide improved efficiency. For instance, when running rknn_benchmark against yolov5s-640-640.rknn for 1000 iterations with a core mask of 7 (tri-core), the results observed are (v1.5.0 sdk) :
Running 3 separate instances of rknn benchmark for same model with core mask 1, 2 & 4 (single core) the average per instance is :
Avg Time 18.84ms, Avg FPS = 53.084
The initial benchmark results suggest a potential improvement with this approach, as running three object detection streams in parallel could yield better overall performance. Furthermore this opens up the possibility of multi stream object detection. However, it is crucial to acknowledge that the frames per second (fps) figures reported by the benchmark are quite optimistic. Primarily because the test input is a static pre-cropped RGB (640x640) image, and the outputs are not sorted based on confidence levels. Hence, in a real-world deployment, additional pre and post processing steps would be necessary and effect the overall processing time.
In order to assess the feasibility of the aforementioned approach, I developed a C++ application that performs several tasks concurrently. This application includes the decoding of an H264 stream, resizing and converting each frame to RGB (640x640), running the yolov5 model on each frame for object detection whilst simultaneously rendering the video. It's worth noting that video playback occurs independently of rendering the rectangles generated by yolov5 through an overlay. The primary challenge encountered during development was optimizing the frequency of frame conversions and resizing for both inference and rendering. This optimization was crucial to ensure that the output rectangles from yolov5 remained synchronized with the corresponding video frame intended for rendering. Otherwise fast moving objects in the video stream are noticeably out of sync with the detected rectangle for that frame. The main argument passed to the application is the core mask, which allows the selection of which NPU core(s) to utilize for the processing tasks.
As shown in the showcase video above, by running three instances of the application with each assigned a single NPU core, we were able to achieve sufficient performance to keep up (well almost in case of 60fps stream) with the video playback rate. The application was tested on the following boards running under weston:
- Mekotronics R58 Mini HDD
- Radxa Rock 5-b
- Red: person
- Green: par, truck, bus, bicycle
- Blue: anything else
Benchmarks from concurrently running 3 instances show an average per instance of:
Avg Time = 25.20ms Avg FPS = 38.49
Compared to a single instance running with NPU in tri-core mode
Avg Time = 15.92ms Avg FPS = 61.42
Based on my testing it is possible to run object detection on 3 video streams assuming 1080p@30 assuming the inference time of your model on a single npu core is less than 25ms. This work was done as part of a suite of video applications that I'm developing for the RK3588.
CPU usage while running the 3 instances:
Tasks: 236 total, 2 running, 234 sleeping, 0 stopped, 0 zombie
%Cpu(s): 10.6 us, 3.3 sy, 0.0 ni, 84.5 id, 1.3 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 7691.7 total, 6531.9 free, 558.4 used, 601.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 6942.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1422 rock 1 -19 1153180 128604 89000 S 32.1 1.6 0:18.35 subsurf+
1439 rock 1 -19 1158676 128880 89480 S 31.8 1.6 0:12.97 subsurf+
1404 rock 1 -19 1161504 132664 89456 S 28.1 1.7 0:24.37 subsurf+
1000 rock 20 0 705784 99024 76756 S 21.2 1.3 0:56.09 weston
363 root 20 0 94212 48112 47004 R 4.6 0.6 0:24.04 systemd+
212 root -51 0 0 0 0 S 4.3 0.0 0:10.57 irq/34-+
927 rock 20 0 16096 4608 3416 S 0.7 0.1 0:00.60 sshd
1100 root 20 0 0 0 0 I 0.7 0.0 0:01.05 kworker+
1395 root 0 -20 0 0 0 I 0.7 0.0 0:01.23 kworker+
1402 root 0 -20 0 0 0 I 0.7 0.0 0:00.76 kworker+
139 root 20 0 0 0 0 S 0.3 0.0 0:01.07 queue_w+
371 root 0 -20 0 0 0 I 0.3 0.0 0:00.16 kworker+
910 rock 20 0 16096 4600 3408 S 0.3 0.1 0:00.89 sshd
1329 root 0 -20 0 0 0 I 0.3 0.0 0:00.43 kworker+
1330 root 20 0 0 0 0 I 0.3 0.0 0:00.76 kworker+
1403 root 20 0 7124 3128 2364 R 0.3 0.0 0:00.60 top
1421 root 20 0 0 0 0 I 0.3 0.0 0:00.46 kworker+