7 March 2025
Today I find the WCH -F4P6 dev board has clocked over 35 billion loops without hanging up.
The STK system timer is available in all the CH32V devices. On the QingKe V2 devices, such as our -003 test subject, it is a 32 bit counter that can count up or down and can trigger an interrupt when it hits a particular value. This makes it very useful for basic timing tasks as well as providing periodic interrupts or a measure of uptime. The STK on the QingKe V4 devices is 64 bits long.
There’s not a lot needed to initialize the STK as there just aren’t that many options. One choice is whether to use the system clock directly as its clock source, or to divide it by eight. We’re only going to be using it to measure an approximately 50 microsecond pulse, and it doesn’t have to be excruciatingly precise. I’ll use the prescaled clock as the timing source.
Since the only other options are to have it trigger an interrupt or compare the current count to a value, which I’m not needing at the moment, that’s the only configuration bit in the STK_CTLR control register that I will need to set, other than the “STE” system timer enable control bit.
Time to add more enumerated values to my collection in my CH32V003.h header file:
# STK - System Timer
STK_STE = (1 << 0) # STK enable
STK_STIE = (1 << 1) # interrupt enable
STK_STCLK = (1 << 2) # clock source selection
STK_STRE = (1 << 3) # auto-reload counter enable
STK_SWIE = (1 << 31) # software interrupt trigger
# STK_STCLK values
STK_STCLK_HCLK_8 = (0 << 2) # clock source is HCLK / 8
STK_STCLK_HCLK = (1 << 2) # clock source is HCLK
The code to initialize the STK is pretty simple:
# initialize STK - clock = HCLK/8 = 6 MHz
la x3, STK_BASE
li x4, STK_STCLK_HCLK_8 | STK_STE
sw x4, STK_CTLR(x3)
Technically, we can omit the STK_STCLK_HCLK_8 parameter, as it is a zero, but I like to include it to make my intention clearer to Future Me.
The delay_us function just needs to take the requested number of microseconds, as passed into it via function argument register a0, multiply it by six, as there are six STK timer clock cycles per microsecond, then add that time duration to the current time, as represented by the value in the STK_CNTL register.
The function then loops until the current timer count is no longer less than the calculated ‘future time’.
I also added a quick exit in the case of the caller asking for a zero microsecond delay. We’ll still be late getting back, but not as late as if we went ahead and preserved all the registers, etc.
Here is the code for the delay_us function:
delay_us: # delay in microseconds
# on entry: a0 delay time in microseconds
# on exit: none
# register usage:
# x3: pointer to STK_BASE
# x4: calculated end time
# x5: read timer count
STK_TICKS_PER_MICROSECOND = ((HCLK / 1000000) / 8)
beqz a0, 9f # exit on 0 microsecond request
addi sp, sp, -16 # allocate space on stack
sw ra, 12(sp) # preserve return address
sw x3, 8(sp) # preserve x3
sw x4, 4(sp) # preserve x4
sw x5, 0(sp) # preserve x5
la x3, STK_BASE
# calculate future end time
slli x4, a0, 1 # x4 = a0 * 2
slli x5, a0, 2 # x5 = a0 * 4
add x4, x4, x5 # x4 = x4 + x5
lw x5, STK_CNTL(x3) # read current timer count
add x4, x4, x5
1: lw x5, STK_CNTL(x3) # read system timer count
blt x5, x4, 1b # loop if x5 < x4, i.e., end time not yet reached
lw ra, 12(sp) # restore return address
lw x3, 8(sp) # restore x3
lw x4, 4(sp) # restore x4
lw x5, 0(sp) # restore x5
addi sp, sp, 16 # restore stack pointer
9: ret # return from function
And while I am providing a perfectly mathematical solution to the question of how many STK cycles or ‘ticks’ are in a single microsecond, via the STK_TICKS_PER_MICROSECOND symbol (the answer is six here), the QingKe V2 does not support the ‘mul’ (integer multiply) instruction.
If you put an integer multiply instruction in the code, the assembler assembles it, as assemblers do, but the chip throws an exception when it tries to execute it. But why does the assembler allow it to get that far down the chain?
It’s most likely because I just copy/pasted the makefile from another project and it specifically says that the architecture of the chip is “–march=rv32imac_zicsr”, which it most decidedly is not. Changing the “AS_OPTS” variable in the makefile to “–march=rv32ec_zicsr” fixes this, and the assembler throws the very correct error:
src/F4-WS2812B-SPI-asm.S:10: Error: unrecognized opcode `mul a0,a0,a0'
It now also catches my earlier error when I used the non-existent s2 register. These are powerful tools if you will just let them be so.
So there being no integer multiply instruction, it’s not too terribly difficult to multiply two integers together using shifts and adds. In fact, with a constant multiplier such as six, it’s just a matter of shifting the multiplicand to the left, one time using a single bit shift and then again using two bit shifts, then adding those two numbers together.
Now we have a reasonably accurate delay function that does nothing but waste time for a reasonably accurate amount of time. We can use that to send the ~50 us reset signal to the WS2812B LEDs by sending out a 0x00 via the SPI and then just waiting it out. It makes for a pretty simple function:
ws2812b_reset: # send 'reset' signal to WS2812B LED
# on entry: none
# on exit: none
# register usage:
# x3: function arguments
addi sp, sp, -16 # allocate space on stack
sw ra, 12(sp) # preserve return address
sw x3, 8(sp) # preserve x3
li a0, 0x00
call spi_send # set SDO low
li a0, 50
call delay_us # ~ 50 us low level
lw ra, 12(sp) # restore return address
lw x3, 8(sp) # restore x3
addi sp, sp, 16 # restore stack pointer
ret # return from function
This is technically a ‘leaf’ function as it does not ‘branch’ out to any other functions in the performance of its duties. So I could have skipped the ‘preservation’ of the return address register and it would have worked perfectly. But I tend to leave it in as it’s fast and it’s better to have it and not need it than to need it and not have it.
I would really like to come up with a way to streamline the creation of these assembly language functions as they do contain a moderate quantity of boiler-plate code.
If you’ll recall, I had originally built up a hierarchy of function calls to send the right wave forms to the LEDs, but then de-optimized the code to eliminate perceived overhead. Well, that was in the C programming language, and it tends to encourage that sort of algebraic abstraction. At least, it encourages me to do so. Now we’re in the Wild West of bare-metal assembly language and everything comes at a price. So to keep the complexity of each function to a minimum, I’ll reinvent my cascade of function calls here.
The lowest level function sends out an encoded one or a zero. A zero has a shorter high period and a one has a longer high period. We are using the bit patterns 0x60 and 0x7E as zero and one, respectively. Here is the ws2812b_bit function:
ws2812b_bit: # send an encoded zero or one to the WS2812B LED via SPI
WS2812B_ZERO = 0x60
WS2812B_ONE = 0x7E
# on entry: a0[0] bit to transmit
# on exit: a0[7..0] bit pattern sent
# register usage:
# x3: function arguments
addi sp, sp, -16 # allocate space on stack
sw ra, 12(sp) # preserve return address
sw x3, 8(sp) # preserve x3
li x3, WS2812B_ZERO # assume it's a zero
beqz a0, 1f
li x3, WS2812B_ONE # well it wasn't
1: mv a0, x3
call spi_send
lw ra, 12(sp) # restore return address
lw x3, 8(sp) # restore x3
addi sp, sp, 16 # restore stack pointer
ret # return from function
I preload the x3 register with a bit pattern for a zero, WS2812B_ZERO or 0x60, assuming that it will be a zero. If it is a zero, it skips the next instruction, which loads the WS2812B_ONE code, or 0x7E. In either case, the contents of x3 are mv’d (moved) over to function argument a0 and the spi_send function is called.
Now that we can write a bit, let’s write a byte. It’s not too terribly difficult, but I think you’re starting to see why I wanted to split this medium-sized problem up into tiny-problem chunks. Tiny problems I can handle. Here’s the ws2812b_byte function:
ws2812b_byte: # send a byte's worth of encoded ones and zeros to the WS2812B LED
# on entry: a0[7..0] byte to transmit, MSB first
# on exit: none
# register usage:
# x3: argument save
# x4: bit counter
addi sp, sp, -16 # allocate space on stack
sw ra, 12(sp) # preserve return address
sw x3, 8(sp) # preserve x3
sw x4, 4(sp) # preserve x4
sw a0, 0(sp) # preserve a0
mv x3, a0 # save byte argument in x3
li x4, 8 # initialize bit counter
1: andi a0, x3, 0x80 # test MSB
snez a0, a0 # convert 0x00/0x80 to 0/1
call ws2812b_bit # transmit the bit
slli x3, x3, 1 # shift all bits one place toward MSB
addi x4, x4, -1 # decrement bit counter
bnez x4, 1b # loop if needed
lw ra, 12(sp) # restore return address
lw x3, 8(sp) # restore x3
lw x4, 4(sp) # restore x4
lw a0, 0(sp) # restore a0
addi sp, sp, 16 # restore stack pointer
ret # return from function
I took the extra step of preserving the function argument so that the caller can just load up a single value and call the function three times in a row without having the reload the argument. That’s just to make the debugging easier, as the final form won’t need that.
And here is the final form: the ws2812b_rgb function, wherein the caller sends the three bytes representing the red, green and blue components of the color they want on the LED:
ws2812b_rgb: # send red, green and blue color components to WS2812B LEDs
# on entry:
# a0[0..7] red data
# a1[0..7] green data
# a2[0..7] blue data
# on exit: none
# register usage:
# a0: function argument
# x3: swap register
addi sp, sp, -16 # allocate space on stack
sw ra, 12(sp) # preserve return address
sw x3, 8(sp) # preserve x3
mv x3, a0 # save red data
mv a0, a1 # green data
call ws2812b_byte # send green data
mv a0, x3 # return red data
call ws2812b_byte # send red data
mv a0, a2
call ws2812b_byte # send blue data
lw ra, 12(sp) # restore return address
lw x3, 8(sp) # restore x3
addi sp, sp, 16 # restore stack pointer
ret # return from function
Note the register-swapping shenanigans to be able to state the color data as R, then G, then B, but transmit in GRB order, as the WS2812B thinks proper.
Now to let the little chip send this sequence a bazillion times and see if it gets confused. I’m actually feeling sort of confident that it won’t at this point, but the proper thing to do is to test it.