After investigating the timing of the WS2812 protocol in the previous part, the question is now how to use this knowledge for an optimized software implementation of a controller. An obvious approach would be to use an inner loop that uses a switch statement to branch into separate functions to emit either a “0” symbol or a “1” symbol. But as it is often, there is another solution that is both more elegant and more simple.
The image above shows the timing of both the “0” and the “1” code. The cycle starts at t0, the rising edge, for both symbols. The output has to be set high regardless of the symbol. At t1, the output has to be set to low for a “0” and can be unchanged for a “1”. At t2 the output goes low for the “1”. Since it is already low for a “0” we can set the output to low, regardless of the symbol. Finally, at t3 the complete symbol has been sent and the output can be left unchanged.
So, in the end there is only one point in time were the output is influenced by the symbol type, t1. Everything else remains unchanged. This means that special case handling can be limited to a very small part of the code.
This is what I ended up with in AVR assembler code:
ldi %0,8 Loop 8 times for one byte loop: out %2,%3 //  - t0 Set output Hi ...wait1... sbrs %1,7 // [02/03] - Skip t1 if bit 7 is set out %2,%4 //  - t1 Set output Low ...wait2... lsl %1 //  - Shift out next bit out %2,%4 //  - t2 Set output Low ...wait3... dec %0 //  brne loop //  - t3 Loop
This code outputs one byte of data, which has to be loaded into %1 (The C compiler will take care of this). Since the protocol sends data msb first, bit 7 is tested. If it is “1”, the out instruction at t1 is skipped. That’s it, as simple as that, only 7 instructions needed in the inner loop.
What is left now is to correct the timing. To do that, nops have to be inserted at positions wait1..wait3. As shown in the previous part, the most critical timing is that of the “0” where
the delay between t0 and t1 may not exceed 500 ns. The minimum achievable delay, when no nops are inserted at wait1, is two cycles. This equals 500 ns at 4 MHz and less at higher clock speeds. All other timings may exceed the minimum timing required from the data sheet.
This means that even this simple loop is able to control WS2812 LEDs at only 4 MHz! This is quite an achievement, since it was previously considered to be difficult to control WS2812 LEDs even at 8 MHz. Note that the 500 ns is safe on the WS2812B, but may be critical on the WS2812(S). It worked with my devices, though.
To make the final implementation as flexible as possible, I opted to calculate the exact number of nops to insert at compile time from the F_CPU define, which is usually set to the CPU clock speed in the AVR-GCC toolchain. You can find the implementation here. The C-code tries to adjust the timing according to the following rules, which considers at least 150 ns margin for both the WS2812 and the WS2812B timing:
350 ns < t1-t0 <= 500 ns 900 ns <= t2-t0 1250 ns <= t3-t0
The outer loop is implemented in pure C, since it can be safely assumed not to take more than 5 µs. This way maximum flexibility is retained.
14 thoughts on “Light_WS2812 library V2.0 – Part II: The Code”
Looks like using a microcontroller with SPI and a DMA engine should make it trivial to implement this — just encode 3 bits for every input bit and DMA the whole mess out the SPI port at a fixed clock rate.
Nice to read this!
This is my very successful method for years.
It works fine one several STM32 devices using cyclic DMA.
No CPU interactions required. Any SPI baud rate between
2 MHz and 2.857 MHz fits.
I am often asked how I implemented the 3 bit encoding.
Here it is:
// Encoding one byte for WS2812B
// 0 -> 100
// 1 -> 110
uint32_t encode(uint8_t intensity)
return (intensity << 15 & 1 << 22)
| (intensity << 13 & 1 << 19)
| (intensity << 11 & 1 << 16)
| (intensity << 9 & 1 << 13)
| (intensity << 7 & 1 << 10)
| (intensity << 5 & 1 << 7)
| (intensity << 3 & 1 << 4)
| (intensity << 1 & 1 << 1)
Hope it will be useful for some of you.
It's very efficient on ARM processors, may be inefficient on several 8 bit devices.
I am glad I stumbled on this today. I have been toying with these LED strips recently, making LED scrolling sign programs and such. My frustration with the Adafruit (and other) implementations is the memory usage and how that limits the number of LEDs you can control. Since 21-bit color is a bit overkill for many uses, I propose using a palette and then compressing the data (one byte per pixel instead of three, for 256 colors). Then, I’d pull each byte and look up the 3-byte value to send out.
While I programmed 6809 assembly, I have yet to touch AVR. Finding articles and projects like yours is a good starting point for me. Hopefully I can figure out enough to do what I am wanting.
Thanks for sharing!
Sure, you could use a look-up table. The existing code could easily be altered to use one. However, the LUT will already use 768 bytes of memory, you could have 256 LEDs instead.
Very good point – though one could go with a 16 or 30 color table to save space (for an LED sign, maybe 256 colors would be overkill anyway), and I was thinking of having the palette table compiled in to Eeprom.
I adapted your code to work with a LUT in PROGMEM, works like a charm! Thank you SO much for your efforts – i stand on the shoulders of a giant
If anyone else is interested, use the RGBW structure for the LUT, makes the chip math easier (*4 instead of *3) , which (even at 8mhz) keeps the cycles within scope
Can I change the brightness of your implementation? Adafruit library lets do it.
No, the purpose of the library is really to be “light weight”. It only to sends data to the LEDs, and does not do any computation.
Am I right in thinking that the LED controller does not have an address and that the only way to change say the third LED is to set one and two and three.
That suggests sending a duplicate of the information in one and two so as not to change them
If I’m reading this correctly, the data has to be sent to the ws2812 in a serial fashion? We can’t address each LED individually if we wanted to? (so, we can’t turn LED 10 on, but leave 0-9 untouched?)
This means we always need to have an array to keep the value of all of our “pixels” limiting the size of the strip on a smaller processor (so I still need 3 bytes of storage for each pixel in the array even if I am tracking those pixels elsewhere).
Any chance that we could get a little documentation around ws2812_sendarray_mask?
Yes, indeed. That is how the LEDs work.
Do you have a specific question about ws2812_sendarray_mask?
Currently I have a string of 256 LEDs, but I could see that being significantly more. I figure the library is using about 768 bytes (38% of memory). This does eat into my memory budget quite a bit (took me a while to realize that the memory requirements of the library aren’t reported by the compiler).
Due to what I’m doing, I could generate what RGB values to sent to each LED at runtime without storing it in an array. For instance, I could store the brightness of a particular LED, but set the RGB values as I loop through to render. This would reduce the memory requirements significantly.
In a perfect world I’d want to create a new sendarray_mask function, and I comprehend in very basic terms what’s going on inside of the ws2812_sendarray_mask function.
*data = a pointer to your array of RGB values
datlen = the length of that array
maskhi = ?? not sure what this is
*port = I assume this is the pointer to the port we’re banging the bits out on
*portreg = Another pointer for an additional parameter for the port?
From a timing stand-point, can I throw some code in that while loop without messing up timing?
Honestly, I’m on the edge of comprehension here (just enough to be dangerous kind of thing).