Chaining/mirroring panels

I have setup where I output to 128x64 panels, and for ease of wiring, I use the data outputted from the first panel to feed the 2nd panel.
Because of BCM, the time the shifted frame is displayed on the 2nd panel is the wrong amount of time, but in real life, the output is mostly usable.

That’s the quick patch I had to write:

--- a/lib/framebuffer.cc
+++ b/lib/framebuffer.cc
@@ -554,10 +554,10 @@ static void InitFM6127(GPIO *io, const struct HardwareMapping &h, int columns) {
                                               int columns) {
   if (!panel_type || panel_type[0] == '\0') return;
   if (strncasecmp(panel_type, "fm6126", 6) == 0) {
-    InitFM6126(io, *hardware_mapping_, columns);
+    InitFM6126(io, *hardware_mapping_, columns*2);
   }
   else if (strncasecmp(panel_type, "fm6127", 6) == 0) {
-    InitFM6127(io, *hardware_mapping_, columns);
+    InitFM6127(io, *hardware_mapping_, columns*2);
   }

I’m thinking if there are some --led-pwm-bits and --led-pwm-dither-bits that might make the output a bit better on the 2nd panel.
Maybe @hzeller or others can suggest options.
Obviously one way to make it work perfectly would be to disable BCM and go back to pure PWM, at the expense of CPU time.

In real life, I should do some cable that splices the output from the first panel, but it makes the wiring a fair amount more complicated, so I’m using this for now.
Maybe the dirty patch can help someone who might need something similar.

Here are some pictures, the right panels are connected to the rPi, and the left ones are chained to the right one (3 parallel chains).

What’s interesting is that the BCM timing on the bitplanes is much more visible on the cube demo:

@hzeller if I can’t improve the shifted out copy of the FB with --led-pwm-bits and --led-pwm-dither-bits, is it reasonably easy to turn off BCM. Yes, I know that for 7 bit colors, that means I’ll get 128 interrupts per frame instead of 7, but if my CPU is fast enough, do I really care if it gives me a perfect looking image on the shift out to the 2nd panel?

@hzeller So, I’m now at a point in my project where I need to design the production wiring.
For a bunch of reasons, the wiring is a lot easier/better if I can chain a 2nd set of panels that get the bits output from the first panel. I know that because of BCM, the bitplanes on the 2nd panels will be displayed the wrong amount of time, but it turns out that with SmartMatrix they still look mostly identical (and I thought SmartMatrix also used BCM).
With your lib, indeed some colors look wrong enough that I can’t go production with that (a shame because 90-95% looks close enough, but the remaining times look pretty wrong).
Do you think any command line arguments might make this work better, and if not, would it be a fair amount of work to turn off BCM in the code and do regular PWM with obviously a lot more interrupts?

You can hack around this by sending the same data multiple times by adding a little hack to lib/framebuffer.cc

Around the place where the row data is clocked out, make a loop that does it twice for instance:

--- a/lib/framebuffer.cc
+++ b/lib/framebuffer.cc
@@ -850,6 +850,7 @@ void Framebuffer::DumpToMatrix(GPIO *io, int pwm_low_bit) {
     // Rows can't be switched very quickly without ghosting, so we do the
     // full PWM of one row before switching rows.
     for (int b = start_bit; b < kBitPlanes; ++b) {
+      for (int i = 0; i < 2; ++i) {  // Output the same stuff twice
         gpio_bits_t *row_data = ValueAt(d_row, 0, b);
         // While the output enable is still on, we can already clock in the next
         // data.
@@ -859,6 +860,7 @@ void Framebuffer::DumpToMatrix(GPIO *io, int pwm_low_bit) {
           io->SetBits(h.clock);               // Rising edge: clock color in.
         }
         io->ClearBits(color_clk_mask);    // clock back to normal.
+      }
 
       // OE of the previous row-data must be finished before strobe.

I think this is too specialized to add to the regular library, but if you add this hack to your locally modified version, this will be better. Depending on the settings, this might cost you some refresh rate. If that is too much, then only splicing the wiring will help.

(Note, PWM is never possible with these panels, as a group of LEDs are affected byt the same timing, thus only BCM is the only option.)

Thanks @hzeller, I didn’t even think about that as I was fixated on getting the mirroring ‘for free’ like I do with smartmatrix (however it outputs its bitplanes, using mirror out of a panel, really does work, no refresh speed lost).
When you say “it might cost some refresh rate”, I’m confused: isn’t it guaranteed to divide the speed by 2?
I use
./examples-api-use/demo --led-gpio-mapping=regular --led-rows=64 --led-cols=128 --led-row-addr-type=0 --led-chain=1 --led-show-refresh --led-slowdown-gpio=2 --led-pwm-bits=7 --led-panel-type=FM6126A --led-parallel=3 --led-chain=1 --led-pwm-lsb-nanoseconds=100 --led-pwm-dither-bits=2 -D0
and I get 200Hz instead of 400Hz now, but the output is perfect as expected.

Now, when you say PWM is not possiible, I’m confused:
Let’s say we have 4bpp, with PWM I can have 16 interrupts of the same length (let’s say 1ns) and a value of 15/15 means you keep the LED on all 16 times. If you want 9/15, you keep it on 9 interrupts out of 15 and so forth.
BCM saves interrupts because the first interrupt is 8ns, the 2nd one 4ns, the 3rd one 2ns, and the last one 1ns, and you can still get all 4 bits of resolution, which is what you do.
But what’s wrong with going back to 16 interrupts of 1ns per scan instead of 4 interrupts or 8, 4, 2, and 1ns?
Doing that would ensure that the bits shifted out look just as good on chained panels, and still keep the original 400Hz refresh rate.
I agree though that it’s a non trivial code change that would probably only be useful to me :slight_smile:

The only thing I’m confused about is that SmartMatrix on teensy/ESP32 does BCM I’m pretty sure, and yet its output can be chained into another panel and looks just about as good.

I asked Louis (SmartMatrix) if he can chime in on what his driver does differently, but when I chain with his, I get this. If you look super carefully you’ll notice that the colors are ever so slightly different, but not enough for me to care (and I get same full refresh speed)
I’m not sure what’s the difference between the 2 drivers, that cause such a visible difference.

It could be the order that the bitplanes are shifted out. I shift out the bitplanes LSB first, MSB last (at least for the ESP32). Your panels feeding off the data from the main panels will be displaying the previous bitplane for the current time. So for MSB’s time they’ll be displaying MSB-1’s data. When the row transitions, they’ll be displaying the previous row’s MSB data, but only for the LSB time (a single very short pulse of OE).

If the order were reversed from MSB to LSB, things would probably look a lot worse.

(Hopefully this is correct, I didn’t double check my assumptions)

Thanks louis (@Pixelmatix), that makes perfect sense, rpi-rgb-panel must be MSB first, which is why it looks way off.
@hzeller, before I dive into the code, would you be ok with inverting the order in the lib to be LSB first? (either you if you have time and you feel inspired, or me if you don’t, and I’m smart enough to figure out the code :slight_smile: )

Thanks @hzeller, I didn’t even think about that as I was fixated on getting the mirroring ‘for free’ like I do with smartmatrix (however it outputs its bitplanes, using mirror out of a panel, really does work, no refresh speed lost).
When you say “it might cost some refresh rate”, I’m confused: isn’t it guaranteed to divide the speed by 2?

Not necessarily, only when clocking in the data is the vastly dominant part of the time spent. Typically though, that is waiting for the longest pulse, so the data clocking would otherwise be idle. But in a long output enable pulse,
you can clock in more data while the pulse is still ongoing. Longer panels tend to be more dominated by the clock time.

In your case you use a very few pwm bits, and you have a long panel (in terms of data to be pushed) so you probably spend most/all your time clocking data, so in your particular case you probably will see about half speed.
If you try this with the standard pwm bits, you see that your refresh will go down a smaller fraction.

Now, when you say PWM is not possiible, I’m confused:
Let’s say we have 4bpp, with PWM I can have 16 interrupts of the same length (let’s say 1ns) and a value of 15/15 means you keep the LED on all 16 times. If you want 9/15, you keep it on 9 interrupts out of 15 and so forth.

Mmh, so yes it could work for smaller PWM bits on also smaller panels.

Clocking in the data is the dominant time you spend. Let’s take your 128x64 panel above.
Say we can get data 25Mhz clock speed on a panel with 128 pixels across: then clocking in one row is about 128/25Mhz = 5.12μs.

That is essentially now your led-pwm-lsb-nanoseconds … 5120ns.

(The way the panels are constructed, we can in the meantime output-enable the previous clocked in row, so we don’t loose
any time now switching it on. good.)

For 7 bit, you have to clock that in 128 times, so about 655μs. For the usual 11 bits, that would be 10.5ms.

Now you have to do that 32 times (if your panel is 64 pixel high), so about 20ms total time or 48Hz refresh on 7 bit (335ms; 3Hz refresh on 11 bit…)

So even with only 7 bit, it would already create an unusable frame-rate, let alone 11 bit… With only 4 bits however, you could reach about 380Hz, which might be doable.

The advantage is now, that all timing pulses are the same length, so if you chain things the way you’d like to do, showing the previous data in the shift register will not result in color artifacts.

Anyway, unless you really show line-graphics and no images, low PWM bits are not resulting in a good image.

BCM saves interrupts because the first interrupt is 8ns, the 2nd one 4ns, the 3rd one 2ns, and the last one 1ns, and you can still get all 4 bits of resolution, which is what you do.
But what’s wrong with going back to 16 interrupts of 1ns per scan instead of 4 interrupts or 8, 4, 2, and 1ns?

(Note the dominant time is clocking in the data, so 5120ns is now your LSB)
As shown above, with very few bits (<< 7), it might result in an ok refresh rate, but with a image-quality usable 11 bits, you have (2048-11)x the load…

Doing that would ensure that the bits shifted out look just as good on chained panels, and still keep the original 400Hz refresh rate.

Well, with your example above with the 7 bit setting you’d get 48Hz, not 400Hz. But that would stay constant, indeed :slight_smile:

I agree though that it’s a non trivial code change that would probably only be useful to me :slight_smile:

Code change is probably not too hard, but it is so limited in use (small panels, low bit count), that I am not sure if it is worthwhile.
It might result in brighter panels overall though, as now our lsb nanoseconds is 5120ns, so less switching issues and less super-short pulses on OE.

One problem with this however will be, that you now use 100% CPU all the time, which will not make the kernel happy (and result in regular glitches).
In the other case, we have a time to relax at least in the msb-ish bit planes as we then dominate in waiting.

The only thing I’m confused about is that SmartMatrix on teensy/ESP32 does BCM I’m pretty sure, and yet its output can be chained into another panel and looks just about as good.

They might not overlap clocking in data with pulse output from the previous row, which might result in slightly more viewable result, albeit slower refresh.

Thanks for the details, it’s indeed not as simple as I thought (and yes, I made the 1ns value up, it was just for easy math).
As for keeping one core constantly busy, for a 4 core rPi3/rPi4 and the fact that the lib is pinned to one core anyway, would that be an issue? Would it overheat and be clocked down or something?

Either way, I think you kind of convinced me that I just got lucky with @Pixelmatix’s lib working out for shifted data, and trying to replicate this with your lib without pushing the data twice, may not be trivial, or possible.
Maybe I’ll just eat the refresh rate cost, splicing and soldering ribbon cables is no fun, and I’m afraid it won’t be very solid/reliable.

I think it is possible to crimp two cables at once on a cingle IDC connector (they have these little knifes that cut into the wires), so it might actually be quite simple.

I agree. You might have trouble with the capacitance of longer IDC cables with two panels connected, but it’s worth a shot.

Here’s some advice copied from my Continuum Instructable:

  • Longer 16-pin IDC ribbon cables
    • You’ll need longer cables than are typically supplied with the HUB75 panels to connect the HUB75 panels between rows
    • The cheapest option is probably to get a roll of 16-conductor ribbon cable, and a pack of 16-pin IDC connectors, and to crimp your own. Note that if you can’t find 16-conductor cable you can find wider (e.g. 20-pin) and just separate the 16 wires you need you need
    • You can get a special IDC crimping tool, or just use a bench vice

Thanks, that is worth a shot.