Andrew Zonenberg

Security and open source at the hardware/software interface. Embedded sec @ IOActive. Lead dev of ngscopeclient/libscopehal. GHz probe designer. Open source networking hardware. "So others may live"

Toots searchable on tootfinder.

Public Key

npub1cddglts94qutscms0qpmk87lel9m8xku7q0wr20u2th5fxvvunqqxz9vpd

NIP-05 Address

[email protected]

Profile Code

nprofile1qqsvxk504cz6sw9cvdc8sqamrl0uljanntw0q8hp48799m6ynxxwfsqpz4mhxue69uhhyetvv9ujuerfw36x7tnsw43qfqrw8c

Publishing to

relay.ditto.pub

Author Public Key

npub1cddglts94qutscms0qpmk87lel9m8xku7q0wr20u2th5fxvvunqqxz9vpd

Show more details

Last Notes

2026-02-06 13:10:43 UTC

- reply

@nprofile…azw7 i'm sure one will get killed off, just like microchip buying atmel and dead ending the MIPS PIC32 line in favor of the new ARM based PIC32s that are just SAMs in a trench coat

2026-02-06 08:07:47 UTC

- reply

@nprofile…vnlz what do you mean there's nothing wrong with in band signaling
+++ATH0

2026-02-06 08:05:35 UTC

- reply

@nprofile…fz8u just watch out for flyback :p

2026-02-06 08:04:51 UTC

- reply

@nprofile…fz8u yeah i have the scopehal / ngscopeclient channel bridged to discord because some users demand it, but I'm an irc lifer for sure

2026-02-06 07:47:54 UTC

- reply

@nprofile…fz8u what about IRC?

2026-02-06 07:46:43 UTC

- reply

@nprofile…vntm @nprofile…0f4l (I could totally see the pilots being half awake, seeing the autoland in progress, then deciding it would probably fly better than they would in their current state and sitting back to watch the view)

2026-02-06 07:42:23 UTC

- reply

@nprofile…vntm @nprofile…0f4l And I'll take this outcome over an F-18 following the ting until it runs out of fuel any day

2026-02-06 07:41:34 UTC

- reply

@nprofile…vntm @nprofile…0f4l Yeah it's all speculation since all we have is statements of an obviously unreliable witness. Without CVR tapes or something we'll never know how conscious (or not) they were.
Either way, it worked as designed.

2026-02-06 07:23:18 UTC

- reply

@nprofile…0f4l But of course this is based on pilot statements and the brain is weird on hypoxia.
So it's entirely possible they *did* zone out for a bit and come to a bit when the pressurization alarm hit, and just don't remember.

2026-02-06 07:20:10 UTC

- reply

@nprofile…0f4l see https://www.ainonline.com/aviation-news/business-aviation/2025-12-23/king-air-b200-lands-after-garmin-autoland-activation
and https://avweb.com/aviation-news/garmin-autoland-activation-crew-decision/

2026-02-06 07:17:34 UTC

- reply

@nprofile…0f4l From the sources I've seen, it wasn't a true incapacitation (but it was very close to being one).
There was a pressurization failure, which (along with no control inputs for X amount of time, a manual button press, etc) is one of the triggers for the autoland system.
The pilots recognized the issue and put on their oxygen masks and never lost consciousness, but once the system activated they sat back for the ride since they were going to have to make an emergency landing anyway. They were conscious and prepared to resume hand flying if it didn't work.
But it was a legitimate in flight emergency in which the autoland flew the plane to a safe landing with no human intervention.

2026-02-06 06:58:37 UTC

- reply

@nprofile…k6gw @nprofile…vnlz That's what superb bowel sunday is about! Have you had a colon cancer screening lately? If not, consider this a reminder.

2026-02-06 06:34:16 UTC

- reply

@nprofile…mvvp thaaat's not cursed at all...

2026-02-06 05:25:37 UTC

- reply

@nprofile…l9z3 If you haven't seen my short story "Terminal Pacifism" you might appreciate it... https://serd.es/2025/10/31/Terminal-Pacifism.html

2026-02-06 05:18:16 UTC

- reply

@nprofile…vnlz No idea, but super bowel sunday is coming up so I thought I'd share some trivia.

2026-02-06 03:23:30 UTC

Did you know the average human large intestine is about 1.5 meters long? [1]
#SuperBowelFacts
[1] https://www.chp.edu/our-services/transplant/intestine/education/about-small-large-intestines

2026-02-06 03:19:54 UTC

- reply

@nprofile…au3r @nprofile…3qha DU is... Not nearly spicy enough to do that

2026-02-06 03:17:29 UTC

- reply

Also the Ubuntu 24.04 tests are failing to initialize vulkan while the fedora tests can do so just fine.
None of that code has changed recently

2026-02-06 03:09:28 UTC

I have a test case for libscopehal that has recently started failing in the GitHub CI environment with a SIGSEGV.
The same test, run on any of my machines, passes even when run under asan.
Anybody have ideas on how to debug? The limited visibility into the CI environment is annoying, I can't like ssh in and run gdb or something.

2026-02-05 16:10:07 UTC

- reply

@nprofile…huwm The "coupler de-embed" block takes in forward and reverse port waveforms plus a s4p file of the coupler.
First, it does a FFT based de-embed of the coupled port to input path (e.g. TX to TX monitor, and RX to RX monitor). This gives a first iteration estimate of the input signal, given the output, with the coupler's insertion loss and frequency response corrected. But you still have the leakage (which manifests as crosstalk between TX and RX waveforms).
So the next step is, the block does channel emulation of the input-to-opposite-coupled path (TX to RX monitor, and RX to TX monitor), using the estimate of the input signal from step 1.
This gives a prediction of the crosstalk signal. The prediction isn't perfect, since it was based on a signal which itself had some crosstalk. But assuming the coupler had *some* directivity, this prediction will contain mostly the crosstalk signal and very little of the desired signal. And since it's based on a full S-parameter model it will be phase-correct across the operating frequency band.
Finally, the magic happens: we subtract the predicted crosstalk from the measured signal (giving an estimate of what the scope would have seen had the coupler had better directivity) and repeat the step 1 de-embed operation to correct insertion loss and frequency response of the new signal.
This gives a refined, de-embedded prediction of the coupler's input with most of the crosstalk (and a small amount of the desired signal, unfortunately) subtracted. I'm not a math expert but I *think* the net effect of this operation is a squaring of the directivity, i.e. a coupler with 10 dB of directivity gives you 20 dB of directivity after this operation assuming you had clean data with perfect SNR and no quantization noise etc.

2026-02-05 15:51:59 UTC

- reply

@nprofile…huwm Dual directional coupler :D split the signal into separate forward and reverse outputs.
The only problem is, they separation isn't perfect: they leak a bit of the opposite side into the outout.
So we have a filter block in ngscopeclient to improve directivity in post.

2026-02-05 09:24:44 UTC

- reply

@nprofile…mcmd As the owner of an oscilloscope with a list price (when it was new) in that same ballpark... I feel your pain lol.

2026-02-05 06:57:33 UTC

15.6 kWh solar generation today, new record (unsurprisingly... the days are getting longer plus we are having a sunny spell).
This month so far, production has averaged around 10% of my consumption although I expect both to be climbing soon (my big scope is back from service, and spring is coming next month) so it'll be interesting to see what happens after that.
Total YTD production is 194.8 kWh although that includes 12 days of zero production with the system offline between initial post-install commissioning and the final inspection from the power company.
I burned 2.5 MWh in that same window.
But this is winter in the pacific northwest, I'm honestly quite happy with the amount of juice I'm getting so far.

2026-02-05 05:38:42 UTC

Finally had time to sit down and do a GPU version of the constellation diagram filter.
I'm now running this filter graph (4 channels -> 2 differential legs, S-parameter de-embed of dual directional coupler, 3 dB FIR equalizer, 4x sin(x)/x upsample, PAM-3 edge detection, CDR PLL, PAM-3 eye pattern, demux to 2D-PAM3 channels, 2D-PAM3 constellation, 100baseT1 single pair ethernet protocol decode) at about 6.3 WFM/s on 4 channels * 20M points.
This is 504 Msps or a touch over half way to real time on the ThunderScope.
The big bottleneck is the baseT1 decode which is completely single threaded and running on the CPU; every other filter in the graph is largely or entirely shader driven. This one block takes as much time to run as the whole rest of the filter graph, and since it's at the end of the dependency chain it can't start until everything but the constellation has finished running.
So figuring out how to GPU at least the initial 2D-PAM3 demodulation, descrambling, and start-of-frame detection should be a nice speedup.
https://files.ioc.exchange/media_attachments/files/116/016/386/001/203/451/original/2acda688a5d1d217.png

2026-02-05 04:26:31 UTC

- reply

@nprofile…8xvk 4.5 might be a little young for a cloud chamber.
But I'll definitely play with one later on

2026-02-05 03:55:03 UTC

- reply

@nprofile…97sf i think you mean kdoption

2026-02-05 03:54:06 UTC

- reply

@nprofile…k6gw whatever maglite aftermarket upgrade kits shipped with circa 2009?
Probably somewhere between 0 and garbage.
If i really cared i could run it on the ASEQ spectrometer down in the lab and get an actual calibrated CRI.
But that's a bit beyond preschool science class.

2026-02-05 03:05:38 UTC

- reply

We've looked at absorbance spectra of a bunch of magnatiles but she doesn't quite get all the details I'm trying to explain.
The joys of trying to teach a preschooler physics lol.

2026-02-05 03:04:14 UTC

- reply

Comparing spectra of the incandescent and white LED bulbs
https://files.ioc.exchange/media_attachments/files/116/015/798/988/220/101/original/86f6a17ad09932dd.jpg
https://files.ioc.exchange/media_attachments/files/116/015/799/205/971/683/original/b29ea6208d3a192e.jpg

2026-02-05 03:01:19 UTC

Fine tuning the Lego spectrometer more.
Using a maglite as the source eliminates the need for a separate collimating lens.
I also switched away from the very flaky variable slit made of a sliding brick that wouldn't stay in place if you looked at it wrong.
After some failed experiments with aluminum foil, I discovered that the round tree trunk Duplo brick was a tiny bit smaller than a nominal square 2x2 brick. When placed next to a square brick and locked in place with a plate on top you get a very consistent, repeatable slit that I'd ballpark as somewhere around 200 μm.
https://files.ioc.exchange/media_attachments/files/116/015/784/962/035/549/original/0f7fc3b6a6212e6b.jpg
https://files.ioc.exchange/media_attachments/files/116/015/785/269/678/181/original/f7a9460cf602ee2c.jpg

2026-02-05 01:56:47 UTC

- reply

@nprofile…4424 Vulkan is mandatory for the application to run at all.
The only corner case that this would matter is something that has Vulkan 1.0 or better but without those two extensions, and has a CPU with AVX512F. As far as I can tell this is an extremely improbable combination of hardware if it exists at all (think recent ish Xeon + obsolete low end dGPU)

2026-02-04 23:52:52 UTC

- reply

@nprofile…kpre Not large, but I'm doing refactoring around that code and if it's going to get removed, now is the time.
There's currently no way to force fallback, if your GPU advertises support for int64 it's going to use the shader not the AVX. The only way to run that code path is to have a GPU without those extensions, and a CPU that advertises avx512f.
(Unrelated: nice profile pic... Mead & Conway? I recognize the color scheme)

2026-02-04 22:52:26 UTC

- reply

@nprofile…nhhd Yeah I'm currently leaning towards keeping the avx2/fma versions but removing the 512

2026-02-04 22:39:42 UTC

- reply

@nprofile…tm9j So what i'm hearing is, the space of systems that would be affected by me removing the AVX512 eye pattern block, keeping the generic, AVX2, AVX2+FMA, and GPU-with-int64 implementations, is the empty set or close enough it makes no difference?

2026-02-04 22:03:14 UTC

- reply

@nprofile…tm9j I know old, like Haswell era, Intel iGPUs lacked int64 support but I thought (could be wrong) current models have it. And most of those don't have AVX512 (AVX2 makes more sense to keep at least for now).
So I guess part of the question is: did Intel ever make a chip where the iGPU does not support int64, but the CPU supports avx512f?

2026-02-04 21:36:45 UTC

Debating removing some of the AVX accelerated filter implementations in libscopehal for blocks that we now have GPU support.
For starters, the eye pattern: what are the odds that we have any reasonable number of users where a) they have a CPU with AVX512F (or even AVX2), and b) do NOT have a GPU with GL_ARB_gpu_shader_int64 and GL_EXT_shader_atomic_int64?

2026-02-04 20:06:56 UTC

- reply

@nprofile…thd8 I read that as "branch prediction"

2026-02-04 17:46:05 UTC

- reply

@nprofile…u57d Yeah I can't imagine an oil company being friendly to that lol.
I'm pretty much maxed out on solar. My back yard gets way too much shade, the front yard has a small tree on one side and the other is pretty small. A little carport thingie over the driveway miiight be feasible as a future expansion, that area does get a fair bit of sun.
My next focus will be reducing consumption by deploying a commercial VRF heat pump so I can heat the house during the winter with waste heat from the lab, rather than air conditioning the lab with one heat pump while heating the house with another and just dumping the heat outside that I could be using elsewhere (which I do now).
And then replacing my aging Cisco switches in the lab with my own FPGA-based design, whenever I finish it, will probably save another couple hundred watts.

2026-02-04 17:33:50 UTC

- reply

@nprofile…u57d And if you had no partial shading to worry about, you could probably use string inverters vs microinverters which would also be cheaper.
My setup has so many trees nearby one or more of the panels are almost always shaded so a microinverter architecture was the only sane option.
Large commercial installs like you're dreaming about normally use string inverters (and are on big enough plots of cleared land that tree/building shade isn't a huge concern)

2026-02-04 17:28:43 UTC

- reply

@nprofile…u57d So first order estimate (you might have some savings due to economies of scale) you're talking about a suburban block of rooftops.
Or if ground mounted, (1.9m^2 * 41 panels) = 77.9 m^2 (my array), times 10 = 779 m^2 or 0.2 acres of array for a 150 kW farm and 2337 m^2 / 0.6 acres for a 450 kW farm.
Assuming roughly comparable parts+labor cost to my setup, no tax incentives, and no economies of scale for a larger deployment, roughly $0.5M / $1.5M install cost respectively.
Ground mounting would definitely save you a bunch vs rooftop.

2026-02-04 17:20:45 UTC

- reply

@nprofile…u57d That would need quite the farm and/or a huge battery bank.
My... 41 panel? rooftop array is 18 kW theoretical DC peak, 15 kW max continuous AC output (which should not be a bottleneck because the array is oriented in all 4 directions and there's almost always some shading due to surrounding trees).
I think current gen fast chargers top out at 350 kW so you'd be looking at an array of ~23x larger size to be able to keep up with the charger assuming continuous full sun.
You'd need even more (maybe 30x) to allow for suboptimal sun angles, dirt, conversion losses, etc although if you didn't plan to run the charger nonstop during daylight, and had batteries for peak shedding, you could probably get away with more like 10x depending on how heavily you expected to load it.

2026-02-04 15:27:03 UTC

- reply

@nprofile…4424 I'm not an expert on this since I don't normally do my own terminations, I pay Koaxis to deal with this for me :P
There might be a trick to cleanly removing the foil layer... perhaps @nprofile…k7h5 has ideas?

2026-02-04 08:49:03 UTC

- reply

This was still way faster than the CPU version: the v0.1.1 CDR PLL took 75.5 ms on the CPU to run this benchmark, while the unevenly partitioned GPU version took about 12ms.
But the fixed version took about 5.8 ms.
I saw a similar, less striking speedup on the PAM edge detector which had slightly less imbalance with the sample size and thread count in this particular benchmark, going from 259 ms for the CPU version to 12 imbalanced to 10 balanced.

2026-02-04 08:38:14 UTC

- reply

The bug turned out to be calculating the number of samples processed per thread as (numSamples / numThreads) with integer math, rounding down.
Simplifying a bit, each thread processes samples i*blocksize to i*(blocksize+1) - 1, with the last thread clamping to the number of samples if it doesn't divide evenly.
Therein lies the problem. Suppose you have 8192 threads and 8388607 (8192*1024 - 1) samples. Divide this by 8192 and round down, and you get 1023.9998, which rounds to 1023.
But let's look at what this means for workload partitioning: threads 0...8190 will each process 1023 samples.
And then poor thread 8191 is left with a block that starts at sample 8191*1023 = 8379393, meaning it has to crunch a whopping 9214 samples instead of 1023 like everyone else, and will run nine times longer than them.
The fix is to round up instead of down, which means each thread processes 1024 samples instead of 1024 (taking infinitesimally longer). Thread 8191 now starts at sample (8191*1024) = 8387584 and processes 1023 samples.
With some thread counts and workload partitioning you end up having the last thread do nothing at all (which caused a bug I had to fix where it assumed at least one sample was present), but it still runs a lot faster than having one overworked thread taking up everyone else's rounding errors.
https://files.ioc.exchange/media_attachments/files/116/011/423/843/314/576/original/e1f91abe157a704e.png

2026-02-04 08:29:44 UTC

Just fixed an interesting and subtle shader performance bug.
Here I'm trying to recover a clock from a PAM-3 signal which consists of two consecutive filter blocks: the PAM edge detector (find level crossings and interpolate, accounting for the fact that the threshold changes depending on start/end symbol), and then the CDR PLL proper.
If you're not familiar with NSight Systems especially on complex multithreaded applications, there's a lot going on here even though I cropped down to only the important bits.
What we're looking at here is time going from left to right, and a whole slew of GPU and CPU performance metrics and state values that can help understand what's going on.
The "NVTX" line near the top is trace output data from ngscopeclient, showing what filter is currently executing. Keep in mind that the ThunderScope driver is also running in the background and pushing work to the GPU via a separate Vulkan queue, so we'll see interference from that.
The PAM edge detector consists of four consecutive shader invocations, which can be seen clearly as distinct activity patterns in the "SM Occupancy (TPC View)" graph about halfway down.
1) Find every crossing of any level threshold. This will double-count some edges e.g. going from 0 to 3 on a PAM-3 signal will be recorded as 0-1, 1-2, 2-3. This produces a separate variable-length output list for each thread
2) Merge these variable-length lists into a linear buffer so it's easier to index (this step might get omitted in a future optimization)
3) Merge double-counted transitions so we only have one sample at the midpoint of each edge, producing another set of variable-length lists
4) Final merge into a linear list of edges that we can pass to the CDR PLL
Ignore the burst of 100% occupancy accompanied by spikes in blue (PCIe) under "Unit Throughputs" and PCIe Read Bandwidth. This is the ThunderScope driver pushing a new channel of waveform data to the GPU and the scheduler pre-empting the filter graph to process it.
But note there's a big gap between each of these shaders, especially the first and second.
The same is true if you look to the right at the clock recovery PLL, there's bursts of activity with long idle periods between theme.
But, it turns out, these are not actually idle periods (note "GPU Active" is still high). There is still one compute warp active, on one SM! This means a single GPU thread in the shader is taking significantly longer than the others, slowing things down.
https://files.ioc.exchange/media_attachments/files/116/011/372/022/573/382/original/b9481c8575ebeaec.png

2026-02-04 01:40:07 UTC

- reply

@nprofile…8vnr @nprofile…hucu That block diagram looks a lot more like the actual part than I'm used to.
Wonder why the receiver and tristate buffer are different dies

2026-02-03 16:56:33 UTC

- reply

@nprofile…8f6p Definitely won't be staying that long although I might be coming up the weekend before

2026-02-03 16:15:10 UTC

- reply

@nprofile…6cfx @nprofile…3tsk stealth dicing ftw?

2026-02-03 16:07:32 UTC

- reply

@nprofile…6cfx @nprofile…3tsk I've seen more deep cracks in substrate than BEOL, but sure that's a consideration as well

2026-02-03 16:04:34 UTC

Looks like I'm gonna be going to HARRIS at MPI-SP in Bochum again this year (March 23-25)
Anybody going, or in the greater Bochum/Dusseldorf/Cologne area, and interested in a meetup around that time?

2026-02-03 15:57:39 UTC

- reply

@nprofile…3tsk (the trick to not cracking the die is to go slowly, do the initial cut if any well short of the plane of interest, then use fine grits, like P1200 and smaller, as you approach the target... using a diamond saw or large abrasive to go through the silicon gives exactly the kind of cracks shown in the original image and they tend to propagate pretty far making it difficult to polish them out... not that they tried, that surface finish looks like P800 or something)

2026-02-03 15:55:41 UTC

- reply

@nprofile…3tsk like this QFN is what I consider "eh, it'll do but I'm not thrilled, i've seen better" cross section quality
https://files.ioc.exchange/media_attachments/files/116/007/508/860/443/325/original/cfd6a426b29b454e.jpg

2026-02-03 15:53:12 UTC

- reply

@nprofile…3tsk I get upset with myself when I do a package cross section and I can see a hairline scratch in the silicon under 100x magnification rather than a flawless mirror finish

2026-02-03 15:52:35 UTC

- reply

@nprofile…3tsk i'm screaming internally at the cracks in the die and the poor surface finish on the polish.
Like it probably gets the point across, but come on have a bit of pride in your work and do a good job with sample prep

2026-02-03 15:49:33 UTC

- reply

@nprofile…0hj6 TIL what the e in e-core stands for

2026-02-03 15:43:13 UTC

- reply

@nprofile…0hj6 It took me a minute to realize that you weren't torturing your laptop by forcing it to "compile more than it ever should have on e".
Like, i guess if your laptop wants to be a girl good for it? but i don't understand how that affects compile performance

2026-02-03 09:47:21 UTC

- reply

@nprofile…q57v I have a somewhat more proper spectrometer in the lab from ASEQ Instruments (inexpensive, but "real").
But this is more appropriate for a 4.5 year old to play with.

2026-02-03 09:44:11 UTC

- reply

@nprofile…q57v Just a random cheap prism from amazon, nothing special.
Basic idea is, put a narrow slit in front of the light source (ideally also a collimating lens to focus it a bit better) to get a tall skinny beam of light, then into the prism, then put something white wherever the rainbow lands. Play with slit size and prism alignment until you get a good rainbow.
Then insert various colored transparent objects into the beam path and watch what they do to the reference spectrum. This is basically a crude form of absorbance/transmission spectroscopy.

2026-02-03 03:04:56 UTC

- reply

@nprofile…gftk you know, if you actually spend your money to do useful things you won't have to worry about a wealth tax

2026-02-03 02:48:17 UTC

- reply

Some measurements from the lego spectrometer
https://files.ioc.exchange/media_attachments/files/116/004/406/847/093/717/original/1a9822e48c71558e.jpg
https://files.ioc.exchange/media_attachments/files/116/004/407/495/876/818/original/10aa799c9507980d.jpg

2026-02-03 02:45:57 UTC

Introducing the kiddo to optical spectroscopy
https://files.ioc.exchange/media_attachments/files/116/004/399/366/467/355/original/47fb4a4ad9eab76d.jpg
https://files.ioc.exchange/media_attachments/files/116/004/399/719/884/995/original/6fc01da0566cf313.jpg
https://files.ioc.exchange/media_attachments/files/116/004/400/391/905/270/original/d074cf48d224406b.jpg

2026-02-03 01:11:44 UTC

- reply

@nprofile…3tsk the layout guy was not thrilled at the K_{64} topology we ended up with instead lol.
4032x 32 Gbps NRZ SERDES channels (8064 diff pairs) on a single board.

2026-02-03 01:06:12 UTC

- reply

@nprofile…3tsk it was basically an extension of an in-package NoC, we would have had all IO be handled by packages on the edges of the array.
The core would only have had power and serdes to adjacent tiles

2026-02-03 01:04:29 UTC

- reply

@nprofile…3tsk you would have been able to basically just put the packages next to each other and have like 2mm diffpairs between them

2026-02-03 01:04:05 UTC

- reply

@nprofile…3tsk it was for a HPC project with a lot of fast interconnect... one proposed topology was a hexgonal lattice of chips on a PCB with the packages basically butting against each other, and short-reach SERDES die to die from each one to its neighbors on all six sides

2026-02-03 00:57:04 UTC

- reply

@nprofile…3tsk Years ago I worked on a silicon project that contemplated (but did not actually manufacture) a hexagonal BGA.
Not like a hexagonal pin lattice on a square package, the actual component would have had six sides and ~2000 balls

2026-02-03 00:56:19 UTC

- reply

@nprofile…gj0l And now I'm confused because I was very sure R+Y was "crew have mutinied" but now I can't find any references to it anymore and I see the same combo used for "keep clear at low speed"

2026-02-02 18:36:50 UTC

If "codices" is the plural of "codex" does that mean the plural of "mutex" should be "mutices"?

2026-02-02 16:05:30 UTC

- reply

New version: directly map CPU side staging buffer and run the shader against that buffer without moving to local memory first.
This version has 98% SM occupancy and takes just over 1ms per buffer, GPU memory write bandwidth is
7%.
Overall application performance is higher (my benchmark went from 7.7-9.0 to 8.6-9.2 WFM/s) with less jitter due to contention between waveform download and the filter graph. It also saved around 100 MB of VRAM that had been used fr the staging buffers.
Seems like a pretty clear win all around, and I'll probably want to do similar optimizations elsewhere on other shaders that have read-once / write-once buffers.
https://files.ioc.exchange/media_attachments/files/116/001/876/432/562/067/original/43f9db4ceb80f21a.png

2026-02-02 16:01:42 UTC

So, I have an answer to my previous question about GPU transfer efficiency.
Original code: write data to staging buffer on CPU, vkCopyBuffer to GPU local memory, run int-float32 conversion on GPU out of that buffer. The copy operation shows 50% SM occupancy by compute warps, 50% unallocated warp slots in active SMs.
GPU memory write bandwidth is sitting around 2%, about 1.9 ms copy/shader run time.
https://files.ioc.exchange/media_attachments/files/116/001/860/778/213/806/original/22e18aeb99fb6085.png

2026-02-02 03:22:17 UTC

- reply

@nprofile…8f6p that requires it be plugged in directly, not remoted.
But it's on the list of things to explore

2026-02-02 03:10:21 UTC

- reply

@nprofile…vnlz did you hit a 32 bit offset limit in gtkwave or something

2026-02-02 03:09:54 UTC

- reply

@nprofile…vnlz oh dear how big

2026-02-02 00:13:34 UTC

- reply

@nprofile…r9vg RoCE is a long term plan but for now let's assume that's not an option

2026-02-01 23:58:37 UTC

- reply

(the raw samples are written once per trigger then read once in the shader)

2026-02-01 23:51:14 UTC

Thinking about a potential performance improvement for overlapping compute and transfer operations in ngscopeclient.
Right now, if you use a ThunderScope (ignoring unified memory platforms where the issue is moot) when a new waveform shows up we write it into CPU side pinned memory.
Then we vkCopyBuffer it into local memory, barrier on that transfer, and run the ConvertNBitSamples shader to convert the raw adc codes to float32.
The problem is, this burns scratch buffer space in local memory needlessly and also on nvidia microarchitectures, submitting a transfer to a compute queue seems to use the SMs for copying which means we effectively run a memcpy shader blocking other compute tasks followed by the conversion shader.
So there's two possible other ways to do this: first, do the transfer operations in a dedicated transfer queue. The DMA will compete for memory bandwidth with other shaders but not burn SM occupancy during the PCIe transfer. But I'm still doing a write-then-read and wasting local memory. And i need to synchronize between a compute and transfer queue this way.
The other possibility would be to pass the CPU side buffer directly to the conversion shader. This means reading CPU side memory in a shader which is slow (meaning I'm blocking SMs while i wait for PCIe) but the transfer has to happen at some point and this frees a temporary buffer. And it's probably still faster than using the SMs for copying and then running a second compute shader.
Thoughts before I spend time implementing either?

2026-02-01 23:32:26 UTC

- reply

@nprofile…thd8 it's from 2014, my oldest rig, and I've generally targeted a ten year minimum lifespan. I think I'll be keeping a lot of my current hardware longer than that though

2026-02-01 23:31:30 UTC

- reply

@nprofile…thd8 yeah i had originally been planning to maybe upgrade my microscope bench workstation in 2026 but... That's not looking too likely right now

2026-02-01 23:28:29 UTC

- reply

@nprofile…thd8 yeah i haven't done a new build without ecc in a long time. It's just not worth the risk

2026-02-01 21:09:32 UTC

Ea-Nasir spotted
https://files.ioc.exchange/media_attachments/files/115/997/417/917/868/871/original/017cb6aa339d4b34.jpg

2026-02-01 09:23:44 UTC

- reply

@nprofile…3y57 hardened tool steel and flexures on the same part? Now I'm curious...

2026-02-01 08:53:51 UTC

Are there any old school non-neural-network based speech recognition or TTS tools still out there (ideally open source)? Curious how much of the dark arts have been lost

2026-02-01 05:45:40 UTC

Another evening, another few filters getting OOM speedups.
Tonight it was invert (27.3x, trivial memory bound shader that just outputs negative x[i] for each output sample) and the 8B/10B decode (didn't even bother to GPU it, just removing the redundant sampling operation by using the new CDR recovered-data output was enough for a 12.1x speedup and 6ms on my 50M point benchmark is fast enough I'm in no hurry to GPU... but at faster data rates it may still be worth it)

2026-02-01 04:46:59 UTC

- reply

@nprofile…vnlz that was my reaction on reading the stm32mp2 official software toolchain docs.
I threw it all out the window, built my own firmware with aarch64-none-eabi-g++, slapped a bootrom header on it, and called it a day

2026-02-01 04:18:15 UTC

- reply

@nprofile…nhxp If someone actually wanted to play the capitalism game with this pitch, you'd take VC money, pay it all out to the engineers performing the service, and offer your services at a steep discount to the public to attract customers.
Worst case, the company goes under but a bunch of engineers got nice jobs for a while, VCs lost money, and a bunch of money didn't go to the AI industry. Seems like an all-around win.

2026-02-01 04:13:14 UTC

- reply

@nprofile…nhxp (The point of the joke was to turn the AI grift on its head, tricking tech bros into hiring actual engineers and paying them a fair wage by telling them they're a bot lol).
But yeah, getting people to pay to support open source development is a perpetual challenge. ngscopeclient certainly has been a net loss to me, we've got some donations but nowhere near enough to even cover what i've personally spent on the project much less the value of my time and that of all of the other contributors...

2026-02-01 03:49:18 UTC

- reply

@nprofile…fz8u ROFL that's a fun confusion.

2026-02-01 03:33:28 UTC

- reply

(whether people would be willing to pay what it cost to pay the staff a fair industry rate for their services is another question, of course)

2026-02-01 03:32:07 UTC

- reply

Jokes aside i feel like there would be a market for a "consultant at your fingertips" service that had vetted professionals in a variety of disciplines on tap, for an hourly fee, for rapidly answering questions and bouncing ideas off

2026-02-01 03:30:18 UTC

Somebody should make an "AI chatbot" that actually connects you to a real expert in the field.
Then when it gets popular for giving much better results than the state of the art LLMs, expose the "grift"

2026-02-01 01:35:17 UTC

- reply

@nprofile…8244 oh I was thinking more like how probable it was that i know somebody who knows the elongated muskrat

2026-02-01 01:33:25 UTC

- reply

@nprofile…dcqh Definitely good advice.
Also just making sure people know your plans, estimated return time, etc. If the PLB (or you) is damaged to the point you can't use it to call for help, the last resort is to have somebody realize you never came home.
When I was in the outing club in school we had a designated WIMP (Worry If Missing Person) for each trip. This was somebody with no plans that weekend who had contact info for the trip leader. At the prearranged check-in time, if they hadn't heard that everyone had come home safe, they'd try to call the leader. If that didn't work, and subsequent attempts to reach them or other group members failed by a second designated "panic" time, they'd alert local authorities to send a search party.
On the other side of the fence... my SAR unit once went out to look for somebody who went out for a day hike on a Monday and wasn't reported missing until the following Weds because she lived alone and hadn't told anyone where she was going. We didn't get paged until Saturday night (she was officially missing, but nobody knew where to look so none of the wilderness search teams could be dispatched) when a park ranger found her car, and found her mid afternoon Sunday.
Somewhat surprisingly that incident had a happy ending. But she openly admitted in a newspaper interview that not telling anyone where she had gone was a mistake.

2026-02-01 01:16:26 UTC

- reply

@nprofile…8244 although if you only count publications it's probably a lot larger

2026-02-01 01:07:48 UTC

- reply

@nprofile…8244 what is your erdos number, out of curiosity? Mine is 3 (via Bulent Yener, Mark Goldberg).
I don't particularly want to know my epstein number. But if i had to guess, it's probably in the same ballpark depending on how casual an acquaintance you count... I know a fair number of people in silicon valley tech circles from work, and many years ago I worked indirectly for, and briefly met, one of the owners of (redacted 2 digit address) Wall Street.
So between those two groups of people, I'd be shocked if none of them knew someone who knew him (or even knew him directly).

2026-01-31 17:03:13 UTC

- reply

@nprofile…czzc That is actually something I've thought about, being able to take explicit thresholds as an optional parameter.
But if you're doing 10-90 or 20-80% rise *and* fall measurements I still want those levels cacheable to avoid recomputing them or forcing the user to pass a whole bunch of temporaries around the graph explicitly.

2026-01-31 16:29:05 UTC

- reply

@nprofile…czzc About 38 last time I checked, it may be possible to optimize those shaders further.
But that was a lot faster than doing it the old way with separate GetBaseVoltage and GetTopVoltage redundantly histogramming and minmaxxing the signal twice (and not having any GPU acceleration for either).
But if you do e.g. rise and fall time separately it'll be redundant still. I have an open ticket for trying to cache this sort of analysis so you only ever have to average, min, max, etc. a signal once no matter how many filters use those values internally down the road.

2026-01-31 07:44:34 UTC

Another evening of filter refactoring and optimization, a few more nice performance jumps.
I've now done a first pass (remove deprecated method signatures, add explicit input location, add NVTX trace data, do easy GPU optimization if I see an obvious low effort win) on all filters A-F alphabetically plus a few later on that were priorities for one reason or other.
90 down (of which 23 were optimized and the rest just refactored), 115 to go.
Some of the remaining ones should be straightforward duals of ones I've already refactored, e.g. base/top and rise/fall are pretty much inverses of each other so the shaders will mostly be copy-paste with a few signs and conditionals flipped.
Then I can get on to the rest of the v0.2 priorities.
https://files.ioc.exchange/media_attachments/files/115/988/571/057/490/958/original/f28decdc836bee09.png

2026-01-31 07:35:48 UTC

- reply

Got it down to 50 ms, a 15.8x speedup. I'd still like faster, but this will do for now.

2026-01-31 03:42:36 UTC

- reply

@nprofile…kdgg @nprofile…2fsd does it still work? Don't crack it open if it still runs

2026-01-31 03:07:57 UTC

- reply

@nprofile…2fsd (YIG is opaque iirc, roughly the same appearance as tungsten carbide i think?)