This version has 98% SM occupancy and takes just over 1ms per buffer, GPU memory write bandwidth is
7%.
Overall application performance is higher (my benchmark went from 7.7-9.0 to 8.6-9.2 WFM/s) with less jitter due to contention between waveform download and the filter graph. It also saved around 100 MB of VRAM that had been used fr the staging buffers.
Seems like a pretty clear win all around, and I'll probably want to do similar optimizations elsewhere on other shaders that have read-once / write-once buffers.