Queue-latency demo frame from a 9K city-scale trajectory

3D Gaussian Splatting Rasterization

PolySplat

Workload-regime-aware CUDA rasterization for robust 3DGS rendering across extreme primitive counts, high resolutions, and interactive 60 Hz latency constraints.

76 benchmark targets
56.5M max Gaussians
9K max long edge
6.48x vs. reference 3DGS
0.00% 60 Hz deadline misses
Abstract

Rasterization tuned for workload regimes

Rasterization dictates the interactive budget of 3D Gaussian Splatting (3DGS). However, the comparative speed of modern CUDA rasterizers is typically evaluated on a narrow canonical benchmark, masking severe regime-dependent performance reversals.

PolySplat introduces a warp-saturating tile-key emitter with adaptive three-way dispatch, asynchronous shared-memory staging in the render kernel, and a persistent kernel architecture with centralized atomic dispatch to bound tail latency. On an extended 76-target benchmark, PolySplat achieves dataset-balanced geometric-mean speedups of 1.26x to 6.48x over state-of-the-art rasterizers at lossless visual quality. Under sustained 60 Hz interaction at extreme resolutions, PolySplat meets 1-vsync display deadlines where existing renderers suffer queue divergence and input-to-photon lag.

Demo

Sustained 60 Hz on a 9K city-scale trajectory

Mill19 Rubble, 3.19M Gaussians, 9216 x 6912 rendering, strict-order 1-vsync queue protocol.

60.00 fps PolySplat 28.13 fps gsplat 0.00% misses
Method

Three CUDA mechanisms target the hidden bottlenecks

Warp-saturating tile-key emission

Adaptive dispatch routes single-tile, 2-31 tile, and large multi-tile Gaussians into specialized paths so every warp lane emits useful tile tasks.

Asynchronous shared-memory staging

Double-buffered feature staging overlaps irregular global-memory gathers with pixel shading, removing exposed latency from the render critical path.

Persistent centralized scheduling

Resident blocks pull tiles from an atomic global queue, reducing terminal imbalance in heavy-tailed spatial workloads.

Paper Figures

System design and benchmark coverage

Adaptive three-way dispatch routes uneven tile coverage into specialized paths that feed uniform 32-lane task batches.
Async double-buffer pipeline overlaps feature loads with shading once steady state begins.
Persistent tile dispatch keeps resident blocks fed through a centralized counter and reduces terminal scheduling tails.
Results

Robust speedups across all dataset buckets

Dataset-balanced geometric means show PolySplat accelerating the forward pass over FlashGS, gsplat, Flash3DGS, and the INRIA 3DGS reference while preserving visual quality.

1.26xvs. FlashGS
2.27xvs. gsplat
4.58xvs. Flash3DGS
6.48xvs. 3DGS
Per-dataset forward-pass wall-clock on the 76-target benchmark.
Dataset-balanced mean per-kernel time by pipeline stage.
Dataset Targets Cams PolySplat ms FlashGS ms gsplat ms Speedup vs gsplat
dl3dv2401.862.244.202.27x
eyeful2408.048.6623.662.95x
flashgs_data132601.792.214.182.26x
h3dgs4807.2612.189.741.37x
llff81601.802.085.042.80x
mill192402.893.1210.713.70x
nerfstudio122401.081.492.292.14x
seathru-nerf4780.761.081.542.03x
tanksandtemples214201.031.621.691.66x
urbanscene3d1205.315.7014.002.64x
worldengine_navtest3600.630.701.522.41x
zipnerf4800.720.951.401.96x
Latency Case Study

Queue stability under sustained interaction

On a 360-frame, 6.0 s Mill19 Rubble trajectory at 9216 x 6912, PolySplat keeps every render below the 16.67 ms 1-vsync budget. gsplat sits above budget on nearly every frame, causing queue divergence and seconds of input-to-photon lag.

Deadline-miss ratio0.00% vs 99.72%
Effective throughput60.00 fps vs 28.13 fps
Time above 100 ms0.00% vs 96.94%
Max input-to-photon16.67 ms vs 3.22 s
Resources

Self-contained folder for GitHub Pages

The site includes converted web figures, the demo MP4, and an anonymous code link for paper review. It can be served as plain static files, with the paper and data links reserved for public releases.