Designing a Shepard-tone test for time-stretch / pitch-shift quality

How we built a spectral-envelope-preservation metric for MondoLoop's DSP bench using Shepard tones, FFT peak picking, and a log-frequency Gaussian fit — and why the first version's 69-cent "error" was actually the test's fault, not the stretcher's.

Code and reproducer commands: the benchmark harness is open-source at github.com/syshlted/timepitch-bench (Apache-2.0). Every test in this post has an exact one-line reproducer in docs/USAGE.md. Independent results welcome — see CONTRIBUTING.md.

Why a new test

MondoLoop’s DSP bench already compared Signalsmith Stretch, SoundTouch, and Rubber Band on sine, sweep, impulse, and white-noise signals — CPU, peak RSS, pitch error in cents. The trouble: on pure sines, all three libraries land within ±0.5 cents. Quality on synthetic tones is essentially equivalent, so the existing battery couldn’t tell us anything about how the libraries actually colour music.

We needed a signal that exercises the spectrum more.

The Shepard tone, weaponised

The idea: octave-spaced partials, swept continuously upward at 0.5 oct/sec, under a fixed Gaussian envelope in log-frequency (center = log₂(500 Hz), sigma = 2 octaves). For an ideal stretcher this gives clean predictions:

  • Time-stretch: instantaneous partial frequencies at any equivalent input position should be preserved; only the sweep rate changes.
  • Pitch-shift by R: every partial multiplied by R; octave structure preserved.
  • FFT check: averaged output spectrum should show peaks at octave spacing.
Gaussian envelope in log-frequency space with octave-spaced partial peaks as red stems
The test signal in log-frequency space. The shaded curve is the fixed envelope; the red stems are the discrete partials, with amplitudes sampled from the envelope at each octave.

Closed-form phase for the exponential sweep keeps things numerically clean:

φ(t) = 2π f₀ (2^(rate·t) − 1) / (rate·ln 2)
Spectrogram of the swept Shepard tone, partials sweeping upward as parallel diagonal lines
Spectrogram of the swept signal. Each partial rises linearly on the log-frequency axis — parallel ramps spaced one octave apart — while the envelope keeps mid-band partials loudest throughout.

Implementation landed across signals.{h,cpp} (the synthesis), fft.{h,cpp} (a parabolic-interpolated peak picker ranked by magnitude), and an extended QualityReport carrying input/output peak lists, median adjacent-octave ratio, and observed pitch ratio.

First run: two surprises

Octave preservation was excellent across all three libraries (median adjacent-peak ratio 1.999–2.001 everywhere). But the pitch-ratio detection produced two confusing results.

The +1-octave shift looked broken. Output partials coincide exactly with the input partials one octave above, so the nearest-input-peak matcher snapped to a ratio of ~1, not 2. That’s the Shepard illusion working as designed: a pitch shift of an exact octave on this signal is spectrally indistinguishable from identity. Lesson: probe pitch with non-octave shifts.

Signalsmith showed a 69-cent error at pitch ×1.3348. Suspicious — the existing sine pitch test at the same ratio puts it at 1.02 cents (sub-cent like the others). What went wrong?

Cross-checked against the sine baseline:

Library error (cents) on sine, ×1.3348
signalsmith 1.02
soundtouch 0.17
rubberband 0.16

So signalsmith’s pitch accuracy is fine. The 69 cents came from comparing mid-windows of a sweeping signal across stretchers with different algorithmic latencies. Signalsmith’s 60 ms latency means its mid-output window corresponds to a slightly earlier point in the input sweep than the other libraries’. At 0.5 oct/sec, even small time misalignments turn into tens of cents.

Stationary mode

Solution: add --shepard-sweep-rate <oct/sec> (default 0.5). Passing 0 produces stationary partials, eliminating sweep-vs-latency interaction entirely. Stationary results:

identity pitch ×1.3348 pitch ×2.0 time ×1.5
signalsmith 1.0000 1.3364 (+2¢) 1.0005 1.0000
soundtouch 1.0000 1.3348 (perfect) 1.0001 1.0000
rubberband 1.0000 1.3348 (perfect) 1.0000 1.0000

Confirmed: the earlier 69-cent error was 100% sweep-vs-latency, not a quality issue.

The envelope-preservation metric

The Shepard tone’s amplitude envelope is Gaussian in log-frequency by construction:

amplitude_k = exp(-½·((log₂ f_k − log₂ 500) / 2)²)

After a pitch shift R, the envelope should be centered at log₂(500·R) with sigma still ≈ 2 octaves. A pure time-stretch should leave it unchanged.

Fitting log(magnitude) = A + B·log₂(f) + C·log₂(f)² to the detected peaks via least squares recovers μ = −B/(2C) and σ = √(−1/(2C)) — a clean characterisation of how each stretcher preserves spectral balance.

Scatter of detected log-magnitude peaks versus log2 frequency, quadratic fit overlaid, recovered mu and sigma marked
The fit in action. Red points are detected output peaks; the blue curve is the quadratic least-squares fit. The recovered μ (green dashed line) sits where the parabola peaks; ±σ (orange band) is its half-width. Any drift from the input’s μ is spectral coloration.

First results were promising, but absolute centers were biased ~1 Hz low at identity (FFT-window amplitude bias on edge peaks affects the fit). Sigma was biased ~0.05 oct low for the same reason. Critically, the bias was consistent across all three stretchers, which suggested it would cancel in a relative measurement.

Input-relative comparison

Fit the same Gaussian to the input peaks too. Report:

  • center_error_cents = (observed_shift_oct − expected_shift_oct) · 1200
  • sigma_error = output_sigma − input_sigma

By comparing output to input rather than to theory, the analysis-side bias cancels. Identity errors collapsed to sub-cent across all three libraries (1–1.5 Hz on absolute, 0.2–1.4 cents on input-relative). Clean separation of stretcher-side spectral coloration finally emerged.

When to use each mode

  • Stationary Shepard (--shepard-sweep-rate 0) for envelope-precision work. Tightest, lowest-noise metric. Use this to compare libraries.
  • Sweeping Shepard (default 0.5 oct/sec) for perceptual stress and transient-coherence testing. Noisier, but probes window-vs-sweep interaction in a way that mimics real material.

The actual library comparison — Rubber Band vs SoundTouch vs Signalsmith — is the next post.

Leave a Reply

Your email address will not be published. Required fields are marked *