The deepSTRF metrics paradigm

This note documents how deepSTRF computes losses and performance metrics on predictions produced by encoding models, and what invariants any metric function must respect. It is the third leg of the data/model/metrics contract, alongside data_paradigm.md and model_paradigm.md, and is the single reference for the public functional API in deepSTRF.metrics.

1. Scope: a functional, NaN-aware, single-axis API

deepSTRF.metrics ships pure functions that map predictions and ground truth onto either a per-neuron tensor (reduction='none') or a single scalar (reduction='mean'|'sum'). Two flavours:

Flavour	Examples	Gradient?	NaN handling
Losses	`mse_loss`, `poisson_loss`	yes	Internal: NaN-positions are dropped before reducing
Metrics	`corrcoef`, `normalized_corrcoef`, `fve`, `signal_power`, `noise_power`, `snr`, `coherence`	no — `@torch.no_grad()`	Internal, same as losses

Out of scope for v1 (deferred to follow-ups, listed at the end): torchmetrics-style stateful module API, feature_decorrelation, pop_corrcoef. Anything population-aggregated. Cross-batch accumulators.

2. Input shape conventions

The metrics API matches the data and model contracts exactly. There is no “reshape your tensors before calling a metric” step.

Tensor	Shape	Source
`pred`	`(B, N, 1, T)`	model `forward()` output (`R = 1` axis from `model_paradigm.md` §3)
`responses`	`(B, N, R, T)`	dataloader output, NaN-padded
`gt_psth`	`(B, N, 1, T)`	typically `responses.nanmean(dim=2, keepdim=True)`
`valid_mask`	`(B, N, R, T)` bool	dataloader output, derived from `~responses.isnan()`

The R-axis is the only friction point. Most metrics need a PSTH (one ground-truth time series per neuron); the repeat-aware metrics (signal_power, noise_power, snr, normalized_corrcoef) need the raw repeats. The two functional signatures handle this without a class hierarchy:

# prediction-vs-PSTH metrics: gt may be the PSTH (B, N, 1, T) OR raw responses (B, N, R, T).
# When R > 1 the metric collapses to PSTH internally via nanmean(dim=2, keepdim=True).
corrcoef(pred, gt, mask=None, reduction='mean')                   # (B, N, 1, T) × (B, N, 1|R, T)
fve(pred, gt, mask=None, reduction='mean')                        # (B, N, 1, T) × (B, N, 1|R, T)
mse_loss(pred, gt, mask=None, reduction='mean')                   # (B, N, 1, T) × (B, N, 1|R, T)
poisson_loss(pred, gt, mask=None, reduction='mean')               # (B, N, 1, T) × (B, N, 1|R, T)

# repeat-aware metrics: ground truth keeps its R dimension.
signal_power(responses, mask=None, reduction='mean')              # (B, N, R, T)
noise_power(responses, mask=None, reduction='mean')               # (B, N, R, T)
snr(responses, mask=None, reduction='mean')                       # (B, N, R, T)
normalized_corrcoef(pred, responses, method='schoppe',            # (B, N, 1, T) × (B, N, R, T)
                    mask=None, reduction='mean')

# eval-only frequency-domain metric: needs a regular grid, NaN-free, PSTH-only
coherence(pred, gt_psth, dt_ms, reduction='mean')                  # (B, N, 1, T) × (B, N, 1, T)

The R = 1 singleton on pred mirrors what the model emits. We do not silently squeeze it — that would break a future probabilistic model that populates the axis with multiple samples.

Auto-PSTH collapse on the `gt` argument

The four prediction-vs-PSTH metrics (mse_loss, poisson_loss, corrcoef, fve) accept gt in either canonical shape:

(B, N, 1, T) — pre-computed PSTH or a deliberately chosen single-trial target — used as-is.
(B, N, R, T) with R > 1 — raw responses — collapsed to PSTH internally via nanmean(dim=2, keepdim=True) before the formula is applied.

This means the canonical training step is

loss = mse_loss(pred, responses)            # one line, auto-PSTH

with no caller-side responses.nanmean(...) boilerplate. NaN-padded missing-trial slabs flow through the nanmean cleanly: a position where every repeat is NaN comes out NaN at the collapsed PSTH and is dropped by the metric’s NaN-derived mask (§4). When gt is already (B, N, 1, T) the collapse is a no-op, so passing a custom target (e.g. for a non-PSTH fitting objective) keeps working unchanged.

The repeat-aware metrics (signal_power, noise_power, snr, normalized_corrcoef) and coherence do not auto-collapse: the first three need the per-trial structure for their formulas, and coherence is documented as PSTH-only and NaN-intolerant (§6.8).

3. The reduction over time happens inside

Every metric computes one scalar per neuron by reducing across the entire batch and time axis, in a way that depends on the metric’s mathematical definition. The user does not pick this. Specifically:

corrcoef, fve, normalized_corrcoef, coherence, signal_power, noise_power, snr all flatten (B, T) (and R, when present) into one long pseudo-time series per neuron, and compute their formula on that. Boolean indexing with valid_mask makes this safe even when stim durations vary or some neurons are uncorded for some stims.
mse_loss and poisson_loss compute the per-element residual, then apply reduction over the valid positions only.

This matches Schoppe (2016)’s original derivation, which assumes one concatenated long signal per neuron. Per-stim CC values that are then averaged across stims (the “Pennington recipe”, §6.4) are a separate quantity; we do not ship it in v1, and document the difference if a future follow-up adds it.

The takeaway: the time axis is collapsed by the metric’s formula, not by reduction. What reduction controls is the neuron axis only.

4. NaN handling

The data paradigm uses NaN as the universal sentinel for missingness (data_paradigm.md §3–4). All metrics in deepSTRF.metrics are NaN-aware by default and require no preprocessing:

loss = mse_loss(pred, gt_psth)         # picks up NaNs from gt_psth automatically
cc   = corrcoef(pred, gt_psth)         # same
sp   = signal_power(responses)         # NaN-padded repeats handled internally

Internally, each metric derives valid = ~gt.isnan() (or the analogous mask on responses) and uses boolean indexing per-neuron — i.e. for neuron n:

pred_n = pred[..., n, :, :][valid[..., n, :, :]]   # 1-D
gt_n   = gt  [..., n, :, :][valid[..., n, :, :]]   # 1-D
result[n] = formula(pred_n, gt_n)

This is the same recipe data_paradigm.md §6 recommends for the training loop. Per-neuron flattening keeps every metric robust to:

Uncorded (stim, neuron) pairs (full-NaN slabs in responses).
Variable T_s across stims (right-NaN-padded by neural_collate).
Variable R_{s, n} across neurons within a stim (right-NaN-padded).

`mask=` override

For cases where the canonical “valid iff not NaN” rule is not what the user wants — for example, masking out the first 50 ms of every stim to exclude onset transients — every metric accepts an optional mask kwarg:

mask = ~gt_psth.isnan() & (time_idx >= onset_offset)
cc   = corrcoef(pred, gt_psth, mask=mask)

mask must be a bool tensor broadcastable to the ground-truth tensor’s shape. If mask=None (the default), the metric falls back to ~gt.isnan(). If mask is provided, it replaces (not augments) the NaN-derived mask: the user is now responsible for keeping the NaN-positions out — typically by &ing them in as shown above. We document this explicitly because the alternative (“mask is always intersected with ~isnan”) is harder to reason about when the user deliberately wants to mark NaN positions as valid (impossible — they would contaminate the result; we do not protect against this).

Loud failure on misuse

If the user passes a mask that marks a NaN position as valid, the metric’s per-element computation is performed at that position ((pred − NaN)**2 = NaN, gt · log(pred + ε) with gt = NaN = NaN, etc.). NaN propagates through the per-neuron sum into the per-neuron output. The result is then a visibly NaN per-neuron value — not a silently-dropped or zero-contaminated number. This matches the loud-failure ethos of the data paradigm (data_paradigm.md §4): users spot misuse via NaN in their reported metric, not via subtle drift.

Implementation note: this is achieved by masked_fill(~valid, 0.0) → sum / count rather than nanmean, which would silently drop NaN.

Boolean indexing vs multiplicative masking

Two natural ways to compute a masked reduction:

# (a) Boolean indexing — flat 1-D vector of valid positions
ssr = ((pred[valid] - gt[valid]) ** 2).mean()

# (b) Multiplicative masking — preserves axis structure
per_element = (pred - gt.nan_to_num(0.0)) ** 2 * valid.float()
per_axis    = per_element.sum(dim=AXIS) / valid.float().sum(dim=AXIS)

The shipped metrics use boolean indexing internally for the per-neuron formula, because:

The flat 1-D form keeps cov / var / Pearson correlation textbook-clean (no per-row .nansum() denominators).
It is numerically robust against 0 · NaN = NaN traps that multiplicative masking can hit when NaN has leaked into pred.

Multiplicative masking is the right choice when you need to preserve the (B, N, T) axis structure for further aggregation (e.g. per-neuron losses with custom batch reduction outside the metric API). In that case, work directly with the dataloader’s valid_mask rather than calling a deepSTRF metric — the metrics are designed to terminate at per-neuron scalars.

5. Reduction semantics (PyTorch convention)

Every public metric has a reduction keyword with three values, matching torch.nn.functional.mse_loss and friends:

`reduction`	Returned shape	Meaning
`'none'`	`(N,)`	one value per neuron
`'mean'`	`()` (scalar)	mean across neurons
`'sum'`	`()` (scalar)	sum across neurons

Reduction is over the neuron axis only. Reduction over time is decided by the metric’s mathematical formula and is never controllable via reduction (§3 above).

Edge case: when one or more neurons have zero valid positions under the chosen mask (e.g. no val data for a particular cell), the per-neuron result is NaN. Under 'mean' and 'sum' we use torch.nanmean / torch.nansum so the scalar reduction stays well-defined. Under 'none', the NaN is preserved — the caller can see which cells were dropped.

6. Per-metric definitions

Each metric is documented with its formula, shape contract, edge cases, and reference. Where the existing implementation has a known bug or band-aid (abs(ttrc), capitalized 'None', etc.), the rewrite either fixes or removes the band-aid; this section is the authoritative spec.

6.1 `mse_loss(pred, gt, mask=None, reduction='mean')`

Boolean-masked mean squared error.

For each neuron, take the per-element squared residual, drop positions where mask is False, and reduce.

mse_n = mean_{(b,t) ∈ valid_n}  (pred[b, n, 0, t] - gt[b, n, 0, t])²

reduction='none': returns (N,).
reduction='mean': scalar, nanmean(mse_n, dim=0).
reduction='sum': scalar, nansum(mse_n, dim=0).

Differentiable. No @torch.no_grad.

6.2 `poisson_loss(pred, gt, mask=None, reduction='mean', log_input=False, validate_input=False, eps=1e-8)`

Negative Poisson log-likelihood, dropping the log(gt!) term. Two parameterisations of the prediction are supported, matching the canonical-link logic of generalised linear models:

`log_input=False` (default): pred is the rate `λ`

poisson_n = mean_{(b,t) ∈ valid_n}  pred[b, n, 0, t]  -  gt[b, n, 0, t] · log(pred[b, n, 0, t] + ε)

Requires pred ≥ 0 for the log to be meaningful. The implementation silently clamps pred to ≥ ε inside the log to prevent NaN from propagating when the model temporarily emits a small negative prediction. The linear pred term outside the log keeps its sign, so the gradient still pushes negative predictions toward positive values. With validate_input=True, the loss instead raises ValueError on the first negative prediction at a masked-in position — useful for debugging an output-activation bug, but adds a per-step CPU sync, so it is opt-in.

`log_input=True`: pred is the log-rate `η = log(λ)`

poisson_n = mean_{(b,t) ∈ valid_n}  exp(pred[b, n, 0, t])  -  gt[b, n, 0, t] · pred[b, n, 0, t]

Derivation: starting from the Poisson NLL λ − gt · log(λ) and substituting λ = exp(η), the loss becomes exp(η) − gt · η, well-defined for any real-valued η. This is the canonical link for the Poisson distribution in GLM language — same role that the logit link plays for Bernoulli. Practically: pair log_input=True with any readout (Linear, ParametricSigmoid, etc.) and read the model output as log-rate. To recover rate for visualisation or downstream metrics expecting rate (e.g. corrcoef against a count-valued PSTH), exp(pred) first.

Stirling term

The full Poisson NLL has a gt · log(gt) − gt + 0.5·log(2π·gt) term (Stirling’s approximation to log(gt!)). It is not added by default because:

It is constant in pred, so it does not affect optimisation.
It is meaningless when gt is non-integer (e.g. trial-averaged PSTH binned counts — the typical deepSTRF target). log(gt!) is defined only on non-negative integers.

Users who want the full likelihood for AIC/BIC model comparison on integer count targets can add the term outside the loss.

Pairing with output activations

Output activation	Recommended `log_input`	Why
`nn.Softplus`	`False`	Output already `> 0`; rate interpretation natural
`ParametricSoftplus` with `non_negative_output=True` (default)	`False`	Same; default zoo activation (NS1, NAT4, …)
`ParametricSigmoid` with `non_negative_output=True`	`False`	Same
`ParametricDoubleExponential` with `non_negative_output=True`	`False`	Same
`nn.Identity` / Linear	`True`	Output unbounded; treat as log-rate
`nn.Sigmoid` (output in [0, 1])	`False`	Output already `> 0`
Any with `non_negative_output=False`	`True`	Output may go negative

The default log_input=False matches PyTorch’s torch.nn.PoissonNLLLoss(log_input=False) and the historical deepSTRF behaviour.

6.3 `corrcoef(pred, gt, mask=None, reduction='mean')`

Pearson correlation coefficient between prediction and PSTH, computed per-neuron over all valid positions:

ρ_n = corr(pred_n, gt_n)

where pred_n, gt_n are the 1-D vectors of valid (b, t) positions for neuron n (§4). Uses torch.corrcoef, denominator regularized with a small floor to avoid 0/0 on constant predictions.

mean / sum reduce over N.
A neuron with zero variance in either pred_n or gt_n returns NaN for that index — propagation rules from §5 apply.

6.4 `normalized_corrcoef(pred, responses, method='schoppe', mask=None, reduction='mean')`

Noise-corrected correlation coefficient. Two formulations are shipped; they target the same quantity with different bias profiles.

The shape contract differs from corrcoef: ground truth is the raw repeat-bearing tensor (B, N, R, T), not the PSTH. The PSTH is computed internally as responses.nanmean(dim=2, keepdim=True), but the per-trial information is needed for the noise correction.

`method='schoppe'`

Schoppe et al. 2016, “Measuring the Performance of Neural Models”, Frontiers in Neuroscience.

CCnorm_n = cov(pred_n, psth_n) / sqrt( var(pred_n) · SP_n )

where pred_n, psth_n are the 1-D vectors of valid (b, t) positions for neuron n (concatenated across stims via boolean indexing, §3), and SP_n is the Sahani–Linden signal power for neuron n (§6.5, length-weighted across stims). The aggregation convention for the denominator is chosen to match the natural sample-count weighting that the numerator’s concatenated cov and var perform automatically — see §11. Replaces the noise-inflated var(psth) in the denominator of the raw Pearson correlation with the unbiased signal-only variance estimator. Cells with SP_n ≤ 0 (the noise-floor regime where the unbiased estimator fails) return NaN.

`method='hsu'`

Hsu, Borst & Theunissen 2004, with the CCmax formulation popularized by Schoppe et al. 2016.

CCnorm_n     = corr(pred_n, psth_n) / CCmax_n
CCmax_{b,n}  = sqrt( 2 · ρ_half_{b,n} / (1 + ρ_half_{b,n}) )           (Spearman–Brown, per stim)
ρ_half_{b,n} = E_{i,j}[ corr( psth_half_{i,b,n}, psth_half_{j,b,n} ) ] (over disjoint half-trial splits)
CCmax_n      = sum_b ( T_b · CCmax_{b,n} ) / sum_b T_b                 (length-weighted, §11)

Computed per-stim by sampling up to ccmax_iters=126 disjoint half-trial pairs, then length-weighted across stims with T_b as the weight. Same robustness convention as signal_power (§6.5). Per-stim cells with ρ_half ≤ 0 (worse-than-chance ceiling — too noisy to estimate) are dropped from the weighted average rather than contaminating it.

Single-trial degenerate case

When R = 1 (or every cell of responses[..., n, :, :] has only one valid repeat after masking), the noise-correction factor is undefined: both SP_n and CCmax_n would require ≥ 2 trials. In that case normalized_corrcoef returns the raw corrcoef(pred, psth) for that neuron. This is a worst-case assumption: with no information about trial-to-trial variability, we cannot know how much of the prediction gap is noise vs model error, so we report the un-corrected number. It is the same convention used by the existing implementation and by Schoppe (2016) §3.5.

Pennington 2023 — not shipped in v1

The Pennington & David (2023) noise-corrected prediction correlation divides by sqrt(TTRC) where TTRC is the average pairwise trial-pair correlation. The existing implementation includes this as pennington_prediction_correlation, with an abs(ttrc) band-aid that the audit flags as wrong. Re-derivation: TTRC can be slightly negative in the low-SNR regime and Pennington’s paper is silent on what to do; the field’s typical fix is to clip-to-ε with a warning, but this changes the reported number for poorly-driven cells in a way that has been benchmarked in deepSTRF. We defer the Pennington method to a follow-up rather than ship a rushed fix.

6.5 `signal_power(responses, mask=None, reduction='mean')`

Sahani & Linden 2003, NIPS, “How linear are auditory cortical responses?”.

For one (b, n) cell with R valid repeats, the variance of the multi-trial response decomposes as

TP_{b,n} = E_r [ var_t( y_{r, b, n}(t) ) ]            = SP_{b,n} + NP_{b,n}             (single-trial variance)
           var_t( E_r[ y_{r, b, n}(t) ] )             = SP_{b,n} + NP_{b,n} / R          (variance of the PSTH)

Solving for SP_{b,n}:

SP_{b,n} = ( R · var_t(psth_{b,n}) - TP_{b,n} ) / (R - 1)

The per-neuron quantity is the length-weighted average of the per-stim estimates:

SP_n = sum_b ( T_b · SP_{b,n} ) / sum_b T_b

where T_b is the number of valid time bins for stim b. This is the BLUE estimator under the assumption that every per-stim SP_{b,n} is unbiased: weighting by sample count gives the minimum-variance combiner. It also matches the natural sample weighting that cov and var apply in the numerator of corrcoef / normalized_corrcoef (which run over the full concatenated (b, t) series), so a 500 ms clip drives the denominator the same way it drives the numerator. See §11 for the derivation.

A given (b, n) cell is included iff it has ≥ 2 valid repeats and ≥ 2 valid time bins. Cells without any qualifying stim return NaN under reduction='none'.

6.6 `noise_power(responses, mask=None, reduction='mean')`

NP_{b,n} = TP_{b,n} - SP_{b,n}
NP_n     = sum_b ( T_b · NP_{b,n} ) / sum_b T_b

Same length-weighted aggregation as signal_power.

6.7 `snr(responses, mask=None, reduction='mean')`

SNR_n = SP_n / NP_n

Returns +inf for noiseless cells (NP_n = 0) — caller decides whether to filter.

6.8 `coherence(pred, gt, dt_ms, reduction='mean')`

Magnitude-squared coherence between prediction and PSTH per neuron, as a scalar per neuron computed as the mean over frequency bins. Uses scipy.signal.coherence internally, so this is eval-only (no gradient).

Coherence requires a regular time grid and does not tolerate NaN. The caller is responsible for picking a NaN-free subset (e.g. one fully recorded validation stim, or pre-flatten across stims of equal length). We document this explicitly and raise ValueError if NaN appears in the input. Future follow-up: a chunked-coherence variant that handles NaN-padded variable-length stims.

This metric is preserved largely as it was — the existing implementation is correct for its supported shape — but adapted to the new (B, N, 1, T) × (B, N, 1, T) contract.

7. Recommended training-loop pattern

Mirroring data_paradigm.md §6, the canonical loop now looks like:

from deepSTRF.metrics import mse_loss, corrcoef, normalized_corrcoef

for batch in loader:                                             # batch is a dict
    stims, responses = batch['stims'], batch['responses']
    pred = model(stims)                                          # (B, N, 1, T)

    loss = mse_loss(pred, responses)                             # auto-PSTH inside
    loss.backward()
    optimizer.step()

    if val_step:
        cc       = corrcoef(pred, responses, reduction='none')    # (N,)  auto-PSTH
        cc_norm  = normalized_corrcoef(pred, responses,           # (N,)  needs raw repeats
                                       method='schoppe',
                                       reduction='none')

Notes:

No manual masking and no manual PSTH. Pass responses directly to any prediction-vs-PSTH metric; the auto-PSTH collapse (§2) and the NaN-derived mask (§4) are both internal. The dataloader’s valid_mask is informational at this point.
Pass responses.nanmean(dim=2, keepdim=True) explicitly only when you want a non-default target — a custom objective, a single-trial fit, etc. The (B, N, 1, T) shape is treated as authoritative and passed through unchanged.
reduction='none' is the default for val metrics. Per-neuron numbers are usually what you want for diagnostics (best/worst cells, distribution plots). Aggregate yourself with cc.nanmean() if a single number is needed.
Use mask= only when overriding the default NaN rule. The 90 % case (mask = ~isnan) is automatic.

8. Invariants for metric authors

Anyone adding a new metric to deepSTRF.metrics must respect:

Shape signature matches §2. Either (B, N, 1, T) × (B, N, 1, T) for prediction-vs-PSTH metrics, or (B, N, R, T) for repeat-aware metrics. No other input shape is supported in v1.
NaN-aware by default. Derive valid internally; expose mask= for override.
reduction='none'|'mean'|'sum' over the neuron axis. Use nanmean / nansum for mean / sum to handle dropped cells gracefully.
@torch.no_grad() on metrics; no no_grad on losses. Losses participate in backprop; metrics do not. Mixing them up is the single biggest bug pattern in the existing module.
Reduction is over neurons, never over time. Time is collapsed by the metric’s formula. If your formula needs a different time-axis convention, document it explicitly and add a test.
Loud failure on shape mismatch. Raise ValueError with the expected vs received shape; do not auto-broadcast.
Tests under tests/test_metrics.py assert: (a) shape contracts, (b) NaN handling on a synthetic full-NaN slab, (c) reduction semantics, (d) at least one analytic correctness check (e.g. corrcoef of a tensor with itself = 1, signal_power on a pure noise signal ≈ 0).

9. Known bugs in the existing implementation (audit)

This section is the rewrite checklist. The existing metrics/performance.py and metrics/losses.py will be rewritten — not patched in place — to satisfy the contracts above. Issues found during the 2026-04-28 audit:

@torch.no_grad() on losses. NegativePoissonLogLikelihood is correctly a nn.Module (no no_grad), but every “performance” function in performance.py has @torch.no_grad(). Splitting losses vs metrics into two modules makes the contract obvious.
No NaN handling anywhere. Silently produces NaN if any input element is NaN. Critical given the data paradigm.
covariance uses (T - 1) denominator; var(dim=-1) is Bessel-corrected by default. Mixing biased and unbiased estimators in the same formula introduces a small but real bias. Pick one (Bessel) and apply consistently.
reduction='None' (capitalized) in correlation_coefficient and normalized_correlation_coefficient is a typo. Switch to PyTorch convention 'none'|'mean'|'sum'.
pennington_prediction_correlation uses abs(ttrc) — a band-aid for sign issues that shouldn’t appear in a correctly-derived formula. Re-derive (or defer; see §6.4).
compute_CCmax uses 1/sqrt(cchalf²) = 1/|cchalf| instead of 1/cchalf. Same band-aid pattern as #5; equivalent to |ρ_half| in the Spearman–Brown formula. Replace with the textbook form sqrt(2·ρ_half / (1+ρ_half)) and clip-or-NaN low-SNR cells.
fill_missing_repeats is a dataset-side helper that the modernized NaN-sentinel path obsoletes. Drop from metrics.
coherence returns a (B, N, F) tensor that is then averaged over B, not over F — different from the per-neuron-scalar contract here. Adapt to §6.8.
NegativePoissonLogLikelihood computes log(-prediction + 1e-9) — negation inside log is a bug. Should be log(prediction + ε). Also uses 'None' (capitalized) and computes mean over (-2, -1) then over batch, which is 'mean' over all elements — fine for the loss but the reduction semantics are inconsistent with the new convention.

10. Deferred items

These were considered for v1 and explicitly cut. Restoration will re-open this doc.

Module API (torchmetrics-style stateful update() / compute()). Useful for streaming val metrics across an epoch without materializing every batch. The functional API is enough for v1; module API is easy to add later as a thin wrapper.
feature_decorrelation / redundancy-reduction regularizer. Not part of the original deepSTRF scope and uses a different design pattern (forward hooks on model.core).
pop_corrcoef (correlation of population-summed PSTH). Uncommon metric; no immediate use case.
Pennington-style noise-corrected prediction correlation (§6.4). Re-derivation pending; current abs(ttrc) is wrong, simple ε-clip is not vetted.
Per-stim-then-averaged metrics. Some of the literature reports mean_s CC(stim_s) rather than CC(concat_s). Different quantity; not shipped in v1.
Chunked-coherence for NaN-padded variable-length stims (§6.8).

11. Why length-weighting? (BLUE derivation)

Several v1 metrics combine per-stim estimates into a per-neuron number: signal_power, noise_power, snr, and normalized_corrcoef(method='hsu'). The aggregation across stims is length-weighted:

X_n = sum_b ( T_b · X_{b,n} ) / sum_b T_b

with T_b the number of valid time bins for stim b. This subsection records why, since “average across stims” is also a valid choice and the literature is split.

Setup

Let X_{b,n} be a per-stim estimate of some neuron-level quantity X_n (e.g. signal power). Treat X_{b,n} as an unbiased estimator with variance Var(X_{b,n}) ∝ 1 / T_b — i.e. estimator variance shrinks linearly with the number of time samples used. This is the standard behaviour of variance-of-variance estimators on stationary signals: doubling T halves the noise on the variance estimate.

The general weighted estimator

X̂_n = sum_b ( w_b · X_{b,n} ) / sum_b w_b

is unbiased for any positive w_b summing to one. Its variance is minimised — i.e. it is the Best Linear Unbiased Estimator (BLUE) — when w_b ∝ 1 / Var(X_{b,n}) = T_b. So weighting by T_b extracts the most information possible from the per-stim estimates under that variance assumption.

Coherence with the numerator

corrcoef(pred_n, psth_n) and var(pred_n) in the Schoppe formula operate over the concatenated valid (b, t) time series for neuron n. A 500 ms clip contributes 10× the samples of a 50 ms clip and correspondingly drives the numerator’s cov / var 10× harder. If we were to combine per-stim SPs with equal weights instead of length weights, the numerator and denominator would weight stims differently, which makes the resulting CCnorm value awkward to interpret — especially when comparing models on heterogeneous-length validation sets. Length-weighting on the denominator brings everything into agreement.

The equal-weight alternative

mean_b X_{b,n} is also defensible:

Pros: each stim contributes one answer, not one fact. If you trust short-stim estimates as much as long-stim estimates (e.g. your short stims are repeated many more times so the per-stim variance gain compensates), equal weighting reflects that.
Cons: ignores the variance structure described above; short noisy stims can pull the average around.

We pick length-weighting as the default because it composes correctly with the natural sample-count weighting elsewhere in the pipeline. A future kwarg weighting='length' | 'equal' could expose the choice if a clear use case emerges.

Practical consequence

A neuron with two clips — a 500 ms validation stim with SP=10 and a 50 ms validation stim with SP=1 — gets SP_n ≈ (500·10 + 50·1)/550 ≈ 9.18 under length-weighting, vs 5.5 under equal weighting. The length-weighted answer matches what var(psth_n) over the concatenated time series would have given (modulo the 1/(R−1) Bessel correction); the equal-weighted answer would produce an internally inconsistent CCnorm.

12. References

Sahani, M. & Linden, J. F. (2003). “How linear are auditory cortical responses?” Advances in Neural Information Processing Systems (NIPS). — Original signal/noise power decomposition.
Hsu, A., Borst, A. & Theunissen, F. E. (2004). “Quantifying variability in neural responses and its application for the validation of model predictions.” Network: Computation in Neural Systems. — CCmax formulation via Spearman–Brown.
Schoppe, O., Harper, N. S., Willmore, B. D. B., King, A. J. & Schnupp, J. W. H. (2016). “Measuring the performance of neural models.” Frontiers in Computational Neuroscience. — CCnorm (signal-power formulation), reconciliation with Hsu et al.
Pennington, J. R. & David, S. V. (2023). “A convolutional neural network provides a generalizable model of natural sound coding by neural populations in auditory cortex.” PLOS Computational Biology. — TTRC-based noise-corrected prediction correlation.
Gill, P. et al. (2025). “Sound representation methods for spectro-temporal receptive field estimation.” Journal of Computational Neuroscience. — Coherence-information metric (deferred follow-up).