# The deepSTRF metrics paradigm

This note documents how deepSTRF computes losses and performance metrics on
predictions produced by encoding models, and what invariants any metric
function must respect. It is the third leg of the data/model/metrics
contract, alongside [`data_paradigm.md`](data_paradigm.md) and
[`model_paradigm.md`](model_paradigm.md), and is the single reference for
the public functional API in `deepSTRF.metrics`.

## 1. Scope: a functional, NaN-aware, single-axis API

`deepSTRF.metrics` ships **pure functions** that map predictions and ground
truth onto either a per-neuron tensor (`reduction='none'`) or a single
scalar (`reduction='mean'|'sum'`). Two flavours:

| Flavour       | Examples                                           | Gradient? | NaN handling                                         |
|---            |---                                                  |---        |---                                                   |
| **Losses**    | `mse_loss`, `poisson_loss`                          | yes       | Internal: NaN-positions are dropped before reducing  |
| **Metrics**   | `corrcoef`, `normalized_corrcoef`, `fve`, `signal_power`, `noise_power`, `snr`, `coherence` | no — `@torch.no_grad()` | Internal, same as losses                             |

**Out of scope for v1** (deferred to follow-ups, listed at the end):
torchmetrics-style stateful module API, `feature_decorrelation`,
`pop_corrcoef`. Anything population-aggregated. Cross-batch accumulators.

## 2. Input shape conventions

The metrics API matches the data and model contracts exactly. There is no
"reshape your tensors before calling a metric" step.

| Tensor              | Shape                  | Source                                                      |
|---                  |---                      |---                                                          |
| `pred`              | `(B, N, 1, T)`          | model `forward()` output (`R = 1` axis from `model_paradigm.md` §3) |
| `responses`         | `(B, N, R, T)`          | dataloader output, NaN-padded                               |
| `gt_psth`           | `(B, N, 1, T)`          | typically `responses.nanmean(dim=2, keepdim=True)`           |
| `valid_mask`        | `(B, N, R, T)` bool     | dataloader output, derived from `~responses.isnan()`         |

The R-axis is the only friction point. Most metrics need a PSTH (one
ground-truth time series per neuron); the repeat-aware metrics
(`signal_power`, `noise_power`, `snr`, `normalized_corrcoef`) need the raw
repeats. The two functional signatures handle this without a class
hierarchy:

```python
# prediction-vs-PSTH metrics: gt may be the PSTH (B, N, 1, T) OR raw responses (B, N, R, T).
# When R > 1 the metric collapses to PSTH internally via nanmean(dim=2, keepdim=True).
corrcoef(pred, gt, mask=None, reduction='mean')                   # (B, N, 1, T) × (B, N, 1|R, T)
fve(pred, gt, mask=None, reduction='mean')                        # (B, N, 1, T) × (B, N, 1|R, T)
mse_loss(pred, gt, mask=None, reduction='mean')                   # (B, N, 1, T) × (B, N, 1|R, T)
poisson_loss(pred, gt, mask=None, reduction='mean')               # (B, N, 1, T) × (B, N, 1|R, T)

# repeat-aware metrics: ground truth keeps its R dimension.
signal_power(responses, mask=None, reduction='mean')              # (B, N, R, T)
noise_power(responses, mask=None, reduction='mean')               # (B, N, R, T)
snr(responses, mask=None, reduction='mean')                       # (B, N, R, T)
normalized_corrcoef(pred, responses, method='schoppe',            # (B, N, 1, T) × (B, N, R, T)
                    mask=None, reduction='mean')

# eval-only frequency-domain metric: needs a regular grid, NaN-free, PSTH-only
coherence(pred, gt_psth, dt_ms, reduction='mean')                  # (B, N, 1, T) × (B, N, 1, T)
```

The `R = 1` singleton on `pred` mirrors what the model emits. We do *not*
silently squeeze it — that would break a future probabilistic model that
populates the axis with multiple samples.

### Auto-PSTH collapse on the `gt` argument

The four prediction-vs-PSTH metrics (`mse_loss`, `poisson_loss`,
`corrcoef`, `fve`) accept `gt` in either canonical shape:

- `(B, N, 1, T)` — pre-computed PSTH or a deliberately chosen
  single-trial target — used as-is.
- `(B, N, R, T)` with `R > 1` — raw responses — collapsed to PSTH
  internally via `nanmean(dim=2, keepdim=True)` before the formula is
  applied.

This means the canonical training step is

```python
loss = mse_loss(pred, responses)            # one line, auto-PSTH
```

with no caller-side `responses.nanmean(...)` boilerplate. NaN-padded
missing-trial slabs flow through the `nanmean` cleanly: a position where
every repeat is NaN comes out NaN at the collapsed PSTH and is dropped
by the metric's NaN-derived mask (§4). When `gt` is already
`(B, N, 1, T)` the collapse is a no-op, so passing a custom target
(e.g. for a non-PSTH fitting objective) keeps working unchanged.

The repeat-aware metrics (`signal_power`, `noise_power`, `snr`,
`normalized_corrcoef`) and `coherence` do **not** auto-collapse: the
first three need the per-trial structure for their formulas, and
`coherence` is documented as PSTH-only and NaN-intolerant (§6.8).

## 3. The reduction over time happens inside

Every metric computes one scalar per neuron by reducing across the *entire
batch and time axis*, in a way that depends on the metric's mathematical
definition. The user does not pick this. Specifically:

- `corrcoef`, `fve`, `normalized_corrcoef`, `coherence`, `signal_power`,
  `noise_power`, `snr` all flatten `(B, T)` (and `R`, when present) into
  one long pseudo-time series per neuron, and compute their formula on
  that. Boolean indexing with `valid_mask` makes this safe even when stim
  durations vary or some neurons are uncorded for some stims.
- `mse_loss` and `poisson_loss` compute the per-element residual, then
  apply `reduction` over the *valid* positions only.

This matches Schoppe (2016)'s original derivation, which assumes one
concatenated long signal per neuron. Per-stim CC values that are then
averaged across stims (the "Pennington recipe", §6.4) are a *separate*
quantity; we do not ship it in v1, and document the difference if a future
follow-up adds it.

The takeaway: **the time axis is collapsed by the metric's formula, not by
`reduction`.** What `reduction` controls is the *neuron* axis only.

## 4. NaN handling

The data paradigm uses NaN as the universal sentinel for missingness
(`data_paradigm.md` §3–4). All metrics in `deepSTRF.metrics` are
NaN-aware by default and require **no** preprocessing:

```python
loss = mse_loss(pred, gt_psth)         # picks up NaNs from gt_psth automatically
cc   = corrcoef(pred, gt_psth)         # same
sp   = signal_power(responses)         # NaN-padded repeats handled internally
```

Internally, each metric derives `valid = ~gt.isnan()` (or the analogous
mask on `responses`) and uses boolean indexing per-neuron — i.e. for
neuron `n`:

```python
pred_n = pred[..., n, :, :][valid[..., n, :, :]]   # 1-D
gt_n   = gt  [..., n, :, :][valid[..., n, :, :]]   # 1-D
result[n] = formula(pred_n, gt_n)
```

This is the same recipe `data_paradigm.md` §6 recommends for the training
loop. Per-neuron flattening keeps every metric robust to:

- Uncorded `(stim, neuron)` pairs (full-NaN slabs in `responses`).
- Variable `T_s` across stims (right-NaN-padded by `neural_collate`).
- Variable `R_{s, n}` across neurons within a stim (right-NaN-padded).

### `mask=` override

For cases where the canonical "valid iff not NaN" rule is not what the
user wants — for example, masking out the first 50 ms of every stim to
exclude onset transients — every metric accepts an optional `mask` kwarg:

```python
mask = ~gt_psth.isnan() & (time_idx >= onset_offset)
cc   = corrcoef(pred, gt_psth, mask=mask)
```

`mask` must be a bool tensor broadcastable to the ground-truth tensor's
shape. If `mask=None` (the default), the metric falls back to
`~gt.isnan()`. If `mask` is provided, it **replaces** (not augments) the
NaN-derived mask: the user is now responsible for keeping the
NaN-positions out — typically by `&`ing them in as shown above. We
document this explicitly because the alternative ("mask is always
intersected with `~isnan`") is harder to reason about when the user
deliberately wants to mark NaN positions as valid (impossible — they would
contaminate the result; we do not protect against this).

### Loud failure on misuse

If the user passes a `mask` that marks a NaN position as valid, the
metric's per-element computation is performed at that position
(`(pred − NaN)**2 = NaN`, `gt · log(pred + ε)` with `gt = NaN` = NaN,
etc.). NaN propagates through the per-neuron sum into the per-neuron
output. The result is then a visibly NaN per-neuron value — *not* a
silently-dropped or zero-contaminated number. This matches the
loud-failure ethos of the data paradigm (`data_paradigm.md` §4): users
spot misuse via NaN in their reported metric, not via subtle drift.

Implementation note: this is achieved by `masked_fill(~valid, 0.0) →
sum / count` rather than `nanmean`, which would silently drop NaN.

### Boolean indexing vs multiplicative masking

Two natural ways to compute a masked reduction:

```python
# (a) Boolean indexing — flat 1-D vector of valid positions
ssr = ((pred[valid] - gt[valid]) ** 2).mean()

# (b) Multiplicative masking — preserves axis structure
per_element = (pred - gt.nan_to_num(0.0)) ** 2 * valid.float()
per_axis    = per_element.sum(dim=AXIS) / valid.float().sum(dim=AXIS)
```

The shipped metrics use **boolean indexing** internally for the per-neuron
formula, because:

- The flat 1-D form keeps `cov` / `var` / Pearson correlation textbook-clean
  (no per-row `.nansum()` denominators).
- It is numerically robust against `0 · NaN = NaN` traps that
  multiplicative masking can hit when NaN has leaked into `pred`.

Multiplicative masking is the right choice when you need to preserve the
`(B, N, T)` axis structure for further aggregation (e.g. per-neuron
losses with custom batch reduction outside the metric API). In that
case, work directly with the dataloader's `valid_mask` rather than
calling a deepSTRF metric — the metrics are designed to terminate at
per-neuron scalars.

## 5. Reduction semantics (PyTorch convention)

Every public metric has a `reduction` keyword with three values, matching
`torch.nn.functional.mse_loss` and friends:

| `reduction` | Returned shape | Meaning                                    |
|---          |---              |---                                         |
| `'none'`    | `(N,)`          | one value per neuron                       |
| `'mean'`    | `()` (scalar)   | mean across neurons                        |
| `'sum'`     | `()` (scalar)   | sum across neurons                         |

**Reduction is over the neuron axis only.** Reduction over time is decided
by the metric's mathematical formula and is never controllable via
`reduction` (§3 above).

Edge case: when one or more neurons have **zero valid positions** under
the chosen mask (e.g. no val data for a particular cell), the per-neuron
result is `NaN`. Under `'mean'` and `'sum'` we use `torch.nanmean` /
`torch.nansum` so the scalar reduction stays well-defined. Under
`'none'`, the NaN is preserved — the caller can see which cells were
dropped.

## 6. Per-metric definitions

Each metric is documented with its formula, shape contract, edge cases,
and reference. Where the existing implementation has a known bug or
band-aid (`abs(ttrc)`, capitalized `'None'`, etc.), the rewrite either
fixes or removes the band-aid; this section is the authoritative spec.

### 6.1 `mse_loss(pred, gt, mask=None, reduction='mean')`

Boolean-masked mean squared error.

For each neuron, take the per-element squared residual, drop positions
where `mask` is False, and reduce.

```text
mse_n = mean_{(b,t) ∈ valid_n}  (pred[b, n, 0, t] - gt[b, n, 0, t])²
```

- `reduction='none'`: returns `(N,)`.
- `reduction='mean'`: scalar, `nanmean(mse_n, dim=0)`.
- `reduction='sum'`: scalar, `nansum(mse_n, dim=0)`.

Differentiable. No `@torch.no_grad`.

### 6.2 `poisson_loss(pred, gt, mask=None, reduction='mean', log_input=False, validate_input=False, eps=1e-8)`

Negative Poisson log-likelihood, dropping the `log(gt!)` term. Two
parameterisations of the prediction are supported, matching the
canonical-link logic of generalised linear models:

#### `log_input=False` (default): pred is the rate `λ`

```text
poisson_n = mean_{(b,t) ∈ valid_n}  pred[b, n, 0, t]  -  gt[b, n, 0, t] · log(pred[b, n, 0, t] + ε)
```

Requires `pred ≥ 0` for the `log` to be meaningful. The implementation
*silently clamps* `pred` to `≥ ε` inside the `log` to prevent NaN from
propagating when the model temporarily emits a small negative
prediction. The linear `pred` term outside the `log` keeps its sign,
so the gradient still pushes negative predictions toward positive
values. With `validate_input=True`, the loss instead raises
`ValueError` on the first negative prediction at a masked-in position
— useful for debugging an output-activation bug, but adds a per-step
CPU sync, so it is opt-in.

#### `log_input=True`: pred is the log-rate `η = log(λ)`

```text
poisson_n = mean_{(b,t) ∈ valid_n}  exp(pred[b, n, 0, t])  -  gt[b, n, 0, t] · pred[b, n, 0, t]
```

Derivation: starting from the Poisson NLL `λ − gt · log(λ)` and
substituting `λ = exp(η)`, the loss becomes `exp(η) − gt · η`, well-defined
for any real-valued `η`. This is the **canonical link** for the Poisson
distribution in GLM language — same role that the logit link plays for
Bernoulli. Practically: pair `log_input=True` with **any** readout
(Linear, ParametricSigmoid, etc.) and read the model output as
log-rate. To recover rate for visualisation or downstream metrics
expecting rate (e.g. `corrcoef` against a count-valued PSTH),
`exp(pred)` first.

#### Stirling term

The full Poisson NLL has a `gt · log(gt) − gt + 0.5·log(2π·gt)` term
(Stirling's approximation to `log(gt!)`). It is **not** added by
default because:

- It is constant in `pred`, so it does not affect optimisation.
- It is meaningless when `gt` is non-integer (e.g. trial-averaged
  PSTH binned counts — the typical deepSTRF target). `log(gt!)` is
  defined only on non-negative integers.

Users who want the full likelihood for AIC/BIC model comparison on
integer count targets can add the term outside the loss.

#### Pairing with output activations

| Output activation                                | Recommended `log_input` | Why                                              |
|---                                               |---                       |---                                                |
| `nn.Softplus`                                     | `False`                  | Output already `> 0`; rate interpretation natural |
| `ParametricSoftplus` with `non_negative_output=True` (default) | `False`        | Same; default zoo activation (NS1, NAT4, ...)     |
| `ParametricSigmoid` with `non_negative_output=True` | `False`                | Same                                              |
| `ParametricDoubleExponential` with `non_negative_output=True` | `False`        | Same                                              |
| `nn.Identity` / Linear                            | `True`                   | Output unbounded; treat as log-rate               |
| `nn.Sigmoid` (output in [0, 1])                   | `False`                  | Output already `> 0`                              |
| Any with `non_negative_output=False`              | `True`                   | Output may go negative                            |

The default `log_input=False` matches PyTorch's
`torch.nn.PoissonNLLLoss(log_input=False)` and the historical deepSTRF
behaviour.

### 6.3 `corrcoef(pred, gt, mask=None, reduction='mean')`

Pearson correlation coefficient between prediction and PSTH, computed
per-neuron over all valid positions:

```text
ρ_n = corr(pred_n, gt_n)
```

where `pred_n`, `gt_n` are the 1-D vectors of valid `(b, t)` positions for
neuron `n` (§4). Uses `torch.corrcoef`, denominator regularized with a
small floor to avoid `0/0` on constant predictions.

- `mean` / `sum` reduce over `N`.
- A neuron with zero variance in either `pred_n` or `gt_n` returns NaN
  for that index — propagation rules from §5 apply.

### 6.4 `normalized_corrcoef(pred, responses, method='schoppe', mask=None, reduction='mean')`

Noise-corrected correlation coefficient. Two formulations are shipped;
they target the same quantity with different bias profiles.

The shape contract differs from `corrcoef`: ground truth is the raw
repeat-bearing tensor `(B, N, R, T)`, not the PSTH. The PSTH is computed
internally as `responses.nanmean(dim=2, keepdim=True)`, but the per-trial
information is needed for the noise correction.

#### `method='schoppe'`

Schoppe et al. 2016, "Measuring the Performance of Neural Models",
Frontiers in Neuroscience.

```text
CCnorm_n = cov(pred_n, psth_n) / sqrt( var(pred_n) · SP_n )
```

where `pred_n`, `psth_n` are the 1-D vectors of valid `(b, t)` positions
for neuron `n` (concatenated across stims via boolean indexing, §3),
and `SP_n` is the Sahani–Linden signal power for neuron `n` (§6.5,
length-weighted across stims). The aggregation convention for the
denominator is chosen to match the natural sample-count weighting that
the numerator's concatenated `cov` and `var` perform automatically — see
§11. Replaces the noise-inflated `var(psth)` in the denominator of the
raw Pearson correlation with the unbiased signal-only variance estimator.
Cells with `SP_n ≤ 0` (the noise-floor regime where the unbiased
estimator fails) return NaN.

#### `method='hsu'`

Hsu, Borst & Theunissen 2004, with the `CCmax` formulation popularized
by Schoppe et al. 2016.

```text
CCnorm_n     = corr(pred_n, psth_n) / CCmax_n
CCmax_{b,n}  = sqrt( 2 · ρ_half_{b,n} / (1 + ρ_half_{b,n}) )           (Spearman–Brown, per stim)
ρ_half_{b,n} = E_{i,j}[ corr( psth_half_{i,b,n}, psth_half_{j,b,n} ) ] (over disjoint half-trial splits)
CCmax_n      = sum_b ( T_b · CCmax_{b,n} ) / sum_b T_b                 (length-weighted, §11)
```

Computed per-stim by sampling up to `ccmax_iters=126` disjoint
half-trial pairs, then length-weighted across stims with `T_b` as the
weight. Same robustness convention as `signal_power` (§6.5). Per-stim
cells with `ρ_half ≤ 0` (worse-than-chance ceiling — too noisy to
estimate) are dropped from the weighted average rather than
contaminating it.

#### Single-trial degenerate case

When `R = 1` (or every cell of `responses[..., n, :, :]` has only one
valid repeat after masking), the noise-correction factor is undefined:
both `SP_n` and `CCmax_n` would require ≥ 2 trials. In that case
`normalized_corrcoef` returns the raw `corrcoef(pred, psth)` for that
neuron. This is a *worst-case* assumption: with no information about
trial-to-trial variability, we cannot know how much of the prediction
gap is noise vs model error, so we report the un-corrected number. It is
the same convention used by the existing implementation and by
Schoppe (2016) §3.5.

#### Pennington 2023 — *not shipped in v1*

The Pennington & David (2023) noise-corrected prediction correlation
divides by `sqrt(TTRC)` where `TTRC` is the average pairwise trial-pair
correlation. The existing implementation includes this as
`pennington_prediction_correlation`, with an `abs(ttrc)` band-aid that
the audit flags as wrong. Re-derivation: TTRC can be slightly negative
in the low-SNR regime and Pennington's paper is silent on what to do; the
field's typical fix is to clip-to-ε with a warning, but this changes the
reported number for poorly-driven cells in a way that has been
benchmarked in deepSTRF. We **defer** the Pennington method to a
follow-up rather than ship a rushed fix.

### 6.5 `signal_power(responses, mask=None, reduction='mean')`

Sahani & Linden 2003, NIPS, "How linear are auditory cortical responses?".

For one `(b, n)` cell with `R` valid repeats, the variance of the
multi-trial response decomposes as

```text
TP_{b,n} = E_r [ var_t( y_{r, b, n}(t) ) ]            = SP_{b,n} + NP_{b,n}             (single-trial variance)
           var_t( E_r[ y_{r, b, n}(t) ] )             = SP_{b,n} + NP_{b,n} / R          (variance of the PSTH)
```

Solving for `SP_{b,n}`:

```text
SP_{b,n} = ( R · var_t(psth_{b,n}) - TP_{b,n} ) / (R - 1)
```

The per-neuron quantity is the **length-weighted average** of the
per-stim estimates:

```text
SP_n = sum_b ( T_b · SP_{b,n} ) / sum_b T_b
```

where `T_b` is the number of valid time bins for stim `b`. This is the
BLUE estimator under the assumption that every per-stim `SP_{b,n}` is
unbiased: weighting by sample count gives the minimum-variance combiner.
It also matches the natural sample weighting that `cov` and `var` apply
in the numerator of `corrcoef` / `normalized_corrcoef` (which run over
the full concatenated `(b, t)` series), so a 500 ms clip drives the
denominator the same way it drives the numerator. See §11 for the
derivation.

A given `(b, n)` cell is included iff it has ≥ 2 valid repeats and ≥ 2
valid time bins. Cells without any qualifying stim return NaN under
`reduction='none'`.

### 6.6 `noise_power(responses, mask=None, reduction='mean')`

```text
NP_{b,n} = TP_{b,n} - SP_{b,n}
NP_n     = sum_b ( T_b · NP_{b,n} ) / sum_b T_b
```

Same length-weighted aggregation as `signal_power`.

### 6.7 `snr(responses, mask=None, reduction='mean')`

```text
SNR_n = SP_n / NP_n
```

Returns `+inf` for noiseless cells (`NP_n = 0`) — caller decides whether
to filter.

### 6.8 `coherence(pred, gt, dt_ms, reduction='mean')`

Magnitude-squared coherence between prediction and PSTH per neuron, as a
scalar per neuron computed as the mean over frequency bins. Uses
`scipy.signal.coherence` internally, so this is **eval-only** (no
gradient).

Coherence requires a regular time grid and **does not tolerate NaN**.
The caller is responsible for picking a NaN-free subset (e.g. one fully
recorded validation stim, or pre-flatten across stims of equal length).
We document this explicitly and raise `ValueError` if NaN appears in the
input. Future follow-up: a chunked-coherence variant that handles
NaN-padded variable-length stims.

This metric is preserved largely as it was — the existing implementation
is correct for its supported shape — but adapted to the new
`(B, N, 1, T)` × `(B, N, 1, T)` contract.

## 7. Recommended training-loop pattern

Mirroring `data_paradigm.md` §6, the canonical loop now looks like:

```python
from deepSTRF.metrics import mse_loss, corrcoef, normalized_corrcoef

for batch in loader:                                             # batch is a dict
    stims, responses = batch['stims'], batch['responses']
    pred = model(stims)                                          # (B, N, 1, T)

    loss = mse_loss(pred, responses)                             # auto-PSTH inside
    loss.backward()
    optimizer.step()

    if val_step:
        cc       = corrcoef(pred, responses, reduction='none')    # (N,)  auto-PSTH
        cc_norm  = normalized_corrcoef(pred, responses,           # (N,)  needs raw repeats
                                       method='schoppe',
                                       reduction='none')
```

Notes:

1. **No manual masking and no manual PSTH.** Pass `responses` directly to
   any prediction-vs-PSTH metric; the auto-PSTH collapse (§2) and the
   NaN-derived mask (§4) are both internal. The dataloader's `valid_mask`
   is informational at this point.
2. **Pass `responses.nanmean(dim=2, keepdim=True)` explicitly only when
   you want a non-default target** — a custom objective, a single-trial
   fit, etc. The `(B, N, 1, T)` shape is treated as authoritative and
   passed through unchanged.
3. **`reduction='none'` is the default for val metrics.** Per-neuron
   numbers are usually what you want for diagnostics (best/worst cells,
   distribution plots). Aggregate yourself with `cc.nanmean()` if a
   single number is needed.
4. **Use `mask=` only when overriding the default NaN rule.** The 90 %
   case (mask = `~isnan`) is automatic.

## 8. Invariants for metric authors

Anyone adding a new metric to `deepSTRF.metrics` must respect:

1. **Shape signature matches §2.** Either `(B, N, 1, T)` × `(B, N, 1, T)`
   for prediction-vs-PSTH metrics, or `(B, N, R, T)` for repeat-aware
   metrics. No other input shape is supported in v1.
2. **NaN-aware by default.** Derive `valid` internally; expose `mask=`
   for override.
3. **`reduction='none'|'mean'|'sum'` over the neuron axis.** Use
   `nanmean` / `nansum` for `mean` / `sum` to handle dropped cells
   gracefully.
4. **`@torch.no_grad()` on metrics; no `no_grad` on losses.** Losses
   participate in backprop; metrics do not. Mixing them up is the
   single biggest bug pattern in the existing module.
5. **Reduction is over neurons, never over time.** Time is collapsed by
   the metric's formula. If your formula needs a different time-axis
   convention, document it explicitly and add a test.
6. **Loud failure on shape mismatch.** Raise `ValueError` with the
   expected vs received shape; do not auto-broadcast.
7. **Tests under `tests/test_metrics.py`** assert: (a) shape contracts,
   (b) NaN handling on a synthetic full-NaN slab, (c) reduction
   semantics, (d) at least one analytic correctness check
   (e.g. `corrcoef` of a tensor with itself = 1, `signal_power` on a
   pure noise signal ≈ 0).

## 9. Known bugs in the existing implementation (audit)

This section is the rewrite checklist. The existing `metrics/performance.py`
and `metrics/losses.py` will be rewritten — not patched in place — to
satisfy the contracts above. Issues found during the 2026-04-28 audit:

1. **`@torch.no_grad()` on losses.** `NegativePoissonLogLikelihood` is
   correctly a `nn.Module` (no `no_grad`), but every "performance" function
   in `performance.py` has `@torch.no_grad()`. Splitting losses vs metrics
   into two modules makes the contract obvious.
2. **No NaN handling anywhere.** Silently produces NaN if any input
   element is NaN. Critical given the data paradigm.
3. **`covariance` uses `(T - 1)` denominator; `var(dim=-1)` is
   Bessel-corrected by default.** Mixing biased and unbiased estimators in
   the same formula introduces a small but real bias. Pick one (Bessel)
   and apply consistently.
4. **`reduction='None'` (capitalized)** in `correlation_coefficient` and
   `normalized_correlation_coefficient` is a typo. Switch to PyTorch
   convention `'none'|'mean'|'sum'`.
5. **`pennington_prediction_correlation` uses `abs(ttrc)`** — a band-aid
   for sign issues that shouldn't appear in a correctly-derived formula.
   Re-derive (or defer; see §6.4).
6. **`compute_CCmax` uses `1/sqrt(cchalf²)` = `1/|cchalf|`** instead of
   `1/cchalf`. Same band-aid pattern as #5; equivalent to `|ρ_half|` in
   the Spearman–Brown formula. Replace with the textbook form
   `sqrt(2·ρ_half / (1+ρ_half))` and clip-or-NaN low-SNR cells.
7. **`fill_missing_repeats`** is a dataset-side helper that the modernized
   NaN-sentinel path obsoletes. Drop from metrics.
8. **`coherence` returns a `(B, N, F)` tensor that is then averaged over
   B, not over F** — different from the per-neuron-scalar contract here.
   Adapt to §6.8.
9. **`NegativePoissonLogLikelihood` computes `log(-prediction + 1e-9)`** —
   negation inside `log` is a bug. Should be `log(prediction + ε)`. Also
   uses `'None'` (capitalized) and computes mean over `(-2, -1)` then over
   batch, which is `'mean'` over all elements — fine for the loss but the
   reduction semantics are inconsistent with the new convention.

## 10. Deferred items

These were considered for v1 and explicitly cut. Restoration will
re-open this doc.

- **Module API (torchmetrics-style stateful `update()` / `compute()`).**
  Useful for streaming val metrics across an epoch without materializing
  every batch. The functional API is enough for v1; module API is easy
  to add later as a thin wrapper.
- **`feature_decorrelation`** / redundancy-reduction regularizer. Not
  part of the original deepSTRF scope and uses a different design pattern
  (forward hooks on `model.core`).
- **`pop_corrcoef`** (correlation of population-summed PSTH). Uncommon
  metric; no immediate use case.
- **Pennington-style noise-corrected prediction correlation** (§6.4).
  Re-derivation pending; current `abs(ttrc)` is wrong, simple ε-clip is
  not vetted.
- **Per-stim-then-averaged metrics.** Some of the literature reports
  `mean_s CC(stim_s)` rather than `CC(concat_s)`. Different quantity;
  not shipped in v1.
- **Chunked-coherence for NaN-padded variable-length stims** (§6.8).

## 11. Why length-weighting? (BLUE derivation)

Several v1 metrics combine per-stim estimates into a per-neuron number:
`signal_power`, `noise_power`, `snr`, and `normalized_corrcoef(method='hsu')`.
The *aggregation* across stims is length-weighted:

```text
X_n = sum_b ( T_b · X_{b,n} ) / sum_b T_b
```

with `T_b` the number of valid time bins for stim `b`. This subsection
records *why*, since "average across stims" is also a valid choice and
the literature is split.

### Setup

Let `X_{b,n}` be a per-stim estimate of some neuron-level quantity
`X_n` (e.g. signal power). Treat `X_{b,n}` as an unbiased estimator
with variance `Var(X_{b,n}) ∝ 1 / T_b` — i.e. estimator variance shrinks
linearly with the number of time samples used. This is the standard
behaviour of variance-of-variance estimators on stationary signals:
doubling `T` halves the noise on the variance estimate.

The general weighted estimator

```text
X̂_n = sum_b ( w_b · X_{b,n} ) / sum_b w_b
```

is unbiased for any positive `w_b` summing to one. Its variance is
minimised — i.e. it is the **Best Linear Unbiased Estimator (BLUE)** —
when `w_b ∝ 1 / Var(X_{b,n}) = T_b`. So weighting by `T_b` extracts
the most information possible from the per-stim estimates under that
variance assumption.

### Coherence with the numerator

`corrcoef(pred_n, psth_n)` and `var(pred_n)` in the Schoppe formula
operate over the **concatenated** valid `(b, t)` time series for neuron
`n`. A 500 ms clip contributes 10× the samples of a 50 ms clip and
correspondingly drives the numerator's `cov` / `var` 10× harder. If we
were to combine per-stim SPs with equal weights instead of length
weights, the numerator and denominator would weight stims differently,
which makes the resulting CCnorm value awkward to interpret —
especially when comparing models on heterogeneous-length validation
sets. Length-weighting on the denominator brings everything into
agreement.

### The equal-weight alternative

`mean_b X_{b,n}` is also defensible:

- Pros: each stim contributes one *answer*, not one *fact*. If you
  trust short-stim estimates as much as long-stim estimates (e.g. your
  short stims are repeated many more times so the per-stim variance
  gain compensates), equal weighting reflects that.
- Cons: ignores the variance structure described above; short noisy
  stims can pull the average around.

We pick length-weighting as the default because it composes correctly
with the natural sample-count weighting elsewhere in the pipeline. A
future kwarg `weighting='length' | 'equal'` could expose the choice if a
clear use case emerges.

### Practical consequence

A neuron with two clips — a 500 ms validation stim with `SP=10` and a
50 ms validation stim with `SP=1` — gets `SP_n ≈ (500·10 + 50·1)/550 ≈
9.18` under length-weighting, vs `5.5` under equal weighting. The
length-weighted answer matches what `var(psth_n)` over the concatenated
time series would have given (modulo the `1/(R−1)` Bessel correction);
the equal-weighted answer would produce an internally inconsistent
CCnorm.

## 12. References

- **Sahani, M. & Linden, J. F. (2003).** "How linear are auditory cortical
  responses?" *Advances in Neural Information Processing Systems (NIPS)*.
  — Original signal/noise power decomposition.
- **Hsu, A., Borst, A. & Theunissen, F. E. (2004).** "Quantifying
  variability in neural responses and its application for the validation
  of model predictions." *Network: Computation in Neural Systems*. —
  CCmax formulation via Spearman–Brown.
- **Schoppe, O., Harper, N. S., Willmore, B. D. B., King, A. J. &
  Schnupp, J. W. H. (2016).** "Measuring the performance of neural
  models." *Frontiers in Computational Neuroscience*. — CCnorm
  (signal-power formulation), reconciliation with Hsu et al.
- **Pennington, J. R. & David, S. V. (2023).** "A convolutional neural
  network provides a generalizable model of natural sound coding by
  neural populations in auditory cortex." *PLOS Computational Biology*. —
  TTRC-based noise-corrected prediction correlation.
- **Gill, P. et al. (2025).** "Sound representation methods for
  spectro-temporal receptive field estimation." *Journal of Computational
  Neuroscience*. — Coherence-information metric (deferred follow-up).