# Dataset concatenation

> Pool recordings across species, labs, and preparations into a single
> deepSTRF dataset, and fit one model on the union.

deepSTRF supports concatenating two or more `NeuralDataset` instances along
**both** the stim and neuron axes. The result is a "chimeric" dataset that
keeps each source's data intact and fills the cross-blocks with the
canonical NaN sentinel — the same one used everywhere else in the library
to mark missing `(stim, neuron)` pairs.

This is, to our knowledge, a feature unique to deepSTRF among
neurophysiology DNN benchmark libraries. We hope it makes a class of
experiments easier than they have been: training a single model on data
from different species, different labs, or different recording sessions,
to ask which computational principles generalise.

## What concatenation actually means here

Concatenating datasets in deepSTRF is **not** appending stimuli into one
long playlist (that would be `torch.cat`-style), and **not** adding more
neurons that observed the same stimuli. It is the more general 2-D case:

| Example                | Stim axis  | Neuron axis | Cross-block        |
|------------------------|------------|-------------|--------------------|
| `aa1 + aa2`            | A's ⊕ B's  | A's ⊕ B's   | NaN sentinels      |
| `aa1 + aa2 + ns1`      | three-way  | three-way   | NaN sentinels      |

For `k` input datasets with `(S_i, N_i)` each, the result has
`S = Σ S_i` stimuli and `N = Σ N_i` neurons arranged block-diagonally:

```
              neurons of A │ neurons of B │ neurons of C
stims of A │   real data   │      NaN     │      NaN
stims of B │      NaN      │   real data  │      NaN
stims of C │      NaN      │      NaN     │   real data
```

`combined.nrn_masks` reflects exactly that pattern, since the property is
derived on the fly from `combined.responses`.

## Why this is useful

- **More data, same model.** Auditory neurons in zebra finch (CRCNS AA1)
  and in ferret (NS1) share computational principles even though the
  experimental preparations differ. Concatenating the datasets and fitting
  a single backbone lets the model exploit that overlap, while output
  heads remain neuron-specific.
- **Same species, more recordings.** The CRCNS AA series (AA1 / AA2 /
  AA4 / AA5) are all zebra finch but from different cohorts; concatenation
  produces a substantially larger pooled dataset than any of them alone.
- **Cross-lab benchmarks.** Train once, evaluate the same model on
  per-source held-out sets — a cleaner benchmark than fitting four models
  separately and averaging.
- **Lossless missing-data accounting.** The block-diagonal mask is
  built into the loss path automatically (via `valid_mask` from
  `neural_collate`), so the model never sees a fake "this neuron heard
  this stim" signal.

## Public API

```python
from deepSTRF.utils.data import concat_neural_datasets

# N-ary form — recommended for clarity
combined = concat_neural_datasets([aa1, aa2, ns1])

# pairwise sugar via __add__ — convenient for ad-hoc work
combined = aa1 + aa2
```

Both produce identical results. The N-ary form is preferred because it
reads more clearly when more than two datasets are involved and avoids
relying on the `__add__` operator chaining.

## Compatibility requirements

deepSTRF will refuse to concatenate datasets that disagree on properties
that would make the result silently wrong. These are **hard asserts**, not
auto-adapted:

| Requirement              | Where checked                                  | Caller must…                    |
|--------------------------|------------------------------------------------|----------------------------------|
| Same `dt_ms`             | `NeuralDataset._concat_check_compat`           | re-instantiate with matching bin |
| Same `F` (audio)         | `AudioNeuralDataset._concat_check_compat`      | re-instantiate with matching `n_mels` |
| Same `(H, W)` (video)    | `VideoNeuralDataset._concat_check_compat` (TBD)| re-instantiate / pre-resample    |
| All inputs are datasets  | `concat_neural_datasets`                       | only pass `NeuralDataset`s       |

Resampling responses to a common `dt_ms` or warping spectrograms to a
common `F` is intentionally **not** done by deepSTRF — different choices
have different scientific implications, and we would rather the user
makes them explicitly than have the library hide them.

## Return type

The result's concrete type is the **most-specific common ancestor** of
the inputs:

| Inputs                                       | Result type                  |
|----------------------------------------------|------------------------------|
| `aa1, aa1` (or two of any single subclass)   | `CRCNSAA1Dataset` (preserved) |
| `aa1, aa2` (different audio subclasses)      | `AudioNeuralDataset`         |
| `aa1, video_ds`                              | `NeuralDataset`              |

Subclass-specific methods that don't make sense on the merged object
(e.g. `aa1.areas`) are simply not present on the result. The core API —
`stims`, `responses`, `stim_meta`, `nrn_meta`, `nrn_masks`,
`select_*`, `__len__`, `__getitem__` — works identically.

## Iterating only one source's data

`__len__` and `__getitem__` filter by the current neuron selection — they
expose only stimuli for which at least one currently selected neuron has
valid response data (see [`data_paradigm.md`](data_paradigm.md) §8 for
the general rule). This makes selecting a sub-population on a chimeric
dataset Just Work: the cross-block stims, which are full-NaN against the
selected neurons, are hidden automatically.

```python
combined = aa1 + aa2     # 30 + 117 = 147 stims, 100 + 494 = 594 neurons
len(combined)            # 147 — full pool by default

# select only AA1's neurons -> AA2's stims disappear from iteration
combined.select_population(list(range(aa1.N_neurons)))
len(combined)            # 30
combined[0]              # an AA1 stim, with valid responses for the selection
combined[30]             # IndexError, not a fully-NaN AA2 stim

# select MLd neurons across both sources (AA1 has 'MLd', AA2 has 'mld')
mld = [n for n, m in enumerate(combined.nrn_meta)
       if m["area"].lower() == "mld"]
combined.select_population(mld)
len(combined)            # all stims that any MLd neuron heard, in either source
```

A `DataLoader` over the chimeric dataset under any of these selections
visits only the relevant stims — no manual filtering, no risk of training
on a fully-NaN batch item.

The same idea works in the other direction. Filtering on a stim attribute
(`select_stims_by_attr`, `select_stim`, `select_stims`) narrows iteration
to the matching stims, and the bidirectional rule auto-hides cells that
have no responses left in the selection:

```python
# train only on conspecific stims across the chimeric pool
combined = aa1 + aa2
combined.select_stims_by_attr("type", "conspecific")
# - cells that have responses to >=1 conspecific stim survive
# - cells that only have responses to flatrip / non-conspecific stims are hidden
```

## Caller invariants we don't enforce

- **Neuron and stim UIDs should be mutually exclusive across sources.**
  Pooling a dataset with its own subset is degenerate (use constructor
  arguments instead). Duplicate UIDs across sources are silently accepted
  but produce a misleading view of coverage. We could check this at
  concat time — at the cost of forcing every dataset to have a canonical
  UID field, which is not yet the case.

## Memory cost

Concatenation is **eager**: the result holds its own complete `(S, N)` grid
of response references. At deepSTRF scales (S, N in the low hundreds) this
is negligible — cross-block entries are single-element `(1, 1)` NaN
tensors weighing ~8 bytes each. The full `aa1 + aa2 + ns1` cross-block
overhead is on the order of kilobytes.

## See also

- [`examples/dataset_concatenation.ipynb`](../../examples/dataset_concatenation.ipynb)
  — runnable end-to-end demo on AA1 + AA2.
- [`data_paradigm.md`](data_paradigm.md) — the broader storage and
  missingness conventions that make concatenation natural.