Dataset concatenation

Pool recordings across species, labs, and preparations into a single deepSTRF dataset, and fit one model on the union.

deepSTRF supports concatenating two or more NeuralDataset instances along both the stim and neuron axes. The result is a “chimeric” dataset that keeps each source’s data intact and fills the cross-blocks with the canonical NaN sentinel — the same one used everywhere else in the library to mark missing (stim, neuron) pairs.

This is, to our knowledge, a feature unique to deepSTRF among neurophysiology DNN benchmark libraries. We hope it makes a class of experiments easier than they have been: training a single model on data from different species, different labs, or different recording sessions, to ask which computational principles generalise.

What concatenation actually means here

Concatenating datasets in deepSTRF is not appending stimuli into one long playlist (that would be torch.cat-style), and not adding more neurons that observed the same stimuli. It is the more general 2-D case:

Example	Stim axis	Neuron axis	Cross-block
`aa1 + aa2`	A’s ⊕ B’s	A’s ⊕ B’s	NaN sentinels
`aa1 + aa2 + ns1`	three-way	three-way	NaN sentinels

For k input datasets with (S_i, N_i) each, the result has S = Σ S_i stimuli and N = Σ N_i neurons arranged block-diagonally:

              neurons of A │ neurons of B │ neurons of C
stims of A │   real data   │      NaN     │      NaN
stims of B │      NaN      │   real data  │      NaN
stims of C │      NaN      │      NaN     │   real data

combined.nrn_masks reflects exactly that pattern, since the property is derived on the fly from combined.responses.

Why this is useful

More data, same model. Auditory neurons in zebra finch (CRCNS AA1) and in ferret (NS1) share computational principles even though the experimental preparations differ. Concatenating the datasets and fitting a single backbone lets the model exploit that overlap, while output heads remain neuron-specific.
Same species, more recordings. The CRCNS AA series (AA1 / AA2 / AA4 / AA5) are all zebra finch but from different cohorts; concatenation produces a substantially larger pooled dataset than any of them alone.
Cross-lab benchmarks. Train once, evaluate the same model on per-source held-out sets — a cleaner benchmark than fitting four models separately and averaging.
Lossless missing-data accounting. The block-diagonal mask is built into the loss path automatically (via valid_mask from neural_collate), so the model never sees a fake “this neuron heard this stim” signal.

Public API

from deepSTRF.utils.data import concat_neural_datasets

# N-ary form — recommended for clarity
combined = concat_neural_datasets([aa1, aa2, ns1])

# pairwise sugar via __add__ — convenient for ad-hoc work
combined = aa1 + aa2

Both produce identical results. The N-ary form is preferred because it reads more clearly when more than two datasets are involved and avoids relying on the __add__ operator chaining.

Compatibility requirements

deepSTRF will refuse to concatenate datasets that disagree on properties that would make the result silently wrong. These are hard asserts, not auto-adapted:

Requirement	Where checked	Caller must…
Same `dt_ms`	`NeuralDataset._concat_check_compat`	re-instantiate with matching bin
Same `F` (audio)	`AudioNeuralDataset._concat_check_compat`	re-instantiate with matching `n_mels`
Same `(H, W)` (video)	`VideoNeuralDataset._concat_check_compat` (TBD)	re-instantiate / pre-resample
All inputs are datasets	`concat_neural_datasets`	only pass `NeuralDataset`s

Resampling responses to a common dt_ms or warping spectrograms to a common F is intentionally not done by deepSTRF — different choices have different scientific implications, and we would rather the user makes them explicitly than have the library hide them.

Return type

The result’s concrete type is the most-specific common ancestor of the inputs:

Inputs	Result type
`aa1, aa1` (or two of any single subclass)	`CRCNSAA1Dataset` (preserved)
`aa1, aa2` (different audio subclasses)	`AudioNeuralDataset`
`aa1, video_ds`	`NeuralDataset`

Subclass-specific methods that don’t make sense on the merged object (e.g. aa1.areas) are simply not present on the result. The core API — stims, responses, stim_meta, nrn_meta, nrn_masks, select_*, __len__, __getitem__ — works identically.

Iterating only one source’s data

__len__ and __getitem__ filter by the current neuron selection — they expose only stimuli for which at least one currently selected neuron has valid response data (see data_paradigm.md §8 for the general rule). This makes selecting a sub-population on a chimeric dataset Just Work: the cross-block stims, which are full-NaN against the selected neurons, are hidden automatically.

combined = aa1 + aa2     # 30 + 117 = 147 stims, 100 + 494 = 594 neurons
len(combined)            # 147 — full pool by default

# select only AA1's neurons -> AA2's stims disappear from iteration
combined.select_population(list(range(aa1.N_neurons)))
len(combined)            # 30
combined[0]              # an AA1 stim, with valid responses for the selection
combined[30]             # IndexError, not a fully-NaN AA2 stim

# select MLd neurons across both sources (AA1 has 'MLd', AA2 has 'mld')
mld = [n for n, m in enumerate(combined.nrn_meta)
       if m["area"].lower() == "mld"]
combined.select_population(mld)
len(combined)            # all stims that any MLd neuron heard, in either source

A DataLoader over the chimeric dataset under any of these selections visits only the relevant stims — no manual filtering, no risk of training on a fully-NaN batch item.

The same idea works in the other direction. Filtering on a stim attribute (select_stims_by_attr, select_stim, select_stims) narrows iteration to the matching stims, and the bidirectional rule auto-hides cells that have no responses left in the selection:

# train only on conspecific stims across the chimeric pool
combined = aa1 + aa2
combined.select_stims_by_attr("type", "conspecific")
# - cells that have responses to >=1 conspecific stim survive
# - cells that only have responses to flatrip / non-conspecific stims are hidden

Caller invariants we don’t enforce

Neuron and stim UIDs should be mutually exclusive across sources. Pooling a dataset with its own subset is degenerate (use constructor arguments instead). Duplicate UIDs across sources are silently accepted but produce a misleading view of coverage. We could check this at concat time — at the cost of forcing every dataset to have a canonical UID field, which is not yet the case.

Memory cost

Concatenation is eager: the result holds its own complete (S, N) grid of response references. At deepSTRF scales (S, N in the low hundreds) this is negligible — cross-block entries are single-element (1, 1) NaN tensors weighing ~8 bytes each. The full aa1 + aa2 + ns1 cross-block overhead is on the order of kilobytes.