Dataset concatenation

Pool recordings across species, labs, and preparations into a single deepSTRF dataset, and fit one model on the union.

deepSTRF supports concatenating two or more NeuralDataset instances along both the stim and neuron axes. The result is a “chimeric” dataset that keeps each source’s data intact and fills the cross-blocks with the canonical NaN sentinel — the same one used everywhere else in the library to mark missing (stim, neuron) pairs.

This is, to our knowledge, a feature unique to deepSTRF among neurophysiology DNN benchmark libraries. We hope it makes a class of experiments easier than they have been: training a single model on data from different species, different labs, or different recording sessions, to ask which computational principles generalise.

What concatenation actually means here

Concatenating datasets in deepSTRF is not appending stimuli into one long playlist (that would be torch.cat-style), and not adding more neurons that observed the same stimuli. It is the more general 2-D case:

Example

Stim axis

Neuron axis

Cross-block

aa1 + aa2

A’s ⊕ B’s

A’s ⊕ B’s

NaN sentinels

aa1 + aa2 + ns1

three-way

three-way

NaN sentinels

For k input datasets with (S_i, N_i) each, the result has S = Σ S_i stimuli and N = Σ N_i neurons arranged block-diagonally:

              neurons of A │ neurons of B │ neurons of C
stims of A │   real data   │      NaN     │      NaN
stims of B │      NaN      │   real data  │      NaN
stims of C │      NaN      │      NaN     │   real data

combined.nrn_masks reflects exactly that pattern, since the property is derived on the fly from combined.responses.

Why this is useful

  • More data, same model. Auditory neurons in zebra finch (CRCNS AA1) and in ferret (NS1) share computational principles even though the experimental preparations differ. Concatenating the datasets and fitting a single backbone lets the model exploit that overlap, while output heads remain neuron-specific.

  • Same species, more recordings. The CRCNS AA series (AA1 / AA2 / AA4 / AA5) are all zebra finch but from different cohorts; concatenation produces a substantially larger pooled dataset than any of them alone.

  • Cross-lab benchmarks. Train once, evaluate the same model on per-source held-out sets — a cleaner benchmark than fitting four models separately and averaging.

  • Lossless missing-data accounting. The block-diagonal mask is built into the loss path automatically (via valid_mask from neural_collate), so the model never sees a fake “this neuron heard this stim” signal.

Public API

from deepSTRF.utils.data import concat_neural_datasets

# N-ary form — recommended for clarity
combined = concat_neural_datasets([aa1, aa2, ns1])

# pairwise sugar via __add__ — convenient for ad-hoc work
combined = aa1 + aa2

Both produce identical results. The N-ary form is preferred because it reads more clearly when more than two datasets are involved and avoids relying on the __add__ operator chaining.

Compatibility requirements

deepSTRF will refuse to concatenate datasets that disagree on properties that would make the result silently wrong. These are hard asserts, not auto-adapted:

Requirement

Where checked

Caller must…

Same dt_ms

NeuralDataset._concat_check_compat

re-instantiate with matching bin

Same F (audio)

AudioNeuralDataset._concat_check_compat

re-instantiate with matching n_mels

Same (H, W) (video)

VideoNeuralDataset._concat_check_compat (TBD)

re-instantiate / pre-resample

All inputs are datasets

concat_neural_datasets

only pass NeuralDatasets

Resampling responses to a common dt_ms or warping spectrograms to a common F is intentionally not done by deepSTRF — different choices have different scientific implications, and we would rather the user makes them explicitly than have the library hide them.

Return type

The result’s concrete type is the most-specific common ancestor of the inputs:

Inputs

Result type

aa1, aa1 (or two of any single subclass)

CRCNSAA1Dataset (preserved)

aa1, aa2 (different audio subclasses)

AudioNeuralDataset

aa1, video_ds

NeuralDataset

Subclass-specific methods that don’t make sense on the merged object (e.g. aa1.areas) are simply not present on the result. The core API — stims, responses, stim_meta, nrn_meta, nrn_masks, select_*, __len__, __getitem__ — works identically.

Iterating only one source’s data

__len__ and __getitem__ filter by the current neuron selection — they expose only stimuli for which at least one currently selected neuron has valid response data (see data_paradigm.md §8 for the general rule). This makes selecting a sub-population on a chimeric dataset Just Work: the cross-block stims, which are full-NaN against the selected neurons, are hidden automatically.

combined = aa1 + aa2     # 30 + 117 = 147 stims, 100 + 494 = 594 neurons
len(combined)            # 147 — full pool by default

# select only AA1's neurons -> AA2's stims disappear from iteration
combined.select_population(list(range(aa1.N_neurons)))
len(combined)            # 30
combined[0]              # an AA1 stim, with valid responses for the selection
combined[30]             # IndexError, not a fully-NaN AA2 stim

# select MLd neurons across both sources (AA1 has 'MLd', AA2 has 'mld')
mld = [n for n, m in enumerate(combined.nrn_meta)
       if m["area"].lower() == "mld"]
combined.select_population(mld)
len(combined)            # all stims that any MLd neuron heard, in either source

A DataLoader over the chimeric dataset under any of these selections visits only the relevant stims — no manual filtering, no risk of training on a fully-NaN batch item.

The same idea works in the other direction. Filtering on a stim attribute (select_stims_by_attr, select_stim, select_stims) narrows iteration to the matching stims, and the bidirectional rule auto-hides cells that have no responses left in the selection:

# train only on conspecific stims across the chimeric pool
combined = aa1 + aa2
combined.select_stims_by_attr("type", "conspecific")
# - cells that have responses to >=1 conspecific stim survive
# - cells that only have responses to flatrip / non-conspecific stims are hidden

Caller invariants we don’t enforce

  • Neuron and stim UIDs should be mutually exclusive across sources. Pooling a dataset with its own subset is degenerate (use constructor arguments instead). Duplicate UIDs across sources are silently accepted but produce a misleading view of coverage. We could check this at concat time — at the cost of forcing every dataset to have a canonical UID field, which is not yet the case.

Memory cost

Concatenation is eager: the result holds its own complete (S, N) grid of response references. At deepSTRF scales (S, N in the low hundreds) this is negligible — cross-block entries are single-element (1, 1) NaN tensors weighing ~8 bytes each. The full aa1 + aa2 + ns1 cross-block overhead is on the order of kilobytes.

See also