# Conventions One of the main contributions of this library is a growing zoo of off-the-shelf electrophysiology (and EEG) datasets, compiled from various public sources, preprocessed, and exposed through a single PyTorch API. A major obstacle in sensory-response fitting is the sheer variability of preprocessing methods and formats — different signals (extracellular spikes, multi-unit activity, intracellular potential, scalp EEG), different labs, and different experimental setups. deepSTRF hides that behind one common contract so the same model and training loop work across datasets. ## One base class, one shape Every dataset subclasses {class}`~deepSTRF.datasets.neural_dataset.NeuralDataset` (itself a `torch.utils.data.Dataset`), so it composes with the usual PyTorch utilities (splitting, shuffling, concatenation, …). Internally each dataset stores, per stimulus: * the **stimulus** presented (a spectrogram `(1, F, T)`, or a raw waveform `(1, T_audio)` when `return_waveform=True`); * the **trial-resolved responses**, time-aligned to the stimulus; * per-stimulus and per-neuron **metadata** dicts (`stim_meta`, `nrn_meta`). Datasets are sparse and ragged — neurons may not all hear every stimulus, and stimuli vary in duration and repeat count. deepSTRF handles this with **NaN sentinels** for missing (stimulus, neuron) trials and zero-/NaN-padding at collate time. The full contract — the canonical `(B, N, R, T)` response shape, the NaN-sentinel rules, and the bidirectional neuron/stimulus selection — is documented in {doc}`data_paradigm`. Read that before touching any response-path code. ## What a batch looks like Pair a dataset with `neural_collate` and a `DataLoader`. Each batch is a **dict** (not a positional tuple), so future per-trial variables can be added as new keys without breaking your unpacking: ```python from torch.utils.data import DataLoader from deepSTRF.utils.data import neural_collate loader = DataLoader(ds, batch_size=8, collate_fn=neural_collate) for batch in loader: stims = batch['stims'] # (B, ..., T) zero-padded, no NaN responses = batch['responses'] # (B, N, R, T) NaN-padded valid_mask = batch['valid_mask'] # (B, N, R, T) bool ~responses.isnan() metas = batch['stim_meta'] # length-B list of per-stim dicts ... ``` Indexing the dataset directly (`ds[i]`) returns the same keys for a single item. `CCmax` / `TTRC`-style normalisation is **not** precomputed at the dataloader boundary — it is derived on demand from `responses` by the metrics in {doc}`metrics_paradigm` (e.g. `normalized_corrcoef`). ## Selecting neurons and stimuli The set of units (and stimuli) to work with is chosen through the selection API rather than a fixed constructor argument — by default all neurons are selected. The selection drives both `len(ds)` and iteration. See {doc}`data_paradigm` for the exact semantics; the common entry points are: ```python ds.select_population([0, 1, 2]) # by index (or a single int) ds.select_pop_by_nrn_attr("area", "Field_L") # by metadata label ds.select_pop_by_nrn_predicate(lambda n: n.get("snr", 0) > 0.5) # by threshold ds.select_stims_by_attr("type", "human_speech") # restrict the stimulus set ```