Computational neural response models
We provide a handful of computational models models that can be trained to fit biological auditory neural responses from stimuli in the form of a spectrogram, or “cochleagram”.
Conventions
All models are available inside deepSTRF/models/models.py, and are Pytorch classes inheriting from a mother class
named AudioResponseModel, itself a subclass of torch.nn.Module.
As a result, they all implement inference inside forward() method, in pure PyTorch convention.
This method expects a (B, 1, F, T) input spectrogram tensor and outputs a (B, N, T) neuronal response tensor.
Models share basic common attributes, such as:
self.T–> the past temporal context to predict each timestep / convolution kernel size along time dimensionself.F–> the number of frequency bins expected for the input spectrogramsself.O–> number of output neurons whose activity to simultaneously predictself.H–> generally the number of hidden units…
Models can be defined with or without the AdapTrans model, a prefiltering of the spectrogram that accounts for
ON/OFF responses and is beneficial to fitting performances. In case of usage, it computes a bipolar ON/OFF spectrogram
prior to the backbone. Hyperparameters for AdapTrans can be given trhough the prefiltering argument of the constructor
as a dictionary. For example, {'prefiltering': 'AdapTrans', 'dt': 1.0', 'min_freq': 500, 'max_freq': 20000, 'scale': 'mel'}.
Models also generally implement a STRFs(...) method which returns the explicit spectro-temporal weighting of the cochleagram.
Model Zoo
Linear (L)
Arguably the most naive model of the bunch, also known as Spectro-Temporal Receptive Field (STRF) model. The predicted activity of this model at each timestep is simply a linear combination of past spectro-temporal time bins of input spectrogram, plus a bias representing baseline activity. It has long been a popular model because of its simplicity, but often fails at properly rendering the activity of real neurons, which are highly nonlinear most of the time.
To mitigate overfitting due to too high numbers of learnable parameters, several techniques exist. For now, this repo only proposes a in-house parameterization called DCLS, but we aim to implement more common ones in the future, such as separable kernels. Similar to AdapTrans, prefiltering, hyperparameters for parameterization can be passed as an argument to the class constructor.
Torch class: Linear; Parameterization available
Linear-Nonlinear (LN)
Consists of a Linear model, with an added output activation which makes it nonlinear. The latter often takes the form of a sigmoid or parameterized function (see e.g. Rahman et al. or Willmore et al.).
Torch class: LinearNonlinear; Parameterization available
Network Receptive Field (NRF)
In a nutshell, a LN model with several hidden units.
Torch class: NetworkReceptiveField; Parameterization available; Original paper: Harper et al. (2016)
Dynamic Network (DNet)
In a nutshell, a NRF model in which hidden and output units follow leaky dynamics (as in LIF spiking neurons, but without spikes), with learnable time constants.
Torch class: DNet; Parameterization available; Original paper: Rahman et al. (2019)
2D-CNN
*Contrarily to other models, this one is not based on STRF model, as it is composed of successive convolutional steps whose kernels do not entirely span all frequencies of the input spectrogram. Fully connected prediction head after a convlutional extraction stage.
Torch class: ConvNet2D; Original paper: Pennington & David (2023)
Recurrent / state-space network (StateNet)
A per-timestep spectral encoder feeding a recurrent backbone (GRU / LSTM / Mamba / S4 / LMU). It captures long-range temporal dependencies through the recurrent state and is the strongest model in the zoo on NS1.
Torch class: StateNet; Original paper: Rançon et al. (2025)
Transformer
Patch-embedding + sinusoidal positional encoding + a per-forward causal
self-attention mask, so it generalizes to any sequence length. An optional
finite context_window makes attention band-causal.
Torch class: Transformer; Architecture: Vaswani et al. (2017); designed as the attention baseline in Rançon et al. (2025)
ICNet
A deep, waveform-native model: a SincNet + strided-conv encoder + bottleneck feeding a per-neuron readout with a Poisson (softplus) head. It consumes raw audio directly (no precomputed spectrogram). Designed for midbrain (IC) recordings; it ports cleanly into deepSTRF but is oversized for small cortical datasets like NS1.
Torch class: ICNet; Original paper: Drakopoulos et al. (2025)
Waveform front-ends (wav2spec)
Every audio model defaults to consuming a precomputed spectrogram, but can
instead take a raw waveform through its wav2spec slot — a front-end that
maps audio to a neural-rate spectrogram and that can itself be learned
end-to-end with the model. Shipped front-ends:
Front-end |
Learnable |
Notes |
|---|---|---|
|
no |
strictly-causal log-mel (Rahman 2019 defaults) |
|
no |
ERB gammatone filterbank; optional causal PCEN |
|
filter cutoffs |
parametric bandpass (Ravanelli & Bengio 2018) |
|
full |
learnable Gabor + Gaussian pooling + sPCEN (Zeghidour 2021) |
All are strictly causal. See the dedicated wav2spec page for the
slot contract, the make_wav2spec factory, and the NS1 benchmarks. Waveform-native
models additionally expose waveform_gradmap(...), a listenable time-domain
receptive field (the waveform analogue of STRF_gradmap).