Ferret AC encoding subspace (Wingert 2026)

Dataset Source: Wingert et al. 2026 — Zenodo 18331549

Original paper:

“Convolutional neural network models describe the encoding subspace of local circuits in auditory cortex” by Jereme C. Wingert, Satyabrata Parida, Sam V. Norman-Haignere & Stephen V. David. Nature Neuroscience (2026).

Dataset Details

Description of Stimuli:

Concatenated natural-sound sequences drawn from the Audioset Core 3 Complete corpus and the Pro Sound Effects library, crossfaded with a 10 ms Hanning window. Each sequence is 20 s of sound preceded / followed by 1 s of silence at the older recording cohort (47 sites = 20 s window, no silence flanks) or recorded as 22 s = 1 s pre + 20 s sound + 1 s post at the newer cohort (21 sites). Both cohorts share the same gtgram bins (dt = 10 ms, F = 32 log-spaced bands from 200 Hz to 20 kHz). The values in stim.h5 are the raw linear gammatone-gram; the loader applies the paper’s preprocessing on top (see “Preprocessing” below).
Each site presents ~100 unique single-rep estimation sequences (one presentation each, STIM_seqNNNN.wav) plus a subset of up to 6 test sequences (STIM_00seq*.wav) with R varying 5–30 across sites. The 6 test files share their source audio but each session re-rasterizes its own copy — loader treats (session, stim_name) as the canonical stim key and emits separate stim entries per session, naturally handling both duration cohorts under the deepSTRF ragged-T paradigm.

Description of Neurons:

2 128 A1 + 746 PEG + 217 AC + 37 HC single-units across 67 recording sites in 4 ferrets. The paper headlines A1 + PEG; the smaller AC and HC subsets are exposed for completeness but documented as less-curated.
131 additional cells from three otherwise-unrepresented PRN sessions (PRN010b, PRN011b, PRN020b) ship without an area label. Use include_unlabeled=True to surface them.
Sites range from 5 to 256 stimuli; cell counts per site span 8 to 65.
Each cell carries the published goodpred flag (auditory-responsive to the sound set, ~79% of cells). The two-probe SLJ032a recording contributes 76 cells under site='SLJ032a' (probe A) plus 47 under site='SLJ032a-B' (probe B).

Available Data:

One self-contained .tgz archive per recording site under recordings/, NEMS recording format (<site>.meta.json, <site>.resp.h5, <site>.resp.epoch.csv, <site>.stim.h5, <site>.stim.epoch.csv, <site>.stim.json). Spike trains stored as per-cell spike-time arrays (PointProcess) at fs = 100; spectrograms as a TiledSignal h5 with one (F=32, T) array per unique stim.
cell_list.csv — per-cell metadata: cellid, siteid, area, layer, depth, narrow, celltype, sw (spike width), and goodpred. Authoritative source for the site field of nrn_meta.
wav.zip (raw 44.1 kHz waveforms) and models.zip (published CNN / LN / subspace fits) are not used by deepSTRF and not fetched by download=True.

deepSTRF parses the archive directly — nems0 is not required.

Setup

Easiest path — auto-download from Zenodo into the platformdirs cache:

from deepSTRF.datasets.audio import Wingert2026Dataset

ds_a1  = Wingert2026Dataset(area='A1',  download=True)   # ~4.35 GB recordings.zip
ds_peg = Wingert2026Dataset(area='PEG', download=True)   # (no extra download)

download=True is idempotent and fetches only the files the loader needs (recordings.zip + cell_list.csv); it skips the 3.7 GB wav.zip and the 0.1 MB models.zip. The default cache directory is platformdirs.user_cache_dir('deepSTRF')/Wingert2026, overridable via $DEEPSTRF_DATA_DIR. To use a custom path explicitly:

ds = Wingert2026Dataset('/path/to/wingert2026/', area='A1', download=True)

If you already have the data laid out manually, just pass the path:

ds = Wingert2026Dataset('/path/to/wingert2026/', area='A1')

Expected files in the data dir:

recordings/<SITE>_<hash>.tgz (one per recording site)
cell_list.csv

Filters: area, site, subset

area and site compose by intersection. Both accept a string or an iterable of strings; either can be None to pass through.

# all 2128 A1 cells in 50 sessions, the headline cohort
ds = Wingert2026Dataset(area='A1')

# A1 + PEG together (the paper's "auditory cortex" cohort, 2874 cells)
ds = Wingert2026Dataset(area=['A1', 'PEG'])

# a single recording site
ds = Wingert2026Dataset(site='CLT027c')

# the two-probe SLJ032a session — both probes share the same .tgz, so
# loading both is one .tgz read (no stim duplication across probes)
ds = Wingert2026Dataset(site=['SLJ032a', 'SLJ032a-B'])

# opt-in 131 area=None cells from PRN010b / PRN011b / PRN020b
ds = Wingert2026Dataset(area=None, include_unlabeled=True)   # N = 3259

subset='est' keeps the single-rep STIM_seq* estimation stims; subset='val' keeps the high-rep STIM_00* test stims.

Per-site R for the test stims varies dramatically (5–30 across the release), and many sites only ever heard 1–2 of the 6 test stims. The deepSTRF data paradigm handles this naturally — cells whose session didn’t present a given test stim get a (1, 1) NaN sentinel for that (stim, cell) pair, and the bidirectional select_stims_by_attr rule hides cells with no real data for the selected stim subset.

ds = Wingert2026Dataset(area='A1')
ds.select_stims_by_attr('subset', 'val')   # cells with no val data hidden

Per-cell metadata

nrn_meta[n] carries the cell-list-canonical fields plus parsed cell-id components:

Field	Example	Notes
`cell_id`	`'CLT027c-009-1'`	Raw cell id. 3-segment for most cells, 4-segment for SLJ032a.
`site`	`'CLT027c'`	Authoritative siteid from `cell_list.csv`. The two-probe SLJ032a recording assigns `'SLJ032a'` to probe A and `'SLJ032a-B'` to probe B.
`session`	`'CLT027c'`	Recording-session label (first dash-separated segment of `cell_id`). Equals `site` everywhere except SLJ032a-B, whose session is `'SLJ032a'`.
`area`	`'A1'`	`'A1'`, `'PEG'`, `'AC'`, `'HC'`, or `None` (only when `include_unlabeled=True`).
`layer`	`'56'`	Cortical layer string (`'1-3'`, `'4'`, `'56'`). `None` for unlabeled cells.
`depth`	`500.0`	Depth in μm relative to L3/4 boundary. `None` for unlabeled cells.
`narrow`	`False`	Putative-inhibitory flag (spike width < 0.35 / 0.375 ms cutoff, probe-dependent). `None` for unlabeled cells.
`celltype`	`'RD'`	One of `'RD'`, `'RS'`, `'ND'`, `'NS'` (Regular / Narrow × Deep / Superficial). `None` for unlabeled cells.
`sw`	`0.756`	Spike width in ms. `None` for unlabeled cells.
`goodpred`	`True`	Published auditory-responsive flag (~79% of cells). Always populated.
`animal`	`'CLT'`	3-letter animal code (`CLT`, `LMD`, `PRN`, `SLJ`). Parsed from cell id.
`electrode`	`9`	Probe-channel index. Parsed from cell id.
`unit_in_electrode`	`1`	Unit-on-channel index. Parsed from cell id.

Use the standard filter API:

ds.select_pop_by_nrn_attr('area', 'A1')                 # one area
ds.select_pop_by_nrn_attr('goodpred', True)             # auditory-responsive cells
ds.select_pop_by_nrn_attr('narrow', True)               # putative inhibitory
ds.select_pop_by_nrn_predicate(lambda n: n['depth'] is not None
                                          and n['depth'] < 200)  # superficial

Per-stim metadata

stim_meta[s] carries:

Field	Example	Notes
`name`	`'STIM_seq0032.wav'`	Source-wav file name as it appears in the NEMS archive.
`subset`	`'est'` / `'val'`	File-name prefix: `STIM_00*` → `'val'`, else `'est'`.
`session`	`'CLT027c'`	Recording session that presented this stim copy.

Stim tensors have shape (1, F=32, T) with T ∈ {2000, 2200} — ragged on T by design (the two recording cohorts use different silence flanks). The default neural_collate zero-pads on the right.

Preprocessing

The loader reproduces the paper’s preprocessing exactly (see aud_subspace_fit_demo.ipynb), validated bit-for-bit against the NEMS reference loader to float32 precision (max|diff| ≈ 1e-7):

Stimulus — the raw linear gammatone-gram in stim.h5 is log-compressed with log((x + d) / d), d = 10^log_offset (log_offset = -1 → log(10·x + 1), the NEMS log_compress default), then per-band min–max normalized to [0, 1]. Each of the 32 frequency bands is scaled independently (statistics taken across the whole stimulus set, est + val), and post-norm values < 1e-6 are forced to exactly 0 (“quiet → zero”, matching NEMS). Disable the log step with log_compress=False.
Response — per-neuron min–max to [0, 1], statistics taken across all repeats and all stims for that neuron. Per-neuron (rather than global) scaling is the NEMS choice; it leaves correlation-based metrics (cc / cc_norm) unchanged but balances each cell’s contribution to an MSE training loss.

Normalization statistics are computed on the full loaded set before the subset='est'|'val' filter, so the [0, 1] scaling is identical regardless of which subset you request — matching NEMS’ normalize-then-split order.

Note on the PRN018a / PRN018b duplicate

The release ships two .tgz files for site PRN018a — PRN018a_*.tgz and PRN018b_*.tgz — with bit-identical contents (same 40 cells, same 256 stims, same spike-time arrays). The loader detects this on its first session-map scan and drops PRN018b_*.tgz with a UserWarning. Set filterwarnings("ignore", category=UserWarning) to silence it if needed.

Three other PRN sessions (PRN015a, PRN017a, PRN018a) have a .tgz filename that doesn’t match the cell-id prefix (PRN015b_*.tgz, etc. — the file basename uses the next session suffix). The loader resolves these by mapping on the cell-id prefix, not the filename, so users see 'PRN015a' everywhere in nrn_meta and stim_meta.

Memory and load-time expectations

The loader rasterizes spike trains into per-(stim, cell) (R, T) tensors at construction time. Approximate cost on the local mirror (Linux + Python 3.10 + 10 ms bins):

Scope	N	S	Load time	Peak RSS
One small site	20	11	~7 s	<0.5 GB
Two adjacent sites	77	117	~8 s	~0.8 GB
5 A1 sites	231	685	~15 s	~1.3 GB
Full A1 (50 sites)	2 128	~5 300	~3 min	~7–10 GB

For repeated experimentation against the full cohort, persist the dataset object once with torch.save and reload — instantiating a fresh Wingert2026Dataset(area='A1') re-runs the rasterizer every time. Or use select_pop_by_nrn_attr / select_pop_by_nrn_predicate to filter down to the experimentally-relevant subset before training.

All (1, 1) NaN sentinels in the response grid share a single underlying tensor object — the response grid for an A1 instance costs ~80 MB of Python list pointers rather than the 5+ GB a naive per-slot torch.full(...) would consume. The behaviour is enforced by a regression test (tests/test_wingert2026.py::test_sentinels_share_one_reference).