Ferret AC encoding subspace (Wingert 2026)

Dataset Source: Wingert et al. 2026 — Zenodo 18331549

Original paper:

Dataset Details

Description of Stimuli:

  • Concatenated natural-sound sequences drawn from the Audioset Core 3 Complete corpus and the Pro Sound Effects library, crossfaded with a 10 ms Hanning window. Each sequence is 20 s of sound preceded / followed by 1 s of silence at the older recording cohort (47 sites = 20 s window, no silence flanks) or recorded as 22 s = 1 s pre + 20 s sound + 1 s post at the newer cohort (21 sites). Both cohorts share the same gtgram bins (dt = 10 ms, F = 32 log-spaced bands from 200 Hz to 20 kHz). The values in stim.h5 are the raw linear gammatone-gram; the loader applies the paper’s preprocessing on top (see “Preprocessing” below).

  • Each site presents ~100 unique single-rep estimation sequences (one presentation each, STIM_seqNNNN.wav) plus a subset of up to 6 test sequences (STIM_00seq*.wav) with R varying 5–30 across sites. The 6 test files share their source audio but each session re-rasterizes its own copy — loader treats (session, stim_name) as the canonical stim key and emits separate stim entries per session, naturally handling both duration cohorts under the deepSTRF ragged-T paradigm.

Description of Neurons:

  • 2 128 A1 + 746 PEG + 217 AC + 37 HC single-units across 67 recording sites in 4 ferrets. The paper headlines A1 + PEG; the smaller AC and HC subsets are exposed for completeness but documented as less-curated.

  • 131 additional cells from three otherwise-unrepresented PRN sessions (PRN010b, PRN011b, PRN020b) ship without an area label. Use include_unlabeled=True to surface them.

  • Sites range from 5 to 256 stimuli; cell counts per site span 8 to 65.

  • Each cell carries the published goodpred flag (auditory-responsive to the sound set, ~79% of cells). The two-probe SLJ032a recording contributes 76 cells under site='SLJ032a' (probe A) plus 47 under site='SLJ032a-B' (probe B).

Available Data:

  • One self-contained .tgz archive per recording site under recordings/, NEMS recording format (<site>.meta.json, <site>.resp.h5, <site>.resp.epoch.csv, <site>.stim.h5, <site>.stim.epoch.csv, <site>.stim.json). Spike trains stored as per-cell spike-time arrays (PointProcess) at fs = 100; spectrograms as a TiledSignal h5 with one (F=32, T) array per unique stim.

  • cell_list.csv — per-cell metadata: cellid, siteid, area, layer, depth, narrow, celltype, sw (spike width), and goodpred. Authoritative source for the site field of nrn_meta.

  • wav.zip (raw 44.1 kHz waveforms) and models.zip (published CNN / LN / subspace fits) are not used by deepSTRF and not fetched by download=True.

deepSTRF parses the archive directly — nems0 is not required.

Setup

Easiest path — auto-download from Zenodo into the platformdirs cache:

from deepSTRF.datasets.audio import Wingert2026Dataset

ds_a1  = Wingert2026Dataset(area='A1',  download=True)   # ~4.35 GB recordings.zip
ds_peg = Wingert2026Dataset(area='PEG', download=True)   # (no extra download)

download=True is idempotent and fetches only the files the loader needs (recordings.zip + cell_list.csv); it skips the 3.7 GB wav.zip and the 0.1 MB models.zip. The default cache directory is platformdirs.user_cache_dir('deepSTRF')/Wingert2026, overridable via $DEEPSTRF_DATA_DIR. To use a custom path explicitly:

ds = Wingert2026Dataset('/path/to/wingert2026/', area='A1', download=True)

If you already have the data laid out manually, just pass the path:

ds = Wingert2026Dataset('/path/to/wingert2026/', area='A1')

Expected files in the data dir:

  • recordings/<SITE>_<hash>.tgz (one per recording site)

  • cell_list.csv

Filters: area, site, subset

area and site compose by intersection. Both accept a string or an iterable of strings; either can be None to pass through.

# all 2128 A1 cells in 50 sessions, the headline cohort
ds = Wingert2026Dataset(area='A1')

# A1 + PEG together (the paper's "auditory cortex" cohort, 2874 cells)
ds = Wingert2026Dataset(area=['A1', 'PEG'])

# a single recording site
ds = Wingert2026Dataset(site='CLT027c')

# the two-probe SLJ032a session — both probes share the same .tgz, so
# loading both is one .tgz read (no stim duplication across probes)
ds = Wingert2026Dataset(site=['SLJ032a', 'SLJ032a-B'])

# opt-in 131 area=None cells from PRN010b / PRN011b / PRN020b
ds = Wingert2026Dataset(area=None, include_unlabeled=True)   # N = 3259

subset='est' keeps the single-rep STIM_seq* estimation stims; subset='val' keeps the high-rep STIM_00* test stims.

Per-site R for the test stims varies dramatically (5–30 across the release), and many sites only ever heard 1–2 of the 6 test stims. The deepSTRF data paradigm handles this naturally — cells whose session didn’t present a given test stim get a (1, 1) NaN sentinel for that (stim, cell) pair, and the bidirectional select_stims_by_attr rule hides cells with no real data for the selected stim subset.

ds = Wingert2026Dataset(area='A1')
ds.select_stims_by_attr('subset', 'val')   # cells with no val data hidden

Per-cell metadata

nrn_meta[n] carries the cell-list-canonical fields plus parsed cell-id components:

Field

Example

Notes

cell_id

'CLT027c-009-1'

Raw cell id. 3-segment for most cells, 4-segment for SLJ032a.

site

'CLT027c'

Authoritative siteid from cell_list.csv. The two-probe SLJ032a recording assigns 'SLJ032a' to probe A and 'SLJ032a-B' to probe B.

session

'CLT027c'

Recording-session label (first dash-separated segment of cell_id). Equals site everywhere except SLJ032a-B, whose session is 'SLJ032a'.

area

'A1'

'A1', 'PEG', 'AC', 'HC', or None (only when include_unlabeled=True).

layer

'56'

Cortical layer string ('1-3', '4', '56'). None for unlabeled cells.

depth

500.0

Depth in μm relative to L3/4 boundary. None for unlabeled cells.

narrow

False

Putative-inhibitory flag (spike width < 0.35 / 0.375 ms cutoff, probe-dependent). None for unlabeled cells.

celltype

'RD'

One of 'RD', 'RS', 'ND', 'NS' (Regular / Narrow × Deep / Superficial). None for unlabeled cells.

sw

0.756

Spike width in ms. None for unlabeled cells.

goodpred

True

Published auditory-responsive flag (~79% of cells). Always populated.

animal

'CLT'

3-letter animal code (CLT, LMD, PRN, SLJ). Parsed from cell id.

electrode

9

Probe-channel index. Parsed from cell id.

unit_in_electrode

1

Unit-on-channel index. Parsed from cell id.

Use the standard filter API:

ds.select_pop_by_nrn_attr('area', 'A1')                 # one area
ds.select_pop_by_nrn_attr('goodpred', True)             # auditory-responsive cells
ds.select_pop_by_nrn_attr('narrow', True)               # putative inhibitory
ds.select_pop_by_nrn_predicate(lambda n: n['depth'] is not None
                                          and n['depth'] < 200)  # superficial

Per-stim metadata

stim_meta[s] carries:

Field

Example

Notes

name

'STIM_seq0032.wav'

Source-wav file name as it appears in the NEMS archive.

subset

'est' / 'val'

File-name prefix: STIM_00*'val', else 'est'.

session

'CLT027c'

Recording session that presented this stim copy.

Stim tensors have shape (1, F=32, T) with T {2000, 2200} — ragged on T by design (the two recording cohorts use different silence flanks). The default neural_collate zero-pads on the right.

Preprocessing

The loader reproduces the paper’s preprocessing exactly (see aud_subspace_fit_demo.ipynb), validated bit-for-bit against the NEMS reference loader to float32 precision (max|diff| 1e-7):

  • Stimulus — the raw linear gammatone-gram in stim.h5 is log-compressed with log((x + d) / d), d = 10^log_offset (log_offset = -1log(10·x + 1), the NEMS log_compress default), then per-band min–max normalized to [0, 1]. Each of the 32 frequency bands is scaled independently (statistics taken across the whole stimulus set, est + val), and post-norm values < 1e-6 are forced to exactly 0 (“quiet → zero”, matching NEMS). Disable the log step with log_compress=False.

  • Responseper-neuron min–max to [0, 1], statistics taken across all repeats and all stims for that neuron. Per-neuron (rather than global) scaling is the NEMS choice; it leaves correlation-based metrics (cc / cc_norm) unchanged but balances each cell’s contribution to an MSE training loss.

Normalization statistics are computed on the full loaded set before the subset='est'|'val' filter, so the [0, 1] scaling is identical regardless of which subset you request — matching NEMS’ normalize-then-split order.

Note on the PRN018a / PRN018b duplicate

The release ships two .tgz files for site PRN018aPRN018a_*.tgz and PRN018b_*.tgz — with bit-identical contents (same 40 cells, same 256 stims, same spike-time arrays). The loader detects this on its first session-map scan and drops PRN018b_*.tgz with a UserWarning. Set filterwarnings("ignore", category=UserWarning) to silence it if needed.

Three other PRN sessions (PRN015a, PRN017a, PRN018a) have a .tgz filename that doesn’t match the cell-id prefix (PRN015b_*.tgz, etc. — the file basename uses the next session suffix). The loader resolves these by mapping on the cell-id prefix, not the filename, so users see 'PRN015a' everywhere in nrn_meta and stim_meta.

Memory and load-time expectations

The loader rasterizes spike trains into per-(stim, cell) (R, T) tensors at construction time. Approximate cost on the local mirror (Linux + Python 3.10 + 10 ms bins):

Scope

N

S

Load time

Peak RSS

One small site

20

11

~7 s

<0.5 GB

Two adjacent sites

77

117

~8 s

~0.8 GB

5 A1 sites

231

685

~15 s

~1.3 GB

Full A1 (50 sites)

2 128

~5 300

~3 min

~7–10 GB

For repeated experimentation against the full cohort, persist the dataset object once with torch.save and reload — instantiating a fresh Wingert2026Dataset(area='A1') re-runs the rasterizer every time. Or use select_pop_by_nrn_attr / select_pop_by_nrn_predicate to filter down to the experimentally-relevant subset before training.

All (1, 1) NaN sentinels in the response grid share a single underlying tensor object — the response grid for an A1 instance costs ~80 MB of Python list pointers rather than the 5+ GB a naive per-slot torch.full(...) would consume. The behaviour is enforced by a regression test (tests/test_wingert2026.py::test_sentinels_share_one_reference).