Le 2025 Dataset
Dataset Source: figshare 29203457
Companion code (paper analyses): melizalab/auditory-restoration (also archived at Zenodo 15864044).
Citation:
Le, B.; Bjoring, M. C.; Meliza, C. D. (2025). The zebra finch auditory cortex
reconstructs occluded syllables in conspecific song. Nature Communications,
16:8452. https://doi.org/10.1038/s41467-025-63182-y
Dataset Details
Population fitting: ✅
Description of Stimuli:
8 conspecific song motifs (“natural-syntax”,
nat8a/nat8bstim sets) plus 8 scrambled-syntax pseudo-motifs (synth8b); each motif 837–1200 ms, high-pass filtered at 500 Hz, RMS-normalized to −27 dBFS (cohorts 1 & 2) or −30 dBFS (cohort 3).2 critical intervals (CI1, CI2) per motif, 58–100 ms duration each.
Up to 7 variants per CI, probing the auditory restoration illusion:
code
name
content
cohorts
C
Continuous
Unmodified motif (shared across both CIs)
all
G
Gap
CI replaced by silence
all
N
Noise
CI-duration noise burst in isolation
all
GB
Gap + Burst
CI replaced by noise within the motif (illusion)
all
CB
Continuous + Burst
Motif unchanged, noise added on top of CI
all
GM
Gap-Masked
Whole motif masked by noise, CI deleted
nat8b / synth8b
CM
Continuous-Masked
Whole motif masked, CI intact
nat8b / synth8b
Wav files at native sample rate (40 kHz
nat8a, 48 kHznat8b, 44.1 kHzsynth8b).10 presentation repeats per (motif, variant) per neuron; some trials curated out → effective
Rranges 4–10.
Description of Neurons:
Extracellular single-unit recordings from anesthetized adult zebra finch auditory pallium (NCM/L3, L2a/b, CM/L1; some sites unlabelled).
Three cohorts × two stim sets = four (cohort, stim-set) pairings:
sub-experiment
response dirs
cohort(s)
birds
units
nat8anat8a-{alpha,beta}-responses/1 + 2
14 + 10
407 + 387
nat8bnat8b-responses/3
~8
445
synth8bsynth8b-responses/3
~5
476
Cohort 1 (
nat8a-alpha) had a familiarity manipulation: half the birds were socially housed with half of the 8 stimulus singers before recording. Cohorts 2 and 3 are entirely unfamiliar (singers had died by then).
Spike sorting:
Cohort 1: MountainSort 4 +
phycuration.Cohorts 2 & 3: Kilosort 2.5 +
phycuration.Spike times stored per-unit as JSON pprox files. Two pprox schemas coexist: cohort 1 uses the legacy
spec:2/pprox(spike times in ms,conditionfield encodes the variant); everything else usesspec:2/stimtrial(spike times in seconds,stimulus.namecarries the full filename stem). The dataset class dispatches on$schemaand presents a uniform interface to callers.
Processing needed (Dataset class __init__):
Wav → gammatone spectrogram (50 log-spaced bands, 1–8 kHz, 2.5 ms window, hop =
dt_ms,log(P+1)compression). Matches the paper’sgammatonepackage recipe (Methods p. 10).pprox →
(R, T)PSTH per (stim, neuron), aligned to stim onset, binned atdt_ms.Optional 21 ms Hanning smoothing (Hsu / Borst / Theunissen 2004 convention).
Optional per-neuron Sahani-Linden signal_power / noise_power / snr, length-weighted across stims (∼1 min for
nat8b; skip withcompute_reliability=False).
Benchmark results
Quick baseline established from examples/le_2025_baseline.ipynb:
nat8b, dt_ms=5, 50-band gammatone, train on 6 motifs, val/test 1 each,
Continuous variant only (8 unoccluded motifs, no occlusion).
Area |
Model backbone |
Rank |
Remarks |
Params / nrn |
Perfs |
Paper (backbone) |
|---|---|---|---|---|---|---|
nat8b (all areas) |
StateNet (GRU) |
🥇 |
pop, 6-train |
1,030 |
18.3 / 46.5 (median) |
this notebook |
This is a no-tuning baseline — proper LOOCV across all 8 motifs and a hyperparameter sweep are expected to push CCnorm appreciably higher.
Setup
The figshare archive (zipped 105 MB; unpacks to ~720 MB) is auto-downloaded by the dataset class:
from deepSTRF.datasets.audio import Le2025Dataset
ds = Le2025Dataset(experiment='nat8b', download=True)
Default cache dir is
platformdirs.user_cache_dir('deepSTRF') / 'Le_2025', overridable via
$DEEPSTRF_DATA_DIR. download=True is idempotent — already-unpacked
trees are reused.
Manual install:
Download
zebf-auditory-restoration-1.zipfrom figshare 29203457.Unzip — the archive expands to a
zebf-auditory-restoration-1/directory containingmetadata/,*-responses/,*-stimuli/.ds = Le2025Dataset(path='/path/to/zebf-auditory-restoration-1', experiment='nat8b').
gammatone>=1.0 is required for the paper-faithful spectrogram and is
shipped as an optional extra:
pip install 'deepSTRF[le]'
Filtering
Each stim_meta dict carries name, motif, critical_interval,
variant, syntax ("natural" / "scrambled"), experiment,
sample_rate_hz, duration_s, ci_onset_s, ci_offset_s.
Each nrn_meta dict carries cell_id, animal_id, animal_uuid,
cohort, experiment, area, hemisphere, familiar_motifs, sex,
age_days, pprox_file, plus signal_power / noise_power / snr when
compute_reliability=True (default).
Combined with the base-class selection API and the variant-aware helpers:
ds.select_variant('C') # only unoccluded motifs (8 stims)
ds.select_motif('nat8mk0') # all 12 variants of one motif
ds.select_critical_interval(1) # CI=1 + the CI-independent C / CM
ds.select_pop_by_nrn_attr('cohort', 1) # familiarity sub-experiment
ds.select_pop_by_nrn_attr('area', 'NCM/L3') # one auditory subdivision
# The paper's core restoration analysis runs on a (motif, CI) × {C, CB, GB, GM} quartet:
ds.select_restoration_quartet('nat8mk0', 1)
Filter on the precomputed reliability stats to drop weakly responsive cells:
good = [n for n, meta in enumerate(ds.nrn_meta)
if meta.get('snr', 0) > 0.1]
ds.select_population(good)
Cohort / sub-experiment layout
The handoff between experiment and on-disk directories:
zebf-auditory-restoration-1/
├── metadata/
│ ├── recordings.csv area / hemisphere by site (cohorts 2 & 3)
│ ├── song-birds.csv motif-name mapping (nat8a ↔ nat8b) + CI timings in ms
│ └── ephys-birds.csv cohort / sex / age / familiarity-group by bird
├── nat8a-alpha-responses/ cohort 1 (familiarity, 407 units, legacy pprox)
├── nat8a-beta-responses/ cohort 2 (unfamiliar, 387 units, stimtrial pprox)
├── nat8a-stimuli/ shared by alpha + beta; bird-named motifs
├── nat8b-responses/ cohort 3 (445 units, natural syntax)
├── nat8b-stimuli/ same motifs as nat8a, renamed nat8mk0..7
├── synth8b-responses/ cohort 3 (476 units, scrambled syntax)
└── synth8b-stimuli/ synth8mk0..7
experiment='nat8a' unifies the alpha and beta response dirs (they share
the same stimulus set); the per-unit cohort field distinguishes them.
Cross-experiment concatenation works via the standard
concat_neural_datasets:
nat = Le2025Dataset('...', experiment='nat8a')
syn = Le2025Dataset('...', experiment='synth8b')
both = nat + syn # block-diagonal coverage matrix; bidirectional rule applies