# Le 2025 Dataset

**Dataset Source:** [figshare 29203457](https://doi.org/10.6084/m9.figshare.29203457)

**Companion code (paper analyses):** [melizalab/auditory-restoration](https://github.com/melizalab/auditory-restoration)
(also archived at [Zenodo 15864044](https://doi.org/10.5281/zenodo.15864044)).

**Citation:**
```text
Le, B.; Bjoring, M. C.; Meliza, C. D. (2025). The zebra finch auditory cortex
reconstructs occluded syllables in conspecific song. Nature Communications,
16:8452. https://doi.org/10.1038/s41467-025-63182-y
```


## Dataset Details

**Population fitting:** ✅

**Description of Stimuli:**
- 8 conspecific song motifs ("natural-syntax", `nat8a` / `nat8b` stim sets)
  plus 8 scrambled-syntax pseudo-motifs (`synth8b`); each motif 837–1200 ms,
  high-pass filtered at 500 Hz, RMS-normalized to −27 dBFS (cohorts 1 & 2) or
  −30 dBFS (cohort 3).
- 2 critical intervals (CI1, CI2) per motif, 58–100 ms duration each.
- Up to **7 variants per CI**, probing the auditory restoration illusion:

  | code | name                  | content                                          | cohorts |
  | ---- | --------------------- | ------------------------------------------------ | ------- |
  | C    | Continuous            | Unmodified motif (shared across both CIs)        | all     |
  | G    | Gap                   | CI replaced by silence                           | all     |
  | N    | Noise                 | CI-duration noise burst in isolation             | all     |
  | GB   | Gap + Burst           | CI replaced by noise within the motif (illusion) | all     |
  | CB   | Continuous + Burst    | Motif unchanged, noise added on top of CI        | all     |
  | GM   | Gap-Masked            | Whole motif masked by noise, CI deleted          | nat8b / synth8b |
  | CM   | Continuous-Masked     | Whole motif masked, CI intact                    | nat8b / synth8b |

- Wav files at native sample rate (40 kHz `nat8a`, 48 kHz `nat8b`,
  44.1 kHz `synth8b`).
- 10 presentation repeats per (motif, variant) per neuron; some trials
  curated out → effective `R` ranges 4–10.

**Description of Neurons:**
- Extracellular single-unit recordings from anesthetized adult zebra finch
  auditory pallium (NCM/L3, L2a/b, CM/L1; some sites unlabelled).
- Three cohorts × two stim sets = four (cohort, stim-set) pairings:

  | sub-experiment | response dirs            | cohort(s) | birds | units |
  | -------------- | ------------------------ | --------- | ----- | ----- |
  | `nat8a`        | `nat8a-{alpha,beta}-responses/` | 1 + 2 | 14 + 10 | 407 + 387 |
  | `nat8b`        | `nat8b-responses/`              | 3     | ~8      | 445   |
  | `synth8b`      | `synth8b-responses/`            | 3     | ~5      | 476   |

  *Cohort 1 (`nat8a-alpha`) had a familiarity manipulation*: half the birds
  were socially housed with half of the 8 stimulus singers before recording.
  Cohorts 2 and 3 are entirely unfamiliar (singers had died by then).

**Spike sorting:**
- Cohort 1: MountainSort 4 + `phy` curation.
- Cohorts 2 & 3: Kilosort 2.5 + `phy` curation.
- Spike times stored per-unit as JSON
  [pprox](https://meliza.org/spec:2/pprox/) files.
  *Two pprox schemas coexist*: cohort 1 uses the legacy `spec:2/pprox` (spike
  times in ms, `condition` field encodes the variant); everything else uses
  `spec:2/stimtrial` (spike times in seconds, `stimulus.name` carries the
  full filename stem). The dataset class dispatches on `$schema` and
  presents a uniform interface to callers.

**Processing needed (Dataset class `__init__`):**
- Wav → gammatone spectrogram (50 log-spaced bands, 1–8 kHz,
  2.5 ms window, hop = `dt_ms`, `log(P+1)` compression). Matches the
  paper's `gammatone` package recipe (Methods p. 10).
- pprox → `(R, T)` PSTH per (stim, neuron), aligned to stim onset,
  binned at `dt_ms`.
- Optional 21 ms Hanning smoothing
  (Hsu / Borst / Theunissen 2004 convention).
- Optional per-neuron Sahani-Linden signal_power / noise_power / snr,
  length-weighted across stims (∼1 min for `nat8b`; skip with
  `compute_reliability=False`).


## Benchmark results

Quick baseline established from `examples/le_2025_baseline.ipynb`:
`nat8b`, `dt_ms=5`, 50-band gammatone, train on 6 motifs, val/test 1 each,
**Continuous variant only** (8 unoccluded motifs, no occlusion).

|   **Area**    | **Model backbone** | **Rank** | **Remarks** | **Params / nrn** | **Perfs <br/>(CCraw / CCnorm) [%]** |                **Paper (backbone)**                |
|:-------------:|:------------------:|:--------:|:-----------:|:----------------:|:-----------------------------------:|:--------------------------------------------------:|
| **nat8b (all areas)** |  StateNet (GRU) | 🥇 | pop, 6-train | 1,030 | 18.3 / 46.5 (median) | this notebook |

This is a no-tuning baseline — proper LOOCV across all 8 motifs and a
hyperparameter sweep are expected to push CCnorm appreciably higher.


## Setup

The figshare archive (zipped 105 MB; unpacks to ~720 MB) is auto-downloaded
by the dataset class:

```python
from deepSTRF.datasets.audio import Le2025Dataset

ds = Le2025Dataset(experiment='nat8b', download=True)
```

Default cache dir is
`platformdirs.user_cache_dir('deepSTRF') / 'Le_2025'`, overridable via
`$DEEPSTRF_DATA_DIR`. `download=True` is idempotent — already-unpacked
trees are reused.

Manual install:
1. Download `zebf-auditory-restoration-1.zip` from
   [figshare 29203457](https://doi.org/10.6084/m9.figshare.29203457).
2. Unzip — the archive expands to a `zebf-auditory-restoration-1/`
   directory containing `metadata/`, `*-responses/`, `*-stimuli/`.
3. `ds = Le2025Dataset(path='/path/to/zebf-auditory-restoration-1',
   experiment='nat8b')`.

`gammatone>=1.0` is required for the paper-faithful spectrogram and is
shipped as an optional extra:

```bash
pip install 'deepSTRF[le]'
```


## Filtering

Each `stim_meta` dict carries `name`, `motif`, `critical_interval`,
`variant`, `syntax` (`"natural"` / `"scrambled"`), `experiment`,
`sample_rate_hz`, `duration_s`, `ci_onset_s`, `ci_offset_s`.

Each `nrn_meta` dict carries `cell_id`, `animal_id`, `animal_uuid`,
`cohort`, `experiment`, `area`, `hemisphere`, `familiar_motifs`, `sex`,
`age_days`, `pprox_file`, plus `signal_power` / `noise_power` / `snr` when
`compute_reliability=True` (default).

Combined with the
[base-class selection API](data_paradigm.md#8-iteration-honours-the-current-selection-bidirectional)
and the variant-aware helpers:

```python
ds.select_variant('C')                       # only unoccluded motifs (8 stims)
ds.select_motif('nat8mk0')                   # all 12 variants of one motif
ds.select_critical_interval(1)               # CI=1 + the CI-independent C / CM
ds.select_pop_by_nrn_attr('cohort', 1)       # familiarity sub-experiment
ds.select_pop_by_nrn_attr('area', 'NCM/L3')  # one auditory subdivision

# The paper's core restoration analysis runs on a (motif, CI) × {C, CB, GB, GM} quartet:
ds.select_restoration_quartet('nat8mk0', 1)
```

Filter on the precomputed reliability stats to drop weakly responsive cells:

```python
good = [n for n, meta in enumerate(ds.nrn_meta)
        if meta.get('snr', 0) > 0.1]
ds.select_population(good)
```


## Cohort / sub-experiment layout

The handoff between `experiment` and on-disk directories:

```
zebf-auditory-restoration-1/
├── metadata/
│   ├── recordings.csv           area / hemisphere by site (cohorts 2 & 3)
│   ├── song-birds.csv           motif-name mapping (nat8a ↔ nat8b) + CI timings in ms
│   └── ephys-birds.csv          cohort / sex / age / familiarity-group by bird
├── nat8a-alpha-responses/       cohort 1 (familiarity, 407 units, legacy pprox)
├── nat8a-beta-responses/        cohort 2 (unfamiliar, 387 units, stimtrial pprox)
├── nat8a-stimuli/               shared by alpha + beta; bird-named motifs
├── nat8b-responses/             cohort 3 (445 units, natural syntax)
├── nat8b-stimuli/               same motifs as nat8a, renamed nat8mk0..7
├── synth8b-responses/           cohort 3 (476 units, scrambled syntax)
└── synth8b-stimuli/             synth8mk0..7
```

`experiment='nat8a'` unifies the alpha and beta response dirs (they share
the same stimulus set); the per-unit `cohort` field distinguishes them.
Cross-experiment concatenation works via the standard
[`concat_neural_datasets`](data_paradigm.md#concatenation):

```python
nat = Le2025Dataset('...', experiment='nat8a')
syn = Le2025Dataset('...', experiment='synth8b')
both = nat + syn   # block-diagonal coverage matrix; bidirectional rule applies
```