# A Natural Sound Dataset: A1 & PEG (NAT4) **Dataset Source:** [NAT4 Dataset](https://doi.org/10.5281/zenodo.8044773) **Original Papers:** - ["Can deep learning provide a generalizable model for dynamic sound encoding in auditory cortex?"](https://doi.org/10.1101/2022.06.10.495698) by Jacob R. Pennington, Stephen V. David. - ["A convolutional neural network provides a generalizable model of natural sound coding by neural populations in auditory cortex"](https://doi.org/10.1371/journal.pcbi.1011110) by Pennington JR, David SV. ## Dataset Details: **Description of Stimuli:** - 20 repetitions of 18 sounds + 1 repetition of 577 sounds. - Each stimulus is 1.5 seconds in duration. **Description of Neurons:** - Total Number of Neurons: 849 for A1, 398 for PEG - Valid Neurons: 777 for A1, 339 for PEG - Valid Neurons Criteria: Auditory neurons (see the papers for further details) **Available Data:** - One population recording per area (`_NAT4_ozgf.fs100.ch18.tgz`) packaging the full population time series with the 18 val stimuli pre-averaged over 20 reps in the first 27 s, then 575 est stimuli at R=1 each. CSV + JSON inside the tarball. - One per-site recording per recording session (`_single_sites/.tgz`) with raw 1 ms-resolution spike times stored as HDF5 — used for trial-resolved val responses (R=20). - Per-area `_pred_correlation.csv` flagging the 777 A1 and 339 PEG neurons that the dataset's authors classified as auditory. **deepSTRF parses the NAT4 archive directly — NEMS0 is no longer required.** ## Benchmark results | **Area** | **Model backbone** | **Rank** | **Remarks** | **Params / nrn** | **Perfs
(CCraw / CCnorm) [%]** | **Paper (backbone)** | |:--------:|:------------------:|:--------:|:---------------------------------------------:|:----------------:|:-----------------------------------:|:---------------------------------------------------------------------------------------------------:| | **A1** | StateNet | 🥇 | LSTM, pop | 40,271 | 46.6 / 65.1 | [Rançon et al.](https://doi.org/10.1101/2025.01.08.631909) | | | 2D-CNN | 🥈 | [AdapTrans](docs_md/README_AdapTrans.md), pop | XX,XXX | 46.4 / 64.5 | [Pennington et al.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011110) | | | Transformer | 🥉 | pop | 28,437 | 46.6 / 64.4 | [Rançon et al.](https://doi.org/10.1101/2025.01.08.631909) | | **PEG** | Transformer | 🥇 | pop | 28,437 | 39.7 / 55.5 | [Pennington et al.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011110) | | | 2D-CNN | 🥈 | [AdapTrans](docs_md/README_AdapTrans.md), pop | XX,XXX | 39.2 / 55.2 | [Pennington et al.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011110) | | | StateNet | 🥉 | LSTM, pop | 40,271 | 38.9 / 54.7 | [Pennington et al.](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011110) | ## Setup Easiest path — auto-download from Zenodo into the platformdirs cache: ```python from deepSTRF.datasets.audio import NAT4Dataset ds_a1 = NAT4Dataset(area='A1', download=True) # ~108 MB ds_peg = NAT4Dataset(area='PEG', download=True) # ~50 MB ``` Default cache dir is `platformdirs.user_cache_dir('deepSTRF')/NAT4`, overridable via `$DEEPSTRF_DATA_DIR`. To use a custom path explicitly: ```python ds = NAT4Dataset('/path/to/your/data/', area='A1', download=True) ``` `download=True` is idempotent — it skips files / dirs that already exist, so re-instantiating the dataset is cheap. If you already have the data laid out manually, just pass the path: ```python ds = NAT4Dataset('/path/to/your/data/', area='A1') ``` Expected files in the data dir: * `A1_NAT4_ozgf.fs100.ch18.tgz` (or already-extracted directory) * `A1_single_sites/` (extracted from `A1_single_sites.zip`) * `A1_pred_correlation.csv` * (likewise for PEG) ## Estimation vs validation subsets NAT4 has two stim subsets: **est** (575 stims, R=1) and **val** (18 stims, R=20). Each `stim_meta` entry carries a `"subset"` field equal to `"est"` or `"val"` so you can filter at either load time or iteration time. ```python # load only one subset (skips the per-site spike-time pass under subset='est') ds_est = NAT4Dataset(area='A1', subset='est') # 575 stims ds_val = NAT4Dataset(area='A1', subset='val') # 18 stims # or load everything and filter later ds = NAT4Dataset(area='A1') # 593 stims ds.select_stims_by_attr('subset', 'val') # __len__ -> 18 # 33 val-less A1 cells auto-hidden # via the bidirectional rule ds.reset_stim_selection() # back to 593 ``` Note: 33 of the 849 A1 cells have no val data. Under `subset='val'` (or `select_stims_by_attr('subset', 'val')`), the bidirectional rule in the base class hides them from `__getitem__` automatically, so training loops only see val-having cells. See [the data paradigm doc](data_paradigm.md#8-iteration-honours-the-current-selection-bidirectional) for the full contract. ## Per-cell metadata `nrn_meta[n]` carries the raw NEMS `cell_id` plus parsed components: | Field | Example | Notes | |----------------------|---------------|-------------------------------------------------------------| | `cell_id` | `'ARM029a-04-1'` | Raw NEMS id. | | `area` | `'A1'` | Cortical area (A1 or PEG). | | `auditory` | `True` | Per-cell flag from `_pred_correlation.csv`. | | `site` | `'ARM029a'` | Recording site (animal + recording number + session). | | `animal` | `'ARM'` | 3-letter animal code (e.g. ARM, CRD, DRX, TNC). | | `electrode` | `4` | Electrode index, parsed from cell_id. | | `unit_in_electrode` | `1` | Unit index on that electrode. | Best-effort: any field whose source is missing or unparseable is `None` for that neuron. Combine with `select_pop_by_nrn_attr`, e.g.: ```python ds.select_pop_by_nrn_attr('animal', 'ARM') # all cells from animal ARM ds.select_pop_by_nrn_attr('auditory', True) # the 777 / 339 auditory cells ```