# CRCNS AA4 Dataset **Dataset Source:** [AA4 Dataset](https://crcns.org/data-sets/aa/aa-4/about-aa-4) **Citation:** ```text Elie J E and Theunissen F E (2019), Simultaneous extracellular recordings of avian auditory neurons in zebra finches presented with all the repertoire of vocalizations used by this species for vocal communication. CRCNS.org. http://dx.doi.org/10.6080/K00C4T06 ``` **Papers Using the Dataset:** - ["Meaning in the avian auditory cortex: Neural representation of communication calls"](https://onlinelibrary.wiley.com/doi/10.1111/ejn.12812) (2015) by Julie Elie and Frédéric Theunissen - ["Invariant neural responses for sensory categories revealed by the time-varying information for communication calls"](https://dx.plos.org/10.1371/journal.pcbi.1006698) (2019) by Julie Elie and Frédéric Theunissen ## Dataset Details **Population fitting:** ✅ **Batching:** ✅ **Description of Stimuli:** A total of 170 different clips of conspecific vocalizations (songs and calls) and clips of artificial ripple noise, up to 3 s duration. - Sample rate @ 24.4 kHz - Transformed into 32-band mel spectrograms followed by a compression function - Around 10 response trials on average for a given stimulus **Description of Neurons:** - Extracellular single-unit recordings from 4 male and 2 female zebra finches - anesthetized subjects - Total Number of units: 1401 (including 914 single units) - Targeted avian auditory cortical regions included: - **Field L** (including the thalamo recipient L2, the primary auditory regions L1 and L3), - **caudolateral and caudomedial mesopallium** (CLM and CMM), - **caudomedial nidopallium** (NCM) - However, neurons were _not_ individually assigned one of these specific regions. | **Animal** | **Sex** | **#units** | **#stims** | |:------------------------:|:-------:|:----------:|:----------:| | **BlaBro09xxF** | F | 151 | 130 | | **GreBlu9508M** | M | 355 | 130 | | **LblBlu2028M** | M | 53 | 137 | | **WhiBlu5396M** | M | 198 | 73 | | **WhiWhi4522M** | M | 304 | 131 | | **YelBlu6903F** | F | 282 | 129 | **Available data:** - Full Python preprocessing. - One folder for each animal subject, containing several .h5 files of neural recordings (one for each unit) **Processing needed (Dataset constructor):** - Transforming the sound waveform (.wav file) into a 32-band spectrogram. - Choosing neurons based on stimulus type and animal. - Transforming the spike times of each repeat of each stimulus into PSTHs - Remove pre-onset spikes - Align trials temporally - Pad/cut to the right (present/future time steps) so that trials have the same duration ## Setup **Requirements**: a [CRCNS account](https://crcns.org/register). Easiest path — auto-download via the CRCNS NERSC mirror: ```python from deepSTRF.datasets.audio import CRCNSAA4Dataset, AA4_ANIMAL_IDS ds = CRCNSAA4Dataset( download=True, dt_ms=5, crcns_username="your_username", crcns_password="your_password", ) ``` Alternatively set `$CRCNS_USERNAME` / `$CRCNS_PASSWORD`. Default cache dir is `platformdirs.user_cache_dir('deepSTRF')/CRCNS_AA4`, overridable via `$DEEPSTRF_DATA_DIR`. `download=True` is idempotent. If you already have the data laid out manually, the `data/` folder should look like this: ``` data/ |____ BlaBro09xxF/ |____ GreBlu9508M/ |____ LblBlu2028M/ |____ WhiBlu5396M/ |____ YelBlu6903F/ |______ ... ``` ```python ds = CRCNSAA4Dataset('/path/to/data', stimuli=('song', 'call'), animals=(AA4_ANIMAL_IDS[0],)) ``` ## Filtering Each `stim_meta` dict carries `name` (the stimulus md5 — the canonical identifier; the wav filename is per-animal and not unique across the corpus), `type` (e.g. `"song"`, `"call"`), `class` (broader category), and `duration_s`. Each `nrn_meta` dict carries `cell_id` (the basename of the source h5 file), `animal_id` (one of `AA4_ANIMAL_IDS`), `sex` (`"M"` or `"F"` — last char of `animal_id`), `site` (recording site label, e.g. `"Site1"`), `electrode` (int 1-32 across both hemisphere arrays, 16 channels each in 5/6 birds; 1 array in the 6th), `ldepth` / `rdepth` (left- and right-array depth in µm at this site), `sort_type` (`"single"` or `"multi"`; `"noise"` / `"tdt"` are filtered out at load), `sort_id` (online-sort id, int), and `subsort_id` (offline spike-sorting id parsed from the trailing `_ss` of the filename; `None` if absent). The dataset paper does **not** publish a per-cell brain-area assignment, so neurons cannot be filtered by area — the natural axis to slice by is `animal_id`. Otherwise the full selection API from [the data paradigm doc](data_paradigm.md#8-iteration-honours-the-current-selection-bidirectional) is available: select neurons by metadata (`select_pop_by_nrn_attr`), select stims by metadata (`select_stims_by_attr`), and the bidirectional rule auto-hides cells that have no responses to the current stim selection.