# CRCNS AA2 Dataset

**Dataset Source:** [AA2 Dataset](https://crcns.org/data-sets/aa/aa-2/about)

**Citation:**
```text
Theunissen, Frederic E.; Gill, Patrick; Noopur, Amin; Zhang, Junli; Woolley, Sarah M. N.; Fremouw, Thane (2011): Single-unit recordings from multiple auditory areas in male zebra finches. CRCNS.org.
http://dx.doi.org/10.6080/10.6080/K0JW8BSC
```

**Papers Using the Dataset:**
- ["Sound representation methods for spectro-temporal receptive field estimation"](https://doi.org/10.1007/s10827-006-7059-4) (2006) by Patrick Gill, Junli Zhang, Sarah M. N. Woolley, Thane Fremouw and Frédéric E. Theunissen
- ["Role of the Zebra Finch Auditory Thalamus in Generating Complex Representations for Natural Sounds"](https://doi.org/10.1152/jn.00128.2010) (2010) by Noopur Amin, Patrick Gill, Frederic E Theunissen

## Dataset Details

**Population fitting:** ✅

**Description of Stimuli:**
- 72 clips of conspecific vocalizations, 20 clips of flat ripples, and 25 clips of song ripples up to 5 s duration.
- Sample rate @ 32 kHz and 16 bit precision
- Up to 10-20 response trials for a given stimulus

**Description of Neurons:**
- Extracellular single-unit recordings from 57 male zebra finches
- Total Number of Neurons: 494 
  - 143 mld 
  - 59 OV 
  - 189 L
    - 17 L1 
    - 53 L2a
    - 42 L2b
    - 43 L3
    - 31 others ("L")
  - 37 CM
  - 66 others ("None")

**Available data:**
- Full Python preprocessing.
- One very simple .txt file for each cell (unit) response:
  - Spike timestamp relative to stimulus onset
  - Each line corresponds to one response trial

**Processing needed (Dataset class init() method):**
- Transforming the sound waveform (.wav file) into a 32-band spectrogram.
- Choosing neurons based on their recording site, stimulus type, and animal.
- Transforming the spike times of each repeat of each stimulus into PSTHs
  - Remove pre-onset spikes
  - Align trials temporally
  - Pad/cut to the right (present/future time steps) so that trials have the same duration


## Setup

**Requirements**: a [CRCNS account](https://crcns.org/register).

Easiest path — auto-download via the CRCNS NERSC mirror:

```python
from deepSTRF.datasets.audio import CRCNSAA2Dataset

ds = CRCNSAA2Dataset(
    download=True, dt_ms=5,
    crcns_username="your_username",
    crcns_password="your_password",
)
```

Alternatively set `$CRCNS_USERNAME` / `$CRCNS_PASSWORD`. Default cache dir
is `platformdirs.user_cache_dir('deepSTRF')/CRCNS_AA2`,
overridable via `$DEEPSTRF_DATA_DIR`. `download=True` is idempotent.

If you already have the data laid out manually:
1. Download from [the dataset page](https://crcns.org/data-sets/aa/aa-2/about).
2. Extract `all_stims/` and `all_cells/` into a `data/` folder.
3. `ds = CRCNSAA2Dataset('/path/to/data', dt_ms=5)`.

## Filtering

Each `stim_meta` dict carries `name` (stimulus identifier), `type`
(`"conspecific"` or `"songrip"` — the latter is reversed-song / pitch-shifted
controls), `sample_rate`, `n_samples`, `duration_s` (last three from
`data/stim_data.csv`). Each `nrn_meta` dict carries `cell_id` (the
raw cell name from the dataset), `animal_id`, `area` (`"MLd"`, `"OV"`,
`"L"`, `"CM"`, or one of the smaller secondary areas — see AA1 for the
parsing details), `cell_seq` (within-animal cell index), and `rig` (often
`None` in AA2).

The full selection API from [the data paradigm doc](data_paradigm.md#8-iteration-honours-the-current-selection-bidirectional)
is available on AA2: filter neurons by metadata (`select_pop_by_nrn_attr`)
or by stim coverage (`select_pop_by_stim_attr`), filter stims by metadata
(`select_stims_by_attr`), and rely on the bidirectional rule so that
narrowing the stim space automatically hides cells with no responses
left in it.