# Alice EEG Dataset

**Dataset Source:** [The Alice Datasets — Deep Blue Data (UMich)](https://deepblue.lib.umich.edu/data/concern/data_sets/bg257f92t),
re-released by Brodbeck et al. as a preprocessed MNE-Python deposit at the
University of Maryland DRUM repository
([DOI 10.13016/pulf-lndn](https://doi.org/10.13016/pulf-lndn)).

**Citation:**
```text
Bhattasali, S., Brennan, J. R., Luh, W.-M., Franzluebbers, B., & Hale, J. T.
(2020). The Alice Datasets: fMRI & EEG Observations of Natural Language
Comprehension. Proceedings of the 12th Conference on Language Resources and
Evaluation (LREC), 120–125.
```

**Papers using the dataset:**
- ["Hierarchical structure guides rapid linguistic predictions during naturalistic listening"](https://doi.org/10.1371/journal.pone.0207741)
  by Brennan, Hale & Bolhuis (2019), PLOS ONE.
- ["Eelbrain, a Python toolkit for time-continuous analysis with temporal response functions"](https://doi.org/10.7554/eLife.85012)
  by Brodbeck et al. (2023), eLife. (Uses the same Alice EEG release that
  deepSTRF consumes; the benchmark numbers below are taken from this paper.)


## Dataset Details

**Population fitting:** ✅ (across subjects, across channels, or both)

**Description of Stimuli:**
- First chapter of *Alice in Wonderland* read by a single narrator, split
  into 12 audio segments, totalling ~12.4 minutes (2129 words).
- Mono 44.1 kHz `.wav`, plus a word-onset / n-gram-surprisal CSV table.

**Description of Responses:**
- 33 human participants listened passively to the chapter while EEG was
  recorded.
- 61 EEG channels per subject (10–20-like extended montage).
- `R = 1` per (subject, channel) — each segment was played once per
  subject; no within-subject repeats.

**Available data:**
- Per-subject MNE-Python `.fif` files containing the continuous recording,
  channel montage, segment-onset annotations, bad-channel list, and
  artifact-rejection annotations.

**Processing performed by the dataset class:**
- Audio segments are loaded and converted to log-power ERB-band
  spectrograms (a frequency-domain Gaussian approximation of the gammatone
  spectrogram used in the eelbrain analysis; same band structure as
  Brodbeck Fig 4 panels).
- Per-subject EEG is downsampled to `1000 / dt_ms` Hz (default 100 Hz),
  segmented at the 12 audio-onset annotations, and aligned to the
  spectrogram time grid.
- Bad channels (`raw.info['bads']`) and bad-window annotations (`BAD_*`)
  are converted to NaN at the response level, following the deepSTRF
  [data paradigm](data_paradigm.md) — single source of truth, no separate
  mask.

## Two modes: subjects-as-neurons vs subjects-as-repeats

Alice EEG sits at the intersection of two natural ways to organise the
data. `AliceEEGDataset` exposes both via the `treat_subjects_as` kwarg.

### `treat_subjects_as="neurons"` (default)

Every `(subject, channel)` pair becomes one entry in the neuron axis.
`N = sum_s(channels_s)`, `R = 1`. This is the standard deepSTRF view —
each "neuron" has one trial, and bad-channel-on-subject combos carry the
structural-NaN sentinel. The metrics to report are
[`corrcoef`](metrics_paradigm.md#63-corrcoefpred-gt-masknone-reductionmean)
and
[`fve`](metrics_paradigm.md#scope-a-functional-nan-aware-single-axis-api).

### `treat_subjects_as="repeats"`

Channels-as-neurons, subjects-as-repeats. `N = n_montage_channels` (61),
`R = n_subjects`. Bad `(channel, subject)` cells become NaN at the repeat
slot, and the deepSTRF metrics handle them transparently.

This mode enables **inter-subject reliability** analysis via
[`normalized_corrcoef(method='schoppe')`](metrics_paradigm.md#methodschoppe)
— predictions are scored against an inter-subject signal-power ceiling
analogous to (but **not** the same as) the single-unit trial-reliability
ceiling.

**Interpretive caveat:** the noise model underpinning the Schoppe
correction assumes iid trial noise around a shared deterministic signal.
Between-subject variability is structured (anatomy, source orientation,
attention) and only approximately iid. The math runs and gives a useful
group-level ceiling, but the resulting `CCnorm` is interpreted as
"how well does the model predict the shared, across-subject EEG
component" — not as a trial-reliability bound on a single recording. Use
it as a model-comparison axis, not as an absolute predictive-power
ceiling.

## Benchmark targets (Brodbeck et al. 2023, eLife)

Brodbeck reports **% variability explained** per channel, averaged across
33 subjects. The key thing to internalise about these numbers — and the
[envelope-tracking literature](#what-good-looks-like-on-scalp-eeg)
generally — is the **unit**:

> **% variability explained = 100 · r²**, *not* Pearson `r`.

Figure 4B's "envelope predictive power" topography has a colorbar that
**maxes at 1 %**. That is `r² = 0.01`, i.e. a per-channel Pearson `r ≈
0.10`. The incremental panels (4C/4D — the predictive-power *gain* from
adding onsets / spectrogram over the envelope-only model) span **±0.1 %**
Δ-variance, i.e. another `r ≈ ±0.03`. So the real headline ceiling is:

| Predictor model | % variability explained | equiv. Pearson `r` |
|---|---|---|
| Envelope alone | up to ~1 % | up to ~0.10 |
| + acoustic-onset | + ~0.1 % | + ~0.03 |
| + spectrogram | + ~0.1 % | + ~0.03 |

These are tiny absolute prediction accuracies **by design** — scalp EEG
single-trial envelope tracking is a low-SNR problem (see below). The
science is in the *significance of the increment* and the *shape of the
recovered TRF*, not the absolute predictive power.

### What this library reproduces

We verified empirically (subject S01, single 9/1/2 stim split, Heeris
gammatone-8 spectrogram, 0.5–20 Hz bandpass) that **four independent
estimators converge on the same per-channel test accuracy**:

| Estimator | Mean test `r` | Max | Mean % var |
|---|---|---|---|
| deepSTRF `Linear` STRF (AdamW + MSE) | 0.084 | 0.16 | 0.86 % |
| `sklearn` Ridge (α-grid) | 0.084 | 0.16 | 0.97 % |
| deepSTRF `StateNet` GRU (C=14, 6.8k params) | 0.086 | 0.17 | 0.74 % |
| **eelbrain `boosting`** (Brodbeck's own method) | **0.090** | 0.17 | **1.05 %** |

The `Δ` between the deepSTRF Linear STRF and eelbrain's L1-boosting +
50 ms-basis pipeline is **+0.005 `r` — within run-to-run noise.** There
is **no algorithmic gap and no spec-pipeline gap**: deepSTRF reaches the
published single-subject envelope-TRF ceiling on this dataset. The
recurrent `StateNet` matches the linear STRF with **7× fewer
parameters** — the sample-efficient choice for the small per-subject
data, and the natural backbone for multi-subject pooling.

The accompanying [example notebook](../../examples/alice_eeg_tutorial.ipynb)
walks through a linear baseline and a `StateNet` core on the same data
pipeline. The deepSTRF reframing of the onset-spectrogram condition is
`AdapTrans + Linear` — a learnable peripheral adaptation front-end in
place of the hand-engineered Fishbach-2001 onset detector.

## What "good" looks like on scalp EEG

If you come to this dataset from single-unit or ECoG work, the
per-channel `r ≈ 0.1` ceiling will look like a broken fit. It is not.
Single-trial scalp-EEG envelope-TRF prediction `r` is **0.05–0.15**
across the foundational literature (Lalor & Foxe 2010; Ding & Simon
2012; Di Liberto et al. 2015; Crosse et al. 2016; Broderick et al.
2018; Brodbeck et al. 2018). `r ≥ 0.3` only happens with **intracranial
recordings** (ECoG/sEEG) or **unit-level** data with many trial
repeats — the cortical envelope-tracking response is ~1 µV against
30–50 µV of background, so even `r² = 1 %` is a real, replicable signal.

Because absolute prediction accuracy is low, the EEG/MEG-TRF field
reports its results differently from the spike-prediction `cc_norm` that
deepSTRF's auditory datasets use:

1. **The TRF/STRF kernel shape itself** is the primary deliverable. The
   recovered response function has interpretable, replicable peaks
   (P1/M50 ≈ 50 ms, N1/M100 ≈ 100 ms, P2 ≈ 200 ms); their latencies and
   topographies *are* the science. The kernel is estimated precisely
   even when single-trial prediction is poor. deepSTRF surfaces this via
   [`AudioEncodingModel.STRF_gradmap`](README_gradmap_strf.md).
2. **Predictive power as a significance test, not a score.** With n=33
   subjects, even `r² = 0.5 %` is `p ≪ 0.001` at the group level. The
   question is "does adding predictor X *significantly increase*
   predictive power over the model without it?" — the Δ%-variance you
   see in Brodbeck Fig 4C/4D, tested across subjects.
3. **Nested-model comparison / variance partitioning.** "Does a
   phoneme-surprisal predictor explain variance *beyond* the acoustic
   envelope?" Fit nested models, compare predictive power. This is the
   main thing TRFs are *for* — disentangling overlapping, correlated
   acoustic / lexical / semantic predictors.
4. **Backward (decoding) models + attention decoding.** Reconstructing
   the stimulus envelope *from* EEG pools all channels and reaches
   higher `r` (≈ 0.1–0.3); auditory-attention decoding then reports
   *classification accuracy* (often 80–90 % in 60 s windows), not `r`.
5. **Group-level cluster statistics** over the (time-lag × sensor) TRF
   (TFCE / cluster-permutation corrected) — the result is a significant
   spatiotemporal cluster, reported as a topography + time-course.

For deepSTRF's purposes, the per-channel `r` we measure is the right
*sanity* number; the scientifically useful outputs on EEG are (1) the
recovered TRF kernels and (3) nested-model predictive-power comparisons.

## Setup

**Requirements:** the `[eeg]` optional extra (pulls in MNE-Python):

```bash
pip install "deepSTRF[eeg]"
```

Easiest path — auto-download from the UMd DRUM mirror (~2.5 GiB total,
anonymous HTTPS, idempotent):

```python
from deepSTRF.datasets.audio import AliceEEGDataset

ds = AliceEEGDataset(download=True, dt_ms=10, n_frequency_bands=8)
```

Default cache dir is
`platformdirs.user_cache_dir('deepSTRF')/Alice_EEG`, overridable via
`$DEEPSTRF_DATA_DIR`.

If the data is already laid out manually:

```python
ds = AliceEEGDataset(path="/path/to/brodbeck_eelbrain_elife", dt_ms=10)
```

Expected layout under `path`:
```
brodbeck_eelbrain_elife/
├── eeg.0/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.1/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.2/eeg/Sxx/Sxx_alice-raw.fif
└── stimuli/{1..12}.wav  +  AliceChapterOne-EEG.csv
```

## Filtering

Each `stim_meta` dict carries `name`, `type` (`"alice_chapter1"`),
`sample_rate`, `n_samples`, `duration_s`. Each `nrn_meta` dict
carries `channel_id`, `subject` (or `None` in repeats mode), `area`
(`"EEG"`), and `xyz` (channel position from the standard 10–20 montage,
or `None` if not in the montage). Combined with the
[base-class selection API](data_paradigm.md#8-iteration-honours-the-current-selection-bidirectional):

```python
# default — all subjects, both modes
ds = AliceEEGDataset(download=True)

# only one subject
ds = AliceEEGDataset(download=True, subjects=["S20"])

# inter-subject reliability mode
ds = AliceEEGDataset(download=True, treat_subjects_as="repeats")

# post-construction: select a frontal cluster of channels
ds.select_pop_by_nrn_attr("channel_id", "1")    # one channel by id
```

## Status

The shipped `AliceEEGDataset` + canonical preprocessing (0.5–20 Hz
bandpass, base-class `standardize_stims` + `normalize_responses`)
correctly loads the data and feeds the deepSTRF model API, and **reaches
the published single-subject envelope-TRF ceiling** (see the
[four-estimator comparison above](#what-this-library-reproduces)).
There is no accuracy gap to close against Brodbeck on the acoustic-only
models — `r ≈ 0.09` mean per channel is what the data supports, and
deepSTRF's `Linear` and `StateNet` both get there.

What *would* extend the analysis (in the directions the EEG-TRF field
actually cares about — kernel recovery and nested-model comparison
rather than raw predictive power), in increasing order of effort:

1. **Word-onset / surprisal predictors** from
   `stimuli/AliceChapterOne-EEG.csv`. Reproduces Brodbeck Fig 5–6
   (TRF-of-discrete-events; function vs content words). This is the
   nested-model-comparison story — "does a lexical predictor explain
   variance *beyond* acoustics?" — and is where TRFs earn their keep.
2. **Hamming-basis STRF kernel.** Add a `BasisKernel` to
   `deepSTRF.models.layers` that constrains the temporal axis of the
   STRF to a sparse basis of overlapping Hamming windows (eelbrain's
   `basis=0.050`). It won't move the per-channel `r` much — we're at the
   ceiling — but it produces **smoother, more interpretable TRF
   kernels**, which is the actual deliverable on EEG.
3. **Subject embeddings + shared StateNet backbone.** A learned
   per-subject context vector concatenated to the GRU input — true
   multi-subject pretraining, beyond the per-subject readouts the `N`
   axis already provides. The most promising route to a meaningfully
   *higher* number, by pooling the across-subject shared response.
4. **`eelbrain.boosting` wrapper as an alternative `Fitter`.** The
   apples-to-apples cross-check used to validate this dataset (see
   `untracked/alice_eeg_eelbrain_compare.py`); worth promoting to a
   reusable utility + regression test for future EEG/MEG work.
5. **Topomap helper** using `mne.viz.plot_topomap` from
   `nrn_meta['xyz']` — for the eLife-style scalp figures.
6. **Per-subject `download=True`** instead of all 2.5 GiB at once.

The accompanying [tutorial notebook](../../examples/alice_eeg_tutorial.ipynb)
exercises the dataset end-to-end. It is a **library-on-EEG
demonstration** — showing that the same model API that fits ferret A1
spikes also fits human scalp EEG, lands at the field-standard ceiling,
and recovers interpretable TRF kernels.