Alice EEG Dataset

Dataset Source: The Alice Datasets — Deep Blue Data (UMich), re-released by Brodbeck et al. as a preprocessed MNE-Python deposit at the University of Maryland DRUM repository (DOI 10.13016/pulf-lndn).

Citation:

Bhattasali, S., Brennan, J. R., Luh, W.-M., Franzluebbers, B., & Hale, J. T.
(2020). The Alice Datasets: fMRI & EEG Observations of Natural Language
Comprehension. Proceedings of the 12th Conference on Language Resources and
Evaluation (LREC), 120–125.

Papers using the dataset:

Dataset Details

Population fitting: ✅ (across subjects, across channels, or both)

Description of Stimuli:

  • First chapter of Alice in Wonderland read by a single narrator, split into 12 audio segments, totalling ~12.4 minutes (2129 words).

  • Mono 44.1 kHz .wav, plus a word-onset / n-gram-surprisal CSV table.

Description of Responses:

  • 33 human participants listened passively to the chapter while EEG was recorded.

  • 61 EEG channels per subject (10–20-like extended montage).

  • R = 1 per (subject, channel) — each segment was played once per subject; no within-subject repeats.

Available data:

  • Per-subject MNE-Python .fif files containing the continuous recording, channel montage, segment-onset annotations, bad-channel list, and artifact-rejection annotations.

Processing performed by the dataset class:

  • Audio segments are loaded and converted to log-power ERB-band spectrograms (a frequency-domain Gaussian approximation of the gammatone spectrogram used in the eelbrain analysis; same band structure as Brodbeck Fig 4 panels).

  • Per-subject EEG is downsampled to 1000 / dt_ms Hz (default 100 Hz), segmented at the 12 audio-onset annotations, and aligned to the spectrogram time grid.

  • Bad channels (raw.info['bads']) and bad-window annotations (BAD_*) are converted to NaN at the response level, following the deepSTRF data paradigm — single source of truth, no separate mask.

Two modes: subjects-as-neurons vs subjects-as-repeats

Alice EEG sits at the intersection of two natural ways to organise the data. AliceEEGDataset exposes both via the treat_subjects_as kwarg.

treat_subjects_as="neurons" (default)

Every (subject, channel) pair becomes one entry in the neuron axis. N = sum_s(channels_s), R = 1. This is the standard deepSTRF view — each “neuron” has one trial, and bad-channel-on-subject combos carry the structural-NaN sentinel. The metrics to report are corrcoef and fve.

treat_subjects_as="repeats"

Channels-as-neurons, subjects-as-repeats. N = n_montage_channels (61), R = n_subjects. Bad (channel, subject) cells become NaN at the repeat slot, and the deepSTRF metrics handle them transparently.

This mode enables inter-subject reliability analysis via normalized_corrcoef(method='schoppe') — predictions are scored against an inter-subject signal-power ceiling analogous to (but not the same as) the single-unit trial-reliability ceiling.

Interpretive caveat: the noise model underpinning the Schoppe correction assumes iid trial noise around a shared deterministic signal. Between-subject variability is structured (anatomy, source orientation, attention) and only approximately iid. The math runs and gives a useful group-level ceiling, but the resulting CCnorm is interpreted as “how well does the model predict the shared, across-subject EEG component” — not as a trial-reliability bound on a single recording. Use it as a model-comparison axis, not as an absolute predictive-power ceiling.

Benchmark targets (Brodbeck et al. 2023, eLife)

Brodbeck reports % variability explained per channel, averaged across 33 subjects. The key thing to internalise about these numbers — and the envelope-tracking literature generally — is the unit:

% variability explained = 100 · r², not Pearson r.

Figure 4B’s “envelope predictive power” topography has a colorbar that maxes at 1 %. That is = 0.01, i.e. a per-channel Pearson r 0.10. The incremental panels (4C/4D — the predictive-power gain from adding onsets / spectrogram over the envelope-only model) span ±0.1 % Δ-variance, i.e. another r ±0.03. So the real headline ceiling is:

Predictor model

% variability explained

equiv. Pearson r

Envelope alone

up to ~1 %

up to ~0.10

+ acoustic-onset

+ ~0.1 %

+ ~0.03

+ spectrogram

+ ~0.1 %

+ ~0.03

These are tiny absolute prediction accuracies by design — scalp EEG single-trial envelope tracking is a low-SNR problem (see below). The science is in the significance of the increment and the shape of the recovered TRF, not the absolute predictive power.

What this library reproduces

We verified empirically (subject S01, single 9/1/2 stim split, Heeris gammatone-8 spectrogram, 0.5–20 Hz bandpass) that four independent estimators converge on the same per-channel test accuracy:

Estimator

Mean test r

Max

Mean % var

deepSTRF Linear STRF (AdamW + MSE)

0.084

0.16

0.86 %

sklearn Ridge (α-grid)

0.084

0.16

0.97 %

deepSTRF StateNet GRU (C=14, 6.8k params)

0.086

0.17

0.74 %

eelbrain boosting (Brodbeck’s own method)

0.090

0.17

1.05 %

The Δ between the deepSTRF Linear STRF and eelbrain’s L1-boosting + 50 ms-basis pipeline is +0.005 r — within run-to-run noise. There is no algorithmic gap and no spec-pipeline gap: deepSTRF reaches the published single-subject envelope-TRF ceiling on this dataset. The recurrent StateNet matches the linear STRF with 7× fewer parameters — the sample-efficient choice for the small per-subject data, and the natural backbone for multi-subject pooling.

The accompanying example notebook walks through a linear baseline and a StateNet core on the same data pipeline. The deepSTRF reframing of the onset-spectrogram condition is AdapTrans + Linear — a learnable peripheral adaptation front-end in place of the hand-engineered Fishbach-2001 onset detector.

What “good” looks like on scalp EEG

If you come to this dataset from single-unit or ECoG work, the per-channel r 0.1 ceiling will look like a broken fit. It is not. Single-trial scalp-EEG envelope-TRF prediction r is 0.05–0.15 across the foundational literature (Lalor & Foxe 2010; Ding & Simon 2012; Di Liberto et al. 2015; Crosse et al. 2016; Broderick et al. 2018; Brodbeck et al. 2018). r 0.3 only happens with intracranial recordings (ECoG/sEEG) or unit-level data with many trial repeats — the cortical envelope-tracking response is ~1 µV against 30–50 µV of background, so even = 1 % is a real, replicable signal.

Because absolute prediction accuracy is low, the EEG/MEG-TRF field reports its results differently from the spike-prediction cc_norm that deepSTRF’s auditory datasets use:

  1. The TRF/STRF kernel shape itself is the primary deliverable. The recovered response function has interpretable, replicable peaks (P1/M50 ≈ 50 ms, N1/M100 ≈ 100 ms, P2 ≈ 200 ms); their latencies and topographies are the science. The kernel is estimated precisely even when single-trial prediction is poor. deepSTRF surfaces this via AudioEncodingModel.STRF_gradmap.

  2. Predictive power as a significance test, not a score. With n=33 subjects, even = 0.5 % is p 0.001 at the group level. The question is “does adding predictor X significantly increase predictive power over the model without it?” — the Δ%-variance you see in Brodbeck Fig 4C/4D, tested across subjects.

  3. Nested-model comparison / variance partitioning. “Does a phoneme-surprisal predictor explain variance beyond the acoustic envelope?” Fit nested models, compare predictive power. This is the main thing TRFs are for — disentangling overlapping, correlated acoustic / lexical / semantic predictors.

  4. Backward (decoding) models + attention decoding. Reconstructing the stimulus envelope from EEG pools all channels and reaches higher r (≈ 0.1–0.3); auditory-attention decoding then reports classification accuracy (often 80–90 % in 60 s windows), not r.

  5. Group-level cluster statistics over the (time-lag × sensor) TRF (TFCE / cluster-permutation corrected) — the result is a significant spatiotemporal cluster, reported as a topography + time-course.

For deepSTRF’s purposes, the per-channel r we measure is the right sanity number; the scientifically useful outputs on EEG are (1) the recovered TRF kernels and (3) nested-model predictive-power comparisons.

Setup

Requirements: the [eeg] optional extra (pulls in MNE-Python):

pip install "deepSTRF[eeg]"

Easiest path — auto-download from the UMd DRUM mirror (~2.5 GiB total, anonymous HTTPS, idempotent):

from deepSTRF.datasets.audio import AliceEEGDataset

ds = AliceEEGDataset(download=True, dt_ms=10, n_frequency_bands=8)

Default cache dir is platformdirs.user_cache_dir('deepSTRF')/Alice_EEG, overridable via $DEEPSTRF_DATA_DIR.

If the data is already laid out manually:

ds = AliceEEGDataset(path="/path/to/brodbeck_eelbrain_elife", dt_ms=10)

Expected layout under path:

brodbeck_eelbrain_elife/
├── eeg.0/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.1/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.2/eeg/Sxx/Sxx_alice-raw.fif
└── stimuli/{1..12}.wav  +  AliceChapterOne-EEG.csv

Filtering

Each stim_meta dict carries name, type ("alice_chapter1"), sample_rate, n_samples, duration_s. Each nrn_meta dict carries channel_id, subject (or None in repeats mode), area ("EEG"), and xyz (channel position from the standard 10–20 montage, or None if not in the montage). Combined with the base-class selection API:

# default — all subjects, both modes
ds = AliceEEGDataset(download=True)

# only one subject
ds = AliceEEGDataset(download=True, subjects=["S20"])

# inter-subject reliability mode
ds = AliceEEGDataset(download=True, treat_subjects_as="repeats")

# post-construction: select a frontal cluster of channels
ds.select_pop_by_nrn_attr("channel_id", "1")    # one channel by id

Status

The shipped AliceEEGDataset + canonical preprocessing (0.5–20 Hz bandpass, base-class standardize_stims + normalize_responses) correctly loads the data and feeds the deepSTRF model API, and reaches the published single-subject envelope-TRF ceiling (see the four-estimator comparison above). There is no accuracy gap to close against Brodbeck on the acoustic-only models — r 0.09 mean per channel is what the data supports, and deepSTRF’s Linear and StateNet both get there.

What would extend the analysis (in the directions the EEG-TRF field actually cares about — kernel recovery and nested-model comparison rather than raw predictive power), in increasing order of effort:

  1. Word-onset / surprisal predictors from stimuli/AliceChapterOne-EEG.csv. Reproduces Brodbeck Fig 5–6 (TRF-of-discrete-events; function vs content words). This is the nested-model-comparison story — “does a lexical predictor explain variance beyond acoustics?” — and is where TRFs earn their keep.

  2. Hamming-basis STRF kernel. Add a BasisKernel to deepSTRF.models.layers that constrains the temporal axis of the STRF to a sparse basis of overlapping Hamming windows (eelbrain’s basis=0.050). It won’t move the per-channel r much — we’re at the ceiling — but it produces smoother, more interpretable TRF kernels, which is the actual deliverable on EEG.

  3. Subject embeddings + shared StateNet backbone. A learned per-subject context vector concatenated to the GRU input — true multi-subject pretraining, beyond the per-subject readouts the N axis already provides. The most promising route to a meaningfully higher number, by pooling the across-subject shared response.

  4. eelbrain.boosting wrapper as an alternative Fitter. The apples-to-apples cross-check used to validate this dataset (see untracked/alice_eeg_eelbrain_compare.py); worth promoting to a reusable utility + regression test for future EEG/MEG work.

  5. Topomap helper using mne.viz.plot_topomap from nrn_meta['xyz'] — for the eLife-style scalp figures.

  6. Per-subject download=True instead of all 2.5 GiB at once.

The accompanying tutorial notebook exercises the dataset end-to-end. It is a library-on-EEG demonstration — showing that the same model API that fits ferret A1 spikes also fits human scalp EEG, lands at the field-standard ceiling, and recovers interpretable TRF kernels.