Alice EEG Dataset

Dataset Source: The Alice Datasets — Deep Blue Data (UMich), re-released by Brodbeck et al. as a preprocessed MNE-Python deposit at the University of Maryland DRUM repository (DOI 10.13016/pulf-lndn).

Citation:

Bhattasali, S., Brennan, J. R., Luh, W.-M., Franzluebbers, B., & Hale, J. T.
(2020). The Alice Datasets: fMRI & EEG Observations of Natural Language
Comprehension. Proceedings of the 12th Conference on Language Resources and
Evaluation (LREC), 120–125.

Papers using the dataset:

“Hierarchical structure guides rapid linguistic predictions during naturalistic listening” by Brennan, Hale & Bolhuis (2019), PLOS ONE.
“Eelbrain, a Python toolkit for time-continuous analysis with temporal response functions” by Brodbeck et al. (2023), eLife. (Uses the same Alice EEG release that deepSTRF consumes; the benchmark numbers below are taken from this paper.)

Dataset Details

Population fitting: ✅ (across subjects, across channels, or both)

Description of Stimuli:

First chapter of Alice in Wonderland read by a single narrator, split into 12 audio segments, totalling ~12.4 minutes (2129 words).
Mono 44.1 kHz .wav, plus a word-onset / n-gram-surprisal CSV table.

Description of Responses:

33 human participants listened passively to the chapter while EEG was recorded.
61 EEG channels per subject (10–20-like extended montage).
R = 1 per (subject, channel) — each segment was played once per subject; no within-subject repeats.

Available data:

Per-subject MNE-Python .fif files containing the continuous recording, channel montage, segment-onset annotations, bad-channel list, and artifact-rejection annotations.

Processing performed by the dataset class:

Audio segments are loaded and converted to log-power ERB-band spectrograms (a frequency-domain Gaussian approximation of the gammatone spectrogram used in the eelbrain analysis; same band structure as Brodbeck Fig 4 panels).
Per-subject EEG is downsampled to 1000 / dt_ms Hz (default 100 Hz), segmented at the 12 audio-onset annotations, and aligned to the spectrogram time grid.
Bad channels (raw.info['bads']) and bad-window annotations (BAD_*) are converted to NaN at the response level, following the deepSTRF data paradigm — single source of truth, no separate mask.

Two modes: subjects-as-neurons vs subjects-as-repeats

Alice EEG sits at the intersection of two natural ways to organise the data. AliceEEGDataset exposes both via the treat_subjects_as kwarg.

`treat_subjects_as="neurons"` (default)

Every (subject, channel) pair becomes one entry in the neuron axis. N = sum_s(channels_s), R = 1. This is the standard deepSTRF view — each “neuron” has one trial, and bad-channel-on-subject combos carry the structural-NaN sentinel. The metrics to report are corrcoef and fve.

`treat_subjects_as="repeats"`

Channels-as-neurons, subjects-as-repeats. N = n_montage_channels (61), R = n_subjects. Bad (channel, subject) cells become NaN at the repeat slot, and the deepSTRF metrics handle them transparently.

This mode enables inter-subject reliability analysis via normalized_corrcoef(method='schoppe') — predictions are scored against an inter-subject signal-power ceiling analogous to (but not the same as) the single-unit trial-reliability ceiling.

Interpretive caveat: the noise model underpinning the Schoppe correction assumes iid trial noise around a shared deterministic signal. Between-subject variability is structured (anatomy, source orientation, attention) and only approximately iid. The math runs and gives a useful group-level ceiling, but the resulting CCnorm is interpreted as “how well does the model predict the shared, across-subject EEG component” — not as a trial-reliability bound on a single recording. Use it as a model-comparison axis, not as an absolute predictive-power ceiling.

Benchmark targets (Brodbeck et al. 2023, eLife)

Brodbeck reports % variability explained per channel, averaged across 33 subjects. The key thing to internalise about these numbers — and the envelope-tracking literature generally — is the unit:

% variability explained = 100 · r², not Pearson r.

Figure 4B’s “envelope predictive power” topography has a colorbar that maxes at 1 %. That is r² = 0.01, i.e. a per-channel Pearson r ≈ 0.10. The incremental panels (4C/4D — the predictive-power gain from adding onsets / spectrogram over the envelope-only model) span ±0.1 % Δ-variance, i.e. another r ≈ ±0.03. So the real headline ceiling is:

Predictor model	% variability explained	equiv. Pearson `r`
Envelope alone	up to ~1 %	up to ~0.10
+ acoustic-onset	+ ~0.1 %	+ ~0.03
+ spectrogram	+ ~0.1 %	+ ~0.03

These are tiny absolute prediction accuracies by design — scalp EEG single-trial envelope tracking is a low-SNR problem (see below). The science is in the significance of the increment and the shape of the recovered TRF, not the absolute predictive power.

What this library reproduces

We verified empirically (subject S01, single 9/1/2 stim split, Heeris gammatone-8 spectrogram, 0.5–20 Hz bandpass) that four independent estimators converge on the same per-channel test accuracy:

Estimator	Mean test `r`	Max	Mean % var
deepSTRF `Linear` STRF (AdamW + MSE)	0.084	0.16	0.86 %
`sklearn` Ridge (α-grid)	0.084	0.16	0.97 %
deepSTRF `StateNet` GRU (C=14, 6.8k params)	0.086	0.17	0.74 %
eelbrain `boosting` (Brodbeck’s own method)	0.090	0.17	1.05 %

The Δ between the deepSTRF Linear STRF and eelbrain’s L1-boosting + 50 ms-basis pipeline is +0.005 r — within run-to-run noise. There is no algorithmic gap and no spec-pipeline gap: deepSTRF reaches the published single-subject envelope-TRF ceiling on this dataset. The recurrent StateNet matches the linear STRF with 7× fewer parameters — the sample-efficient choice for the small per-subject data, and the natural backbone for multi-subject pooling.

The accompanying example notebook walks through a linear baseline and a StateNet core on the same data pipeline. The deepSTRF reframing of the onset-spectrogram condition is AdapTrans + Linear — a learnable peripheral adaptation front-end in place of the hand-engineered Fishbach-2001 onset detector.

What “good” looks like on scalp EEG

If you come to this dataset from single-unit or ECoG work, the per-channel r ≈ 0.1 ceiling will look like a broken fit. It is not. Single-trial scalp-EEG envelope-TRF prediction r is 0.05–0.15 across the foundational literature (Lalor & Foxe 2010; Ding & Simon 2012; Di Liberto et al. 2015; Crosse et al. 2016; Broderick et al. 2018; Brodbeck et al. 2018). r ≥ 0.3 only happens with intracranial recordings (ECoG/sEEG) or unit-level data with many trial repeats — the cortical envelope-tracking response is ~1 µV against 30–50 µV of background, so even r² = 1 % is a real, replicable signal.

Because absolute prediction accuracy is low, the EEG/MEG-TRF field reports its results differently from the spike-prediction cc_norm that deepSTRF’s auditory datasets use:

The TRF/STRF kernel shape itself is the primary deliverable. The recovered response function has interpretable, replicable peaks (P1/M50 ≈ 50 ms, N1/M100 ≈ 100 ms, P2 ≈ 200 ms); their latencies and topographies are the science. The kernel is estimated precisely even when single-trial prediction is poor. deepSTRF surfaces this via AudioEncodingModel.STRF_gradmap.
Predictive power as a significance test, not a score. With n=33 subjects, even r² = 0.5 % is p ≪ 0.001 at the group level. The question is “does adding predictor X significantly increase predictive power over the model without it?” — the Δ%-variance you see in Brodbeck Fig 4C/4D, tested across subjects.
Nested-model comparison / variance partitioning. “Does a phoneme-surprisal predictor explain variance beyond the acoustic envelope?” Fit nested models, compare predictive power. This is the main thing TRFs are for — disentangling overlapping, correlated acoustic / lexical / semantic predictors.
Backward (decoding) models + attention decoding. Reconstructing the stimulus envelope from EEG pools all channels and reaches higher r (≈ 0.1–0.3); auditory-attention decoding then reports classification accuracy (often 80–90 % in 60 s windows), not r.
Group-level cluster statistics over the (time-lag × sensor) TRF (TFCE / cluster-permutation corrected) — the result is a significant spatiotemporal cluster, reported as a topography + time-course.

For deepSTRF’s purposes, the per-channel r we measure is the right sanity number; the scientifically useful outputs on EEG are (1) the recovered TRF kernels and (3) nested-model predictive-power comparisons.

Setup

Requirements: the [eeg] optional extra (pulls in MNE-Python):

pip install "deepSTRF[eeg]"

Easiest path — auto-download from the UMd DRUM mirror (~2.5 GiB total, anonymous HTTPS, idempotent):

from deepSTRF.datasets.audio import AliceEEGDataset

ds = AliceEEGDataset(download=True, dt_ms=10, n_frequency_bands=8)

Default cache dir is platformdirs.user_cache_dir('deepSTRF')/Alice_EEG, overridable via $DEEPSTRF_DATA_DIR.

If the data is already laid out manually:

ds = AliceEEGDataset(path="/path/to/brodbeck_eelbrain_elife", dt_ms=10)

Expected layout under path:

brodbeck_eelbrain_elife/
├── eeg.0/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.1/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.2/eeg/Sxx/Sxx_alice-raw.fif
└── stimuli/{1..12}.wav  +  AliceChapterOne-EEG.csv

Filtering

Each stim_meta dict carries name, type ("alice_chapter1"), sample_rate, n_samples, duration_s. Each nrn_meta dict carries channel_id, subject (or None in repeats mode), area ("EEG"), and xyz (channel position from the standard 10–20 montage, or None if not in the montage). Combined with the base-class selection API:

# default — all subjects, both modes
ds = AliceEEGDataset(download=True)

# only one subject
ds = AliceEEGDataset(download=True, subjects=["S20"])

# inter-subject reliability mode
ds = AliceEEGDataset(download=True, treat_subjects_as="repeats")

# post-construction: select a frontal cluster of channels
ds.select_pop_by_nrn_attr("channel_id", "1")    # one channel by id

Status

The shipped AliceEEGDataset + canonical preprocessing (0.5–20 Hz bandpass, base-class standardize_stims + normalize_responses) correctly loads the data and feeds the deepSTRF model API, and reaches the published single-subject envelope-TRF ceiling (see the four-estimator comparison above). There is no accuracy gap to close against Brodbeck on the acoustic-only models — r ≈ 0.09 mean per channel is what the data supports, and deepSTRF’s Linear and StateNet both get there.

What would extend the analysis (in the directions the EEG-TRF field actually cares about — kernel recovery and nested-model comparison rather than raw predictive power), in increasing order of effort:

Word-onset / surprisal predictors from stimuli/AliceChapterOne-EEG.csv. Reproduces Brodbeck Fig 5–6 (TRF-of-discrete-events; function vs content words). This is the nested-model-comparison story — “does a lexical predictor explain variance beyond acoustics?” — and is where TRFs earn their keep.
Hamming-basis STRF kernel. Add a BasisKernel to deepSTRF.models.layers that constrains the temporal axis of the STRF to a sparse basis of overlapping Hamming windows (eelbrain’s basis=0.050). It won’t move the per-channel r much — we’re at the ceiling — but it produces smoother, more interpretable TRF kernels, which is the actual deliverable on EEG.
Subject embeddings + shared StateNet backbone. A learned per-subject context vector concatenated to the GRU input — true multi-subject pretraining, beyond the per-subject readouts the N axis already provides. The most promising route to a meaningfully higher number, by pooling the across-subject shared response.
eelbrain.boosting wrapper as an alternative Fitter. The apples-to-apples cross-check used to validate this dataset (see untracked/alice_eeg_eelbrain_compare.py); worth promoting to a reusable utility + regression test for future EEG/MEG work.
Topomap helper using mne.viz.plot_topomap from nrn_meta['xyz'] — for the eLife-style scalp figures.
Per-subject download=True instead of all 2.5 GiB at once.

The accompanying tutorial notebook exercises the dataset end-to-end. It is a library-on-EEG demonstration — showing that the same model API that fits ferret A1 spikes also fits human scalp EEG, lands at the field-standard ceiling, and recovers interpretable TRF kernels.