Alice EEG Dataset
Dataset Source: The Alice Datasets — Deep Blue Data (UMich), re-released by Brodbeck et al. as a preprocessed MNE-Python deposit at the University of Maryland DRUM repository (DOI 10.13016/pulf-lndn).
Citation:
Bhattasali, S., Brennan, J. R., Luh, W.-M., Franzluebbers, B., & Hale, J. T.
(2020). The Alice Datasets: fMRI & EEG Observations of Natural Language
Comprehension. Proceedings of the 12th Conference on Language Resources and
Evaluation (LREC), 120–125.
Papers using the dataset:
“Hierarchical structure guides rapid linguistic predictions during naturalistic listening” by Brennan, Hale & Bolhuis (2019), PLOS ONE.
“Eelbrain, a Python toolkit for time-continuous analysis with temporal response functions” by Brodbeck et al. (2023), eLife. (Uses the same Alice EEG release that deepSTRF consumes; the benchmark numbers below are taken from this paper.)
Dataset Details
Population fitting: ✅ (across subjects, across channels, or both)
Description of Stimuli:
First chapter of Alice in Wonderland read by a single narrator, split into 12 audio segments, totalling ~12.4 minutes (2129 words).
Mono 44.1 kHz
.wav, plus a word-onset / n-gram-surprisal CSV table.
Description of Responses:
33 human participants listened passively to the chapter while EEG was recorded.
61 EEG channels per subject (10–20-like extended montage).
R = 1per (subject, channel) — each segment was played once per subject; no within-subject repeats.
Available data:
Per-subject MNE-Python
.fiffiles containing the continuous recording, channel montage, segment-onset annotations, bad-channel list, and artifact-rejection annotations.
Processing performed by the dataset class:
Audio segments are loaded and converted to log-power ERB-band spectrograms (a frequency-domain Gaussian approximation of the gammatone spectrogram used in the eelbrain analysis; same band structure as Brodbeck Fig 4 panels).
Per-subject EEG is downsampled to
1000 / dt_msHz (default 100 Hz), segmented at the 12 audio-onset annotations, and aligned to the spectrogram time grid.Bad channels (
raw.info['bads']) and bad-window annotations (BAD_*) are converted to NaN at the response level, following the deepSTRF data paradigm — single source of truth, no separate mask.
Two modes: subjects-as-neurons vs subjects-as-repeats
Alice EEG sits at the intersection of two natural ways to organise the
data. AliceEEGDataset exposes both via the treat_subjects_as kwarg.
treat_subjects_as="neurons" (default)
Every (subject, channel) pair becomes one entry in the neuron axis.
N = sum_s(channels_s), R = 1. This is the standard deepSTRF view —
each “neuron” has one trial, and bad-channel-on-subject combos carry the
structural-NaN sentinel. The metrics to report are
corrcoef
and
fve.
treat_subjects_as="repeats"
Channels-as-neurons, subjects-as-repeats. N = n_montage_channels (61),
R = n_subjects. Bad (channel, subject) cells become NaN at the repeat
slot, and the deepSTRF metrics handle them transparently.
This mode enables inter-subject reliability analysis via
normalized_corrcoef(method='schoppe')
— predictions are scored against an inter-subject signal-power ceiling
analogous to (but not the same as) the single-unit trial-reliability
ceiling.
Interpretive caveat: the noise model underpinning the Schoppe
correction assumes iid trial noise around a shared deterministic signal.
Between-subject variability is structured (anatomy, source orientation,
attention) and only approximately iid. The math runs and gives a useful
group-level ceiling, but the resulting CCnorm is interpreted as
“how well does the model predict the shared, across-subject EEG
component” — not as a trial-reliability bound on a single recording. Use
it as a model-comparison axis, not as an absolute predictive-power
ceiling.
Benchmark targets (Brodbeck et al. 2023, eLife)
Brodbeck reports % variability explained per channel, averaged across 33 subjects. The key thing to internalise about these numbers — and the envelope-tracking literature generally — is the unit:
% variability explained = 100 · r², not Pearson
r.
Figure 4B’s “envelope predictive power” topography has a colorbar that
maxes at 1 %. That is r² = 0.01, i.e. a per-channel Pearson r ≈ 0.10. The incremental panels (4C/4D — the predictive-power gain from
adding onsets / spectrogram over the envelope-only model) span ±0.1 %
Δ-variance, i.e. another r ≈ ±0.03. So the real headline ceiling is:
Predictor model |
% variability explained |
equiv. Pearson |
|---|---|---|
Envelope alone |
up to ~1 % |
up to ~0.10 |
+ acoustic-onset |
+ ~0.1 % |
+ ~0.03 |
+ spectrogram |
+ ~0.1 % |
+ ~0.03 |
These are tiny absolute prediction accuracies by design — scalp EEG single-trial envelope tracking is a low-SNR problem (see below). The science is in the significance of the increment and the shape of the recovered TRF, not the absolute predictive power.
What this library reproduces
We verified empirically (subject S01, single 9/1/2 stim split, Heeris gammatone-8 spectrogram, 0.5–20 Hz bandpass) that four independent estimators converge on the same per-channel test accuracy:
Estimator |
Mean test |
Max |
Mean % var |
|---|---|---|---|
deepSTRF |
0.084 |
0.16 |
0.86 % |
|
0.084 |
0.16 |
0.97 % |
deepSTRF |
0.086 |
0.17 |
0.74 % |
eelbrain |
0.090 |
0.17 |
1.05 % |
The Δ between the deepSTRF Linear STRF and eelbrain’s L1-boosting +
50 ms-basis pipeline is +0.005 r — within run-to-run noise. There
is no algorithmic gap and no spec-pipeline gap: deepSTRF reaches the
published single-subject envelope-TRF ceiling on this dataset. The
recurrent StateNet matches the linear STRF with 7× fewer
parameters — the sample-efficient choice for the small per-subject
data, and the natural backbone for multi-subject pooling.
The accompanying example notebook
walks through a linear baseline and a StateNet core on the same data
pipeline. The deepSTRF reframing of the onset-spectrogram condition is
AdapTrans + Linear — a learnable peripheral adaptation front-end in
place of the hand-engineered Fishbach-2001 onset detector.
What “good” looks like on scalp EEG
If you come to this dataset from single-unit or ECoG work, the
per-channel r ≈ 0.1 ceiling will look like a broken fit. It is not.
Single-trial scalp-EEG envelope-TRF prediction r is 0.05–0.15
across the foundational literature (Lalor & Foxe 2010; Ding & Simon
2012; Di Liberto et al. 2015; Crosse et al. 2016; Broderick et al.
2018; Brodbeck et al. 2018). r ≥ 0.3 only happens with intracranial
recordings (ECoG/sEEG) or unit-level data with many trial
repeats — the cortical envelope-tracking response is ~1 µV against
30–50 µV of background, so even r² = 1 % is a real, replicable signal.
Because absolute prediction accuracy is low, the EEG/MEG-TRF field
reports its results differently from the spike-prediction cc_norm that
deepSTRF’s auditory datasets use:
The TRF/STRF kernel shape itself is the primary deliverable. The recovered response function has interpretable, replicable peaks (P1/M50 ≈ 50 ms, N1/M100 ≈ 100 ms, P2 ≈ 200 ms); their latencies and topographies are the science. The kernel is estimated precisely even when single-trial prediction is poor. deepSTRF surfaces this via
AudioEncodingModel.STRF_gradmap.Predictive power as a significance test, not a score. With n=33 subjects, even
r² = 0.5 %isp ≪ 0.001at the group level. The question is “does adding predictor X significantly increase predictive power over the model without it?” — the Δ%-variance you see in Brodbeck Fig 4C/4D, tested across subjects.Nested-model comparison / variance partitioning. “Does a phoneme-surprisal predictor explain variance beyond the acoustic envelope?” Fit nested models, compare predictive power. This is the main thing TRFs are for — disentangling overlapping, correlated acoustic / lexical / semantic predictors.
Backward (decoding) models + attention decoding. Reconstructing the stimulus envelope from EEG pools all channels and reaches higher
r(≈ 0.1–0.3); auditory-attention decoding then reports classification accuracy (often 80–90 % in 60 s windows), notr.Group-level cluster statistics over the (time-lag × sensor) TRF (TFCE / cluster-permutation corrected) — the result is a significant spatiotemporal cluster, reported as a topography + time-course.
For deepSTRF’s purposes, the per-channel r we measure is the right
sanity number; the scientifically useful outputs on EEG are (1) the
recovered TRF kernels and (3) nested-model predictive-power comparisons.
Setup
Requirements: the [eeg] optional extra (pulls in MNE-Python):
pip install "deepSTRF[eeg]"
Easiest path — auto-download from the UMd DRUM mirror (~2.5 GiB total, anonymous HTTPS, idempotent):
from deepSTRF.datasets.audio import AliceEEGDataset
ds = AliceEEGDataset(download=True, dt_ms=10, n_frequency_bands=8)
Default cache dir is
platformdirs.user_cache_dir('deepSTRF')/Alice_EEG, overridable via
$DEEPSTRF_DATA_DIR.
If the data is already laid out manually:
ds = AliceEEGDataset(path="/path/to/brodbeck_eelbrain_elife", dt_ms=10)
Expected layout under path:
brodbeck_eelbrain_elife/
├── eeg.0/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.1/eeg/Sxx/Sxx_alice-raw.fif
├── eeg.2/eeg/Sxx/Sxx_alice-raw.fif
└── stimuli/{1..12}.wav + AliceChapterOne-EEG.csv
Filtering
Each stim_meta dict carries name, type ("alice_chapter1"),
sample_rate, n_samples, duration_s. Each nrn_meta dict
carries channel_id, subject (or None in repeats mode), area
("EEG"), and xyz (channel position from the standard 10–20 montage,
or None if not in the montage). Combined with the
base-class selection API:
# default — all subjects, both modes
ds = AliceEEGDataset(download=True)
# only one subject
ds = AliceEEGDataset(download=True, subjects=["S20"])
# inter-subject reliability mode
ds = AliceEEGDataset(download=True, treat_subjects_as="repeats")
# post-construction: select a frontal cluster of channels
ds.select_pop_by_nrn_attr("channel_id", "1") # one channel by id
Status
The shipped AliceEEGDataset + canonical preprocessing (0.5–20 Hz
bandpass, base-class standardize_stims + normalize_responses)
correctly loads the data and feeds the deepSTRF model API, and reaches
the published single-subject envelope-TRF ceiling (see the
four-estimator comparison above).
There is no accuracy gap to close against Brodbeck on the acoustic-only
models — r ≈ 0.09 mean per channel is what the data supports, and
deepSTRF’s Linear and StateNet both get there.
What would extend the analysis (in the directions the EEG-TRF field actually cares about — kernel recovery and nested-model comparison rather than raw predictive power), in increasing order of effort:
Word-onset / surprisal predictors from
stimuli/AliceChapterOne-EEG.csv. Reproduces Brodbeck Fig 5–6 (TRF-of-discrete-events; function vs content words). This is the nested-model-comparison story — “does a lexical predictor explain variance beyond acoustics?” — and is where TRFs earn their keep.Hamming-basis STRF kernel. Add a
BasisKerneltodeepSTRF.models.layersthat constrains the temporal axis of the STRF to a sparse basis of overlapping Hamming windows (eelbrain’sbasis=0.050). It won’t move the per-channelrmuch — we’re at the ceiling — but it produces smoother, more interpretable TRF kernels, which is the actual deliverable on EEG.Subject embeddings + shared StateNet backbone. A learned per-subject context vector concatenated to the GRU input — true multi-subject pretraining, beyond the per-subject readouts the
Naxis already provides. The most promising route to a meaningfully higher number, by pooling the across-subject shared response.eelbrain.boostingwrapper as an alternativeFitter. The apples-to-apples cross-check used to validate this dataset (seeuntracked/alice_eeg_eelbrain_compare.py); worth promoting to a reusable utility + regression test for future EEG/MEG work.Topomap helper using
mne.viz.plot_topomapfromnrn_meta['xyz']— for the eLife-style scalp figures.Per-subject
download=Trueinstead of all 2.5 GiB at once.
The accompanying tutorial notebook exercises the dataset end-to-end. It is a library-on-EEG demonstration — showing that the same model API that fits ferret A1 spikes also fits human scalp EEG, lands at the field-standard ceiling, and recovers interpretable TRF kernels.