Scale matters: Large language models with billions (rather than millions) of parameters better match neural representations of natural language

Zhuoqiao Hong; Haocheng Wang; Zaid Zada; Harshvardhan Gazula; David Turner; Bobbi Aubrey; Leonard Niekerken; Werner Doyle; Sasha Devore; Patricia Dugan; Daniel Friedman; Orrin Devinsky; Adeen Flinker; Uri Hasson; Samuel A Nastase; Ariel Goldstein

doi:10.7554/eLife.101204.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Nai Ding
Zhejiang University, Hangzhou, China
Senior Editor
Yanchao Bi
Beijing Normal University, Beijing, China

Reviewer #1 (Public review):

Summary:

The authors perform an analysis of the relationship between the size of an LMM and the predictive performance of an ECoG encoding model made using the representations from that LMM. They find a logarithmic relationship between model size and prediction performance, consistent with previous findings in fMRI. They additionally observe that as the model size increases, the location of the "peak" encoding performance typically moves further back into the model in terms of percent layer depth, an interesting result worthy of further analysis into these representations.

Strengths:

The evidence is quite convincing, consistent across model families, and complementary to other work in this field. This sort of analysis for ECoG is needed and supports the decade-long enduring trend of the "virtuous cycle" between neuroscience and AI research, where more powerful AI models have consistently yielded more effective predictions of responses in the brain. The lag analysis showing that optimal lags do not change with model size is a nice result using the higher temporal resolution of ECoG compared to other methods like fMRI.

Weaknesses:

I would have liked to have seen the data scaling trends explored a bit too, as this is somewhat analogous to the main scaling results. While better performance with more data might be unsurprising, showing good data scaling would be a strong and useful justification for additional data collection in the field, especially given the extremely limited amount of existing language ECoG data. I realize that the data here is somewhat limited (only 30 minutes per subject), but authors could still in principle train models on subsets of this data.

Separately, it would be nice to have better justification of some of these trends, in particular the peak layerwise encoding performance trend and the overall upside-down U-trend of encoding performance across layers more generally. There is clearly something very fundamental going on here, about the nature of abstraction patterns in LLMs and in the brain, and this result points to that. I don't see the lack of justification here as a critical issue, but the paper would certainly be better with some theoretical explanation for why this might be the case.

Lastly, I would have wanted to see a similar analysis here done for audio encoding models using Whisper or WavLM as this is the modality where you might see real differences between ECoG and other slower scanning approaches. Again, I do not see this omission as a fundamental issue, but it does seem like the sort of analysis for which the higher temporal resolution of ECoG might grant some deeper insight.

https://doi.org/10.7554/eLife.101204.1.sa3

Reviewer #2 (Public review):

Summary:

This paper investigates whether large language models (LLMs) of increasing size more accurately align with brain activity during naturalistic language comprehension. The authors extracted word embeddings from LLMs for each word in a 30-minute story and regressed them against electrocorticography (ECoG) activity time-locked to each word as participants listened to the story. The findings reveal that larger LLMs more effectively predict ECoG activity, reflecting the scaling laws observed in other natural language processing tasks.

Strengths:

(1) The study compared model activity with ECoG recordings, which offer much better temporal resolution than other neuroimaging methods, allowing for the examination of model encoding performance across various lags relative to word onset.

(2) The range of LLMs tested is comprehensive, spanning from 82 million to 70 billion parameters. This serves as a valuable reference for researchers selecting LLMs for brain encoding and decoding studies.

(3) The regression methods used are well-established in prior research, and the results demonstrate a convincing scaling law for the brain encoding ability of LLMs. The consistency of these results after PCA dimensionality reduction further supports the claim.

Weaknesses:

(1) Some claims of the paper are less convincing. The authors suggested that "scaling could be a property that the human brain, similar to LLMs, can utilize to enhance performance", however, many other animals have brains with more neurons than the human brain, making it unlikely that simple scaling alone leads to better language performance. Additionally, the authors claim that their results show 'larger models better predict the structure of natural language.' However, it remains unclear to what extent the embeddings of LLMs capture the "structure" of language better than the lexical semantics of language.

(2) The study lacks control LLMs with randomly initialized weights and control regressors, such as word frequency and phonetic features of speech, making it unclear what the baseline is for the model-brain correlation.

(3) The finding that peak encoding performance tends to occur in relatively earlier layers in larger models is somewhat surprising and requires further explanation. Since more layers mean more parameters, if the later layers diverge from language processing in the brain, it raises the question of what aspects of the larger models make them more brain-like.

https://doi.org/10.7554/eLife.101204.1.sa2

Reviewer #3 (Public review):

This manuscript studies the connection between neural activity collected through electrocorticography and hidden vector representations from autoregressive language models, with the specific aim of studying the influence of language model size on this connection. Neural activity was measured from subjects who listened to a segment from a podcast, and the representations from language models were calculated using the written transcription as the input text. The ability of vector representations to predict neural activity was evaluated using 10-fold cross-validation with ridge regression models.

The main results are that (as well summarized in section headings):

(1) Larger models predict neural activity better.

(2) The ability of language model representations to predict neural activity differs across electrodes and brain regions.

(3) The layer that best predicts neural activity differs according to model size, with the "SMALL" model showing a correspondence between layer number and the language processing hierarchy.

(4) There seems to be a similar relationship between the time lag and the ability of language model representations to predict neural activity across models.

Strengths:

(1) The experimental and modeling protocols generally seem solid, which yielded results that answer the authors' primary research question.

(2) Electrocorticography data is especially hard to collect, so these results make a nice addition to recent functional magnetic resonance imaging studies.

Weaknesses:

(1) The interpretation of some results seems unjustified, although this may just be a presentational issue.

a) Figure 2B: The authors interpret the results as "a plateau in the maximal encoding performance," when some readers might interpret this rather as a decline after 13 billion parameters. Can this be further supported by a significance test like that shown in Figure 4B?

b) Figure S1A: It looks like the drop in PCA max correlation is larger for larger models, which may suggest to some readers that the same trend observed for ridge max correlation may not hold, contra the authors' claim that all results replicate. Why not include a similar figure as Figure 2B as part of Figure S1?

(2) Discussion of what might be driving the main result about the influence of model size appears to be missing (cf. the authors aim to provide an explanation of what seems to drive the influence of the layer location in Paragraph 3 of the Discussion section). What explanations have been proposed in the previous functional magnetic resonance imaging studies? Do those explanations also hold in the context of this study?

(3) The GloVe-based selection of language-sensitive electrodes (at least to me) isn't explained/motivated clearly enough (I think a more detailed explanation should be included in the Materials and Methods section). If the electrodes are selected based on GloVe embeddings, then isn't the main experiment just showing that representations from larger language models track more closely with GloVe embeddings? What justifies this methodology?

(4) (Minor weakness) The main experiments are largely replications of previous functional magnetic resonance imaging studies, with the exception of the one lag-based analysis. Is there anything else that the electrocorticography data can reveal that functional magnetic resonance imaging data can't?

https://doi.org/10.7554/eLife.101204.1.sa1

Author response:

We thank the reviewers for their thoughtful feedback and valuable comments. We plan to fully address their concerns by including the following experiments and analyses:

Reviewer 1 suggested exploring data scaling trends for encoding models, as successful scaling would justify larger datasets for language ECoG studies. To estimate scaling effects, we will develop encoding models on subsets of our data.

Reviewer 2 expressed uncertainty about the baseline for model-brain correlation and recommended adding control LLMs with randomly initialized weights. In response, we will generate embeddings using untrained LLMs to establish a more robust baseline for encoding results.

Reviewer 2 also proposed incorporating control regressors such as word frequency and phonetic features of speech. We will re-run our modeling analysis using control regressors for word frequency, 8 syntactic features (e.g., part of speech, dependency, prefix/suffix), and 3 phonetic features (e.g., phonemes, place/manner of articulation) to assess how much these features contribute to encoding performance.

Reviewer 3 raised concerns that the “plateau in maximal encoding performance” was actually a decline for the largest models. We will add significance tests in Figure 2B to clarify this issue.

Reviewer 3 also noted that in Supplementary Figure 1A, the decline in encoding performance was more pronounced when using PCA to reduce embedding dimensionality, in contrast to the trend observed when using ridge regression. To address this, we will attempt to replicate the observed scaling trends in Figure 2B using PCA combined with OLS.

Additionally, we will provide a point-by-point response and revise the manuscript with updated analyses and figures in the near future.

https://doi.org/10.7554/eLife.101204.1.sa0

Scale matters: Large language models with billions (rather than millions) of parameters better match neural representations of natural language

Peer review process

Editors

Be the first to read new articles from eLife