Introduction

Since speech is a continuous signal, one of the infants’ first challenges during language acquisition is to break it down into smaller units, notably to be able to extract words. Parsing has been shown to rely on prosodic cues (e.g., pitch and duration changes) but also on identifying regular patterns across perceptual units. Almost 20 years ago, Saffran, Newport, and Aslin (1996) demonstrated that infants are sensitive to local regularities between syllables. After hearing a stream of continuous and monotonous syllables constructed by concatenating four tri-syllabic pseudo-words, 8-month-old infants distinguished a list of these triplets from a list of triplets formed by the first part of one pseudo-word and the last part of another (called part-words). Indeed, for the correct triplets (called words), the TP between syllables was 1, whereas it drops to 1/3 for the transition encompassing two words present in the part-words. Since this seminal study, statistical learning has been regarded as an essential mechanism for language acquisition because it allows for the extraction of regular patterns without prior knowledge.

During the last two decades, many studies have extended this finding by demonstrating sensitivity to statistical regularities in sequences across domains and species. For example, segmentation capacities analogous to those observed for a syllable stream are observed throughout life in the auditory modality for tones (Kudo et al., 2011; Saffran et al., 1999) and in the visual domain for shapes (Bulf et al., 2011; Fiser and Aslin, 2002; Kirkham et al., 2002) and actions (Baldwin et al., 2008; Monroy et al., 2017). Non-human animals, such as cotton-top tamarins (Hauser et al., 2001), rats (Toro and Trobalón, 2005), dogs (Boros et al., 2021), and chicks (Santolin et al., 2016) are also sensitive to TPs. While the level of complexity that each species can track might differ, statistical learning between events appears as a general learning mechanism for auditory and visual sequence processing (for a review of statistical learning capacities across species, see (Santolin and Saffran, 2018)).

Using near-infra-red spectroscopy (NIRS) and electroencephalography (EEG), we have shown that statistical learning is observed in sleeping neonates (Flo et al., 2022; Fló et al., 2019), highlighting the automaticity of this mechanism. We also discovered that tracking statistical probabilities might not lead to stream segmentation in the case of quadrisyllabic words in both neonates and adults, revealing an unsuspected limitation of this mechanism (Benjamin et al., 2022). Here, we aimed to further characterise the characteristics of this mechanism in order to shed light on its role in the early stages of language acquisition. In particular, we wanted to clarify whether, in human neonates, statistical learning is a general learning mechanism applicable to any speech feature or whether there is a bias in favour of calculations on linguistic content to extract words; and secondly, at what level infant newborns compute transitions between syllables, at a low auditory level, i.e. between the presented events or later in the processing chain, at the phonetic level after normalisation through an irrelevant dimension such as voices.

We have, therefore, taken advantage of the fact that syllables convey two important pieces of information for humans: what is being said and who is speaking, i.e. linguistic information and information about the speaker’s identity. While statistical learning can be helpful to word extraction, a statistical relationship between successive voices is not of obvious use and could even hinder word extraction if instances of a word uttered by a different speaker are considered independently. However, as auditory processing is organised along several hierarchical and parallel pathways integrating different spectro-temporal dimensions (Belin et al., 2000; DeWitt and Rauschecker, 2012; Norman-Haignere et al., 2015; Zatorre and Belin, 2001), statistical learning might be computed on one dimension independently of the variation of the other along the linguistic and the voice pathways in parallel. Given the numerous behavioural and brain imaging studies showing phonetic normalisation across speakers in infants (Dehaene-Lambertz and Pena, 2001; Gennari et al., 2021; Kuhl and Miller, 1982), as also the statistical learning studies in the second half of the first year using different voices (Estes and Lew-Williams, 2015) and natural speech in which there is already variations of production from one instance to another (Hay et al., 2011; Pelucchi et al., 2015), we were expecting that even if each syllable is produced by a different speaker, TPs between syllables would be computed after a normalisation process at the syllable or phonetic level, even in neonates. The predictions for learning the TPs between different voices were more open. Either statistical learning is universal and can be similarly computed over any dimension comprising voices, or listening to a speech stream favours the processing of phonetic regularities over other non-linguistic dimensions of speech and thus hinders the possibility of computing regularities over voices.

To study these possibilities, we constructed artificial streams based on the random concatenation of three bi-syllabic pseudo-words (duplets). To build the duplets, we used six consonant-vowel (CV) syllables produced by six voices (Table S1 and S2), resulting in 36 possible tokens. To form the streams, tokens were combined either by considering their phonetic content (Experiment 1: Structure over Phonemes) or their voice content (Experiment 2: Structure over Voices), while the second dimension varied randomly (Figure 1). For example, in Experiment 1, one duplet could be petu with pe and tu uttered by a random voice each time. In contrast, in Experiment 2, one duplet could be the combination [yellow voice-red voice], each uttering randomly any of the syllables.

Experimental protocol.

The experiments started with a Random stream (120 s) in which both syllables and voices changed randomly, followed by a long-Structured stream (120 s). Then, 10 short familiarisation streams (30 s), each followed by test blocks comprising 18 isolated duplets (SOA 2-2.3 s) were presented. Example streams are presented to illustrate the construction of the streams, with different colours representing different voices. In Experiment 1, the Structured stream had a statistical structure based on phonemes (TPs alternated between 1 and 0.5), while the voices were randomly changing (uniform TPs of 0.2). For example, the two syllables of the word “petu” were produced by different voices, which randomly changed at each presentation of the word. In Experiment 2, the statistical structure was based on voices (TPs alternated between 1 and 0.5), while the syllables changed randomly (uniform TPs of 0.2). For example, the “green” voice was always followed by the “red” voice, but they were randomly saying different syllables “boda” then “tupe” in our example. The test duplets were either Words (TP=1) or Partwords (TP=0.5). Words and Partwords were defined in terms of phonetic content for Experiment 1 and voice content for Experiment 2.

If infants at birth compute regularities on the pure auditory signal, this implies computing the TPs over the 36 tokens. Thus, they should compute a 36 × 36 TPs matrix relating each acoustic signal, with TPs alternating between 1/6 within words and 1/12 between words. With this type of computation, we predict infants should fail the task in both experiments since previous studies showing successful segmentation in infants use high TP within words (usually 1) and much fewer elements (most studies 4 to 12) (Saffran and Kirkham, 2018). If speech input is processed along the two studied dimensions in distinct pathways, it enables the calculation of two independent TP matrices of 6×6 between the six voices and six syllables. These computations would result in TPs alternating between 1 and 1/2 for the informative feature and uniform at 1/5 for the uninformative feature, leading to stream segmentation based on the informative dimension.

As in our previous experiments (Benjamin et al., 2022; Flo et al., 2022), we used high-density EEG (128 electrodes) to study speech segmentation abilities. Using artificial language with syllables with a fixed duration elicits Steady State Evoked Potentials (SSEP) at the syllable rate. Crucially, if the artificial language presents a regular structure between the syllables (i.e., regular drops in TPs marking word boundaries), if the structure is perceived, then the neural responses reflect the slower frequency of the word as well (Buiatti et al., 2009). In other words, the brain activity becomes phase-locked to the regular input, increasing the Inter Trial Coherence (ITC) and power at the input regularity frequencies. Under these circumstances, the analysis in the frequency domain is advantageous since fast and periodic responses can be easily investigated by looking at the target frequencies without considering their specific timing (Kabdebon et al., 2022). The phenomenon is also named frequency tagging or neural entrainment in the literature. Here, we will refer to it indistinctively as SSEP or neural entrainment since we do not aim to make any hypothesis on the origin of the response (i.e., pure evoked or phase reset of endogenous oscillations (Giraud and Poeppel, 2012)). Our study used an orthogonal design across two groups of 1-4 day-old neonates. In Experiment 1 (34 infants), the regularities in the speech stream were based on the phonetic content, while the voices varied randomly (Phoneme group). Conversely, in Experiment 2 (33 infants), regularities were based on voices, while the phonemes changed randomly (Voice group). Both experiments started with a control stream in which both features varied randomly (i.e. Random stream, 120 s). Next, neonates were exposed to the Structured stream (120 s) with statistical structure over one or the other feature. The experiments ended with ten sets of 18 test duplets presented in isolation, preceded by short Structured streams (30 s) to maintain learning (Figure 1). Half of the test duplets corresponded to familiar regularities (Words, TP =1), and the other half were duplets present in the language but which straddled a drop in TPs (Partwords, TP = 0.5).

To investigate online learning, we quantified the ITC as a measure of neural entrainment at the syllable (4 Hz) and word rate (2 Hz) during the presentation of the continuous streams. For the recall process, we compared ERPs to Word and Part-Word duplets. We also tested 57 adult participants in a comparable behavioural experiment to investigate adults’ segmentation capacities under the same conditions.

Results

Neural entrainment during the familiarisation phase

To measure neural entrainment, we quantified the ITC in non-overlapping epochs of 7.5 s. We compared the studied frequency (syllabic rate 4 Hz or duplet rate 2 Hz) with the 12 adjacent frequency bins following the same methodology as in our previous studies.

For the Random streams, we observed significant entrainment at syllable rate (4Hz) over a broad set of electrodes in both experiments (p < 0.05, FDR corrected) and no enhanced activity at the duplet rate for any electrode (p > 0.05, FDR corrected). Concerning the Structured streams, ITC increased at both the syllable and duplet rate (p < 0.05, FDR corrected) in both experiments (Figure 2A, 2B). The duplet effect was localised over occipital and central-left electrodes in the Phoneme group and over occipital and temporal-right electrodes in the Voice group.

Neural entrainment during the random and structured streams.

(A) SNR for the ITC during the Random and Structured streams of Experiment 1 (structure on phonetic content). The topographies represent the entrainment in the electrode space at the syllabic (4 Hz) and duplet rates (2 Hz). Crosses indicate the electrodes showing enhanced neural entrainment (cross: p < 0.05, one-sided paired permutation test, FDR corrected by the number of electrodes; dot: p < 0.05, without FDR correction). Colour scale limits [-1.8, 1.8]. The entrainment for each electrode is shown in light grey. The thick orange line shows the mean over the electrodes with significant entrainment relative to the adjacent frequency bins at the syllabic rate (4 Hz) (p < 0.05 FDR corrected). The thick green line shows the mean over the electrodes showing significant entrainment relative to the adjacent frequency bins at the duplet rate (2 Hz) (p < 0.05 FDR corrected). The asterisks indicate frequency bins with entrainment significantly higher than on adjacent frequency bins for the average across electrodes (p < 0.05, one-sided permutation test, FDR corrected for the number of frequency bins). (B) Analog to A for Experiment 2 (structure on voice content). (C) The first two rows show the topographies for the difference in entrainment during the Structured and Random streams at 4 Hz and 2 Hz for both experiments. Crosses indicate the electrodes showing stronger entrainment during the Structured stream (cross: p < 0.05, one-sided paired permutation test, FDR corrected by the number of electrodes; dot: p < 0.05, without FDR correction). The bottom row shows the interaction effect by comparing the difference in entrainment during the Structured and Random streams between Experiments 1 and 2. Crosses indicate significant differences (cross: p < 0.05, two-sided unpaired permutation test, FDR corrected by the number of electrodes; dot: p < 0.05, without FDR correction). (D) Time course of the neural entrainment at 4 Hz for the average over electrodes showing significant entrainment during the Random stream and at 2 Hz for the average over electrodes showing significant entrainment during the Structured stream (Phoneme: green line, Voice blue line). The shaded area represents standard errors. The horizontal lines on the bottom indicate when the entrainment was larger than 0 (p < 0.05, one-sided t-test, corrected by FDR by the number of time points).

We also directly compared the ITC at both frequencies of interest between the Random and Structured conditions (Figure 3C). We found electrodes with significantly higher ITC at the duplet rate during Structured streams than Random streams in both experiments (p < 0.05, FDR corrected). We also found electrodes with higher entrainment at syllable rate during the Structured than Random streams in both experiments (p < 0.05, FDR corrected). This effect might result from stronger or more phase-locked responses to syllables over those electrodes when the input is structured. While the first harmonic of the duplet rate that coincides with the syllable rate could also contribute to this effect, this contribution is unlikely since the electrodes differ from the electrodes, showing enhanced word-rate activity at 2 Hz.

Cluster-based permutation analysis of ERPs to isolated duplets during recall

The topographies show the difference between the two conditions corresponding to each main effect. Results obtained from the cluster-based permutation analyses are shown at the bottom of each panel. Thick lines correspond to the grand averages for the two main tested conditions. Shaded areas correspond to the standard error across participants. Thin lines show the ERPs separated by duplet type and familiarisation type. The shaded areas between the thick lines show the time extension of the cluster. The topographies correspond to the difference between conditions during the time extension of the cluster. The electrodes belonging to the cluster are marked with a cross. Significant clusters are indicated with an asterisk. Color scale limits [-0.07, 0.07] a.u. (A) Main effect of Test-duplets (Words - Part-words) over a frontal-right positive cluster (p = 0.019) and a left temporal negative cluster (p = 0.0056). (B) Main effect of familiarisation (Phonemes - Voices) over a posterior negative cluster (p = 0.018). The frontal positive cluster did not reach significance (p = 0.12). Results are highly comparable to the ROIs-based analysis presented in SI (Fig S3).

Adults’ behavioural experiment.

Each subject’s average score attributed to the Words (blue) and Partwords (orange) is represented. On the right, for the group familiarised with the Phoneme structure and on the left, for the group familiarised with the Voice structure. The difference between test duplets was significant for the Phoneme group (p = 0.007) and only marginally significant for the Voice group (p = 0.050).

Finally, we looked for an interaction effect between groups and conditions (Structured vs. Random streams) (Figure 2C). A few electrodes show differential responses between groups, reflecting the topographical differences observed in the previous analysis, notably the trend for stronger ITC at 2 Hz over the left central electrodes for the Phoneme group compared to the Voice group, but none survive multiple comparison corrections.

Learning time-course

To investigate the time course of the learning, we computed neural entrainment at the duplet rate in sliding time windows of 2 minutes with a 1 s step across both random and structured streams (Figure 2D). Notice that because the integration window was two minutes long, the entrainment during the first minute of the structure stream included data from the random stream. To test whether ITC at 2 Hz increased during long Structured familiarisation (120 s), we fitted a Linear Mixed Model (LMM) with a fixed effect of time and random slopes and interceptions for individual subjects: ITC ∼ − 1 + time + (1 + time|subject). In the Phoneme group, we found a significant time effect (β=4.16×10-3, 95% CI=[2.06×10-3, 6.29×10- 3], SE=1.05×10-3, p=4×10-4), as well as in the Voice group (β=2.46×10-3, 95% CI=[2.6×10-4, 4.66×10-3], SE=1.09×10-3, p=0.03). To test for differences in the time effect between groups, we included all data in a single LMM: ITC ∼ − 1 + timegroup + (1 + time|subject). The model showed a significant fixed effect of time for the Phoneme group consistent with the previous results (β=4.22×10-3, 95% CI=[1.07×10-3, 7.37×10-3], SE=1.58×10-3, p=0.0096), while the fixed effect estimating the difference between the Phoneme and Voice groups was not significant (β=-1.84×10-3, 95% CI=[-6.29×10-3, 2.61×10-3], SE=2.24×10-3, p=0.4).

ERPs during the test phase

To test the recall process, we also measured ERP to isolated duplets afterwards. The average ERP to all conditions merged is shown in Figure S1. We investigated (1) the main effect of test duplets (Word vs. Part-word) across both experiments, (2) the main effect of familiarisation structure (Phoneme group vs. Voice group), and finally (3) the interaction between these two factors. We used non-parametric cluster-based permutation analyses (i.e. without a priori ROIs) (Oostenveld et al., 2011).

The difference between Word and Part-word consisted of a dipole with a median positivity and a left temporal negativity ranging from 400 to 1500 ms, with a maximum around 800-900 ms (Fig 3). Cluster-based permutations recovered two significant clusters around 500-1500 ms: a frontal-right positive cluster (p = 0.019) and a left temporal negative cluster (p = 0.0056). A difference between groups was observed consisting of a dipole that started with a right temporal positivity left temporo-occipital negativity around 300 ms and rotated anti-clockwise to bring the positivity over the frontal electrodes and the negativity at the back of the head (500-800 ms) (Fig S2). Cluster-based permutations on the Phoneme group vs. Voice group recovered a posterior cluster (p = 0.018) around 500 ms; with no positive cluster reaching significance (p >0.10). A cluster-based permutation analysis on the interaction effect, i.e., comparing Words - Part-Words between both experiments, showed no significant clusters (p > 0.1).

As cluster-based statistics are not very sensitive, we also analysed the ERPs over seven ROIS defined on the grand average ERP of all merged conditions (see Methods). Results replicated what we observed with the cluster-based permutation analysis with similar differences between Words and Part-words for the effect of familiarisation and no significant interactions. Results are presented in SI. The temporal progression of voltage topographies for all ERPs is presented in Figure S2. To verify that the effects were not driven by one group per duplet type condition, we ran a mixed two-way ANOVA for the average activity in each ROI and significant time window, with duplet type (Word/Part-word) as within-subjects factor and familiarisation as between-subjects factor. We did not observe significant interactions in any case. Future studies should consider a within-subject design to gain sensitivity to possible interaction effects.

Adult’s behavioural performance in the same task

Adult participants heard a Structure Learning stream lasting 120 s and then ten sets of 18 test duplets preceded by Short Structure streams (30 s). For each test bigram, they had to rate their familiarity with it on a scale from 1 to 6. For the group familiarised with the Phoneme structure, there was a significant difference between the score attributed to Words and Part-words (t(26)=2.92, p = 0.007, Cohen’s d =0.562). The difference was marginally significant for the group familiarised with the Voice structure (t(29)=2.0443, p = 0.050, Cohen’s d=0.373). A 2-way ANOVA with test-duplets and familiarisation as factors revealed a main effect of Word (F(1,55)=12.52, p=0.0008, η 2=0.039), no effect of familiarisation (F(1,55) <1), and a significant interaction Word × Familiarisation (F(1,55) = 5.28, p= 0.025, ηg2=0.017).

Discussion

Statistical learning is a general learning mechanism

In two experiments, we compared STATISTICAL LEARNING over a linguistic and a non-linguistic dimension in sleeping neonates. We took advantage of the possibility of constructing streams based on the same tokens, the only difference between the experiments being the arrangement of the tokens in the streams. We showed that neonates were sensitive to regularities based either on the phonetic or the voice dimensions of speech, even in the presence of a non-informative feature that must be disregarded.

Parsing based on statistical information was revealed by steady-state evoked potentials at the duplet rate observed around 2 min after the onset of the familiarisation stream and by different ERPs to Words and Part-words presented during the test in both experiments. Despite variations in the other dimension, statistical learning was possible, showing that this mechanism operates at a stage when these dimensions have already been separated along different processing pathways. Our results, thus, revealed that linguistic content and voice identity are calculated independently and in parallel. This result confirms that, even in newborns, the syllable is not a holistic unit (Gennari et al., 2021) but that the rich temporo-frequential spectrum of speech is processed in parallel along different networks, probably using different integration factors (Boemio et al., 2005; Moerel et al., 2012; Zatorre and Belin, 2001).

Second, we observed no obvious advantage for the linguistic dimension in neonates. These results reveal the universality of statistical learning. While statistical learning has already been described in many domains and species (Frost et al., 2015; Ren and Wang, 2023; Santolin and Saffran, 2018), we add here that even sleeping neonates distinctly applied it to possibly all the dimensions in which speech is factorised in the auditory cortex (Gennari et al., 2021; Gwilliams et al., 2022). This mechanism gives them a powerful tool to create associations between recurrent events.

Differences between statistical learning over voices and over phonemes

While the main pattern of results between experiments was comparable, we did observe some differences. The word-rate steady-state response (2 Hz) for the group of infants exposed to structure over phonemes was left lateralised over central electrodes, while the group of infants hearing structure over voices showed mostly entrainment over right temporal electrodes. These results are compatible with statistical learning in different lateralised neural networks for processing speech’s phonetic and voice content. Recent brain imaging studies on infants do indeed show precursors of later networks with some hemispheric biases (Blasi et al., 2011; Dehaene-Lambertz et al., 2010), even if specialisation increases during development (Shultz et al., 2014; Sylvester et al., 2023). The hemispheric differences reported here should be considered cautiously since the group comparison did not survive multiple comparison corrections. Future work investigating the neural networks involved should implement a within-subject design to gain statistical power.

The time course of the entrainment at the duplet rate revealed that entrainment emerged at a similar time for both statistical structures. While this duplet rate response seemed more stable in the Phoneme group (i.e., the ITC at the word rate was higher than zero in a sustained way only in the Phoneme group, and the slope of the increase was steeper), no significant difference was observed between groups. Since we did not observe group differences in the ERPs to Words and Part-words during the test, it is unlikely that these differences during learning were due to a worse computation of the statistical transitions for the voice stream relative to the phoneme stream. An alternative explanation might be related to the nature of the duplet rate entrainment. Entrainment might result either from a different response to low and high TPs or (and) from a response to chunks in the stream (i.e., “Words”). In a previous study (Benjamin et al., 2022), we showed that in some circumstances, neonates compute TPs, but entrainment does not emerge, likely due to the absence of chunking. It is thus possible that chunking was less stable when the regularity was over voices, consistent with the results of previous studies reporting challenges with voice identification in infants as in adults (Johnson et al., 2011; Mahmoudzadeh et al., 2016).

Phoneme regularities might trigger a lexical search

In the test part on isolated duplets, we also observed a significant difference between groups: A dipole, consisting of a posterior negative pole and a frontal positivity, was observed around 500 ms following linguistic duplets but not voice duplets. Since the acoustic properties of the duplets were the same in the two experiments, the difference can only be due to an endogenous process that modulates stimuli processing. Given its topography, latency and the context in which it appears, we hypothesise that this component is congruent with an N400, a component elicited by lexico-semantic manipulations in adults (Kutas and Federmeier, 2011). As often in infants, the component latency is delayed, and its topography is more posterior relative to later ages (Friedrich and Friederici, 2005; Junge et al., 2021). A larger posterior negativity has been reported in infants as young as five months when they hear their name compared to a stranger’s name (Parise et al., 2010), when a pseudo-word is consistent vs inconsistently associated with an object (Friedrich and Friederici, 2011), and when they see unexpected vs expected actions (Reid et al., 2009), suggesting that such negativity might be related to semantic.

There is also evidence that infants extract and store possible word-forms (Jusczyk and Hohne, 1997). A stronger fMRI activation for forward speech than backward speech in the left angular gyrus in 3-mo-olds was assumed to be related to activations of possible word forms in a proto-lexicon for native language sentences (Dehaene-Lambertz et al., 2002). As elegantly shown by (Shukla et al., 2011), these chunks extracted from the speech stream are candidate words to which meanings can be attached. These authors show that 6-month-olds spontaneously associate a non-word extracted from natural sentences with a visual object. Although in their experiment, infants used prosodic cues to extract the word, and here statistical cues, a spontaneous bias to use possible word-forms as referring to a meaning (see also (Bergelson and Aslin, 2017) might trigger activation along a lexicon pathway and explain the difference seen here between the two groups. Only speech chunks based on phonetic regularities, and not voice regularities, would be considered as possible candidate.

A similar interpretation of an N400 induced by possible words, even without a clear semantic, explains the observation of an N400 in adult participants listening to artificial languages. Sanders et al. (2002) observed an N400 in adults listening to an artificial language only when they were previously exposed to the isolated pseudo-words. Other studies reported larger N400 amplitudes when adult participants listened to a structured stream compared to a random sequence of syllables (Cunillera et al., 2009, 2006), tones (Abla et al., 2008), and shapes (Abla and Okanoya, 2009). Our results show an N400 for both Words and Part-words in the post-learning phase, possibly related to a top-down effect induced by the familiarisation stream. Since computing ERPs during the streams has inherent baseline issues that become critical with young infants’ EEG recordings due to the large amplitude of the slow waves at this age (Eisermann et al., 2013), we cannot perform the ERP analysis during the stream as in previous adult studies preventing a direct comparison. However, the component we observed for duplets presented after the familiarisation streams might result from a related phenomenon. Learning the phonetic structure, but not the voice sequence, might induce a lexical search during the presentation of the isolated duplets, as it happens in the Structured stream vs. Random streams in the previously mentioned adult studies (Abla et al., 2008; Abla and Okanoya, 2009; Cunillera et al., 2009, 2006). A lexical entry might also explain the more sustained activity during the familiarisation stream in the phonemes group, as the chunk might be encoded as a putative word in this admittedly rudimentary but present lexical store, and the neural entrainment reflects not only the TP but also the recovery of the “lexical” item.

Finally, we would like to point out that it is not natural for a word not to be produced by the same speaker, nor for speakers to have statistical relationships of the kind we used here. Neonates, who have little experience and therefore no (or few) expectations or constraints, are probably better revealers of the possibilities opened by statistical learning than older participants. In fact, adults obtained better results for phoneme structure than for voice structure, perhaps because of an effective auditory normalisation process or the use of a writing code for phonemes but not for voices. It is also possible that the difference between neonates and adults is related to the behavioural test being a more explicit measure of word recognition than the implicit task allowed by EEG recordings. In any case, results show that even adults displayed some learning on the voice duplets.

Altogether, our results show that statistical learning works similarly on different speech features with no clear advantage for computing linguistically relevant regularities in speech. This supports the idea that statistical learning is a general and evolutionarily ancient learning mechanism, probably operating on common computational principles but within different neural networks with a different chain of operations: phonetic regularities induce a supplementary component - not seen in the case of voice regularities-that we related to a lexical N400. Understanding how statistical learning computations over linguistically relevant dimensions, such as the phonetic content of speech, are extracted and passed to other networks might be fundamental to understanding what enables the infant brain to acquire language. Further work is needed to understand how extracting regularities over different features articulates the language network.

Acknowledgements

We want to thank all the families who participated in the study. This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 695710).

Author contributions

A.F. and G.D.L. conceptualised the research; A.F., L.B. and M.P. performed the research; A.F. analysed the data; and A.F., L.B., and G.D.L. wrote the paper

Declaration of interests

The authors declare no competing interests.

Methods

Participants

Participants were healthy-full-term neonates with normal pregnancy and birth (GA > 38 weeks, Apgar scores ≥ 7/8 at 1/5 minute, birthweight > 2.5 Kg, cranial perimeter ≥ 33.0 cm), tested at the Port Royal Maternity (AP-HP), in Paris, France. Parents provided informed consent. The regional ethical committee for biomedical research (Comité de Protection des Personnes Region Centre Ouest 1, EudraCT/ID RCB: 2017-A00513-50) approved the protocol, and the study was carried out according to relevant guidelines and regulations. 67 participants (34 in Experiment 1 and 33 in Experiment 2) who provided enough data without motion artefacts were included (Experiment 1: 19 females; 1 to 4 days old; mean GA: 39.3 weeks; mean weight: 3387 g; Experiment 2: 15 females; 1 to 4 days old; mean GA: 39.0 weeks; mean weight: 3363 g). 12 other infants were excluded from the analyses (11 due to fussiness; 1 due to bad data quality).

Stimuli

The stimuli were synthesised using the MBROLA diphone database (Dutoit et al., 1996). Syllables had a consonant-vowel structure and lasted 250 ms (consonants 90 ms, vowels 160 ms). Six different syllables (ki, da, pe, tu, bo, ) and six different voices were used (fr3, fr1, fr7, fr2, it4, fr4), resulting in a total of 36 syllable-voice combinations, from now on, tokens. The voices could be female or male and have three different pitch levels (low, middle, and high) (Table S1). The 36 tokens were synthesised independently in MBROLA, their intensity was normalised, and the first and last 5 ms were ramped to zero to avoid “clicks.” The streams were synthesised by concatenating the tokens’ audio files, and they were ramped up and down during the first and last 5 s to avoid the start and end of the stream serving as perceptual anchors.

The Structured streams were created by concatenating the tokens in such a way that they resulted in a semi-random concatenation of the duplets (i.e., pseudo-words) formed by one of the features (syllable/voice) while the other feature (voice/syllable) vary semi-randomly. In other words, in Experiment 1, the order of the tokens was such that Transitional Probabilities (TPs) between syllables alternated between 1 (within duplets) and 0.5 (between duplets), while between voices, TPs were uniformly 0.2. The design was orthogonal for the Structured streams of Experiment 2 (i.e., TPs between voices alternated between 1 and 0.5, while between syllables were evenly 0.2). The random streams were created by semi-randomly concatenating the 36 tokens to achieve uniform TPs equal to 0.2 over both features. The semi-random concatenation implied that the same element could not appear twice in a row, and the same two elements could not repeatedly alternate more than two times (i.e., the sequence XkXjXkXj, where Xk and Xj are two elements, was forbidden). Notice that with an element, we refer to a duplet when it concerns the choice of the structured feature and to the identity of the second feature when it involves the other feature. The same statistical structures were used for both Experiments, only changing over which dimension the structure was applied. The learning stream lasted 120 seconds, with each duplet appearing 80 times. The 10 short structured streams lasted 30 seconds each, each duplet appearing a total of 200 times (10 × 20). The same random stream was used for both Experiments, and it lasted 120 seconds.

In Experiment 1, the duplets were created to prevent specific phonetic features from facilitating stream segmentation. In each experiment, two different structured streams (lists A and B) were used by modifying how the syllables/voices were combined to form the duplets (Table S2). Crucially, the Words/duplets of list A are the Part-words of list B and vice versa any difference between those two conditions can thus not be caused by acoustical differences. Participants were randomly assigned and balanced between lists and Experiments.

The test words were duplets formed by the concatenation of two tokens, such that they formed a Word or a Part-word according to the structured feature.

Procedure and data acquisition

Scalp electrophysiological activity was recorded using a 128-electrode net (Electrical Geodesics, Inc.) referred to the vertex with a sampling frequency of 250 Hz. Neonates were tested in a soundproof booth while sleeping or during quiet rest. The study involved: (1) 120 s of a random stream, (2) 120 s of a structured stream, (3) 10 series of 30 s of structured streams followed by 18 test sequences (SOA 2-2.3s).

Data pre-processing

Data were band-pass filtered 0.1-40 Hz and pre-processed using custom MATLAB scripts based on the EEGLAB toolbox 2021.0 according to the APICE pre-processing pipeline to recover as much free-artifacts data as possible (Fló et al., 2022).

Neural entrainment

The pre-processed data were further high-pass filtered at 0.2 Hz. Then, data was segmented from the beginning of each phase into 0.5 s long segments (240 duplets for the Random, 240 duplets for the long Structured, and 600 duplets for the short Structured). Segments containing samples with artefacts defined as bad data in more than 30% of the channels were rejected, and the remaining channels with artefacts were spatially interpolated.

Neural entrainment per condition

The 0.5 s epochs belonging to the same condition were reshaped into non-overlapping epochs (Benjamin et al., 2021) of 7.5 s (15 duplets, 30 syllables), retaining the chronological order; thus, the timing of the steady-state response. Subjects who did not provide at least 50% artifact-free epochs for each condition (at least 8 long epochs during Random and 28 during Structured) were excluded from the entrainment analysis (32 included subjects in Experiment 1, and 32 included subjects in Experiment 2). The retained subjects for Experiment 1, on average provided 13.59 epochs for the Random condition (SD 2.07, range [8, 16]) and 48.16 for the Structured conditions (SD 5.89, range [33, 55]). The retained subjects for Experiment 2, on average provided 13.78 epochs for the Random condition (SD 1.93, range [8, 16]) and 46.88 for the Structured conditions (SD 5.62, range [36, 55]). After data rejection, data were referenced to the average and normalized by dividing by the standard deviation within an epoch across electrodes and time. Next, data were converted to the frequency domain using the Fast Fourier Transform (FFT) algorithm, and the ITC was estimated for each electrode during each condition (Random, Structured) as , where N is the number of trials and φ(f,i) is the phase at frequency f and trial i. The ITC ranges from 0 to 1 (i.e., completely desynchronized activity to perfectly phased locked activity). Since we aim to detect an increase in signal synchronization at specific frequencies, the SNR was computed relative to the twelve adjacent frequency bins (six of each side corresponding to 0.8 Hz) (Kabdebon et al., 2022). This procedure also enables correcting differences in the ITC due to a different number of trials. Specifically, the SNR was SNR(f) = ( ITC(f)-mean(ITCnoise(f)) )/std(ITCnoise(f)), where ITCnoise(f) is the ITC over the adjacent frequency bins. For statistical analysis, we compared the SNR at syllable rate (4 Hz) and duplet rate (2 Hz) against the average SNR over the 12 adjacent frequency bins using a one-tail paired permutation test (5000 permutations). We also directly compared the entrainment during the two conditions to individuate the electrodes showing a greater entrainment during the Structured than Random streams. We evaluated the interaction between Stream type (Random and Structured) and Familiarization type (Structured over Phonemes or Voices) by comparing the difference in entrainment between Structured and Random during the two experiments using a two sides unpaired permutation test (5000 permutations). All P-values were corrected across electrodes by FDR.

Neural entrainment time course

The 0.5 s epochs were concatenated chronologically (2 minutes of Random, 2 minutes of long Structured stream, and 5 minutes of short Structured blocks). The same analysis as above was performed in sliding time windows of 2 minutes with a 1 s step. A time window was considered valid if at least 8 out of the 16 epochs were free of motion artefacts. Missing values due to the presence of motion artifacts where linearly interpolated. Then, the entrainment time course at the syllable rate was computed as the average over the electrodes showing significant entrainment at 4 Hz during the Random condition, and at the duplet rate, as the average over the electrodes showing significant entrainment at 2 Hz during the Structured condition. Finally, data was smooth over a time window of 30 s.

To investigate the increase in the neural activity locked to the regularity during the long familiarisation, we fitted a LMM for each group of subjects. We included time as a fixed effect and random slopes and interceptions for individual subjects: ITC ∼ − 1 + time + (1 + time|subject). We then compare the time effect between groups by including all data in a single LMM with time and group as fixed effects: ITC ∼ − 1 + timegroup + (1 + time|subject).

ERPs to test words

The pre-processed data were filtered between 0.2 and 20 Hz, and epoched between [-0.2, 2.0] s from the onset of the duplets. Epochs containing samples identified as artifacts by APICE procedure were rejected. Subjects who did not provide at least half of the trials (45 trials) per condition were excluded (34 subjects kept for Experiment 1, and 33 for Experiment 2). None subjects were excluded based on this criteria in the Phonemes groups, and one subject was excluded in the Voice groups. For Experiment 1, we retained on average 77.47 trials (SD 9.98, range [52, 89]) for the Word condition and 77.12 trials (SD 10.04, range [56, 89]) for the Partword condition. For Experiment 2, we retained on average 73.73 trials (SD 10.57, range [47, 90]) for the Word condition and 74.18 trials (SD 11.15, range [46, 90]) for the Partword condition. Data were reference averaged and normalised within each epoch by dividing by the standard deviation across electrodes and time.

Since the gran average response across both groups and conditions returned to the pre-stimulus level at around 1500 ms, we defined [0, 1500] ms as time windows of analysis. We first analysed the data using non-parametric cluster-based permutation analysis (Oostenveld et al., 2011) in the time window [0, 1500] ms (alpha threshold for clustering 0.10, neighbour distance ≤ 2.5 cm, clusters minimum size 3 and 5,000 permutations).

We also analysed the data in seven ROIs to ensure that no other effects were present that were not caught by the cluster-based permutation analysis. By inspecting the grand average ERP across both Experiments and conditions, we identified three characteristic topographies: (a) positivity over central electrodes, (b) positivity over frontal electrodes and negativity over occipital electrodes, and (c) positivity over prefrontal electrodes and negativity over temporal electrodes (Figure S1). Then, we defined 7 symmetric ROIs: Central, Frontal Left, Frontal Right, Occipital, Prefrontal, Temporal Left, Temporal Right. We evaluated the main effect of Test-word type by comparing the EPRs between Words and Partwords (paired t-test) and the main effect of Familiarization type by comparing ERPs between Experiment 1 (structured over Phonemes) and Experiment 2 (structure over Voices) (unpaired t-test). All p-values were FDR corrected by the number of time points (n = 376) and ROIs (n = 7). To test for possible interaction effects, we compared the difference between Words and Partwords between the two groups. To verify that the main effects were not driven by one condition or group, we computed the average on each of the time windows where a main effect was identified considering both the Test-word type and Familiarization type factors, and we ran a two ways-ANOVA (Test-word type x Familiarization type). Results are presented in SI (Figure S3).

Adults’ behavioural experiment

57 French-speaking adults were tested in an online experiment analogous to the infant study through the Prolific platform. All participants provided informed consent and received monetary compensation for their participation. The study was approved by the Ethical Research Committee of Paris Saclay University under the reference CER-Paris-Saclay-2019-063. The same stimuli as in the infants’ experiment were used. Participants first heard 2 min of familiarisation with the Structured stream. Then, they completed ten sessions of re-familiarisation and testing. Each re-familiarization lasted 30 s, and in each test session, all 18 test words were presented. The structure could be either over the phonetic or the voice content, and two lists were used (see Table S2). Participants were randomly assigned to one of the groups and to one list. The Phonemes group included 27 participants, and the Voices group 30 participants. Before starting the experiment, subjects were instructed to pay attention to an invented language because later, they would have to answer if different sequences adhered to the structure of the language. During the test phase, subjects were asked to scale their familiarity with each test word by clicking with a cursor on a scale from 1 to 6.