Abstract
Novelty is a double-edged sword for agents and animals alike: they might benefit from untapped resources or face unexpected costs or dangers such as predation. The conventional exploration/exploitation tradeoff is thus coloured by risk-sensitivity. A wealth of experiments has shown how animals solve this dilemma, for example using intermittent approach. However, there are large individual differences in the nature of approach, and modeling has yet to elucidate how this might be based on animals’ differing prior expectations about reward and threat, and differing degrees of risk aversion. To capture these factors, we built a Bayes adaptive Markov decision process model with three key components: an adaptive hazard function capturing potential predation, an intrinsic reward function providing the urge to explore, and a conditional value at risk (CVaR) objective, which is a contemporary measure of trait risk-sensitivity. We fit this model to a coarse-grain abstraction of the behaviour of 26 animals who freely explored a novel object in an open-field arena (Akiti et al. Neuron 110, 2022). We show that the model captures both quantitative (frequency, duration of exploratory bouts) and qualitative (stereotyped tail-behind) features of behavior, including the substantial idiosyncrasies that were observed. We find that “brave” animals, though varied in their behavior, are generally more risk neutral, and enjoy a flexible hazard prior. They begin with cautious exploration, and quickly transition to confident approach to maximize exploration for reward. On the other hand, “timid” animals, characterized by risk aversion and high and inflexible hazard priors, display self-censoring that leads to the sort of asymptotic maladaptive behavior that is often associated with psychiatric illnesses such as anxiety and depression. Explaining risk-sensitive exploration using factorized parameters of reinforcement learning models could aid in the understanding, diagnosis, and treatment of psychiatric abnormalities in humans and other animals.
1 Introduction
In naturalistic environments, novelty can be a source of both reward and dangers. Despite these duelling aspects, investigations of novelty in reinforcement learning (RL) have mostly focused on neophilia driven by optimism in the face of uncertainty, and so information-seeking (Duff, 2002a; Dayan and Sejnowski, 1996; Gottlieb et al., 2013; Wilson et al., 2014). Neophobia has attracted fewer computational studies, apart from some interesting evolutionary analyses (Greggor et al., 2015).
Excessive novelty seeking and excessive novelty avoidance can both be maladaptive – they are flip sides of a disturbed balance. Here, we seek to examine potential sources of such disturbances, for instance, in distorted priors about the magnitude or probabilities of rewards (which have been linked to mania; Radulescu and Niv, 2019; Bennett and Niv, 2020; Eldar et al., 2016) or threats (linked to anxiety and depression; Bishop and Gagne, 2018; Paulus and Yu, 2012), or in extreme risk attitudes (Gagne and Dayan, 2022).
To do this, we take advantage of a recent study by Akiti et al. (2022) on the behaviour of mice exploring a familiar open-field arena after the introduction of a novel object near to one corner. The mice could move freely and interact with the object at will. Akiti et al. (2022) performed detailed analyses of how individual animals’ trajectories reflected the novel object, including using DeepLabCut (Mathis et al., 2018) to track the orientation of the mice relative to the object and MOSEQ (Wiltschko et al., 2020) to extract behavioural ‘syllables’ whose prevalence was affected by it. The animals differed markedly in how they approached the object, and in what pattern. For the former, Akiti et al. (2022) observed two characteristic positionings of the animals when near to the object: ‘tail-behind’ and ‘tail-exposed’, associated respectively with cautious risk-assessment and engagement. For the latter, there was substantial heterogeneity along a spectrum of timidity, with all animals initially performing tail-behind approach, but some taking much longer (or failing altogether) to transition to tail-exposed approach.
Akiti et al. (2022) provides a model-free account of their data, focusing on the prediction of threat and its realization in the tail of the striatum. In contrast, we provide a model-based account, focusing on the rich details of the dynamics of approach carefully characterized by Akiti et al. (2022). These include intermittency (i.e., why animals retreat from the object), approach drive (or why animals approach in the first place), the significant long-run approach of timid animals despite having reached the “avoid” state, and how the intensity of approach increases when brave animals transition from risk-assessment to engagement and then decreases in the long-run of the “engagement” phase. Our model also provides an alternative explanation for why animals learn to avoid the novel object in a completely benign environment. Through modeling these additional statistics and behaviors, we reveal the multidimensional aspect of timidity in exploration which cannot be captured just in terms of time spent at the object.
We model an abstract depiction of the behaviour of individual mice by combining the Bayes-adaptive Markov Decision Process (BAMDPs) treatment of rational exploration (Dearden et al., 2013; Duff, 2002a; Guez et al., 2013) with two sources of risk-sensitivity: the prior over the potential hazard associated with the object, and the conditional value at risk (CVaR) probability distortion mechanism (Artzner et al., 1999; Chow et al., 2015; Gagne and Dayan, 2022; Bellemare et al., 2023).
In a BAMDP, the agent maintains a belief about the possible rewards, costs and transitions in the environment, and decides upon optimal actions based on these beliefs. Since the agent can optionally reuse or abandon incompletely known actions based on what it discovers about them, these actions traditionally enjoy an exploration bonus or “value of information”, which generalizes the famous Gittins indices (Gittins, 1979; Weber, 1992). In addition to beliefs about reward, the agent also maintains a belief about potential hazard which is the first source of risk-sensitivity. These beliefs are initialized as prior expectations about the environment; and so are readily subject to individual differences.
In addition to beliefs about hazards which may be specific to a particular environment, we include a second source of trait risk-sensitivity. We consider optimizing the CVaR, in which agents concentrate on the average value within lower (risk-averse) or upper (risk-seeking) quantiles of the distribution of potential outcomes (Rigter et al., 2021). In the context of a BAMDP, this can force agents to pay particular attention to hazards. More extreme quantiles are associated with more extreme risk-sensitivity; and again are a potential locus of individual differences (as examined in regular Markov decision processes in the context of anxiety disorders in Gagne and Dayan, 2022).
Here, we present a behavioral model of risk sensitive exploration. Our agent computes optimal actions using the BAMDP framework under the CVaR objective. This model provides a normative explanation of individual variability – the agent makes decisions by trading off potential reward and threat in a principled way. Different priors and risk sensitivities lead to different exploratory schedules, from timid (indicative of neophobia) to brave. The model captures differences in duration, frequency, and type of approach (risk-assessment versus engagement) across animals, and through time. We report features of the different behavioural trajectories the model is able to capture, providing mechanistic insight into how the trade-off between potential reward and threat leads to rational exploratory schedules. Behavioral phenotypes emerge from the interaction of the separate computational mechanisms elucidated by our model-based treatment. This paves the way for future experimental investigations of these mechanisms, including the unexpected non-identifiability of our two sources of risk-sensitivty: hazard priors and CVaR.
2 Results
2.1 Behavior Phases and Animal Groups
Our goal is to provide a computational account of the exploratory behavior of individual mice under the assumption that they have different prior expectations and risk sensitivities. We start from Akiti et al. (2022)’s observation that the animal approaches and remains within a threshold distance (determined by them to be 7cm) of the object in “bouts” which can be characterized as “cautious” or tail-behind (if the animal’s nose lies between the object and tail) or otherwise “confident” or tail-exposed. We sought to capture both these qualitative differences (cautious versus confident) and aspects of the quantitative changes in bout durations and frequencies as the animal learns about their environment.
In order to focus narrowly on interaction with the object, we abstracted away the details of the spatial interaction with the object, rather fitting boxcar functions to the percentages of its time gcau (t), gcon (t) that the animal spends in cautious and confident bouts around time t in the apparatus. We can then well encompass the behaviour of most animals via four coarse phases of behaviour that arise from two binary factors: whether the animal is mainly performing cautious or confident approaches, and whether bouts happen frequently, at a peak rate, or at a lower, steady-state rate. The time an animal spends near the object in one of these phases reflects the product of how frequently it visits the object, and how long it stays per visit. We average these two factors within each phase.
Consider the behaviour of the animal in Fig 1a. Here, gcau (t) (top graph) makes a transition from an initial level (during the “cautious” phase) to a final steady-state level (which we simplify as being at a transition point t = t1. At the same timepoint, gcon (t) (second row) makes a transition from 0 to a peak level of confident approach (defining the “peak confident” phase). Finally, there is another transition at time t2 from peak to a steady-state confident approach time (in the “steady-state confident” phase). The lower two rows of figure 1a show the duration of the bouts in the relevant phases, and the frequency per unit time of such bouts. The upper panel of fig 1b shows the same data in a more convenient manner. The colours in the top row indicate the type of approach (green is cautious; blue is confident). The second and third rows indicate the duration and frequency of approach. Darker colours represent higher values.
The orange coloured lines in Fig 1a and the lower panel in Fig 1b render the abstracted behaviour of this animal in an integrated form, showing how we generate “phase-level” statistics from minute-to-minute statistics. Averaging statistics over phases ignores idiosyncrasies of behavior and allows us to fit the high-level statistics of behavior: phase-transiton times, phase-averaged durations and frequencies. We consider animal 25 to be a “brave” animal because of its transition to peak and then steady-state confident approach. There were 12 brave mice out of the 26 in total.
Fig 1c shows an example of another characteristic “intermediate” animal. This animal makes a transition from cautious to confident approach (where both duration and frequency of visits can change), but the approach time during the confident phase does not decrease. Hence, intermediate animals do not have a transition from peak to steady-state confident phase. There were 5 such intermediate mice.
Fig 1d shows the behaviour of an example of the last class of “timid” animals. This animal never makes a transition to confident approach. Hence, for it, gcon (t) = 0. However, the cautious approach time makes a transition to a non-zero steady state , often via a change in frequency, dening the fourth phase (“steady-state cautious”). There were 9 such timid mice.
Fig 2 summarizes our categorization of the animals into the three groups: brave, intermediate, and timid based on the phases identified in the animal’s exploratory trajectories. Timid animals spend no time in confident approach. Brave animals differ from intermediate animals in that their approach time during the first ten minutes of the confident phase is greater than the last ten minutes (steady-state phase).
2.2 A Bayes-adaptive Model-based Model for Exploration and Timidity
2.2.1 State description
We use a model-based Bayes-adaptive reinforcement learning model (BAMDP) to provide a mechanistic account of the behavior of the mice under threat of predation. This extends the model-free description of threat in Akiti et al. (2022) by constructing various mechanisms to explain additional facets of the dynamics of the behavior.
Underlying the BAMDP is a standard multi-step decision-making problem of the sort that is the focus of a huge wealth of studies (Russell and Norvig, 2016). We cartoon the problem with the four real and four counterfactual states shown in Fig 3. The nest is a place of safety, (modelling all places in the environment away from the object, ignoring, for instance, the change to thigmotactic behaviour that the mice exhibit when the object is introduced. The animal can choose to stay at the nest (possibly for multiple steps) or choose to make a cautious or confident approach.
At an approach state, the modelled agent can either stay, or return to the nest via the retreat state; the latter happens anyhow after four steps. The animal also imagines the (in reality, counter-factual) possibility of being detected by a potential predator. It can then either manage to escape back to the nest, or alternatively expire. We parameterize costs associated with the various movements; and also the probability of unsuccessful escape starting from confident (p1) or cautious (p2 < p1) approach.
We describe the dilemma between cautious and confident approach as a calculation of the risk and reward trade-off between the two types of approaches. Cautious approach (the “cautious object” state) has a lower (informational) reward (e.g. because in the cautious state the animal spends more cognitive effort monitoring for lurking predators rather than exploring the object). However, cautious approach leads to a lower probability of expiring if detected than does confident approach (the “cautious object” state) (e.g. because in the cautious state the animal is better poised to escape). Risk aversion modulates the agent’s choice of approach type.
The next sections describe the components of the BAMDP model: a characterization of the time-dependent risk of predation, an informational reward for exploration, and a method for handling risk sensitivity. Finally, we will discuss the way we fitted individual mice, and present a full analysis of their behaviour. We report on recovery simulations in the supplement.
2.2.2 Modeling Threat with a Bayesian, Generalizing Hazard Function
Whilst exploring the novel object in the “object” state, the decision problem allows for the possibility of detection, and then attack, by a predator whose appearance is governed by a temporal hazard function (see Fig 4).
Formally, the probability of detection given either cautious or confident approach is modelled using the hazard function hτ, where τ is the number of steps the animal has so far spent at the object in the current bout. In a key simplification, this probability resets back to baseline upon a return to the nest. We treat the hazard function as being learned in a Bayesian manner, from the experience (in this case, of not being detected). We assume that the animal has the inductive bias that the hazard function is increasing over time, reflecting a potential predator’s evidence accumulation process about the prey. Therefore, we derive it from a succession of independent
Beta-distributed random variables θ1 = 0; θτ ∼ Beta(μτ, στ), τ > 1 as:
rather as in what is known as a stick-breaking process. Note that, for convenience, we parameterize the Beta distribution in terms of its mean μ and standard deviation σ rather than its pseudocounts, as is perhaps more common.
Eq 2 shows that the hazard function is always increasing. As we will see, the duration of bouts at the object depend on the (discrete) slope of the hazard function, with steep hazard functions leading to short bouts. In our model, the agent can stay at the object 2, 3 or 4 turns (we take θ1 = 0 as a way of coding actual approach). [We therefore sometimes refer to cautious−k or confident−k bouts in which the model animal spends k = {2, 3, 4} steps at the object.] Hence the collection of random variables, hτ, is derived from six parameters (the mean μτ and the standard deviation στ of the Beta distribution for the turn). These start at initial prior values, and are subject to an update from experience. Here, that experience is exclusively negative, since there is no actual predator; this implies that the update has a simple, closed form (see Methods). The animals’ initial ignorance, which is mitigated by learning, makes the problem a BAMDP, whose solution is a risk-averse itinerant policy.
A particular characteristic of the noisy-or hazard function of Eq 1 is that the derived bout duration increases progressively. This is because not being detected at τ = 2, say, provides information that θ2 is small, and so reduces the hazard function for longer bouts τ > 2.
Fig 4 shows the fitted priors of a brave (top) and timid (bottom) animal, as well as the posteriors after ten exploratory bouts. The brave animal starts with a high variance prior. This flexibility allows it to transition from short, cautious bouts (duration τ = 2) to longer confident bouts (duration τ = 3, 4), reducing the hazard function to near zero. The timid animal has a low variance prior, and does not stay long enough at the object to build sufficient confidence (only performing duration τ = 2 bouts). As a result, its posterior hazard function remains similar to its prior.
2.2.3 Modeling the Motivation to Approach
We model the mouse’s drive to approach the object as stemming from its belief that the object might be rewarding. In a fully Bayesian treatment, the agent would maintain a posterior over the possibility of rewards and would enjoy a conventional, informational, Bayes-adaptive exploration bonus encouraging it to approach the object. However, this would add substantial computational complexity. Thus, instead, we use a simple, heuristic, exploration bonus G(t) (Kakade and Dayan, 2002). The model mouse moves from the “nest” state to the “object” state when this exploration bonus exceeds the costs implied by the risk of being attacked.
We characterize the exploration bonus as coming from an initial ‘pool’ G0 that becomes depleted when the animal is at the objec, as it experiences a lack of reward, but is replenished at a steady rate f when the animal is at the nest, through forgetting or potential change. We model the animal as harvesting this exploration bonus pool more quickly under confident than cautious approaches, for instance since it can pay more attention to the object (an issue captured in more explicit detail in the context of foraging by Lloyd and Dayan (2018)). This underpins the transition between the two types of approach for non-timid animals. In simulations, when G(t) is high, the agent has a high motivation to explore the object. In other words, the depletion from G0 substantially influences the time point at which approach makes a transition from peak to steady-state; the steady-state time then depends on the dynamics of depletion (when at the object) and replenishment (when at the nest).
Finally, the animal is also motivated to approach by informational reward from the hazard function (which can be exploited to collect more future reward) – according to a standard Bayes-adaptive bonus mechanism (Duff, 2002a).
2.2.4 Conditional Value at Risk Sensitivity
Along with varying degrees of pessimism in their prior over the hazard function, the mice could have different degrees of risk sensitivity in the aspect of the return that they seek to optimize. There are various ways in which the mice might be risk sensitive. Following Gagne and Dayan (2022), we consider a form called nested conditional value at risk (nCVaR). In general, CVaRα, for risk sensitivity 0 ≤ α ≤ 1, measures the expected value in the lower α quantile of returns – thus over-weighting the worse outcomes. The lower α, the more extreme the risk-aversion; with α = 1 being associated with the conventional, risk-neutral, expected value of the return. Section 4.2 details the optimization procedure concerned – it operates by upweighting the probabilities of outcomes with low returns – which come here from detection and expiration. Thus, when α is low, confident and longer bouts are costly, inducing shorter, cautious ones. nCVaRα affects behavior in a similar manner to pessimistic hazard priors, except that nCVaRα acts on both the aleatoric uncertainty of expiring and epistemic uncertainty of detection, while priors only affect the latter. As we will see, despite this difference, we were not able to differentiate pessimistic priors from risk sensitivity using the data in (Akiti et al., 2022).
2.2.5 Model Fitting
The output of each simulation is a sequence of states which we use to derive summary statistics that can be compared directly with our abstraction of the behavior of a mouse (as in figure 1). This requires us to model transition points in this behavior, and the times involved in each state.
In the model, the transition point from cautious to confident approach happens when the agent first ventures a confident approach; this switch is rarely reversed. Peak to steady-state transition points occur when the model mouse decreases its frequency of bouts, which tends to happen abruptly in the model. We fit the transition points in mouse data by mapping the length of a step in the model to wall-clock time. As in the abstraction of the experimental data, we average the duration (number of turns at the object) and frequency statistics in each phase. We characterize the relative frequencies of the bouts across phase transitions. Frequency mainly governs the total time at or away from the object and is formally defined as the inverse of the number of steps the model spends at the object and the nest.
We use a form of Approximate Bayesian computation Sequential Monte Carlo (ABCSMC; Toni et al. (2009)) to fit the elements of our abstraction of the approach behaviour of the mice (section 2.1), namely change points, peak and steady-state durations as well as relative frequencies of bouts. See the Methods section 4.5 for details on the fitted statistics. At the core of ABCSMC is the ability to simulate the behaviour of model mice for given parameters. We do this by solving the underlying BAMDP problem approximately using receding horizon tree search with a maximum depth of 5 steps (which covers the longest allowable bout, defined as a subsequence of states where the model mouse goes from the nest to the object and back to the nest).
The full set of parameters includes 6 for the prior over the hazard function (given that we limit to four the number of time steps the model mouse can stay at the object), the risk sensitivity parameter α for CVaRα, the initial reward pool G0 and the forgetting rate f.
2.2.6 A Spectrum of Risk-Sensitive Exploration Trajectories
Fig 5 shows model fits on the 26 mice from Akiti et al. (2022). The animal ranking is sorted first by animal group, and second by total time spent near the object. We call this ranking the group-timidity animal index – it slightly differs from the timidity index used in Akiti et al. (2022) which is only based on total time spent near the object. The model captures many details of the data across the entire spectrum of courage to timidity, explaining the behavior mechanistically. Differing schedules of exploration emerge because of the battle between learning about threat and reward.
All animals initially assess risk with cautious approach, since potential predation significantly outweighs potential rewards. Brave animals assess risk either with short (length 2 bouts) or medium (length 3 bouts) depending on the hazard priors (Fig 6 a. and b. versus c. and d.). If E[h3] is high, then the animal performs cautious length 2 bouts, otherwise, it performs cautious length 3 bouts. With more bout experience, the posterior hazard function becomes more optimistic (since there is no actual predator to observe; Fig 4), empowering it to take on more risk by staying even longer at the object and performing confident approach. Animals with low E[h4] perform the longest, confident, length 4 bouts instead of length 3 bouts (Fig 6 a. and c. versus b. and d). How long brave animals spend assessing risk depends on hazard priors and the risk sensitivity nCVaRα.
Fig 7 shows that the fitted hazard priors and nCVaRα relate to the group-timidity animal index. Brave animals are fitted with higher nCVaRα and a low slope and high variance (flexibility) hazard prior. In other words, the model brave mouse believes that the hazard probability for long bouts is low in its environment. Timid animals are fitted by lower nCVaRα and a higher slope, inflexible hazard prior. The parameters for intermediate animals lie between those for brave and timid animals.
G0 determines how much time brave animals spend in the peak-confident exploration phase, or the peak to steady-state change point. Animals with larger G0 tend to have high bout frequencies for a longer period (see Fig 8). Finally, how often brave animals revisit the object, which is related to the relative steady-state frequency, is determined by the forgetting rate.
Timid animals have short bouts and continue to assess risk with cautious approach in the steady-state. Fig 7 shows that their hazard priors are inflexible (low variance), with a high slope, and that they have low nCVaRα. The priors are slow to update and risk sensitivity causes timid agents to overestimate the probability of bad outcomes, leading to their prolonged cautious behavior. Hence, the reward exploration pool is depleted (i.e. the agent transitions to the steady-state phase) before the agent overcomes its priors. This particular dynamic of approach-drive and hazard function updating leads to self-censoring and neophobia. In the steady-state phase, the agent stays long periods at the nest (how long depends again on the forgetting rate). As a result, the animal (at least during the course of the experiment) never accumulates sufficient evidence to learn the safety of the object or if the object yields rewards. Akiti et al’s experiment did not last long enough to answer the question of whether all animals, even the most timid ones, eventually perform confident approach. Our model predicts that they will since the agent only accumulates negative evidence for the hazard function. However, with sufficient low CVaR or pessimistic priors, this may take a very long time.
Intermediate animals, like brave animals, eventually switch to confident approach to maximize information gained about potential rewards. Similar to brave animals, the cautious to confident transition tends to be later with lower nCVaRα and steeper, less flexible priors. Intermediate animals perform both cautious and confident bouts with medium duration. This is captured by a hazard prior with smaller E[h3] and larger E[h4]. The percentage of time spent at the object is relatively constant throughout the experiment for intermediate animals. This can be explained by either large G0 or a high forgetting rate. In other words, the animal is either slow to update its belief about the potential reward at the object, or it expects the reward probability to change quickly.
Fig 5 also illustrates several limitations of the model. In particular, the duration of bouts can only increase, whereas a few animals exhibit decreasing bout duration between confident-peak and confident-steady-state phases. Furthermore, the model has trouble capturing abrupt changes in duration (from 2 turns to 4) coinciding with an animal’s transition from cautious to confident approach.
2.2.7 Risk Sensitivity versus Prior Belief Pessimism
We found that risk sensitivity and prior pessimism could not be teased apart in our model fits. This is illustrated in Fig 9. In the ABCSMC posterior distributions, nCVaRα is correlated with the mean μ2 for timid and intermediate animals, μ3 for cautious-2/confident-4 and cautious-2/confident-3 animals, and μ4 for cautious-2/confident-4 and cautious-3/confident-4 animals. In other words, lower nCVaRα (higher risk-sensitivity) can be traded off against lower (more optimistic) priors to explain the observed risk-aversion in animals.
In ablation studies (not shown), we found that it is possible to fit the full range of the behavior equally well with a risk-neutral nCVaR1.0 objective, only varying the hazard priors. The only advantage of fitting both nCVaRα and hazard priors to each animal is greater diversity in the particles discovered by ABCSMC. While the model with nCVaR1.0 is simpler, one might suspect, on general grounds, that both risk sensitivity and belief pessimism affect mice behavior – and they would be distinguishable under other conditions.
By contrast, we found that nCVaRα alone, with the same hazard prior for all animals, is incapable of fitting the full range of animal behavior (results not shown). This can be explained by the fact that nCVaRα cannot model the different slopes in the hazard function. For example, a cautious-2/confident-3 animal must be modeled using a high value of μ4. Starting with the parameters for a cautious-2/confident-4 animal and decreasing nCVaRα will not create a cautious-2/confident-3 animal. Instead, decreasing nCVaRα will delay the cautious-to-confident transition of the cautious-2/confident-4 animal and eventually create a cautious-2 timid animal. Therefore, in our task, structured prior beliefs are required to model the detailed behavior of animals. It is not clear in general, in which environments one can expect nCVaRα and priors to be identifiable given the complex interaction of these two sources of risk-sensitivity.
2.2.8 Familiar Object Novel Context
As a contrast with their main experiment, in which mice were exposed to an unfamiliar object in a novel context (UONC), Akiti et al. (2022) also looked at the consequences of exposing animals to a familiar object in a novel context (FONC), where the animals still habituate in the arena over two days but the combination of the object and arena is novel. We fit the behavior of the 9 FONC amimals, and, as the closest match compared this with that of the 11 brave amimals in the UONC condition. Figure 10 shows that there are 1 intermediate and 8 brave FONC animals, with the latter having exploration schedules similar to the bravest UONC animals. The 8 FONC animals have confident-peak and confident-steady-state phases, meaning their approach decreases in the steady-state, suggesting that they are reinvestigating the familiar object for reward.
Figure 11 compares the posteriors of the ABCSMC fit of brave UONC and FONC animals. The x-axis shows the group-timidity animal index, but split by experiment condition (UONC then FONC). Compared to brave UONC animals, FONC animals are fitted with higher nCVaRα and lower hazard priors (average posterior parameters across animals are significantly different according to the Kolmogorov-Smirnov test, p < 0.05). Both the hazard prior means and variances are lower for the FONC animals indicating these animals are more certain of the safety of the object compared to UONC animals. For 3 animals the hazard prior means are nearly zero, indicating belief of almost certain safety. This is similar to the hazard function of a brave UONC animal at the end of the experiment. For the other 6 FONC animals, the hazard prior is high enough to warrant initial cautious bouts suggesting that the novelty of the context has increased their beliefs of the threat level of the familiar object. However, even these animals transition faster to confident approach than the brave UONC animals. This can be seen in Figure 10. Figure 11b shows that FONC animals also have on average lower (p < 0.05)) exploration pool than brave UONC animals. Taken together, these results show that pre-exposure to the object decreases both the animals’ beliefs about potential hazards but also their motivation to explore the object for reward.
3 Discussion
We combined a Bayes adaptive Markov decision process framework with beliefs about hazards, and a conditional value at risk objective to capture many facets of an abstraction of the substantially different risk-sensitive exploration of individual animals reported by Akiti et al. (2022). In the model, behaviour reflects a battle between learning about potential threat and potential reward (neither of which actually exists). The substantial individual variability in the schedules of exploratory approach was explained by different risk sensitivities, forgetting rates, exploration bonuses and prior beliefs about an assumed hazard associated with a novel object. Neophilia arises from a form of optimism in the face of uncertainty, and neophobia from the hazard. Critically, the hazard function is generalizing (reducing the t = 2 hazard reduces the t = 4 hazard) and monotonic. The former property induces an increasing approach duration over time (Arsenian, 1943). Furthermore, the exploration bonus associated with the object regenerates, as if the subjects consider its affordance to be non-stationary (Dayan et al., 2000). This encourages even the most timid animals to continue revisiting it.
A main source of persistent timidity is a sort of path-dependent self-censoring (Dayan et al., 2020). That is, the agents could be so pessimistic about the object that they never visit it for long enough to overturn their negative beliefs. This can, in principle, arise from either excessive risk-sensitivity or overly pessimistic priors. We found that it was not possible to use the model to disen-tangle the extent to which these two were responsible for the behavior of the mice, since they turn out to have very similar behavioural phenotypes in this task. One key difference is that risk aversion continues to affect behaviour at the asymptote of learning; something that might be revealed by due choice of a series of environments. Certainly, according to the model, forced exposure (Huys et al., 2022) would hasten convergence to the true hazard function and the transition to confident approach.
Due to the complexity of the dataset, we made several rather substantial simplifying assumptions. First, the model employs a particular set of state abstractions, for instance representing thigmotaxis as a notional “nest” (Simon et al., 1994). Second, the model only allows the frequency of approach, and not its duration, to decrease during the steady-state phase - some animals are better fit by decreasing duration. This limitation could be remediated in future models with, for example, a mechanism for boredom causing the animal to retreat when little potential reward remains at the object. Third, the probability of being detected was the same between cautious and confident approaches, which may not be true in general. Note that the agent decides the type of approach before the bout, and is incapable of switching from cautious to confident mid-bout or vice versa. This is consistent with behavior reported in Akiti et al. (2022). Fourth, we restricted ourselves to a monotonic hazard function for the predator. It would be interesting to experiment with a non-monotonic hazard function instead, as would arise, for instance, if the agent believed that if the predator has not shown up after a long time, then there actually is no predator. Of course, a sophisticated predator would exploit the agent’s inductive bias about the hazard function – by waiting until the agent’s posterior distribution has settled. In more general terms, the hazard function is a first-order approximation to a complex game-theoretic battle between prey and predator, which could be modeled, for instance using an interactive IPOMDP (Gmytrasiewicz and Doshi, 2005). How the predator’s belief about the whereabouts of the prey diminishes could also be modeled game-theoretically, leading to partial hazard resetting rather than the simplified complete resetting in our model.
Our account is model-based, with the mice assumed to be learning the statistics of the environment and engaging in prospective planning (Mobbs et al., 2020). By contrast, Akiti et al. (2022) provide a model-free account of the same data. They suggest that the mice learn the values of threat using an analogue of temporal difference learning (Sutton, 1988), and explain individual variability as differences in value initialization (Akiti et al., 2022). The initial values are generalizations from previous experiences with similar objects, and are implemented by activity of dopamine in the tail of the striatum (TS) responding to stimuli salience (Akiti et al., 2022). By contrast, our model encompasses extra features of behavior such as bout duration, frequency, and type of approach – ultimately arriving at a different mechanistic explanation of neophobia. In the context of our model, TS dopamine could still respond to the physical salience of the novel object but might then affect choices by determining the potential cost of the encountered threat (a parameter we did not explore here) or perhaps the prior on the hazard function. An analogous mechanism may set the exploration pool or the prior belief about reward - perhaps involving projections from other dopamine neurons, which have been implicated in novelty in the context of exploration bonuses (Kakade and Dayan, 2002) and information-seeking for reward (Ogasawara et al., 2022; Bromberg-Martin and Hikosaka, 2009).
As reported in Akiti et al. (2022), animals in the FONC condition in which the object is familiar (though the context is less so) transition quickly to tail-exposed approach and therefore spend more time near the object compared to animals in the UONC condition. Akiti et al. (2022) models the FONC animals using low initial mean threat and high initial threat uncertainty. We directly compare the behavior of FONC animals against that of the 11 brave UONC animals, showing that FONC animals make choices that are comparable to the bravest UONC animals. FONC behavior is fit by significantly higher nCVaRα than brave UONC behavior animals. It is also characterized by both lower hazard prior means and standard deviations, implying greater certainty about the object’s safety. Furthermore, FONC behavior is fitted with lower exploration pools than brave UONC behavior. Taken together, we can understand the FONC animals as having both lower uncertainty about hazard and reward compared to the brave UONC animals at the start of the experiment. However, the hazard and reward uncertainties are higher than what we might expect of UONC animals at the end of the experiment, suggesting the novel context modulates both of these uncertainties. However, heterogeneity exists between FONC individuals in terms of nCVaRα, hazard priors, and exploration pool which allows another possibility: that both hazard and reward uncertainty are restored by forgetting during the time that passed between pre-exposure and the experiment.
Our model-based account recovers several behavioral phenotypes in addition to those considered in Akiti et al. (2022). First, intermittency in our model emerges from the fact that the (possibly CVaR perturbed) hazard function increases with time spent at the object. Therefore, it is rational for the model mice to retreat to the nest when the probability of detection becomes too high and wait until (they believe) the “predator has forgotten about them”, before venturing to the object again.
Second, we offer an alternative explanation for why animals avoid after risk-assessment in a benign environment. In Akiti’s model, timid animals perform risk-assessment because of the delay in model-free value updating from the initial threat at the object (at timestep t = 10 in their account) to the time of decision (t = 8). In our model, avoidance arises from a rational trade-off between potential risk and reward: timid animals perform risk-assessment because of the potential reward at the object and having found none, cease to approach because, although potential threat is lower than at the outset, it still outweighs the even further-reduced potential reward. The same exhaustion of the exploration bonus explains why the brave animals decrease their approach during the steady-state of engagement. If the potential reward is low, there is no reason to return to the object at the initial, high rate of engagement.
Third, the temporally evolving battle between reward and threat also explains why brave animals increase their duration of approach when transitioning from risk-assessment to engagement. During confident approach, the animals harvest the exploration pool faster, at the cost of an increased probability of expiring. For brave animals, the hazard posterior decreases faster than the depletion of the exploration pool, and hence brave animals decide to save on travel costs by exploring the object longer in each bout.
Fourth, timid animals return to the object in the steady-state of “avoidance”, albeit at a lower rate than during risk-assessment. This was not considered in Akiti et al. (2022)’s account. In our model, timid animals’ steady-state approach is explained by the regenerating exploration pool. Such regeneration is natural if the animals assume that the environment is non-stationary, allowing reward structures to change and thus potentially repaying occasional returns to the object if the potential threat has become sufficiently low. Similarly, the animal may believe that threat is non-stationary. Threat forgetting may act on longer time-scales than reward forgetting in our studied environment, and is one possible explanation for the initial non-zero hazard functions of some brave animals in the FONC condition.
Finally, our model shows the multi-faceted nature of timidity during exploration. Not only do animals differ in time spent near the object but also in how quickly they transition from cautious to confident approach, and their duration and frequency of approach along their exploration schedules. These proxies for timidity are imperfectly correlated. Indeed, an animal could believe that short bouts (τ = 2) are very safe while long bouts (τ = 4) certainly lead to expiration.
Of course, agents do not need to be fully model-free or model-based. They can truncate model-based planning using model-free values at leaf nodes (Keramati et al., 2016). Furthermore, replay-like prioritized model-based updates can update a model-free policy when environmental contingencies change (Antonov and Dayan, 2023). Finally, while online BAMDP planning can be computationally expensive, a model-based agent may simply amortize planning into a model-free policy which it can reuse in similar environments or even precompile model-based strategies into an efficient model-free policy using meta-learning (Wang et al., 2017). Agents may have faced many different exploration environments with differing reward and threat trade-offs through their life-times and even over evolution that they have used to create fast, instinctive model-free policies that resemble prospective, model-based behavior (Rusu et al., 2016; Mattar and Daw, 2018). In turn, TS dopamine might reflect aspects of MF values or prediction errors that had been trained by a MB system following the precepts we outlined.
In Akiti et al. (2022), ablating TS-projecting dopamine neurons made mice “braver”. They spent more time near the object, performed more tail-exposed approach and transitioned faster to tail-exposed approach compared to control. In Menegas et al. (2018) TS ablation affected the learning dynamics for actual, rather than predicted threat. Both ablated and control animals initially demonstrated retreat responses towards airpuffs but only control mice maintained this response (Menegas et al., 2018). After airpuff punishment, ablated individuals surprisingly did not decrease their choices of water ports associated with airpuffs (while controls did). One possibility is that this additional exposure could have caused acclimatization to the airpuffs in the same way that brave animals in our study acclimatize to the novel object by approaching more, and timid animals fail to acclimatize because of self-censoring. Indeed, future experiments might investigate why punishment-avoidance does not occur in ablated animals and whether the same holds in risk-sensitive exploration settings (Menegas et al., 2018). In other words, would mice decrease approach after reaching the “detected” state, as expected by our model, or would they maladaptively continue the same rate of approach? Finally, while our study has focused on threat, Menegas et al. (2017) showed that TS also responds to novelty and salience in the context of rewards and neutral stimuli. That TS ablated animals spend more, rather than less time near the novel object suggests that the link from novelty to neophilia and exploration bonuses might not be mediated by this structure.
The behaviour of the mice in Akiti et al. (2022) somewhat resembles attachment behaviour in toddlers (Ainsworth, 1964; Bowlby, 1955), albeit with the care-giver’s trusty leg (a secure base from which to explore) replaced by thigmotaxis (or, in our case, the notional ‘nest’). Characteristic to this behaviour is an intermittent exploration strategy, with babies venturing away from the leg for a period before retreating back to its safety. Through the time course of exposure to a novel environment, toddlers progressively venture out longer and farther away, spending more time actively playing with the toys rather than passively observing them in hesitation (Arsenian, 1943). This is another example of a dynamic exploratory strategy, putatively arising again from differential updates to beliefs about threats and the rewards in the environment (Arsenian, 1943; Ainsworth, 1964).
Variability in timidity during exploration has been reported in other animal species and can be caused by differences in both prior experience and genotype. Fish from predator-dense environments tend to make more inspection approaches but stay further away, avoid dangerous areas (attack-cone avoidance) and approach in larger shoals compared to fish from predator-sparse environments (Magurran and Seghers, 1990; Dugatkin, 1988; Magurran, 1986). Dugatkin (1988) and Magurran (1986) report significant within-population differences in the inspection behavior of guppies and minnows respectively. Brown and Dreier (2002) directly manipulates the predator experience of glowlight tetras, leading to changes to inspection behavior. Similar inter- and intra-population differences in timidity have been reported in mammals. In Coss and Biardi (1997), the squirrel population sympatric with the tested predators stayed further away and spent less time facing the predator compared to the allopatric population. Furthermore, the number of inspection bouts differed between litters, between individuals within the same litter, and even between the same individuals at different times during development (Coss and Biardi, 1997). In Kemp and Kaplan (2011), marmosets differed in risk-aversion when inspecting a potential (taxidermic) predator but risk-aversion was not stable across contexts for some individuals. FitzGibbon (1994) reports age differences in inspection behavior - adolescent gazelles inspected cheetahs more than adults or half-growns. Finally, Mazza et al. (2019); Eccard et al. (2020) report substantial individual differences in the foraging behavior of voles in risky environments and Lloyd and Dayan (2018) provide a somewhat general model of foraging under risk.
In conclusion, our model shows that risk-sensitive, normative, reinforcement learning can account for individual variability in exploratory schedules of animals, providing a crisp account of the competition between neophilia and neophobia that characterizes many interactions with an incompletely known world.
4 Materials and methods
4.1 BAMDP Hyperstate
A Bayes-Adaptive Markov Decision Process (BAMDP; Duff, 2002b; Guez et al., 2013) is an extension of model-based MDP and a special case of a Partially Observable Markov Decision Process (POMDP; Kaelbling et al., 1998) in which the agent models its uncertainty about the (unchanging) transition dynamics. In a BAMDP, the agent extends its state representation into a hyperstate consisting of the original MDP state s, and the belief over the transition dynamics b(T).
In our model s is the conjunction of the “physical state” (the location of the agent, as shown in Fig 3) and the number of turns the agent has spent at the object so far τ. In the general case, T is a |S| × |A| × |S| tensor where each element is p(s, a, s′) and S and A are the number of states and actions respectively. Therefore, b(T) is a probability distribution over (possibly infinite) transition tensors. In our model, all transition probabilities are assumed fixed except for the hazard function probabilities. Therefore, a belief over transition tensors b(T) is a belief over hazard functions b(h). We use a noisy-or hazard function parameterized by a vector of Beta distribution parameters . In totality, the belief over transition tensors b(T) is a belief over parameter vectors.
However, to maintain generality in the next section, we derive the Bellman updates using the notation b(T).
Our hyperstate additionally contains the nCVaR static risk preference , and the parameters of the heuristic exploration bonus (see Section 4.4).
4.2 Bellman Updates for BAMDP nCVaR
As for a conventional MDP, the nCVaR objective for a BAMDP can be solved using Bellman updates. We use Eq 4 which assumes a deterministic, state-dependent, reward.
s′ is the next state and b′(T) is the posterior belief over transition dynamics after observing the transition (s, a, s′). is the expected transition probability.
Proof of Eq 4.
where is the risk envelope for CVaR (Chow et al., 2015). But is only non-zero when .
Hence we can drop the independent integration over , and only integrate over s′.
Epistemic uncertainty about the transitions only generates risk in as much as it affects the probabilities of realizable transitions in the environment.
4.3 Noisy-Or Hazard Function
In our model, the hazard function defines a binary detection event Xτ for each number of turns the agent spends at the object τ = 2, 3, 4. The predator detects the agent when Xτ = 1. We use a noisy-or hazard function which defines Xτ as the union of Bernoulli random variables Zj ∼ Bernoulli(θj) (Eq 6) with priors θj ∼ Beta(μj, σj) for j = 2, 3, 4. Fig 12 shows the relationships between the random variables in plate notation.
Posterior inference for the noisy-or model is intractable in the general case (Jaakkola and Jordan, 1999). However, there is a closed-form solution for the posterior when the agent only makes negative observations, meaning (in our case, since there is no actual predator). For example, given a single observation xτ = 0,
Here we switch back to the pseudocount parameterization of the Beta distribution Beta(θ; a, b) to exploit its conjugacy.
Hence the posterior update simply increments the Beta pseudocounts for the ‘0’ outcomes. The hazard probability is the posterior predictive distribution h(τ) = p(xτ = 1 |D) where D are a set of observations of X1, X2, … Xτ.
Where μj = 𝔼[θj] is the expected value of the posterior on θj.
Proof of Eq 8.
where are the pseudocounts of negative observations after updating the Beta prior with D using Eq 7. It can be shown that h(τ) is recursive.
This recursion has two implications. First, the hazard function is monotonic since (1−h(τ−1)) > 0 and μτ > 0. Second, the hazard function generalizes. From Eq 9 it is clear if h(τ − 1) increases, then h(τ) increases. It is this generalization that allows the agent to progressively spend more turns at the object.
4.3.1 Transforming μ, σ to Pseudocount Parameterization of Beta Distribution
We use the mean μ and variance v = σ2 parameterization of the Beta distribution to get a more uniform sampling of the prior parameter space for ABCSMC fitting. We sample μ and σ from uniform distributions. However, it is more convenient to work with pseudocounts for computing the hazard posterior. Therefore, we transform μ and σ to pseudocounts a, b using the identities below. Note that v must be less than μ − μ2 to avoid negative values of a, b.
4.4 Heuristic Exploration Bonus Pool
The heuristic reward function approximates the sort of exploration bonus (Gittins, 1979) that would arise from uncertainty about potential exploitable benefits of the object. It incentivizes approach and engagement. In the experiment, there is no actual reward so the motivation is purely intrinsic (Oudeyer and Kaplan, 2007). The exploration bonus depletes as the agent learns about the object; but regenerates if the agent believes that the object can change over time (or, equivalently, if the agent forgets what it has learnt). This regenerating uncertainty can be modeled normatively using POMDPs but is only approximated here. Since we imagine the agent as finding more out about the object through confident than cautious approach, the former generates a greater bonus per step, but also depletes it more quickly.
We model the exploration-based reward as an exponentially decreasing resource. G(t) is the “exploration bonus pool” and can be interpreted as the agent’s remaining motivation to explore in the future. We fit the size of the initial exploration pool G(0) = G0 to the behavior of each animal. During planning, the agent imagines receiving rewards at the cautious and confident object states proportional to G(t).
On every turn at the cautious or confident object states, the agent extracts reward or from its budget G, depleting G at rates ωcautious or ωconfident. This leads to an exponential decrease in G(t) with turns spent at the object which is clear from Eq 17. For example, at the cautious object state the update to G(t) is,
However, a secondary factor affects the update to G(t). G linearly regenerates back to G0 at the forgetting rate f which we also fit for each animal. The full update to the reward pool for spending one turn at the cautious object state is,
Note that G(t) regenerates by f in all states, not only at the object states. We use linear forgetting for its simplicity although other mechanisms such as exponential forgetting are possible.
Finally, for completeness in other environments, the reward the agent imagines receiving also depends on the actual reward it has received in the past. Let n1 and n0 be the number of times the agent has received one or zero reward at the object state, analogous to the pseudocounts of a Beta posterior in a fully Bayesian treatment of reward. Furthermore, let and be the (fitted) values at t = 0. We use and . The agent imagines receiving reward
after spending one turn in the cautious object state. A similar equation applies to the confident object state.
We define the depletion rates as and ωcautious = K ⋅ ωconfident with constants R = 1.1 and K = 0.89 < 1.0. These values were fitted to capture the full range of behavior of the 26 animals.
4.5 Data Fitting
Data fitting aims to elucidate individual differences and population patterns in behavior by searching for the model parameters that best describe the behavior of each animal. We map the behavior of model and animals to a shared abstract space using a common set of statistics and then fit the model to data using ABCSMC.
4.5.1 Animal Statistics
To extract animal statistics, we first coarse-grain behavior into phases and subsequently classify the animals into three groups: brave, intermediate, and timid (as described in the main text). This allows us to maintain the temporal dynamics of the behavior while reducing the dimension of the data. We average the approach type, duration, and frequency over each phase and fit a subset of statistics that capture the high-level temporal dynamics of behavior of animals in each group.
The behavior of brave animals comes in three phases: cautious, confident-peak and confident-steady-state. We fit five statistics: the transition time from cautious to confident-peak phase tconfident-steady,tcautious-to-confident, the transition time from confident-peak to confident-steady-state phase tpeak-to-steady, the average durations during the cautious and confident-peak phases dcautious, dpeak-confident, and the ratio of confident-peak and confident-steady-state phases’ frequencies .
Intermediate animals only exhibit two phases: cautious and confident. We fit four statistics: the transition time from cautious to confident phase tcautious-to-confident, the durations of the two phases dcautious, dconfident, and the ratio of the cautious and confident phases frequencies . However, one limitation of the model is that frequency can only decrease, not increase, because of the dynamics of depletion and replenishment of the exploration bonus pool. Hence we instead fit
Timid animals also only exhibit two phases, albeit different ones from the intermediate animals: cautious-peak and cautious-steady-state. We fit four statistics: the transition time from cautious-peak to cautious-steady-state phase tpeak-to-steady, the durations of the two phases dcautious-peak and dcautious-steady, and the ratio of the frequencies of the two phases .
4.5.2 Model Statistics
By design, our BAMDP agent also enjoys a notion of bouts and behavioral phases. We map the behavior of the agent to the same abstract space of duration, frequency, and transition time statistics as the animals to allow the fitting.
We consider the agent as performing a bout when it leaves the nest, stays at the object state for some turns, and finally returns to the nest. We parse bouts and behavioral phases from the overall state trajectory of the agent which, like the animals, has what we can describe as contiguous periods of cautious or confident approach and low or high approach frequency.
The transition from cautious to confident phase (measured in the number of turns) is when the model begins visiting the confident-object state rather than the cautious-object state (this transition never happens for low ). The transition from peak to steady-state phase is when the model starts spending > 1 consecutive turns at the nest (to regenerate G), which happens when G reaches its steady-state value determined by the forgetting rate. We linearly map the agent’s transition times (in units of turns) to the space of animals’ transition times (units of minutes) using the relationship: 2 turns to 1 minute. Therefore, agent is simulated for 200 turns corresponding to 100 minutes in the experiment.
Bout duration is naturally defined as the number of consecutive turns the agent spends at the object. Because the agent lives in discrete time, we map its duration (units of turns) to the space of animal duration (units of seconds) using the formula,
Hence the agent is capable of having durations from 0.75 to 3.75 seconds. This captures a large range of the animals’ phase-averaged durations.
We define the momentary frequency with which the agent visits the object as the inverse of the period, which is the number of turns between two consecutive bouts (sum of turns at nest and object states). Frequency ratios are computed by dividing the average periods of two phases (in units of turns) and are unitless. Hence, no mapping between agent and animal frequency ratios is necessary.
4.5.3 Approximate Bayesian Computation
We fit each of the 26 animals from Akiti et al. (2022) separately using an Approximate Bayesian Computation Sequential Monte Carlo (ABCSMC) algorithm (Toni et al., 2009). We use an adaptive acceptance threshold schedule that sets ϵt to the lowest 30-percentile of distances d(x, x0) in the previous population. We use a Gaussian transition kernel Kt(θ|θ*) = 𝒩 (0, Σ), where the bandwidth of Σ is set using the Silverman heuristic. We ran ABC-SMC for T = 30 populations for each animal but most animals converged earlier. We used uniform priors. Table 1 contains a list of ABCSMC parameters.
Given agent statistics x and animal statistics x0 in a joint space, we compute the ABC distance d(x, x0) using the a normalized L1 distance function.
where i indexes the statistics. Ci(xi) is a normalization constant that depends on the statistic and possibly the value xi. Normalization is necessary because the statistics have different units and value ranges.
We normalize durations using a constant Ci(xi) = 4.0 seconds. We normalize the transition times using a piece-wise linear function to prevent extremely small or large values from dominating the distance.
We also normalize the frequency ratio using a piece-wise linear function.
Acknowledgements
We are grateful to Chris Gagne, Vikki Neville, Mike Mendl, Elizabeth S. Paul, Richard Gao and particularly Mitsuko Watabe-Uchida for their helpful discussion and feedback. Funding was from the Max Planck Society and the Humboldt Foundation. Open access funding provided by Max Planck Society. PD is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 39072764 and of the Else Kröner Medical Scientist Kolleg “ClinbrAIn: Artificial Intelligence for Clinical Brain Research. We thank the IT team from the Max Planck Institute for Biological Cybernetics for technical support.
Appendix 1
A Recovery Analysis
We performed recovery analysis on our ABCSMC fits. The recovery targets were the bestfitting particles for each of the 26 mice. We ran ABCSMC a second time, using the same hyperparameters, to check that we could recover the recovery targets.
Fig. 1 compares the recovery targets against the closest particles in the posterior of the (recovery) ABCSMC fit. Each subplot shows one of the nine fitted parameters: nCVaRα, G0, the forgetting rate f, the three hazard prior means and the three hazard prior deviations. In general, the ABCSMC fitting algorithm recovers the recovery-target reasonably well for all animals, with a minimum R2 value of 0.72.
Fig. 2 compares the recovery targets against the (marginal) means of the ABCSMC posterior. The exploration pool G0 and forgetting rate f are well recovered. However, there is poor recoverability for nCVaRα and the prior parameters due to non-identifiability. This is further illustrated in Fig. 3 for a single brave animal. Fig. 3 plots the univariate and bivariate marginals of the ABCSMC posterior. As expected, the recovery targets lie within a narrow range of the posterior distributions for G0 and f. For nCVaRα and the prior parameters, the recovery targets are farther from the means of the posterior but still lie within a region of the posterior with support.
References
- Patterns of attachment behavior shown by the infant in interaction with his motherMerrill-Palmer Quarterly of Behavior and Development 10:51–58
- Striatal dopamine explains novelty-induced behavioral dynamics and individual variability in threat predictionNeuron 110:3789–3804https://doi.org/10.1016/j.neuron.2022.08.022
- Exploring ReplaybioRxiv
- Young children in an insecure situationThe Journal of Abnormal and Social Psychology 38
- Coherent measures of riskMathematical finance 9:203–228
- Distributional Reinforcement LearningMIT Press
- Opening Burton’s clock: Psychiatric insights from computational cognitive modelsThe Cognitive Neurosciences :439–450
- Anxiety, Depression, and Decision Making: A Computational PerspectiveAnnual Review of Neuroscience 41:371–388https://doi.org/10.1146/annurev-neuro-080317-062007
- (b) The Growth of Independence in the Young ChildJournal (Royal Society of Health) 76:587–591
- Midbrain Dopamine Neurons Signal Preference for Advance Information about Upcoming RewardsNeuron 63:119–26https://doi.org/10.1016/j.neuron.2009.06.009
- Predator inspection behaviour and attack cone avoidance in a characin fish: the effects of predator diet and prey experienceAnimal Behaviour 63:1175–1181
- Risk-sensitive and robust decision-making: a cvar optimization approachAdvances in neural information processing systems 28
- Individual variation in the antisnake behavior of California ground squirrels (Spermophilus beecheyi)Journal of Mammalogy 78:294–310
- Learning and selective attentionNature neuroscience 3:1218–1223
- The first steps on long marches: The costs of active observationPsychiatry Reborn: Biopsychosocial psychiatry in modern medicine Oxford University Press https://doi.org/10.1093/med/9780198789697.003.0014
- Exploration bonuses and dual controlMachine Learning 25:5–22
- Model-based Bayesian explorationarXiv
- Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processesUniversity of Massachusetts Amherst
- Optimal learning: Computational procedures for Bayes-adaptive Markov decision processesUniversity of Massachusetts Amherst
- Do guppies play TIT FOR TAT during predator inspection visits?Behavioral Ecology and Sociobiology 23:395–399
- Among-individual differences in foraging modulate resource exploitation under perceived predation riskOecologia 194:621–634
- Mood as representation of momentumTrends in cognitive sciences 20:15–24
- The costs and benefits of predator inspection behaviour in Thomson’s gazellesBehavioral Ecology and Sociobiology 34:139–148
- Peril, prudence and planning as risk, avoidance and worryJournal of Mathematical Psychology 106https://doi.org/10.1016/j.jmp.2021.102617
- Bandit processes and dynamic allocation indicesJournal of the Royal Statistical Society Series B: Statistical Methodology 41:148–164
- A Framework for Sequential Planning in Multi-Agent SettingsJournal of Artificial Intelligence Research 24:49–79https://doi.org/10.1613/jair.1579
- Information-seeking, curiosity, and attention: computational and neural mechanismsTrends in cognitive sciences 17:585–593
- Neophobia is not only avoidance: improving neophobia tests by combining cognition and ecologyCurrent Opinion in Behavioral Sciences 6:82–89https://doi.org/10.1016/j.cobeha.2015.10.007
- Scalable and effcient Bayes-adaptive reinforcement learning based on Monte-Carlo tree searchJournal of Artificial Intelligence Research 48:841–883
- Components of Behavioral Activation Therapy for Depression Engage Specific Reinforcement Learning Mechanisms in a Pilot StudyComputational Psychiatry https://doi.org/10.5334/cpsy.81
- Variational probabilistic inference and the QMR-DT networkJournal of artificial intelligence research 10:291–322
- Planning and acting in partially observable stochastic domainsArtificial intelligence 101:99–134
- Dopamine: generalization and bonusesNeural Networks 15:549–559
- Individual modulation of anti-predator responses in common marmosetsInternational Journal of Comparative Psychology 24
- Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrumProceedings of the National Academy of Sciences 113:12868–12873
- Interrupting behaviour: Minimizing decision costs via temporal commitment and low-level interruptsPLoS computational biology 14
- Predator inspection behaviour in minnow shoals: differences between populations and individualsBehavioral ecology and sociobiology 19:267–273
- Population differences in predator recognition and attack cone avoidance in the guppy Poecilia reticulataAnimal Behaviour 40:443–452
- DeepLabCut: markerless pose estimation of user-defined body parts with deep learningNature neuroscience 21:1281–1289
- Prioritized memory access explains planning and hippocampal replayNature Neuroscience 21https://doi.org/10.1038/s41593-018-0232-z
- Individual variation in cognitive style reflects foraging and anti-predator strategies in a small mammalScientific Reports 9
- Dopamine neurons projecting to the posterior striatum reinforce avoidance of threatening stimuliNature neuroscience 21:1421–1430
- Opposite initialization to novel cues in dopamine signaling in ventral and posterior striatum in miceelife 6
- Space, time, and fear: survival computations along defensive circuitsTrends in cognitive sciences 24:228–241
- A primate temporal cortex–zona incerta pathway for novelty seekingNature Neuroscience 25:50–60https://doi.org/10.1038/s41593-021-00950-1
- What is intrinsic motivation? A typology of computational approachesFrontiers in neurorobotics 1
- Emotion and decision-making: affect-driven belief systems in anxiety and depressionTrends in cognitive sciences 16:476–483
- State representation in mental illnessCurrent Opinion in Neurobiology 55:160–166https://doi.org/10.1016/j.conb.2019.03.011
- Risk-averse bayes-adaptive reinforcement learningAdvances in Neural Information Processing Systems 34:1142–1154
- Artificial intelligence: a modern approachMalaysia: Pearson Education Limited
- Policy Distillation arXiv
- Thigmotaxis as an index of anxiety in mice. Influence of dopaminergic transmissionsBehavioural Brain Research 61:59–64https://doi.org/10.1016/0166-4328(94)90008-6
- Learning to predict by the methods of temporal differencesMachine learning 3:9–44
- Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systemsJournal of the Royal Society Interface 6:187–202
- Learning to reinforcement learnarXiv
- On the Gittins Index for Multiarmed BanditsThe Annals of Applied Probability 2:1024–1033https://doi.org/10.1214/aoap/1177005588
- Humans use directed and random exploration to solve the explore–exploit dilemmaJournal of Experimental Psychology: General 143
- Revealing the structure of pharmacobehavioral space through motion sequencingNature Neuroscience 23:1433–1443https://doi.org/10.1038/s41593-020-00706-3
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
Copyright
© 2024, Tingke Shen & Peter Dayan
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 91
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.