As explained in text, our null model started with just action perseveration (AP; tendency to stay with last action taken), which was included as a process in all subsequent models. Next, we established key components of the model-based (MB) system that were not explained in the main text. First, we found an MB system that only learns from chosen feedback, and disregards counterfactual feedback (‘MB no CF’) fit the data far worse than a model including counterfactual feedback (‘MB 1 LR’). Subsequently, we showed that an MB agent that learns from counterfactual feedback at a different learning rate (‘MB’) outperforms an agent that enforces the learning rate be equal across factual and counterfactual feedback (‘MB 1 LR’).After arriving at the winning model in the stepwise fashion described in Results in the main text, we tested a variet of other models by remodeling certain processes within each system. We first instantiated the resource-rational learning as described in the main text either just within the the goal-perseveration system (‘MB + MF + GPRR ’) or in both the GP and MB systems (‘MBRR + MF + GPRR). We then sought to determine if the resource rational shift in MB resources was a function of the most recently experienced goals (which could vary within task block; ‘MB + RFHist + MF + GP’). To do so, the agent computed a running prediction of the upcoming goal, which was updated by a goal prediction error, defined by the disparity between the goal outcome (1 if reward pursuit, −1 if punishment avoidance) and the predicted goal quantity ( ), and a goal learning rate, δ:
This goal prediction was then used to modulate model-based utilization weights on a trial-by-trial basis. Unique reward pursuit and punishment avoidance weights defined the direction and magnitude that the goal prediction quantity modulated model-based utilization weights:
If goals were modulated according to goal abundance (as predicted), should take on a negative value, as this would result in the model-based utilization weight decreasing for reward pursuit and increasing for punishment avoidance when an agent’s recent goal history involves more punishment avoidance trials. By the same logic, was predicted to take on a positive value. Next, we sought to determine if differences in goal valence were better explained by differences in learning (via two separate learning rates) relative to our original implementation where goal valence differences were encoded in utilization weights ( and ). Doing so, we had to exclude the resource rational component because the fitted learning rates greatly exceeded 1 and extended far into the negative range (i.e., <0), which rendered the parameter uninterpretable. As such we compared two models: (1) in which there were separate learning rates for reward and punishment features, a single utilization weight for MB and a single utilization weight for the GP (‘MBLR + MF + GPLR’; Bayesian information criterion [BIC] = 33,098.176) versus (2) a single learning rate for reward and punishment features and, and separate utilization weights for reward pursuit and punishment avoidance in both the MB and GP systems, as was true of the winning model (‘MB + MF + GP’; BIC = 32,810.86). This system also included action perseveration and model-free control in line with the winning model. Doing so demonstrated that, as was true in the winning model, goal pursuit differences encoded in utilization weights as opposed to learning rates explained the data better. Note, it was not possible to fit separate learning rates and utilization weights across each goal within the MB and GP systems as they would be unrecoverable due to trading off. Subsequently, we tested a model in which action values for each goal were learned in a model-free way. That is, instead of two action values in the original model-free system, there were four action values, two for each goal (‘GPRR + G-MF’). Updates to action values for a given goal occurred only when that goal was instructed on the current trial. Similarly, decision-making was only influenced by the two action values relevant to the instructed goal. Like the winning model, action values for each goal had their own utilization weights ( and ) that shifted across task block in the same way as the winning model with change parameters. The model additionally encoded action perseveration.We next sought to model the MF system in such a way that encoded a GP-like signature; namely, that the experience of punishment features always was a punishment (−1 points), and the experience of a reward feature was always rewarding (+1 points), irrespective of the instructed goal (‘MBRR + MF3’). Thus, each trial, there were three action values in the MF system: (1) as originally implemented, (2) as predicting whether one’s action leads to reward features, and (3) whether one’s action leads to punishing features. These three action values were intergrade with action values from the MB system and AP to determine choices. Next, we sought to verify that the order in which we included controllers did not affect modeling results. To do so, we excluded from the winning model just the MF system (‘MBRR + GP’), just the GP system (‘MBRR + MF’), and just the MB system (‘GPRR + MF). All were inferior to the winning model. Finally, we tested an alternative model where GP behavior may derive from an MB strategy that occasionally forgets the reward function (Forgetful-MB + MthF + AP), allowing this forgetting to occur at different rates during reward and punishment goals. This model-based agent includes two additional parameters, fR and fp, which govern the probability of forgetting the presented reward function on reward pursuit trials and punishment avoidance trials, respectively. Thus, on each trial, the model replaces the instructed goal with the opposite goal (e.g., if the actual goal was [−1, 0], the participant used [0,1]) with some fixed probability (either fR or fp, depending on the trial type). We again found that this model fit worse than the winning model, confirming that a model where the model-based controller forgets the current reward with different rates on reward and punishment trials does not account for our results supporting GP as well.