Medicine

Research: Adequate statistical power in clinical trials is associated with the combination of a male first author and a female last author

University Medical Center Utrecht/Utrecht University, Netherlands
VU University, Netherlands
University Medical Center Utrecht/Utrecht University, The Netherlands

Jun 5, 2018

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

Article
Figures and data
Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Decision letter
Author response
Article and author information
Metrics

Abstract

Clinical trials have a vital role in ensuring the safety and efficacy of new treatments and interventions in medicine. A key characteristic of a clinical trial is its statistical power. Here we investigate whether the statistical power of a trial is related to the gender of first and last authors on the paper reporting the results of the trial. Based on an analysis of 31,873 clinical trials published between 1974 and 2017, we find that adequate statistical power was most often present in clinical trials with a male first author and a female last author (20.6%, 95% confidence interval 19.4-21.8%), and that this figure was significantly higher than the percentage for other gender combinations (12.5-13.5%; P<0.0001). The absolute number of female authors in clinical trials gradually increased over time, with the percentage of female last authors rising from 20.7% (1975-85) to 28.5% (after 2005). Our results demonstrate the importance of gender diversity in research collaborations and emphasize the need to increase the number of women in senior positions in medicine.

https://doi.org/10.7554/eLife.34412.001

Introduction

There is increasing awareness that many clinical trials have systematic methodological flaws and that their results may be biased, exaggerated, and difficult to reproduce (Ioannidis et al., 2014). Clinical trials are the result of complex group efforts. Male and female researchers differ in their collaborative strategies which depends on the level of their expertise and whether they have a junior or senior position (Zeng et al., 2016; Bozeman and Gaughan, 2011). There are indications that mixed gender teams may make the best use of personal knowledge and skills, (Nielsen et al., 2017) an effect also reported in a scientific research context (Woolley et al., 2010; Campbell et al., 2013). Even though this may in turn positively influence the quality of clinical research, (Nielsen et al., 2017) no studies have systematically investigated whether collaborations between male and female researchers affect the quality of clinical trials. This topic is important in light of the existing diversity challenges that currently exist in the biomedical research field (Valantine and Collins, 2015).

In this study, we therefore aimed to quantify the effect of collaborations across gender combinations of junior and senior authors on the methodological quality of clinical trials. To this aim, we determined the percentage of adequately powered trials in 31,873 clinical trials published between 1974 and 2017 based on Cochrane meta-analyses. As statistical power reflects the chance of detecting a true effect, it is regarded as one of the key elements of responsible research (Button et al., 2013) and considered essential in reproducible clinical research (Halpern et al., 2002). We found that the probability of having adequate statistical power for one combination - male first author, female last author - was significantly higher than that for the other three possible combinations. Moreover, this effect was present across countries and most medical fields.

Results

Statistical power and gender combinations in all clinical trials (N=31,873)

In our 31,873 trials, the number of published clinical trials with adequate statistical power (>80%) was generally low (12-13%; Figure 1A, left panel). The exception was the set of trials with a male first author combined with a female last author with 20.6% of outcomes adequately powered (CI 19.4–21.8). This percentage was significantly higher in comparison to the three other combinations (highest odds ratio 2.08, CI 1.87–2.30, P<0.0001). Cut-off values for adequate power set to either 70% or 90% yielded comparable results (P<0.0001; Figure 1B). The percentage of adequately powered trials in which the gender combination was unknown was 13.8% (CI 13.6–14.1; Figure 1C). Irrespective of the gender of the first author, clinical trials with female last authors had a higher statistical power compared to male last authors: 16.6% (CI 15.9–17.4) versus 12.9% (CI 12.6–13.3; Figure 2). The average statistical power of clinical trials with missing gender was comparable to those with known gender combinations (Figures 1C and 2). Slightly higher odds for adequately powered trials were also found in the author combination ‘both males’ and ‘female – male (last)’ in comparison to the reference group ‘both females’: odds ratios 1.28 (CI 1.17–1.41, P<0.0001) and 1.25 (CI 1.13–1.39, P<0.0001), respectively (Table 1). In the sensitivity analysis model estimates were slightly lower (relative estimate difference 2.3% to 4.8%; Table 2).

Figure 1

Download asset Open asset

Percentage of adequately powered trials for the four different gender combinations of first and last author.

(A) Percentage of trials with power > 0.8 published between 1974 and 2017 for the four gender combinations (left panel) and for four periods (1975–1985; 1985–1995; 1995–2005; >2005) during this time (right panel). (B) Percentage of trials published with power > 0.7 (*left*) and power > 0.9 (*right*) for the four gender combinations. (C) Percentage of trials with power > 0.8 for the four gender combinations, including the trials were gender could not be determined for the first and/or last author (‘unknown’). Error bars represent the 95% confidence interval for proportions for all panels.

https://doi.org/10.7554/eLife.34412.002

Figure 2

Download asset Open asset

Percentage of adequately powered trials when the gender of the first and last author is male, female or unknown.

*Left*: Percentage of trials with power > 0.8 plotted for the gender of the first author (top) and the last author (bottom). *Right*: Percentage of trials with power > 0.8 plotted for four periods (1975–1985; 1985–1995; 1995–2005; >2005) for the gender of the first author (top) and the last author (bottom). Error bars represent the 95% confidence interval for proportions for all panels.

https://doi.org/10.7554/eLife.34412.003

Table 1

Model estimates for the variables fitted against adequately powered trials.

https://doi.org/10.7554/eLife.34412.004

Variables	Odds ratio	95% CI		Z value	P value
Author combination
Both females	1.00 (ref.)
Both males	1.28	1.17	1.41	5.22	<0.0001
Female - male (last)	1.25	1.13	1.39	4.29	<0.0001
Male - female (last)	2.08	1.87	2.30	13.94	<0.0001
Time
Publication year	1.03	1.02	1.03	12.05	<0.0001
Country group
Anglosphere	1.00 (ref.)
Europe	0.76	0.71	0.81	8.87	<0.0001
Non-western	0.87	0.80	0.94	3.69	<0.0001
Medical discipline
Allergy & intolerance	1.00 (ref)
Blood disorders	0.45	0.34	0.62	5.11	<0.0001
Child health	0.47	0.36	0.61	5.68	<0.0001
Complementary medicine	0.23	0.17	0.31	9.14	<0.0001
Consumer strategies	0.66	0.41	1.03	1.80	0.072
Dentistry & oral health	1.05	0.68	1.59	0.21	0.832
Developmental problems	0.69	0.47	1.00	1.98	0.048
Ear, nose & throat	0.37	0.24	0.55	4.77	<0.0001
Effective health systems	0.75	0.53	1.07	1.57	0.115
Endocrine & metabolic	0.29	0.20	0.42	6.51	<0.0001
Eyes & vision	0.56	0.38	0.81	3.02	0.003
Gastroenterology & hepatology	0.49	0.38	0.65	5.07	<0.0001
Genetic disorders	0.19	0.12	0.30	7.09	<0.0001
Gynaecology	0.69	0.52	0.92	2.58	0.01
Health & safety at work	0.24	0.13	0.42	4.74	<0.0001
Heart & circulation	0.29	0.22	0.39	8.24	<0.0001
Infectious disease	0.61	0.47	0.80	3.62	<0.0001
Kidney disease	0.80	0.58	1.12	1.28	0.201
Lungs & airways	0.35	0.27	0.46	7.56	<0.0001
Mental health	0.53	0.40	0.71	4.40	<0.0001
Neonatal care	0.47	0.34	0.64	4.68	<0.0001
Neurology	0.56	0.42	0.74	4.08	<0.0001
Orthopaedics & trauma	0.79	0.60	1.05	1.63	0.103
Pain & anaesthesia	0.64	0.49	0.84	3.23	0.001
Pregnancy & childbirth	0.58	0.44	0.77	3.76	<0.0001
Public health	1.23	0.78	1.92	0.89	0.372
Rheumatology	0.75	0.57	1.00	2.02	0.043
Skin disorders	0.89	0.65	1.23	0.69	0.488
Tobacco, drugs & alcohol	0.34	0.26	0.46	7.39	<0.0001
Urology	1.04	0.74	1.45	0.21	0.834
Wounds	0.36	0.21	0.61	3.73	<0.0001

Table 2

Model estimates from the sensitivity analysis (with individual countries) for the variables fitted against adequately powered trials.

https://doi.org/10.7554/eLife.34412.005

Variables	Odds ratio	95% CI		Z value	P value
Author combination
Both females	1.00 (ref.)
Both males	1.25	1.13	1.37	4.58	<0.001
Female - male (last)	1.19	1.07	1.32	3.28	0.001
Male - female (last)	1.98	1.78	2.19	12.95	<0.001
Time
Publication year	1.02	1.02	1.03	14.5	<0.001
Country
Argentina	1.00 (ref.)
Australia	0.79	0.52	1.19	1.12	0.261
Austria	1.31	0.84	2.02	1.19	0.232
Bangladesh	3.29	2.00	5.41	4.69	<0.001
Belgium	0.94	0.61	1.45	0.29	0.775
Brazil	0.98	0.63	1.51	0.10	0.92
Canada	1.16	0.78	1.72	0.72	0.474
Chile	0.74	0.39	1.42	0.89	0.371
China	1.20	0.8	1.81	0.87	0.383
Colombia	1.95	1.17	3.26	2.55	0.011
Costa Rica	0.00	0.00	Inf	0.14	0.891
Croatia	0.47	0.22	1.03	1.88	0.06
Czech Republic	0.71	0.45	1.13	1.45	0.147
Denmark	1.24	0.82	1.87	1.03	0.303
Egypt	1.78	1.13	2.79	2.50	0.013
Finland	0.88	0.58	1.32	0.63	0.527
France	0.91	0.61	1.37	0.44	0.663
Gambia	1.05	0.56	1.99	0.16	0.87
Germany	0.90	0.6	1.34	0.53	0.593
Ghana	0.84	0.48	1.48	0.61	0.544
Greece	0.46	0.29	0.75	3.12	0.002
Hong Kong	1.37	0.89	2.11	1.44	0.15
Hungary	2.87	1.75	4.7	4.18	<0.001
India	0.89	0.58	1.35	0.56	0.573
Indonesia	0.71	0.34	1.48	0.93	0.354
Iran	1.14	0.73	1.79	0.59	0.557
Ireland	0.80	0.49	1.32	0.87	0.387
Israel	0.80	0.51	1.26	0.98	0.328
Italy	1.03	0.69	1.53	0.15	0.881
Japan	0.35	0.22	0.53	4.83	<0.001
Jordan	3.91	2.09	7.32	4.27	<0.001
Kenya	0.42	0.18	1.00	1.97	0.049
Korea	1.56	1.02	2.39	2.07	0.038
Lebanon	1.36	0.75	2.46	1.01	0.311
Malawi	0.12	0.03	0.52	2.83	0.005
Malaysia	0.78	0.34	1.79	0.59	0.552
Mali	0.75	0.29	1.91	0.61	0.543
Mexico	1.07	0.62	1.85	0.25	0.8
Netherlands	0.71	0.47	1.07	1.62	0.106
New Zealand	1.28	0.76	2.14	0.94	0.349
Nigeria	1.32	0.70	2.48	0.87	0.386
Norway	0.89	0.56	1.41	0.49	0.624
Pakistan	0.93	0.48	1.83	0.20	0.844
Papua New Guinea	0.00	0.00	Inf	0.10	0.918
Peru	0.99	0.57	1.7	0.04	0.967
Poland	0.39	0.22	0.68	3.29	0.001
Portugal	3.17	1.84	5.45	4.17	<0.001
Qatar	0.00	0.00	Inf	0.11	0.916
Saudi Arabia	0.54	0.30	0.98	2.02	0.043
Singapore	1.14	0.68	1.93	0.50	0.614
Slovenia	0.00	0.00	Inf	0.11	0.91
South Africa	1.24	0.79	1.96	0.93	0.355
Spain	1.08	0.71	1.62	0.35	0.73
Sweden	1.24	0.83	1.85	1.03	0.301
Switzerland	0.66	0.43	1.02	1.89	0.059
Taiwan	0.45	0.29	0.71	3.43	0.001
Thailand	1.53	0.99	2.37	1.93	0.053
Turkey	0.64	0.42	0.98	2.06	0.039
Uganda	1.27	0.56	2.88	0.58	0.56
UK	1.25	0.84	1.85	1.10	0.273
USA	1.42	0.96	2.10	1.78	0.076
Venezuela	5.25	3.22	8.54	6.67	<0.001
Vietnam	0.00	0.00	Inf	0.12	0.907
Zimbabwe	1.93	0.90	4.12	1.70	0.089
Other countries	0.75	0.48	1.17	1.28	0.201
Medical discipline
Allergy & intolerance	1.00 (ref.)
Blood disorders	0.49	0.39	0.63	5.79	<0.001
Child health	0.55	0.45	0.67	5.87	<0.001
Complementary medicine	0.26	0.20	0.33	11.21	<0.001
Consumer strategies	0.94	0.66	1.34	0.35	0.73
Dentistry & oral health	1.43	1.07	1.92	2.41	0.016
Developmental problems	0.78	0.58	1.05	1.64	0.101
Ear, nose & throat	0.51	0.39	0.68	4.66	<0.001
Effective health systems	0.85	0.63	1.14	1.11	0.269
Endocrine & metabolic	0.4	0.30	0.53	6.58	<0.001
Eyes & vision	0.51	0.38	0.70	4.27	<0.001
Gastroenterology & hepatology	0.56	0.46	0.69	5.41	<0.001
Genetic disorders	0.29	0.20	0.42	6.51	<0.001
Gynaecology	0.82	0.66	1.01	1.84	0.066
Health & safety at work	0.54	0.37	0.79	3.16	0.002
Heart & circulation	0.34	0.27	0.43	9.43	<0.001
Infectious disease	0.8	0.65	0.99	2.09	0.036
Kidney disease	0.71	0.55	0.92	2.59	0.01
Lungs & airways	0.47	0.38	0.58	7.04	<0.001
Mental health	0.6	0.48	0.75	4.58	<0.001
Neonatal care	0.38	0.29	0.48	7.81	<0.001
Neurology	0.7	0.57	0.87	3.26	0.001
Orthopaedics & trauma	1.18	0.96	1.46	1.56	0.12
Pain & anaesthesia	0.73	0.60	0.90	2.92	0.003
Pregnancy & childbirth	0.69	0.55	0.85	3.40	0.001
Public health	1.72	1.24	2.37	3.27	0.001
Rheumatology	0.97	0.78	1.20	0.31	0.757
Skin disorders	1.26	0.99	1.59	1.89	0.058
Tobacco. drugs & alcohol	0.4	0.32	0.50	7.97	<0.001
Urology	1.27	1.00	1.63	1.92	0.054
Wounds	0.8	0.59	1.08	1.44	0.15

Trends across countries

The world map in Figure 3 shows the geographical distribution of the trials in our sample (based on affiliation of the first author). The percentage of trials originating from Anglosphere countries (United States, United Kingdom, Canada, Australia and New Zealand) was 46.9%; the percentage from European countries was 32.9%; and the percentage from non-western countries was 20.2% (with the top five being Turkey, Japan, India, China and Israel). European trials had lower odds of adequate statistical power compared to Anglosphere trials (odds ratio: 0.76, CI 0.71–0.81, P<0.0001; Figure 4A). This was also the case in trials from Non-western countries (odds ratio: 0.87, CI 0.80–0.94, P<0.0001; Table 1). Individual country data, from the sensitivity analysis, is provided in Table 2.

Figure 3

Download asset Open asset

The proportion of included trials mapped per country on a white to red color scale (range: 0 – 24%).

The highest proportion of first authors were affiliated with an institution in the United States. Countries not present in any affiliation are plotted in gray.

https://doi.org/10.7554/eLife.34412.006

Figure 4

Download asset Open asset

The influence of geography on the percentage of trials that are adequately powered.

(A) Percentage of trials with power > 0.8 for the four gender combinations of first and last author within the three country groups. Error bars represent the 95% confidence interval for proportions. (B) A logistic regression multivariable model (see "Data analysis and statistical model" below) can be used to predict the probability that a trial will have a power above a certain value. Here the predicted probabilities that trials will have power > 0.8 are plotted as a function of year for the four gender combinations in the three country groups. The predicted probabilities are averaged across medical disciplines and plotted as mean and 95% confidence intervals.

https://doi.org/10.7554/eLife.34412.007

Trends over time

The percentage of adequately powered trials with a male first author and a female last author increased over time, and was higher than the percentage for other combinations in the last three decades of the study (Figure 1A, right panel). According to a logistic regression multivariable model (see "Data analysis and statistical model") the odds ratio of adequate statistical power increased each year (odds ratio: 1.03, CI: 1.02–1.03, P<0.0001; Figure 4B).

Trends across medical fields

The higher percentage of adequately powered clinical trials with a combination of a male first author and a female last author was not restricted to specific medical disciplines, although the effect sizes differed across disciplines (Figure 5). The medical fields with a relative low odds for adequate statistical power in general, as determined with the multivariable model, are: ‘complementary medicine’, ‘endocrine & metabolic’, ‘gastroenterology & hepatology’, ‘genetic disorders’, ‘health & safety at work’ and ‘heart & circulation’, all with significant odds ratios below 0.3 compared to the reference field allergy and intolerance (Table 1). The fields with most pronounced higher statistical power for male first and female last author were ‘pregnancy & childbirth’, ‘gynaecology’, ‘lungs & airways’, ‘gastroenterology & hepatology’ and ‘tobacco, drugs & alcohol’. The total number of trials for each of the four gender combinations was not equally distributed. Most trials were published by the male–male author combination (Figure 6), and this inequality in the gender of authors was found across major medical disciplines (Figure 6). Nevertheless, the number of clinical trials with a male first and last author decreased from 64.8% in the period 1975–1985 (CI 61.9–67.6) to 49.0% after 2005 (CI 48.4–49.6; Figure 7).

Figure 5

Download asset Open asset

Percentage of adequately powered trials, for the four gender combinations of the first and the last author, within 21 major medical disciplines.

Error bars represent the 95% confidence interval for proportions.

https://doi.org/10.7554/eLife.34412.008

Figure 6

Download asset Open asset

The percentage of the total number of trials underlying the four gender combinations within 21 major medical disciplines.

Error bars represent the 95% confidence interval for proportions.

https://doi.org/10.7554/eLife.34412.009

Figure 7

Download asset Open asset

The percentage of trials for the different gender combinations and periods studied.

*Left:* The number and percentage of trials underlying the power calculations for the four gender combinations. *Right*: The corresponding percentage of the total number of trials underlying the four gender combinations for the four periods studied (1975–1985; 1985–1995; 1995–2005; >2005). Error bars represent the 95% confidence interval for proportions for both panels.

https://doi.org/10.7554/eLife.34412.010

Correction for potential confounders

To correct for the potential confounders at the country level, the year of publication, and the medical discipline, logistic regression was performed. The linear combination of the variables ‘author combination’, ‘year of publication’, ‘country’ and ‘medical discipline’ explained the presence or absence of adequate statistical power well in a multivariable logistic regression model (χ²Zeng et al., 2016 = 1146.5 (degree-of-freedom 36), P<0.0001). The four author combinations were overall different from each other (χ²Zeng et al., 2016 = 440.5 (4), P<0.0001). The model estimates are provided in Table 1. A sensitivity analysis with ‘country’ defined as individual countries rather than groups of countries did not significantly change the other variable model estimates (Table 2). The sensitivity model explained the presence or absence of adequate statistical power very well in a multivariable logistic regression model (χ²Zeng et al., 2016 = 3638.6 (degree-of-freedom 101), P<0.0001). The four author combinations in the sensitivity analysis were also overall different from each other (χ²Zeng et al., 2016 = 488.2 (4), P<0.0001).

Discussion

The analysis of 31,873 clinical trials published between 1974 and 2017 demonstrates that adequately powered clinical trials are relatively more often published by a combination of a male first author and a female last author compared to other gender combinations. This effect was robust as the effect was present across countries and most medical fields. Even though the average statistical power was generally low, the overall percentage of adequately powered trials slightly increased over the past four decades.

In line with recent literature, (West et al., 2013) the absolute number of clinical trials published by female authors remained relatively low, even though it increased over time. The effects of equal representation of male and female scientists are not only important to better understand the success of collaborative efforts, but are also pressing in light of the persistent gender gap in medicine. Despite improvements, female scientists continue to face unequal pay (Rimmer, 2017) and funding disparities (Shen, 2013), and to remain underrepresented in clinical medicine in terms of the clinical faculty positions and first author publications (Jagsi et al., 2006; Reed et al., 2011), even though gains in participation have been made over the last years (Filardo et al., 2016). Independent of gender, the overall percentage of adequately powered clinical trials was disappointingly low, notwithstanding the fact that the practice of conducting clinical trials with low statistical power has been denounced for a long time (Halpern et al., 2002; Ioannidis, 2005). On a more positive note, the percentage of adequately powered trials did increase slightly over the past four decades. A possible reason for this increase may be the obligation to register clinical trials (i.e., on platforms like clinicaltrials.gov). This may have caused an increase in pre-registrations and research protocols with a higher quality and commitment to the original research plan and proposed sample size.

Our results support previous reports that gender differences exist and may influence the quality of clinical trials (Campbell et al., 2013; Nielsen et al., 2017). It may also be influenced by collaboration style patterns as differences exist between men and women in mixed-sex interactions (Balliet et al., 2011). Firm evidence on the influence of collaborative styles is still lacking. (Zeng et al., 2016; Araújo et al., 2017) However, the impact of social behavior between clinical researchers on trial outcomes – particularly related to gender - is yet a rather unexplored area. It is important to note that not all studies have found convincing evidence for gender differences in science, (Hyde, 2005) for example with regard to bias (Fanelli et al., 2017). From our results, it could be hypothesized that collaborations between male and female researchers are beneficial with respect to cross-fertilization, team productivity and research efficacy. However, our understanding of social and gender-related factors that underlie clinical trial quality is probably still limited, which is underlined by our finding that the statistical power of trials is relatively low when both first and last author are female.

Because our analyses are based on a comprehensive body of clinical trials published over a 40-year period, across medical fields, the results provide a representative picture of the relation between gender collaborations and statistical power. Nonetheless, there are several limitations. First, we only investigated one aspect of methodological rigor. Even though statistical power is an important sign of sound trial conduct, there are other domains, including pre-post registration mismatch and other sources of bias that determine methodological rigor. These parameters, however, are more difficult to quantify. Second, gender from the first and last author could not be determined for most included clinical trials (almost 70%, see flow diagram) as not all first name records were available. However, the statistical power of trials with missing gender data was not different from the clinical trials with known gender. Third, first and last authorships only provide a relative rough proxy for junior and senior positions. The actual hierarchical relations in a clinical trial may differ in a subset, for instance in some disciplinary fields authors are alphabetically positioned, or the persons in charge of the actual conduct of the clinical trial in daily practice are not last author on the resulting publication. Fourth, we only have included clinical trials and although these results can be extrapolated to other types of research, other research types and other academic disciplinary fields may have other unwritten rules how to determine the authors’ positions on a paper. Fifth, we do not have the data of the gender of the authors between the first and last author which may influence collaboration patterns within and between research groups.

Even though adequate power in clinical trials is of vital importance, (Ioannidis, 2014) future studies on gender collaborations should also take other methodological outcomes into account, such as the risk of bias and deviations from the pre-registered protocol. Also, to further determine how the gender of a researcher impacts on the scientific methodological quality, a more qualitative research design would be necessary to explore on a deeper level why methodological quality of clinical trials depends on the gender of researchers and clinicians. This would include interviews and observation studies of clinical trial teams with male and female leadership positions. Our findings demonstrate the importance of gender diversity in research collaborations and emphasize the need for more prominent positions for women at senior positions in medicine (Nature, 2013).

Materials and methods

Selection of clinical trials

Request a detailed protocol

The selection of trials for this analysis is shown in a flow chart (Figure 8). Clinical trials were extracted from the Cochrane Database of Systematic Reviews. Only the subset of trials was included in the analysis where the first name of the first and last author were reported. These reviews cover all medical fields and have high quality standards and methodological rigor with elaborate search protocols, and rigorously identify and summarize comparable trials (Jørgensen et al., 2006). Moreover, these reviews perform meta-analyses on individual clinical trials to generate an estimated effect size of interventions. All clinical trials with a PubMed ID included in a systematic review published in the second Issue of the 2017 Cochrane Database of Systematic Reviews (CDSR) were extracted using an in-house developed, open-source Cochrane Library website parser. For each individual clinical trial, we extracted publication year, outcome estimates (odds or risk ratio, risk difference or standardized mean difference), and Cochrane’s medical discipline classifications.

Figure 8

Download asset Open asset

Flow diagram of the 31,873 trials selected for analysis.

Trials were analyzed if published after 1974, included in a significant meta-analysis in a systematic review and gender data was extractable for both the first and the last author.

https://doi.org/10.7554/eLife.34412.011

Statistical power of individual clinical trials

Request a detailed protocol

Statistical power was assessed in clinical trials, published after 1974, which were included in a Cochrane meta-analysis with a significant overall estimate (i.e., a meta-analytic P-value of <0.05). All data and scripts are available via the Open Science Framework (WM Otte, Temporal RCT power, Open Science Framework, https://osf.io/ud2jw/. Update 17-03-04 11:19 AM). We included only significant meta-analyses to exclude bias from interventions with no proven effects. In other words, if a confidence interval of a meta-analysis contains 0, the point estimate of the overall effect size is not reliable nor known and may not be used to estimate the individual power of studies included in that meta-analysis. Nevertheless, inclusion of non-significant meta-analyses did not impact on our findings (data not shown).

The power for an individual clinical trial was calculated based on the sample sizes in both trial arms, using a 5% α threshold using the meta-analytic estimate as approximation of the true effect size. Trials with a statistical power lower than 80% were considered to be underpowered based on historical arguments (Moher et al., 1994). This cut-off is standard but also relatively arbitrary. We therefore also performed analyses using a less and more conservative cut-off of 70% and 90%, respectively. The statistical power is presented in all plots with 95% confidence intervals determined with the Wilson’s score method (Wilson, 1927).

Gender extraction

Request a detailed protocol

All included trials had multiple authors. We considered the first author of clinical trial publication as a junior researcher and the last author as a senior. This assumption will most likely reflect the hierarchal relationship in the majority of the cases. The senior author having the last position in publications has long been practiced in medicine. Typically, the person conducting the practical research, analyzing the data, and drafting the first manuscript is often the first author, while the last author is the senior research responsible for the overall oversight.

For the gender of authors, first names were extracted for the first and the last author for all included clinical trials using the online interface PubReMiner (http://hgserver2.amc.nl/cgi-bin/miner/miner2.cgi) (Slater, 2014). First names were then converted to male and female probabilities with the application programming interface (API) of Genderize (http://genderize.io/). This API compares first names against a database containing over 216,000 distinct names from 79 countries and 89 languages based on millions of public profiles and their gender data in major social networks. Accuracy of female and male classification with this API, compared with open-source gender prediction tools, is excellent (Wais, 2016). A recent validation study reported female and male classification precisions of 95% and 98%, respectively (Karimi et al., 2016). Gender probabilities were dichotomized to obtain binary male/female labels. Trials with unknown gender data for either the first or last author were not included in the analysis. Missing first names caused most of the unknown genders. For some first names no gender data was available in the gender database (<5%).

Data analysis and statistical model

Request a detailed protocol

Clinical trials with adequate statistical power, more than 80%, were identified for all four combinations of the gender of first and last author (i.e. female–female, male–male, female–male and male–female).

To correct for potential cultural differences we determined the author’s institutional country based on the given affiliation. We only determined this for the first author as affiliations for co-authors are added to the PubMed database only since 2014. We categorized the countries into three main groups based on prevalence. The Anglosphere countries are those where English is the main native language, the European countries, except for the United Kingdom but including Ireland, were categorized in another group. The remaining countries were labeled as Non-western.

We classified the trials using the 21 standard Cochrane major medical discipline classifications. To exclude selection bias, statistical power was determined for all clinical trials with missing gender data. To ascertain that results were not due to disproportionate female underrepresentation in older trials, the absolute number of clinical trials for the four different gender combinations was also calculated.

The data were modeled with logistic regression. In this model the log odds of the dichotomous outcome variable, namely trial ‘adequate power’, was modeled as a linear combination of predictor variables. We used the glm function in R software version 3.2.0. The variable ‘author combination’ was added as a factor to the model, with the author combination ‘both females’ as reference group. The three covariates included were ‘publication year’, ‘country’ and ‘medical field’. The model fit was investigated with the significance of the overall model. This χ (Zeng et al., 2016) test determines whether the model with predictors fits significantly better than a so called null model with just an intercept. The 95% confidence intervals for the estimated coefficients were determined with the profiled log-likelihood function (Venzon and Moolgavkar, 1988).. The estimates were exponentiated to interpret them as odds-ratios. The overall effect of ‘author combination’ in the model was tested with the Wald test. We determined the model’s predicted probabilities and their 95% confidence intervals over time. We considered a P-value<0.005 as significant (Benjamin et al., 2018). We performed a sensitivity analysis with the ‘country’ variable not specified into three main categories but into individual country categories, if a minimal of fifty entries per country were present.

Data sharing

Request a detailed protocol

Open-source code to reproduce our processing pipeline is provided via the Open Science Framework (WM Otte, Temporal RCT power, Open Science Framework, https://osf.io/ud2jw/. Update 17-03-04 11:19 AM). Data extraction from the Cochrane Database of Systematic Reviews requires full text access.

Data availability

Data and scripts are available via the Open Science Framework (https://osf.io/ud2jw/) Parsing of genders can be done via genderize.io.

The following previously published data sets were used

1. Willem M. Otte
(2017) Temporal RCT power
Available at Open Science Framework.

https://osf.io/ud2jw/

References

(2017) Gender differences in scientific collaborations: Women are more egalitarian than men
PLoS One 12:e0176791.

https://doi.org/10.1371/journal.pone.0176791
- PubMed
- Google Scholar
(2011) Sex differences in cooperation: a meta-analytic review of social dilemmas
Psychological Bulletin 137:881–909.

https://doi.org/10.1037/a0025354
- PubMed
- Google Scholar
1. Benjamin DJ
2. Berger JO
3. Johannesson M
4. Nosek BA
5. Wagenmakers E-J
6. Berk R
7. Bollen KA
8. Brembs B
9. Brown L
10. Camerer C
11. Cesarini D
12. Chambers CD
13. Clyde M
14. Cook TD
15. De Boeck P
16. Dienes Z
17. Dreber A
18. Easwaran K
19. Efferson C
20. Fehr E
21. Fidler F
22. Field AP
23. Forster M
24. George EI
25. Gonzalez R
26. Goodman S
27. Green E
28. Green DP
29. Greenwald AG
30. Hadfield JD
31. Hedges LV
32. Held L
33. Hua Ho T
34. Hoijtink H
35. Hruschka DJ
36. Imai K
37. Imbens G
38. Ioannidis JPA
39. Jeon M
40. Jones JH
41. Kirchler M
42. Laibson D
43. List J
44. Little R
45. Lupia A
46. Machery E
47. Maxwell SE
48. McCarthy M
49. Moore DA
50. Morgan SL
51. Munafó M
52. Nakagawa S
53. Nyhan B
54. Parker TH
55. Pericchi L
56. Perugini M
57. Rouder J
58. Rousseau J
59. Savalei V
60. Schönbrodt FD
61. Sellke T
62. Sinclair B
63. Tingley D
64. Van Zandt T
65. Vazire S
66. Watts DJ
67. Winship C
68. Wolpert RL
69. Xie Y
70. Young C
71. Zinman J
72. Johnson VE
(2018) Redefine statistical significance
Nature Human Behaviour 2:6–10.

https://doi.org/10.1038/s41562-017-0189-z
- Google Scholar
1. Bozeman B
2. Gaughan M
(2011) How do men and women differ in research collaborations? an analysis of the collaborative motives and strategies of academic researchers
Research Policy 40:1393–1402.

https://doi.org/10.1016/j.respol.2011.07.002
- Google Scholar
1. Button KS
2. Ioannidis JP
3. Mokrysz C
4. Nosek BA
5. Flint J
6. Robinson ES
7. Munafò MR
(2013) Power failure: why small sample size undermines the reliability of neuroscience
Nature Reviews Neuroscience 14:365–376.

https://doi.org/10.1038/nrn3475
- PubMed
- Google Scholar
(2013) Gender-heterogeneous working groups produce higher quality science
PLoS One 8:e79147.

https://doi.org/10.1371/journal.pone.0079147
- PubMed
- Google Scholar
(2017) Meta-assessment of bias in science
PNAS 114:3714–3719.

https://doi.org/10.1073/pnas.1618569114
- PubMed
- Google Scholar
1. Filardo G
2. da Graca B
3. Sass DM
4. Pollock BD
5. Smith EB
6. Martinez MA
(2016) Trends and comparison of female first authorship in high impact medical journals: observational study (1994-2014)
BMJ 352:i847.

https://doi.org/10.1136/bmj.i847
- PubMed
- Google Scholar
(2002) The continuing unethical conduct of underpowered clinical trials
JAMA 288:358–362.

https://doi.org/10.1001/jama.288.3.358
- PubMed
- Google Scholar
1. Hyde JS
(2005) The gender similarities hypothesis
American Psychologist 60:581–592.

https://doi.org/10.1037/0003-066X.60.6.581
- PubMed
- Google Scholar
1. Ioannidis JP
2. Greenland S
3. Hlatky MA
4. Khoury MJ
5. Macleod MR
6. Moher D
7. Schulz KF
8. Tibshirani R
(2014) Increasing value and reducing waste in research design, conduct, and analysis
The Lancet 383:166–175.

https://doi.org/10.1016/S0140-6736(13)62227-8
- PubMed
- Google Scholar
1. Ioannidis JP
(2005) Why most published research findings are false
PLoS Medicine 2:e124.

https://doi.org/10.1371/journal.pmed.0020124
- PubMed
- Google Scholar
1. Ioannidis JP
(2014) How to make more published research true
PLoS Medicine 11:e1001747.

https://doi.org/10.1371/journal.pmed.1001747
- PubMed
- Google Scholar
(2006) Cochrane reviews compared with industry supported meta-analyses and other meta-analyses of the same drugs: systematic review
BMJ 333:782.

https://doi.org/10.1136/bmj.38973.444699.0B
- PubMed
- Google Scholar
1. Jagsi R
2. Guancial EA
3. Worobey CC
4. Henault LE
5. Chang Y
6. Starr R
7. Tarbell NJ
8. Hylek EM
(2006) The "gender gap" in authorship of academic medical literature--a 35-year perspective
New England Journal of Medicine 355:281–287.

https://doi.org/10.1056/NEJMsa053910
- PubMed
- Google Scholar
(2016) Inferring gender from names on the web: a comparative evaluation of gender detection methods
WWW’16 Companion pp. 53–54.

https://doi.org/10.1145/2872518.2889385
- Google Scholar
(1994) Statistical power, sample size, and their reporting in randomized controlled trials
JAMA 272:122–124.

https://doi.org/10.1001/jama.1994.03520020048013
- PubMed
- Google Scholar
1. Nature
(2013)
Science for all

Nature 495:5.
- PubMed
- Google Scholar
(2017) Opinion: Gender diversity leads to better science
PNAS 114:1740–1742.

https://doi.org/10.1073/pnas.1700616114
- PubMed
- Google Scholar
1. Reed DA
2. Enders F
3. Lindor R
4. McClees M
5. Lindor KD
(2011) Gender differences in academic productivity and leadership appointments of physicians throughout academic careers
Academic Medicine 86:43–47.

https://doi.org/10.1097/ACM.0b013e3181ff9ff2
- PubMed
- Google Scholar
Website
1. Rimmer A
(2017) The gender pay gap: female doctors still earn a third less than male doctors
BMJ. Accessed October 3, 2017.

http://careers.bmj.com/careers/advice/The_gender_pay_gap%3A_female_doctors_still_earn_a_third_less_than_male_doctors
1. Shen H
(2013) Inequality quantified: Mind the gender gap
Nature 495:22–24.

https://doi.org/10.1038/495022a
- PubMed
- Google Scholar
1. Slater L
(2014) Product review: pubmed PubReMiner
Journal of the Canadian Health Libraries Association 33:106–107.

https://doi.org/10.5596/c2012-014
- Google Scholar
1. Valantine HA
2. Collins FS
(2015) National Institutes of Health addresses the science of diversity
PNAS 112:12240–12242.

https://doi.org/10.1073/pnas.1515612112
- PubMed
- Google Scholar
1. Venzon DJ
2. Moolgavkar SH
(1988) A method for computing profile-likelihood-based confidence intervals
Applied Statistics 37:87–94.

https://doi.org/10.2307/2347496
- Google Scholar
1. Wais K
(2016)
Gender prediction methods based on first names with genderizeR

The R Journal 8:17–37.
- Google Scholar
1. West JD
2. Jacquet J
3. King MM
4. Correll SJ
5. Bergstrom CT
(2013) The role of gender in scholarly authorship
PLoS One 8:e66212.

https://doi.org/10.1371/journal.pone.0066212
- PubMed
- Google Scholar
1. Wilson EB
(1927) Probable inference, the law of succession, and statistical inference
Journal of the American Statistical Association 22:209–212.

https://doi.org/10.1080/01621459.1927.10502953
- Google Scholar
(2010) Evidence for a collective intelligence factor in the performance of human groups
Science 330:686–688.

https://doi.org/10.1126/science.1193147
- PubMed
- Google Scholar
1. Zeng XH
2. Duch J
3. Sales-Pardo M
4. Moreira JA
5. Radicchi F
6. Ribeiro HV
7. Woodruff TK
8. Amaral LA
(2016) Differences in Collaboration Patterns across Discipline, Career Stage, and Gender
PLoS Biology 14:e1002573.

https://doi.org/10.1371/journal.pbio.1002573
- PubMed
- Google Scholar

Decision letter

Peter Rodgers

Reviewing Editor; eLife, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Opposite sexes attract: male-female collaborations and the statistical power of 31,873 clinical trials" to eLife for consideration as a Feature Article. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by myself as the eLife Features Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Andreas Neef (Reviewer #1).

The reviewers have discussed the reviews with one another and I have drafted this decision to help you prepare a revised submission

Summary:

The study presented is an analysis of over 30,000 clinical trials to assess whether gender pairings of first and last authors are associated with sufficient statistical power. The study employs a large data set and uses the API Genderize to assign gender. The authors find that overall a relatively low number of studies had sufficient statistical power (12-13%). Similar to other studies looking at gender pairings, the authors find that most studies are comprised of male first-male last authors pairings. Over the forty years analyzed, the authors observed a gradual increase in the number of female first and last authors in clinical trials. When assessed for sufficient power, studies comprised of a first male author and female last author were significantly more prevalent compared to other gender pairings.

However, a number of concerns need to be addressed before the article can be accepted for publication. In particular, there is a need for more transparency with respect to data inclusion/exclusion, and the authors need to pay more careful attention to potential confounders and possible selection bias (see points 1-8 below). The Discussion section and the references cited also need extensive attention (points 9 and 10). The Title also needs to be revised (point 11)

Essential revisions:

Confounders: I see three possible confounders that may bias the results of this study. Time, discipline-specific traditions and country variation. I will briefly touch upon each of these possible confounders below and suggest a suitable solution.

1) As mentioned by the authors, the underrepresentation of female authors in older trials may bias the results given the general increase in the percentage share of sufficiently powered trials over time. The authors account for this in Figure 1 (right panel) by dividing the results into periodical bins. But what about the possible variation within these ten-year periodical bins? A more refined approach could be to use publication year as a covariate in a binary logistic regression (I return to this).

2) Some medical disciplines may be more likely to conduct sufficiently powered trials than others, irrespective of author-gender composition. It is difficult infer whether this is the case from Figure 2. But four of the disciplines with the most pronounced higher statistical power for studies with female last authors (gynaecology, pregnancy and childbirth and tobacco and drugs and infectious diseases) also have generally good female author representation as presented in Figure 6. Consequently, the general difference detected in your study may possibly be explained by a higher general proportion of women authors in disciplines that are have relatively high general proportions of sufficiently powered studies. Or alternatively, the actual differences may possibly be larger than what you report, if the areas where women authors are well-represented are less likely to reach sufficient statistical power. Again, a more refined approach would be to do a binary logistic regression where you include covariates for the average proportion of female last authors and female authors per discipline.

3) Your study does not adjust for country-variations, which makes sense given your primary emphasis on descriptive statistics. My worry is that women's general representation as trial authors may be higher in countries that score relatively high with respect to proportion of sufficiently powered studies. If, for instance, North America (Canada and US) and Northern Europe have good numbers of women trial authors, while also being "top-performers" with respect to sufficiently powered studies, this may bias your results. Here you could also calculate a covariate adjusting for women's proportional participation as authors in a given geographic region (North America, North-Western Europe, South-East-Asia, Oceania etc.) and use this in a binary log reg.

To briefly summarize I suggest that you adjust for these possible confounders in a binary logistic regression with the outcome variable: 0=insufficiently powered 1= sufficiently powered. The regression could include your author-composition categories + the following covariates: publication year, proportion of women authors in the discipline, proportion of women last authors in the discipline, proportion of women authors in the geographical region where the trial took place, and proportion of women last authors in the geographical region where the trial took place.

4) Possible selection bias: In subsection “Statistical power of individual clinical trials” it is mentioned that you only included significant meta-analyses to exclude bias from interventions with no proven effects? I miss some sort of justification for this choice? Why would interventions with no proven effect bias your results? It would also be useful to know whether the gender composition of authors contributing to studies with no effects is similar to that of the analysed sample. Is female authors' representation higher in the sample with no proven effects relative to the sample used in the analysis?

5) Transparency with respect to data inclusion and exclusion: It would be helpful to have some sort of flow-diagram specifying the data exclusion steps. (1) How many trials did you start out with? (2) were you able to extract the necessary data from all eligible clinical trials identified in Cochrane? (3) How many studies were excluded due to statistical insignificance? (4) How many studies were excluded due to missing gender specification?

6) It would aid interpretation if the breakdown of numbers (studies excluded, numbers per category of pairing, those with sufficient statistical power….) were provided in a Table or in a Figure (similar to how presented in Figure 7).

7) Please also ensure that the following information is available in the manuscript:

- information on the definition of confidence intervals (was bootstrap used?)

- information on the statistical test applied. chi^{2}-test is mentioned, to me it is unclear if multiple comparison correction was used and why the continuity correction has to be used when actually _numbers_ of studies are tested. Also unclear is whether the bimodal distribution of statistical power in studies (see Button et al., 2013) would still have us expect a normal distribution (or hypergeometric) of the number of cases in each class. Even more concerning is the potential internal structure in the dataset: how many different seniors carry the effect? etc.

So to address all this within one analysis: Why not use randomized gender label swapping?

8) A more elaborate description of the gender disambiguation would also strengthen the study. In subsection “Gender extraction” you mention that "gender probabilities were dichotomized to obtain binary male/female labels". I presume this means that your gender algorithm provides numerical name-to-gender accuracy estimations? (please describe in more detail how the algorithm works). If yes, what threshold did you use? It would also strengthen the study if you could manually check a sub-sample of authors to verify the accuracy of the algorithm with the chosen threshold.

9) A number of reference (e.g., Helmer et al., 2017; Wuchty, Jones and Uzzi, 2007; Jones, Wuchty and Uzzi, 2008; Woolley et al., 2010 and Rhoten and Pfirman, 2007) are quoted out of context. Woolley et al., 2010 does not focus on research teams: rather, it finds that the collective intelligence of teams increase with women's representation, not with gender diversity. Rhoten and Pfirman, 2007 finds that women (not gender diverse teams) are more likely to do interdisciplinary research.

10) The Discussion section needs to be completely rewritten to focus on:i) how the results in the present manuscript relate to the existing literature (including possible explanations for the results).ii) the shortcomings and caveats associated with the present approach (e.g., just one type of article, just one area of science, just the first and last authors); and only first/last author). iii) open questions for future research.

As mentioned above, please ensure that any papers cited are directly relevant to the subject being discussed.

11) The Title needs to be revised to better reflect the content of the article. Also, please note that punctuation like colons, semi-colons and hypens/dashes are not allowed in the title of eLife articles.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting a revised version of your manuscript "Adequate statistical power of clinical trials is associated with the combination of a male first author and a female last author". The revised version has been reviewed again by one of the referees who reviewed the original version. Once you have satisfactorily addressed the comments from this reviewer (see below) and some editorial comments from myself (also below), we should be able to accept your article.

The authors have done a good job addressing my main concerns about confounders. The logistic regression analysis has strengthened the study a lot. I think the paper is ready for publication with some revisions.

1) The authors used categorical values with three values to adjust for geo-cultural variation. I am not convinced that this is sufficient, and given the size of the data-set, I suggest that they run the same log-reg model with dummy for all countries to ensure that potential confounders at the country level are sufficiently adjusted for. If this model lead to the same results, they can choose to report the model as it is given now.

2) I suggest that the authors revise the Results section so that descriptive findings are presented first, followed by the outcomes of the logistic regression analysis. Mixing the results from the different approaches is somewhat confusing. The log-reg should be used to validate the descriptive results.

3) Regarding the Discussion section: I am still not convinced by the arguments about gender differences in collaboration style patterns. It is difficult for me to understand how your research adds to discussions of collaboration style. Collaboration style is a question of process, your study can only contribute to the understanding of outcomes? I would argue that your study contributes to the emerging literature about how gender diversity may influence research outcomes. Here are some references for other relevant studies addressing this issue (one of which – Campbell et al., 2013 – is already cited):

# Valantine and Collins (2015) (This paper could work quite well as a motivation for your manuscript])

# Nielsen et al., (2017)

# Joshi, (2014)

# Campbell et al., (2013)

4) Likewise, I am not convinced by the speculative argument that pairs of female first and last authors have less powered studies due to women seniors being more critical of female employees. How would being critical towards female coauthors lower statistical power? I can't see the link, and I suggest that you skip this reflection entirely.

5) Woolley et al., 2010 is not about research teams, but teams in general.

6) I feel the Title would read better if the fourth word was changed from "of" to "in" so that the title read as follows:

Adequate statistical power in clinical trials is associated with the combination of a male first author and a female last author.

However, please feel free to keep the present title if you wish.

7) I feel that the abstract and introduction would read better as follows:

Abstract

“Clinical trials have a vital role in ensuring the safety and efficacy of new medical treatments and interventions. A key characteristic of a clinical trial is its statistical power. Here we investigate whether the statistical power of a trial is related to the gender of first and last authors on the paper reporting the results of the trial. Based on an analysis of 31,873 clinical trials published between 1974 and 2017, we find that adequate statistical power was most often present in clinical trials with a male first author and a female last author (20.6%, 95% confidence interval 19.4-21.8%), and that the difference between this figure and the figure for other gender combinations was significant (12.5-13.5%; P < 0.0001). The absolute number of female authors in clinical trials also increased gradually over time, with the percentage of female last authors rising from 20.7% (1975-85) to 28.5% (after 2005). Our results demonstrate the importance of gender diversity in research collaborations and emphasize the need to increase the number of women in senior positions in medicine.”

Introduction section

“Clinical trials are complex projects that often involve collaborations between researchers who have different areas of expertise and different levels of seniority. The statistical power of a clinical trial reflects the chance of detecting a true effect, so adequate statistical power is regarded as one of the key elements of responsible research [8] and is considered essential in reproducible clinical research [9]. However, there is an increasing awareness that many clinical trials have systematic methodological flaws, including a lack of adequate statistical power, and that their results may be biased, exaggerated, and difficult to reproduce [1].

Male and female researchers differ in their collaborative strategies in ways that depend on their levels of expertise and whether they have a junior or senior position [4,5]. There are also indications that mixed gender research groups may make the best use of personal knowledge and skills [6,7], but possible relationships between the gender balance of collaborations and the quality of clinical trials – as measured by their statistical power – have not been investigated. In this study, we examined 31,873 clinical trials published between 1974 and 2017 to see if there was any relation between the gender of the first and last authors and statistical power. We found that the probability of having adequate statistical power for one combination – male first author, female last author – was significantly higher than that for the other three possible combinations. Moreover, this effect was present across countries and most medical fields.”

Please feel free to revise the Abstract and Introduction section further, but please try to avoid undue overlap between the two. Also, please note that my suggestions would mean that the following sentence (and the references therein) would be deleted: "Indeed, collaborations on trial design, conduct and report are important, and the publications resulting from team efforts and multi-university research teams are more often cited and have more impact [2,3]."

8) Textual changes and clarifications:

# Please list all the Anglophone in parenthesis the first time the word Anglophone is used.

# Please clarify in the UK and Ireland are included with the Anglophone or European countries.

# Please list the five biggest non-Western countries in parenthesis the first time the phrase non-Western countries is used.

# In the third paragraph of the Discussion, please reword the passage "It is important to note that gender assumptions are not black and white" to be more precise.

# In the fourth paragraph of the Discussion section, please replace the first two sentences ("Our analyses […] statistical power") with one shorter sentence.

9) Figures

To help the presentation of the article, please do the following:

# combine Figure 1, Figure 2 and Figure 3 into one figure with three panels.

# combine Figure 6 and Figure 7 into one figure with two panels.

# please add a colour scale to the present Figure 5.

# please expand the caption for the present Figure 7 to better explain what has been measured and what is being predicted.

https://doi.org/10.7554/eLife.34412.015

Author response

Summary:

The study presented is an analysis of over 30,000 clinical trials to assess whether gender pairings of first and last authors are associated with sufficient statistical power. The study employs a large data set and uses the API Genderize to assign gender. The authors find that overall a relatively low number of studies had sufficient statistical power (12-13%). Similar to other studies looking at gender pairings, the authors find that most studies are comprised of male first-male last authors pairings. Over the forty years analyzed, the authors observed a gradual increase in the number of female first and last authors in clinical trials. When assessed for sufficient power, studies comprised of a first male author and female last author were significantly more prevalent compared to other gender pairings.

However, a number of concerns need to be addressed before the article can be accepted for publication. In particular, there is a need for more transparency with respect to data inclusion/exclusion, and the authors need to pay more careful attention to potential confounders and possible selection bias (see points 1-8 below). The Discussion section and the references cited also need extensive attention (points 9 and 10). The Title also needs to be revised (point 11).

We thank the editor and all reviewers for their thoughtful and constructive feedback. Our feedback is provided below.

Essential revisions:

[…] To briefly summarize I suggest that you adjust for these possible confounders in a binary logistic regression with the outcome variable: 0=insufficiently powered 1= sufficiently powered. The regression could include your author-composition categories + the following covariates: publication year, proportion of women authors in the discipline, proportion of women last authors in the discipline, proportion of women authors in the geographical region where the trial took place, and proportion of women last authors in the geographical region where the trial took place.

We agree with the reviewers that the factors ‘study year’, ‘country of author’ and ‘medical discipline’ may affect the association between ‘author type’ and the dependent variable ‘sufficiently powered’. For instance, as seen in Figure 1, there appears to be a change in adequately powered trials over time.

We re-analyzed the data with logistic regression (with outcome variable: 0=insufficiently powered 1= sufficiently powered) where ‘study year’, ‘country of author’ and ‘medical discipline’ were included as a covariates. We have added Table 1 with the corresponding odds ratio’s, 95% confidence intervals and P-values to the revised manuscript. We have also added predicted probability plots, with the 95% prediction intervals to help understand the model.

4) Possible selection bias: In subsection “Statistical power of individual clinical trials” it is mentioned that you only included significant meta-analyses to exclude bias from interventions with no proven effects? I miss some sort of justification for this choice? Why would interventions with no proven effect bias your results? It would also be useful to know whether the gender composition of authors contributing to studies with no effects is similar to that of the analysed sample. Is female authors' representation higher in the sample with no proven effects relative to the sample used in the analysis?

In the frequentist framework we cannot infer information from the point estimate of a meta-analysis if the confidence interval contains 0. As we do not know the true effect size (which could also be 0) we can also not reliable estimate the statistical power, as this requires an estimation of the true effect size. Even so, inclusion of non-significant meta-analyses did not change any of the results. We have added text and clarification in the article.

We have added data from the trials where we have no gender information. The average power for those studies is very similar to the categories ‘both females/both males/female-male (last)’, indicating that our data is not biased by these ‘unknowns’ (see Figure 7).

5) Transparency with respect to data inclusion and exclusion: It would be helpful to have some sort of flow-diagram specifying the data exclusion steps. (1) How many trials did you start out with? (2) were you able to extract the necessary data from all eligible clinical trials identified in Cochrane? (3) How many studies were excluded due to statistical insignificance? (4) How many studies were excluded due to missing gender specification?

We have constructed a flow diagram (Figure 1) and additional descriptive figures in the revised manuscript. Furthermore, we explain briefly why our gender identification software was unable to detect gender in almost 70% of the included trials (Discussion section).

6) It would aid interpretation if the breakdown of numbers (studies excluded, numbers per category of pairing, those with sufficient statistical power….) were provided in a Table or in a Figure (similar to how presented in Figure 6).

These numbers are now given in the flow diagram (Figure 8).

7) Please also ensure that the following information is available in the manuscript:

- information on the definition of confidence intervals (was bootstrap used?)

All confidence intervals are 95% confidence intervals.

We have added this to the manuscript. All confidence intervals are 95% confidence intervals. The CI’s for the proportions are determined with the standard R function prop.test from the default ‘stats’ package (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html). The confidence intervals for the odds ratios from the logistic regression model are determined with the confint function in R from the ‘stats’ package (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/confint.html).

- information on the statistical test applied. chi^{2}-test is mentioned, to me it is unclear if multiple comparison correction was used and why the continuity correction has to be used when actually _numbers_ of studies are tested. Also unclear is whether the bimodal distribution of statistical power in studies (see Button et al., 2013) would still have us expect a normal distribution (or hypergeometric) of the number of cases in each class. Even more concerning is the potential internal structure in the dataset: how many different seniors carry the effect? etc.

So, to address all this within one analysis: Why not use randomized gender label swapping?

As we changed the statistical testing to logistic regression the statistical analysis with chi-square is removed. We only fitted one model in our current manuscript version. We have lowered the α cutoff threshold to 0.005, based on a recent discussion on the too liberal threshold of 0.05 (Benjamin et al., 2018).

8) A more elaborate description of the gender disambiguation would also strengthen the study. In subsection “Gender extraction” you mention that "gender probabilities were dichotomized to obtain binary male/female labels". I presume this means that your gender algorithm provides numerical name-to-gender accuracy estimations? (please describe in more detail how the algorithm works). If yes, what threshold did you use? It would also strengthen the study if you could manually check a sub-sample of authors to verify the accuracy of the algorithm with the chosen threshold.

The gender extraction based on first names with genderize.io has the highest accuracy of current methods available at the moment. Comparison and validation work is reported in this paper (Wais, (2016)). The genderize.io method is compared to the method of West et al., (2013) and Larivière et al., (2013). The area-under-the ROC curve – a general measure of predictiveness – for the genderize.io software is 0.927. The predicted power is thus excellent. Another study which validated the high validity of generize.io gender prediction is the work by Fell and König, (2016). Yet another study, by Topaz et al., on gender predictions in mathematical sciences, report an accuracy of 97.5% of generize.io (Topaz and Sen, (2016)) The generize.io software provides a probability for a first name being male or female (summing to 1.0). We labeled first names based on the highest gender probability.

9) A number of references (e.g., Helmer et al., 2017; Wuchty, Jones and Uzzi, 2007; Jones, Wuchty and Uzzi, 2008; Woolley et al., 2010 and Rhoten and Pfirman, 2007) are quoted out of context. Woolley et al., 2010 does not focus on research teams: rather, it finds that the collective intelligence of teams increase with women's representation, not with gender diversity. Rhoten and Pfirman, 2007 finds that women (not gender diverse teams) are more likely to do interdisciplinary research.

We appreciate the feedback. We have adjusted the references and text to fit the context better. As for West, Jacquet and King, 2013, we think the referee was actually discussing another paper (Araujo, Araujo and Moreira, 2017). This article does show that when an article has more collaborators, the chance of having interdisciplinary collaboration is higher for female scientists. We believe that our use of this reference is justified, stating that “our results do support previous findings that gender differences exist in collaboration style patterns.”

10) The Discussion section needs to be completely rewritten to focus on:i) how the results in the present manuscript relate to the existing literature (including possible explanations for the results).ii) the shortcomings and caveats associated with the present approach (eg, just one type of article, just one area of science, just the first and last authors); and only first/last author). iii) open questions for future research.

As mentioned above, please ensure that any papers cited are directly relevant to the subject being discussed.

We have rewritten the Discussion section, addressed the abovementioned shortcomings and put a focus on future research.

11) The Title needs to be revised to better reflect the content of the article. Also, please note that punctuation like colons, semi-colons and hypens/dashes are not allowed in the title of eLife articles.

We have changed the title to “Adequate statistical power of clinical trials is associated with the combination of a male first author and a female last author.”

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The authors have done a good job addressing my main concerns about confounders. The logistic regression analysis has strengthened the study a lot. I think the paper is ready for publication with some revisions.

1) The authors used categorical values with three values to adjust for geo-cultural variation. I am not convinced that this is sufficient, and given the size of the data-set, I suggest that they run the same log-reg model with dummy for all countries to ensure that potential confounders at the country level are sufficiently adjusted for. If this model lead to the same results, they can choose to report the model as it is given now.

Table 2 is added to the manuscript with individual countries as potential confounder. The main effect remains and is not influenced by individual countries. We have also addressed this in the Results section.

2) I suggest that the authors revise the Results section so that descriptive findings are presented first, followed by the outcomes of the logistic regression analysis. Mixing the results from the different approaches is somewhat confusing. The log-reg should be used to validate the descriptive results.

This is a valuable suggestion. We have changed the Results section; the section “model performance” was renamed “correction for potential confounders” and moved to the end of the Results section.

3) Regarding the Discussion section: I am still not convinced by the arguments about gender differences in collaboration style patterns. It is difficult for me to understand how your research adds to discussions of collaboration style. Collaboration style is a question of process, your study can only contribute to the understanding of outcomes? I would argue that your study contributes to the emerging literature about how gender diversity may influence research outcomes. Here are some references for other relevant studies addressing this issue (one of which – Campbell et al., 2013 – is already cited):

# Valantine and Collins, (2015) (This paper could work quite well as a motivation for your manuscript)

# Nielsen et al., (2017)

# Nielsen et al., (2017)

# Joshi, (2014)

# Campbell et al., (2013)

We have added your suggested references to further improve the manuscript by adding text to the discussion. Now we state that the results suggest that gender differences may alter research outcomes. See the Discussion section.

4) Likewise, I am not convinced by the speculative argument that pairs of female first and last authors have less powered studies due to women seniors being more critical of female employees. How would being critical towards female coauthors lower statistical power? I can't see the link, and I suggest that you skip this reflection entirely.

We have removed the speculative arguments (Discussion section).

5) Woolley et al., 2010 is not about research teams, but teams in general.

We changed the sentence in the Introduction section to: “There are indications that mixed gender teams may make the best use of personal knowledge and skills, an effect also reported in a scientific research context.”

6) I feel the Title would read better if the fourth word was changed from "of" to "in" so that the title read as follows:

Adequate statistical power in clinical trials is associated with the combination of a male first author and a female last author.

However, please feel free to keep the present title if you wish.

Thank you for the suggestion, we changed the Title accordingly.

7) I feel that the abstract and introduction would read better as follows:

Abstract

“Clinical trials have a vital role in ensuring the safety and efficacy of new medical treatments and interventions. A key characteristic of a clinical trial is its statistical power. Here we investigate whether the statistical power of a trial is related to the gender of first and last authors on the paper reporting the results of the trial. Based on an analysis of 31,873 clinical trials published between 1974 and 2017, we find that adequate statistical power was most often present in clinical trials with a male first author and a female last author (20.6%, 95% confidence interval 19.4-21.8%), and that the difference between this figure and the figure for other gender combinations was significant (12.5-13.5%; P < 0.0001). The absolute number of female authors in clinical trials also increased gradually over time, with the percentage of female last authors rising from 20.7% (1975-85) to 28.5% (after 2005). Our results demonstrate the importance of gender diversity in research collaborations and emphasize the need to increase the number of women in senior positions in medicine.”

Introduction section

“Clinical trials are complex projects that often involve collaborations between researchers who have different areas of expertise and different levels of seniority. The statistical power of a clinical trial reflects the chance of detecting a true effect, so adequate statistical power is regarded as one of the key elements of responsible research [8] and is considered essential in reproducible clinical research [9]. However, there is an increasing awareness that many clinical trials have systematic methodological flaws, including a lack of adequate statistical power, and that their results may be biased, exaggerated, and difficult to reproduce [1].

Male and female researchers differ in their collaborative strategies in ways that depend on their levels of expertise and whether they have a junior or senior position [4,5]. There are also indications that mixed gender research groups may make the best use of personal knowledge and skills [6,7], but possible relationships between the gender balance of collaborations and the quality of clinical trials – as measured by their statistical power – have not been investigated. In this study, we examined 31,873 clinical trials published between 1974 and 2017 to see if there was any relation between the gender of the first and last authors and statistical power. We found that the probability of having adequate statistical power for one combination – male first author, female last author – was significantly higher than that for the other three possible combinations. Moreover, this effect was present across countries and most medical fields.”

Please feel free to revise the Abstract and Introduction section further, but please try to avoid undue overlap between the two. Also, please note that my suggestions would mean that the following sentence (and the references therein) would be deleted: "Indeed, collaborations on trial design, conduct and report are important, and the publications resulting from team efforts and multi-university research teams are more often cited and have more impact [2,3]."

Many thanks for improving the readability of the text. The Abstract is completely incorporated, and we have added some results in the Abstract. The Introduction section was also amended. We have added one line to the Acknowledgments section in appreciation of your and the reviewers’ suggestions.

8) Textual changes and clarifications:

# Please list all the Anglophone in parenthesis the first time the word Anglophone is used.

We removed the mixed of anglophone and anglosphere from the text and kept anglosphere. These are formally defined as the United States, Canada, Australia, New Zealand and the United Kingdom. We have added those the first time the word Anglosphere is used.

# Please clarify in the UK and Ireland are included with the Anglophone or European countries.

Ireland is analyzed as being part of the European category. We have added this to the text.

# Please list the five biggest non-Western countries in parenthesis the first time the phrase non-Western countries is used.

We have added the five biggest non-Western countries: Turkey, Japan, India, China, and Israel.

# In the third paragraph of the Discussion section, please reword the passage "It is important to note that gender assumptions are not black and white" to be more precise.

# In the fourth paragraph of the Discussion section, please replace the first two sentences ("Our analyses […] statistical power") with one shorter sentence.

The abovementioned changes have been made.

9) Figures

To help the presentation of the article, please do the following:

# combine Figure 1, Figure 2 and Figure 3 into one figure with three panels.

# combine Figure 6 and Figure 7 into one figure with two panels.

# please add a colour scale to the present Figure 5.

# please expand the caption for the present Figure 7 to better explain what has been measured and what is being predicted.

The abovementioned changes have been made.

https://doi.org/10.7554/eLife.34412.016

Article and author information

Author details

Willem M Otte

Willem M Otte is in the Biomedical MR Imaging and Spectroscopy Group, Center for Image Sciences and the Department of Child Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht/Utrecht University, Utrecht, The Netherlands

Contribution
Conceptualization, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing—original draft, Writing—review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1511-6834
Joeri K Tijdink

Joeri K Tijdink is in the Department of Philosophy, VU University, Amsterdam, The Netherlands

Contribution
Conceptualization, Investigation, Writing—original draft, Writing—review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-1826-2274
Paul L Weerheim

Paul L Weerheim is in the Biomedical MR Imaging and Spectroscopy Group, Center for Image Sciences, University Medical Center Utrecht/Utrecht University, Utrecht, Netherlands

Contribution
Formal analysis, Methodology

Competing interests
No competing interests declared
Herm J Lamberink

Herm J Lamberink is in the Biomedical MR Imaging and Spectroscopy Group, Center for Image Sciences and the Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht/Utrecht University, Utrecht, The Netherlands

Contribution
Conceptualization, Formal analysis, Methodology, Writing—original draft, Writing—review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1379-3487
Christiaan H Vinkers

Christiaan H Vinkers is in the Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht/Utrecht University, Utrecht, The Netherlands

Contribution
Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing—original draft, Project administration, Writing—review and editing

For correspondence
c.h.vinkers@umcutrecht.nl

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-3698-0744

Funding

ZonMw (445001002)

Willem M Otte
Joeri K Tijdink
Herm J Lamberink
Christiaan H Vinkers

VENI (016.168.038)

Willem M Otte

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We appreciate the valuable input and suggestions by the reviewers. We gratefully acknowledge the possibility that the all-male author list may have negatively affected the scientific quality of our work.

Publication history

Received: December 15, 2017
Accepted: May 17, 2018
Version of Record published: June 5, 2018

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

6,707

views
242

downloads
6

citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Willem M Otte
Joeri K Tijdink
Paul L Weerheim
Herm J Lamberink
Christiaan H Vinkers

(2018)

Research: Adequate statistical power in clinical trials is associated with the combination of a male first author and a female last author

eLife 7:e34412.

https://doi.org/10.7554/eLife.34412

Categories and tags

Research organism

Human

1. Part of Collection
Equity, Diversity and Inclusion

Edited by Julia Deathridge
Further reading

Share this article

Cite this article

Percentage of adequately powered trials for the four different gender combinations of first and last author.

Percentage of adequately powered trials when the gender of the first and last author is male, female or unknown.

Model estimates for the variables fitted against adequately powered trials.

Model estimates from the sensitivity analysis (with individual countries) for the variables fitted against adequately powered trials.

The proportion of included trials mapped per country on a white to red color scale (range: 0 – 24%).

The influence of geography on the percentage of trials that are adequately powered.

Percentage of adequately powered trials, for the four gender combinations of the first and the last author, within 21 major medical disciplines.

The percentage of the total number of trials underlying the four gender combinations within 21 major medical disciplines.

The percentage of trials for the different gender combinations and periods studied.

Flow diagram of the 31,873 trials selected for analysis.

Author details

Willem M Otte

Contribution

Competing interests

Joeri K Tijdink

Contribution

Competing interests

Paul L Weerheim

Contribution

Competing interests

Herm J Lamberink

Contribution

Competing interests

Christiaan H Vinkers

Contribution

For correspondence

Competing interests

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading