Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards

  1. Yochai Edlitz
  2. Eran Segal  Is a corresponding author
  1. Weizmann Institute of Science, Israel

Abstract

Background: Type 2 diabetes (T2D) accounts for ~90% of all cases of diabetes, resulting in an estimated 6.7 million deaths in 2021, according to the International Diabetes Federation (IDF). Early detection of patients with high risk of developing T2D can reduce the incidence of the disease through a change in lifestyle, diet, or medication. Since populations of lower socio-demographic status are more susceptible to T2D and might have limited resources or access to sophisticated computational resources, there is a need for accurate yet accessible prediction models.

Methods: In this study, we analyzed data from 44,709 non-diabetic U.K. Biobank participants aged 40-69, predicting the risk of T2D onset within a selected timeframe (mean of 7.3 years with a standard deviation of 2.3 years). We started with 798 features that we identified as potential predictors for T2D onset. We first analyzed the data using gradient boosting decision trees, survival analysis, and logistic regression methods. We devised one non-laboratory model accessible to the general population and one more precise yet simple model that utilizes laboratory tests. We simplified both models to an accessible scorecard form, tested the models on normoglycemic and prediabetes sub cohorts, and compared the results to the results of the general cohort. We established the non-laboratory model using the following covariates: sex, age, weight, height, waist size, hip circumference, waist-to-hip Ratio (WHR), and Body-Mass Index (BMI). For the laboratory model, we used age and sex together with four common blood tests: HDL (high-density lipoprotein), gamma-glutamyl transferase, glycated hemoglobin, and triglycerides. As an external validation dataset, we used the electronic medical record database of Clalit Health Services.

Results: The non-laboratory scorecard model achieved an Area Under the Receiver Operating Curve (auROC) of 0.81 (0.77-0.84 95% Confidence Interval (CI)) and an odds ratio (OR) between the upper and fifth prevalence deciles of 17.2 (5-66 95% CI). Using this model, we classified three risk groups, a group with 1% (0.8-1%), 5% (3-6%), and the third group with a 9% (7-12%) risk of developing T2D. We further analyzed the contribution of the laboratory-based model and devised a blood-test model based on age, sex and the four common blood tests noted above. In this scorecard model, we included age, sex, glycated hemoglobin (HbA1c%), gamma glutamyl-transferase, triglycerides, and HDL cholesterol. Using this model, we achieved an auROC of 0.87 (0.85-0.90 95% CI) and a deciles' OR of x48 (12-109 95% CI). Using this model, we classified the cohort into four risk groups with the following risks: 0.5% (0.4%-7%); 3% (2-4%); 10% (8-12%) and a high-risk group of 23% (10-37%) of developing T2D. When applying the blood tests model using the external validation cohort (Clalit), we achieved an auROC of 0.75 (0.74-0.75 95% CI). We analyzed several additional comprehensive models, which included genotyping data and other environmental factors. We found that these models did not provide cost-efficient benefits over the four blood test model. The commonly used German Diabetes Risk Score (GDRS) and Finnish Diabetes Risk Score (FINDRISC) models, trained using our data, achieved an auROC of 0.73 (0.69-0.76) and 0.66 (0.62-0.70), respectively, inferior to the results achieved by the four blood test model and by the Anthropometry models.

Conclusions: The four blood tests and anthropometric models outperformed the commonly used non-laboratory models, the FINDRISC and the GDRS. We suggest that our models be used as tools for decision-makers to assess populations at elevated T2D risk and thus improve medical strategies. These models might also provide a personal catalyst for changing lifestyle, diet, or medication modifications to lower the risk of T2D onset.

Funding: No Funders. The funders had no role in study design, data collection, interpretation, or the decision to submit the work for publication.

Data availability

All data that we used to develop the models in this research is available through the UK Biobank database. The external validation cohort is from "Clalit healthcare".The two databases can be accessed upon specific requests and approval as described below.UKBiobank - The UK Biobank data is Available from UK Biobank subject to standard procedures (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access). The UK Biobank resource is open to all bona fide researchers at bona fide research institutes to conduct health-related research in the public interest. UK Biobank welcomes applications from academia and commercial institutes.Clalit - The data that support the findings of the external Clalit cohort originate from Clalit Health Services (http://clalitresearch.org/about-us/our-data/). Due to restrictions, these data can be accessed only by request to the authors and/or Clalit Health Services. Requests for access to all or parts of the Clalit datasets should be addressed to Clalit Healthcare Services via the Clalit Research Institute (http://clalitresearch.org/contact/). The Clalit Data Access committee will consider requests given the Clalit data-sharing policy.Source code for analysis is available at https://github.com/yochaiedlitz/T2DM_UKB_predictions

The following previously published data sets were used

Article and author information

Author details

  1. Yochai Edlitz

    Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Yavne, Israel
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7733-3995
  2. Eran Segal

    Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Yavne, Israel
    For correspondence
    eran.segal@weizmann.ac.il
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6859-1164

Funding

Feinberg Graduate School, Weizmann Institute of Science

  • Eran Segal

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2022, Edlitz & Segal

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,086
    views
  • 433
    downloads
  • 18
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yochai Edlitz
  2. Eran Segal
(2022)
Prediction of type 2 diabetes mellitus onset using logistic regression-based scorecards
eLife 11:e71862.
https://doi.org/10.7554/eLife.71862

Share this article

https://doi.org/10.7554/eLife.71862

Further reading

    1. Epidemiology and Global Health
    Marina Padilha, Victor Nahuel Keller ... Gilberto Kac
    Research Article

    Background: The role of circulating metabolites on child development is understudied. We investigated associations between children's serum metabolome and early childhood development (ECD).

    Methods: Untargeted metabolomics was performed on serum samples of 5,004 children aged 6-59 months, a subset of participants from the Brazilian National Survey on Child Nutrition (ENANI-2019). ECD was assessed using the Survey of Well-being of Young Children's milestones questionnaire. The graded response model was used to estimate developmental age. Developmental quotient (DQ) was calculated as the developmental age divided by chronological age. Partial least square regression selected metabolites with a variable importance projection ≥ 1. The interaction between significant metabolites and the child's age was tested.

    Results: Twenty-eight top-ranked metabolites were included in linear regression models adjusted for the child's nutritional status, diet quality, and infant age. Cresol sulfate (β = -0.07; adjusted-p < 0.001), hippuric acid (β = -0.06; adjusted-p < 0.001), phenylacetylglutamine (β = -0.06; adjusted-p < 0.001), and trimethylamine-N-oxide (β = -0.05; adjusted-p = 0.002) showed inverse associations with DQ. We observed opposite directions in the association of DQ for creatinine (for children aged -1 SD: β = -0.05; p =0.01; +1 SD: β = 0.05; p =0.02) and methylhistidine (-1 SD: β = - 0.04; p =0.04; +1 SD: β = 0.04; p =0.03).

    Conclusion: Serum biomarkers, including dietary and microbial-derived metabolites involved in the gut-brain axis, may potentially be used to track children at risk for developmental delays.

    Funding: Supported by the Brazilian Ministry of Health and the Brazilian National Research Council.

    1. Epidemiology and Global Health
    Riccardo Spott, Mathias W Pletz ... Christian Brandt
    Research Article

    Given the rapid cross-country spread of SARS-CoV-2 and the resulting difficulty in tracking lineage spread, we investigated the potential of combining mobile service data and fine-granular metadata (such as postal codes and genomic data) to advance integrated genomic surveillance of the pandemic in the federal state of Thuringia, Germany. We sequenced over 6500 SARS-CoV-2 Alpha genomes (B.1.1.7) across 7 months within Thuringia while collecting patients’ isolation dates and postal codes. Our dataset is complemented by over 66,000 publicly available German Alpha genomes and mobile service data for Thuringia. We identified the existence and spread of nine persistent mutation variants within the Alpha lineage, seven of which formed separate phylogenetic clusters with different spreading patterns in Thuringia. The remaining two are subclusters. Mobile service data can indicate these clusters’ spread and highlight a potential sampling bias, especially of low-prevalence variants. Thereby, mobile service data can be used either retrospectively to assess surveillance coverage and efficiency from already collected data or to actively guide part of a surveillance sampling process to districts where these variants are expected to emerge. The latter concept was successfully implemented as a proof-of-concept for a mobility-guided sampling strategy in response to the surveillance of Omicron sublineage BQ.1.1. The combination of mobile service data and SARS-CoV-2 surveillance by genome sequencing is a valuable tool for more targeted and responsive surveillance.