The LOTUS initiative for open knowledge management in natural products research

  1. Adriano Rutz
  2. Maria Sorokina
  3. Jakub Galgonek
  4. Daniel Mietchen
  5. Egon Willighagen
  6. Arnaud Gaudry
  7. James G Graham
  8. Ralf Stephan
  9. Roderic Page
  10. Jiří Vondrášek
  11. Christoph Steinbeck
  12. Guido F Pauli
  13. Jean-Luc Wolfender
  14. Jonathan Bisson  Is a corresponding author
  15. Pierre-Marie Allard  Is a corresponding author
  1. School of Pharmaceutical Sciences, University of Geneva, Switzerland
  2. Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Switzerland
  3. Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Germany
  4. Institute of Organic Chemistry and Biochemistry of the CAS, Czech Republic
  5. Ronin Institute, United States
  6. Leibniz Institute of Freshwater Ecology and Inland Fisheries, Germany
  7. School of Data Science, University of Virginia, United States
  8. Department of Bioinformatics-BiGCaT, Maastricht University, Netherlands
  9. Center for Natural Product Technologies and WHO Collaborating Centre for Traditional Medicine (WHO CC/TRM), Pharmacognosy Institute; College of Pharmacy, University of Illinois at Chicago, United States
  10. Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, United States
  11. Ontario Institute for Cancer Research (OICR), University Ave Suite, Canada
  12. University of Glasgow, United Kingdom
  13. Department of Biology, University of Fribourg, Switzerland
8 figures, 6 tables and 3 additional files

Figures

Blueprint of the LOTUS initiative.

Data undergo a four-stage process: (1) Harmonization, (2) Processing, (3) Validation, and (4) Dissemination. The process was designed to incorporate future contributions (5), either by the addition of new data from within Wikidata (a) or new sources (b) or via curation of existing data (c). The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_blueprint.svg.

Alluvial plot of the data transformation flow within LOTUS during the automated curation and validation processes.

The figure also reflects the relative proportions of the data stream in terms of the contributions from the various sources (‘source’ block, left), the composition of the harmonized subcategories (‘original subcategory’ block, middle) and the validated data after curation (‘processed category’ block, right). Automatically validated entries are represented in green, rejected entries in blue. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_alluvial_plot.svg.

Illustration of the ‘found in taxon’ statement section on the Wikidata page of erysodine Q27265641 showing a selection of erysodine-containing taxa and the references documenting these occurrences.
Distribution of ‘structures per organism’ and ‘organisms per structure’.

The number of organisms linked to the planar structure of β-sitosterol (KZJWDPNRJALLNS) and the number of chemical structures in Arabidopsis thaliana are two exemplary highlights. A. thaliana contains 687 different short InChIKeys (i.e. 2D structures) and KZJWDPNRJALLNS is reported in 3979 distinct organisms. Less than 10% of the species contain more than 80% of the structural diversity present within LOTUS. In parallel, 80% of the species present in LOTUS are linked to less than 10% of the structures. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_structure_organism_distribution.svg.

UpSet plots of the individual contribution of electronic NP resources to the planar structures found in Arabidopsis thaliana (A) and to organisms reported to contain the planar structure of β-sitosterol (KZJWDPNRJALLNS) (B).

UpSet plots are evolved Venn diagrams, allowing to represent intersections between multiple sets. The horizontal bars on the lower left represent the number of corresponding entries per electronic NP resource. The dots and their connecting line represent the intersection between source and consolidate sets. The vertical bars indicate the number of entries at the intersection. For example, 479 organisms containing the planar structure of β-sitosterol are present in both UNPD and NAPRALERT, whereas each of them respectively reports 1349 and 2330 organisms containing the planar structure of β-sitosterol. The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_upset_plot.svg.

TMAP visualizations of the chemical diversity present in LOTUS.

Each dot corresponds to a chemical structure. A highly specific quassinoid (NXZXPYYKGQCDRO) (light green star) and an ubiquitous stigmastane steroid (KZJWDPNRJALLNS) (dark green diamond) are mapped as examples in all visualizations. In panel A., compounds (dots) are colored according to the NPClassifier chemical class they belong to. In panel B., compounds that are mostly reported in the Simaroubaceae family are highlighted in blue. Finally, in panel C., the compounds are colored according to the specificity score of chemical classes found in biological organisms. This biological specificity score at a given taxonomic level for a given chemical class is calculated as a Jensen-Shannon divergence. A score of 1 suggests that compounds are highly specific, 0 that they are ubiquitous. Zooms on a group of compounds of high biological specificity score (in pink) and on compounds of low specificity (blue) are depicted. An interactive HTML visualization of the LOTUS TMAP is available at https://lotus.nprod.net/post/lotus-tmap/ and archived at https://doi.org/10.5281/zenodo.5801807 (Rutz and Gaudry, 2021). The figure is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_biologically_interpreted_chemical_tree.svg.

Figure 7 with 1 supplement
LOTUS provides new means of exploring and representing chemical and biological diversity.

The tree generated from current LOTUS data builds on biological taxonomy and employs the kingdom as tips label color (only families containing 50+ chemical structures were considered). The outer bars correspond to the most specific chemical class found in the biological family. The height of the bar is proportional to a specificity score corresponding to a Jaccard index between the chemical class and the biological family. The bar color corresponds to the chemical pathway of the most specific chemical class in the NPClassifier classification system. The size of the leaf nodes corresponds to the number of genera reported in the family. The figure is vectorized and zoomable for detailed inspection and is available under the CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_chemically_interpreted_biological_tree.svg.

Figure 7—figure supplement 1
Distribution of β-sitosterol and related chemical parents among families with at least 50 reported compounds present in LOTUS.

Script used for the generation of each tree in the figure is the same (src/4_visualizing/plot_magicTree.R) as for Figure 7 as both figures are related. The figure is available under CC0 license at https://commons.wikimedia.org/wiki/File:Lotus_initiative_1_chemically_interpreted_biological_tree_supplement.svg.

Author response image 1
Left heuristic vs right Overlap index.

Tables

Table 1
Example of a referenced structure-organism pair before and after curation.
StructureOrganismReference
Before curationCyathocalineStem bark of Cyathocalyx zeylanica CHAMP. ex HOOK. f. & THOMS. (Annonaceae)Wijeratne E. M. K., de Silva L. B., Kikuchi T., Tezuka Y., Gunatilaka A. A. L., Kingston D. G. I., J. Nat. Prod., 58, 459–462 (1995).
After curationVFIIVOHWCNHINZ-UHFFFAOYSA-NCyathocalyx zeylanicus10.1021 /NP50117A020
Table 2
Potential questions about structure-organism relationships and corresponding Wikidata queries.
QuestionWikidata SPARQL query
What are the compounds present in Mouse-ear cress (Arabidopsis thaliana) or its child taxa?https://w.wiki/4Vcv
Which organisms are known to contain β-sitosterol?https://w.wiki/4VFn
Which organisms are known to contain stereoisomers of β-sitosterol?https://w.wiki/4VFq
Which pigments are found in which taxa, according to which reference?https://w.wiki/4VFx
What are examples of organisms where compounds were found in an organism sharing the same parent taxon, but not in the organism itself?https://w.wiki/4Wt3
Which Zephyranthes species lack compounds known from at least two species in the genus?https://w.wiki/4VG3
How many compounds are structurally similar to compounds labeled as antibiotics? (grouped by the parent taxon of the containing organism)https://w.wiki/4VG4
Which organisms contain indolic scaffolds? Count occurrences, group and order the results by the parent taxon.https://w.wiki/4VG9
Which compounds with known bioactivities were isolated from Actinobacteria, between 2014 and 2019, with related organisms and references?https://w.wiki/4VGC
Which compounds labeled as terpenoids were found in Aspergillus species, between 2010 and 2020, with related references?https://w.wiki/4VGD
Which are the available referenced structure-organism pairs on Wikidata? (example limited to 1000 results)https://w.wiki/4VFh
Table 3
Distribution and specificity of chemical structures across four important NP reservoirs: plants, fungi, animals, and bacteria.

When the chemical structure/class appeared only in one group and not the three others, they were counted as ‘specific’. Chemical classes were attributed with NPClassifier.

GroupOrganisms2D Structure-Organism pairs2D chemical structuresSpecific 2D chemical structuresChemical classesSpecific chemical classes
Plantae28,439342,89195,19190,672 (95%)54559 (11%)
Fungi4,00336,95022,59420,194 (89%)41719 (5%)
Animalia2,71624,11415,24211,822 (78%)45514 (3%)
Bacteria1,55523,19815,89514,130 (89%)38543 (11%)
Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Software, algorithmLotus-processor codeThis work (https://github.com/lotusnprod/lotus-processor, Rutz, 2022a)Archived at https://doi.org/10.5281/zenodo.5802107
Software, algorithmLotus-web codeThis work (https://github.com/lotusnprod/lotus-web, Rutz, 2022b)Archived at https://doi.org/10.5281/zenodo.5802119
Software, algorithmLotus-wikidata-interact codeThis work (https://github.com/lotusnprod/lotus-wikidata-interact, Rutz, 2022c)Archived at https://doi.org/10.5281/zenodo.5802113
Software, algorithmGlobal Names Architeturehttps://globalnames.orgQID:Q65691453See Additional executable files
Software, algorithmJavahttps://www.java.comQID:Q251
Software, algorithmKotlinhttps://kotlinlang.orgQID:Q3816639See Kotlin packages
Software, algorithmManubothttps://manubot.orgQID:Q96473455 RRID:SCR_018553Repository available at https://github.com/lotusnprod/lotus-manuscript
Software, algorithmNPClassifierhttps://npclassifier.ucsd.eduSee https://doi.org/10.1021/acs.jnatprod.1c00399
Software, algorithmOPSINhttps://github.com/dan2097/opsinQID:Q26481302See Additional executable files
Software, algorithmPython Programming Languagehttps://www.python.orgQID:Q28865 RRID:SCR_008394See Python packages
Software, algorithmR Project for Statistical Computinghttps://www.r-project.orgQID:Q206904 RRID:SCR_001905See R packages
Software, algorithmMolconverthttps://docs.chemaxon.com/display/docs/molconvert.mdQID:Q55377678See Chemical structures
Software, algorithmWikidatahttps://www.wikidata.orgQID:Q2013 RRID:SCR_018492Project page https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Natural_products
OtherLotus custom dictionariesThis workArchived at https://doi.org/10.5281/zenodo.5801798
OtherChemical identifier resolverhttps://cactus.nci.nih.gov/chemical/structureSee Chemical structures
OtherCrossRefhttps://www.crossref.orgQID:Q5188229 RRID:SCR_003217See References
OtherPubChemhttps://pubchem.ncbi.nlm.nih.govQID:Q278487 RRID:SCR_004284LOTUS data https://pubchem.ncbi.nlm.nih.gov/source/25132
OtherPubMedhttps://pubmed.ncbi.nlm.nih.govQID:Q180686 RRID:SCR_004846See References
OtherTaxonomic data sourceshttps://resolver.globalnames.org/data_sourcesSee Translation
OtherNatural Products data sourcesSee Appendix 1
Appendix 1—table 1
Data sources list.
DatabaseTypeInitial retrieved unique entriesCleaned referenced structure-organism pairsPairs validated for Wikidata exportActual validated pairs on WikidataWebsiteArticleRetrievalLicense statusContactDumpStatus
afrotrypopen313935554-article (Ibezim et al., 2017)downloadlicense_copyrightFidele Ntie-Kang or Ngozi Justina NwodoYESunmaintained
alkamidopen4,4162,6392,3092,160websitearticle (Boonen et al., 2012)scriptlicense_copyrightBart De SpiegeleerNOmaintained
antibasecommercial46,95645,221-------NOunmaintained
antimarincommercial73,01767,559-------NOunmaintained
biofacquimopen531519519511website (old version)article_old; article_new (Pilón-Jiménez et al., 2019)downloadlicense_CCBY_4.0José Medina-FrancoYESmaintained
biophytmolopen543558322308websitearticle (Sharma et al., 2014)scriptlicense_CCBYAnshu BhardwajNOunmaintained
carotenoiddbopen2,922639530485websitearticle (Yabuzaki, 2017)scriptlicense_copyrightyzjunko@gmail.comNOmaintained
coconutopen5,757,8725,723,691153,981140,877websitearticle (Sorokina and Steinbeck, 2020b)downloadlicense_CCBY_4.0Maria SorokinaYESmaintained
cyanometdbopen1,9051,6311,6211,605-article (Jones et al., 2021)downloadlicense_CCBY_4.0elisabeth.janssen@eawag.chYESmaintained
datawarrioropen5895417160websitearticle (Sander et al., 2015)downloadno_licensethomas.sander@idorsia.comYESretired
dianatdbopen290323115111websitearticle (Madariaga-Mazón et al., 2021)downloadlicense_CCBY_NCamadariaga@iquimica.unam.mx or kmtzm@unam.mxYESmaintained
dnpcommercial205,072254,573--website-script-support@taylorfrancis.comNOmaintained
drdukeopen90,6759,6606,1845,222website-downloadlicense_CC0agref@usda.govYESmaintained
foodbrestricted81,94139,662--website-downloadlicense_CCBY_NCjreid3@ualberta.ca (Jennifer)YESunmaintained
inflamnatopen665632306268-article (Zhang et al., 2019)downloadlicense_copyrightxiaoweilie@ynu.edu.cnYESunmaintained
knapsackopen132,127139,33659,94555,186websitearticle (Shinbo et al., 2006)scriptlicense_copyrightskanaya@gtc.naist.jpNOmaintained
metabolightsopen38,20837,7046,2415,687websitearticle (Haug et al., 2020)downloadlicense_copyright-YESmaintained
mibigopen1,3101,139638535websitearticle (Kautsar et al., 2020)downloadlicense_CCBY_4.0Tilmann Weber orMarnix MedemaYESunmaintained
mitishambaopen1,071534294291websitearticle (Derese et al., 2019)scriptlicense_copyright-NOdefunct
nanpdbopen5,7526,3835,9375,283websitearticle (Ntie-Kang et al., 2017)scriptlicense_copyrightntiekfidele@gmail.com stefan.guenther@pharmazie.uni-freiburg.deNOmaintained
napralertcommercial681,401392,498294,818270,743websitearticle (Graham and Farnsworth, 2010)-license_copyrightnapralert@uic.eduNOdefunct
npassopen290,53530,18525,42923,612websitearticle (Zeng et al., 2018)downloadlicense_CCBY_NCphacyz@nus.edu.sg jiangyy@sz.tsinghua.edu.cn iaochen@163.comYESunmaintained
npatlasopen32,53934,72634,54833,087websitearticle (van Santen et al., 2019; )downloadlicense_CCBY_4.0rliningt@sfu.caYESmaintained
npcareopen7,7635,8783,7903,538websitearticle (Choi et al., 2017)downloadlicense_CCBY_4.0choihwanho@gmail.comYESunmaintained
npediaopen82992828websitearticle (Tomiki et al., 2006)scriptno_licensehisyo@riken.jp npd@riken.jpNOdefunct
nubbeopen2,1892,3402,3402,119websitearticle (Pilon et al., 2017)-license_copyrightVanderlan S. BolzaniNOmaintained
pamdbopen3,0462,8202424websitearticle (Huang et al., 2018)downloadlicense_CCBY_NCawilks@rx.umaryland.edu aoglesby@rx.umaryland.edu mkane@rx.umaryland.eduYESunmaintained
phenolexploreropen8,0778,7007,1235,721websitearticle (Rothwell et al., 2013)downloadlicense_copyrightscalberta@iarc.frYESunmaintained
phytohubopen2,3491,14513294websitearticle (Giacomoni et al., 2017)scriptno_licenseclaudine.manach@inra.frYESunmaintained
procardbopen6,5566,2786055websitearticle (Nupur et al., 2016)scriptlicense_CCBY_4.0Anil Kumar PinnakaAshwani KumarNOunmaintained
respectopen2,7591,064634547websitearticle (Sawada et al., 2012)downloadlicense_CCBY_NC_2.1_Japanksaito@psc.riken.jpYESunmaintained
sancdbopen860925747732websitearticle (Hatherley et al., 2015)scriptlicense_CCBY_4.0Özlem Tastan BishopNOunmaintained
streptomedbopen71,63833,21720,71518,395websitearticle (Klementz et al., 2016)downloadlicense_copyrightstefan.guenther@pharmazie.uni-freiburg.deYESmaintained
swmdopen1,0751,7511,5971,479websitearticle (Davis and Vasanthi, 2011)scriptlicense_CCBY_4.0Dicky.John@gmail.comNOunmaintained
tmdbopen2,1165332624websitearticle (Yue et al., 2014)scriptlicense_copyrightXiao-Chun WanGuan-Hu BaoNOunmaintained
tmmcopen15,0337,8335,8264,015websitearticle (Kim et al., 2015a)downloadlicense_copyrightJeong-Ju LeeYESunmaintained
tpptopen27,18223,872684641websitearticle (Günthardt et al., 2018)downloadlicense_copyrightthomas.bucheli@agroscope.admin.chYESunmaintained
unpdopen331,242304,683211,158197,710websitearticle (Gu et al., 2013)-license_CCBY_4.0lirongc@pku.edu.cn xiaojxu@pku.edu.cnNOdefunct
wakankensakuopen367224208202website-script--NOdefunct
Wikidataopen951,268960,611959,747919,752website-downloadlicense_CC0-YESmaintained
Appendix 2—table 1
Summary of the Validation Statistics.
First validation dataset (n = 420)Second validation dataset (n = 100)
Reference TypeTrue positiveFalse positiveFalse negativeTrue negativeRatioPrecisionRecallF0.5 scoreTrue positiveFalse negative
Original8067110.310.930.920.92381
Pubmed371560.300.970.880.9251
DOI1156060.190.951.000.97431
Title3820160.120.951.000.9770
Split8015270.081.000.350.5240
Publishing details101320.011.000.500.6700
Total2791528981.00---973
Corrected total-----0.960.890.91--

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Adriano Rutz
  2. Maria Sorokina
  3. Jakub Galgonek
  4. Daniel Mietchen
  5. Egon Willighagen
  6. Arnaud Gaudry
  7. James G Graham
  8. Ralf Stephan
  9. Roderic Page
  10. Jiří Vondrášek
  11. Christoph Steinbeck
  12. Guido F Pauli
  13. Jean-Luc Wolfender
  14. Jonathan Bisson
  15. Pierre-Marie Allard
(2022)
The LOTUS initiative for open knowledge management in natural products research
eLife 11:e70780.
https://doi.org/10.7554/eLife.70780