DataSheet2_Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.CSV (5.8 kB)
Download file

DataSheet2_Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer.CSV

Download (5.8 kB)
dataset
posted on 19.08.2021, 05:02 by Andre Lamurias, Sofia Jesus, Vanessa Neveu, Reza M. Salek, Francisco M. Couto

Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process.

Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata.

Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database.

Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.

History