Data Sheet 1_A machine-learning approach for pancreatic neoplasia classification based on plasma extracellular vesicles.pdf
Pancreatic cancer (PC) is a lethal disease developing from either exocrine or endocrine cells. Efforts to assist early diagnosis focus on liquid biopsy methods, and especially on the detection of Extracellular Vesicles (EVs) secreted from cancer cells in their microenvironment and accumulated in systemic circulation. Multiple studies explore how EVs size, surface biomarkers or content can determine their unique role and function in the recipient cell’s gene expression, metabolism and behavior affecting cancer development. This study aimed to develop a machine learning-driven (ML) pipeline utilizing clinical variables and EV-based features to predict the presence of pancreatic tumors of different nature (exocrine/endocrine) in patients’ plasma compared to patients with benign lesions or age-matched non-oncological patients.
MethodsAll available plasma samples (N=126) and variables were collected prior to surgery. EVs were detected and characterized by flow cytometry-immunostaining. Data including size and a unique set of biomarkers (CD45, CD63 and EphA2) were combined with hematological/biochemical data and processed under two use cases, each formulated as a 3-class classification problem for patient risk stratification. The first use case aimed at classifying patients as with benign lesions or exocrine/endocrine neoplasms. The second use case aimed to distinguish patients with exocrine/endocrine neoplasms from non-oncological patients. Various ML methods were applied, including Logistic Regression, Random Forest, Support Vector Machines, and Extreme Gradient Boosting. Evaluation metrics, as area under the receiver operating characteristic curve (AUC-ROC), were computed, and Shapley values were utilized to determine features with the greatest impact on the discrimination of outcome groups.
ResultsAnalyses identified hematological and biochemical features, among significant predictors. Models demonstrated substantial accuracy and AUC-ROC values based on plasma EVs subpopulations, which scored over 0.90 in accuracy of the Random Forest and XGBoost algorithms, presenting 0.96 +/- 0.03 accuracy in the first use case and 0.93 +/- 0.04 in the second.
DiscussionBy leveraging advanced analytical ML-driven approaches and integrating diverse data types, this study achieved significant accuracy, assisting patient’s risk estimation and supporting the feasibility for early detection of pancreatic cancer. Going beyond currently used biomarkers such as CEA, or CA19.9, EV-based features represent an added value offering increased diagnostic capacity.