Computational Approaches for Cancer Workshop 2023

CAFCW23

Hyperparameter Optimization for deep-learning models predictive of anti-cancer drug responses

Rylie Weaver, Rohan Gnanaolivu, Rajeev Jain, Chen Wang, Oleksandyr Narykov

Abstract
Artificial intelligence (AI) and deep learning (DL) have emerged as powerful tools for analyzing complex molecular and genetic data, particularly in cancer therapeutic research. Every DL model has hyper-parameters that control the high-level learning behaviors and neural network (NN) architectures, consequently influencing model’s performance. Despite central roles of hyper-parameters to DL models, the selection of best hyper-parameters is very computationally costly, as each set of hyper-parameter configuration necessitates the training and evaluation of an individual DL instance. To fully exploit DL’s predictive potentials, this study is designed to examine a hyper-parameter optimization (HPO) framework for DL models in predicting anti-cancer drug response, consisting of the process of finding the optimal values of DL hyper-parameters that maximizes performance while minimizing computational cost compared to a random search.

The inputs of the developed HPO framework include DL models containerized in a singularity instance, specified hyper-parameters, as well as value ranges for constraining optimization searching space. Within the HPO framework, we devised a genetic algorithm (GA), which firstly initializes a population of hyper-parameter configurations inside the hyperparameter space, and then simulates evolution of configurations through iterations of mutation, mating, and selection. According to a fitness function corresponding to the DL model's validation loss, we designed the entire GA process to iteratively improve configurations and choose the best HPO solution over multiple evolution generations.

As the use-case evaluation, our developed HPO framework is applied to DrugCell and DeepTTA DL models. DrugCell is an interpretable method that constructs a visible NN following gene ontology relationships and predicts anti-cancer drug responses by incorporating biochemical properties and structures of compounds. DeepTTA’s focus is a creation of robust drug representations, which is achieved via transformer architecture applied to Explainable Substructure Partition Fingerprints (ESPF). This method also utilizes gene expression data from tumors and creates a separate embedding for biological samples. Then both drug and tumor representations are used for regression purposes. We examined a set of critical hyper-parameters of these models, which are learning rate, batch size, optimizer, and dropout using training-testing-validation schema on the CCLE and GDSC cancer cell line / drug datasets. We achieved improvement in the concordance of predicted CCL-drug response versus measured ones with HPO compared to the default settings and a random search in both models. Through the landscape analysis of HPO results, we identified that the learning rate and batch size were the most influential hyper-parameter for performance out of the ones chosen. Our study demonstrates the importance of HPO and robustness of GA in finding optimal hyperparameters for anti-cancer drug research, as well as some hyperparameter configuration guidance for researchers with similar models. Moreover, after improvement with HPO, the interpretability and effectiveness of our models give further insights into the effectiveness of cancer drugs and the mechanisms behind cancer-drug prediction.
Presented by
Rylie Weaver <rylieweaver9@gmail.com>
Institution
Argonne National Laboratory, Mayo Clinic
Hashtags
#machinelearning #HyperparameterOptimization #cancer #deeplearing
Chat with Presenter
Available November 12th 3:00-4:00 MST

https://clinicalunitmapping.com/show/COVID19_Ensemble_Latest.html

Jacob Barhak

Abstract
The Reference Model for disease progression was initially a diabetes model. It used the approach of assembling models and validating them against different populations from clinical trials.

The model performs simulation at the individual level while modeling entire populations using the MIcro-Simulation Tool (MIST), employing High Performance Computing (HPC), and using machine learning techniques to combine models.

The Reference Model technology was transformed to model COVID-19 near the start of the epidemic. The model is now composed of multiple models from multiple contributors that represent different phenomena: It includes infectiousness models, transmission models, human response / behavior models, mortality models, and observation models. Some of those models were calculated at different scales including cell scale, organ scale, individual scale, and population scale.

The Reference Model has therefore reached the achievement of being the first known multi-scale ensemble model for COVID-19. This project is ongoing and this presentation is constantly updated for each venue. To access the most recent publication please use this link https://www.clinicalunitmapping.com/show/COVID19_Ensemble_Latest.html

This is an interactive presentation - please explore the tabs above and interact with the figures - they have sliders and widgets and hover information that will allow interaction. Following the tabs in order from left to right will tell the story
Presented by
Jacob Barhak <jacob.barhak@gmail.com>
Institution
Jacob Barhak Analytics
Hashtags
Chat with Presenter
Available On Nov 12th 12 pm - 1:30 pm Mountain Standard Time (MST) and 3:00 pm - 3:30 pm MST and after the conference on Nov 13-17 10am-noon Central time or by appointment

Finding Novel Drug Discovery Experiments with QSAR

John Marinelli, Thomas Passaro, Rida Saifullah, Daniel Salinas Duron

Abstract
Novel drug discovery is essential for advancing cancer treatment. However, current high-throughput screening of compounds is time-consuming and costly, making it inaccessible to smaller labs. We developed a tool to identify potential inhibitors for targets strongly associated with a given disease. The tool utilizes a database with disease-target-compound-assay associations from OpenTargets and ChEMBL platforms. By training a graph convolutional network model, it predicts IC50 values for specific targets and suggests novel inhibitors not yet tested against the proposed target. This enables researchers to effectively sift through potential inhibitors in silico, reducing the need for costly screening. Furthermore, the tool provides information on the commercial availability of the proposed compound through a ZINC database lookup. Assays to test the compound effectiveness are proposed through a clustering algorithm. This will empower small labs to conduct screening on a reduced set of promising compounds for the desired disease.
Presented by
Thomas Passaro <t.j.passaro@email.msmary.edu>
Institution
Mount Saint Mary's University, Science Department
Hashtags
#DeepChem #OpenTargets #Lamba #ChEMBL

Influencing factors on false positive rates when classifying tumor cell line response to drug treatment

Priyanka Vasanthakumari, Thomas Brettin, Yitan Zhu, Hyunseung Yoo, Maulik Shukla, Alexander Partin, Fangfang Xia, Oleksandr Narykov, Rick L. Stevens

Abstract
Informed selection of drug candidates for laboratory experimentation will provide an efficient means of identifying suitable anti-cancer treatments. The advancement of machine learning and artificial intelligence has led to the development of computational models to predict cancer cell line response to drug treatment. To use computational response models for guiding drug screening experiments, it is important to analyze the false positive rate (FPR) to help increase the number of effective treatments discovered through experiments and to minimize unnecessary laboratory experimentation. Such analysis will also aid in identifying drugs or cancer types that require more data collection to improve model predictions. The dataset used in this work is constructed by including 21 cancer types that have the largest numbers of cell lines with RNA-Seq and drug response data available in a combined set of five drug screening studies. The cell lines are represented by gene expression profiles and the drugs are represented by molecular descriptors. Based on the normalized area under the dose response curve (AUC), the experiments/samples are categorized into non-responders (AUC >= 0.5) and responders (AUC < 0.5). An attention based neural network classification model is used to predict responsive vs. non-responsive treatments. The model construction and evaluation are performed through 100 10-fold cross-validation trials. The model prediction performance is measured by the Mathew’s correlation coefficient (MCC). Two data filtering techniques have been applied, including removing samples for which dose response curves are poorly fitted and removing samples whose AUC values are marginal around 0.5 from the training set. Our analysis results show both filtering approaches help improve the model prediction performance. The highest MCC value of 0.635 is achieved in the analysis, which removes samples whose R-squared values resulted from dose response curve fitting are smaller than 0.9 and excludes samples with AUC values in the range of [0.4, 0.6] from the training set. Based on its prediction results, we summarized the FPRs of 21 cancer types and 96 different drug mechanism of action (MoA) categories. The FPR of cancer type spans between 0.262 and 0.5189, while that of drug MoA category spans almost the full range of [0, 1]. This study helps to identify cancer types and drug MoAs with high FPRs. To further improve drug response models, a reasonable strategy is to generate more data of cancer types and drug MoA categories with high FPRs, which is expected to improve the models for predicting responses of these cancer types and drug MoA categories.
Presented by
Priyana Vasanthakumari <pvasanthakumari@anl.gov>
Institution
Argonne National Laboratory
Hashtags
#drug response prediction #error analysis
Chat with Presenter
Available November 12, 3:00 to 3:30 PM, MST

Building an Online Interactive Volumetric Surface Viewer to Visualize the Spatial Distribution of Brain Metastases

William Delery, Ricky Savjani

Abstract
Purpose/Objective: A common coordinate framework for the human body could enable a new frontier of volumetric oncological population-level inference across patients. With an atlas (or template) that each individual patient could be deformed onto, the precise spatial distribution at the millimeter level of particular oncological processes could be investigated.

Current management and treatment of cancers rely on individual physician expertise, years of training and experience, and integrated feedback from successful treatments versus those that induced significant toxicities. This clinical gestalt, however, is difficult to access, quantify, and teach. What if vital information from every cancer patient ever treated could be accessed instantly?

We propose to build an interactive visual search database for brain metastases. This transcends text-based spreadsheets to allow interrogation of population-based responses in an intuitive web viewer. Clinicians would be able to dynamically view the spatial distribution of tumors and corresponding radiotherapy treatments on a 3D surface. We propose to build the backend of the database to inspect clinical outcomes: One could click a region of the brain and know how likely patients with brain metastases in this region are to experience seizures, radiation necrosis, local/distant recurrence, and death. All of this information would be available in an intuitive visual online portal, facilitating dynamic data-driven treatment decisions.

Materials/Methods: We first built a custom VoxelMorph model to non-linearly register patient MRI T1w pre-contrast MRIs onto a standard template brain. This framework integrates velocity fields into a deep U-Net to be able to register brain MRIs onto an atlas in under 1 second on a GPU. We used an NVIDIA DGX A100 Station to train models using over 1,000 patients to normalize brains onto a common atlas.

Results/Conclusion: We have piloted our workflow on 647 patients with intracranial brain lesions treated with radiotherapy and created a population surface viewer on PyCortex. We thresholded the dose maps to the 95%, binarized them to create masks, and generated a spatial distribution of brain metastases. We have launched this interactive viewer here.

This will allow, for example, a clinician to quickly see the spatial distribution of which areas in the brain are most likely to induce seizures. These data combined will also help differentiate radiation-induced changes from tumor progression. Studies on autosegmentation of brain metastases can be rapidly deployed on our data and improved upon. Recent work on predicting the primary site of origin can also be extended. We have collected longitudinal imaging for patients at their clinical standard of care three-month follow-up imaging and plan to evaluate and visualize the evolution of these tumors spatially. Our data would be the largest in class to involve over 1,000 patients treated over 10 years, extending the work of small recent studies showing the spatial distribution of brain lesions.
Presented by
William Delery <wdelery@mednet.ucla.edu>
Institution
University of California Los Angeles, Department of Radiation Oncology
Hashtags
#voxelmorph #pycortex
Chat with Presenter
Available November 12, 4-5 pm MST

Exploration of the containerized ATOM Modeling Pipeline's accessibility and cross-compatibility

J. Jedediah Smith

Abstract
This project aims to assess the containerization of the ATOM Modeling PipeLine (AMPL), explore its compatibility with other programs, and improve its accessibility through the development of additional educational resources. Containers are sandbox workspaces isolated from the host machine that contain a specific program and any necessary dependencies. They are used to improve accessibility, reproducibility, and transportability. With the container management tool Docker, we verified that the containerized version of AMPL linked on its GitHub was functional. Next, we explored the compatibility of the AMPL container with the cloud-computing platforms Google Cloud Platform (GCP) and Microsoft Azure, as well as another more secure container tool known as Singularity. All three avenues are promising to different degrees, but further work is needed to ensure whether the AMPL Docker container will work properly. In summary, we confirmed the original AMPL image runs with Docker, expanded upon its existing GitHub instructions, and determined that GCP, Azure, and Singularity may all be compatible with the AMPL Docker container.
Presented by
Jedediah Smith
Institution
Frederick National Laboratory, ATOM Consortium
Hashtags
#atom #ampl #containers #pipeline #bioinformatics
Chat with Presenter
Available Nov. 12, 3:00 - 3:30 MST

Data Modeling and Analytics Towards Patient Selection For Cancer End-Of-Life Care Study

Denise S. Davis, PhD Candidate in Health Informatics

Abstract
Purpose: This paper describes data modeling and analysis methods developed for patient selection and features extraction of cancer decedents from 2016-2022. Features extracted included hospital admissions, ICU stays, cancer-related metrics, emergency room visits, hospice stays, palliative care consults, advanced care planning, diagnoses, procedures, therapies, and treatment visits. The aim of this project was to develop a DataMart for End-Of-Life (EOL) Care study that could address disparities in EOL care, including sociodemographic, physical health and clinical factors. The added goal was to simplify access to data for use in machine learning algorithms.

Context: Our multidisciplinary research team was comprised of medical oncologist, palliative care physician, statistician, health informatician and data engineer(doctoral student). Data acquisition for this study was driven by the team's collaborative efforts.

Methods: First, the research grant requirements were analyzed and compared to existing EPIC Cogito data elements. Next, manual retrospective chart review was conducted on 121 decedents from September 2022 to December 2022. Taking both into consideration, a multidimensional relational database was designed to enforce integrity, simplify querying, reporting and visualizations, and faster aggregation. Extract-Transform-Load (ETL) processes were written in SQL to transform cancer registry data into this analytical clinical schema. ETL processes included one hot encoding to process sparse data that EHRs (electronic health records) are notorious for and capture of timed events that are used as indicators of EOL Care. Patient inclusion and exclusion criteria were based on measures of aggressive EOL care defined in the grant, such as more than one hospital stay in the last thirty days. Additional factors included anomalies and missing values discovered during the data harmonization process. Incomplete data elements were analyzed by the medical oncologist and palliative care physician for the suitability of inclusion based on their clinical knowledge. The ETL process was developed iteratively according to the system development life cycle.

Results: The entire last quarter of 2022 were extracted, transformed and analyzed resulting in 789 decedents. Then, the last seven years, 2016-2022, were processed resulting in 15,306 decedents. Large imbalances with respect to race and ethnicity were observed in both datasets. However, a gender balance was observed in both datasets. Preliminary hypothesis testing between the last quarter of 2022 to the last seven years demonstrated the proportion of decedents with more than one hospital stay in last 30 days was not significantly different from 53.9% observed in last seven years.

Conclusion: Real-world data can be useful for integrating into clinical research. There is much planning and research upfront to prepare the data before it can be used. Multidisciplinary collaborations can help demystify patient records to avoid misuse and biases associated with secondary uses. The clinical research development framework using real-world data involves analytical algorithms, data management systems, data harmonization, human-centered applications, visualization tools, clinical knowledge, shadowing, and conversations. Data modeling and analytics are important preparation steps towards unlocking knowledge in real-world data.

Disclaimer: This work was supported by a grant from BDHSC and received approval from institutional IRB to use decedent data as the source of research.
Presented by
Denise S. Davis <denisesd@email.sc.edu>
Institution
University of South Carolina, College of Engineering and Computing
Hashtags
#datamodels #analytics
Chat with Presenter
Available Nov 12, 12-1:30pm, 3-3:30pm

Diabetes and the Social Link to Cancer

Victoria M Conerly

Abstract
As you may know minorities are always hit the hardest with diagnosed cases of cancer and diabetes more so in southern "hotter" parts of the country. "Diabetes and the Social Link to Cancer" goal was to bring more awareness to others with hopes of eradicating both diseases at the (cytosis level). By taking a focused approach on food intake in hotter weather and how it could affect our metabolism in which may cause cancer and or diabetes. At minimal, finding and utilizing new technology for better treatment options; along with new preventative care for people living with cancer and or diabetes. I plan to implement and develop programs that are geared to bring awareness to both diseases locally, in communities by creating workshops, getting public health officials involved and educating people as to how food, and other beverage consumptions including their demographic region and environment can affect their health.
Presented by
Victoria Conerly <conerlyandassociates@yahoo.com>
Institution
Conerly & Associates SP
Hashtags
#diabetesandthesociallinktoCA

Predicting HOMO-LUMO Gap Values Using An Advanced Machine Learning Platform for Drug Discovery: ATOM Modeling PipeLine (AMPL)

Renate Toldo, Sarah Norris, Justin Overhulse, Ph.D., Chloe Thangavelu, Ph.D.

Abstract
The HOMO-LUMO gap is a quantum mechanical (QM) property of compounds which can predict compound stability and reactivity. For this reason, gap values can be useful for determining which lead compounds may become viable drugs. The process of determining QM properties, such as the HOMO-LUMO gap, is lengthy and computationally expensive; this leads to increased time and cost requirements for drug development. We are addressing the need for a more streamlined process by training a machine learning (ML) model on the QM9 dataset, which includes many drug-like molecules, to predict a molecule’s HOMO-LUMO gap value from its SMILES string. We used the ATOM Modeling PipeLine (AMPL) to featurize the dataset and split the data to best train our model. The best model we produced was a scaffold split neural network (NN) model resulting in a test R2 value of 0.798. We cross-examined the QM9 model and the curated dataset against two other models and their datasets to determine how our model performed with new data. Pereira’s dataset contains 37,486 compounds and was calculated at the B3LYP/6-31G* level of quantum chemistry. QM9 was calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry and contains 129,435 compounds. The largest dataset used, QMugs, totals 437,714 compounds and was calculated at the ωB97X-D/def2-SVP level. When tested on the QM9 model, the R2 of the QMugs dataset was 0.161, and the R2 of Pereira’s dataset was 0.190. This process revealed a greater need for diverse datasets made with the most recent level of QM theory calculations.
Presented by
Renate Toldo <rtoldo@butler.edu>
Institution
Butler University; Frederick National Laboratory for Cancer Research
Hashtags

Comparison of neural networks with tree-based machine learning approaches for predictive drug response models

Vineeth Gutta, Satish Ranganathan, Sara Jones, Matthew Beyers , Sunita Chandrasekaran

Abstract
Using deep learning in cancer research to tackle scientific challenges is becoming an increasingly popular technique. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost (eXtreme Gradient Boosting) more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models).
Presented by
Vineeth Gutta
Institution
University of Delaware
Hashtags
Chat with Presenter
Available Nov 12 3:00 - 3:30 PM MST

Prediction of key metastatic genes in head and neck squamous cell carcinoma (HNSCC) using a deep learning based context-aware foundation model for network biology

Tarak Nandi, Christina Theodoris, Alex Rodriguez and Ravi Madduri

Abstract
Cancer metastasis is responsible for most cancer-related deaths, yet the genes driving metastasis remain poorly defined. Identifying the key genes that promote metastasis is critical for developing effective treatments, and recent advances in single-cell RNA sequencing (scRNA-Seq) methods provide information about gene expression in individual cells, enabling the modeling of complex gene regulatory networks. Here we apply Geneformer [1], an attention based foundational deep learning model to identify the core genes involved in the transition of primary to metastatic tumors using scRNA-Seq data from head and neck squamous cell carcinoma (HNSCC).

Geneformer is pretrained on ~30 million diverse human single cell transcriptomes (using cells with low mutational) to learn fundamental properties of gene network dynamics and gene hierarchy, and fine-tuned versions of the model have been shown to perform well for several genomic tasks including identification of therapeutic targets for cardiomyopathy. For each single-cell transcriptome, the model takes as an input a list of genes sorted by their normalized expression values and embeds each gene into a 256-dimensional space that encodes the characteristics of the gene specific to the context of that cell. Subsequently, the embeddings of the genes expressed in each cell are integrated to generate cell-level embeddings that encode the characteristics of the cell state.

We extend the scope of the Geneformer model by re-training it (using a transfer learning approach) using a corpus of approximately 6 million single cells (normal and cancer cells combined) spanning 10 cancer types [2] to teach the model to delineate healthy and cancerous states. Subsequent fine-tuning on comparatively much smaller amounts of scRNA-Seq data collected from both primary tumor sites, and (early) metastatic sites (cervical lymph node) for HNSCC [3] enabled Geneformer to distinguish between primary tumor and metastatic tumor cells, and the gene hierarchy involved in metastasis. The Geneformer model is particularly effective in sparse disease-specific data settings, thanks to its knowledge about gene network dynamics acquired through extensive pretraining across a wide variety of cells.

We present results from an in-silico perturbation approach aimed at identifying genes that drive HNSCC metastasis, particularly epithelial to mesenchymal transition (EMT), that confers migratory and invasive properties to cancer cells. Genes from the cells collected from the metastatic site, whose deletion from the model shifts the cell-level output embeddings towards those corresponding to the non-metastatic state (or the over-expression of which shifts the embeddings of the primary cells towards those of the metastatic cells), are considered potential therapeutic targets.

The next phase of our work will involve the creation of a more extensive and refined cancer cell dataset to improve cancer cell specific predictions, a more detailed study of the genes identified as key for EMT to identify the affected pathways and confirm their relevance, and analyze the model's attention weights to predict gene-gene interactions driving metastatic transition.



References:

1. C. V. Theodoris, L. Xiao, A. Chopra, M. D. Chaffin, Z. R. Al Sayed, M. C. Hill, H. Mantineo et al., "Transfer learning enables predictions in network biology," Nature, vol. 123, pp. 1-9, 2023.

2. Chan Zuckerberg Initiative, "CZ CELLxGENE Discover," Online: https://cellxgene.cziscience.com/, Accessed: Aug. 11, 2023.

3. H. S. Quah, E. Y. Cao, L. Suteja, C. H. Li, H. S. Leong, F. T. Chong, S. Gupta et al., "Single cell analysis in head and neck cancer reveals potential immune evasion mechanisms during early metastasis," Nature Communications, vol. 14, no. 1, p. 1680, 2023.

Presented by
Tarak Nandi
Institution
Argonne National Laboratory, Gladstone Institutes (Institute of Cardiovascular Disease, Institute of Data Science and Biotechnology), University of California San Fransisco
Hashtags
cancer, metastasis, deep learning, language model, transformers, single cell, genomics, gene networks
Chat with Presenter
Available November 12th, 12-1:30 PM MST and 3-3:30 PM MST

Transformer Based Reinforcement Learner for Dynamic Cancer Treatment

Sarang Gawane, Xinhua Zhang, Guadalupe Canahuate, Andrew Wentzel, Clifton Fuller, Mohamed Naser, Elisa Tardini, Lisanne Van Djik, Abdallah Mohammed

Abstract
Dynamic Treatment Regimes (DTR) are models that help clinicians determine the course of action while treating patients, especially where a sequence of decisions are involved. The models or collections of models can be used to guide decision making for doctors using the medical history and clinical decision support system where an appropriate intervention leading to an optimal treatment can be materialized given a specific desired outcome. The DTR models have been gaining popularity lately and can be seen utilized in treating various conditions such as depression and ADHD.

The Transformer based Meta-Reinforcement Learner (TMRL) is a novel solution for DTR of cancer patients by harnessing the computational prowess of the Encoder Transformer architecture. Coupled with the Meta-learning framework and Reinforcement Learning algorithms we can train our model using the medical history of patients previously subjected to treatment. The idea is to have a framework that has a bilayered understanding of the various medical decisions taken while treating the patient. We utilised the dataset curated by the University of Texas MD Anderson Cancer Data Centre between the years 2005 to 2013. The dataset contains toxicity levels, medical history and treatment procedures conducted on 536 patients suffering Head and Neck Cancer. As in the existing setup, decisions of whether or not to proceed with certain medical procedures such as Radiotherapy, Concurrent Chemotherapy and Induction Chemotherapy are generally taken by a diverse board of clinicians where a large number of metrics and pre-treatment variables are analysed over broad insights. It is thus apparent that, as the treatment is conducted mostly on a case-by-case basis, the Meta-RL framework could be considerably effective in tackling such a problem.

We therefore frame this problem into a Markov Decision Process where the patients' conditions represent the state of the agent, the medical interventions its actions and the reward is the weighted sum of two conflicting outcomes namely 'Survival rate' and 'Quality of life'. The former refers to the longevity of the patient's life that could be improved at the cost of the latter factor, or compromised leading to a relatively painless but shorter life span. Patients generally employ different preferences of tradeoff.

The Meta Transformer architecture premised on the idea that it can create using its attention mechanism a memory contextualization of all the various treatment scenarios, along with the tradeoff preferences of a similar cohort of patients. By employing proximal policy optimization, we trained the Transformer to find a policy that delivers, compared with the state of the art, superior cumulative rewards over multiple epochs and across various difficulty levels of distinguishing the preference factor. The attention weights also allowed us to locally interpret and visualize the learned treatment policy based on similar patients' experiences. Due to the large state space, we conducted the experiment on the COMPaaS DLV system, a high-performance computing platform at the Electric Visualization Lab at the University of Illinois Chicago, which is equipped with 64x Nvidia V100 and T4 GPUs.
Presented by
Sarang Gawane
Institution
University of Illinois Chicago
Hashtags
#meta-reinforcement-learning
Chat with Presenter
Available 12 pm – 1:30 pm MST and during the break

Transformer Based Reinforcement Learner for Dynamic Cancer Treatment

Sarang Gawane, Xinhua Zhang, Elisa Tardini, Guadalupe Canahuate, Abdallah S R Mohamed, Lisanne Van Dijk, Clifton D Fuller G Elisabeta Marai, Mohamed Naser

Abstract
Dynamic Treatment Regimes (DTR) are models that help clinicians determine the course of action while treating patients, especially where a sequence of decisions are involved. The models or collections of models can be used to guide decision making for doctors using the medical history and clinical decision support system where an appropriate intervention leading to an optimal treatment can be materialized given a specific desired outcome. The DTR models have been gaining popularity lately and can be seen utilized in treating various conditions such as depression and ADHD.

The Transformer based Meta-Reinforcement Learner (TMRL) is a novel solution for DTR of cancer patients by harnessing the computational prowess of the Encoder Transformer architecture. Coupled with the Meta-learning framework and Reinforcement Learning algorithms we can train our model using the medical history of patients previously subjected to treatment. The idea is to have a framework that has a bilayered understanding of the various medical decisions taken while treating the patient. We utilised the dataset curated by the University of Texas MD Anderson Cancer Data Centre between the years 2005 to 2013. The dataset contains toxicity levels, medical history and treatment procedures conducted on 536 patients suffering Head and Neck Cancer. As in the existing setup, decisions of whether or not to proceed with certain medical procedures such as Radiotherapy, Concurrent Chemotherapy and Induction Chemotherapy are generally taken by a diverse board of clinicians where a large number of metrics and pre-treatment variables are analysed over broad insights. It is thus apparent that, as the treatment is conducted mostly on a case-by-case basis, the Meta-RL framework could be considerably effective in tackling such a problem.

We therefore frame this problem into a Markov Decision Process where the patients' conditions represent the state of the agent, the medical interventions its actions and the reward is the \emph{weighted} sum of two conflicting outcomes namely 'Survival rate' and 'Quality of life'. The former refers to the longevity of the patient's life that could be improved at the cost of the latter factor, or compromised leading to a relatively painless but shorter life span. Patients generally employ different preferences of tradeoff.

The Meta Transformer architecture premised on the idea that it can create using its attention mechanism a memory contextualization of all the various treatment scenarios, along with the tradeoff preferences of a similar cohort of patients. By employing proximal policy optimization, we trained the Transformer to find a policy that delivers, compared with the state of the art, superior cumulative rewards over multiple epochs and across various difficulty levels of distinguishing the preference factor. The attention weights also allowed us to locally interpret and visualize the learned treatment policy based on similar patients' experiences. Due to the large state space, we conducted the experiment on the COMPaaS DLV system, a high-performance computing platform at the Electric Visualization Lab at the University of Illinois Chicago, which is equipped with 64x Nvidia V100 and T4 GPUs.
Presented by
Sarang Gawane
Institution
University of Illinois at Chicago
Hashtags
#reinforcementlearning #dynamiccancertreatment
Chat with Presenter
Available November 12th, 12pm-1pm

Towards Physiology and Synthesis-Informed Generative Modeling in Drug Discovery.

Nolan English, Belinda Akpa, Zach Fox

Abstract
Towards Physiology and Synthesis-Informed Generative Modeling in Drug Discovery. The promise of generative AI models for drug design lies in being able to explore a vast, largely uncharted chemical space. Current estimates propose the existence of up to 10^60 chemical compounds with potential therapeutic effects. Only approximately 10^8 of these have ever been synthesized. Although generative models can propose and assess novel drug candidates in a high-throughput manner, most candidates fail to meet two criteria crucial for a successful drug. First: generative models frequently propose molecules that are not synthesizable. The integration of chemical knowledge into a generative model often relies on embedding "chemical language" into the model by using an autoencoder. Unfortunately, the syntax and grammar of chemistry often get lost in the encoding process. Consequently, proposed structures may violate fundamental rules of organic chemistry, rendering them invalid. Alternatively, valid molecules may lack known synthesis pathways. Second: most generative workflows are designed to optimize drug properties that do not represent human-level outcomes. This limits the likelihood that the proposed molecules, including those targeting specific cancerous tissues, will exhibit efficacy in clinical trials.

Coupling generative AI with graph-based retrosynthetic and differential equation-based human systems models could address these shortcomings. Retrosynthetic pathway models establish whether a path exists to synthesize a given molecule from known building blocks, and these models can score synthesis pathways based on important factors such as cost. Human systems models, especially physiologically based pharmacokinetic (PBPK) models, are crucial in predicting drug disposition in specific tissues, including tumors, and their potential effect on disease. While retrosynthesis and human systems models could address two key barriers to successful molecular design, these models are computationally expensive, and thus they create bottlenecks in generative modeling workflows.

In this presentation, I will discuss the challenges faced with embedding cross-discipline knowledge into a generative modeling framework. I will introduce how physiologically based pharmacokinetic (PBPK) models can better inform the optimization criteria for generative drug discovery. I will discuss how retrosynthetic analysis can pre-filter drug candidates by synthetic accessibility as part of the generation process. Lastly, I will present how we have integrated these models into our generative modeling framework with a client-server approach meant to compensate for the significant differences in runtime. This modular framework not only paves the way for more effective drug design through the inclusion of future problem-specific models but also allows for easier interdisciplinary collaboration within the same framework.
Presented by
Nolan English
Institution
Oak Ridge National Laboratory
Hashtags

GDC-GPT(v0.2) A large language model for querying the Genomic Data Commons

Aarti Venkat, Anirudh Subramanyam, Robert Grossman

Abstract
GDC-GPT(v0.2) A large language model for querying the Genomic Data Commons ​​Aarti Venkat, Anirudh Subramanyam, Robert Grossman

Large language models (LLMs) have revolutionized the field of natural language processing, but their potential for cancer research is only beginning to be understood. Here, we propose a simple framework for continual pre-training of GPT-2 on data from the Genomic Data Commons (GDC), which contains genomic data from 78 projects spanning over 86,000 cases. By utilizing various API endpoints in GDC, we demonstrate a method of composing a training corpus with clinical and genomic data, consisting of gene level somatic mutations and their impact on transcript function, exposure information such as alcohol history, demographics and ethnicity, pathological stage of mutation, primary diagnosis and associated treatment. Preliminary evaluations on randomized chunk completion prompts result in an accuracy of 99.6, suggesting the model is capable of generating accurate variant annotations and clinical descriptions of chromosomal mutations and outperforms the baseline GPT-2 model. The framework we propose can be easily extended to add additional training data from other multi-omic API endpoints in the GDC such as gene expression, copy number variants and survival and provide insights on strategies for fine tuning an LLM on data from a commons. The GDC-GPT(v0.2) model could provide a simple means to search and explore data in the GDC, including its user-friendliness for researchers unfamiliar with the APIs
Presented by
Aarti Venkat, Anirudh Subramanyam
Institution
University of Chicago
Hashtags
Chat with Presenter
Available Nov 12th 3-3.30 pm MST

Evaluating Algorithmic Bias on Triple-Negative Breast Cancer Data in Six SEER Registries

Jordan Tschida∗, Mayanka Chandrashekar∗, Alina Peluso∗, Zachary Fox∗, Charles Wiggins†, Antoinette M. Stroup‡, Stephen M. Schwartz§, Eric B. Durbin¶, Xiao-Cheng Wu∥, Heidi A. Hanson∗

Abstract
Racial bias in artificial intelligence (AI) for clinical care and public health may create or exacerbate health disparities. Black women are more likely to be diagnosed with triple-negative breast cancer (TNBC) than women of other races. We investigate algorithmic bias in a deep learning based classifier for TNBC pathology reports. Data consists of 594,875 breast cancer pathology reports from six Surveillance, Epidemiology, and End Results (SEER) registries. We find racial differences in classification accuracy exist, with more accurate TNBC predictions for Black women relative to White women.

Clinical relevanc— Investigating AI model performance on minority populations ensures racial disparities in data are not exacerbated when deployed in a clinical setting.
Presented by
Jordan Tschida <tschidajl@ornl.gov>
Institution
∗Advanced Computing for Health Sciences, Oak Ridge National Lab; †Department of Internal Medicine, University of New Mexico; ‡New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey; §Division of Public Health Sciences, Fred Hutchinson Cancer Center; ¶Cancer Research Informatics Shared Resource Facility, Markey Cancer Center; University of Kentucky; ∥Louisiana State University
Hashtags
#bias #breastcancer #SEER #artificialintelligence #biomarkers
Chat with Presenter
Available 12 pm – 1:30 pm MST

Enhancing Authenticity in Cancer-Related Information Retrieval Using Retrieval Augmented Generation LLM Framework

Ashish Mahabal(1), Asitang Mishra(2), Kristen Anton(3), Maureen Ryan Colbert(3), Sean Kelly(2), Heather Kincaid(2), Daniel Crichton(2), and the EDRN Team

Abstract
In the rapidly advancing field of computational oncology, the authenticity and accuracy of generated content are crucial. We present our work employing the Retrieval Augmented Generation (RAG) Large Language Model (LLM) framework, specifically adapted to minimize content hallucinations — erroneous or invented pieces of information — by marrying the robustness of RAG-LLM with the granularity of fine-tuning, drawing from two primary sources: (1) a comprehensive corpus of academic papers, and (2) a curated set of both precise and broader questions paired with expert-validated answers.

Importantly, our methodology's intrinsic design allows for adaptability, rendering it applicable across various domains in the broader scientific spectrum. Through this strategy, we aim to ensure that the information generated is both authentic and reflective of established scientific knowledge, thereby aiding in the quest for trustworthy computational insights in the fight against cancer. Our method also offers an avenue for illuminating gaps in our current understanding. By examining the inconsistencies and areas where the model struggles, we can identify uncharted territories in the literature that may require further empirical scrutiny. This may motivate relevant areas for future research, driving innovation in the field.

Our initial application centers on the pivotal research on biomarkers for the early detection of cancer. We are utilizing the large body of data and published results generated by the NIH-funded Early Detection Research Network (EDRN) consortium, for which we serve as the informatics center. In particular, in the current study we can point to underuse of certain data types, e.g., genomic, proteomic, etc. by comparing EDRN output with that of broader NCI initiatives for which data is publicly accessible, like the Human Tumor Atlas Network (HTAN). Our method may point to lacunae and/or contradictions within the corpora. We will explore these as future steps. We welcome feedback that might further solidify the authenticity of generated content and pave the way for groundbreaking discoveries in computational cancer research including early detection.
Presented by
Ashish Mahabal
Institution
1: California Institute of Technology, Pasadena, CA, USA 2: University of North Carolina, Chapel Hill, NC, USA 3: Jet Propulsion Laboratory, California Institute of Technology, CA, USA
Hashtags
#cancer #LLM

Quantum-Assisted Prediction of Pharmacokinetic Parameters for Plant-Based Small Molecules Targeting Cancer Protein using ATOM Modeling PipeLine (AMPL)

Priyanka Banerjee3, Vijay P Bhatkar10, Anagha Bhuvanagiri6, Saanvi Gadila6, Jaspreet Kaur Dhanjal2, Dimple Khona6, H Kim Lyerly11 , Asheet Kumar Nath1, Ana Maria Lopez5, Amita Pathak6, Koninika Ray6, *Amit Saxena1, Smita Saxena4, Akshay Seetharam6, Anil Srivastava6, Eric Stahlberg9, Aanya Tiwari6, Richa Tripathi7, Zhao Zheng8

Abstract
In this study, we introduce an innovative approach for the prediction of crucial safety and pharmacokinetic parameters associated with phytochemicals targeting cancer proteins. Our methodology leverages the versatile ATOM Modeling PipeLine (AMPL) and combines classical machine learning techniques with quantum machine learning, to enhance the predictive capabilities of our model.

The initial phase of our approach involves the meticulous curation and standardization of plant-based small molecules or phytochemicals known to possess anti-cancer activity, mediated through specific protein targets. Subsequent data refinement and exploratory analysis for chemical space exploration has contributed to an improved understanding of the dataset's underlying characteristics. To augment ligand representations and optimize model performance, we employ quantum feature mapping using prominent quantum computing libraries.

Building upon the enriched ligand representations, we proceed to train the ML Regressor. By seamlessly integrating quantum-enhanced features, our approach harnesses the inherent power of quantum computing to capture intricate ligand-protein interactions, significantly elevating prediction accuracy. The results of our study showcase the immense potential of quantum machine learning in the domain of drug discovery and development. The fusion of quantum feature mapping with classical machine learning enables more accurate predictions of critical safety and pharmacokinetic parameters for phytochemicals targeting the cancer proteins. This methodology not only advances our understanding of quantum-assisted drug discovery but also presents a transformative avenue for the identification and optimization of therapeutic agents derived from natural products.
Presented by
Amit Saxena
Institution
1) Centre for Development of Advanced Computing(C-DAC) 2)Indraprastha Institute of Information Technology Delhi (IIIT-Delhi) 3)Charit ́e - University Medicine Berlin; 4)SP-Pune University 5)Thomas Jefferson University 6)Open Health Systems Laboratory (OHSL) 7)All India Institute of Ayurveda (AIIA) 8)University of Virginia 9)National Cancer Institute (NCI) 10) Multiversity 11) Duke University School of Medicine
Hashtags
#cancer , # Quantum, # Plant-based

PieVal: an Open-Source, Efficient, Secure, Gamified, Rapid Document Classification Annotation Tool

Albert William Riedl, MS; Aaron Seth Rosenberg, MD; JP Graff, DO; Matthew S Renquist; Joseph M Cawood; Nicholas R Anderson, PhD

Abstract
As much as 83% of biomedical information is contained within clinical notes [1], and thus inaccessible to typical structured data approaches for electronic health record (EHR) based phenotyping. Recent advancements in compute architectures and associated software frameworks have led to the commoditization of Natural Language Processing (NLP) tools to address this challenge. However, these efforts are bottlenecked by their need for annotated datasets for model training and evaluation. Gold standard data, annotated by Subject Matter Experts (SMEs), is expensive due to the need for highly trained personnel and the time required for annotation. We aimed to decrease annotation time by streamlining the annotation process. We therefore developed PieVal, a web-based, secure, high-throughput tool to both capture new and evaluate previously generated document level text classification labels. Our goal is to extract actionable information from clinical text. We chose to frame our Information Extraction (IE) processes as document classification tasks. This allows for a simple annotation schema that facilitates both initial annotation (training data) and evaluation of machine generated labels (model performance and drift over time). PieVal was designed specifically to work within this framework and includes two innovations that both improve its fit for purpose and reduce annotation time: 1) annotations are framed assertion tests simplifying the user interface; 2) direct support for testing of data enrichment strategies, which can reduce the volume of text required for review while still allowing access to the full original text when needed. We implemented PieVal in support of a clinical NLP task assessing for the presence or absence of clonal plasma cells on bone marrow biopsy (BMBx) reports. Initially, 100 BMBx reports were loaded into PieVal and 30 were duplicated to capture intra-operator statistics. Text enrichment was performed, truncating the overall text by focusing on the kappa and lambda stains the SMEs felt were most salient to identify plasma cell clonality. Two SMEs, a hematologist and a hematopathologist, annotated the reports within the PieVal interface and the data was used to train a document classifier. After training the model was deployed over all BMBx reports, and 100 machine labeled reports were reviewed in PieVal by both SMEs. On the total 460 unique annotations events, we achieved an intra-operator agreement of 0.93 and an inter-operator agreement of 0.81. The trained NLP task achieved an accuracy of 0.94, a c-statistic of 0.94, and an F1 score of 0.91 indicating robust signal capture. PieVal has since been used on a variety of tasks, capturing over 2000 annotations with an average annotation time of 15-30 seconds. All annotators reported that the tool is easy to use, pleasant, and appreciated the built-in annotator scoreboard which incentivizes engagement through competition and reward. In conclusion, we demonstrate rapid annotation time, efficient measure of annotation quality, a competently trained machine learner, and qualitative measures about annotator engagement. PieVal represents an annotation tool to fuel future work incorporating clinical note data for electronic health record-based phenotyping. 1. Murdoch TB, Detsky AS. The Inevitable Application of Big Data to Health Care. JAMA2013;309:1351--
Presented by
Bill Riedl <awriedl@ucdavis.edu>
Institution
University of California, Davis
Hashtags
Chat with Presenter
Available November 12th 330-4pm to align with cafcw break

ClinicalUnitMapping.Com Takes a Small Step Towards Machine Comprehension of Clinical Trial Data

Jacob Barhak & Joshua Schertz

Abstract
ClinicalTrials.Gov is the database storing data from clinical trials. Many clinical trials are required to report their findings in this database according to U.S. law. On 2022-08-26 this database held 425,969 clinical trials with 55,248 trials having numeric results. However, the data is not standardized and numerical data cannot be comprehended since the units are not standardized. There were 36,752 unique units of measure compared to 2019-04-12 when there were 24,548 unique units of measure. It is an increase of 12,204 units over roughly 40 months - almost 10 new unique units of measure added per day. To use the numerical data in disease modeling, there is a need to have machine support to standardize this data. Such standardization is necessary if we wish to use such data in computational models. ClinicalUnitMapping.com is a web tool constructed to help standardize this data and merge it with the following standards and specifications: UCUM, RTMMS / IEEE 11073-10101, BIOUO, and CDISC. IEEE 11073-10101 - Adapted and reprinted with permission from IEEE. Copyright IEEE 2019. All rights reserved. This presentation will discuss how python tools are used to: 1) Process and index the data, 2) Find similar units using NLP and machine learning, 3) Create a web interface to support user mapping of those units, 4) Use advanced machine learning tools such as transformers for Natural Language Processing (NLP) to drive inference and core-sets to speed up labeling and quickly setup an inference engine. This project is ongoing and this presentation is constantly updated for each venue. This presentation will focus on improvement of supervised learning using transformers and accelerated labeling using core-sets. The latest interactive presentation with results is accessible through: https://www.clinicalunitmapping.com/show/Unit_Mapping_Latest.html

The intention is to unify unit standards and machine learning tools that will be able to map all units reported by clinical trials. With such capabilities, the data in this important clinical trials database would become machine comprehensible.

This is an interactive presentation - please explore the tabs above and interact with the figures - they have sliders and widgets and hover information that will allow interaction. Following the tabs in order from left to right will tell the story
Presented by
Jacob Barhak <jacob.barhak@gmail.com>
Institution
Jacob Barhak Analytics
Hashtags
Chat with Presenter
Available On Nov 12th 12 pm - 1:30 pm Mountain Standard Time (MST) and 3:00 pm - 3:30 pm MST and after the conference on Nov 13-17 10am-noon Central time or by appointment