Skip to main content

Machine Learning (ML)

Working Group leader: Fuchang (Frank) Gao and Audrey Fu

Group members: Audrey Fu, Min Xian, Aleksandar Vakanski, Linh Nguyen, Boyu Zhang, Esteban Hernandez Vargas

Originated: August 2018

Description:

This group studies various machine learning methods/models and their application with two primary goals:

  1. Bring together researchers on machine learning and discuss the most recent models/algorithms/applications.
  2. To facilitate research collaboration among participants from different disciplines.

Dr. Gao states, “It is important that CMCI continue to support this kind of working group which facilitates the collaboration among researchers from different disciplines. It also creates a learning and research atmosphere of data science on campus.”

A Causal Network Approach to Understanding Transcription and Methylation in Breast Cancer

Project Team: Audrey Fu, Md. Bahadur Badsha, Evan Martin

Complex diseases often involve changes in DNA sequence, and in DNA transcription and methylation, an epigenetic process that can both regulate and be regulated by gene expression. These changes result in a wide range of symptoms or multiple subtypes of the same disease. In breast cancer, for example, different patterns of gene expression and DNA methylation characterize subtypes that vary in terms of tumor progression and treatment. In order to develop more effective treatments for different subtypes, it is necessary to understand the genes and processes (i.e., transcription and methylation) that drive the differences between subtypes. It is therefore of immense interest to understand how genetic variation influences disease through gene regulatory networks. Unfortunately, identification of genes and processes that are key to diseases is often compromised by inference based on correlation, not causation.

Our long-term goal is to develop computational methods to infer gene regulatory networks that are potentially causal for multiple clinical phenotypes using genomic and clinical data of complex diseases. In this project, we will develop new statistical approaches based on the principle of Mendelian randomization to systematically identify regulatory networks involving both transcription and methylation that are potentially causal for disease subtype. We will use breast cancer as the disease model and apply our methods to genomic data. The principle of Mendelian randomization assumes that the alleles of a genetic variant are randomly assigned to individuals in a population, analogous to a natural randomization experiment. This principle has gained increasing attention in genomics, given its power to separate correlation due to causation from correlation not due to causation.

The models and algorithms developed here will allow us to make causal statements about the two processes at the single gene level and account for confounding variables, which similar studies have not examined. These methods will help to identify key genes for specific breast cancer subtypes and elucidate the roles of transcription and methylation when many genes are involved, offering insights into genes and processes that could better inform subtype classification, cancer diagnosis and development of novel drug targets. These methods are not limited to breast cancer but are applicable to complex diseases in general.

Mountain West Mine Tailings, Watersheds and Adverse Human Health Outcomes

Project Team: Alan Kolok, Lucas Sheneman, Chantal Vella

The long-term goal of this program is to model the associations that occur among metal contamination (as a consequence of mining), watershed geography and adverse human health impacts across the Rocky Mountains. In this pilot project, we will focus on generating a predictive classifier model that includes data from Oregon, Washington, Idaho and Western Montana. Our central hypothesis is that geospatial models that incorporate the occurrence of metal contamination in large watersheds can be predictive of adverse health outcomes, including birth defects, pediatric cancers and cardiovascular disease. To satisfy this hypothesis, the following aims will be addressed.

Aim 1: derive a collection of interoperable digital map layers of the northwestern United States that effectively integrate adverse health outcomes and hydrologic units (watersheds).
Aim 2: use supervised machine learning methods using derived Aim 1 data layers to build and train a spatially-explicit classifier model that discretely categorizes mountain west hydrologic regions in terms of estimated relative health risk by effectively correlating related adverse health outcomes with identified hydrologic units.

A comprehensive evaluation of data available from Public Health Departments in Idaho, Oregon, Washington and Montana will be accomplished. We will also acquire data on premature mortality from the National Vital Statistics System via the publicly available CDC WONDER database. Data on the prevalence of pediatric cancer and birth defects will be gathered from state registries, where available. Spatially nested hydrologic unit (watershed) maps at varying scales (regions, sub-regions, accounting units, cataloging units, etc.) will be harvested from the publicly-available USGS Watershed Boundary Dataset (WBD). All combined source watershed and public health data will be centrally stored, catalogued, transformed, and managed in collaboration with the Northwest Knowledge Network (NKN) at UI.

A discrete classifier system in Esri ArcGIS and/or R will be produced using the spatially-transformed health data from Specific Aim 1. Uniquely identified hydrologic units at multiple scales (using USGS HUC naming conventions) will be assigned by the trained classifier that will be produced for this project. A gradient relative health-risk label ranging from low to high will be developed. The end result of applying an effective trained classifier across the full input dataset will be an efficiently derived geospatial data layer that estimates and discretely labels overall relative human health risk within identified watershed boundaries.

Undergraduate Research In Action

Undergraduate Research In Action

University of Idaho senior Emmanuel Ijezie is double majoring in molecular biology and biotechnology. He’s also one of many undergraduate students working directly with CMCI-related research. He said he specifically chose to attend U of I because of the opportunity to be directly involved with cutting-edge research projects. Read more of his story here.

Bioinformatic Analysis of Immune-Cell-Derived, Regeneration-Specific Transcripts in Zebrafish

Project Team: Diana Mitchell (PI), Ousseini Isaaka

Start Date: August 1, 2018

For the 19 genes without significant similarity to genes in humans, the following analyses will be performed in order to identify their putative functions:

  1. identify conserved proteins or protein domains
  2. predict protein localization
  3. search for functional domains
  4. predict cellular and functional pathways
  5. synteny analysis for hypothetical proteins.

The team will identify suitable published RNA-seq datasets for zebrafish macrophages to probe for the presence / absence of these nonorthologous transcripts in other macrophage-mediated immune response.

After identifying conserved proteins or protein domains, we will go on to look for species that are known to have regenerative capacity.