Machine learning

SMARTbiomed summer school 2026

Machine Learning with applications to multi-omics and clinical prediction

Overview

This course offers a comprehensive introduction to clinical prediction models from model development to evaluation of predictive performance and clinical utility, an overview of the health AI landscape, and a deep dive into modern machine learning approaches used in the analysis and integration of high-dimensional omics data. After a general introduction to clinical prediction models, the program gives participants a solid grounding in predictive modelling specifically for genomics, transcriptomics, proteomics, and related datasets, along with practical experience building and evaluating robust prediction models. The fourth day focuses on unsupervised learning, highlighting the role of Variational Autoencoders (VAEs) in uncovering latent biological structure, reducing dimensionality, and handling noisy, complex molecular measurements. The final day introduces foundation models - including Large Language Models - and explores their rapidly growing influence in multi-omics research, from multimodal data integration to generative modelling and biological discovery. Through a combination of lectures and hands-on practicals, participants will gain the conceptual foundations, technical skills, and practical insight needed to apply cutting-edge machine learning to contemporary omics challenges.

Prerequisites

Basic proficiency in R and python.

Learning Objectives

Clinical Prediction

Understand the fundamental steps of developing a clinical prediction model
Understand and compare clinically relevant predictive performance metrics and differentiate between predictive performance and impact
Learn and discuss potential pitfalls from model development to implementation
Gain knowledge on clinically relevant topics beyond predictive performance (e.g. algorithmic fairness)

Supervised Learning

Understand the characteristics of high-dimensional biological data (e.g., genomics, transcriptomics, proteomics).
Learn the theoretical principles behind supervised learning for omics, including regularisation, model selection, and avoiding overfitting.
Explore commonly used models for omics prediction tasks (e.g., elastic net, random forests, kernel methods, shallow neural nets).
Gain hands-on experience preprocessing omics datasets for prediction.
Build, tune, and evaluate a supervised prediction model for a biological outcome.

Unsupervised Learning

Understand the goals of unsupervised learning in omics (structure discovery, denoising, latent variable modeling).
Learn the mathematical foundations of Variational Autoencoders (VAEs), including latent space modeling, encoder-decoder structures, and loss decomposition.
Understand how VAEs handle high-dimensional biological inputs (e.g., gene-expression matrices, proteomics profiles).
Implement preprocessing steps for unsupervised modeling (normalisation, scaling, batch correction awareness).
Build a simple variational autoencoder and interpret its latent space.

Foundation Models

Introduction to premise and concept of foundation models.
Mathematical overview of transformers.
Introduction to Large Language Models (LLMs) and applications in omics integration.

Learning Outcomes

By the end of the course, participants will be able to:

Clinical Prediction

Discuss the steps and challenges of developing and evaluating clinical prediction models
Identify relevant evaluation strategies for predictive performance and clinical utility
Navigate resources for clinical prediction research (e.g. TRIPOD+AI, PROBAST+AI)
Identify and discuss barriers of model implementation
Define algorithmic fairness and discuss in different scenarios from healthcare

Supervised Learning

Describe the challenges and properties of high-dimensional omics datasets and how supervised models can address these issues.
Pre-process omics data for predictive tasks, including managing sparsity, scaling, and feature selection.
Build, tune and evaluate supervised learning models using appropriate regularisation and model validation strategies.
Interpret model performance and assess generalisability within a biological context.

Unsupervised Learning

Explain the aims of unsupervised learning in omics, including structure discovery, denoising and latent variable modelling.
Describe VAE architecture, latent space representations and the roles of reconstruction and KL-divergence losses.
Build and train a simple VAE and utilise its latent space for interpretation or downstream analyses.
Compare VAEs to traditional unsupervised approaches such as PCA, clustering and manifold learning techniques.

Foundation Models

Explain the concept and motivation behind foundation models and how they differ from task-specific machine learning methods.
Describe the structure and functioning of Large Language Models and how they may be adapted for biological data.
Summarise current state-of-the-art applications of foundation models in omics, including multimodal integration, representation learning and generative modelling.
Critically evaluate the opportunities, limitations and ethical considerations associated with applying foundation models to biological research.

General Skills Across the Course

Implement end-to-end machine learning workflows for high-dimensional biological datasets.
Apply best practices in reproducibility, model evaluation and clear scientific communication.
Select appropriate machine learning approaches for different omics questions and data modalities.
Confidently navigate emerging AI methodologies and assess their relevance to future multi-omics analyses.

Required Software

Anaconda

Packages: pandas, scikit-learn, umap-learn, jupyterlab, matplotlib, shap, xgboost

All packages are to be downloaded and installed prior to the start of the course.

R, RStudio.

Teachers

Professor Chris Yau, Nuffield Department of Women’s Health and Reproduction, University of Oxford.

Chris Yau is Professor of Artificial Intelligence at the Big Data Institute in the Nuffield Department of Women’s & Reproductive Health. His research lies at the interface of statistics, machine learning, genomics, and biomedical data. His lab develops statistical and AI-based methods to interpret high-dimensional molecular, genomic, and health-record data. These tools aim to improve understanding of disease mechanisms - particularly in cancer and other complex diseases - and to support prediction and decision-making in the clinic.

Associate Professor Adam Hulman, Department of Public Health, Aarhus University.

Adam Hulman is an associate professor at the Department of Public Health at Aarhus University. His lab focuses on integrating multimodal health data and modern computing methods to improve disease risk prediction and clinical decision-making. With a background in applied mathematics and over a decade of experience in diabetes epidemiology, Adam is committed to turning complex health and epidemiological data into actionable clinical insights. His research uses advanced statistical and machine-learning methods to combine clinical, longitudinal, and registry data, with the aim of forecasting disease trajectories and complications in real-world populations.

Revised 06.05.2026

SMARTbiomed

SMARTbiomed summer school 2026