SMARTbiomed summer school: “Causal inference, statistical genetics, and machine learning in common disease epidemiology and biology”

 

Module: Applied Machine Learning with Python

Module Summary

Machine learning is a subset of artificial intelligence that focuses on enabling computers to learn from data and improve their performance over time without being explicitly programmed for each task.  It involves the development of algorithms and statistical models that identify patterns within datasets, allowing systems to make predictions or decisions based on new input data.  By continuously adapting and refining their models through experience, machine learning can automate complex processes and provide valuable insights across various fields such as image and speech recognition, natural language processing, and predictive analytics.

Prerequisites

A basic knowledge in Python.

Module content

The four-day introductory course on applied machine learning with Python will provide a foundational understanding of machine learning concepts.  The course will cover core topics such as supervised and unsupervised learning, dimensionality reduction, clustering, key learning algorithms, and model evaluation techniques. The accompanying tutorial sessions will provide hands-on experience with Python and the commonly used libraries and packages. Participants will gain practical skills in applying these concepts to typical problems using provided data sets.

Day 1 (6 hours):

  • Morning (3 hours): Foundations
    • Introduction to the course and overview of machine learning concepts.
    • Supervised Learning I (Regression): Linear regression.
    • Supervised Learning II (Classification): Logistic regression, decision trees and ensemble methods.
    • Model selection and evaluation, performance metrics
  • Afternoon (3 hours): Practical session
    • Data handling and preprocessing in Python
    • Regression and classification tasks, cross-validation, and model’s performance evaluation.
    • Day 1 recap/Q&A.

Day 2 (5 hours)

  • Morning (3 hours): Advanced techniques
    • Unsupervised Learning: clustering and dimensionality reduction.
    • Introduction to gradient boosting machines
    • Hyperparameter tuning and model explainability
    • Building integrated ML pipelines in scikit-learn
  • Afternoon (2 hours)
    • Final project: an end-to-end workflow
    • Wrap-up, further resources, and concluding Q&A.

Required Software 

  • Anaconda
  • Packages: 
    • pandas, scikit-learn, umap-learn, jupyterlab, matplotlib, shap, xgboost

All packages are to be downloaded and installed prior to the start of the course.
 

Teachers

Dr Andrey Kormiltizin, University of Oxford

Andrey Kormilitzin is a Group Leader for health data science and machine learning in the Department of Psychiatry and an affiliate member of the Mathematical Institute at the University of Oxford. He has developed an interdisciplinary research program that integrates statistical machine learning, natural language processing, image analysis, human-computer interaction, and the mathematics of evolving multidimensional data streams. Dr Kormilitzin’s focus is on translating advanced computational methods to address clinical challenges, to improving the detection, treatment and monitoring of interventions for mental illness and deployable technology for clinical decision support. His broader contributions include an open-source NLP package in Python for clinical information extraction (‘Med7’), his services as  programme and committee chair for the international conferences on machine learning (NeurIPS, HealTAC) and organisation of the Patient and Public Involvement and Engagement events pertinent to applications of AI tools in healthcare.