PhD Research: Probabilistic Machine Learning and Deep Learning for Health and Genetics
Deep Learning · Probabilistic ML · Health & Genetics
My PhD research was motivated by my practical experience deploying AI systems, recognising the need for general solutions to recurring technical problems. The methods developed in my research are applicable across many domains but I took a particular interest to advancing AI/ML for impact in health, computational biology and personalised medicine, where unique data complexities call for novel approaches.
I undertook my PhD at the Department of Computer Science at Aalto University, Finland, under the supervision of Professor Samuel Kaski (Director, ELLIS Institute Finland; Professor, Aalto University and University of Manchester). I was fortunate to work on interdisciplinary research with collaborators at the Probabilistic Machine Learning group at Aalto University, Finnish Center for Artificial Intelligence (FCAI), Institute for Molecular Medicine Finland at the University of Helsinki, pan-European INTERVENE consortium, Broad Institute of MIT and Harvard, and several other international research institutions.Thesis & Publications
Lectio Praecursoria
In Finland, a public PhD defence begins with the Lectio Praecursoria — a lecture delivered by the doctoral candidate to introduce the research topic to a general audience.
The video below is my Lectio Praecursoria from my PhD defence, delivered on 30.4.2025. The opponent for my defence was Professor Hanna Suominen from the Australian National University.Show Transcript
Honoured Custos, honoured opponents, esteemed audience.
The last time you visited the doctor perhaps you left with a prescription, a referral or a diagnosis.
Behind the scenes, this information was probably captured in your electronic health record – a digital version of your medical history.
But have you ever stopped to wonder: could that data do more than just document your past?
Could it help predict—and even prevent—your next illness?
Imagine this:
Suppose you've been feeling a bit off lately and you visit the doctor.
Your doctor consults your digital health profile – compiled from past medical records and genetic testing done years ago.
Behind the scenes, machine learning models are making predictions about your health, based on patterns learned from millions of other patients.
These algorithms aren't replacing the doctor – they're augmenting them, by offering predictions tailored to your unique health profile.
The machine learning system flags a subtle pattern that's easily missed in routine care.
Thanks to this precision diagnostics technology, your doctor identifies a diagnosis that may have taken years to detect using traditional methods.
But that's not all. Based on your genetic testing, another algorithm has helped identify a treatment plan that avoids medications that would have caused adverse effects.
What is the result of all of this? A timely diagnosis and a quicker recovery with fewer side effects.
Personalised medicine is a system where diagnosis, prevention, and treatment are tailored to the individual, to optimise the health outcomes of the individual.
In some areas of medicine, like oncology, cardiology and rare diseases, personalised medicine is having an impact on patient outcomes, for example, by identifying targeted treatments with genetic testing.
And yet, a significant opportunity remains to leverage artificial intelligence and machine learning in enabling the widespread adoption of personalised medicine at scale. Genetics data and other data captured in electronic health records can be analysed by machine learning algorithms to assess a patient's risk of disease, inform targeted screening and preventative programs, and diagnose diseases in the earlier stages to improve patient outcomes.
A key ingredient for achieving this is high quality data, which is where resources like genetic biobanks and national health registries play a vital role.
The research presented in my dissertation was carried out in an interdisciplinary manner and made use of data from genetic biobanks including the UK Biobank and FinnGen, which each contain genomics and health data for around 500,000 individuals.
These genetic biobanks are also linked to longitudinal data derived from electronic health records. These electronic health records contain medical diagnoses, prescriptions and other health-related information spanning several decades, making this data suitable for predictive modelling of future health events from an individual's past medical history.
My research also used nationwide registry data from the FinRegistry initiative, which aggregates information across 19 registries for 7.2 million individuals. This resource enabled the integrated analysis of graph structured data from population-scale networks of family relationships.
Machine learning algorithms can utilise this vast amount of health and genetics data for personalised medicine.
But when working with this type of data for critical personalised medicine applications therein lies a key machine learning challenge that is the subject of this thesis, and which I'll now illustrate with a simple example.
In this plot each panel is one patient and the dots show data collected for two variables – for example, this could be a drug dose on the X axis and a blood pressure reading on the Y axis. Perhaps a doctor is interested in predicting how a patient responds to a change in drug dose. Even in this small sample of 18 individuals you can see how the data trend varies across patients. It's hard to confidently make out the statistical relationships between X and Y for each patient because there are very few data points observed and influences from other factors - like genetics, medical history and lifestyle factors - haven't been measured here.
We could attempt to learn the relationships between X and Y from each patient's data and here we see the lines of best-fit relationships for each patient and these wide blue bands that represent our confidence intervals for these lines. This approach captures the “personalised” part of personalised medicine, but because each patient only has a few data points these models are very uncertain for regions where there is little information.
Alternatively, we could pool together data from all 18 patients and create a “one-size-fits-all” model. For some patients this works well because the overall trends are similar, but for others the model systematically over- or -under predicts. In medicine this is comparable to handing out the same drug to every patient – which may be “good enough” for the “average” patient, yet inevitably leaves many individuals without the optimal health outcomes because this does not consider what makes each of us unique.
Instead we can think of our ideal algorithm as one that borrows strength across a larger cohort of patients to learn the patterns that are shared across multiple individuals, while still allowing for patient-specific effects. This is like the sweet spot between the two extremes we just saw. It provides more confidence in the predictions by borrowing statistical power from similar patients and also considers the health data of the individual.
While this was a simple example of two variables, in the general setting of the thesis, the availability of biobank and registry datasets gives the benefit of being able to utilise many X variables to predict Y. Electronic health records measure thousands of variables representing diagnoses, medications and lab results, and for genetics data we are talking in the magnitude of millions of genetic variants. However, this high dimensional setting poses its own challenges – and raises the question of how to effectively utilise the statistical power of large datasets while ensuring that the model delivers robust statistical inference tailored to individual needs.
The example you saw on the previous slides shows one way of approaching this, using Bayesian hierarchical models. We can fit individual-specific models phi_i to individual i from their own data, and for more reliable estimates we also borrow statistical strength from a wider group of individuals, which are represented in these equations by the shared parameters theta. This principle can be implemented in deep learning algorithms, which are effective at handling high dimensional data and the various data modalities explored in the thesis, like sequence and graph data, but even then, standard algorithms can still miss many factors that explain patient-specific effects.
Across 3 publications my thesis introduces specialised machine learning methods for handling these types of challenges in personalised medicine applications. The first publication addresses a generative machine learning problem of how to create synthetic data for a patient's genetic sequence and disease phenotypes. Publications 2 and 3 focus on predictive machine learning problems, for predicting a patient's health-related outcomes from their electronic health record data. Publication 2 uses graph neural networks to model the genetic and environmental effects shared across families, while Publication 3 introduces a hierarchical model that uses similar causal mechanisms to improve the generalisability of machine learning methods to new settings.
These contributions are addressed in the following 3 research questions of the thesis.
The first research question asks how a new probabilistic machine learning approach can be developed to efficiently generate large synthetic datasets of millions of individuals from high-dimensional genetics datasets with limited samples.
This work was motivated by the need for synthetic data to facilitate the development of polygenic risk scoring algorithms, which consider the effects of millions of genetic variants across the genome when estimating an individual's genetic predisposition to a certain trait or disease.
I carried out this research in collaboration with the INTERVENE consortium, which developed polygenic risk scoring methods using 1.7 million genomes in various biobanks across Europe. Because of the highly sensitive nature of genetics data, my collaborators needed synthetic datasets that resembled the real data but could be shared freely across research institutions without privacy concerns.
The main contribution of Publication 1 was a software tool called HAPNEST for simulating genetic sequence and disease phenotype data for over 1 million individuals.
While previous software tools provide methods for generating high-fidelity data, they didn't address this at the scale of this project, requiring new approaches to efficiently generate very large synthetic datasets without compromising quality. This is achieved with statistical models of the underlying generative processes of genetic sequences and complex disease phenotypes, and simulation-based inference techniques to fit model parameters for diverse ancestry groups. This work also introduced a comprehensive workflow for evaluating the quality of synthetic datasets, across fidelity, generalisability and diversity metrics.
The second research question examines a different perspective on the role of genetics in predicting and explaining health outcomes by asking how graph neural networks can be used to model relationships between individuals, and what benefits this provides compared to alternative approaches.
This work was specifically motivated by the opportunity to use Finland's nationwide registry data to study population-scale family networks.
You may have been asked by your doctor if anyone in your family has had a particular condition – like heart disease, cancer or diabetes – because it offers clues about your risk for certain illnesses. When there is electronic health record data available for family members it becomes possible to use machine learning algorithms to more accurately model how family history captures both genetic and environmental influences on disease, such as through inherited genetic variants, shared lifestyle factors and access to healthcare.
On a mathematical level this type of relational information can be represented using graphs. Publication II introduces a graph neural network based algorithm that models each patient as a graph, where nodes represent the patient and up to third-degree relatives, edges represent family relationships, and node features are sequences of medical histories derived from electronic health records.
Importantly, each patient being associated with their own graph enables individualised disease risk predictions from family information, aligning with the goals of personalised medicine. Maximising a mutual information objective in a downstream analysis of the neural network predictions aided with identifying the most influential family members and risk factors for a patient's disease risk.
This was applied this to a case study of Finland's nationwide registry data, for prediction of five common diseases. This involved constructing over 5 million graphs of 1500 features per node derived from 20 years of disease endpoint data. The graph neural network model was compared to clinically-inspired baselines such as querying first-degree family history and non-graph based deep learning algorithms. The results found that graph neural networks outperformed these baselines and the performance gains were most significant for diseases with higher heritability.
While the deep learning techniques used in Publication II make predictions by identifying statistical relationships that suggest potential associations with disease risk, research question 3 more explicitly addresses how to develop deep learning algorithms that generalise across different settings. In practice this requires new techniques that incorporate models of cause and effect of disease – not necessarily to interpret causal effects – but to make more robust predictions.
In practice, health data is not independent and identically distributed. This means data variability that can arise from different subtypes of disease, and different underlying disease mechanisms and treatment responses. In Publication III this is formalised by treating different disease subtypes or individual patients as different prediction tasks – and Publication III proposes learning these tasks jointly using an algorithm that both shares information across tasks and allows each task to specialise.
This approach is evaluated in a case study for UK Biobank and FinnGen data, where tasks correspond to predicting the risk of different stroke types. Each subtype has distinct biological mechanisms and clinical presentations, but also shares overlapping risk factors like age and hypertension – enabling improved prediction through shared learning.
This is related to the hierarchical models I described earlier, consisting of task-specific neural network parameters phi_t and global neural network parameters theta shared across multiple tasks. Publication III introduces a Bayesian meta-learning algorithm where learning is structured around two phases –
1. meta-training, where the model is exposed to a training set of tasks like predicting different disease subtypes, and
2. meta-testing, where the meta-trained models can be applied directly or adapted to new patients with fine-tuning
Task similarity weights are incorporated into the hierarchical model and variational inference algorithm, which act as a guide for how much information should be shared between tasks.
A key question is what form of task similarity is beneficial? Machine learning algorithms applied to electronic health records are prone to learning from spurious associations in the data – such as shared patient demographics – which can degrade the algorithm's generalisability performance when tested in new settings outside of the training dataset. Publication III examined this issue by comparing causal inference based techniques for estimating task similarity, and demonstrated how this can improve generalisability performance. The causal methods use auxiliary information in addition to the electronic health record data to identify similar tasks based on similar underlying cause and effect mechanisms. For example, Mendelian Randomization uses information about genetic variants associated with the task disease phenotype to identify similar tasks based on having similar biological mechanisms of disease, which was shown to differ significantly to similarity based on non-causal methods.
Collectively, the thesis introduces machine learning methods that are purpose‑designed for the needs of large‑scale health and genetics data. Realising the potential of artificial intelligence for personalised medicine demands both methodological advancements and collaborative efforts from diverse human teams, to address data quality issues, privacy and ethical concerns, statistical biases, and ensuring safe AI implementation. The case studies in the publications touch on several of these challenges and demonstrate how technical advances can be translated into practical applications towards personalised medicine. AI for health and biology is one of the most promising frontiers for societal impact from AI and machine learning innovations, and I hope that the findings from my research will help guide effective and responsible adoption of AI as these systems become ever more present in our everyday lives.
I ask you, honoured Professor Hanna Suominen appointed as opponent by Aalto University School of Science to present the observations that you consider appropriate for this doctoral thesis.
Key Contributions
Research Question 1
How can a new probabilistic machine learning approach be developed to efficiently generate large synthetic datasets (millions of individuals) from high-dimensional reference datasets with limited samples, offering superior computational scalability compared to previous methods, while achieving comparable data quality (fidelity)?
- Methods & software: HAPNEST software for generating synthetic genotypes + phenotypes
- Practical applications: Sharable, high-fidelity synthetic data for polygenic risk score method development
Research Question 2
How can complex, individual-level biological relationships be modelled using graph neural networks to improve predictions and explanations of predictions for various health outcomes, compared to methods that consider these relationships using simple approaches or not at all?
- Methods & software: Graph representation learning for disease prediction from population-scale networks
- Practical applications: Personalised disease risk predictions and explanations from family history
Research Question 3
How can Bayesian meta-learning utilise similarity in causal mechanisms underlying supervised learning tasks, and how does this aid (i) generalisability to new patients, and (ii) capacity to mitigate negative transfer; compared to standard meta-learning or equivalent single-task learning baselines?
- Methods & software: Bayesian meta-learning for task specific and shared learning across tasks based on similar causal mechanisms
- Practical applications: Predictive models for EHR data that generalise to new settings
Datasets
Part of the reason I was able to conduct this research was because I had access to exceptional data resources, including health and genetics data for millions of individuals:
- Genetics biobanks: UK Biobank and FinnGen, which each contain genomics and health data for around 500,000 individuals
- Linkage to electronic health records (EHRs): longitudinal data for medical diagnoses, prescriptions and other health-related information spanning several decades; enabling modelling of an individual's medical history
- Nationwide registries: FinRegistry initiative, which aggregates information across 19 registries for 7.2 million individuals (covering covering public health care visits, health conditions, medications, vaccinations, laboratory responses, demographics, familial relations and socioeconomic variables); enabling graph-based modelling of genetic and environmental factors of disease from networks of family relationships
- Human genetics datasets: 1000 Genomes Project and Human Genome Diversity Project