The first STM report has been evaluated from the 2019-2020 call. Claudia Swart-Coipan from the National Institute for Public Health and the Environment (RIVM), The Netherlands visited Nadejda Lupolova at the The Roslin Institute, The University of Edinburgh, UK from 18-25 August 2019 to study Source attribution for Campylobacter using machine learning algorithms.

OBJECTIVE: 

The proposed Short Term Mission (STM) will promote the development of skills in the field of big data epidemiology for the home institute.

The project will apply machine learning techniques, developed and implemented at the Roslin Institute in Scotland, on whole-genome sequences of bacteria, isolated at the home institute in The Netherlands, from humans, animal reservoirs (livestock, wildlife, companion animals), but also from environmental sources i.e. various types of surface water.

REPORT: 

The project has applied machine learning techniques, developed and implemented at the host institute in Scotland (Lupolova et al. 2016, 2017), on whole-genome sequences of bacteria, isolated at the home institute in The Netherlands, from humans, animal reservoirs (livestock, wildlife, companion animals), but also from environmental sources i.e. various types of surface water. 

A first version of a supervised machine learning algorithm – Random Forest was tested on the accessory genome of 786 environmental Campylobacter jejuni isolates, where the various classes were unbalanced. A second version aimed to correct the class imbalance and used an upsampled version of the dataset, where all classes were resampled with replacement to reach the number of isolates of the most numerous class. The accuracy of the second model was ~80%. Use of this model on 272 human isolates rendered comparable results to other source attribution models based on bacterial population genetic structure (e.g. STRUCTURE). Further refinements of the model aim to test various methods of correction for the class imbalance and assess the uncertainty in the probabilities associated with each class, with the final purpose of improvement of the source attribution for the various Campylobacter genotypes by using high resolution typing data (WGS).