The EIOS System: What's Cooking?

Analyzing radio data for epidemic surveillance and using speech recognition and natural language processing

Benjamin Huynh, PhD Candidate in Biomedical Informatics, Stanford University

Public radio remains a predominant form of media and communication in many parts of the world representing an unused area for epidemic surveillance. Monitoring of rural areas based on their radio communications could improve epidemic surveillance, but the sheer volume of data is too large to parse manually. However, modern machine learning techniques present the opportunity to conduct epidemic surveillance using raw unstructured data from public radio. We present a data analysis pipeline that transcribes radio data using automatic speech recognition and analyses the text using machine learning. We discuss the preliminary and proposed work based on radio data from Uganda.

Towards Anomaly Detection in EIOS: Natural Language Processing and Supervised Learning can Help Direct Signals

Stephane Ghozzi, Research Assistant in Data Science, Robert Koch Institut

In the context of the research present, "anomaly detection" is defined as applying algorithms trained to predict which EIOS articles would be identified as "signals" by experts at WHO. The texts of the articles are vectorized following both a bag-of-words and a word-embeddings approach. Overall 78 combinations of data preparation and classification algorithms are evaluated using a series of scores. Despite a very low precision, focusing on high recall, the multilayer perceptron, applied to the bag-of-words without up sampling or standardization, reaches a specificity of 0.84 for a recall of 0.91, among other good performances, which indicates that it could already be useful in sorting articles.

Using human expertise and text classification algorithms to identify the noise in the Epidemic Intelligence from Open Sources (EIOS) system

Scott Lee (Statistician, Centers for Disease Control and Prevention, USA) & Emilie Peron (Epidemiologist/Evaluation Coordinator, World Health Organization)

In 2019, the EIOS team in collaboration with six organizations conducted an evaluation of the proportion of media reports incorrectly included in the surveillance system, known colloquially as "noise reports". In this presentation, we will provide some background on the noise evaluation that has been carried out and how we explore the use of text classification algorithms for identifying these reports. We try several machine learning models, from traditional models like the random forest with n-gram features, to newer models like Google's Bidirectional Encoder Representations from Transformers (BERT), and show that good identification is indeed possible. After discussing the results of the analysis, we propose several ways similar models may be included in the EIOS processing chain to identify the noise and improve the performance of the system.