DATA150_Serena

Methodology Paper

Introduction

The human development topic I’ve selected is the impacts of vector-borne infectious diseases on people in Arica. Common vector-borne pathogens include viruses, bacterial, parasites, etc. Vector-borne diseases are also the main cause of emerging diseases: they are responsible for more than 17% of all infectious diseases and cause more than 700,000 deaths annually. (WHO, 2020) Even though attentions have been paid to the problems caused by these diseases, it is very difficult or even impossible to completely eliminate such infections. Not to say that it is hard to determine where the initial vector or pathogens come from, even when health bureaus managed to keep certain disease under control, given the seasonal climate changes and constant human mobility, these diseases are very likely to reappear. What’s more, most attention to this problem is focus on tropical and subtropical regions, those poorest places that are also vulnerable to infection transmission received insufficient attention. Data sciences and the analysis of datasets may help people recognize and respond to such problems and the research topic is about how can people make use of data science methods to control the transmission of health-related infection problem. The methods I’d like to concentrate on are all machine learning methods. When reviewing literatures in the previous assignments, many researchers apply extensive algorithms to build their model.

Regression

The first one is regressions and logistic regression and linear regression are more often used. Logistic regression is used for predictive analysis, it is the regression to conduct when the dependent variable is dichotomous. Logistic regression can also be used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. In python, logistic regression is a fundamental classification technique. It belongs to the group of linear classifiers and is somewhat similar to polynomial and linear regression. In the article Utilizing general human movement models to predict the spread of emerging infectious diseases in resource poor settings, regression models are extensively used. The epidemiological data are collected from World Health Organization (WHO), and one of the aims of this paper is to test whether human movement metrics from other regions can be used to predict the dynamics of Ebola virus diseases (may be borne by fruit bats). The utility of generalized human mobility metrics is tested with other models. Since the researchers don’t have the access to data (i.e., population counts between regions) on the movement of people from the countries affected by Ebola Virus Disease (EVD), so the estimation of the movement of people reflects the overall movement of people between districts. (Kraemer, M.U.G., Golding, N., Bisanzio, D. et al. 2019)

The first model used by researchers is the gravity model which assumes that the relative flow between regions is a log-linear function of the distance between the population and the region. The following formula is the model.

截屏2021-04-24 下午11 01 19

i and j represent two districts and T_(i,j) represents the total number of individuals moving from district i to district j, N_i^α is the population size at the origin location, N_j^β is the population size at the destination location and d_(i,j)^γ is the distance between them. This model proves to be very useful based on local conditions for it can infer daily commuting methods, long-term movement and the overall population diffusion process. By fitting this model to empirical data (datasets derived from Call Detailed Records), researchers simulate the impact of human mobility on geographic distribution and transmission rate of EVD, and an invasion model and a transmission model is built to characterize introduction to previously unaffected districts and the anticipated secondary cases arose from these introductions. The invasion model estimates the probability pi(t) of identifying one or more cases in the previous disease-free area i at time t (whether there are new cases is represented by Yi(t)) , as a function of the number of cases C−i(t − 1) in all other districts in the previous time point (biweekly), the product of corresponding values of the mobility covariates xj,−i for each covariate j, regression coefficient bj and a fixed intercept term c. This provides the following logistic regression model:

截屏2021-04-24 下午11 04 06

For all the districts reporting one or more infection cases, a bi-weekly transmission model is built for reflection, It,i represents the number of infected and infectious individuals and St−1,i represents the number of susceptible individuals. t is still the time and i is still the district. Ni is the population of district i and βt,i is the covariate-driven mobility rate characterized by a linear combination of the mobility metrics described above. αi accounts for the discretization of a continuous process and can be seen as an approximation of the contact rate of the population in district i. ϵ_(i,j) are error terms that are independent, identically log-normally distributed random variables with N(0, σ2).

截屏2021-04-24 下午11 05 10

By converting these two equations into a linear regression form to calculate the relative weights of each district:

截屏2021-04-24 下午11 06 10

For any district i, each Xij is one of k district-specific covariates that combines how many cases there are in all the other districts weighted by a mobility matrix. These district specific covariates are re-calculated at each time step. βt,i terms were generally fitted entering the covariates linearly, and all the model fitting was conducted in R. The analysis found that generalized human motion models derived from data outside the affected geographic area can explain a considerable part of the observed EVD outbreak dynamics. The following figure shows prediction with 95% confidence level. (Kraemer, M.U.G., Golding, N., Bisanzio, D. et al. 2019).

截屏2021-04-24 下午1 14 58 截屏2021-04-24 下午1 15 55

K-Nearest-Neighbor

In the article Machine Learning Model for Imbalanced Cholera Dataset in Tanzania, cholera data was collected in Dar es Salaam from January 2015 to December 2017. The data includes seasonal meteorological variables from the Tanzania Meteorological Agency (TMA), such as temperature, rainfall, humidity and wind direction. These data are collected from 2951 patients and 9 predictors, and check whether there are data entry errors in the data, including data missing and spelling errors. Since the transmission of cholera is very much related to environment (contaminated food or water, temperature, humidity, etc.), many factors should be taken into consideration to understand the reemergence of cholera outbreak. To achieve the model, the procedure is explained by the following figure. The data are divided into 30 folds and split into training data (to build the model) and testing data (to show the predictive performance of the model). K-NN is a simple non-parametric algorithm for classification and regression, which is usually successful when the decision boundary is very irregular. (Judith Leo, Edith Luhanga, Kisangiri Michael, 2019)

截屏2021-04-24 下午11 08 28

Since the data are not equally distributed, there is an imbalance problem. The authors therefore performed oversampling by using adaptive synthetic sampling approach. According to the nature of the cholera dataset, the researchers use balance accuracy, sensitivity and specificity indicators to evaluate the performance of the models. KNN is one of the models that performs best with the given metrics. This study improved our understanding of the important role of machine learning strategies in health-care data. However, the data are of low quality since they are collected in a certain time period. The health-system should provide more real-time updated data to facilitate quality data collection. The drawback is that, K-NN is not suitable for large data sets, data with non-uniform features, high dimensions and unbalanced conditions. In addition, K-NN does not have the ability to deal with the problem of missing values, and due to the existence of irrelevant features, its accuracy may be greatly reduced. (Judith Leo, Edith Luhanga, Kisangiri Michael, 2019)

截屏2021-04-24 下午9 46 44

Whether machine learning can successfully predict the occurrence of cholera and its correlation with seasonal weather changes depends on the good use of data and machine learning classifiers. In order to get the best results, the right machine learning model must be selected for the right problem. The result of this study indicates that the burst of infectious disease transmission is closely related to season, temperature and rainfall level, so when thinking of tackling transmissions like this, we can reasonably allocate effort investment according to these factors.

Reflection

Many studies like the formerly mentioned ones are based on unproven assumptions. For example, in the article Mapping internal connectivity through human migration in malaria endemic countries (which also applies extensive logistic regression for models), the whole study is based on two assumptions: (i) the census samples are considered to be representative at the administrative unit level at which migration was recorded and (ii) the percentage of people migrating between administrative units is considered to be constant over time. (Sorichetta, A., Bird, T., Ruktanonchai, N. et al. 2016) We need to keep in mind that the estimated derived from census data collected years ago may lead to inaccurate conclusion. In addition, the algorithm models may still have limitations (just like what I’ve discussed in the KNN section). These are all questions to consider when coming to make algorithm models help people manage the transmission of vector borne infectious diseases.