In this project, we decide to analyze what variables might have an effect on the probability of having angina or coronary heart disease. Based on life experience, we are interested in whether level of obesity measured by BMI (Body Mass Index), exercise frequency, cholesterol level, hypertension, and the consumption of green food are associated with angina or coronary heart disease and if so, how strong are their relationships? We also intend to build models that can predict the probability of having angina or coronary heart disease based on relevant variables.
According to Heart and Stroke Foundation of Canada, “Angina is the medical term for chest pain or discomfort caused by a temporary disruption in the flow of blood and oxygen to the heart. People describe angina discomfort as a squeezing, suffocating or burning feeling – usually in the center of the chest, behind the breastbone.”
According to University of Ottawa Heart Institute, “Coronary heart disease, also known as coronary artery disease (CAD), is a condition which affects the arteries that supply the heart with blood. It is usually caused by atherosclerosis which is a buildup of plaque inside the artery walls. This buildup causes the inside of the arteries to become narrower and slows down the flow of blood.”
According to CDC:
Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States.
One person dies every 36 seconds in the United States from cardiovascular disease.
Coronary heart disease is the most common type of heart disease, killing 360,900 people in 2019.
About 18.2 million adults age 20 and older have CAD (about 6.7%).
From CDC statistics, it is obvious that coronary heart disease poses a major threat to humans’ health. If we could prove what factors are associated with coronary heart disease, we could gain a deeper insight into coronary heart disease and how we could reduce the risk of having coronary heart disease, hence our research question.
In order to answer our research question, we first perform explanatory data analysis (EDA). Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, summarize important characteristics, and come up with potential research questions with the help of graphical representations and other tools. In this part, we take a glance at the whole data set, study the codebook, and come up with a research question which we are interested in (coronary heart disease), after which we perform data cleaning where we focus on the selected variables, cope with missing data, and process the variables so that they are ready for subsequent procedures.
With the selected variables ready, we perform significance test, which includes chi-square test, z-test, t-test, etc. The purpose of significance test is to check whether coronary heart disease is associated with the explanatory variables. It serves as a preliminary model to help us gain a deeper insight into the relationships between different variables. In this part, we first check whether the conditions for significance are met (randomness, large sample size, independence), write down the null hypotheses and alternative hypotheses, calculate the p-value, and conclude whether there is an association between the tested variables. However, this model cannot predict the probability of having coronary heart disease since it can only prove the existence of an association but not the type of the association. Therefore, more appropriate models are required in order to predict the probability of having coronary heart disease, hence the following step: statistical modelling.
In statistical modelling, we build different models (logistic regression, KNN, SVM and random forest) using all the explanatory variables to examine their joint influence on coronary heart disease and try to predict the probability of having coronary heart disease. The process of building these models comprise the following steps:
Make assumptions and check whether the conditions for modelling are met.
Divide the samples into training, test, and validation sets using undersampling technique since the response variable is extremely imbalanced.
Construct models using the training set and compute their accuracy using the validation set.
Optimize the models (improve accuracy) by tuning the parameters and selecting most relevant predictors based on various selection criterion.
Using the test set, evaluate the performance of each model separately by plotting their corresponding ROC curve (Receiver Operating Characteristic curve) and calculating corresponding AUC (Area Under Curve), and perform model diagnostics.
Compare the performance of all models, state the most influential predictors, and select the best model taking into account accuracy, AUC, computational costs, etc.
Lastly, based on results of EDA, significance test, and statistical modelling, we summarize our work and answer our research question.
According to BRFSS Overview, “The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.
The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger."
There are several assumptions that we have made prior to performing statistical analysis.
Since BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US, any conclusion we draw is only applicable to the US.
Since respondents are identified through telephone-based methods based on stratified sampling (a method of sampling from a population which can be partitioned into subpopulations), and an overall estimated 97.5% of US households had telephone service in 2012, we can assume that the respondents are randomly selected. However, the use of landline telephone and cellular telephone means that people who do not have telephone service could not be interviewed. However, since most Americans around 2013 have telephone service, the bias caused by not collecting data from people who do not have telephone service is small, and thus could be ignored.
Since this is an observational study and the answers from the respondents may not be accurate, we cannot make definitive statements of fact about the safety, efficacy, or effectiveness of a practice. Nevertheless, the 2013 BRFSS data can still provide information on human’s health and behaviors, detect signals about the benefits and risks of certain practices in the general population, and help formulate hypotheses to be tested in subsequent experiments.
Firstly, there are a total number of 491775 rows, or respondents, and a total number of 330 columns in the brfss2013
data set.
# Take a look at the dimension of the data set
dim(brfss2013)
## [1] 491775 330
We can also take a look at the head and the tail of the data set to get a broad sense of the data.
# Take a look at the head and the tail of the data set
head(brfss2013)
tail(brfss2013)
We now take a look at the race distribution of the respondents. Below is a table of the race distribution of the respondents.
brfss <- brfss2013
levels(brfss$X_race)<-c(levels(brfss$X_race),"Missing Data")
brfss$X_race[is.na(brfss$X_race)] <- "Missing Data"
race_table = dplyr::count(brfss, vars= X_race)
colnames(race_table) = c("Race", "Count")
formattable(race_table, align =c("l","c"), list(`Race` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Race | Count |
---|---|
White only, non-Hispanic | 376253 |
Black only, non-Hispanic | 39339 |
American Indian or Alaskan Native only, Non-Hispanic | 7676 |
Asian only, non-Hispanic | 9508 |
Native Hawaiian or other Pacific Islander only, Non-Hispanic | 1547 |
Other race only, non-Hispanic | 2701 |
Multiracial, non-Hispanic | 9121 |
Hispanic | 37060 |
Missing Data | 8570 |
The bar chart for the race distribution below shows that the majority of the respondents are white only, non-Hispanic (376253). The second largest and the third largest races are black only, non-Hispanic, and Hispanic respectively.
ggplot(brfss, aes(x = X_race, fill = X_race), stat = "count" ) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
ylim(c(0,450000))+
scale_fill_brewer(palette="Set1") +
labs(y = "Count", x = "Responses") +
ggtitle("Race Distribuion of Respondents")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
theme(legend.position="none")
We now take a look at the age distribution of the respondents. Below is a table for all the respondents of different age groups.
levels(brfss$X_age_g)<-c(levels(brfss$X_age_g),"Missing Data")
brfss$X_age_g[is.na(brfss$X_age_g)] <- "Missing Data"
age_table = dplyr::count(brfss, vars= X_age_g)
colnames(age_table) = c("Age", "Count")
formattable(age_table, align =c("l","c"), list(`Age` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Age | Count |
---|---|
Age 18 to 24 | 27203 |
Age 25 to 34 | 50164 |
Age 35 to 44 | 60416 |
Age 45 to 54 | 83769 |
Age 55 to 64 | 109464 |
Age 65 or older | 160747 |
Missing Data | 12 |
From the bar chart below, we can see that the largest age group of the respondents is age 65 or older, the second largest and the third largest age groups are age 55 to 64 and age 45 to 54. The age distribution of the respondents is slightly different from the true adults age distribution in the US in 2013 (https://www.census.gov/data/tables/2013/demo/age-and-sex/2013-age-sex-composition.html), so bias may occur. However, for simplicity, we will ignore the bias caused by age.
ggplot(brfss, aes(x = X_age_g, fill = X_age_g), stat = "count" ) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
ylim(c(0,200000))+
scale_fill_brewer(palette="Set1") +
labs(y = "Count", x = "Age Distributions") +
ggtitle("Age Distribuion of Respondents")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
theme(legend.position="none")
age_group <- c("Age 20 to 24", "Age 25 to 34", "Age 35 to 44",
"Age 45 to 54", "Age 55 to 64", "Age 65 or older")
age_true <- c(7.1/73.5, 13.4/73.5, 12.8/73.5, 14.0/73.5, 12.4/73.5, 13.8/73.5)
age_true_propor <- data.frame(age_group, age_true)
ggplot(data = age_true_propor, mapping = aes(x = age_group, y =age_true,fill=age_group))+
geom_text(aes(label = paste0(sprintf("%1.1f", age_true*100),"%")),vjust = -0.2,size=6)+
scale_fill_brewer(palette = "Set1")+
ylim(c(0,0.25))+
geom_col()+labs(x = "Category",y = "Percent")+
ggtitle("True Adults Age Distribuion In the US In 2013")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
theme(legend.position="none")
#grid.arrange(age_plot1, age_plot2, ncol = 1)
We now take a look at the sex ratio of the respondents. Below is a table for respondents’ gender.
levels(brfss$sex)<-c(levels(brfss$sex),"Missing Data")
brfss$sex[is.na(brfss$sex)] <- "Missing Data"
sex_table = dplyr::count(brfss, vars= sex)
colnames(sex_table) = c("Sex", "Count")
formattable(sex_table, align =c("l","c"), list(`Sex` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Sex | Count |
---|---|
Male | 201313 |
Female | 290455 |
Missing Data | 7 |
The bar chart below shows that there are more females than males in the survey. The ratio is about 20 million males to 29 million females. The sex ratio of the respondents is slightly different from the true adults sex ratio in the US in 2013 (https://www.census.gov/data/tables/2013/demo/age-and-sex/2013-age-sex-composition.html), so bias may occur. However, for simplicity, we will ignore the bias caused by gender.
ggplot(brfss, aes(x = sex, fill = sex), stat = "count" ) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
ylim(c(0,350000)) +
scale_fill_brewer(palette="Set1") +
labs(y = "Count", x = "Age Distributions") +
ggtitle("Sex Distribuion of Respondents")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 12))+
theme(legend.position="none")
sex_group <- c("Males", "Females")
sex_true <- c(152335/311216, 158781/311216)
sex_true_propor <- data.frame(sex_group, sex_true)
ggplot(data = sex_true_propor, mapping = aes(x =reorder(sex_group,sex_true),
y =sex_true,fill=sex_group))+
geom_text(aes(label = paste0(sprintf("%1.1f", sex_true*100),"%")),vjust = -0.2,size=6)+
scale_fill_manual(values = c("blue", "red") ) +
ylim(c(0,0.6))+
geom_col()+labs(x = "Category",y = "Percent")+
ggtitle("True Adults Sex Ratio In the US In 2013")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
theme(legend.position="none")
The variables we choose to answer our research question include both categorical variables and quantitative variables.
Categorical variables:
cvdcrhd4
: Ever diagnosed with angina or coronary heart disease
X_bmi5cat
: Computed body mass index categories
exerany2
: During the past month, other than your regular job, did you participate in any physical activities or exercises
X_rfchol
: High cholesterol calculated variable
bphigh4
: Ever told blood pressure high
Quantitative variables:
fruitju1
: How many times did you drink 100 percent pure fruit juices
fruit1
: How many times did you eat fruit
fvbeans
: How many times did you eat beans or lentils
fvgreen
: How many time did you eat dark green vegetables
fvorang
: How many times did you eat orange-colored vegetables
vegetab1
: How many times did you eat other vegetables
We combine the six quantitative variables related to green food consumption since they are collinear (correlated). Some other variables are also collinear (correlated) and of the same topic, so adding more than one of these variables to the model would not add much value to the model.
The response variable in our analysis is cvdcrhd4
(Ever diagnosed with angina or coronary heart disease) and the explanatory variables are the rest variables.
We now take a deeper look into these variables.
Firstly, the cvdcrhd4
variable, which corresponds to the question “ever diagnosed with angina or coronary heart disease”, has three categories, namely “Yes”, “No”, and missing data. There are a total number of 491775 responses, 29064 of which answered “Yes”, 458288 of which answered “No”, and 4423 of which are missing data. We only want the samples that answered “Yes” or “No” to this question. Again, for simplicity, we will consider those who answered “Yes” to have angina or coronary heart disease and those who answered “No” to not have angina or coronary heart disease.
levels(brfss$cvdcrhd4)<-c(levels(brfss$cvdcrhd4),"Missing Data")
brfss$cvdcrhd4[is.na(brfss$cvdcrhd4)] <- "Missing Data"
brfss$cvdcrhd4 <- as.factor( recode(brfss$cvdcrhd4, "Yes" = "Coronary",
"No" = "Healthy") )
chd_table = dplyr::count(brfss, vars= cvdcrhd4)
colnames(chd_table) = c("Response", "Count")
formattable(chd_table, align =c("l","c"), list(`Response` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Response | Count |
---|---|
Coronary | 29064 |
Healthy | 458288 |
Missing Data | 4423 |
The bar chart below gives a good overview of the cvdcrhd4
variable:
#mapping = aes(x = cvdcrhd4, fill = cvdcrhd4), stat = "count"
ggplot(brfss, aes(x = cvdcrhd4, fill = cvdcrhd4), stat = "count") +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), vjust = -1, colour = "black",size=6)+
scale_x_discrete(name = " ", limits = c("Healthy","Coronary", "Missing Data")) +
scale_fill_brewer(palette="Set1") +
labs(y = "Count", x = "Responses") +
ylim(c(0,500000))+
ggtitle("Ever Diagnosed with Angina or Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
colour='black', size=15) )+
theme(legend.position="none")
The pie chart below gives another overview of the cvdcrhd4
variable:
ggplot(brfss, aes(x="xxx", fill=cvdcrhd4)) +
geom_bar( width=1, color="white") +
coord_polar("y", start=0) +
theme_void() + # remove background, grid, numeric labels
guides(fill = guide_legend(title = "Coronary Heart Disease"))+
ggtitle("Ever Diagnosed with Angina or Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = -0.1,
colour='black', size=15) )
The X_bmi5cat
, which corresponds to the computed body mass index categories.The categorical variable X_bmi5cat
has 4 categories as shown below.
bmi_tab<-data.frame(table(brfss2013$X_bmi5cat))
colnames(bmi_tab)=c('Category','Frequency')
bmi_cat <- gt(data = bmi_tab)
bmi_cat <-
bmi_cat %>%
tab_header(
title = "Contingency Table of the Categories of `X_bmi5cat`"
)
bmi_cat
Contingency Table of the Categories of `X_bmi5cat` | |
---|---|
Category | Frequency |
Underweight | 8267 |
Normal weight | 154898 |
Overweight | 167084 |
Obese | 134799 |
The sample numbers of every category are given by the bar plot above. Since the factor we surveyed is the obesity level and the number of Underweight group is small, we omitted the Underweight group.
bmi_plot <- ggplot(data = bmi_tab, mapping = aes(x = Category, y =Frequency,fill=Category))+
geom_text(aes(label = Frequency),vjust = -0.2,size=6)+
scale_fill_brewer(palette = "Pastel1")+
ylim(c(0,200000))+
geom_col()+labs(title = "Level of Obesity",x = "Category",y = "Frequency")+
theme(plot.title = element_text(hjust = 0.5))
bmi_plot
To avoid the bias caused by the missing data, we calculated the missing ratio of each category in X_bmi5cat
and cvdcrhd4
. Seeing from the bar plot below, the difference of the missing ratios in each category is acceptable. Therefore, there doesn’t exist some extreme imbalance of the missing data.
obmit_under=subset(brfss2013,brfss2013$X_bmi5cat!="Underweight"|is.na(brfss2013$X_bmi5cat)==TRUE)
cvd_na=subset(obmit_under,is.na(obmit_under$cvdcrhd4)==TRUE)
normal_na_ratio=table(cvd_na$X_bmi5cat)[2]/table(obmit_under$X_bmi5cat)[2]
over_na_ratio=table(cvd_na$X_bmi5cat)[3]/table(obmit_under$X_bmi5cat)[3]
obese_na_ratio=table(cvd_na$X_bmi5cat)[4]/table(obmit_under$X_bmi5cat)[4]
bmi_missing<-data.frame(Ratio=c(normal_na_ratio,over_na_ratio,obese_na_ratio),Category=c('Normal weight','Overweight','Obese'))
bmi_missing_plot <- ggplot(data = bmi_missing, mapping = aes( x=Category, y =Ratio,fill=Category))+
scale_fill_brewer(palette = "Pastel1")+
geom_col()+
theme(plot.margin=unit(rep(0,8),'lines'))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
labs(title = "Missing Ratio of `X_bmi5cat`")+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label=paste0(sprintf("%1.1f", Ratio*100),"%")),vjust = -0.2)
bmi_na=subset(obmit_under,is.na(obmit_under$X_bmi5cat)==TRUE)
cvd_na_ratio=table(bmi_na$cvdcrhd4)[1]/table(obmit_under$cvdcrhd4)[1]
nocvd_na_ratio=table(bmi_na$cvdcrhd4)[2]/table(obmit_under$cvdcrhd4)[2]
cvd_missing<-data.frame(Ratio=c(cvd_na_ratio,nocvd_na_ratio),Category=c('Yes','No'))
cvd_missing_plot <- ggplot(data = cvd_missing, mapping = aes( x=Category, y =Ratio,fill=Category))+scale_fill_brewer(palette = "Pastel1")+
geom_col()+
theme(plot.margin=unit(rep(1,4),'lines'))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
labs(title = "Missing Ratio of `cvdcrhd4`")+
theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label=paste0(sprintf("%1.1f", Ratio*100),"%")),vjust = -0.2)
grid.arrange(bmi_missing_plot,cvd_missing_plot,nrow=1)
cvd_bmi_na=subset(obmit_under,is.na(obmit_under$cvdcrhd4)==TRUE | is.na(obmit_under$X_bmi5cat)==TRUE)
all_na_ratio=length(cvd_bmi_na$cvdcrhd4)/length(obmit_under$cvdcrhd4)
The overall missing ratio, 0.0635046, is relatively small. As a result, in the following analysis, we would ignore the bias causing by the missing samples. Then we constructed a subset omitting the samples with missing X_bmi5cat
and cvdcrhd4
.
The stack bar diagram below roughly shows that the proportion of people who ever had angina or coronary heart disease increases as the level of obesity increases.
bmi_subset=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & is.na(brfss2013$cvdcrhd4)==FALSE & brfss2013$X_bmi5cat!="Underweight",select = c(X_bmi5cat,cvdcrhd4))
bmi_subset$cvdcrhd4 <- as.factor( recode(bmi_subset$cvdcrhd4, "Yes" = "Coronary",
"No" = "Healthy") )
dt0 <- bmi_subset%>%
dplyr::group_by(X_bmi5cat, cvdcrhd4)%>%
dplyr::tally()%>%
dplyr::mutate(percent=n/sum(n))
ggplot(data = dt0,aes(x= X_bmi5cat, y = n,fill = cvdcrhd4))+
guides(fill = guide_legend(title = "Health Status"))+
geom_bar(stat="identity",position = "fill")+
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
position=position_fill(vjust=0.5), colour="black", size =6)+
scale_fill_brewer(palette = "Paired")+labs(title = "Stack Bar Plot of Disease Condition and Level of Obesity",x = "Category",y = "Percentage",fill="Health Condition")+
theme(plot.title = element_text(hjust = 0.5))
The categorical variable exerany2
has two categories as shown below.
levels(brfss$exerany2)<-c(levels(brfss$exerany2), "Missing Data")
brfss$exerany2[is.na(brfss$exerany2)] <- "Missing Data"
ex_tab<-data.frame(dplyr::count(brfss, vars= exerany2))
colnames(ex_tab)=c('Category','Frequency')
ex_tab
We plot the bar chart of categorical variable exerany2
as follow.
ggplot(data = ex_tab,
mapping = aes(x = Category, y=Frequency, fill = Category), stat = "count") +
scale_fill_brewer(palette = "Set1")+ geom_col( ) +
geom_text(aes(label = Frequency),size=6, vjust = -0.2) + ylim(c(0,350000))+
labs(y = "Frequency", x = "Category") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) + # Plot the bar graph
ggtitle("Exercise In Past 30 Days")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )
ex_misratio<-34029/length(brfss2013$exerany2)
ex_misratio
## [1] 0.06919628
There are some missing data, the ratio of the missing data is 0.0691963. The ratio is relatively small. As a result, in the following analysis, we would ignore the difference caused by deleting the missing data.
ex <- which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt = filter(brfss2013[ex, ]) # Filter out unwanted samples
exercise <- filt[, c("exerany2", "cvdcrhd4")] # Filter out unwanted variables
#glimpse(blood_pressure)
exercise$exerany2 <- droplevels(exercise$exerany2)
exercise$exerany2 <- as.factor( recode(exercise$exerany2, "Yes" = "exercise", "No" = "No exercise") )
exercise$cvdcrhd4 <- as.factor( recode(exercise$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy") )
samples<-491775 - length(exercise$exerany2) # Calculate the number of responses that have been deleted
samples_ratio<-(491775 - length(exercise$exerany2)) / 491775 # Calculate the proportion of responses that have been deleted
We have deleted a total number of 3.8043^{4} samples from these two variables, and the overall missing data ratio is 0.0773585. Obviously, it is relatively small so that we can ignore these samples in the following analysis.
We draw a bar chart of the relationship between exerany2
and cvdcrhd4
.
ggplot(exercise, aes(x = exerany2, fill = cvdcrhd4), stat = "count") +
geom_bar(position = "dodge")+
geom_text(position = position_dodge(width = .9),stat = "count",
aes(label = ..count..),size=6, vjust = -1, colour = "black")+
ylim(c(0,350000))+
guides(fill = guide_legend(title = "Health Condition"))+
xlab("Category")+ ylab("Frequency")+
ggtitle("Gropued Bar Chart for Exerise and Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=13) )
As we can roughly see from two bar chart below, people who had exercised in the past 30 days had a much lower rate of angina or coronary heart disease than people who had not exercised.
dt1 <- exercise%>%
dplyr::group_by(exerany2, cvdcrhd4)%>%
dplyr::tally()%>%
dplyr::mutate(percent=n/sum(n))
ggplot(data = dt1,aes(x= exerany2, y = n,fill = cvdcrhd4))+
guides(fill = guide_legend(title = "Health Status"))+
geom_bar(stat="identity",position = "fill")+
xlab("Exercise")+
ylab("Percent Value")+
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
position=position_fill(vjust=0.5), colour="black", size =6)+
ggtitle("Percent Stacked Bar Chart for Exercise and Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=13) )
The valid categories and their corresponding sample sizes in X_rfchol
are shown in the table below:
X_rfchol_NA_process <- brfss2013$X_rfchol
X_rfchol_NA_process[is.na(X_rfchol_NA_process)] <- NA
rfchol_tab <- data.frame(table(X_rfchol_NA_process))
colnames(rfchol_tab) <- c("Category", "Frequency")
rfchol_cat <- gt(data = rfchol_tab)
rfchol_cat <-
rfchol_cat %>%
tab_header(
title = "Contingency Table of the Categories of \"X_rfchol\""
)
rfchol_cat
Contingency Table of the Categories of "X_rfchol" | |
---|---|
Category | Frequency |
No | 236614 |
Yes | 183497 |
We also make a bar plot to make it more visual.
levels(brfss$X_rfchol)<-c(levels(brfss$X_rfchol),"Missing Data")
brfss$X_rfchol[is.na(brfss$X_rfchol)] <- "Missing Data"
brfss$X_rfchol <- as.factor( recode(brfss$X_rfchol, "Yes" = "High Cholesterol",
"No" = "Normal Cholesterol") )
ggplot(brfss, aes(x = X_rfchol, fill = X_rfchol),stat = "count") +
geom_bar(stat = "count") +
ylim(c(0,300000))+
geom_text(stat = "count", aes(label = ..count..), vjust = -1, colour = "black",size=6)+
labs(y = "Count", x = "Responses") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
scale_fill_brewer(palette = "Pastel1")+
ggtitle("Ever Told Cholesterol Level High") +
theme(plot.title = element_text(
family = "", face = "bold", hjust = 0.5,
colour = "black", size = 15
))
There are 183497 respondents answered that they have been told to have high cholesterol level, 236614 answered that they have not been told to have high cholesterol level and 71664 data are missing.
To further analysis the data, we need to due with the missing data from the original data set. The missing ratio of variable X_rfchol
is 0.1457252, which is not big. Therefore, we can just throw out these missing data.
Moreover, we need to make sure that the missing ratios of cvdcrhd4
corresponding to the categories in X_rfchol
are not too high and have a small difference.
The plot "Missing Ratios Corresponding to X_rfchol's Categories"
below shows the missing ratio of cvdcrhd4
corresponding to the “Yes” and “No” categories of X_rfchol
. The “missing” here stands for the samples who have valid X_rfchol
data but miss cvdcrhd4
data.
yes <- subset(brfss2013, brfss2013$X_rfchol == "Yes")
no <- subset(brfss2013, brfss2013$X_rfchol == "No")
yes_na <- subset(brfss2013, brfss2013$X_rfchol == "Yes" & is.na(brfss2013$cvdcrhd4))
no_na <- subset(brfss2013, brfss2013$X_rfchol == "No" & is.na(brfss2013$cvdcrhd4))
missing <- c(nrow(yes_na) / nrow(yes), nrow(no_na) / nrow(no))
names <- c("Yes", "No")
missing_ratio <- data.frame(missing, names)
c1 <- ggplot(missing_ratio, aes(x = names, y = missing, fill = names)) +
geom_bar(stat = "identity") +
ylim(c(0,0.015))+
geom_text(aes(label=paste0(sprintf("%1.1f", missing*100),"%")),vjust = -0.2,size=6)+
labs(y = "Missing ratio", x = "X_rfchol") +
scale_fill_brewer(palette = "Pastel1")+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
ggtitle("Missing Ratios Corresponding to X_rfchol's Categories") +
theme(plot.title = element_text(
family = "", face = "bold", hjust = 0.5,
colour = "black", size = 8
))
According to the plot, the missing ratios of cvdcrhd4
corresponding to categories of X_rfchol
are both small (less than 2%) and close to each other.
The plot "Missing Ratios Corresponding to cvdcrhd4's Categories"
below shows the missing ratio of X_rfchol
corresponding to the “Yes” and “No” categories of cvdcrhd4
. The “missing” here stands for the samples who have valid cvdcrhd4
data but miss X_rfchol
data.
yes <- subset(brfss2013, brfss2013$cvdcrhd4 == "Yes")
no <- subset(brfss2013, brfss2013$cvdcrhd4 == "No")
yes_na <- subset(brfss2013, brfss2013$cvdcrhd4 == "Yes" & is.na(brfss2013$X_rfchol))
no_na <- subset(brfss2013, brfss2013$cvdcrhd4l == "No" & is.na(brfss2013$X_rfchol))
missing <- c(nrow(yes_na) / nrow(yes), nrow(no_na) / nrow(no))
names <- c("Yes", "No")
missing_ratio <- data.frame(missing, names)
c2 <- ggplot(missing_ratio, aes(x = names, y = missing, fill = names)) +
ylim(c(0,0.04))+
geom_bar(stat = "identity") +
geom_text(aes(label=paste0(sprintf("%1.1f", missing*100),"%")),vjust = -0.2,size=6)+
labs(y = "Missing ratio", x = "cvdcrhd4") +
scale_fill_brewer(palette = "Pastel1")+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
ggtitle("Missing Ratios Corresponding to cvdcrhd4's Categories") +
theme(plot.title = element_text(
family = "", face = "bold", hjust = 0.5,
colour = "black", size = 8
))
grid.arrange(c1,c2,nrow=1)
According to the plot, the missing ratios of X_rfchol
corresponding to categories of X_rfchol
are both small and close to each other as well. As a result, the missing data have a small influence on the final result. We can just throw out all samples with missing data in X_rfchol
and cvdcrhd4
.
rfchol <- which((brfss2013$X_rfchol == "Yes" | brfss2013$X_rfchol == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[rfchol, ]) # Filter out unwanted samples
rfchol_data <- filt[, c("X_rfchol", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
rfchol_data$X_rfchol <- droplevels(rfchol_data$X_rfchol)
rfchol_data$X_rfchol <- as.factor(recode(rfchol_data$X_rfchol, "Yes" = "High Cholesterol Level", "No" = "Normal Cholesterol Level"))
rfchol_data$cvdcrhd4 <- as.factor(recode(rfchol_data$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))
We make a percent stacked bar plot to show the association between them. From the plot we found that cholesterol level the rate of getting angina or coronary heart disease related to different cholesterol level is different. We can get a preliminary conclusion that whether the cholesterol level is high do affect the rate of getting angina or coronary heart disease.
dt5 <- rfchol_data%>%
dplyr::group_by(X_rfchol, cvdcrhd4)%>%
dplyr::tally()%>%
dplyr::mutate(percent=n/sum(n))
ggplot(data = dt5,aes(x= X_rfchol, y = n,fill = cvdcrhd4))+
guides(fill = guide_legend(title = "Health Status"))+
geom_bar(stat="identity",position = "fill")+
xlab("Cholesterol Level")+
ylab("Percent Value")+
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
position=position_fill(vjust=0.5), colour="black", size =6)+
ggtitle("Percent Stacked Bar Chart for Cholesterol Level and Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
colour='black', size=13) )
bphigh4
variable, which corresponds to the question “ever told blood pressure high” The bphigh4
is a categorical variable which has five categories, namely “Yes”, “Yes, but female told only during pregnancy”, “No”, “Told borderline or pre-hypertensive”, and missing data. There are a total number of 491775 responses. 198921 people answered “Yes”, 3680 people answered “Yes, but female told only during pregnancy”, 282687 people answered “No”, 5067 people answered “Told borderline or pre-hypertensive”, and 1420 are missing data. Those who answered “Yes, but female told only during pregnancy” and “Told borderline or pre-hypertensive” can be considered as special cases, so we can ignore them. In addition, we need to be aware of the missing data. Based on the analysis above, we only want the samples that answered “Yes” or “No” to this question. Although it might not always be the case, for simplicity, we will consider those who answered “Yes” to have high blood pressure and those who answered “No” to have normal blood pressure.
levels(brfss$bphigh4)<-c(levels(brfss$bphigh4), "Missing Data")
brfss$bphigh4[is.na(brfss$bphigh4)] <- "Missing Data"
bp_table = dplyr::count(brfss, vars= bphigh4) # Overview of "bphigh4" variable
colnames(bp_table) = c("Response", "Count")
formattable(bp_table, align =c("l","c"), list(`Response` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Response | Count |
---|---|
Yes | 198921 |
Yes, but female told only during pregnancy | 3680 |
No | 282687 |
Told borderline or pre-hypertensive | 5067 |
Missing Data | 1420 |
The bar chart below gives a good overview of the blood pressure data:
ggplot(brfss, aes(x = bphigh4, fill = bphigh4), stat = "count" ) +
geom_bar() +
geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
ylim(c(0,350000))+
scale_x_discrete(name = " ", limits = c("No","Yes", "Told borderline or pre-hypertensive", "Yes, but female told only during pregnancy", "Missing Data")) +
scale_fill_brewer(palette="Set1") +
labs(y = "Count", x = "Responses") +
ggtitle("Ever Told Blood Pressure High")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=15) )+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
theme(legend.position="none")
There are a total number of 4423 missing coronary heart disease data. To the “ever told blood pressure high” question, 3025 out of 4423 people answered “Yes”, 1186 out of 4423 people answered “No”, 45 out of 4423 people answered “Told borderline or pre-hypertensive”, 17 out of 4423 people answered “Yes, but female told only during pregnancy”, and 150 out of 4423 are missing data. These missing data ratios are all very low except for “missing data” category.
filt_2 <- filter( brfss2013[which(is.na(brfss2013$cvdcrhd4)), ] )
coronary_deleted <- filt_2[,c("bphigh4","cvdcrhd4")]
coronary_deleted <- droplevels(coronary_deleted)
missing_coronary_table = dplyr::count(coronary_deleted, var=bphigh4)
colnames(missing_coronary_table) = c("Response", "Count")
formattable(missing_coronary_table, align =c("l","c"), list(`Response` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Response | Count |
---|---|
Yes | 3025 |
Yes, but female told only during pregnancy | 17 |
No | 1186 |
Told borderline or pre-hypertensive | 45 |
NA | 150 |
per_of_missing_coronary_res <- c(3025/198921, 17/3680, 1186/282687, 45/5067, 150/1420)
missing_coronary_response <- c("Yes",
"Yes, but female told only during pregnancy",
"No", "Told borderline or pre-hypertensive",
"Missing Data")
missing_coronary_data <- data.frame(per_of_missing_coronary_res,
missing_coronary_response) # Data frame for
plot_1 <- ggplot(data = missing_coronary_data, aes(x = reorder(missing_coronary_response, -per_of_missing_coronary_res),
y = per_of_missing_coronary_res, fill = missing_coronary_response )) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Set3") +
labs(y = "Percent", x = "Responses") +
theme(axis.text.x = element_text(size = 8, hjust = 0.5)) + # Plot the bar graph
ggtitle("Missing Ratio of Blood Pressure Data")+
theme(axis.text.x = element_text(angle = 20, hjust = 1, size = 8)) +
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=10) ) +
theme(legend.position="none")+
geom_text(aes(label=paste0(sprintf("%1.1f", per_of_missing_coronary_res*100),"%")),vjust = -0.2)
For the blood pressure data, there are a total number of 1420 missing data. To the “ever diagnosed with angina or coronary heart disease” question, 76 out of 1420 people answered “Yes”, 1194 out of 1420 people answered “No”, and 150 out of 1420 are missing data. These missing data ratios are all very low.
filt_4 <- filter( brfss[which(brfss$bphigh4=="Missing Data"), ] )
bloodp_deleted <- filt_4[,c("bphigh4","cvdcrhd4")]
bloodp_deleted <- droplevels(bloodp_deleted)
missing_bp_table = dplyr::count(bloodp_deleted, var=cvdcrhd4)
colnames(missing_bp_table) = c("Response", "Count")
formattable(missing_bp_table, align =c("l","c"), list(`Response` = formatter(
"span", style = ~ style(color = "grey",font.weight = "bold"))
))
Response | Count |
---|---|
Coronary | 76 |
Healthy | 1194 |
Missing Data | 150 |
per_of_missing_bp_res <- c(76/29064, 1194/458288, 150/4423)
missing_bp_response <- c("Coronary", "Healthy", "Missing Data")
missing_bp_data <- data.frame(per_of_missing_bp_res,
missing_bp_response) # Data frame for
plot_2 <- ggplot(data = missing_bp_data, aes(x = reorder(missing_bp_response,
-per_of_missing_bp_res),
y = per_of_missing_bp_res, fill = missing_bp_response )) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette="Set3") +
labs(y = "Percent", x = "Responses") +
theme(axis.text.x = element_text(size = 8, hjust = 0.5)) + # Plot the bar graph
ggtitle("Missing Ratio of Coronary Data")+
theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
colour='black', size=10) ) +
theme(legend.position="none")+
geom_text(aes(label=paste0(sprintf("%1.1f", per_of_missing_bp_res*100),"%")),vjust = -0.2)
grid.arrange(plot_1, plot_2, nrow = 1)
Based on our assumptions, the only responses that we are interested in are those who answered “Yes” or “No” to both questions and we will leave out the other responses. After selection, we deleted 14378 samples, which accounts for only 2.92% of the total number of people surveyed. Also, the missing ratios are relatively small. Therefore, we can confidently delete these responses and continue our analysis.
#str(brfss2013$bphigh4)
bp <- which((brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt = filter(brfss2013[bp, ]) # Filter out unwanted samples
blood_pressure <- filt[, c("bphigh4", "cvdcrhd4")] # Filter out unwanted variables
#glimpse(blood_pressure)
blood_pressure$bphigh4 <- droplevels(blood_pressure$bphigh4)
blood_pressure$bphigh4 <- as.factor( recode(blood_pressure$bphigh4, "Yes" = "High Blood Pressure", "No" = "Normal Blood Pressure") )
blood_pressure$cvdcrhd4 <- as.factor( recode(blood_pressure$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy") )
491775 - length(blood_pressure$bphigh4) # Calculate the number of responses that have been deleted
## [1] 14378
(491775 - length(blood_pressure$bphigh4)) / 491775 # Calculate the proportion of responses that have been deleted
## [1] 0.02923695
#count(blood_pressure, vars= cvdcrhd4)
From the percent stacked bar chart below, it seems that there is an association between blood pressure and angina or coronary heart disease. In addition, the proportion of people who, with high blood pressure, have angina or coronary heart disease, seems to be higher than the proportion of people who, with normal blood pressure, have angina or coronary heart disease. However, we need to confirm our hypotheses by using chi-square test and confidence interval.
dt2 <- blood_pressure%>%
dplyr::group_by(bphigh4, cvdcrhd4)%>%
dplyr::tally()%>%
dplyr::mutate(percent=n/sum(n))
ggplot(data = dt2,aes(x= bphigh4, y = n,fill = cvdcrhd4))+
guides(fill = guide_legend(title = "Health Status"))+
geom_bar(stat="identity",position = "fill")+
xlab("Blood Pressure")+
ylab("Percent Value")+
geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
position=position_fill(vjust=0.5), colour="black", size =6)+
ggtitle("Percent Stacked Bar Chart for Blood Pressure and Coronary Heart Disease")+
theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
colour='black', size=13) )
People’s eating time on green food such as vegetable, fruit, and fruit juice, are respectively recorded in the quantitative variables of cvdinfr4
fruitju1
fruit1
fvbeans
fvgreen
fvorang
and vegetab1
. We define a new variable eating
, which is the sum of the 6 variables.
fruitju1
: How Many Times Did You Drink 100 Percent Pure Fruit Juices?
fruit1
: How Many Times Did You Eat Fruit?
fvbeans
: How Many Times Did You Eat Beans Or Lentils?
fvgreen
: How Many Times Did You Eat Dark Green Vegetables?
fvorang
: How Many Times Did You Eat Orange-Colored Vegetables?
vegetab1
: How Many Times Did You Eat Other Vegetables?
#非NA的数据
test = subset(brfss2013,is.na(brfss2013$cvdcrhd4)==FALSE &is.na(brfss2013$fruitju1)==FALSE & is.na(brfss2013$fruit1)==FALSE&is.na(brfss2013$fvbeans)==FALSE & is.na(brfss2013$fvgreen)==FALSE&is.na(brfss2013$fvorang)==FALSE & is.na(brfss2013$vegetab1)==FALSE)
#有个NA的数据
test_rev = subset(brfss2013,!(is.na(brfss2013$cvdcrhd4)==FALSE &is.na(brfss2013$fruitju1)==FALSE & is.na(brfss2013$fruit1)==FALSE&is.na(brfss2013$fvbeans)==FALSE & is.na(brfss2013$fvgreen)==FALSE&is.na(brfss2013$fvorang)==FALSE & is.na(brfss2013$vegetab1)==FALSE))
#饮食数据变成所有加起来
eating = test$fruitju1 + test$fruit1 + test$fvbeans + test$fvgreen + test$fvorang + test$vegetab1
eating_rev = test_rev$fruitju1 + test_rev$fruit1 + test_rev$fvbeans + test_rev$fvgreen + test_rev$fvorang + test_rev$vegetab1
health = test$cvdcrhd4
health_rev = test_rev$cvdcrhd4
The missing ratio of the the coronary~eating is 11.4%, which is at an acceptable level.
Then we analyze the samples that we know whether they have coronary heart disease or not, but not know their ‘eating’.
#对有数据没数据做下卡方检验
test1 = data.frame(eating,health)
#test1是吃喝疾病都齐全的
yes = subset(test1,test1$health == "Yes")
no = subset(test1,test1$health == "No")
#yes是病, no是没病
test1$health<-factor(test1$health,labels=c("Coronary","healthy"))
test_rev1 = data.frame(eating_rev,health_rev)
#test_rev1是吃喝疾病有点不全的
test_rev1_no_eating = subset(test_rev1,is.na(test_rev1$eating_rev)==TRUE)
#这个是那些没有吃的信息的
no_eating_ill=subset(test_rev1_no_eating,test_rev1_no_eating$health_rev=="Yes")
no_eating_noill=subset(test_rev1_no_eating,test_rev1_no_eating$health_rev=="No")
ill_ratio_isdata =length(yes$eating)/(length(no$eating)+length(yes$eating))
ill_ratio_noeatingdata=length(no_eating_ill$eating_rev)/(length(no_eating_ill$eating_rev)+length(no_eating_noill$eating_rev))
We are to check if the missing data of eating do have a difference with the sample not missing eating data.
\(H_0\): There is no relationship between coronary and the missing data of eating
\(H_a\): There is no relationship between coronary and the missing data of eating
We set \(\alpha\) = 0.1
Missing = c(3472,53204,3472+53204)
No_missing = c(25592,405084,25592+405084)
Sum = c(3472+25592,53204+405084,3472+25592+53204+405084)
chi_tab = data.frame(Missing,No_missing,Sum,row.names = c("Coronary","Healthy","Sum"))
chi_tab_name = rownames(chi_tab)
chi_tab = cbind(chi_tab,chi_tab_name)
tab <- gt(data=chi_tab,rowname_col="chi_tab_name")
tab<-
tab%>%
tab_header(title="Contingency Table of Health And Eating")
tab
Contingency Table of Health And Eating | |||
---|---|---|---|
Missing | No_missing | Sum | |
Coronary | 3472 | 25592 | 29064 |
Healthy | 53204 | 405084 | 458288 |
Sum | 56676 | 430676 | 487352 |
missing = data.frame(Percent=c(3472/(3472+53204),53204/(3472+53204)),Category=c("Coronary","Healthy"))
missing_plot <-ggplot(data = missing,mapping = aes( x=Category, y =Percent,fill=Category))+
scale_fill_brewer(palette = "Pastel1")+ geom_col()+
theme(plot.margin=unit(rep(1,4),'lines'))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
labs(title = "Lost Data")+theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label=paste0(sprintf("%1.1f", Percent*100),"%")),vjust = -0.2)+xlab("Health State")
no_missing = data.frame(Percent=c(25592/(25592+405084),405084/(25592+405084)),Category=c("Coronary","Healthy"))
no_missing_plot <-ggplot(data = no_missing,mapping = aes( x=Category, y =Percent, fill=Category))+
scale_fill_brewer(palette = "Pastel1")+ geom_col()+
theme(plot.margin=unit(rep(1,4),'lines'))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
labs(title = "No Lost Data")+theme(plot.title = element_text(hjust = 0.5))+
geom_text(aes(label=paste0(sprintf("%1.1f", Percent*100),"%")),vjust = -0.2)+xlab("Health State")
grid.arrange(missing_plot,no_missing_plot,ncol=2,nrow=1)
missing = c(3472,53204)
no_missing = c(25592,405084)
chi_exercise<-data.frame(missing,no_missing,row.names = c("cvd","no_Cvd"))
chisq.test(chi_exercise)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: chi_exercise
## X-squared = 2.9832, df = 1, p-value = 0.08413
According the P-value, which is 0.08<0.1, we can conclude that there is no evidence relation in coronary illness between the missing data and the not missing data.
Then we are going to view the data that are not missing at all.
We observe the boxplot of the sample that have coronary and healthy, and finf that they look almost the same.
#观察有数据部分,看看有病没病的分布
fill<-"#4271AE"
p = ggplot(test1,aes(x=health,y=eating))+geom_boxplot(fill = fill)+scale_y_continuous(name="The sum of eating times")+scale_x_discrete(name="Health Status")+ggtitle("Boxplot of the eating by health status")
p
Then we observe the eating
distribution of coronary patient, healthy people, unknown-state people.
We can see that their distribution is quite normal and look similar.
tt=subset(test_rev1,is.na(test_rev1$health_rev)==TRUE&is.na(test_rev1$eating_rev)==FALSE)
ttf = data.frame(tt)
yesf = data.frame(yes)
nof=data.frame(no)
p1<-ggplot(yesf,aes(x=eating,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people have Coronary")+xlab("Eating times")
p2<-ggplot(nof,aes(x=eating,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people do not have Coronary")+xlab("Eating times")
p3<-ggplot(ttf,aes(x=eating_rev,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people do not konw if they have Coronary")+xlab("Eating times")
grid.arrange(p1,p2,p3, nrow = 3)
Before we carry out any tests, we can plot a correlation matrix for all the variables that we are going to analyze to get a better sense of the relationships between these variables. Since this is only an overview of the correlation between these variables, we simply filter out all the missing data, and we are only concerned about the strength of the correlations but not about whether the correlation is positive or negative.
From the correlation matrix below, we can see that the correlation between coronary heart disease and the other variables are small (less than 0.2), indicating that their linear relationships are very weak. Therefore, we can build logistic regression model and other classification models to predict the probability of having coronary heart disease. What’s more, the correlation between the predictors (blood pressure, exercise, cholesterol level, obesity level, and green food) is small as well (less than 0.3), indicating weak linear relationships and low collinearity.
brfss_cor <- brfss2013
brfss_cor$eating <- brfss_cor$fruitju1 + brfss_cor$fruit1 + brfss_cor$fvbeans + brfss_cor$fvgreen + brfss_cor$fvorang + brfss_cor$vegetab1
cor_select_0 <- which((!is.na(brfss_cor$bphigh4)) & (!is.na(brfss_cor$cvdcrhd4)) &
(!is.na(brfss_cor$exerany2)) & (!is.na(brfss_cor$X_rfchol)) &
(!is.na(brfss_cor$X_bmi5cat)) & (!is.na(brfss_cor$eating)) )# Select only the samples who responded to all selected questions
#cor_select_1 <- cor_select_0
cor_data = filter(brfss_cor[cor_select_0, ]) # Filter out unwanted samples
cor_data <- cor_data[, c("cvdcrhd4", "bphigh4", "X_rfchol",
"X_bmi5cat", "exerany2", "eating")] # Filter out unwanted variables
cor_data[] <- lapply(cor_data,as.integer)
corrplot(cor(cor_data), method = "pie", order="hclust")
corrplot(cor(cor_data), method = "number", order="hclust")
To assess the association between health indexes or living habits and angina or coronary heart diseases, we perform significance tests between selected variables corresponding to health indexes and cvdcrhd4
.
We could conduct the tests if the following conditions are met:
Random: The samples are randomly selected from the population.
Large Sample Size: Sample sizes and all expected counts in the table are much greater than 5.
Independent: Knowing the values of both variables for one person in the study gives us no meaningful information about the values of the variables for another person. So individual observations are independent.
All variables we selected meet these conditions.
To assess the association between the level of obesity and coronary heart or angina diseases statistically, we conducted a Chi-squared test of these two variables.
After statistical conversion, the data table is as follows:
set.seed(109)
# bmi_samp=bmi_subset[sample(nrow(bmi_subset),10000,replace=FALSE),]
bmi_samp <- bmi_subset <- subset(brfss2013, is.na(brfss2013$X_bmi5cat) == FALSE & is.na(brfss2013$cvdcrhd4) == FALSE & brfss2013$X_bmi5cat != "Underweight", select = c(X_bmi5cat, cvdcrhd4))
X1 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Normal weight")
X2 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Overweight")
X3 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Obese")
X4 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Normal weight")
X5 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Overweight")
X6 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Obese")
Normalweight <- c(X1, X4, X1 + X4)
Overweight <- c(X2, X5, X2 + X5)
Obese <- c(X3, X6, X3 + X6)
Sum <- c(X1 + X2 + X3, X4 + X5 + X6, X1 + X2 + X3 + X4 + X5 + X6)
chi_bmi_tab <- data.frame(Normalweight, Overweight, Obese, Sum, row.names = c("Yes", "No", "Sum"))
cvd_sta <- rownames(chi_bmi_tab)
rownames(chi_bmi_tab) <- NULL
chi_bmi_tab <- cbind(cvd_sta, chi_bmi_tab)
bmi_tbl <- gt(data = chi_bmi_tab, rowname_col = "cvd_sta")
bmi_tbl <-
bmi_tbl %>%
tab_header(
title = md("**Contingency Table of the Level of Obesity and the Health Condition**")
)
bmi_tbl
Contingency Table of the Level of Obesity and the Health Condition | ||||
---|---|---|---|---|
Normalweight | Overweight | Obese | Sum | |
Yes | 6778 | 10264 | 10670 | 27712 |
No | 147020 | 155390 | 122681 | 425091 |
Sum | 153798 | 165654 | 133351 | 452803 |
We want to perform a test with the following hypotheses:
\(H_0\): There is no relationship between angina or coronary heart disease and the level of obesity.
\(H_a\): There is a relationship between angina or coronary heart disease and the level of obesity.
Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following three plots only show slight differences in the percentage of people who ever had angina or coronary heart diseases.
normal <- data.frame(Percent = c(X4 / (X1 + X4), X1 / (X1 + X4)), Category = c("Health", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
labs(title = "Normal Weight") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
Over <- data.frame(Percent = c(X5 / (X2 + X5), X2 / (X2 + X5)), Category = c("Health", "Coronary"))
over_plot <- ggplot(data = Over, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
labs(title = "Overweight") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
# over_plot
Obese <- data.frame(Percent = c(X6 / (X3 + X6), X3 / (X3 + X6)), Category = c("Health", "Coronary"))
obese_plot <- ggplot(data = Obese, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
labs(title = "Obese") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
# over_plot
grid.arrange(normal_plot, over_plot, obese_plot, ncol = 3, nrow = 1)
Normal <- c(X1, X4)
Over <- c(X2, X5)
Obese <- c(X3, X6)
chi_bmi <- data.frame(Normal, Over, Obese, row.names = c("cvd", "no_Cvd"))
output1 <- chisq.test(chi_bmi)
The calculated test statistic \[\chi^2=1608.6451307\] Using \(df=2\), the corresponding P-value is less than2.2e-16, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.
To assess the association between exercise and coronary heart or angina diseases statistically, we conducted a Chi-squared test of these two variables.
After statistical conversion, the data table is as follows:
ex <- which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")), ]) # Filter out unwanted samples
exercise <- filt[, c("exerany2", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
exercise$exerany2 <- droplevels(exercise$exerany2)
exercise$exerany2 <- as.factor(recode(exercise$exerany2, "Yes" = "exercise", "No" = "No exercise"))
exercise$cvdcrhd4 <- as.factor(recode(exercise$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))
X2 <- sum(exercise$cvdcrhd4 == "Coronary" & exercise$exerany2 == "exercise")
X3 <- sum(exercise$cvdcrhd4 == "Coronary" & exercise$exerany2 == "No exercise")
X4 <- sum(exercise$cvdcrhd4 == "Healthy" & exercise$exerany2 == "exercise")
X5 <- sum(exercise$cvdcrhd4 == "Healthy" & exercise$exerany2 == "No exercise")
Exercise <- c(X2, X4, X2 + X4)
No_exercise <- c(X3, X5, X3 + X5)
Sum <- c(X2 + X3, X4 + X5, X2 + X3 + X4 + X5)
chi_exercise_tab <- data.frame(Exercise, No_exercise, Sum, row.names = c("Coronary", "Healthy", "Sum"))
cvd_sta <- rownames(chi_exercise_tab)
rownames(chi_exercise_tab) <- NULL
chi_exercise_tab <- cbind(cvd_sta, chi_exercise_tab)
exercise_tbl <- gt(data = chi_exercise_tab, rowname_col = "cvd_sta")
exercise_tbl <-
exercise_tbl %>%
tab_header(
title = md("**Contingency Table of Exercise and the Health Condition**")
)
exercise_tbl
Contingency Table of Exercise and the Health Condition | |||
---|---|---|---|
Exercise | No_exercise | Sum | |
Coronary | 16688 | 10733 | 27421 |
Healthy | 313500 | 112811 | 426311 |
Sum | 330188 | 123544 | 453732 |
We want to perform a test with the following hypotheses:
\(H_0\): There is no relationship between angina or coronary heart disease and exercise.
\(H_a\): There is a relationship between angina or coronary heart disease and exercise.
Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following two plots only show slight differences in the percentage of people who ever had angina or coronary heart diseases.
Exercise <- data.frame(Percent = c(X2 / (X2 + X4), X4 / (X2 + X4)), Category = c("Coronary", "Healthy"))
Exercise_plot <- ggplot(data = Exercise, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "Exercise") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
No_exercise <- data.frame(Percent = c(X3 / (X3 + X5), X5 / (X3 + X5)), Category = c("Coronary", "Healthy"))
No_exercise_plot <- ggplot(data = No_exercise, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "No_Exercise") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
# over_plot
grid.arrange(Exercise_plot, No_exercise_plot, ncol = 2, nrow = 1)
Exercise <- c(X2, X4)
No_exercise <- c(X3, X5)
chi_exercise <- data.frame(Exercise, No_exercise, row.names = c("cvd", "no_Cvd"))
output2 <- chisq.test(chi_exercise)
The calculated test statistic \[\chi^2=2089.7317508\]
Using \(df=(2-1)(2-1)=2\), the corresponding P-value is less than 2.2e-16, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.
After statistical conversion, the data table is as follows:
rfcholcvdcrhd <- subset(brfss2013, is.na(brfss2013$X_rfchol) == FALSE & is.na(brfss2013$cvdcrhd4) == FALSE)
analysis <- table(rfcholcvdcrhd$X_rfchol, rfcholcvdcrhd$cvdcrhd4)
analysis_ <- addmargins(table(rfcholcvdcrhd$X_rfchol, rfcholcvdcrhd$cvdcrhd4))
Sum <- analysis_[, 3]
High_col <- analysis_[1, ]
norm_col <- analysis_[2, ]
Sum <- analysis_[3, ]
ori_chol_tab <- data.frame(High_col, norm_col, Sum, row.names = c("Coronary", "Healthy", "Sum"))
names(ori_chol_tab) <- c("Normal Cholesterol Level", "High Coolesterol Level", "Sum")
chol_sta <- rownames(ori_chol_tab)
rownames(ori_chol_tab) <- NULL
ori_chol_tab <- cbind(chol_sta, ori_chol_tab)
chol_tbl <- gt(data = ori_chol_tab, rowname_col = "chol_sta")
chol_tbl <-
chol_tbl %>%
tab_header(
title = md("**Contingency Table of the Cholesterol Level and the Coronary Heart Disease**")
)
chol_tbl
Contingency Table of the Cholesterol Level and the Coronary Heart Disease | |||
---|---|---|---|
Normal Cholesterol Level | High Coolesterol Level | Sum | |
Coronary | 7229 | 20838 | 28067 |
Healthy | 228129 | 160131 | 388260 |
Sum | 235358 | 180969 | 416327 |
We want to perform a test with the following hypotheses:
\(H_0\): There is no association between cholesterol level and angina or coronary heart disease.
\(H_a\): There is an association between cholesterol level and angina or coronary heart disease.
Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following two plots show apparent differences in the percentage of people who ever had angina or coronary heart diseases.
normal <- data.frame(Percent = c(analysis_[1, 2] / analysis_[1, 3], analysis_[1, 1] / analysis_[1, 3]), Category = c("Healthy", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "Normal Cholesterol Level") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
High <- data.frame(Percent = c(analysis_[2, 2] / analysis_[2, 3], analysis_[2, 1] / analysis_[2, 3]), Category = c("Healthy", "Coronary"))
high_plot <- ggplot(data = High, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "High Cholesterol Level") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
grid.arrange(normal_plot, high_plot, ncol = 2, nrow = 1)
Finally we perform the Chi-Square test.
chi_chol <- data.frame(norm_col, High_col)
chi_output <- chisq.test(chi_chol)
The calculated test statistic \[\chi^2=1.1600108\times 10^{4}\] Using \(df=2\), the corresponding P-value is 0, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.
# str(brfss2013$bphigh4)
bp <- which((brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") & (
brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[bp, ]) # Filter out unwanted samples
blood_pressure <- filt[, c("bphigh4", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
blood_pressure$bphigh4 <- droplevels(blood_pressure$bphigh4)
blood_pressure$bphigh4 <- as.factor(recode(blood_pressure$bphigh4, "Yes" = "High Blood Pressure", "No" = "Normal Blood Pressure"))
blood_pressure$cvdcrhd4 <- as.factor(recode(blood_pressure$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))
# 491775 - length(blood_pressure$bphigh4) # Calculate the number of responses that have been deleted
# (491775 - length(blood_pressure$bphigh4)) / 491775 # Calculate the proportion of responses that have been deleted
After statistical conversion, the data table is as follows:
bp_summary <- table(blood_pressure$cvdcrhd4, blood_pressure$bphigh4)
bp_summary_sum <- addmargins(table(blood_pressure$cvdcrhd4, blood_pressure$bphigh4))
a <- as.data.frame.matrix(bp_summary_sum)
new_col <- c("Coronary", "Healthy", "Sum")
a <- cbind(new_col, a)
colnames(a)[1] <- " "
a %>%
gt(rowname_col = " ") %>%
tab_header(
title = md("**Contingency Table of Blood Pressure and Coronart Heart Disease**")
)
Contingency Table of Blood Pressure and Coronart Heart Disease | |||
---|---|---|---|
High Blood Pressure | Normal Blood Pressure | Sum | |
Coronary | 22500 | 6234 | 28734 |
Healthy | 173396 | 275267 | 448663 |
Sum | 195896 | 281501 | 477397 |
We want to perform a test with the following hypotheses:
\(H_0\): There is no relationship between angina or coronary heart disease and the level of obesity.
\(H_a\): There is a relationship between angina or coronary heart disease and the level of obesity.
Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following three plots only show slight differences in the percentage of people who ever had angina or coronary heart diseases.
normal <- data.frame(Percent = c(bp_summary_sum[2, 2] / bp_summary_sum[3, 2], bp_summary_sum[1, 2] / bp_summary_sum[3, 2]), Category = c("Healthy", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "Normal Blood Pressure") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
High <- data.frame(Percent = c(bp_summary_sum[2, 1] / bp_summary_sum[3, 1], bp_summary_sum[1, 1] / bp_summary_sum[3, 1]), Category = c("Healthy", "Coronary"))
high_plot <- ggplot(data = High, mapping = aes(x = Category, y = Percent, fill = Category)) +
scale_fill_brewer(palette = "Pastel1") +
geom_col() +
theme(plot.margin = unit(rep(1, 4), "lines")) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
labs(title = "High Blood Pressure") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = guide_legend(title = "Health Status")) +
xlab("Health Status")
grid.arrange(normal_plot, high_plot, ncol = 2, nrow = 1)
output3 <- summary(bp_summary)
The calculated test statistic \[\chi^2=1.7552304\times 10^{4}\] Using \(df=2\), the corresponding P-value is extreme close to 0, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the blood pressure level and angina or coronary heart diseases are associated.
# 非NA的数据
test <- subset(brfss2013, is.na(brfss2013$cvdcrhd4) == FALSE & is.na(brfss2013$fruitju1) == FALSE & is.na(brfss2013$fruit1) == FALSE & is.na(brfss2013$fvbeans) == FALSE & is.na(brfss2013$fvgreen) == FALSE & is.na(brfss2013$fvorang) == FALSE & is.na(brfss2013$vegetab1) == FALSE)
# 有个NA的数据
test_rev <- subset(brfss2013, !(is.na(brfss2013$cvdcrhd4) == FALSE & is.na(brfss2013$fruitju1) == FALSE & is.na(brfss2013$fruit1) == FALSE & is.na(brfss2013$fvbeans) == FALSE & is.na(brfss2013$fvgreen) == FALSE & is.na(brfss2013$fvorang) == FALSE & is.na(brfss2013$vegetab1) == FALSE))
# 饮食数据变成所有加起来
eating <- test$fruitju1 + test$fruit1 + test$fvbeans + test$fvgreen + test$fvorang + test$vegetab1
eating_rev <- test_rev$fruitju1 + test_rev$fruit1 + test_rev$fvbeans + test_rev$fvgreen + test_rev$fvorang + test_rev$vegetab1
health <- test$cvdcrhd4
# 对有数据没数据做下卡方检验
test1 <- data.frame(eating, health)
# test1是吃喝疾病都齐全的
yes <- subset(test1, test1$health == "Yes")
no <- subset(test1, test1$health == "No")
# yes是病, no是没病
test1$health <- factor(test1$health, labels = c("Coronary", "healthy"))
Since the variable eating
is a numerical variable and its standard deviation is unknown, it is better to conduct a t-test to figure out whether there is an association with angina and coronary diseases.
We want to perform a t-test with the following hypotheses:
\(H_0\): There is an association between angina or coronary heart disease and the amount of fruits and vegetables eaten
\(H_a\): There is no association between angina or coronary heart disease and the amount of fruits and vegetables eaten
out_t <- t.test(test1$eating ~ test1$health)
Using the t-test we can get the P-value 0.6258454, we cannot reject \(H_0\), which indicates that there is no strong evidence that angina or coronary heart diseases are associated with the amount of fruits and vegetables eaten.
The resultant P-values of the significance tests for the whole dataset is listed in the table below.
out1 <- c(output1$p.value)
out2 <- c(output2$p.value)
out3 <- c(chi_output$p.value)
out4 <- c(output3$p.value)
out44 <- c(out_t$p.value)
conclusion_tab <- data.frame(out1, out2, out3, out4, out44, row.names = c("P-Value"))
names(conclusion_tab) <- c("Obesity Level", "Exercise", "Cholesterol Level", "Blood Pressure", "Green Food")
conclu_rowname <- c("P-Value")
conclusion_tab <- cbind(conclu_rowname, conclusion_tab)
conclusion_tab <- gt(data = conclusion_tab, rowname_col = "conclu_rowname")
conclusion_tbl <-
conclusion_tab %>%
tab_header(
title = md("**P-values of Whole Dataset**")
)
conclusion_tbl
P-values of Whole Dataset | |||||
---|---|---|---|---|---|
Obesity Level | Exercise | Cholesterol Level | Blood Pressure | Green Food | |
P-Value | 0 | 0 | 0 | 0 | 0.6258454 |
From the table above, it is clear that variables obesity level(bmi_cat5
), exercise(exerany2
), cholesterol level(X_rfchol
) and blood pressure level (bphigh4
) all have extremely small p-value, which indicates that they are strongly associated with angina or coronary heart diseases. Meanwhile, amount of fruits and vegetables eaten (eating
) has a p-value larger than the significance level of 5%, which indicates that we are not confident that is is associated with angina or coronary heart diseases.
Through the implementation of Chi-squared test and significance test, we can make some conclusions whether the variables are associated with angina or coronary heart diseases. However, Chi-squared test and significance test can only decide whether there is an association between two variables, and it cannot excavate the potential association between them. As a result, we can hardly use this model to predict the probability of getting angina or coronary heart diseases based on other variables. In this project, we only take it as a preliminary model to remove some insignificant variables.
Variable eating
is removed from our further analysis since it is tested to be insignificant variable in the significance tests. We perform our modeling process with explanatory variables X_bmi5cat
, exerany2
, bphigh4
and X_rfchol
and response variable cvdcrhd4
.
Logistic regression is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation. This type of statistical analysis (also known as logit model) is often used for predictive analytics and modeling, which can return the likelihood of an event happening or a choice being made. A logistic approach fits best when the task that the machine is learning is based on two values, or a binary classification. The logistic approach is more powerful compared to the linear regression when the dependent variable is limited to a range of values or categorical — A or B…or A, B, C or D. In our project, the dependent variable cvdcrhd4
is a categorical variable with two possible outcomes. Therefore, we could apply logistic regression to build up the prediction model, which also contains the information of the effect of the independent variables, so as to answer the research problem.
The logistic regression method assumes that:
cvdcrhd4
has two outcomes, “Yes” and “No” after EDA process filtering out the missing data.For the assumption of (b) and (c), we assumes that they hold true and conduct diagnostics to check the validity after building up the model.
The sigmoid function is used in the logistic regression model. Seeing from the following plot, we could find that it is a symmetric function about (0,0.5). Therefore, the model performs best when the threshold is set to be 0.5. However, due to extreme imbalance in the cvdcrhd4
variable, if we simply select 0.5 as the threshold without resample, the model means nothing although it appears to have a high accuracy by predicting all of the samples to be the outcomes with a larger number in the data set. As a result, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. The sampled data has a size of 50,296. And we take a random sample among these samples with a size of 1800 to avoid overfitting.
sigmoid = function(x) {
1 / (1 + exp(-x))
}
x <- seq(-5, 5, 0.01)
data=data.frame(x=x,y=sigmoid(x))
p <- ggplot(data = data, mapping = aes(
x = x, y = y))
p+geom_line(size = 1.2,color="blue")+labs(title = "Sigmoid Function")+theme(plot.title = element_text(hjust = 0.5))
pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(X_bmi5cat,cvdcrhd4,exerany2,bphigh4,X_rfchol))
pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(X_bmi5cat,cvdcrhd4,exerany2,bphigh4,X_rfchol))
set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)
ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]
pre_subset=rbind(pre_subset1,pre_subset2)
#table(pre_subset$cvdcrhd4)
The resampled data is shown as below:
out1 <- c(table(pre_subset$cvdcrhd4)[1])
out2 <- c(table(pre_subset$cvdcrhd4)[2])
#out3 <- c(chi_output$p.value)
#out4 <- c(output3$p.value)
#out44 <- c(out_t$p.value)
conclusion_tab <- data.frame(out1, out2, row.names = c("Number"))
names(conclusion_tab) <- c("Yes", "No")
conclu_rowname <- c("Number")
conclusion_tab <- cbind(conclu_rowname, conclusion_tab)
conclusion_tab <- gt(data = conclusion_tab, rowname_col = "conclu_rowname")
conclusion_tbl <-
conclusion_tab %>%
tab_header(
title = md("Resampled Data")
)
conclusion_tbl
Resampled Data | ||
---|---|---|
Yes | No | |
Number | 25156 | 25140 |
We divided this group of samples into training set,validation set and test set with a ratio of 0.6,0.2 and 0.2, and then used the training set to train the logistic regression model and the test set to test and evaluate the model.
Here we apply the backward method to select variables based on the AIC value of each model.Seeing from the result, we can select the model with minimum AIC that contains variables exerany2
,bphigh4
and X_rfchol
.
set.seed(999)
pre_subset=pre_subset[sample(nrow(pre_subset),1800,replace = FALSE),]
N=length(pre_subset$cvdcrhd4)
ind=sample(3,N,replace=TRUE,prob=c(0.6,0.2,0.2))
aus_train=pre_subset[ind==1,]
aus_val=pre_subset[ind==2,]
aus_test=pre_subset[ind==3,]
library(MASS)
pre=glm(cvdcrhd4~X_bmi5cat+exerany2+bphigh4+X_rfchol,data=aus_train,family=binomial(link="logit"))
stepAIC(pre,direction = "backward")
## Start: AIC=1293.11
## cvdcrhd4 ~ X_bmi5cat + exerany2 + bphigh4 + X_rfchol
##
## Df Deviance AIC
## - X_bmi5cat 2 1283.7 1291.7
## <none> 1281.1 1293.1
## - exerany2 1 1300.2 1310.2
## - bphigh4 1 1349.0 1359.0
## - X_rfchol 1 1360.9 1370.9
##
## Step: AIC=1291.67
## cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol
##
## Df Deviance AIC
## <none> 1283.7 1291.7
## - exerany2 1 1302.2 1308.2
## - bphigh4 1 1354.9 1360.9
## - X_rfchol 1 1364.3 1370.3
##
## Call: glm(formula = cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol, family = binomial(link = "logit"),
## data = aus_train)
##
## Coefficients:
## (Intercept) exerany2No bphigh4No X_rfcholYes
## 0.4555 -0.6295 1.1655 -1.2309
##
## Degrees of Freedom: 1087 Total (i.e. Null); 1084 Residual
## Null Deviance: 1508
## Residual Deviance: 1284 AIC: 1292
pre=glm(cvdcrhd4~exerany2+bphigh4+X_rfchol,data=aus_train,family=binomial(link="logit"))
summary(pre)
##
## Call:
## glm(formula = cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol, family = binomial(link = "logit"),
## data = aus_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8981 -0.8704 -0.6625 0.9911 1.8024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4555 0.1338 3.405 0.000661 ***
## exerany2No -0.6295 0.1471 -4.279 1.88e-05 ***
## bphigh4No 1.1655 0.1401 8.318 < 2e-16 ***
## X_rfcholYes -1.2309 0.1394 -8.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1508.0 on 1087 degrees of freedom
## Residual deviance: 1283.7 on 1084 degrees of freedom
## AIC: 1291.7
##
## Number of Fisher Scoring iterations: 4
The summary of this model is shown below. We can find that all of the variables have a P-value smaller than 0.05, indicating all of the selected variables are significant predictors. The model can be also expressed in the following equation: \[log(\frac{p}{1-p})=0.4555-0.6295*exerany2No+1.1655*bphigh4No-1.2309*X\_rfcholYes\] where p indicates the probability of being healthy. By observing the coefficients of this model, we found that no exercise in the past 30 days, and high cholesterol were all negatively correlated with the absence of disease, while the absence of hypertension was positively correlated with the absence of coronary heart disease. This result is in accord with people’s common sense, that is, less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease.
Here we used the test set to check the validity assumptions. Since the selected variables are categorical, which cannot show strong linearity, we only look into the relationship between logit and the sum of the multiplication of values and corresponding coefficients. As shown in the following plot, linear relationship between the logit of the outcome and sum of the predictor variables are strong and there is no extreme outliers. Therefore, the assumptions are reasonable for this model. We should also have some intuitive feelings from the plot that the healthy people generally have higher predicted probabilites, which suggests that logit well distinguishes two groups of people.
predict_=predict.glm(pre,type="response",newdata=aus_val)
predict=ifelse(predict_>0.5,1,0)
true=ifelse(aus_val$cvdcrhd4=="No",1,0)
aus_val$true=true
aus_val$predict=predict
true_value=aus_val[,"true"]
predict_value=aus_val[,"predict"]
error=predict_value-true_value
accuracy1=(nrow(aus_val)-sum(abs(error)))/nrow(aus_val)
aus_val$regression=ifelse(aus_val$exerany2=="No",-0.6295,0)
aus_val$regression=ifelse(aus_val$bphigh4=="No",1.1655,0)+aus_val$regression
aus_val$regression=ifelse(aus_val$X_rfchol=="Yes",-1.2309,0)+aus_val$regression+0.4555
aus_val$logit=log(predict_/(1-predict_))
aus_val$color=ifelse(aus_val$cvdcrhd4=="No","blue","red")
aus_val$cvdcrhd4=factor(aus_val$cvdcrhd4)
new_score=jitter(aus_val$regression)
p <- ggplot(data = aus_val, mapping = aes(
x=regression, y=logit))
p+geom_smooth(method = lm)+geom_jitter(color=aus_val$color)+labs(title = "the Result of Logistic Regression")+theme(plot.title = element_text(hjust = 0.5))
To check if the threshold is valid and if our sample method is reasonable, we use plot the relationship of the prediction accuracy and threshold. Then we find that our sample method is reasonable and nearly cause no bias. The accuracy reaches the peak at the threshold of 0.5.
accu_val = data.frame()
thre_val=data.frame()
for (n in 1:10)
{
ratio=n/10
predict_=predict.glm(pre,type="response",newdata=aus_val)
predict=ifelse(predict_>ratio,1,0)
true=ifelse(aus_val$cvdcrhd4=="No",1,0)
aus_val$true=true
aus_val$predict=predict
true_value=aus_val[,"true"]
predict_value=aus_val[,"predict"]
error=predict_value-true_value
accuracy=(nrow(aus_val)-sum(abs(error)))/nrow(aus_val)
accu_val=rbind(accu_val,accuracy)
thre_val=rbind(thre_val,ratio)
}
#plot(c(as.numeric(unlist(thre_val))),c(as.numeric(unlist(accu_val))))
#as.numeric(unlist(thre_val)
plot_=data.frame(x=as.numeric(unlist(thre_val)),y=as.numeric(unlist(accu_val)))
p <- ggplot(data = plot_, mapping = aes(
x = x, y = y))
p+geom_point(size=2)+labs(title = "Relationship between the Accuracy and the Threshold")+theme(plot.title = element_text(hjust = 0.5))+geom_line()
predict_=predict.glm(pre,type="response",newdata=aus_test)
predict=ifelse(predict_>0.5,1,0)
true=ifelse(aus_test$cvdcrhd4=="No",1,0)
#aus_val$true=true
#aus_val$predict=predict
#true_value=aus_val[,"true"]
#predict_value=aus_val[,"predict"]
error=predict-true
accuracy1=(nrow(aus_test)-sum(abs(error)))/nrow(aus_test)
We can get the accuracy of the test set is around 0.7643505. And we also plot the ROC curve and calculate the AUC value,0.801 for this model. And we conclude from this model that less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease.
library(pROC)
modelroc <- roc(true,predict_)
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE,legacy.axes=TRUE,
max.auc.polygon=TRUE,
auc.polygon.col="skyblue", print.thres=TRUE,smooth=TRUE)
pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))
pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))
set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)
ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]
pre_subset=rbind(pre_subset1,pre_subset2)
#table(pre_subset$cvdcrhd4)
KNN (K-Nearest Neighbors) is a machine learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into. Since KNN is a non-parametric algorithm. This means that no assumptions about the data set are made when the model is used. For our project, we have some intuitions that people who have similar living habits and physiological indexes may have similar health states. Therefore, we could build up a KNN model and check if it can predict people’s health states.
Since the data set is extremely imbalanced in the research variable cvdcrhd4
, which can have a large impact on the KNN model that performs well when there is no bias caused by the original data, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. Moreover, to avoid overfitting, the data set is randomly sampled with a size of 4500. Here we divided the data set into three parts, the training set, the validation set and the test set. The corresponding ratio is 0.6, 0.2 and 0.2. We used the validation set to adjust the hyper-parameters in the models and the test set is for checking the validity of this model. Some of the variables are categorical variables so they are converted into dummy variables for modelling.
set.seed(20)
pre_subset=pre_subset[sample(nrow(pre_subset),4500,replace = FALSE),]
dummyV1 <- model.matrix(~.,pre_subset)
pre_subset=data.frame(dummyV1[,c(-1,-7,-9)])
N=dim(pre_subset)[1]
ind=sample(3,N,replace=TRUE,prob=c(0.6,0.2,0.2))
aus_train=pre_subset[ind==1,]
aus_test=pre_subset[ind==3,]
aus_val=pre_subset[ind==2,]
Firstly, we defined a suitable K value for this model. Seeing from the following plot, we selected the K-value with highest accuracy, 40, as the hyper-parameter.
library(class)
k=data.frame()
accu=data.frame()
train=aus_train
val=aus_val
#head(val)
for(n in 1:50)
{
kresult=knn(train = train[,-1],test=val[,-1],cl=train[,1],k=n,use.all=TRUE)
accuracy=sum(diag(table(val[,1],kresult)))/sum(table(val[,1],kresult))
k=rbind(k,n)
accu=rbind(accu,accuracy)
}
k=c(as.numeric(unlist(k)))
accu=c(as.numeric(unlist(accu)))
plot_dat=data.frame(x=k,y=accu)
ggplot(data = plot_dat,aes(x=k,y=accu))+
geom_point(size=3)+labs(title = "Accuracy vs K-value",x="K-value",y="Accuracy of the Validation Set")+theme(plot.title = element_text(hjust = 0.5))+geom_point(aes(40,accu[40]),color="red")
To optimize the model, we should minimize the model size nearly without reducing the accuracy. Here we applies backward method.
ACCURACY=accu[40]
var_select=data.frame()
var_accu=data.frame()
head(val)
for (col in 2:7)
{
var_select=rbind(var_select,col)
train=aus_train[,-col]
val=aus_val[,-col]
kresult=knn(train = train[,-1],test=val[,-1],cl=train[,1],k=40,use.all=TRUE)
accuracy=sum(diag(table(val[,1],kresult)))/sum(table(val[,1],kresult))
var_accu=rbind(var_accu,ACCURACY-accuracy)
}
var_select=c(as.numeric(unlist(var_select)))
var_name=c("X_bmi5catNormal.weight","X_bmi5catOverweight","X_bmi5catObese","exerany2No","bphigh4No",
"X_rfcholYes")
var_accu=c(as.numeric(unlist(var_accu)))
plot_dat=data.frame(x=var_name,y=var_accu)
p <- ggplot(data = plot_dat, mapping = aes(
x = var_name,
y = var_accu ,
fill=var_name))
p + geom_col()+labs(title = "Accuracy Reduction",x="Removed Variables",y="Accuracy Reduction")+theme(plot.title = element_text(hjust = 0.5))+theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+scale_fill_brewer(palette = "Set3")
From the bar plot above, removing the X_bmi5cat
related variables has little influence on the accuracy reduction, which is smaller than 1%. Therefore, we build up the KNN model with the rest three variables:exerany2No
,bphigh4No
,X_rfcholYes
.
To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. As shown in the following graph, the AUC value that equals to the prediction accuracy in this model is approximately 0.702, which only indicates a general performance of prediction. However, we could conclude from the model that exercise habits, blood pressure and cholesterol level have associations with angina or coronary heart disease.
library(pROC)
train=aus_train[,c(-2,-3,-4)]
val=aus_val[,c(-2,-3,-4)]
test=aus_test[,c(-2,-3,-4)]
kresult=knn(train = train[,-1],test=test[,-1],cl=train[,1],k=40,use.all=TRUE)
roc1<-roc(kresult,test[,1])
plot(roc1,print.auc=TRUE,plot=TRUE,print.thres=TRUE,auc.polygon=TRUE,xlim=c(1,0),auc.polygon.col="skyblue",legacy.axes=TRUE)
Support Vector Machine (SVM) is a relatively simple Supervised Machine Learning Algorithm used for classification and/or regression. It is more preferred for classification but is sometimes very useful for regression as well. Basically, SVM finds a hyper-plane that creates a boundary between the types of data. Since SVM is based on vectors, there are also no assumptions about the data set are made when the model is used. Therefore, we could build up an SVM model and check if it can predict people’s health states.
pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))
pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))
set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)
ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]
pre_subset=rbind(pre_subset1,pre_subset2)
Since the data set is extremely imbalanced in the research variable cvdcrhd4
, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. Moreover, to avoid overfitting, the data set is randomly sampled with a size of 15000. Here we divided the data set into two parts, the training set and the test set. The corresponding ratio is 0.7 and 0.3. The test set is for checking the validity of this model. Some of the variables are categorical variables so they are converted into dummy variables for modelling.
set.seed(2021)
pre_subset=pre_subset[sample(nrow(pre_subset),15000,replace = FALSE),]
dummyV1 <- model.matrix(~.,pre_subset)
pre_subset=data.frame(dummyV1[,c(-1,-7,-9)])
N=dim(pre_subset)[1]
ind=sample(2,N,replace=TRUE,prob=c(0.7,0.3))
aus_train=pre_subset[ind==1,]
aus_test=pre_subset[ind==2,]
# pre_subset <- as.h2o(pre_subset)
# test <- as.h2o(test)
# aus_train <- as.h2o(aus_train)
# aus_test <- as.h2o(aus_test)
# x <- setdiff(names(aus_train), c("cvdcrhd4Yes", "int_rate"))
# y <- "cvdcrhd4Yes"
# iris.dl <- h2o.deeplearning(x = x, y = y, training_frame = pre_subset)
# predictions <- h2o.predict(iris.dl, pre_subset)
# performance = h2o.performance(model = iris.dl,newdata = test)
# print(performance)
library("e1071")
aus_train$cvdcrhd4No<- as.factor(aus_train$cvdcrhd4No)
train1 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test1 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train1,cost=0.25,gamma=0.125,type = 'C')
test1_=subset(test1,select=-cvdcrhd4No,)
results_test1<-predict(object=model,test1_,type="class")
res_test1<-table(results_test1,test1$cvdcrhd4No)
accurary1<-sum(diag(res_test1))/sum(res_test1)
train2 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test2 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train2,cost=0.25,gamma=0.125,type = 'C')
test2_=subset(test2,select=-cvdcrhd4No,)
results_test2<-predict(object=model,test2_,type="class")
res_test2<-table(results_test2,test2$cvdcrhd4No)
accurary2<-sum(diag(res_test2))/sum(res_test2)
train3 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test3 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train3,cost=0.25,gamma=0.125,type = 'C')
test3_=subset(test3,select=-cvdcrhd4No,)
results_test3<-predict(object=model,test3_,type="class")
res_test3<-table(results_test3,test3$cvdcrhd4No)
accurary3<-sum(diag(res_test3))/sum(res_test3)
train4 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,exerany2No,bphigh4No,X_rfcholYes))
test4 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catNormal.weight,X_bmi5catOverweight,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train4,cost=0.25,gamma=0.125,type = 'C')
test4_=subset(test4,select=-cvdcrhd4No,)
results_test4<-predict(object=model,test4_,type="class")
res_test4<-table(results_test4,test4$cvdcrhd4No)
accurary4<-sum(diag(res_test4))/sum(res_test4)
train5 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,bphigh4No,X_rfcholYes))
test5 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train5,cost=0.25,gamma=0.125,type = 'C')
test5_=subset(test5,select=-cvdcrhd4No,)
results_test5<-predict(object=model,test5_,type="class")
res_test5<-table(results_test5,test5$cvdcrhd4No)
accurary5<-sum(diag(res_test5))/sum(res_test5)
train6 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,X_rfcholYes))
test6 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train6,cost=0.25,gamma=0.125,type = 'C')
test6_=subset(test6,select=-cvdcrhd4No,)
results_test6<-predict(object=model,test6_,type="class")
res_test6<-table(results_test6,test6$cvdcrhd4No)
accurary6<-sum(diag(res_test6))/sum(res_test6)
train7 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No))
test7 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No))
model<-svm(cvdcrhd4No~.,data=train7,cost=0.25,gamma=0.125,type = 'C')
test7_=subset(test7,select=-cvdcrhd4No,)
results_test7<-predict(object=model,test7_,type="class")
res_test7<-table(results_test7,test7$cvdcrhd4No)
accurary7<-sum(diag(res_test7))/sum(res_test7)
redu<- c(-(accurary2-accurary1),-(accurary3-accurary1),-(accurary4-accurary1),-(accurary5-accurary1),-(accurary6-accurary1),-(accurary7-accurary1))
var_name=c("X_bmi5catNormal.weight","X_bmi5catOverweight","X_bmi5catObese","exerany2No","bphigh4No","X_rfcholYes")
plot_data <- data.frame(x=var_name,y=redu)
p <- ggplot(data = plot_data, mapping = aes(
x = var_name,
y = redu ,
fill=var_name))
p + geom_col()+labs(title = "Accuracy Reduction",x="Removed Variables",y="Accuracy Reduction")+theme(plot.title = element_text(hjust = 0.5))+theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+scale_fill_brewer(palette = "Set3")
From the bar plot above, removing the X_bmi5cat
related variables and exerany2No
have little influence on the accuracy reduction. Therefore, we build up the SVM model with the rest three variables:bphigh4No
,X_rfcholYes
.
To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. As shown in the following graph, the AUC value that equals to the prediction accuracy in this model is approximately 0.681, which only indicates a general performance of prediction. However, we could conclude from the model that blood pressure and cholesterol level have associations with angina or coronary heart disease.
train <- subset(aus_train,select=c(cvdcrhd4No,bphigh4No,X_rfcholYes))
test <- subset(aus_test,select=c(cvdcrhd4No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train,cost=0.1975309,gamma=0.1975309,type = 'C')
test_=subset(test,select=-cvdcrhd4No,)
results_test<-predict(object=model,test_,type="class")
res_test<-table(results_test,test$cvdcrhd4No)
accurary<-sum(diag(res_test))/sum(res_test)
model_roc <- roc(aus_test$cvdcrhd4No,as.numeric(results_test))
plot(model_roc,print.auc=TRUE,auc.polygon=TRUE,max.auc.polygon=TRUE,auc.polygon.col="skyblue",print.thres=TRUE)
Random forest is a commonly-used machine learning algorithm, which combines the output of multiple decision trees to reach a single result. What’s more, it reduce the over fitting of decision tree and more steady. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems. Therefore, we use random forest to predict people health.
library(randomForest)
new = data.frame(brfss2013$cvdcrhd4,brfss2013$X_bmi5cat,brfss2013$exerany2,brfss2013$bphigh4,brfss2013$X_rfchol)
colnames(new)=c("ill","bmi","exercise","pressure","cholesterol")
new=subset(new,is.na(new$ill)==FALSE&is.na(new$bmi)==FALSE&is.na(new$exercise)==FALSE&is.na(new$pressure)==FALSE&is.na(new$cholesterol)==FALSE)
ind <- sample(2, nrow(new), replace=TRUE, prob=c(0.1, 0.9))
new = data.frame(new,ind)
First we select the sample of coronary and no-coronary of 1:1, then we use random forest model to train. The training set have at least 5000 sample.
Comparing the Mean Decrease Gini of each variable which ralte to the importance of variable, we get that the importance of bmiX_bmi5cat
is highest, then is the exerciseexerany2
and cholesterolX_rfchol
, the blood pressurebphigh4
have lowest influence in the tree of random forest. The total error rate when the train sample have patient portion to healthy people equals to 1:1 is 30.06%. The accuracy in training set when predicting correctly on coronary patient is 69.3% while 70.5% on healthy people.
set.seed(123)
trainset = subset(new,new$ind==1)
testset = subset(new,new$ind==2)
train_no = subset(trainset,trainset$ill=="No")
train_yes = subset(trainset,trainset$ill=="Yes")
no_size = dim(train_no)[1]
yes_size = dim(train_yes)[1]
yess = floor(yes_size)
train_no = train_no[c(1:yess),]
#print(i)
trainset = rbind(train_yes,train_no)
trainset = trainset[,c(1:5)]
testset = testset[,c(1:5)]
rf <- randomForest(ill~.,data = trainset,ntree=100,proximity=TRUE)
rf
##
## Call:
## randomForest(formula = ill ~ ., data = trainset, ntree = 100, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 30.39%
## Confusion matrix:
## Yes No class.error
## Yes 1854 788 0.2982589
## No 818 1824 0.3096139
Mean_Decrease_GINI = rf
gini_tab = data.frame(importance(Mean_Decrease_GINI))
gini_tab
varImpPlot(Mean_Decrease_GINI)
Then we filtrate the variables, we wipe off the blood pressurebphigh4
,the model have both decrease on the accuracy of patient and healthy people, it may indicate that the random forest tree can be less influence by the variable which is less important.
To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. We change the portion of coronary sample in training set. We found that it have different outcome in prediction. We plot the True positive rate and False positive rate, then we plot the ROC curve. Using the AUC definition, we also get the AUC value of random forest on the test set is 0.74.
Normally we use the training model of train sample have patient portion to healthy people equals to 1:1 to evaluate the people, it have accuracy of 69.94%.
x=rep(0,75)
dim(x) = c(25,3)
roccurve = data.frame(x)
colnames(roccurve)=c("The sample ratio of the negative/positive","p","n")
for(i in 1:25){
set.seed(123)
trainset = subset(new,new$ind==1)
testset = subset(new,new$ind==2)
train_no = subset(trainset,trainset$ill=="No")
train_yes = subset(trainset,trainset$ill=="Yes")
no_size = dim(train_no)[1]
yes_size = dim(train_yes)[1]
yess = floor(i*yes_size/5)
train_no = train_no[c(1:yess),]
#print(i)
trainset = rbind(train_yes,train_no)
trainset = trainset[,c(1:5)]
testset = testset[,c(1:5)]
rf <- randomForest(ill~.,data = trainset,ntree=10,proximity=TRUE)
outcome = data.frame(predict(rf,testset),testset$ill)
colnames(outcome)=c("predict","actual")
ill_ill = sum(outcome$actual=="Yes"&outcome$predict == "Yes")
ill_noill = sum(outcome$actual=="Yes"&outcome$predict == "No")
noill_ill = sum(outcome$actual=="No"&outcome$predict == "Yes")
noill_noill = sum(outcome$actual=="No"&outcome$predict == "No")
roccurve[i,]=c(i/5,ill_ill/(ill_noill+ill_ill),noill_ill/(noill_ill+noill_noill))
}
#rocname = rownames(roccurve)
#roctab = cbind(roccurve,rocname)
roc_tab = roccurve[c(2,4,6,8,10,12,14,16,18,20),]
colnames(roc_tab)=c("Sample ratio of the \nNegative/Positive","True Positive rate","False Positive rate")
cat <- gt::gt(roc_tab)
cat<-
cat%>%
tab_header("The False~True positive rate")
cat
The False~True positive rate | ||
---|---|---|
Sample ratio of the Negative/Positive | True Positive rate | False Positive rate |
0.4 | 0.9397428683 | 0.6726509571 |
0.8 | 0.7341673521 | 0.3547347243 |
1.2 | 0.7037790572 | 0.3181219227 |
1.6 | 0.6267261158 | 0.2538663273 |
2.0 | 0.4440933293 | 0.1665445383 |
2.4 | 0.2461798191 | 0.0815078355 |
2.8 | 0.2452274793 | 0.0808168460 |
3.2 | 0.1209038570 | 0.0407780220 |
3.6 | 0.0022076966 | 0.0005174386 |
4.0 | 0.0003895935 | 0.0001349840 |
roccurve[1,]=c(0,1,1)
auc_value = 0
for( i in 2:25){
auc_value = auc_value+(roccurve[i-1,2]+roccurve[i,2])*(roccurve[i-1,3]-roccurve[i,3])/2
}
ggplot(roccurve, aes(x=n, y=p)) + geom_line()+
geom_area(fill="Green",alpha=0.75)+
xlim(0,1)+
ylim(0,1)+
xlab("False positive rate")+
ylab("True positive rate")+
ggtitle("ROC curve")+
theme(
plot.title =element_text(hjust = 0.5,vjust = 1),plot.subtitle = element_text(hjust = 0.5))+
annotate("text", x=0.5 , y=0.5 ,label= c("AUC="))+
annotate("text", x=0.5 , y=0.4 ,label= round(auc_value,2))
Comparing the results of four models, we prefer to select logistic regression model as our final model for its highest AUC value and prediction accuracy.Besides, because the other three models are non-parametric models and have higher randomness, we cannot look into the detailed effects of the variables. Instead, from the logistic regression model, we conclude that less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease and people’s degree of obesity have weak association with angina or coronary heart diseases.
accuracy <- c(0.764,0.702, 0.681,0.699)
auc <- c(0.801,0.702, 0.681,0.74)
variables <- c('exerany2+bphigh4+X_rfchol','exerany2+bphigh4+X_rfchol','bphigh4+X_rfchol','X_bmi5cat+exerany2+bphigh4+X_rfchol')
comp_tbl <- data.frame(accuracy,auc,variables,row.names = c("Logistic Regression","KNN","SVM", "Random Forest"))
comp_names <- rownames(comp_tbl)
names(comp_tbl)<-c("Accuracy","AUC","Chosen Variables")
rownames(comp_tbl) <- NULL
comp_tbl <- cbind(comp_names, comp_tbl)
comp_tbl_tbl <- gt(data = comp_tbl, rowname_col = "comp_names")
comp_tbl_tbl <-
comp_tbl_tbl %>%
tab_header(
title = "Model Comparison"
)
comp_tbl_tbl
Model Comparison | |||
---|---|---|---|
Accuracy | AUC | Chosen Variables | |
Logistic Regression | 0.764 | 0.801 | exerany2+bphigh4+X_rfchol |
KNN | 0.702 | 0.702 | exerany2+bphigh4+X_rfchol |
SVM | 0.681 | 0.681 | bphigh4+X_rfchol |
Random Forest | 0.699 | 0.740 | X_bmi5cat+exerany2+bphigh4+X_rfchol |
The reliability and validity of the data: there are several aspects of the data that may cause bias. The age distribution of the respondents is slightly different from the true adults age distribution in the US in 2013, so bias caused by age may occur. The race distribution of the data may not reflect the true race distribution in the US in 2013, thus bias caused by race may occur. The sex ratio of the respondents is slightly different from the true adults sex ratio in the US in 2013 so bias caused by gender may occur. Since BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US, any conclusion we draw is only applicable to the US. Since respondents are identified through telephone-based methods, people who do not have telephone service could not be interviewed, thus bias may occur. Since this is an observational study and the answers from the respondents may not be accurate, so bias may occur.
The variables that we use in our statistical modelling procedure are mostly categorical variables. However, these variables (blood pressure, cholesterol level, exercise frequency, obesity level) could be measured by continuous numbers which would provide more details about one’s health. In addition, If the variables we used in modelling procedure were continuous variables, our model may have better performance.
Appropriateness of the statistical analysis: After preliminary experiments, reliable data samples are obtained, which avoids the effect of extreme values and outliers of samples on the analysis results. Meanwhile, the sampling method also ensures the randomness and balance of samples. By making reasonable assumptions, the problem is simplified and the model can be applied more efficiently so the valid analysis results can be obtained. The model parameters were optimized by training set and validation set, according to the standard of AIC and some unnecessary variables were removed. Test sets are used to evaluate the reliability of the model and check the validity of the hypothesis to ensure the accuracy of the model. Since the samples are independent, the prediction results of the test set are representative to some extent. By comparing the prediction accuracy of the models and AUC values, appropriate models were selected to predict the disease. Therefore, the most valid model can be selected.
Potential limitations: the variables we selected are almost categorical ones. Therefore, during modelling, we had to convert them into dummy variables, which causes the coefficients to be not that meaningful to indicate the average increasing rate of each level since the levels are actually absolute states. And the linear relationship between the predicted logit values and the categorical variables is unable to check, or say the linear regression results mean nothing. Therefore, the conditions of logistic regression may be insufficient to meet. What’s more, due to the extreme imbalance of the data set, we increased the ratio of people with angina or coronary heart disease since our goal is to predict the probability of having diseases. But the actual ratio of people having diseases is not that high, using this model may result in the excessively conservative prediction results.
Suggestions for improvements: in this project, the selected variables are limited. So for further improvements, we can take more factors into considerations and apply more methods to avoid overfitting and build up a more stable model, such as cross validation.
Through the significance test, we already figured out that Green food eating is an insignificant factor with angina or coronary heart disease. Although degree of obesity (X_bmi5cat
) seems related to angina or coronary heart disease in the significance test, it does not contribute to the model construction. Therefore, we may conclude that degree of obesity is also not a significant factor associated with angina or coronary heart disease. Through the models we constructed, it is concluded that variables cholesterol level(X-rfchol
) and blood pressure (bphigh4
) are significant factors associated with angina or coronary heart disease, and exercise (exerany2
) is a secondary factor, comparing the P-value in the logistic regression model.
However, since the factors of getting diseases are complex and the accuracy of our model is not high enough, these selected factors are still insufficient to predict the occurrence of angina or coronary heart disease. But it provides us with some statistical information about the related factors of angina or coronary heart disease, which is illuminating for both medical research and people’s life.