Research Question

In this project, we decide to analyze what variables might have an effect on the probability of having angina or coronary heart disease. Based on life experience, we are interested in whether level of obesity measured by BMI (Body Mass Index), exercise frequency, cholesterol level, hypertension, and the consumption of green food are associated with angina or coronary heart disease and if so, how strong are their relationships? We also intend to build models that can predict the probability of having angina or coronary heart disease based on relevant variables.

What Is Angina or Coronary Heart Disease?

According to Heart and Stroke Foundation of Canada, “Angina is the medical term for chest pain or discomfort caused by a temporary disruption in the flow of blood and oxygen to the heart. People describe angina discomfort as a squeezing, suffocating or burning feeling – usually in the center of the chest, behind the breastbone.”

According to University of Ottawa Heart Institute, “Coronary heart disease, also known as coronary artery disease (CAD), is a condition which affects the arteries that supply the heart with blood. It is usually caused by atherosclerosis which is a buildup of plaque inside the artery walls. This buildup causes the inside of the arteries to become narrower and slows down the flow of blood.”

Why Are We Interested In Angina or Coronary Heart Disease

According to CDC:

Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States.
One person dies every 36 seconds in the United States from cardiovascular disease.
Coronary heart disease is the most common type of heart disease, killing 360,900 people in 2019.
About 18.2 million adults age 20 and older have CAD (about 6.7%).

From CDC statistics, it is obvious that coronary heart disease poses a major threat to humans’ health. If we could prove what factors are associated with coronary heart disease, we could gain a deeper insight into coronary heart disease and how we could reduce the risk of having coronary heart disease, hence our research question.

Methodology

In order to answer our research question, we first perform explanatory data analysis (EDA). Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, summarize important characteristics, and come up with potential research questions with the help of graphical representations and other tools. In this part, we take a glance at the whole data set, study the codebook, and come up with a research question which we are interested in (coronary heart disease), after which we perform data cleaning where we focus on the selected variables, cope with missing data, and process the variables so that they are ready for subsequent procedures.

With the selected variables ready, we perform significance test, which includes chi-square test, z-test, t-test, etc. The purpose of significance test is to check whether coronary heart disease is associated with the explanatory variables. It serves as a preliminary model to help us gain a deeper insight into the relationships between different variables. In this part, we first check whether the conditions for significance are met (randomness, large sample size, independence), write down the null hypotheses and alternative hypotheses, calculate the p-value, and conclude whether there is an association between the tested variables. However, this model cannot predict the probability of having coronary heart disease since it can only prove the existence of an association but not the type of the association. Therefore, more appropriate models are required in order to predict the probability of having coronary heart disease, hence the following step: statistical modelling.

In statistical modelling, we build different models (logistic regression, KNN, SVM and random forest) using all the explanatory variables to examine their joint influence on coronary heart disease and try to predict the probability of having coronary heart disease. The process of building these models comprise the following steps:

Make assumptions and check whether the conditions for modelling are met.
Divide the samples into training, test, and validation sets using undersampling technique since the response variable is extremely imbalanced.
Construct models using the training set and compute their accuracy using the validation set.
Optimize the models (improve accuracy) by tuning the parameters and selecting most relevant predictors based on various selection criterion.
Using the test set, evaluate the performance of each model separately by plotting their corresponding ROC curve (Receiver Operating Characteristic curve) and calculating corresponding AUC (Area Under Curve), and perform model diagnostics.
Compare the performance of all models, state the most influential predictors, and select the best model taking into account accuracy, AUC, computational costs, etc.

Lastly, based on results of EDA, significance test, and statistical modelling, we summarize our work and answer our research question.

Data Background

According to BRFSS Overview, “The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger."

Assumptions

There are several assumptions that we have made prior to performing statistical analysis.

Since BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US, any conclusion we draw is only applicable to the US.
Since respondents are identified through telephone-based methods based on stratified sampling (a method of sampling from a population which can be partitioned into subpopulations), and an overall estimated 97.5% of US households had telephone service in 2012, we can assume that the respondents are randomly selected. However, the use of landline telephone and cellular telephone means that people who do not have telephone service could not be interviewed. However, since most Americans around 2013 have telephone service, the bias caused by not collecting data from people who do not have telephone service is small, and thus could be ignored.
Since this is an observational study and the answers from the respondents may not be accurate, we cannot make definitive statements of fact about the safety, efficacy, or effectiveness of a practice. Nevertheless, the 2013 BRFSS data can still provide information on human’s health and behaviors, detect signals about the benefits and risks of certain practices in the general population, and help formulate hypotheses to be tested in subsequent experiments.

Explanatory Data Analysis

Data Overview

Firstly, there are a total number of 491775 rows, or respondents, and a total number of 330 columns in the brfss2013 data set.

# Take a look at the dimension of the data set
dim(brfss2013)

## [1] 491775    330

We can also take a look at the head and the tail of the data set to get a broad sense of the data.

# Take a look at the head and the tail of the data set
head(brfss2013)

tail(brfss2013)

Race Distribution

We now take a look at the race distribution of the respondents. Below is a table of the race distribution of the respondents.

brfss <- brfss2013
levels(brfss$X_race)<-c(levels(brfss$X_race),"Missing Data")
brfss$X_race[is.na(brfss$X_race)] <- "Missing Data"
race_table = dplyr::count(brfss, vars= X_race)
colnames(race_table) = c("Race", "Count")
formattable(race_table, align =c("l","c"), list(`Race` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Race	Count
White only, non-Hispanic	376253
Black only, non-Hispanic	39339
American Indian or Alaskan Native only, Non-Hispanic	7676
Asian only, non-Hispanic	9508
Native Hawaiian or other Pacific Islander only, Non-Hispanic	1547
Other race only, non-Hispanic	2701
Multiracial, non-Hispanic	9121
Hispanic	37060
Missing Data	8570

The bar chart for the race distribution below shows that the majority of the respondents are white only, non-Hispanic (376253). The second largest and the third largest races are black only, non-Hispanic, and Hispanic respectively.

ggplot(brfss, aes(x = X_race, fill = X_race), stat = "count" ) + 
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
  ylim(c(0,450000))+
  scale_fill_brewer(palette="Set1") +
  labs(y = "Count", x = "Responses") +
  ggtitle("Race Distribuion of Respondents")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  theme(legend.position="none")

Age Distribution

We now take a look at the age distribution of the respondents. Below is a table for all the respondents of different age groups.

levels(brfss$X_age_g)<-c(levels(brfss$X_age_g),"Missing Data")
brfss$X_age_g[is.na(brfss$X_age_g)] <- "Missing Data"
age_table = dplyr::count(brfss, vars= X_age_g)
colnames(age_table) = c("Age", "Count")
formattable(age_table, align =c("l","c"), list(`Age` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Age	Count
Age 18 to 24	27203
Age 25 to 34	50164
Age 35 to 44	60416
Age 45 to 54	83769
Age 55 to 64	109464
Age 65 or older	160747
Missing Data	12

From the bar chart below, we can see that the largest age group of the respondents is age 65 or older, the second largest and the third largest age groups are age 55 to 64 and age 45 to 54. The age distribution of the respondents is slightly different from the true adults age distribution in the US in 2013 (https://www.census.gov/data/tables/2013/demo/age-and-sex/2013-age-sex-composition.html), so bias may occur. However, for simplicity, we will ignore the bias caused by age.

ggplot(brfss, aes(x = X_age_g, fill = X_age_g), stat = "count" ) + 
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
  ylim(c(0,200000))+
  scale_fill_brewer(palette="Set1") +
  labs(y = "Count", x = "Age Distributions") +
  ggtitle("Age Distribuion of Respondents")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
  theme(legend.position="none")

age_group <- c("Age 20 to 24", "Age 25 to 34", "Age 35 to 44",
                "Age 45 to 54", "Age 55 to 64", "Age 65 or older")
age_true <- c(7.1/73.5, 13.4/73.5, 12.8/73.5, 14.0/73.5, 12.4/73.5, 13.8/73.5)
age_true_propor <- data.frame(age_group, age_true)

ggplot(data = age_true_propor, mapping = aes(x = age_group, y =age_true,fill=age_group))+
  geom_text(aes(label = paste0(sprintf("%1.1f", age_true*100),"%")),vjust = -0.2,size=6)+
  scale_fill_brewer(palette = "Set1")+ 
  ylim(c(0,0.25))+
  geom_col()+labs(x = "Category",y = "Percent")+
  ggtitle("True Adults Age Distribuion In the US In 2013")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
  theme(legend.position="none")

#grid.arrange(age_plot1, age_plot2, ncol = 1)

Sex Ratio

We now take a look at the sex ratio of the respondents. Below is a table for respondents’ gender.

levels(brfss$sex)<-c(levels(brfss$sex),"Missing Data")
brfss$sex[is.na(brfss$sex)] <- "Missing Data"
sex_table = dplyr::count(brfss, vars= sex)
colnames(sex_table) = c("Sex", "Count")
formattable(sex_table, align =c("l","c"), list(`Sex` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Sex	Count
Male	201313
Female	290455
Missing Data	7

The bar chart below shows that there are more females than males in the survey. The ratio is about 20 million males to 29 million females. The sex ratio of the respondents is slightly different from the true adults sex ratio in the US in 2013 (https://www.census.gov/data/tables/2013/demo/age-and-sex/2013-age-sex-composition.html), so bias may occur. However, for simplicity, we will ignore the bias caused by gender.

ggplot(brfss, aes(x = sex, fill = sex), stat = "count" ) + 
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
  ylim(c(0,350000)) +
  scale_fill_brewer(palette="Set1") +
  labs(y = "Count", x = "Age Distributions") +
  ggtitle("Sex Distribuion of Respondents")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 12))+
  theme(legend.position="none")

sex_group <- c("Males", "Females")
sex_true <- c(152335/311216, 158781/311216)
sex_true_propor <- data.frame(sex_group, sex_true)

ggplot(data = sex_true_propor, mapping = aes(x =reorder(sex_group,sex_true), 
                                             y =sex_true,fill=sex_group))+
  geom_text(aes(label = paste0(sprintf("%1.1f", sex_true*100),"%")),vjust = -0.2,size=6)+
  scale_fill_manual(values = c("blue", "red") ) +
  ylim(c(0,0.6))+
  geom_col()+labs(x = "Category",y = "Percent")+
  ggtitle("True Adults Sex Ratio In the US In 2013")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 9))+
  theme(legend.position="none")

Explanatory Variables and Response Variable

The variables we choose to answer our research question include both categorical variables and quantitative variables.

Categorical variables:

cvdcrhd4: Ever diagnosed with angina or coronary heart disease
X_bmi5cat: Computed body mass index categories
exerany2: During the past month, other than your regular job, did you participate in any physical activities or exercises
X_rfchol: High cholesterol calculated variable
bphigh4: Ever told blood pressure high

Quantitative variables:

fruitju1: How many times did you drink 100 percent pure fruit juices
fruit1: How many times did you eat fruit
fvbeans: How many times did you eat beans or lentils
fvgreen: How many time did you eat dark green vegetables
fvorang: How many times did you eat orange-colored vegetables
vegetab1: How many times did you eat other vegetables

We combine the six quantitative variables related to green food consumption since they are collinear (correlated). Some other variables are also collinear (correlated) and of the same topic, so adding more than one of these variables to the model would not add much value to the model.

The response variable in our analysis is cvdcrhd4 (Ever diagnosed with angina or coronary heart disease) and the explanatory variables are the rest variables.

We now take a deeper look into these variables.

(1) Ever Diagnosed with Angina or Coronary Heart Disease

Firstly, the cvdcrhd4 variable, which corresponds to the question “ever diagnosed with angina or coronary heart disease”, has three categories, namely “Yes”, “No”, and missing data. There are a total number of 491775 responses, 29064 of which answered “Yes”, 458288 of which answered “No”, and 4423 of which are missing data. We only want the samples that answered “Yes” or “No” to this question. Again, for simplicity, we will consider those who answered “Yes” to have angina or coronary heart disease and those who answered “No” to not have angina or coronary heart disease.

levels(brfss$cvdcrhd4)<-c(levels(brfss$cvdcrhd4),"Missing Data")
brfss$cvdcrhd4[is.na(brfss$cvdcrhd4)] <- "Missing Data"
brfss$cvdcrhd4 <- as.factor( recode(brfss$cvdcrhd4, "Yes" = "Coronary", 
                                    "No" = "Healthy") )
chd_table = dplyr::count(brfss, vars= cvdcrhd4)
colnames(chd_table) = c("Response", "Count")
formattable(chd_table, align =c("l","c"), list(`Response` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Response	Count
Coronary	29064
Healthy	458288
Missing Data	4423

The bar chart below gives a good overview of the cvdcrhd4 variable:

#mapping = aes(x = cvdcrhd4, fill = cvdcrhd4), stat = "count"
ggplot(brfss, aes(x = cvdcrhd4, fill = cvdcrhd4), stat = "count") + 
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), vjust = -1, colour = "black",size=6)+
  scale_x_discrete(name = " ", limits = c("Healthy","Coronary", "Missing Data")) +
  scale_fill_brewer(palette="Set1") +
  labs(y = "Count", x = "Responses") +
  ylim(c(0,500000))+
  ggtitle("Ever Diagnosed with Angina or Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
                                colour='black', size=15) )+
  theme(legend.position="none")

The pie chart below gives another overview of the cvdcrhd4 variable:

ggplot(brfss, aes(x="xxx", fill=cvdcrhd4)) +
  geom_bar( width=1, color="white") +
  coord_polar("y", start=0) +
  theme_void() + # remove background, grid, numeric labels
  guides(fill = guide_legend(title = "Coronary Heart Disease"))+
  ggtitle("Ever Diagnosed with Angina or Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = -0.1,
                                colour='black', size=15) )

(2) Obesity Level

The X_bmi5cat, which corresponds to the computed body mass index categories.The categorical variable X_bmi5cat has 4 categories as shown below.

bmi_tab<-data.frame(table(brfss2013$X_bmi5cat))
colnames(bmi_tab)=c('Category','Frequency')
bmi_cat <- gt(data = bmi_tab)

bmi_cat <-
  bmi_cat %>%
  tab_header(
    title = "Contingency Table of the Categories of `X_bmi5cat`"
  )
bmi_cat

Contingency Table of the Categories of `X_bmi5cat`
Category	Frequency
Underweight	8267
Normal weight	154898
Overweight	167084
Obese	134799

The sample numbers of every category are given by the bar plot above. Since the factor we surveyed is the obesity level and the number of Underweight group is small, we omitted the Underweight group.

bmi_plot <- ggplot(data = bmi_tab, mapping = aes(x = Category, y =Frequency,fill=Category))+
  geom_text(aes(label = Frequency),vjust = -0.2,size=6)+
  scale_fill_brewer(palette = "Pastel1")+ 
  ylim(c(0,200000))+
  geom_col()+labs(title = "Level of Obesity",x = "Category",y = "Frequency")+
  theme(plot.title = element_text(hjust = 0.5))
bmi_plot

Missing Data Processing

To avoid the bias caused by the missing data, we calculated the missing ratio of each category in X_bmi5cat and cvdcrhd4. Seeing from the bar plot below, the difference of the missing ratios in each category is acceptable. Therefore, there doesn’t exist some extreme imbalance of the missing data.

obmit_under=subset(brfss2013,brfss2013$X_bmi5cat!="Underweight"|is.na(brfss2013$X_bmi5cat)==TRUE)
cvd_na=subset(obmit_under,is.na(obmit_under$cvdcrhd4)==TRUE)
normal_na_ratio=table(cvd_na$X_bmi5cat)[2]/table(obmit_under$X_bmi5cat)[2]
over_na_ratio=table(cvd_na$X_bmi5cat)[3]/table(obmit_under$X_bmi5cat)[3]
obese_na_ratio=table(cvd_na$X_bmi5cat)[4]/table(obmit_under$X_bmi5cat)[4]

bmi_missing<-data.frame(Ratio=c(normal_na_ratio,over_na_ratio,obese_na_ratio),Category=c('Normal weight','Overweight','Obese'))
bmi_missing_plot <- ggplot(data = bmi_missing, mapping = aes( x=Category, y =Ratio,fill=Category))+
  scale_fill_brewer(palette = "Pastel1")+
  geom_col()+
  theme(plot.margin=unit(rep(0,8),'lines'))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  labs(title = "Missing Ratio of `X_bmi5cat`")+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(aes(label=paste0(sprintf("%1.1f", Ratio*100),"%")),vjust = -0.2)


bmi_na=subset(obmit_under,is.na(obmit_under$X_bmi5cat)==TRUE)
cvd_na_ratio=table(bmi_na$cvdcrhd4)[1]/table(obmit_under$cvdcrhd4)[1]
nocvd_na_ratio=table(bmi_na$cvdcrhd4)[2]/table(obmit_under$cvdcrhd4)[2]

cvd_missing<-data.frame(Ratio=c(cvd_na_ratio,nocvd_na_ratio),Category=c('Yes','No'))

cvd_missing_plot <- ggplot(data = cvd_missing, mapping = aes( x=Category, y =Ratio,fill=Category))+scale_fill_brewer(palette = "Pastel1")+
  geom_col()+
  theme(plot.margin=unit(rep(1,4),'lines'))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  labs(title = "Missing Ratio of `cvdcrhd4`")+
  theme(plot.title = element_text(hjust = 0.5))+
  geom_text(aes(label=paste0(sprintf("%1.1f", Ratio*100),"%")),vjust = -0.2)


grid.arrange(bmi_missing_plot,cvd_missing_plot,nrow=1)

cvd_bmi_na=subset(obmit_under,is.na(obmit_under$cvdcrhd4)==TRUE | is.na(obmit_under$X_bmi5cat)==TRUE)
all_na_ratio=length(cvd_bmi_na$cvdcrhd4)/length(obmit_under$cvdcrhd4)

The overall missing ratio, 0.0635046, is relatively small. As a result, in the following analysis, we would ignore the bias causing by the missing samples. Then we constructed a subset omitting the samples with missing X_bmi5cat and cvdcrhd4.

Association Overview

The stack bar diagram below roughly shows that the proportion of people who ever had angina or coronary heart disease increases as the level of obesity increases.

bmi_subset=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & is.na(brfss2013$cvdcrhd4)==FALSE & brfss2013$X_bmi5cat!="Underweight",select = c(X_bmi5cat,cvdcrhd4))
bmi_subset$cvdcrhd4 <- as.factor( recode(bmi_subset$cvdcrhd4, "Yes" = "Coronary", 
                                    "No" = "Healthy") )
dt0 <- bmi_subset%>%
  dplyr::group_by(X_bmi5cat, cvdcrhd4)%>%
  dplyr::tally()%>%
  dplyr::mutate(percent=n/sum(n))

ggplot(data = dt0,aes(x= X_bmi5cat, y = n,fill = cvdcrhd4))+
  guides(fill = guide_legend(title = "Health Status"))+
  geom_bar(stat="identity",position = "fill")+
  geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
                     position=position_fill(vjust=0.5), colour="black", size =6)+
  scale_fill_brewer(palette = "Paired")+labs(title = "Stack Bar Plot of Disease Condition and Level of Obesity",x = "Category",y = "Percentage",fill="Health Condition")+
  theme(plot.title = element_text(hjust = 0.5))

(3) Any Exercise In The Past 30 Days

The categorical variable exerany2 has two categories as shown below.

levels(brfss$exerany2)<-c(levels(brfss$exerany2), "Missing Data")
brfss$exerany2[is.na(brfss$exerany2)] <- "Missing Data"
ex_tab<-data.frame(dplyr::count(brfss, vars= exerany2))
colnames(ex_tab)=c('Category','Frequency')
ex_tab

We plot the bar chart of categorical variable exerany2 as follow.

ggplot(data = ex_tab, 
       mapping = aes(x = Category, y=Frequency, fill = Category), stat = "count") + 
  scale_fill_brewer(palette = "Set1")+ geom_col( ) +
  geom_text(aes(label = Frequency),size=6, vjust = -0.2) + ylim(c(0,350000))+
  labs(y = "Frequency", x = "Category") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) + # Plot the bar graph
  ggtitle("Exercise In Past 30 Days")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )

Missing Data Processing

ex_misratio<-34029/length(brfss2013$exerany2)
ex_misratio

## [1] 0.06919628

There are some missing data, the ratio of the missing data is 0.0691963. The ratio is relatively small. As a result, in the following analysis, we would ignore the difference caused by deleting the missing data.

ex <- which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & ( 
      brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt = filter(brfss2013[ex, ]) # Filter out unwanted samples
exercise <- filt[, c("exerany2", "cvdcrhd4")] # Filter out unwanted variables
#glimpse(blood_pressure)
exercise$exerany2 <- droplevels(exercise$exerany2)
exercise$exerany2 <- as.factor( recode(exercise$exerany2, "Yes" = "exercise", "No" = "No exercise") )
exercise$cvdcrhd4 <- as.factor( recode(exercise$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy") )
samples<-491775 - length(exercise$exerany2) # Calculate the number of responses that have been deleted
samples_ratio<-(491775 - length(exercise$exerany2)) / 491775 # Calculate the proportion of responses that have been deleted

We have deleted a total number of 3.8043^{4} samples from these two variables, and the overall missing data ratio is 0.0773585. Obviously, it is relatively small so that we can ignore these samples in the following analysis.

We draw a bar chart of the relationship between exerany2 and cvdcrhd4.

ggplot(exercise, aes(x = exerany2, fill = cvdcrhd4), stat = "count") + 
  geom_bar(position = "dodge")+
  geom_text(position = position_dodge(width = .9),stat = "count", 
            aes(label = ..count..),size=6, vjust = -1, colour = "black")+
  ylim(c(0,350000))+
  guides(fill = guide_legend(title = "Health Condition"))+
  xlab("Category")+ ylab("Frequency")+
  ggtitle("Gropued Bar Chart for Exerise and Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=13) )

Association Overview

As we can roughly see from two bar chart below, people who had exercised in the past 30 days had a much lower rate of angina or coronary heart disease than people who had not exercised.

dt1 <- exercise%>%
  dplyr::group_by(exerany2, cvdcrhd4)%>%
  dplyr::tally()%>%
  dplyr::mutate(percent=n/sum(n))

ggplot(data = dt1,aes(x= exerany2, y = n,fill = cvdcrhd4))+
  guides(fill = guide_legend(title = "Health Status"))+
  geom_bar(stat="identity",position = "fill")+
  xlab("Exercise")+
  ylab("Percent Value")+
  geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
                     position=position_fill(vjust=0.5), colour="black", size =6)+
  ggtitle("Percent Stacked Bar Chart for Exercise and Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=13) )

(4) Ever Told Cholesterol High

The valid categories and their corresponding sample sizes in X_rfchol are shown in the table below:

X_rfchol_NA_process <- brfss2013$X_rfchol
X_rfchol_NA_process[is.na(X_rfchol_NA_process)] <- NA
rfchol_tab <- data.frame(table(X_rfchol_NA_process))
colnames(rfchol_tab) <- c("Category", "Frequency")
rfchol_cat <- gt(data = rfchol_tab)

rfchol_cat <-
  rfchol_cat %>%
  tab_header(
    title = "Contingency Table of the Categories of \"X_rfchol\""
  )
rfchol_cat

Contingency Table of the Categories of "X_rfchol"
Category	Frequency
No	236614
Yes	183497

We also make a bar plot to make it more visual.

levels(brfss$X_rfchol)<-c(levels(brfss$X_rfchol),"Missing Data")
brfss$X_rfchol[is.na(brfss$X_rfchol)] <- "Missing Data"
brfss$X_rfchol <- as.factor( recode(brfss$X_rfchol, "Yes" = "High Cholesterol", 
                                    "No" = "Normal Cholesterol") )
ggplot(brfss, aes(x = X_rfchol, fill = X_rfchol),stat = "count") +
  geom_bar(stat = "count") +
  ylim(c(0,300000))+
  geom_text(stat = "count", aes(label = ..count..), vjust = -1, colour = "black",size=6)+
  labs(y = "Count", x = "Responses") +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
  scale_fill_brewer(palette = "Pastel1")+
  ggtitle("Ever Told Cholesterol Level High") +
  theme(plot.title = element_text(
    family = "", face = "bold", hjust = 0.5,
    colour = "black", size = 15
  ))

There are 183497 respondents answered that they have been told to have high cholesterol level, 236614 answered that they have not been told to have high cholesterol level and 71664 data are missing.

Missing Data Processing

To further analysis the data, we need to due with the missing data from the original data set. The missing ratio of variable X_rfchol is 0.1457252, which is not big. Therefore, we can just throw out these missing data.

Moreover, we need to make sure that the missing ratios of cvdcrhd4 corresponding to the categories in X_rfchol are not too high and have a small difference.

The plot "Missing Ratios Corresponding to X_rfchol's Categories" below shows the missing ratio of cvdcrhd4 corresponding to the “Yes” and “No” categories of X_rfchol. The “missing” here stands for the samples who have valid X_rfchol data but miss cvdcrhd4 data.

yes <- subset(brfss2013, brfss2013$X_rfchol == "Yes")
no <- subset(brfss2013, brfss2013$X_rfchol == "No")
yes_na <- subset(brfss2013, brfss2013$X_rfchol == "Yes" & is.na(brfss2013$cvdcrhd4))
no_na <- subset(brfss2013, brfss2013$X_rfchol == "No" & is.na(brfss2013$cvdcrhd4))
missing <- c(nrow(yes_na) / nrow(yes), nrow(no_na) / nrow(no))
names <- c("Yes", "No")
missing_ratio <- data.frame(missing, names)
c1 <- ggplot(missing_ratio, aes(x = names, y = missing, fill = names)) +
  geom_bar(stat = "identity") +
  ylim(c(0,0.015))+
  geom_text(aes(label=paste0(sprintf("%1.1f", missing*100),"%")),vjust = -0.2,size=6)+
  labs(y = "Missing ratio", x = "X_rfchol") +
  scale_fill_brewer(palette = "Pastel1")+
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
  ggtitle("Missing Ratios Corresponding to X_rfchol's Categories") +
  theme(plot.title = element_text(
    family = "", face = "bold", hjust = 0.5,
    colour = "black", size = 8
  ))

According to the plot, the missing ratios of cvdcrhd4 corresponding to categories of X_rfchol are both small (less than 2%) and close to each other.

The plot "Missing Ratios Corresponding to cvdcrhd4's Categories" below shows the missing ratio of X_rfchol corresponding to the “Yes” and “No” categories of cvdcrhd4. The “missing” here stands for the samples who have valid cvdcrhd4 data but miss X_rfchol data.

yes <- subset(brfss2013, brfss2013$cvdcrhd4 == "Yes")
no <- subset(brfss2013, brfss2013$cvdcrhd4 == "No")
yes_na <- subset(brfss2013, brfss2013$cvdcrhd4 == "Yes" & is.na(brfss2013$X_rfchol))
no_na <- subset(brfss2013, brfss2013$cvdcrhd4l == "No" & is.na(brfss2013$X_rfchol))
missing <- c(nrow(yes_na) / nrow(yes), nrow(no_na) / nrow(no))
names <- c("Yes", "No")
missing_ratio <- data.frame(missing, names)
c2 <- ggplot(missing_ratio, aes(x = names, y = missing, fill = names)) +
  ylim(c(0,0.04))+
  geom_bar(stat = "identity") +
  geom_text(aes(label=paste0(sprintf("%1.1f", missing*100),"%")),vjust = -0.2,size=6)+
  labs(y = "Missing ratio", x = "cvdcrhd4") +
  scale_fill_brewer(palette = "Pastel1")+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  ggtitle("Missing Ratios Corresponding to cvdcrhd4's Categories") +
  theme(plot.title = element_text(
    family = "", face = "bold", hjust = 0.5,
    colour = "black", size = 8
  ))
grid.arrange(c1,c2,nrow=1)

According to the plot, the missing ratios of X_rfchol corresponding to categories of X_rfchol are both small and close to each other as well. As a result, the missing data have a small influence on the final result. We can just throw out all samples with missing data in X_rfchol and cvdcrhd4.

rfchol <- which((brfss2013$X_rfchol == "Yes" | brfss2013$X_rfchol == "No") & (
  brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[rfchol, ]) # Filter out unwanted samples
rfchol_data <- filt[, c("X_rfchol", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
rfchol_data$X_rfchol <- droplevels(rfchol_data$X_rfchol)
rfchol_data$X_rfchol <- as.factor(recode(rfchol_data$X_rfchol, "Yes" = "High Cholesterol Level", "No" = "Normal Cholesterol Level"))
rfchol_data$cvdcrhd4 <- as.factor(recode(rfchol_data$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))

Association Overview

We make a percent stacked bar plot to show the association between them. From the plot we found that cholesterol level the rate of getting angina or coronary heart disease related to different cholesterol level is different. We can get a preliminary conclusion that whether the cholesterol level is high do affect the rate of getting angina or coronary heart disease.

dt5 <- rfchol_data%>%
  dplyr::group_by(X_rfchol, cvdcrhd4)%>%
  dplyr::tally()%>%
  dplyr::mutate(percent=n/sum(n))

ggplot(data = dt5,aes(x= X_rfchol, y = n,fill = cvdcrhd4))+
  guides(fill = guide_legend(title = "Health Status"))+
  geom_bar(stat="identity",position = "fill")+
  xlab("Cholesterol Level")+
  ylab("Percent Value")+
  geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
                     position=position_fill(vjust=0.5), colour="black", size =6)+
  ggtitle("Percent Stacked Bar Chart for Cholesterol Level and Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
                                colour='black', size=13) )

(5) Ever Told Blood Pressure High

bphigh4 variable, which corresponds to the question “ever told blood pressure high” The bphigh4 is a categorical variable which has five categories, namely “Yes”, “Yes, but female told only during pregnancy”, “No”, “Told borderline or pre-hypertensive”, and missing data. There are a total number of 491775 responses. 198921 people answered “Yes”, 3680 people answered “Yes, but female told only during pregnancy”, 282687 people answered “No”, 5067 people answered “Told borderline or pre-hypertensive”, and 1420 are missing data. Those who answered “Yes, but female told only during pregnancy” and “Told borderline or pre-hypertensive” can be considered as special cases, so we can ignore them. In addition, we need to be aware of the missing data. Based on the analysis above, we only want the samples that answered “Yes” or “No” to this question. Although it might not always be the case, for simplicity, we will consider those who answered “Yes” to have high blood pressure and those who answered “No” to have normal blood pressure.

levels(brfss$bphigh4)<-c(levels(brfss$bphigh4), "Missing Data")
brfss$bphigh4[is.na(brfss$bphigh4)] <- "Missing Data"
bp_table = dplyr::count(brfss, vars= bphigh4) # Overview of "bphigh4" variable
colnames(bp_table) = c("Response", "Count")
formattable(bp_table, align =c("l","c"), list(`Response` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Response	Count
Yes	198921
Yes, but female told only during pregnancy	3680
No	282687
Told borderline or pre-hypertensive	5067
Missing Data	1420

The bar chart below gives a good overview of the blood pressure data:

ggplot(brfss, aes(x = bphigh4, fill = bphigh4), stat = "count" ) + 
  geom_bar() +
  geom_text(stat = "count", aes(label = ..count..), size=6,vjust = -1, colour = "black")+
  ylim(c(0,350000))+
  scale_x_discrete(name = " ", limits = c("No","Yes", "Told borderline or pre-hypertensive", "Yes, but female told only during pregnancy", "Missing Data")) +
  scale_fill_brewer(palette="Set1") +
  labs(y = "Count", x = "Responses") +
  ggtitle("Ever Told Blood Pressure High")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=15) )+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  theme(legend.position="none")

Missing Data Processing

There are a total number of 4423 missing coronary heart disease data. To the “ever told blood pressure high” question, 3025 out of 4423 people answered “Yes”, 1186 out of 4423 people answered “No”, 45 out of 4423 people answered “Told borderline or pre-hypertensive”, 17 out of 4423 people answered “Yes, but female told only during pregnancy”, and 150 out of 4423 are missing data. These missing data ratios are all very low except for “missing data” category.

filt_2 <- filter( brfss2013[which(is.na(brfss2013$cvdcrhd4)), ] )
coronary_deleted <- filt_2[,c("bphigh4","cvdcrhd4")]
coronary_deleted <- droplevels(coronary_deleted)
missing_coronary_table = dplyr::count(coronary_deleted, var=bphigh4)
colnames(missing_coronary_table) = c("Response", "Count")
formattable(missing_coronary_table, align =c("l","c"), list(`Response` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Response	Count
Yes	3025
Yes, but female told only during pregnancy	17
No	1186
Told borderline or pre-hypertensive	45
NA	150

per_of_missing_coronary_res <- c(3025/198921, 17/3680, 1186/282687, 45/5067, 150/1420)
missing_coronary_response <- c("Yes", 
                               "Yes, but female told only during pregnancy",
                               "No", "Told borderline or pre-hypertensive", 
                               "Missing Data")
missing_coronary_data <- data.frame(per_of_missing_coronary_res, 
                                    missing_coronary_response) # Data frame for

plot_1 <- ggplot(data = missing_coronary_data, aes(x = reorder(missing_coronary_response, -per_of_missing_coronary_res), 
              y = per_of_missing_coronary_res, fill = missing_coronary_response )) + 
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette="Set3") +
  labs(y = "Percent", x = "Responses") +
  theme(axis.text.x = element_text(size = 8, hjust = 0.5)) + # Plot the bar graph
  ggtitle("Missing Ratio of Blood Pressure Data")+
    theme(axis.text.x = element_text(angle = 20, hjust = 1, size = 8)) +
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=10) ) +
  theme(legend.position="none")+
  geom_text(aes(label=paste0(sprintf("%1.1f", per_of_missing_coronary_res*100),"%")),vjust = -0.2)

For the blood pressure data, there are a total number of 1420 missing data. To the “ever diagnosed with angina or coronary heart disease” question, 76 out of 1420 people answered “Yes”, 1194 out of 1420 people answered “No”, and 150 out of 1420 are missing data. These missing data ratios are all very low.

filt_4 <- filter( brfss[which(brfss$bphigh4=="Missing Data"), ] )
bloodp_deleted <- filt_4[,c("bphigh4","cvdcrhd4")]
bloodp_deleted <- droplevels(bloodp_deleted)

missing_bp_table = dplyr::count(bloodp_deleted, var=cvdcrhd4)
colnames(missing_bp_table) = c("Response", "Count")
formattable(missing_bp_table, align =c("l","c"), list(`Response` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold")) 
))

Response	Count
Coronary	76
Healthy	1194
Missing Data	150

per_of_missing_bp_res <- c(76/29064, 1194/458288, 150/4423)
missing_bp_response <- c("Coronary", "Healthy", "Missing Data")
missing_bp_data <- data.frame(per_of_missing_bp_res, 
                                    missing_bp_response) # Data frame for 

plot_2 <- ggplot(data = missing_bp_data, aes(x = reorder(missing_bp_response, 
              -per_of_missing_bp_res), 
              y = per_of_missing_bp_res, fill = missing_bp_response )) + 
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette="Set3") +
  labs(y = "Percent", x = "Responses") +
  theme(axis.text.x = element_text(size = 8, hjust = 0.5)) + # Plot the bar graph
  ggtitle("Missing Ratio of Coronary Data")+
    theme(axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8)) +
  theme(plot.title=element_text(family='', face='bold', hjust = 0.5,
                                colour='black', size=10) ) +
  theme(legend.position="none")+
  geom_text(aes(label=paste0(sprintf("%1.1f", per_of_missing_bp_res*100),"%")),vjust = -0.2)
grid.arrange(plot_1, plot_2, nrow = 1)

Based on our assumptions, the only responses that we are interested in are those who answered “Yes” or “No” to both questions and we will leave out the other responses. After selection, we deleted 14378 samples, which accounts for only 2.92% of the total number of people surveyed. Also, the missing ratios are relatively small. Therefore, we can confidently delete these responses and continue our analysis.

#str(brfss2013$bphigh4)
bp <- which((brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") & ( 
      brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt = filter(brfss2013[bp, ]) # Filter out unwanted samples
blood_pressure <- filt[, c("bphigh4", "cvdcrhd4")] # Filter out unwanted variables
#glimpse(blood_pressure)
blood_pressure$bphigh4 <- droplevels(blood_pressure$bphigh4)
blood_pressure$bphigh4 <- as.factor( recode(blood_pressure$bphigh4, "Yes" = "High Blood Pressure", "No" = "Normal Blood Pressure") )
blood_pressure$cvdcrhd4 <- as.factor( recode(blood_pressure$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy") )
491775 - length(blood_pressure$bphigh4) # Calculate the number of responses that have been deleted

## [1] 14378

(491775 - length(blood_pressure$bphigh4)) / 491775 # Calculate the proportion of responses that have been deleted

## [1] 0.02923695

#count(blood_pressure, vars= cvdcrhd4)

Association Overview

From the percent stacked bar chart below, it seems that there is an association between blood pressure and angina or coronary heart disease. In addition, the proportion of people who, with high blood pressure, have angina or coronary heart disease, seems to be higher than the proportion of people who, with normal blood pressure, have angina or coronary heart disease. However, we need to confirm our hypotheses by using chi-square test and confidence interval.

dt2 <- blood_pressure%>%
  dplyr::group_by(bphigh4, cvdcrhd4)%>%
  dplyr::tally()%>%
  dplyr::mutate(percent=n/sum(n))

ggplot(data = dt2,aes(x= bphigh4, y = n,fill = cvdcrhd4))+
  guides(fill = guide_legend(title = "Health Status"))+
  geom_bar(stat="identity",position = "fill")+
  xlab("Blood Pressure")+
  ylab("Percent Value")+
  geom_text(aes(label=paste0(sprintf("%1.1f", percent*100),"%")),
                     position=position_fill(vjust=0.5), colour="black", size =6)+
  ggtitle("Percent Stacked Bar Chart for Blood Pressure and Coronary Heart Disease")+
  theme(plot.title=element_text(family='', face='bold', hjust = 0.2,
                                colour='black', size=13) )

(6) Green Food

People’s eating time on green food such as vegetable, fruit, and fruit juice, are respectively recorded in the quantitative variables of cvdinfr4 fruitju1 fruit1 fvbeans fvgreen fvorang and vegetab1. We define a new variable eating, which is the sum of the 6 variables.

fruitju1: How Many Times Did You Drink 100 Percent Pure Fruit Juices?
fruit1: How Many Times Did You Eat Fruit?
fvbeans: How Many Times Did You Eat Beans Or Lentils?
fvgreen: How Many Times Did You Eat Dark Green Vegetables?
fvorang: How Many Times Did You Eat Orange-Colored Vegetables?
vegetab1: How Many Times Did You Eat Other Vegetables?

#非NA的数据
test = subset(brfss2013,is.na(brfss2013$cvdcrhd4)==FALSE &is.na(brfss2013$fruitju1)==FALSE & is.na(brfss2013$fruit1)==FALSE&is.na(brfss2013$fvbeans)==FALSE & is.na(brfss2013$fvgreen)==FALSE&is.na(brfss2013$fvorang)==FALSE & is.na(brfss2013$vegetab1)==FALSE)
#有个NA的数据
test_rev = subset(brfss2013,!(is.na(brfss2013$cvdcrhd4)==FALSE &is.na(brfss2013$fruitju1)==FALSE & is.na(brfss2013$fruit1)==FALSE&is.na(brfss2013$fvbeans)==FALSE & is.na(brfss2013$fvgreen)==FALSE&is.na(brfss2013$fvorang)==FALSE & is.na(brfss2013$vegetab1)==FALSE))


#饮食数据变成所有加起来
eating = test$fruitju1 + test$fruit1 + test$fvbeans + test$fvgreen + test$fvorang + test$vegetab1

eating_rev = test_rev$fruitju1 + test_rev$fruit1 + test_rev$fvbeans + test_rev$fvgreen + test_rev$fvorang + test_rev$vegetab1


health = test$cvdcrhd4

health_rev = test_rev$cvdcrhd4

Missing Data Processing

The missing ratio of the the coronary~eating is 11.4%, which is at an acceptable level.

Then we analyze the samples that we know whether they have coronary heart disease or not, but not know their ‘eating’.

#对有数据没数据做下卡方检验
test1 = data.frame(eating,health)
#test1是吃喝疾病都齐全的
yes = subset(test1,test1$health == "Yes")
no = subset(test1,test1$health == "No")

#yes是病， no是没病
test1$health<-factor(test1$health,labels=c("Coronary","healthy"))

test_rev1 = data.frame(eating_rev,health_rev)
#test_rev1是吃喝疾病有点不全的

test_rev1_no_eating = subset(test_rev1,is.na(test_rev1$eating_rev)==TRUE)
#这个是那些没有吃的信息的


no_eating_ill=subset(test_rev1_no_eating,test_rev1_no_eating$health_rev=="Yes")
no_eating_noill=subset(test_rev1_no_eating,test_rev1_no_eating$health_rev=="No")

ill_ratio_isdata =length(yes$eating)/(length(no$eating)+length(yes$eating))

ill_ratio_noeatingdata=length(no_eating_ill$eating_rev)/(length(no_eating_ill$eating_rev)+length(no_eating_noill$eating_rev))

We are to check if the missing data of eating do have a difference with the sample not missing eating data.

\(H_0\): There is no relationship between coronary and the missing data of eating

\(H_a\): There is no relationship between coronary and the missing data of eating

We set \(\alpha\) = 0.1

Missing = c(3472,53204,3472+53204)
No_missing = c(25592,405084,25592+405084)
Sum = c(3472+25592,53204+405084,3472+25592+53204+405084)
chi_tab = data.frame(Missing,No_missing,Sum,row.names = c("Coronary","Healthy","Sum"))
chi_tab_name = rownames(chi_tab)
chi_tab = cbind(chi_tab,chi_tab_name)
tab <- gt(data=chi_tab,rowname_col="chi_tab_name")
tab<-
  tab%>%
  tab_header(title="Contingency Table of Health And Eating")

tab

Contingency Table of Health And Eating
	Missing	No_missing	Sum
Coronary	3472	25592	29064
Healthy	53204	405084	458288
Sum	56676	430676	487352

missing = data.frame(Percent=c(3472/(3472+53204),53204/(3472+53204)),Category=c("Coronary","Healthy"))
missing_plot <-ggplot(data = missing,mapping = aes( x=Category, y =Percent,fill=Category))+
  scale_fill_brewer(palette = "Pastel1")+ geom_col()+
  theme(plot.margin=unit(rep(1,4),'lines'))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  labs(title = "Lost Data")+theme(plot.title = element_text(hjust = 0.5))+
  geom_text(aes(label=paste0(sprintf("%1.1f", Percent*100),"%")),vjust = -0.2)+xlab("Health State")


no_missing = data.frame(Percent=c(25592/(25592+405084),405084/(25592+405084)),Category=c("Coronary","Healthy"))
no_missing_plot <-ggplot(data = no_missing,mapping = aes( x=Category, y =Percent, fill=Category))+
  scale_fill_brewer(palette = "Pastel1")+ geom_col()+
  theme(plot.margin=unit(rep(1,4),'lines'))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+
  labs(title = "No Lost Data")+theme(plot.title = element_text(hjust = 0.5))+
  geom_text(aes(label=paste0(sprintf("%1.1f", Percent*100),"%")),vjust = -0.2)+xlab("Health State")


grid.arrange(missing_plot,no_missing_plot,ncol=2,nrow=1)

missing = c(3472,53204)
no_missing = c(25592,405084)
chi_exercise<-data.frame(missing,no_missing,row.names = c("cvd","no_Cvd"))
chisq.test(chi_exercise)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  chi_exercise
## X-squared = 2.9832, df = 1, p-value = 0.08413

According the P-value, which is 0.08<0.1, we can conclude that there is no evidence relation in coronary illness between the missing data and the not missing data.

Association Overview

Then we are going to view the data that are not missing at all.

We observe the boxplot of the sample that have coronary and healthy, and finf that they look almost the same.

#观察有数据部分，看看有病没病的分布

fill<-"#4271AE"
p = ggplot(test1,aes(x=health,y=eating))+geom_boxplot(fill = fill)+scale_y_continuous(name="The sum of eating times")+scale_x_discrete(name="Health Status")+ggtitle("Boxplot of the eating by health status")

p

Normal Distribution

Then we observe the eating distribution of coronary patient, healthy people, unknown-state people.

We can see that their distribution is quite normal and look similar.

tt=subset(test_rev1,is.na(test_rev1$health_rev)==TRUE&is.na(test_rev1$eating_rev)==FALSE)
ttf = data.frame(tt)

yesf = data.frame(yes)
nof=data.frame(no)

p1<-ggplot(yesf,aes(x=eating,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people have Coronary")+xlab("Eating times")

p2<-ggplot(nof,aes(x=eating,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people do not have Coronary")+xlab("Eating times")

p3<-ggplot(ttf,aes(x=eating_rev,y=..density..))+
geom_histogram(fill="green", bins = 30)+geom_density(size =1)+xlim(0,2500)+ylim(0,0.002)+ggtitle("The people do not konw if they have Coronary")+xlab("Eating times")
grid.arrange(p1,p2,p3, nrow = 3)

Correlation Overview

Before we carry out any tests, we can plot a correlation matrix for all the variables that we are going to analyze to get a better sense of the relationships between these variables. Since this is only an overview of the correlation between these variables, we simply filter out all the missing data, and we are only concerned about the strength of the correlations but not about whether the correlation is positive or negative.

From the correlation matrix below, we can see that the correlation between coronary heart disease and the other variables are small (less than 0.2), indicating that their linear relationships are very weak. Therefore, we can build logistic regression model and other classification models to predict the probability of having coronary heart disease. What’s more, the correlation between the predictors (blood pressure, exercise, cholesterol level, obesity level, and green food) is small as well (less than 0.3), indicating weak linear relationships and low collinearity.

brfss_cor <- brfss2013
brfss_cor$eating <- brfss_cor$fruitju1 + brfss_cor$fruit1 + brfss_cor$fvbeans + brfss_cor$fvgreen + brfss_cor$fvorang + brfss_cor$vegetab1
cor_select_0 <- which((!is.na(brfss_cor$bphigh4)) & (!is.na(brfss_cor$cvdcrhd4)) &
                    (!is.na(brfss_cor$exerany2)) & (!is.na(brfss_cor$X_rfchol)) & 
                    (!is.na(brfss_cor$X_bmi5cat)) & (!is.na(brfss_cor$eating)) )# Select only the samples who responded to all selected questions
#cor_select_1 <- cor_select_0

cor_data = filter(brfss_cor[cor_select_0, ]) # Filter out unwanted samples
cor_data <- cor_data[, c("cvdcrhd4", "bphigh4", "X_rfchol",
                     "X_bmi5cat", "exerany2", "eating")] # Filter out unwanted variables

cor_data[] <- lapply(cor_data,as.integer)
corrplot(cor(cor_data), method = "pie", order="hclust")

corrplot(cor(cor_data), method = "number", order="hclust")

Significance Test

To assess the association between health indexes or living habits and angina or coronary heart diseases, we perform significance tests between selected variables corresponding to health indexes and cvdcrhd4.

We could conduct the tests if the following conditions are met:

Random: The samples are randomly selected from the population.
Large Sample Size: Sample sizes and all expected counts in the table are much greater than 5.
Independent: Knowing the values of both variables for one person in the study gives us no meaningful information about the values of the variables for another person. So individual observations are independent.

All variables we selected meet these conditions.

(1) Obesity Level

To assess the association between the level of obesity and coronary heart or angina diseases statistically, we conducted a Chi-squared test of these two variables.

After statistical conversion, the data table is as follows:

set.seed(109)
# bmi_samp=bmi_subset[sample(nrow(bmi_subset),10000,replace=FALSE),]
bmi_samp <- bmi_subset <- subset(brfss2013, is.na(brfss2013$X_bmi5cat) == FALSE & is.na(brfss2013$cvdcrhd4) == FALSE & brfss2013$X_bmi5cat != "Underweight", select = c(X_bmi5cat, cvdcrhd4))
X1 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Normal weight")
X2 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Overweight")
X3 <- sum(bmi_samp$cvdcrhd4 == "Yes" & bmi_samp$X_bmi5cat == "Obese")
X4 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Normal weight")
X5 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Overweight")
X6 <- sum(bmi_samp$cvdcrhd4 == "No" & bmi_samp$X_bmi5cat == "Obese")

Normalweight <- c(X1, X4, X1 + X4)
Overweight <- c(X2, X5, X2 + X5)
Obese <- c(X3, X6, X3 + X6)
Sum <- c(X1 + X2 + X3, X4 + X5 + X6, X1 + X2 + X3 + X4 + X5 + X6)
chi_bmi_tab <- data.frame(Normalweight, Overweight, Obese, Sum, row.names = c("Yes", "No", "Sum"))
cvd_sta <- rownames(chi_bmi_tab)
rownames(chi_bmi_tab) <- NULL
chi_bmi_tab <- cbind(cvd_sta, chi_bmi_tab)
bmi_tbl <- gt(data = chi_bmi_tab, rowname_col = "cvd_sta")

bmi_tbl <-
  bmi_tbl %>%
  tab_header(
    title = md("**Contingency Table of the Level of Obesity and the Health Condition**")
  )

bmi_tbl

Contingency Table of the Level of Obesity and the Health Condition
	Normalweight	Overweight	Obese	Sum
Yes	6778	10264	10670	27712
No	147020	155390	122681	425091
Sum	153798	165654	133351	452803

We want to perform a test with the following hypotheses:

\(H_0\): There is no relationship between angina or coronary heart disease and the level of obesity.

\(H_a\): There is a relationship between angina or coronary heart disease and the level of obesity.

Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following three plots only show slight differences in the percentage of people who ever had angina or coronary heart diseases.

normal <- data.frame(Percent = c(X4 / (X1 + X4), X1 / (X1 + X4)), Category = c("Health", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
  labs(title = "Normal Weight") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")
Over <- data.frame(Percent = c(X5 / (X2 + X5), X2 / (X2 + X5)), Category = c("Health", "Coronary"))
over_plot <- ggplot(data = Over, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
  labs(title = "Overweight") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")

# over_plot

Obese <- data.frame(Percent = c(X6 / (X3 + X6), X3 / (X3 + X6)), Category = c("Health", "Coronary"))
obese_plot <- ggplot(data = Obese, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
  labs(title = "Obese") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")

# over_plot

grid.arrange(normal_plot, over_plot, obese_plot, ncol = 3, nrow = 1)

Normal <- c(X1, X4)
Over <- c(X2, X5)
Obese <- c(X3, X6)
chi_bmi <- data.frame(Normal, Over, Obese, row.names = c("cvd", "no_Cvd"))
output1 <- chisq.test(chi_bmi)

The calculated test statistic \[\chi^2=1608.6451307\] Using \(df=2\), the corresponding P-value is less than2.2e-16, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.

(2) Exercise

To assess the association between exercise and coronary heart or angina diseases statistically, we conducted a Chi-squared test of these two variables.

After statistical conversion, the data table is as follows:

ex <- which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & (
  brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[which((brfss2013$exerany2 == "Yes" | brfss2013$exerany2 == "No") & (
  brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")), ]) # Filter out unwanted samples
exercise <- filt[, c("exerany2", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
exercise$exerany2 <- droplevels(exercise$exerany2)
exercise$exerany2 <- as.factor(recode(exercise$exerany2, "Yes" = "exercise", "No" = "No exercise"))
exercise$cvdcrhd4 <- as.factor(recode(exercise$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))
X2 <- sum(exercise$cvdcrhd4 == "Coronary" & exercise$exerany2 == "exercise")
X3 <- sum(exercise$cvdcrhd4 == "Coronary" & exercise$exerany2 == "No exercise")
X4 <- sum(exercise$cvdcrhd4 == "Healthy" & exercise$exerany2 == "exercise")
X5 <- sum(exercise$cvdcrhd4 == "Healthy" & exercise$exerany2 == "No exercise")
Exercise <- c(X2, X4, X2 + X4)
No_exercise <- c(X3, X5, X3 + X5)
Sum <- c(X2 + X3, X4 + X5, X2 + X3 + X4 + X5)
chi_exercise_tab <- data.frame(Exercise, No_exercise, Sum, row.names = c("Coronary", "Healthy", "Sum"))
cvd_sta <- rownames(chi_exercise_tab)
rownames(chi_exercise_tab) <- NULL
chi_exercise_tab <- cbind(cvd_sta, chi_exercise_tab)
exercise_tbl <- gt(data = chi_exercise_tab, rowname_col = "cvd_sta")

exercise_tbl <-
  exercise_tbl %>%
  tab_header(
    title = md("**Contingency Table of Exercise and the Health Condition**")
  )

exercise_tbl

Contingency Table of Exercise and the Health Condition
	Exercise	No_exercise	Sum
Coronary	16688	10733	27421
Healthy	313500	112811	426311
Sum	330188	123544	453732

We want to perform a test with the following hypotheses:

\(H_0\): There is no relationship between angina or coronary heart disease and exercise.

\(H_a\): There is a relationship between angina or coronary heart disease and exercise.

Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following two plots only show slight differences in the percentage of people who ever had angina or coronary heart diseases.

Exercise <- data.frame(Percent = c(X2 / (X2 + X4), X4 / (X2 + X4)), Category = c("Coronary", "Healthy"))
Exercise_plot <- ggplot(data = Exercise, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "Exercise") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")


No_exercise <- data.frame(Percent = c(X3 / (X3 + X5), X5 / (X3 + X5)), Category = c("Coronary", "Healthy"))
No_exercise_plot <- ggplot(data = No_exercise, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "No_Exercise") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")

# over_plot

grid.arrange(Exercise_plot, No_exercise_plot, ncol = 2, nrow = 1)

Exercise <- c(X2, X4)
No_exercise <- c(X3, X5)

chi_exercise <- data.frame(Exercise, No_exercise, row.names = c("cvd", "no_Cvd"))
output2 <- chisq.test(chi_exercise)

The calculated test statistic \[\chi^2=2089.7317508\]

Using \(df=(2-1)(2-1)=2\), the corresponding P-value is less than 2.2e-16, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.

(3) Cholesterol Level

After statistical conversion, the data table is as follows:

rfcholcvdcrhd <- subset(brfss2013, is.na(brfss2013$X_rfchol) == FALSE & is.na(brfss2013$cvdcrhd4) == FALSE)
analysis <- table(rfcholcvdcrhd$X_rfchol, rfcholcvdcrhd$cvdcrhd4)
analysis_ <- addmargins(table(rfcholcvdcrhd$X_rfchol, rfcholcvdcrhd$cvdcrhd4))

Sum <- analysis_[, 3]
High_col <- analysis_[1, ]
norm_col <- analysis_[2, ]
Sum <- analysis_[3, ]
ori_chol_tab <- data.frame(High_col, norm_col, Sum, row.names = c("Coronary", "Healthy", "Sum"))
names(ori_chol_tab) <- c("Normal Cholesterol Level", "High Coolesterol Level", "Sum")
chol_sta <- rownames(ori_chol_tab)
rownames(ori_chol_tab) <- NULL
ori_chol_tab <- cbind(chol_sta, ori_chol_tab)
chol_tbl <- gt(data = ori_chol_tab, rowname_col = "chol_sta")

chol_tbl <-
  chol_tbl %>%
  tab_header(
    title = md("**Contingency Table of the Cholesterol Level and the Coronary Heart Disease**")
  )

chol_tbl

Contingency Table of the Cholesterol Level and the Coronary Heart Disease
	Normal Cholesterol Level	High Coolesterol Level	Sum
Coronary	7229	20838	28067
Healthy	228129	160131	388260
Sum	235358	180969	416327

We want to perform a test with the following hypotheses:

\(H_0\): There is no association between cholesterol level and angina or coronary heart disease.

\(H_a\): There is an association between cholesterol level and angina or coronary heart disease.

Then we looked into the conditional distribution (in proportions) of the health condition for each treatment. The following two plots show apparent differences in the percentage of people who ever had angina or coronary heart diseases.

normal <- data.frame(Percent = c(analysis_[1, 2] / analysis_[1, 3], analysis_[1, 1] / analysis_[1, 3]), Category = c("Healthy", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "Normal Cholesterol Level") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")


High <- data.frame(Percent = c(analysis_[2, 2] / analysis_[2, 3], analysis_[2, 1] / analysis_[2, 3]), Category = c("Healthy", "Coronary"))
high_plot <- ggplot(data = High, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "High Cholesterol Level") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")

grid.arrange(normal_plot, high_plot, ncol = 2, nrow = 1)

Finally we perform the Chi-Square test.

chi_chol <- data.frame(norm_col, High_col)
chi_output <- chisq.test(chi_chol)

The calculated test statistic \[\chi^2=1.1600108\times 10^{4}\] Using \(df=2\), the corresponding P-value is 0, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the obesity level and angina or coronary heart diseases are associated.

(4) Blood Pressure Level

# str(brfss2013$bphigh4)
bp <- which((brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") & (
  brfss2013$cvdcrhd4 == "Yes" | brfss2013$cvdcrhd4 == "No")) # Select only the samples whose responded "Yes" or "No" to both questions
filt <- filter(brfss2013[bp, ]) # Filter out unwanted samples
blood_pressure <- filt[, c("bphigh4", "cvdcrhd4")] # Filter out unwanted variables
# glimpse(blood_pressure)
blood_pressure$bphigh4 <- droplevels(blood_pressure$bphigh4)
blood_pressure$bphigh4 <- as.factor(recode(blood_pressure$bphigh4, "Yes" = "High Blood Pressure", "No" = "Normal Blood Pressure"))
blood_pressure$cvdcrhd4 <- as.factor(recode(blood_pressure$cvdcrhd4, "Yes" = "Coronary", "No" = "Healthy"))
# 491775 - length(blood_pressure$bphigh4) # Calculate the number of responses that have been deleted
# (491775 - length(blood_pressure$bphigh4)) / 491775 # Calculate the proportion of responses that have been deleted

After statistical conversion, the data table is as follows:

bp_summary <- table(blood_pressure$cvdcrhd4, blood_pressure$bphigh4)
bp_summary_sum <- addmargins(table(blood_pressure$cvdcrhd4, blood_pressure$bphigh4))

a <- as.data.frame.matrix(bp_summary_sum)
new_col <- c("Coronary", "Healthy", "Sum")
a <- cbind(new_col, a)
colnames(a)[1] <- " "

a %>%
  gt(rowname_col = " ") %>%
  tab_header(
    title = md("**Contingency Table of Blood Pressure and Coronart Heart Disease**")
  )

Contingency Table of Blood Pressure and Coronart Heart Disease
	High Blood Pressure	Normal Blood Pressure	Sum
Coronary	22500	6234	28734
Healthy	173396	275267	448663
Sum	195896	281501	477397

We want to perform a test with the following hypotheses:

\(H_0\): There is no relationship between angina or coronary heart disease and the level of obesity.

\(H_a\): There is a relationship between angina or coronary heart disease and the level of obesity.

normal <- data.frame(Percent = c(bp_summary_sum[2, 2] / bp_summary_sum[3, 2], bp_summary_sum[1, 2] / bp_summary_sum[3, 2]), Category = c("Healthy", "Coronary"))
normal_plot <- ggplot(data = normal, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "Normal Blood Pressure") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")


High <- data.frame(Percent = c(bp_summary_sum[2, 1] / bp_summary_sum[3, 1], bp_summary_sum[1, 1] / bp_summary_sum[3, 1]), Category = c("Healthy", "Coronary"))
high_plot <- ggplot(data = High, mapping = aes(x = Category, y = Percent, fill = Category)) +
  scale_fill_brewer(palette = "Pastel1") +
  geom_col() +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8)) +
  labs(title = "High Blood Pressure") +
  theme(plot.title = element_text(hjust = 0.5)) +
  guides(fill = guide_legend(title = "Health Status")) +
  xlab("Health Status")

grid.arrange(normal_plot, high_plot, ncol = 2, nrow = 1)

output3 <- summary(bp_summary)

The calculated test statistic \[\chi^2=1.7552304\times 10^{4}\] Using \(df=2\), the corresponding P-value is extreme close to 0, which is much smaller than the given level of significance \(\alpha=0.05\). We reject \(H_0\) and conclude that the blood pressure level and angina or coronary heart diseases are associated.

(5) Green Food

# 非NA的数据
test <- subset(brfss2013, is.na(brfss2013$cvdcrhd4) == FALSE & is.na(brfss2013$fruitju1) == FALSE & is.na(brfss2013$fruit1) == FALSE & is.na(brfss2013$fvbeans) == FALSE & is.na(brfss2013$fvgreen) == FALSE & is.na(brfss2013$fvorang) == FALSE & is.na(brfss2013$vegetab1) == FALSE)
# 有个NA的数据
test_rev <- subset(brfss2013, !(is.na(brfss2013$cvdcrhd4) == FALSE & is.na(brfss2013$fruitju1) == FALSE & is.na(brfss2013$fruit1) == FALSE & is.na(brfss2013$fvbeans) == FALSE & is.na(brfss2013$fvgreen) == FALSE & is.na(brfss2013$fvorang) == FALSE & is.na(brfss2013$vegetab1) == FALSE))


# 饮食数据变成所有加起来
eating <- test$fruitju1 + test$fruit1 + test$fvbeans + test$fvgreen + test$fvorang + test$vegetab1

eating_rev <- test_rev$fruitju1 + test_rev$fruit1 + test_rev$fvbeans + test_rev$fvgreen + test_rev$fvorang + test_rev$vegetab1


health <- test$cvdcrhd4

# 对有数据没数据做下卡方检验
test1 <- data.frame(eating, health)
# test1是吃喝疾病都齐全的
yes <- subset(test1, test1$health == "Yes")
no <- subset(test1, test1$health == "No")

# yes是病， no是没病
test1$health <- factor(test1$health, labels = c("Coronary", "healthy"))

Since the variable eating is a numerical variable and its standard deviation is unknown, it is better to conduct a t-test to figure out whether there is an association with angina and coronary diseases.

We want to perform a t-test with the following hypotheses:

\(H_0\): There is an association between angina or coronary heart disease and the amount of fruits and vegetables eaten

\(H_a\): There is no association between angina or coronary heart disease and the amount of fruits and vegetables eaten

out_t <- t.test(test1$eating ~ test1$health)

Using the t-test we can get the P-value 0.6258454, we cannot reject \(H_0\), which indicates that there is no strong evidence that angina or coronary heart diseases are associated with the amount of fruits and vegetables eaten.

Significance Test Evaluation and Conclusion

The resultant P-values of the significance tests for the whole dataset is listed in the table below.

out1 <- c(output1$p.value)
out2 <- c(output2$p.value)
out3 <- c(chi_output$p.value)
out4 <- c(output3$p.value)
out44 <- c(out_t$p.value)
conclusion_tab <- data.frame(out1, out2, out3, out4, out44, row.names = c("P-Value"))
names(conclusion_tab) <- c("Obesity Level", "Exercise", "Cholesterol Level", "Blood Pressure", "Green Food")
conclu_rowname <- c("P-Value")
conclusion_tab <- cbind(conclu_rowname, conclusion_tab)
conclusion_tab <- gt(data = conclusion_tab, rowname_col = "conclu_rowname")
conclusion_tbl <-
  conclusion_tab %>%
  tab_header(
    title = md("**P-values of Whole Dataset**")
  )
conclusion_tbl

P-values of Whole Dataset
	Obesity Level	Exercise	Cholesterol Level	Blood Pressure	Green Food
P-Value	0	0	0	0	0.6258454

From the table above, it is clear that variables obesity level(bmi_cat5), exercise(exerany2), cholesterol level(X_rfchol) and blood pressure level (bphigh4) all have extremely small p-value, which indicates that they are strongly associated with angina or coronary heart diseases. Meanwhile, amount of fruits and vegetables eaten (eating) has a p-value larger than the significance level of 5%, which indicates that we are not confident that is is associated with angina or coronary heart diseases.

Through the implementation of Chi-squared test and significance test, we can make some conclusions whether the variables are associated with angina or coronary heart diseases. However, Chi-squared test and significance test can only decide whether there is an association between two variables, and it cannot excavate the potential association between them. As a result, we can hardly use this model to predict the probability of getting angina or coronary heart diseases based on other variables. In this project, we only take it as a preliminary model to remove some insignificant variables.

Modeling

Variable eating is removed from our further analysis since it is tested to be insignificant variable in the significance tests. We perform our modeling process with explanatory variables X_bmi5cat, exerany2, bphigh4 and X_rfchol and response variable cvdcrhd4.

Logistic Regression

Logistic regression is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation. This type of statistical analysis (also known as logit model) is often used for predictive analytics and modeling, which can return the likelihood of an event happening or a choice being made. A logistic approach fits best when the task that the machine is learning is based on two values, or a binary classification. The logistic approach is more powerful compared to the linear regression when the dependent variable is limited to a range of values or categorical — A or B…or A, B, C or D. In our project, the dependent variable cvdcrhd4 is a categorical variable with two possible outcomes. Therefore, we could apply logistic regression to build up the prediction model, which also contains the information of the effect of the independent variables, so as to answer the research problem.

Model Assumptions

The logistic regression method assumes that:

The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0, which has been met because cvdcrhd4 has two outcomes, “Yes” and “No” after EDA process filtering out the missing data.
There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is \[logit(p) = log(\frac{p}{1-p})\] where p is the probabilities of the outcome.
There is no influential values (extreme values or outliers) in the continuous predictors.
There is no high intercorrelations (i.e. multicollinearity) among the predictors. This assumption has been checked in the EDA process.

For the assumption of (b) and (c), we assumes that they hold true and conduct diagnostics to check the validity after building up the model.

Data Processing

The sigmoid function is used in the logistic regression model. Seeing from the following plot, we could find that it is a symmetric function about (0,0.5). Therefore, the model performs best when the threshold is set to be 0.5. However, due to extreme imbalance in the cvdcrhd4 variable, if we simply select 0.5 as the threshold without resample, the model means nothing although it appears to have a high accuracy by predicting all of the samples to be the outcomes with a larger number in the data set. As a result, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. The sampled data has a size of 50,296. And we take a random sample among these samples with a size of 1800 to avoid overfitting.

sigmoid = function(x) {
   1 / (1 + exp(-x))
}
x <- seq(-5, 5, 0.01)
data=data.frame(x=x,y=sigmoid(x))
p <- ggplot(data = data, mapping = aes(
  x = x, y = y))
p+geom_line(size = 1.2,color="blue")+labs(title = "Sigmoid Function")+theme(plot.title = element_text(hjust = 0.5))

pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(X_bmi5cat,cvdcrhd4,exerany2,bphigh4,X_rfchol))

pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(X_bmi5cat,cvdcrhd4,exerany2,bphigh4,X_rfchol))


set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)

ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]

pre_subset=rbind(pre_subset1,pre_subset2)
#table(pre_subset$cvdcrhd4)

The resampled data is shown as below:

out1 <- c(table(pre_subset$cvdcrhd4)[1])
out2 <- c(table(pre_subset$cvdcrhd4)[2])
#out3 <- c(chi_output$p.value)
#out4 <- c(output3$p.value)
#out44 <- c(out_t$p.value)
conclusion_tab <- data.frame(out1, out2, row.names = c("Number"))
names(conclusion_tab) <- c("Yes", "No")
conclu_rowname <- c("Number")
conclusion_tab <- cbind(conclu_rowname, conclusion_tab)
conclusion_tab <- gt(data = conclusion_tab, rowname_col = "conclu_rowname")
conclusion_tbl <-
  conclusion_tab %>%
  tab_header(
    title = md("Resampled Data")
  )
conclusion_tbl

Resampled Data
	Yes	No
Number	25156	25140

We divided this group of samples into training set,validation set and test set with a ratio of 0.6,0.2 and 0.2, and then used the training set to train the logistic regression model and the test set to test and evaluate the model.

Model Selection

Here we apply the backward method to select variables based on the AIC value of each model.Seeing from the result, we can select the model with minimum AIC that contains variables exerany2,bphigh4 and X_rfchol.

set.seed(999)
pre_subset=pre_subset[sample(nrow(pre_subset),1800,replace = FALSE),]
N=length(pre_subset$cvdcrhd4)
ind=sample(3,N,replace=TRUE,prob=c(0.6,0.2,0.2))
aus_train=pre_subset[ind==1,]
aus_val=pre_subset[ind==2,]
aus_test=pre_subset[ind==3,]

library(MASS)
pre=glm(cvdcrhd4~X_bmi5cat+exerany2+bphigh4+X_rfchol,data=aus_train,family=binomial(link="logit"))
stepAIC(pre,direction = "backward")

## Start:  AIC=1293.11
## cvdcrhd4 ~ X_bmi5cat + exerany2 + bphigh4 + X_rfchol
## 
##             Df Deviance    AIC
## - X_bmi5cat  2   1283.7 1291.7
## <none>           1281.1 1293.1
## - exerany2   1   1300.2 1310.2
## - bphigh4    1   1349.0 1359.0
## - X_rfchol   1   1360.9 1370.9
## 
## Step:  AIC=1291.67
## cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol
## 
##            Df Deviance    AIC
## <none>          1283.7 1291.7
## - exerany2  1   1302.2 1308.2
## - bphigh4   1   1354.9 1360.9
## - X_rfchol  1   1364.3 1370.3

## 
## Call:  glm(formula = cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol, family = binomial(link = "logit"), 
##     data = aus_train)
## 
## Coefficients:
## (Intercept)   exerany2No    bphigh4No  X_rfcholYes  
##      0.4555      -0.6295       1.1655      -1.2309  
## 
## Degrees of Freedom: 1087 Total (i.e. Null);  1084 Residual
## Null Deviance:       1508 
## Residual Deviance: 1284  AIC: 1292

pre=glm(cvdcrhd4~exerany2+bphigh4+X_rfchol,data=aus_train,family=binomial(link="logit"))
summary(pre)

## 
## Call:
## glm(formula = cvdcrhd4 ~ exerany2 + bphigh4 + X_rfchol, family = binomial(link = "logit"), 
##     data = aus_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8981  -0.8704  -0.6625   0.9911   1.8024  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4555     0.1338   3.405 0.000661 ***
## exerany2No   -0.6295     0.1471  -4.279 1.88e-05 ***
## bphigh4No     1.1655     0.1401   8.318  < 2e-16 ***
## X_rfcholYes  -1.2309     0.1394  -8.827  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1508.0  on 1087  degrees of freedom
## Residual deviance: 1283.7  on 1084  degrees of freedom
## AIC: 1291.7
## 
## Number of Fisher Scoring iterations: 4

The summary of this model is shown below. We can find that all of the variables have a P-value smaller than 0.05, indicating all of the selected variables are significant predictors. The model can be also expressed in the following equation: \[log(\frac{p}{1-p})=0.4555-0.6295*exerany2No+1.1655*bphigh4No-1.2309*X\_rfcholYes\] where p indicates the probability of being healthy. By observing the coefficients of this model, we found that no exercise in the past 30 days, and high cholesterol were all negatively correlated with the absence of disease, while the absence of hypertension was positively correlated with the absence of coronary heart disease. This result is in accord with people’s common sense, that is, less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease.

Regression diagnostics

Here we used the test set to check the validity assumptions. Since the selected variables are categorical, which cannot show strong linearity, we only look into the relationship between logit and the sum of the multiplication of values and corresponding coefficients. As shown in the following plot, linear relationship between the logit of the outcome and sum of the predictor variables are strong and there is no extreme outliers. Therefore, the assumptions are reasonable for this model. We should also have some intuitive feelings from the plot that the healthy people generally have higher predicted probabilites, which suggests that logit well distinguishes two groups of people.

predict_=predict.glm(pre,type="response",newdata=aus_val)
predict=ifelse(predict_>0.5,1,0)
true=ifelse(aus_val$cvdcrhd4=="No",1,0)
aus_val$true=true
aus_val$predict=predict
true_value=aus_val[,"true"]
predict_value=aus_val[,"predict"]
error=predict_value-true_value
accuracy1=(nrow(aus_val)-sum(abs(error)))/nrow(aus_val)


aus_val$regression=ifelse(aus_val$exerany2=="No",-0.6295,0)
aus_val$regression=ifelse(aus_val$bphigh4=="No",1.1655,0)+aus_val$regression
aus_val$regression=ifelse(aus_val$X_rfchol=="Yes",-1.2309,0)+aus_val$regression+0.4555 
aus_val$logit=log(predict_/(1-predict_))

aus_val$color=ifelse(aus_val$cvdcrhd4=="No","blue","red")
aus_val$cvdcrhd4=factor(aus_val$cvdcrhd4)
new_score=jitter(aus_val$regression)
p <- ggplot(data = aus_val, mapping = aes(
  x=regression, y=logit))

p+geom_smooth(method = lm)+geom_jitter(color=aus_val$color)+labs(title = "the Result of Logistic Regression")+theme(plot.title = element_text(hjust = 0.5))

To check if the threshold is valid and if our sample method is reasonable, we use plot the relationship of the prediction accuracy and threshold. Then we find that our sample method is reasonable and nearly cause no bias. The accuracy reaches the peak at the threshold of 0.5.

accu_val = data.frame()
thre_val=data.frame()
for (n in 1:10)
{
  ratio=n/10
  predict_=predict.glm(pre,type="response",newdata=aus_val)
  predict=ifelse(predict_>ratio,1,0)
  true=ifelse(aus_val$cvdcrhd4=="No",1,0)
  aus_val$true=true
  aus_val$predict=predict
  
  true_value=aus_val[,"true"]
  predict_value=aus_val[,"predict"]
  error=predict_value-true_value
  accuracy=(nrow(aus_val)-sum(abs(error)))/nrow(aus_val)
  accu_val=rbind(accu_val,accuracy)
  thre_val=rbind(thre_val,ratio)
  
}

#plot(c(as.numeric(unlist(thre_val))),c(as.numeric(unlist(accu_val))))
#as.numeric(unlist(thre_val)
plot_=data.frame(x=as.numeric(unlist(thre_val)),y=as.numeric(unlist(accu_val)))
p <- ggplot(data = plot_, mapping = aes(
  x = x, y = y))
p+geom_point(size=2)+labs(title = "Relationship between the Accuracy and the Threshold")+theme(plot.title = element_text(hjust = 0.5))+geom_line()

Model Evaluation and conclusions

predict_=predict.glm(pre,type="response",newdata=aus_test)
predict=ifelse(predict_>0.5,1,0)
true=ifelse(aus_test$cvdcrhd4=="No",1,0)
#aus_val$true=true
#aus_val$predict=predict
#true_value=aus_val[,"true"]
#predict_value=aus_val[,"predict"]
error=predict-true
accuracy1=(nrow(aus_test)-sum(abs(error)))/nrow(aus_test)

We can get the accuracy of the test set is around 0.7643505. And we also plot the ROC curve and calculate the AUC value,0.801 for this model. And we conclude from this model that less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease.

library(pROC) 
modelroc <- roc(true,predict_) 
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE,legacy.axes=TRUE, 
max.auc.polygon=TRUE, 
auc.polygon.col="skyblue", print.thres=TRUE,smooth=TRUE)

pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))

pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))


set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)

ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]

pre_subset=rbind(pre_subset1,pre_subset2)
#table(pre_subset$cvdcrhd4)

K-NearestNeighbor (KNN)

KNN (K-Nearest Neighbors) is a machine learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into. Since KNN is a non-parametric algorithm. This means that no assumptions about the data set are made when the model is used. For our project, we have some intuitions that people who have similar living habits and physiological indexes may have similar health states. Therefore, we could build up a KNN model and check if it can predict people’s health states.

Data Processing

Since the data set is extremely imbalanced in the research variable cvdcrhd4, which can have a large impact on the KNN model that performs well when there is no bias caused by the original data, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. Moreover, to avoid overfitting, the data set is randomly sampled with a size of 4500. Here we divided the data set into three parts, the training set, the validation set and the test set. The corresponding ratio is 0.6, 0.2 and 0.2. We used the validation set to adjust the hyper-parameters in the models and the test set is for checking the validity of this model. Some of the variables are categorical variables so they are converted into dummy variables for modelling.

set.seed(20)
pre_subset=pre_subset[sample(nrow(pre_subset),4500,replace = FALSE),]
dummyV1 <- model.matrix(~.,pre_subset)

pre_subset=data.frame(dummyV1[,c(-1,-7,-9)])
N=dim(pre_subset)[1]
ind=sample(3,N,replace=TRUE,prob=c(0.6,0.2,0.2))
aus_train=pre_subset[ind==1,]
aus_test=pre_subset[ind==3,]
aus_val=pre_subset[ind==2,]

Hyper-parameter Selection

Firstly, we defined a suitable K value for this model. Seeing from the following plot, we selected the K-value with highest accuracy, 40, as the hyper-parameter.

library(class)

k=data.frame()
accu=data.frame()
train=aus_train
val=aus_val
#head(val)
  for(n in 1:50)
{
  kresult=knn(train = train[,-1],test=val[,-1],cl=train[,1],k=n,use.all=TRUE)
  accuracy=sum(diag(table(val[,1],kresult)))/sum(table(val[,1],kresult))
  k=rbind(k,n)
  accu=rbind(accu,accuracy)
  }
k=c(as.numeric(unlist(k)))
accu=c(as.numeric(unlist(accu)))
plot_dat=data.frame(x=k,y=accu)
ggplot(data = plot_dat,aes(x=k,y=accu))+
geom_point(size=3)+labs(title = "Accuracy vs K-value",x="K-value",y="Accuracy of the Validation Set")+theme(plot.title = element_text(hjust = 0.5))+geom_point(aes(40,accu[40]),color="red")

Model Optimization

To optimize the model, we should minimize the model size nearly without reducing the accuracy. Here we applies backward method.

ACCURACY=accu[40]
var_select=data.frame()
var_accu=data.frame()
head(val)
for (col in 2:7)
{
  var_select=rbind(var_select,col)
  train=aus_train[,-col]
  val=aus_val[,-col]
  kresult=knn(train = train[,-1],test=val[,-1],cl=train[,1],k=40,use.all=TRUE)
  accuracy=sum(diag(table(val[,1],kresult)))/sum(table(val[,1],kresult))
  var_accu=rbind(var_accu,ACCURACY-accuracy)
}
var_select=c(as.numeric(unlist(var_select)))
var_name=c("X_bmi5catNormal.weight","X_bmi5catOverweight","X_bmi5catObese","exerany2No","bphigh4No",
                                                     "X_rfcholYes")
var_accu=c(as.numeric(unlist(var_accu)))
plot_dat=data.frame(x=var_name,y=var_accu)
p <- ggplot(data = plot_dat, mapping = aes(
  x = var_name,
  y = var_accu ,
  fill=var_name))
p + geom_col()+labs(title = "Accuracy Reduction",x="Removed Variables",y="Accuracy Reduction")+theme(plot.title = element_text(hjust = 0.5))+theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+scale_fill_brewer(palette = "Set3")

From the bar plot above, removing the X_bmi5cat related variables has little influence on the accuracy reduction, which is smaller than 1%. Therefore, we build up the KNN model with the rest three variables:exerany2No,bphigh4No,X_rfcholYes.

Model Evaluation and Conclusions

To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. As shown in the following graph, the AUC value that equals to the prediction accuracy in this model is approximately 0.702, which only indicates a general performance of prediction. However, we could conclude from the model that exercise habits, blood pressure and cholesterol level have associations with angina or coronary heart disease.

library(pROC)
train=aus_train[,c(-2,-3,-4)]
val=aus_val[,c(-2,-3,-4)]
test=aus_test[,c(-2,-3,-4)]
kresult=knn(train = train[,-1],test=test[,-1],cl=train[,1],k=40,use.all=TRUE)
roc1<-roc(kresult,test[,1])
plot(roc1,print.auc=TRUE,plot=TRUE,print.thres=TRUE,auc.polygon=TRUE,xlim=c(1,0),auc.polygon.col="skyblue",legacy.axes=TRUE)

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a relatively simple Supervised Machine Learning Algorithm used for classification and/or regression. It is more preferred for classification but is sometimes very useful for regression as well. Basically, SVM finds a hyper-plane that creates a boundary between the types of data. Since SVM is based on vectors, there are also no assumptions about the data set are made when the model is used. Therefore, we could build up an SVM model and check if it can predict people’s health states.

pre_subset1=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" & is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="Yes"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))

pre_subset2=subset(brfss2013,is.na(brfss2013$X_bmi5cat)==FALSE & brfss2013$X_bmi5cat!="Underweight" &is.na(brfss2013$cvdcrhd4)==FALSE & is.na(brfss2013$exerany2)==FALSE & is.na(brfss2013$bphigh4)==FALSE & is.na(brfss2013$X_rfchol)==FALSE & brfss2013$cvdcrhd4=="No"& (brfss2013$bphigh4 == "Yes" | brfss2013$bphigh4 == "No") ,select=c(cvdcrhd4,X_bmi5cat,exerany2,bphigh4,X_rfchol))


set.seed(1206)
ratio=length(pre_subset1$cvdcrhd4)/length(pre_subset2$cvdcrhd4)

ind=sample(2,length(pre_subset2$cvdcrhd4),length(pre_subset1$cvdcrhd4),prob=c(ratio,1-ratio))
pre_subset2=pre_subset2[ind==1,]


pre_subset=rbind(pre_subset1,pre_subset2)

Data Processing

Since the data set is extremely imbalanced in the research variable cvdcrhd4, we need to conduct an stratified sampling process of the data to deal with the issue, which respectively draws nearly equal samples from the healthy and coronary groups. Moreover, to avoid overfitting, the data set is randomly sampled with a size of 15000. Here we divided the data set into two parts, the training set and the test set. The corresponding ratio is 0.7 and 0.3. The test set is for checking the validity of this model. Some of the variables are categorical variables so they are converted into dummy variables for modelling.

set.seed(2021)
pre_subset=pre_subset[sample(nrow(pre_subset),15000,replace = FALSE),]
dummyV1 <- model.matrix(~.,pre_subset)
pre_subset=data.frame(dummyV1[,c(-1,-7,-9)])
N=dim(pre_subset)[1]
ind=sample(2,N,replace=TRUE,prob=c(0.7,0.3))
aus_train=pre_subset[ind==1,]
aus_test=pre_subset[ind==2,]

# pre_subset <- as.h2o(pre_subset)
# test <- as.h2o(test)
# aus_train <- as.h2o(aus_train)
# aus_test <- as.h2o(aus_test)
# x <- setdiff(names(aus_train), c("cvdcrhd4Yes", "int_rate"))
# y <- "cvdcrhd4Yes"
# iris.dl <- h2o.deeplearning(x = x, y = y, training_frame = pre_subset)
# predictions <- h2o.predict(iris.dl, pre_subset) 
# performance = h2o.performance(model = iris.dl,newdata = test)
# print(performance)

Model Optimization

library("e1071")
aus_train$cvdcrhd4No<- as.factor(aus_train$cvdcrhd4No)
train1 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test1 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train1,cost=0.25,gamma=0.125,type = 'C')
test1_=subset(test1,select=-cvdcrhd4No,)
results_test1<-predict(object=model,test1_,type="class")
res_test1<-table(results_test1,test1$cvdcrhd4No)
accurary1<-sum(diag(res_test1))/sum(res_test1)

train2 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test2 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train2,cost=0.25,gamma=0.125,type = 'C')
test2_=subset(test2,select=-cvdcrhd4No,)
results_test2<-predict(object=model,test2_,type="class")
res_test2<-table(results_test2,test2$cvdcrhd4No)
accurary2<-sum(diag(res_test2))/sum(res_test2)

train3 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
test3 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catObese,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train3,cost=0.25,gamma=0.125,type = 'C')
test3_=subset(test3,select=-cvdcrhd4No,)
results_test3<-predict(object=model,test3_,type="class")
res_test3<-table(results_test3,test3$cvdcrhd4No)
accurary3<-sum(diag(res_test3))/sum(res_test3)

train4 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,exerany2No,bphigh4No,X_rfcholYes))
test4 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catNormal.weight,X_bmi5catOverweight,exerany2No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train4,cost=0.25,gamma=0.125,type = 'C')
test4_=subset(test4,select=-cvdcrhd4No,)
results_test4<-predict(object=model,test4_,type="class")
res_test4<-table(results_test4,test4$cvdcrhd4No)
accurary4<-sum(diag(res_test4))/sum(res_test4)

train5 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,bphigh4No,X_rfcholYes))
test5 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train5,cost=0.25,gamma=0.125,type = 'C')
test5_=subset(test5,select=-cvdcrhd4No,)
results_test5<-predict(object=model,test5_,type="class")
res_test5<-table(results_test5,test5$cvdcrhd4No)
accurary5<-sum(diag(res_test5))/sum(res_test5)

train6 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,X_rfcholYes))
test6 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train6,cost=0.25,gamma=0.125,type = 'C')
test6_=subset(test6,select=-cvdcrhd4No,)
results_test6<-predict(object=model,test6_,type="class")
res_test6<-table(results_test6,test6$cvdcrhd4No)
accurary6<-sum(diag(res_test6))/sum(res_test6)

train7 <- subset(aus_train,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No))
test7 <- subset(aus_test,select=c(cvdcrhd4No,X_bmi5catNormal.weight,X_bmi5catOverweight,X_bmi5catObese,exerany2No,bphigh4No))
model<-svm(cvdcrhd4No~.,data=train7,cost=0.25,gamma=0.125,type = 'C')
test7_=subset(test7,select=-cvdcrhd4No,)
results_test7<-predict(object=model,test7_,type="class")
res_test7<-table(results_test7,test7$cvdcrhd4No)
accurary7<-sum(diag(res_test7))/sum(res_test7)

redu<- c(-(accurary2-accurary1),-(accurary3-accurary1),-(accurary4-accurary1),-(accurary5-accurary1),-(accurary6-accurary1),-(accurary7-accurary1))

var_name=c("X_bmi5catNormal.weight","X_bmi5catOverweight","X_bmi5catObese","exerany2No","bphigh4No","X_rfcholYes")
plot_data <- data.frame(x=var_name,y=redu)
p <- ggplot(data = plot_data, mapping = aes(
  x = var_name,
  y = redu ,
  fill=var_name))
p + geom_col()+labs(title = "Accuracy Reduction",x="Removed Variables",y="Accuracy Reduction")+theme(plot.title = element_text(hjust = 0.5))+theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 8))+scale_fill_brewer(palette = "Set3")

From the bar plot above, removing the X_bmi5cat related variables and exerany2No have little influence on the accuracy reduction. Therefore, we build up the SVM model with the rest three variables:bphigh4No,X_rfcholYes.

Model Evaluation and Conclusions

To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. As shown in the following graph, the AUC value that equals to the prediction accuracy in this model is approximately 0.681, which only indicates a general performance of prediction. However, we could conclude from the model that blood pressure and cholesterol level have associations with angina or coronary heart disease.

train <- subset(aus_train,select=c(cvdcrhd4No,bphigh4No,X_rfcholYes))
test <- subset(aus_test,select=c(cvdcrhd4No,bphigh4No,X_rfcholYes))
model<-svm(cvdcrhd4No~.,data=train,cost=0.1975309,gamma=0.1975309,type = 'C')
test_=subset(test,select=-cvdcrhd4No,)
results_test<-predict(object=model,test_,type="class")
res_test<-table(results_test,test$cvdcrhd4No)
accurary<-sum(diag(res_test))/sum(res_test)
model_roc <- roc(aus_test$cvdcrhd4No,as.numeric(results_test))
plot(model_roc,print.auc=TRUE,auc.polygon=TRUE,max.auc.polygon=TRUE,auc.polygon.col="skyblue",print.thres=TRUE)

Random Forest (RF)

Random forest is a commonly-used machine learning algorithm, which combines the output of multiple decision trees to reach a single result. What’s more, it reduce the over fitting of decision tree and more steady. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems. Therefore, we use random forest to predict people health.

library(randomForest)

Data processing

new = data.frame(brfss2013$cvdcrhd4,brfss2013$X_bmi5cat,brfss2013$exerany2,brfss2013$bphigh4,brfss2013$X_rfchol)
colnames(new)=c("ill","bmi","exercise","pressure","cholesterol")

new=subset(new,is.na(new$ill)==FALSE&is.na(new$bmi)==FALSE&is.na(new$exercise)==FALSE&is.na(new$pressure)==FALSE&is.na(new$cholesterol)==FALSE)

ind <- sample(2, nrow(new), replace=TRUE, prob=c(0.1, 0.9))

new = data.frame(new,ind)

First we select the sample of coronary and no-coronary of 1:1, then we use random forest model to train. The training set have at least 5000 sample.

Comparing the Mean Decrease Gini of each variable which ralte to the importance of variable, we get that the importance of bmiX_bmi5cat is highest, then is the exerciseexerany2 and cholesterolX_rfchol, the blood pressurebphigh4 have lowest influence in the tree of random forest. The total error rate when the train sample have patient portion to healthy people equals to 1:1 is 30.06%. The accuracy in training set when predicting correctly on coronary patient is 69.3% while 70.5% on healthy people.

set.seed(123)
trainset = subset(new,new$ind==1)
testset = subset(new,new$ind==2)

train_no = subset(trainset,trainset$ill=="No")
train_yes = subset(trainset,trainset$ill=="Yes")
no_size = dim(train_no)[1]
yes_size = dim(train_yes)[1]

yess = floor(yes_size)

train_no = train_no[c(1:yess),]
#print(i)

trainset = rbind(train_yes,train_no)
trainset = trainset[,c(1:5)]
testset = testset[,c(1:5)]
  
rf <- randomForest(ill~.,data = trainset,ntree=100,proximity=TRUE)
rf

## 
## Call:
##  randomForest(formula = ill ~ ., data = trainset, ntree = 100,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 30.39%
## Confusion matrix:
##      Yes   No class.error
## Yes 1854  788   0.2982589
## No   818 1824   0.3096139

Mean_Decrease_GINI = rf


gini_tab = data.frame(importance(Mean_Decrease_GINI))
gini_tab

varImpPlot(Mean_Decrease_GINI)

Then we filtrate the variables, we wipe off the blood pressurebphigh4,the model have both decrease on the accuracy of patient and healthy people, it may indicate that the random forest tree can be less influence by the variable which is less important.

Model Evaluation and Conclusions

To evaluate this optimized model, we used the test set and applied ROC curve together with the AUC value. We change the portion of coronary sample in training set. We found that it have different outcome in prediction. We plot the True positive rate and False positive rate, then we plot the ROC curve. Using the AUC definition, we also get the AUC value of random forest on the test set is 0.74.

Normally we use the training model of train sample have patient portion to healthy people equals to 1:1 to evaluate the people, it have accuracy of 69.94%.

x=rep(0,75)
dim(x) = c(25,3)
roccurve = data.frame(x)

colnames(roccurve)=c("The sample ratio of the negative/positive","p","n")
for(i in 1:25){
  
  set.seed(123)
 trainset = subset(new,new$ind==1)
testset = subset(new,new$ind==2)

train_no = subset(trainset,trainset$ill=="No")
train_yes = subset(trainset,trainset$ill=="Yes")
no_size = dim(train_no)[1]
yes_size = dim(train_yes)[1]

yess = floor(i*yes_size/5)

train_no = train_no[c(1:yess),]
#print(i)

trainset = rbind(train_yes,train_no)
trainset = trainset[,c(1:5)]
testset = testset[,c(1:5)]
  
  rf <- randomForest(ill~.,data = trainset,ntree=10,proximity=TRUE)
  outcome = data.frame(predict(rf,testset),testset$ill)
  colnames(outcome)=c("predict","actual")

  ill_ill = sum(outcome$actual=="Yes"&outcome$predict == "Yes")
  ill_noill = sum(outcome$actual=="Yes"&outcome$predict == "No")
  noill_ill = sum(outcome$actual=="No"&outcome$predict == "Yes")
  noill_noill = sum(outcome$actual=="No"&outcome$predict == "No")
  
  roccurve[i,]=c(i/5,ill_ill/(ill_noill+ill_ill),noill_ill/(noill_ill+noill_noill))
}

#rocname = rownames(roccurve)
#roctab = cbind(roccurve,rocname)

roc_tab = roccurve[c(2,4,6,8,10,12,14,16,18,20),]
colnames(roc_tab)=c("Sample ratio of the \nNegative/Positive","True Positive rate","False Positive rate")
cat <- gt::gt(roc_tab)
cat<-
  cat%>%
  tab_header("The False~True positive rate")

cat

The False~True positive rate
Sample ratio of the Negative/Positive	True Positive rate	False Positive rate
0.4	0.9397428683	0.6726509571
0.8	0.7341673521	0.3547347243
1.2	0.7037790572	0.3181219227
1.6	0.6267261158	0.2538663273
2.0	0.4440933293	0.1665445383
2.4	0.2461798191	0.0815078355
2.8	0.2452274793	0.0808168460
3.2	0.1209038570	0.0407780220
3.6	0.0022076966	0.0005174386
4.0	0.0003895935	0.0001349840

roccurve[1,]=c(0,1,1)
auc_value = 0
for( i in 2:25){
  auc_value = auc_value+(roccurve[i-1,2]+roccurve[i,2])*(roccurve[i-1,3]-roccurve[i,3])/2
}

ggplot(roccurve, aes(x=n, y=p)) + geom_line()+ 
  geom_area(fill="Green",alpha=0.75)+ 
  xlim(0,1)+
  ylim(0,1)+
  xlab("False positive rate")+ 
  ylab("True positive rate")+
  ggtitle("ROC curve")+
  theme(
  plot.title =element_text(hjust = 0.5,vjust = 1),plot.subtitle = element_text(hjust = 0.5))+
  annotate("text", x=0.5 , y=0.5 ,label= c("AUC="))+
  annotate("text", x=0.5 , y=0.4 ,label= round(auc_value,2))

Final Model Selection

Comparing the results of four models, we prefer to select logistic regression model as our final model for its highest AUC value and prediction accuracy.Besides, because the other three models are non-parametric models and have higher randomness, we cannot look into the detailed effects of the variables. Instead, from the logistic regression model, we conclude that less exercise, higher cholesterol level and worse condition of hypertension are associated with a higher risk to suffer from coronary heart disease and people’s degree of obesity have weak association with angina or coronary heart diseases.

accuracy <- c(0.764,0.702, 0.681,0.699)
auc <- c(0.801,0.702, 0.681,0.74)
variables <- c('exerany2+bphigh4+X_rfchol','exerany2+bphigh4+X_rfchol','bphigh4+X_rfchol','X_bmi5cat+exerany2+bphigh4+X_rfchol')
comp_tbl <- data.frame(accuracy,auc,variables,row.names = c("Logistic Regression","KNN","SVM", "Random Forest"))
comp_names <- rownames(comp_tbl)
names(comp_tbl)<-c("Accuracy","AUC","Chosen Variables")
rownames(comp_tbl) <- NULL
comp_tbl <- cbind(comp_names, comp_tbl)
comp_tbl_tbl <- gt(data = comp_tbl, rowname_col = "comp_names")

comp_tbl_tbl <-
  comp_tbl_tbl %>%
  tab_header(
    title = "Model Comparison"
  )

comp_tbl_tbl

Model Comparison
	Accuracy	AUC	Chosen Variables
Logistic Regression	0.764	0.801	exerany2+bphigh4+X_rfchol
KNN	0.702	0.702	exerany2+bphigh4+X_rfchol
SVM	0.681	0.681	bphigh4+X_rfchol
Random Forest	0.699	0.740	X_bmi5cat+exerany2+bphigh4+X_rfchol

Discussion

The reliability and validity of the data: there are several aspects of the data that may cause bias. The age distribution of the respondents is slightly different from the true adults age distribution in the US in 2013, so bias caused by age may occur. The race distribution of the data may not reflect the true race distribution in the US in 2013, thus bias caused by race may occur. The sex ratio of the respondents is slightly different from the true adults sex ratio in the US in 2013 so bias caused by gender may occur. Since BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US, any conclusion we draw is only applicable to the US. Since respondents are identified through telephone-based methods, people who do not have telephone service could not be interviewed, thus bias may occur. Since this is an observational study and the answers from the respondents may not be accurate, so bias may occur.

The variables that we use in our statistical modelling procedure are mostly categorical variables. However, these variables (blood pressure, cholesterol level, exercise frequency, obesity level) could be measured by continuous numbers which would provide more details about one’s health. In addition, If the variables we used in modelling procedure were continuous variables, our model may have better performance.

Appropriateness of the statistical analysis: After preliminary experiments, reliable data samples are obtained, which avoids the effect of extreme values and outliers of samples on the analysis results. Meanwhile, the sampling method also ensures the randomness and balance of samples. By making reasonable assumptions, the problem is simplified and the model can be applied more efficiently so the valid analysis results can be obtained. The model parameters were optimized by training set and validation set, according to the standard of AIC and some unnecessary variables were removed. Test sets are used to evaluate the reliability of the model and check the validity of the hypothesis to ensure the accuracy of the model. Since the samples are independent, the prediction results of the test set are representative to some extent. By comparing the prediction accuracy of the models and AUC values, appropriate models were selected to predict the disease. Therefore, the most valid model can be selected.

Potential limitations: the variables we selected are almost categorical ones. Therefore, during modelling, we had to convert them into dummy variables, which causes the coefficients to be not that meaningful to indicate the average increasing rate of each level since the levels are actually absolute states. And the linear relationship between the predicted logit values and the categorical variables is unable to check, or say the linear regression results mean nothing. Therefore, the conditions of logistic regression may be insufficient to meet. What’s more, due to the extreme imbalance of the data set, we increased the ratio of people with angina or coronary heart disease since our goal is to predict the probability of having diseases. But the actual ratio of people having diseases is not that high, using this model may result in the excessively conservative prediction results.

Suggestions for improvements: in this project, the selected variables are limited. So for further improvements, we can take more factors into considerations and apply more methods to avoid overfitting and build up a more stable model, such as cross validation.

Conclusion

Through the significance test, we already figured out that Green food eating is an insignificant factor with angina or coronary heart disease. Although degree of obesity (X_bmi5cat) seems related to angina or coronary heart disease in the significance test, it does not contribute to the model construction. Therefore, we may conclude that degree of obesity is also not a significant factor associated with angina or coronary heart disease. Through the models we constructed, it is concluded that variables cholesterol level(X-rfchol) and blood pressure (bphigh4) are significant factors associated with angina or coronary heart disease, and exercise (exerany2) is a secondary factor, comparing the P-value in the logistic regression model.

However, since the factors of getting diseases are complex and the accuracy of our model is not high enough, these selected factors are still insufficient to predict the occurrence of angina or coronary heart disease. But it provides us with some statistical information about the related factors of angina or coronary heart disease, which is illuminating for both medical research and people’s life.

Group 9 Project Report

李欣忆-201830880507, 郑旻哲-201930640223, 宋永懿-201930640124, 陈澍浚- 201930640032, 张伟捷-201930640216

12/5/2021

Research Question

What Is Angina or Coronary Heart Disease?

Why Are We Interested In Angina or Coronary Heart Disease

Methodology

Data Background

Related Resources

Assumptions

Explanatory Data Analysis

Data Overview

Race Distribution

Age Distribution

Sex Ratio

Explanatory Variables and Response Variable

(1) Ever Diagnosed with Angina or Coronary Heart Disease

(2) Obesity Level

Missing Data Processing

Association Overview

(3) Any Exercise In The Past 30 Days

Missing Data Processing

Association Overview

(4) Ever Told Cholesterol High

Missing Data Processing

Association Overview

(5) Ever Told Blood Pressure High

Missing Data Processing

Association Overview

(6) Green Food

Missing Data Processing

Association Overview

Normal Distribution

Correlation Overview

Significance Test

(1) Obesity Level

(2) Exercise

(3) Cholesterol Level

(4) Blood Pressure Level

(5) Green Food

Significance Test Evaluation and Conclusion

Modeling

Logistic Regression

Model Assumptions

Data Processing

Model Selection

Regression diagnostics

Model Evaluation and conclusions

K-NearestNeighbor (KNN)

Data Processing

Hyper-parameter Selection

Model Optimization

Model Evaluation and Conclusions

Support Vector Machine (SVM)

Data Processing

Model Optimization

Model Evaluation and Conclusions

Random Forest (RF)

Data processing

Model Evaluation and Conclusions

Final Model Selection

Discussion

Conclusion