ORIGINIAL TOPIC FOR DISCUSSION: For this discussion, I want to talk about a patient I cared for on a cardiac stepdown unit. The patient was a 77-year-old female admitted with….

## Applied Statistics and Data Analysis for Public Health

**Introduction**

Statistics is defined as science of applied mathematics which concerns the collecting, processing, summarizing, interpreting and presenting the data. It explains and analyses the provided data and draws the conclusion from the information a sample data contained (Peck, Olsen & Devore, 2015). statistics is majorly used for appropriate data collection and presentation of complex data in suitable graphs, tables and diagrammatic forms to make it clear and easily understandable. Moreover, it helps to understand the complexity and pattern of variations in phenomenon of nature and is useful to plan a statistical analysis correctly and efficiently in any area of research (James et al., 2013). Statistical methods play a vital role in the monitoring and estimation process of public safety, used to classify population at risk, diagnose emerging health risks, organise public health interventions and assess their performance and effectiveness, and manage policy budgets and funds (Kellar & Kelvin, 2013), therefore commonly used in public health and clinical medicine and help the public health administrators to understand what population is experiencing under their control(Armitage & Berry, 1994). Statistics is broadly classified into two groups: descriptive statistics and inferential statistics. Descriptive statistics are the techniques which deals with the enumeration, organization, summarization and graphical description of collected data, thus helps to understand the features of specific data set. On the other hand, inferential statistic is concerned with generating conclusions about a population based on sample information, usually complex and involves more errors. Descriptive statistics provide basis for inferential statistics hence both are interrelated (Fisher & Marshall, 2009). In the following report both types of statistics will be used according to the nature of the question and type of the variables. Statistical reports and methods always need a software for statistical analysis of the collected data set. Choosing the right software is crucial as the selection of wrong software can give errors and false results to the researcher (Cavaliere, 2015). To address the given question in this report, data is given in IBM SPSS software which manipulates data and generates tables and graphs quickly to summarize data and performs statistical analysis ranging from basic descriptive statistics to advanced inferential statistics.

**Aim of the Report **

Purpose of this report is to perform appropriate statistical techniques on given dataset obtained from hypothetical study to evaluate the healthy lifestyle education intervention intending to encourage healthier way of living among university students. Key objectives of study were to increase health education and reduce weight of the selected participants. To address the questions, different statistical techniques will be used and critically discussed to investigate the effects of health education intervention on Body Mass Index (BMI), diabetes status and health literacy across both groups. Any change in BMI due to intervention in intervention group will be compared to the BMI of the control group. Further, report will provide explanation to predict post intervention differences in BMI of the participants utilising the variables of health literacy, age and sex.

**Preliminary Analysis and Investigations**

Designing a measurable, clear and concise question, setting clear measurement priorities and then data collection are the initial steps of data analysis which are already defined in the provided task. After collecting data, data screening and cleaning is the next important step which detects outliers, miscoded or missing values, normality of each variable and checks for possible errors, which helps to ensure the reliability and validity of the employed data for testing causal theory (Odem & Henson, 2002). Data screening provides the general impression of the collected data which is helpful to select and conduct a suitable analytical method and improves the performance of statistical techniques (Abubakar et al., 2017). Type of variables must be determined prior to screening data as they determine the type of descriptive and analytical methods to be used in data summarization and analysis (Mayya et al., 2017). According to Mcdonald (2009), there are two main categories of variables: numerical and categorical. Both categories have subcategories. categorical data uses a descriptive approach to express information and takes numeric values with qualitative properties having no mathematical meaning. Categorical data is unstructured or semi structured data which lacks standardized order scale and natural language description and can visualized using bar chart and pie chart when measuring frequency and percentages respectively. On the other hand, numeric data is structured data and is compatible with most statistical methods as compared to categorical data. It takes numeric values with numerical properties to depict relevant information with standardized order scale and is visualized using scatter plots and line graphs (Cambell & Swinscow, 2009).

The following study employed 81 participants, which are grouped as control group and intervention group and coded 1 & 2 respectively. Intervention status of participants, gender (male & female), location (campus A & campus B), smoking status (not smoking & smoking), asthmadiag i.e asthma status (no asthma & asthma) and diabdiag i.e diabetes status (no diabetes & diabetes) are the categorical or nominal variables (table 1.3) and are appropriately coded. Whereas, participant’s age, health literacy, height, weight1 (weight before intervention), weight2 (weight after intervention) are numeric variables (table 1.2). Both categorical and numerical data will be graphically displayed to make the data understandable which can be easily memorised and compared at one glance. General description, coding and labelling of all variables in the given data set are displayed in **table 1**.

**Presentation of continuous variables**

Descriptive statistics are used to summarize numerical data and is displayed in the form of histogram, line graph, scatter plot and box and whiskers plot. Numerical data usually involves the presentation of distribution, central tendency and dispersion which are three major sample characteristics of each variable (Mishra et al., 2018).

** **

**Table. 1.1** Descriptive statistics for continuous variables in healthy lifestyle data.

Table 1.1 presents the general impression of the descriptive statistics for continuous variables in the given study. This table shows that this study contained the participants from **18** years old (lowest) to **44** years old (highest) with the range of **26** which is the difference between highest and the lowest and has a mean age of **25**. **1.40** m and **1.90** m were the minimum and maximum heights among the participants having mean and range of **1.67** and **.50** respectively.

Participants had minimum health literacy score of **28.57** while maximum was **92.86** with mean of **58.73 **among the participants of control and intervention group and has normal distribution curve also known as Gaussian distribution (figure.1.1 & figure 1.2) which means data is more frequent near the mean than data far from mean.

**Figure 1.1**

** **

** **

**Figure. 1.2**

Before intervention, participants had minimum weight of **41.5 kg** (weight1) and after intervention it was **42 kg** (wieght2). **133.3 kg** (weight 1) was the maximum weight before intervention and after intervention it was 129 kg (weight2). Weight 1 had a mean of **67.7 kg** while weight2 had mean of **67kg** (table 1.1).

Three new continuous variables were added in the given data set to evaluate intervention effect on BMI in intervention group and control group before and after intervention. BMI1, BMI2, BMI_Diff which represent BMI before intervention, BMI after intervention and BMI change before and after intervention respectively. Table 1.2 shows that **15.62** and **38.95** were the lowest and highest BMI before intervention respectively with mean of **23.87**. On the other hand, **15.81** and **37.69** were the lowest and highest values for BMI after intervention respectively with mean of **23.62**. BMI change among the participants in both groups shows normal distribution curve. (figure. 1.5 &1.6).

**Table 1.2.** Descriptive statistics for participant’s BMI before and after intervention in healthy lifestyle health education study.

** **

** **

**Figure. 1.5**

** **

** **

**Figure 1.6**

** **

**Presentation of categorical variables**

** **

Categorical or qualitative data is usually displayed in the form of frequency tables, bar charts and pie charts and percentages (Duquia et al., 2014). categorical data cannot have normal distribution as the variables are not continuous.

**Table.1.3 **frequencies for categorical variables

** Table. 1.4**

**Figure. 1.7**

**Table. 1.5**

**Figure. 1.8 **

**Table. 1.6**

**Figure. 1.9**

**Table. 1.7**

** Figure. 1.10**

**Table. 1.8**

** **

** **

**Figure. 1.11**

** **

** **

**Table. 1.9**

** **

** **

**Figure. 1.12**

Categorical variables are displayed in frequency table and presented in form of pie charts. Table. 1.3 showing a missed value in location variable. Total 81 students participated in this study, out of which 62 (77.5%) were from campus A and 18 (22.5%) from campus B (table 1.7). 26 (32%) participants were males and 55 (68%) were females (table. 1.8). 47 (58%) students were from control group and 34 (42%) received intervention (table 1.9). 57 (70.4%) students were smoker and 24(29.6%) were non-smoker (table 1.4). 77 (95%) participants were diagnosed as diabetic while 4(5%) were non-diabetic (table 1.5). 69 (85%) students were known asthmatic whereas 12 (15%) were non-asthmatic (table 1.6).

** **

**Hypothesis formulation and hypothesis testing**

** **

Prior to answering any question in research, formulation of hypothesis is crucial to develop a specific direction and focused data analysis (Farrugia et al., 2010). For this study **null hypotheses **will be formulated with subsequent questions. To test these hypothesis, parametric and non-parametric tests will be conducted. According to Sheskin (2003), parametric tests are statistically more powerful than non-parametric tests if data fulfilled the required assumptions. These assumptions include normally distributed data, large sample size, continuous variables and random independent sample (independence). Whereas non-parametric tests are assumptions free tests and are used when assumptions do not meet to perform parametric tests. Complete understanding of assumption checks is needed to perform and justify the selection of right statistical test.

__Questions__

**Q1a. Evaluate the effect of the intervention on body BMI in intervention group.**

**Introduction**

Since the following question requires analysis to evaluate intervention effect on BMI before and after intervention in the same participants within intervention group, hence this is a dependant (paired) group. Data is filtered to get intervention group (intervention= 2). p-value is needed to accept or reject the formulated null hypothesis; smaller p-value (typically ≤0.5) indicates the strong evidence to reject null hypothesis. **H _{0}**

_{: }Intervention has no significant effect on BMI of intervention group.

**Preliminary investigation and assumption analysis **

In order to perform the more powerful statistical test, assumptions must be satisfied. SPSS can check the normality of the continuous data by two tests (Kolmogorov-Smirnov and Shapiro-Wilk) and graphically presents in the form of histogram and Q-Q plot in a single step. Table.1A shows descriptives for BMI difference in intervention group. Bell-shaped curve on histogram (figure, 1A), fall of dots on straight line on Q-Q plot (figure.1B) and **p>0.05** in Kolmogorov-Smirnov and Shapiro-Wilk test (Table.1B) indicate the normal distribution of continuous variable.

** **

**Table.1A**

**Table. 1B**

** **

** Figure.1A**

**Figure1.B**

** **

**Test of choice**

t-test and Wilcoxon rank test are usually performed for paired group differences. Assumptions are satisfied, therefore more powerful paired t-test is appropriate to perform.

**Table.1C**

** **

** **

**Table.1D**

** **

** **

** **

**Table.** **1E**

** **

**Interpretation **

Paired t test has shown statistically significant difference on BMI in intervention group before **(mean=24.43±4.67)** and after **(mean=24.68±4.49)** intervention. **[t(33)=7.5; p=.000; 95%Cl, .55-.96] **shows statistically significant reduction in BMI (Table. 1E). Larger t value and smaller p value indicates a strong evidence against null hypothesis.

**Q1b. Evaluate the effect of the intervention on body mass index in the control group**

In following question, intervention effect with no exposure to intervention on BMI in participants of the control group (dependant group) need to be evaluated before after intervention (BMI-diff). Data was filtered to get the paired group (intervention=1). P-value is needed to accept or reject null hypothesis.

**H _{0}**

_{: }Intervention has no significant effect on BMI of control group.

**Preliminary investigation and assumption analysis.**

Bell-shaped curve on histogram (figure, 1C), fall of dots on straight line on Q-Q plot (figure.1D) and **p≥0.05** in Kolmogorov-Smirnov and Shapiro-Wilk test (table. 1J) indicate the normal distribution of continuous variable.

** Table. 1F**

** **

** **

**Table. 1G**

** **

**Figure.1C**

** **

** **

** **

** Figure.1D**

** **

** **

**Test of choice**

t-test and Wilcoxon rank test are usually performed for paired group differences. Assumptions are satisfied, therefore more powerful paired t-test is appropriate to perform.

** **

**Table. 1H**

** **

**Table. 1I**

** Table. 1J**

** **

** **

**Interpretation**

Paired t test showed difference in BMI before **(M=22.75, SD=3.23) **and after **(M=22.86, SD=3.19) **intervention in control group. **[t(46)=3.22; p=.002; 95%Cl, .19-.04]** shows statistically significant increase in BMI (Table. 1J). Larger t value and smaller p value indicates a strong evidence against null hypothesis.

**Q2. Evaluate the effect of the intervention on body mass index in the intervention compared to the control group.**

**Introduction**

Following question contained two unrelated groups as there is no association between control and intervention group. Intervention is independent variable whereas BMI_diff is dependant variable. Assumption analysis will help to select the analytical technique to address the question.

**H _{0}**: Intervention has no significant effect on BMI of intervention group when compared to control group.

**Preliminary investigation and assumption analysis**

Bell-shaped curve on histogram (figure. 2A), fall of dots on straight line on Q-Q plot (figure. 2B) for both control and intervention group and **p>0.05** in Kolmogorov-Smirnov and Shapiro-Wilk test (Table. 2B) indicate the normal distribution of continuous variable.

** Table. 2A**

**Table. 2B**

**Figure. 2A**

** (control group)**

**FIGURE. 2B**

** (Control Group)**

** Figure.2 C**

** Intervention group**

** **

** **

** **

**Figure. 2D**

** Intervention group**

**Test of choice**

Independent t test and Mann-Whitney U test are used to compare differences between two independent variables. Since sample size is not unequal in both groups and standard deviation of dependant variable is not equal in both independent groups, therefore Wilcoxon Mann Whitney U test will be performed instead to independent student t test.

**Table. 2C**

** **

**Table. 2D**

** **

**Table. 2E**

** **

**Interpretation**

The **Ranks** table gives mean rank and sum of ranks of the testing groups, thus providing information about output of the actual Mann-Whitney U test. Group with highest mean rank had the hight BMI change i.e intervention group. This table shows the actual significant value of the test;* U* statistics **(U=102)** and asymptotic significance (2-tailed) *p*-value **<05 **which indicates the strong evidence against null hypothesis.

**Q3.**** Examine whether there is statistically significant difference in the proportion of diabetes cases between the intervention and control groups**

**Introduction and preliminary investigation**

Following question have two categorical variables i.e diabetes status which is labelled as diabdiag (0: no diabetes, 1: diabetes) and intervention status (1: control group, 2: intervention group)

**H _{0}**: Proportion of diabetes cases and intervention status have no significant association with each other.

**Test of choice**

Chi-Squared test is the only choice of performing analysis as there is no parametric option to test whether two categorical variables are associated with each other or not. Chi square test is based on frequencies and determines whether the difference in observed and expected frequencies is real or by chance depending on p value. Moreover, it tests the hypotheses regarding the independence of variables and not useful for estimation.

** **

**Table. 3A**

** **

** **

** **

**TABLE.3B**

** **

**Figure. 3A**

** **

** Figure. 3B**

** **

**Interpretation**

Chi square test showed the total of 4 participants (3 from control group and 1 from intervention group) were diagnosed as diabetic. P>0.05 showing no statistical significance, hence supporting null hypothesis. However, chi squared test violated the assumption as 2 cells had expected counts <5, therefore value of Fisher exact test was used which is p>0.05 indicating no significance association between proportion of diabetes cases and intervention status.

** **

**Q4****. Determine whether there is statistically significant difference in health literacy between intervention and control groups**

**Introduction**

Following question have two independent groups with one independent variable (intervention) and one dependant variable (health literacy score). Former is categorical and latter is a continuous variable.

**H _{0}**: Health literacy rate has no significant association with intervention status.

**Preliminary investigation and assumption analysis**

Bell-shaped curve on histogram, fall of dots on straight line on Q-Q plot on both control and intervention group indicate normal distribution of continuous variable but the value of Solmogrov-Smirvon test in both groups is ≤0.05 showed non-normal distribution, however Shapiro-Wilk test have >0.05 value therefore data is considered to be normally distributed.

**Test of choice**

**Independent t test** and **Mann-Whitney U** are the most appropriate tests for independent samples. As assumption for a parametric test are satisfied, statistically more powerful t test will be performed.

** Table. 4A **

** Table. 4B**

**Figure. 4A**

** Control group**

**Figure.4B**

** Control group**

** **

**Figure.4C**

** Intervention group**

** **

** **

** **

**Figure. 4D**

** Intervention group**

** **

** **

**Table. 4D**

** **

** **

** **

**TABLE. 4E**

** **

**Interpretation**

** **

This test showed significant difference in health literacy scores across control (M=55.47, SD=14.53) and intervention (M=63.23, SD=15.23) group which shows higher literacy scores in intervention group. **p<0.05** **with [t(79)=.2.3; p=0.02; 95Cl%, -14.43- -14.2]** which shows significant evidence against null hypothesis.

** **

**Q5. ****Investigate and explain how health literacy, age and sex can be used to predict post-intervention body mass index among participants in the study.**

Following question requires the prediction of value of a dependant variable usually known as outcome variable based upon three independent variables known as predictors. BMI_diff (BMI before and after intervention) is an outcome variable whereas health literacy, age and sex are predictors.

**H0**: Health literacy, age and sex are poor predictors of BMI change.

**Preliminary investigation and assumption analysis.**

Multiple logistic regression is statistical technique, usually used to analyse the dependency of outcome variable on several exposure variables instead of linear regression as there are more than one exposure variables. Assumptions must be satisfied to conduct multiple regression which require two or more dichotomous independent variable, linearity, normal distribution of residuals and absence of multicollinearity.

**Figure. 5A**

** **

** **

** **

** Figure. 5B**

** **

** **

**Figure. 5C**

** **

** Figure. 5D**

** **

** **

**Figure. 5A, 5B & 5C** showed no clear pattern to determine the linear or non-linear relationship between variables indicating the independence of variables. **Figure. 5D** showing normal distribution curve with slight right skewness. Value of VIF (approximately 1) in **(table. 5E) **indicates the absence of multicollinearity. Above mentioned assumptions are satisfied to conduct multiple regression analysis.

** **

** **

** Table. 5A**

** **

** Table. 5B **

** **

**Table. 5C**

**Table. 5D**

** **

** **

** **

** Table. 5E**

** **

** **

** **

** Table. 5F**

** Table. 5G**

**Figure. 5E**

** **

** **

** **

**Interpretation**

Results of multiple regression analysis showed **BMI_Diff = .81,** **health literacy = .006,** **age =.003 and sex = .64** which are nearly zero **correlation coefficients**. **R ^{2}= 0.028** (only 2.8% of variation in dependant variable is attributed to independent variables), nearly zero coefficient values and

**p>0.05**indicate no statistically significant relationship between dependant and independent variables. There is no strong evidence against null hypothesis

**.**

** **

**Conclusion**

This report aimed to perform appropriate statistical procedures and interpret results to assess a healthy lifestyle education intervention which aimed to encourage healthy lifestyle among university students. Results obtained from statistical techniques indicate that intervention group had significant reduction in Body Mass Index (BMI) compared to control group. Proportion of diabetes case and intervention status have no association, however intervention group had relatively higher literacy scores than that of control group. Above mentioned results are totally opposite to multiple regression analysis which demonstrated the health literacy, age and sex as poor predictors of BMI change. These divergent results could be due to inadequate sample size which was randomly and unevenly distributed in two groups. Selection of relatively large sample from diverse population and appropriate distribution of participants in groups for an appropriate time period can help to get the accurate results. Moreover, these results are based only on p-value which only determines whether the provided data has compatibility with null hypothesis or not and does not the measure the probability that null hypothesis is correct.

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

** **

**References**

ABUBAKAR, A, SAIDIN, S.Z. & AHMI, A, 2017. Performance Management Antecedents and Public Sector Organizational Performance: Data Screening and Preliminary Analysis. *International Journal of Academic Research in Business and Social Sciences *[online]. 7(9), pp.19-31. [viewed 22 April 2020]. Available from: https://pdfs.semanticscholar.org/774d/5b8fcb0cc5786702271953c8229c16848d31.pdf.

ARMITAGE, P., AND BERRY, G. 1994. *Statistical Methods in Medical Research.* Oxford: Blackwell Science.

CAMPBELL, M.J. & SWINSCOW, T.D., 2009. *Statistics at Square One*. 11th ed. Oxford: Wiley-Blackwell.

Cavaliere R., 2015. How to choose the right statistical software -a method increasing the post-purchase satisfaction. *Journal of thoracic disease* [online]. **7**(12), pp.585–E598. [viewed 20 April 2020]. Available from: https://doi.org/10.3978/j.issn.2072-1439.2015.11.57__.__

__ __

FARRUGIA, P., PETRISOR, B. A., FARROKHYAR, F. & BHANDARI, M., 2010. Practical tips for surgical research: Research questions, hypotheses and objectives. *Canadian journal of surgery *[online]. **53**(4), pp.278–281. [viewed 23 April 2020]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912019/.

FISHER, M.J. AND MARSHALL, A.P., 2009. Understanding descriptive statistics. *Australian Critical Care*. **22**(2), pp.93-97.

JAMES, G., WITTEN, D., HASTIE, T. & TIBSHIRANI, R., 2013. *An introduction to statistical learning*. New York: springer.

KELLAR, S.P., KELVIN, E.A. (2013). *Munro’s Statistical methods for health care research*. 6th ed. Philadelphia: Wolters, Kluwer/Lippincott, Williams & Wilkins.

MAYYA, S. S., MONTEIRO, A. D., & GANAPATHY, S., 2017. Types of biological variables. *Journal of thoracic disease* [online]. **9**(6), pp.1730–1733. [viewed 22 April 2020]. Available from: __https://doi.org/10.21037/jtd.2017.05.75.__

Mishra, P., Pandey, C. M., Singh, U., & Gupta, A. (2018). Scales of measurement and presentation of statistical data. *Annals of cardiac anaesthesia* [online]. **21**(4), pp.419–422. [viewed 22 April 2020]. Available from: __https://doi.org/10.4103/aca.ACA_131_18.__

ODOM, L.R. & HENSON, R.K., 2002. *Data Screening: Essential Techniques for Data Review and Preparation *[online]. ERIC. [view 22 April 2020]. Available from: https://files.eric.ed.gov/fulltext/ED466781.pdf.

PECK, R., OLSEN, C. AND DEVORE, J.L., 2015. *Introduction to statistics and data analysis*. Boston, Massachusetts: Cengage Learning.

Sheskin, D.J., 2003. *Handbook of parametric and nonparametric statistical procedures*. 3^{rd} ed. Boca Raton: CRC Press.