|Year : 2018 | Volume
| Issue : 1 | Page : 60-63
Parampreet Kaur1, Jill Stoltzfus2, Vikas Yellapu1
1 Department of Research and Innovation, The Research Institute, St. Luke's University Health Network, Bethlehem, PA 18015, USA
2 Department of Research and Innovation, The Research Institute, St. Luke's University Health Network; Temple University School of Medicine, Bethlehem, PA 18015, USA
|Date of Submission||19-Feb-2018|
|Date of Acceptance||06-Mar-2018|
|Date of Web Publication||23-Apr-2018|
Dr. Parampreet Kaur
St. Luke's University Health Network, 801 Ostrum Street, Bethlehem, PA 18015
Source of Support: None, Conflict of Interest: None
Descriptive statistics are used to summarize data in an organized manner by describing the relationship between variables in a sample or population. Calculating descriptive statistics represents a vital first step when conducting research and should always occur before making inferential statistical comparisons. Descriptive statistics include types of variables (nominal, ordinal, interval, and ratio) as well as measures of frequency, central tendency, dispersion/variation, and position. Since descriptive statistics condense data into a simpler summary, they enable health-care decision-makers to assess specific populations in a more manageable form.
The following core competencies are addressed in this article: Practice-based learning and improvement, Medical knowledge.
Keywords: Descriptive statistics, measures of central tendency, measures of dispersion/variance, measures of frequency, measures of position
|How to cite this article:|
Kaur P, Stoltzfus J, Yellapu V. Descriptive statistics. Int J Acad Med 2018;4:60-3
| Introduction|| |
Quantitative research provides important statistical information to health-care decision-makers that enable them to accomplish tasks such as budget justification, departmental and network needs assessments, and allocation of medical resources. In addition, health-care statistics are critical to both quality improvement and product development. Various hospitals measure their performance outcomes using the results of statistical analysis, as well as implement quality improvement programs to improve their efficiency. Health-care statistics are also helpful for pharmaceutical and technology companies in developing new products and conducting market research analysis of their products.
Within the health-care context, as in other sectors, there are two main approaches to statistical methodology: (1) descriptive analysis, which summarizes raw data from a sample or population and (2) inferential analysis, which draws causative, associative, or other conclusions from the data. Descriptive analysis is a prerequisite for, and provides the foundation of, inferential statistics.
| Variable Type|| |
Before analyzing any dataset, one should be familiar with different types of variables.
Categorical variables (also known as qualitative or discrete) may be further classified as nominal, ordinal, or dichotomous. Nominal variables, which are the simplest in nature, include two or more categories that lack intrinsic order (e.g., types of wounds; abrasion, laceration, puncture, or avulsion). Dichotomous nominal variables have only two categories (e.g., male or female). Ordinal variables have two or more categories that can be ranked or ordered, but there is no objective value to the rankings (e.g., a patient satisfaction scale with “strongly disagree,” “disagree,” “unsure,” “agree,” and “strongly agree”).
Continuous variables (also known as quantitative or numerical) are further categorized as either interval or ratio. Interval variables can be measured along a continuum and have a numeric value, but no true zero point (e.g., temperature measured in Celsius or Fahrenheit). Ratio variables have all the properties of interval variables as well as a true zero point (e.g., height, weight, fasting glucose).
In addition to variable type, descriptive statistics include measures of frequency, central tendency, dispersion/variation, and position [Table 1].
| Measure of Frequency|| |
Absolute frequency is the number of times a particular value occurs in the data. In contrast, relative frequency is the number of times a particular value occurs in the data (absolute frequency) relative to the total number of values for that variable. The relative frequency may be expressed in ratios, rates, proportions, and percentages.
Ratios compare the frequency of one value for a variable with another value for the same variable. For example, in thirty participants, the ratio of an experimental drug's adverse effects to no adverse effects is 2:28; conversely, the ratio of no adverse effects to adverse effects is 28:2.
Rate is the measurement of one value for a variable in relation to the entire sample of values within a given period. For example, in a total of thirty participants, there are 2 who show adverse effects after taking an experimental drug; therefore, the rate of adverse effects is 2/30 participants.
Proportion is the fraction of a total sample that has some value. For example, in a total of thirty participants, with two participants having adverse drug effects, the proportion of adverse effects is 2/30 = 0.066
Percentage is another way of expressing a proportion as fraction of 100. The total percentage of an entire dataset should always add up to 100%. For example, in total of thirty participants, where 2 experience adverse drug effects, 2/30 = 0.066 × 100 = 6.6% of participants experience adverse effects.
The above measures of frequency are often expressed visually in the form of tables, histograms (for quantitative variables), or bar graphs (for qualitative variables) to make the information more easily interpretable.
| Measures of Central Tendency|| |
Central tendency is the value that describes the entire set of data as a single measurement. The three primary measures of central tendency are the mean, median, and mode.
The following example will be used to demonstrate these three measures.
- Sample A (age in years) - 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
- Sample B (age in years) - 52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Mean is the arithmetic average or the sum of values in a dataset divided by the total number of observations. Using the above example, the mean of Sample A is 54 + 54 + 54 + 55 + 56 + 57 + 57 + 58 + 58 + 60 + 60 = 623, divided by 11 (total number of observations) = 56.6 years.
The mean should only be reported with interval and ratio data that are normally distributed (i.e., look like a “bell-shaped” curve) since this measure of central tendency is strongly affected by outliers and skewed distributions.
Median is the middle value in distribution when the data are ranked in order from highest to lowest (or vice versa). If there are an odd number of values, the median is the exact middle value; however, if there is an even number of values, the median is the average of the two middle values. In the example above, the median for Sample A is 57 and for Sample B is 56 + 57/2 = 56.5.
Since the median is less affected by outliers and skewed distributions, it is the appropriate measure to report when data do not follow a “bell-shaped” curve. The median should also be reported with ordinal data.
Mode is the most common value in a dataset. In the above examples, the mode for Samples A and B is 54.
Although the mode may be used for both qualitative and quantitative variables, it may not accurately represent the center of the distribution. Using the above example, the Sample A mode is 54, but the center of distribution is 57 years.
Sometimes, there may not be a mode if all values are different or if there is a bimodal or multimodal sample (signifying peaks at two or more places in the data distribution). In such cases, one may report the mean or median as appropriate.
As illustrated previously, the shape of the data distribution may influence the measures of central tendency. When the distribution is symmetrical (i.e., “bell-shaped”), the mean, median, and mode are all in the middle [[Figure 1], center]. When the distribution is skewed toward the low end of values (positive skew), the mode remains the most common value, and the median remains the middle value, but the mean is pulled toward the right tail of the distribution [[Figure 1], right]. When the distribution is skewed toward the high end of values (negative skew), the mean is pulled toward the left tail of the distribution [Figure 1], left].
|Figure 1: Normal and skewed distribution of the data. (a) Normal distribution. (b) Positive skew. (c) Negative skew|
Click here to view
Outliers, which are extreme or unusual values, may also influence the measures of central tendency. The mean is more sensitive to outliers than the median and mode. However, even if the presence of outliers, the mean is still appropriate to report for interval or ratio data as long as the overall distribution is normal/“bell-shaped.”
| Measures of Dispersion/variation|| |
Although measures of central tendency provide important information when describing one's data, they fail to capture variability within a dataset. Measures of dispersion/variation describe the degree to which a variable's values are similar or diverse. This type of measure only applies to ordinal, interval, and ratio data that can be ranked and includes the range, variance, and standard deviation.
The range is the difference between the lowest and the highest values in a dataset. For example, the range of Sample A above is 6 (60–54 = 6), while the range of Sample B is 8 (60–52 = 8).
The variance and standard deviation are measures of spread that reveal how close each observed value is to the mean of the entire dataset. In datasets with small spread, all values are close to the mean, yielding smaller variance and standard deviation. In contrast, datasets with greater spread of values away from the mean have larger variance and standard deviation. Therefore, if all values of a dataset are the same, the variance and standard deviation will be zero.
In a normally distributed dataset, 68% of the values are within one standard deviation on either side of the mean, 95% of values are within two standard deviations, and 99% of values are within three standard deviations.
| Measures of Position|| |
Determining the position of values in a dataset may be accomplished in three main ways.
Percentiles divide the dataset into 100 equal sections, deciles divide it into ten equal parts, and quartiles divide an ordered dataset into four equal parts. The differences between percentiles and quartiles are minor and often disappear with a large number of values in a dataset. One may clearly see how they are associated as follows:
The lower quartile, Q1 (25th percentile), is the point between the lowest 25% and highest 75% of values. The second quartile, Q2 (50th percentile), is the median (middle of the dataset). The upper quartile, Q3 (75th percentile), is the point between the lowest 75% and highest 25% of values. If the quartile falls between two values, the average of those values represents the quartile value. Using the previous example, in Sample B, Q1 is 54 (54 + 54/2 = 54); Q2 is 56.5 (56 + 57/2 = 56.5); and Q3 is 58 (58 + 58/2 = 58).
The interquartile range is the difference between the upper and lower quartiles and describes the middle 50% of values when ordered from lowest to highest. It is considered a better measure of dispersion than the range, as it is not affected by outliers. For example, in Sample B, Q3–Q1 is 4 (58-54).
Box plots are often useful for interpreting descriptive data in graphical form. As seen in [Figure 2], box plots are constructed using the 25th percentile (lower quartile), the median (50th percentile), the 75th percentile (upper quartile), the minimum data value, and the maximum data value. Box plots also show outlier values.
| Conclusion|| |
Descriptive statistics are a critical part of initial data analysis and provide the foundation for comparing variables with inferential statistical tests. Therefore, as part of good research practice, it is essential that one report the most appropriate descriptive statistics using a systematic approach to reduce the likelihood of presenting misleading results. Since the results of statistical analysis are fundamental in influencing the future of public health and health sciences, the appropriate use of descriptive statistics allow health-care administrators and providers to more effectively weigh the impact of health policies and programs.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Rae C. Why are Statistics Important in the HealthCare Field? Livstrong.com; 2017.
Spriestersbach A, Röhrig B, du Prel JB, Gerhold-Ay A, Blettner M. Descriptive statistics: The specification of statistical measures and their presentation in tables and graphs. Part 7 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2009;106:578-83.
Ali Z, Bhaskar SB. Basic statistical tools in research and data analysis. Indian J Anaesth 2016;60:662-9.
] [Full text]
Sonnad SS. Describing data: Statistical and graphical methods. Radiology 2002;225:622-8.
Huebner M, Vach W, le Cessie S. A systematic approach to initial data analysis is good research practice. J Thorac Cardiovasc Surg 2016;151:25-7.
[Figure 1], [Figure 2]