Statistical glossary

This glossary is based on numerous sources. It is by no means perfect but serves as a guide for my students, and where possible, explains statistical terminology in layman’s terms. Inevitably, that means that there might be compromises. As such this glossary will also be in continued development.

Contact me if you believe that there are errors or you would like an entry included. And obviously still need to work on some entries post-W.

A B C D E F G H I J K L M N O P Q R S T U V W Sources

ABSOLUTE VALUE

The numerical value of a number, disregarding its sign. The absolute values of − 8 and 8 are both 8.

ALPHA (α)

The probability of a Type I error, that is, the probability of rejecting the null hypothesis when it is true. (also see significance level)

A PRIORI HYPOTHESIS

A hypothesis specified before any testing is done. This is opposed to HARK’ing: Hypothesising After the Results are Known.

BAYES’ THEOREM.

A formula that allows the reversal of conditional probabilities.

BAYESIAN STATISTICS

A statistical approach based on Bayes’ theorem, where prior information or beliefs are combined with new data to provide estimates of unknown parameters.

BETA (β)

In statistical research design, the probability of a Type II error, that is, the probability of failing to reject the null hypothesis when it is false. 1 − β = power.
In regression, the term for a standardised B coefficient (slope of the regression line).

BIAS

Error that is systematic and might lead to incorrect interpretation of results.

BINARY VARIABLES

Variables that can take only two values; also called dichotomous variables.

BOX AND WHISKER PLOT

A graph that depicts the minimum and maximum or a range of uncertainty (whiskers), lower and upper quartiles (box) and the median (horizontal line in the box) for a set of data.

CENSUS

A study that includes the whole population rather than a sample of that population.

CENTRAL LIMIT THEOREM

If you take repeated samples from a population with finite variance and calculate their averages, then the averages will be normally distributed. This is called the central limit theorem. The central limit theorem allows inferring that parametric test statistics will become robust to deviations from normality as the sample size increases.

CHI-SQUARED GOODNESS OF FIT TEST

A statistical test used to investigate whether a frequency distribution follows a specific theoretical distribution.

CHI-SQUARED TEST

A statistical test used to investigate the association between two categorical variables.

CLUSTER ANALYSIS

A statistical method used to identify groups or clusters of individuals who have common features in terms of known variables

CONFOUNDING VARIABLE

In research design, a variable that correlates with both the independent and dependent variables and is not in the causal pathway between them.

CONSTRUCT VALIDITY

The extent to which a measurement, or series of measurements, adequately measures a construct (such as intelligence or personality).

CONTROL VARIABLES

Variables included in a study design not because they are the focus of interest but because they are believed to influence the variables of interest and the researcher wants to control for their effect. These should ideally be pre-specified.

COX PROPORTIONAL HAZARDS REGRESSION

A multifactorial regression model used with a time-to-event outcome. This could be, for example, the time to relapse after a substance abuse treatment or the time to finding employment after job loss. Sometimes also referred to as ‘Survival Analysis’.

CRITERION VALIDITY

The extent to which a measurement correlates with something else. For instance, how well scores on a test correlate with grades in school.

CRONBACH’S ALPHA

A statistic used to measure the degree of internal consistency between items in a questionnaire. Alpha will be between 0 and 1. Common guidelines are that values below .7 suggests inadequate reliability. Values above .9 suggest that items are very similar and in such a case perhaps fewer items can be used to measure the same construct.

CROSS-SECTIONAL STUDY

A study in which data is collected at a single point in time. This is contrasted with a longitudinal design.

DEGREES OF FREEDOM (DF OR DF)

The number of values which are free to vary in an equation or statistic. More information can be found here here.

DEPENDENT VARIABLE

In research design, a variable that is assumed to be influenced by another, independent, variable (or more) included in the design.

DETERMINISTIC AND STOCHASTIC

A phenomenon is deterministic when its outcome is inevitable and all observations will take specific value. A phenomenon is stochastic when its outcome may take different values in accordance with some probability distribution.

DICHOTOMOUS, CATEGORICAL, ORDINAL, METRIC DATA

Dichotomous data have two values and take the form “yes or no,” “got better or got worse.” Also known as binary variables.
Categorical data have two or more categories such as yes, no, and undecided. Categorical data may be ordered (opposed, indifferent, in favor) or unordered (dichotomous, categorical, ordinal, metric). Preferences can be placed on an ordered or ordinal scale such as strongly opposed, opposed, indifferent, in favor, strongly in favor.
Metric data can be placed on a scale that permits meaningful subtraction; for example, while “in favor” minus “indifferent” may not be meaningful, 35.6 pounds minus 30.2 pounds is.

DISCRETE DATA

Data that do not lie on a continuum and can only take certain values, usually counts (integers). For these types of data non-parametric statistics are often used.

DUMMY VARIABLES

Typically used in regression modelling to enable a categorical predictor variable to be included. A variable with n categories is converted into n–1 binary variables, where one category is the reference category.

ECOLOGICAL FALLACY

An error whereby a researcher assumes that a statistical pattern found at an aggregate level must translate to a lower level of analysis. Suppose that a researcher finds an association between meat consumption and wealth at the national level, i.e. richer countries consume more meat. It would then be a fallacy to assume that the same relationship holds at the individual level, i.e. richer individuals eat more meat. The converse is the atomistic fallacy.

FACTOR ANALYSIS

A statistical method which is part of multivariate statistics used to identify unknown underlying factors within a set of data. Factor analysis is often used when researchers want to reduce the complexity of a dataset. There is a distinction between “exploratory” and “confirmatory” factor analysis. (see SEM)

FACTORIAL DESIGN

A design including two or more categorical variables and their interactions. These designs can be analysed via ANOVA among other technique. In a full factorial design, all possible combinations of the variables are included in the study.

FISHER’S EXACT TEST

A statistical test that can be used to investigate the association between two categorical variables when the sample is small.

FOREST PLOT.

A graph in meta-analysis used to display individual study estimates and confidence intervals, and the pooled estimate and confidence interval.

FREQUENTIST STATISTICS.

A statistical approach where the data alone are used to provide estimates of unknown parameters. This is as opposed to Bayesian Statistics. Typically some form significance testing is used.

FUNNEL PLOT

In meta-analysis, a simple graphical method for exploring the results from studies to see if publication bias might be present.

GAMMA DISTRIBUTION.

A special type of statistical distribution that allows accommodating many different types of data.

GENERALIZED ESTIMATING EQUATIONS (GEEs)

An alternative approach to multilevel modelling for data with a hierarchical structure or clusters, or serial measurements, that gives population average estimates.

HAZARD RATIO

In survival analysis (Cox regression), the ratio of hazards or risks of outcome in two groups.

HETEROGENEITY.

Term used in meta-analysis, among other things, referring to statistical variability between estimates. In meta-analysis, when there is unexplained heterogeneity researchers often perform meta-regressions to explain some of the observed heterogeneity.

HISTOGRAM

A graph depicting the frequency distribution of a variable, with the length of each bar typically representing the number of cases or the expected number of cases (‘density’).

HYPOTHESIS, NULL HYPOTHESIS, ALTERNATIVE HYPOTHESIS

The dictionary definition of a hypothesis is a proposition, or set of propositions ,put forth as an explanation for certain phenomena. For statisticians, a simple hypothesis would be that the distribution from which an observation is drawn takes a specific form. For example, F[x] is N(0,1). In the majority of cases, a statistical hypothesis will be compound rather than simple—for example, that the distribution from which an observation is drawn has a mean of zero. Often, it is more convenient to test a null hypothesis—for example, that there is no or null difference between the parameters of two populations. There is no point in performing an experiment or conducting a survey unless one also has one or more alternate hypotheses in mind.

INCIDENCE

The number of new cases of a given condition occurring within a specific time period.

INDEPENDENT DATA

A set of separate data values that are not related to each other such as the weight of each man in a random sample of men. Many statistical approaches require that data are independently sampled. If that is not the case researchers often aim to model the source of the non-independence. For example, children in a class room might be more similar to each other than in a random sample of children. Therefore, researchers would, for example, use a multilevel model to account for such non-independence. Such an analysis would help accounting for the fact that children in a class room are more similar to one another than when randomly sampled.

INDEPENDENT VARIABLE

In research design, a variable that is believed to exert an influence on another variable, the dependent variable.

INTERACTION VARIABLE

A variable for which the relationships between two other variables are different, depending on the category or score of the interaction variable.

INTERQUARTILE RANGE (IQR)

The range of values that includes the middle 50% of values when they are arranged in ascending order. Outliers are often defined as those cases 1.5IQR. Extreme values can be defined as those cases 3IQR.

KAPLAN–MEIER CURVE

A graph demonstrating survival probabilities over time.

LIKERT SCALE

A type of ordinal rating scale developed by the psychologist Rensis Likert. A Likert scale presents a statement and asks people to indicate their agreement or disagreement using an ordered scale.

LOGISTIC REGRESSION

A regression model used with a binary outcome.

MCNEMAR’S TEST

A statistical test used to investigate the association between two paired proportions.

MEAN

The arithmetic average of a set of numbers.

MEDIAN

The central value of a set of numbers when they are ordered by value.

META-ANALYSIS

A statistical analysis which combines the results of several independent studies examining the same question. Often presented as part of a systematic review.

MODE

The most common value of a variable.

MULTILEVEL MODELS.

Statistical modelling approach for data with an hierarchical structure or clusters, or serial measurements. Sometimes referred to as random effects or mixed models.

NOMINAL DATA

Data that do not have numeric meaning and for which numeric values serve only as labels (such as gender or color). Also called categorical data.

NONPARAMETRIC STATISTICS

Statistics not based on assumptions about the distribution of the population(s) from which the study data have been drawn or which make less stringent assumptions than parametric statistics.

NONPROBABILITY SAMPLING

Sampling in which the probability of selection for any unit or combination of units is unknown. An examples would be convenience sampling, such as advertising a study on social media. Many statistical techniques require probability sampling for statistical inference.

NORMAL DISTRIBUTION

A continuous probability distribution with a symmetrical bell shape, which is followed by many naturally occurring variables, for example, stature. Sometimes referred to as a Gaussian distribution.

NULL HYPOTHESIS

The baseline hypothesis that is tested in a statistical significance test and which is usually of the form ‘there is no difference between samples’, ‘these samples are from the same distribution’, or ‘there is no association’.

NUMBER NEEDED TO HARM

The number of patients who need to be treated in order that one additional patient has a negative outcome

NUMBER NEEDED TO TREAT

The number of patients who need to be treated in order that one additional patient has a positive outcome.

OBSERVATIONAL STUDY.

A study in which subjects are observed, with exposures and outcomes measured, without any intervention by the researcher.

ODDS

The probability of an event occurring divided by the probability of it not occurring.

ODDS RATIO (OR)

A measure of the difference in odds between two groups, calculated by dividing the odds in one group by the odds in another group. The odds ratio is a measure of effect size, and formulae exist to convert it to a Pearson correlation.

OPERATIONALIZATION

The process of specifying how a concept will be defined and measured.

ORDINAL VARIABLE

A variable that can be ordered, that is, ranked in size but without the assumption of equal intervals between consecutive values. For example, the grading of Pokemon cards by experts, into Mint, Near Mint, Good, Poor.

PARAMETRIC, NONPARAMETRIC, AND SEMIPARAMETRIC MODELS

Models can be subdivided into two components, one systematic and one random. The systematic component can be a function of certain predetermined parameters (a parametric model), be parameter-free (nonparametric), or be a mixture of the two types (semiparametric). The definitions in the following section apply to the random component.

PARAMETRIC, NONPARAMETRIC, AND SEMIPARAMETRIC STATISTICAL PROCEDURES

Parametric statistical procedures concern the parameters of distributions of a known form. One may want to estimate the variance of a normal distribution or the number of degrees of freedom of a chisquare distribution. Student t, the F ratio, and maximum likelihood are typical parametric procedures.

Nonparametric procedures concern distributions whose form is unspecified. One might use a nonparametric procedure like the bootstrap to obtain an interval estimate for a mean or a median or to test that the distributions of observations drawn from two different populations are the same. Nonparametric procedures are often referred to as distribution-free, though not all distribution-free procedures are nonparametric in nature.

Semiparametric statistical procedures concern the parameters of distributions whose form is not specified. Permutation methods and U statistics are typically employed in a semiparametric context.

P VALUE

See SIGNIFICANCE LEVEL AND p VALUE.

PEARSON(’S) CORRELATION (r)

A measure of the strength of linear relationship between two continuous variables. It varies between -1 and 1 and can also be used as an indication of effect size. Read the original paper here

POISSON REGRESSION

A regression model used to model rates or count data based on a Poisson distribution. An example would be when we want to model the number of driving tests an individual has completed.

POSTERIOR DISTRIBUTION

A term used in Bayesian statistics. A probability distribution obtained by combining prior evidence with new information.

POWER

The probability that a statistical test will find a significant difference if a real difference of a given size exists, i.e. the null hypothesis is false. Power = 1 - \(Beta\)

PREDICTOR VARIABLE

In regression analysis, a variable which is used to predict the value of an outcome variable. See INDEPENDENT VARIABLE

PRINCIPAL COMPONENTS ANALYSIS

A statistical method used to reduce a dataset with many inter-correlated variables to a smaller set of uncorrelated variables that explain the overall variability almost as well

PRIOR DISTRIBUTION

A term used in Bayesian statistics. The distribution of prior beliefs or existing information are combined with new data to provide the posterior distribution.

PROBABILITY SAMPLING

Sampling methods in which all combinations of members of the population have a known probability of selection.

PROSPECTIVE STUDY

A study in which individuals are followed (and data collected) moving forward in time.

PUBLICATION BIAS

A term used in meta-analysis. A bias that occurs when the papers which are published on a topic are an incomplete subset of all the studies which have been conducted on that topic. There are tests which allow examining the potential occurrence of (see FUNNEL PLOT)

QUALITATIVE RESEARCH

Research that generates non-numerical data which are not analysed using statistical methods, for example recorded in-depth interviews may be examined to identify common themes.

QUANTITATIVE DATA

Data which can be expressed numerically and are usually either measured or counted.

QUANTITATIVE RESEARCH

Research that generates numerical data which can be analysed using statistical methods.

RANDOM ERROR

Error that is due to chance. Random error makes measurement less precise but does not introduce bias. The opposite of systematic error.

RATIO

A method of expressing the relationship between the magnitude of two numbers. The numbers do not need to share a common unit (for instance, number of pet dogs per 1,000 population).

RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE

A graph plotting the sensitivity against 1–specificity for a diagnostic test at different cut-off points. This relates to signal detection theory.

SAMPLE

A sub-group of cases selected from a population. Often researchers want the sample to be randomly drawn from the population in order to infer patterns. For example, suppose that we want to find out the height of male Northumbria students is larger than say 180 cm on average, then we would want to randomly sample individuals. For example, we could use a computer to randomly select 5% of cases in a database of male student ID’s. A biased sample would be to only sample those individuals enrolled in a sports science degree. With some exceptions, most statistical analyses rely on random sampling. If we sampled all individuals in a population, then no statistics with regards to ‘uncertainty’ need to be calculated. In our example, if we measured all Northumbria students, then there is no ‘uncertainty’ and we know the answer to our question: we now know the population average!

SENSITIVITY ANALYSIS

A way of testing assumptions made in statistical analyses by doing several analyses based on different assumptions, and comparing the results.

SERIAL DATA

Repeated measurements taken over time. Such data requires longitudinal analysis, often referred to as time series analysis.

SIGNIFICANCE LEVEL AND p VALUE

The significance level is the probability of making a Type I error. It is a characteristic of a statistical procedure.
The p value is a random variable that depends both upon the sample and the statistical procedure that is used to analyze the sample.

If one repeatedly applies a statistical procedure at a specific significance level to distinct samples taken from the same population when the hypothesis is true and all assumptions are satisfied, then the p value will be less than or equal to the significance level with the frequency given by the significance level.

SELECTION BIAS

Bias due to the way a sample is selected. For example, advertising a study with potentially large financial rewards might attract participants who are disproportionally poorer than the overall population.

SKEWED DATA

Data that do not follow a symmetrical distribution. This can violate the assumptions of (parametric) statistical tests.

STANDARD DEVIATION (SD)

A measure of dispersion used for continuous data. It is equal to the square root of the variance.

STANDARD ERROR (SE)

A measure of precision. It is the standard deviation of the sampling distribution of the sample mean.

STEM AND LEAF PLOT.

A graph which uses the data values themselves to depict the shape of a frequency distribution.

SYSTEMATIC ERROR

Error due to some cause other than chance. Systematic error can make observed values consistently higher or lower than true values and thus introduce bias. The opposite of random error.

SYSTEMATIC REVIEW

A literature review which aims to identify and qualify all (published) research answering a given question.

TYPE I AND TYPE II ERROR

A Type I error is the probability of rejecting the hypothesis when it is true. A Type II error is the probability of accepting the hypothesis when an alternative hypothesis is true. Thus, a Type II error depends on the alternative.

TYPE II ERROR AND POWER

The power of a test for a given alternative hypothesis is the probability of rejecting the original hypothesis when the alternative is true. A Type II error is made when the original hypothesis is accepted even though the alternative is true. Thus, power is one minus the probability of making a Type II error. (also see BETA)

UNIQUE IDENTIFIER

A code or variable used to identify all the records belonging to a single unit of analysis (for instance, a student ID to identify all the courses a single student is enrolled in). Often, a unique identifier is needed to link various datasets together.

VALIDITY

How closely a measurement actually measures what it is intended to measure.

VARIABLE

A quantity that is measured or observed and which varies (or can vary) from case to case. The opposite is a constant.

VARIANCE

A measure of the variability of a range of numbers, calculated as the mean squared difference from the mean. The square root of the variance is the standard deviation.

VOLUNTEER BIAS

A type of selection bias resulting from collecting data from a sample of volunteers rather than the general population. Volunteers could differ in all sorts of characteristics from the overall population.

WILCOXON SIGNED RANK TEST

Also known as Mann Whitney U test or Mann-Whitney Wilcox test. A statistical test comparing ordinal data from two independent groups. A non-parametric alternative to the independent samples t-test

References.

The above is compiled based on the following sources:

Boslaugh, S. (2012). Statistics in a nutshell: a desktop quick reference. 2nd edition. Cambridge, UK: O’Reilly.

Crawley, M. J. (2013). The R book: Second edition. New York, NY: John Wiley & Sons.

Good, P. I., & Hardin, J. W. (2012). Common errors in statistics (and how to avoid them). Hoboken, NJ: John Wiley & Sons.

Peacock, J.L. & Peacock, P.J. (2011). Oxford Handbook of Medical Statistics. Oxford, UK: Oxford University Press.