Statistical analysis

Statistical analysis

Introduction

Statistics is a science derived from mathematics which can be divided into two branches: descriptive and inferential. Descriptive statistics are commonly used as the first step in data analysis. It refers to measures that summarize and characterize a set of data – it answers questions like: How many people have the disease? How often did the event occur? What is the spread of test results in the population? Descriptive statistics can point to similarities and differences between groups, but on their own are not enough to confirm or refute a hypothesis. Inferential statistics allows hypothesis testing based on the probability theory. It answers the key question: How likely is it that this difference (that we observed between two groups) is due to chance alone?

Before analyzing you data you should have a Statistical Analysis Plan (see below for links to templates, guidelines and examples). This document describes what statistical tests you will perform and assigns them to a hierarchy of primary, secondary and, sometimes, exploratory. Many researchers will publish the statistical analysis plan, either summarized as part of a design paper or study protocol, or by making it available online. This ensures transparency and encourages rigorous scientific method.

Your primary analysis relates to your primary study outcome. It is essential to spell this out from the start. Remember that a P-value of 0.05 just means a likelihood of one in twenty, meaning that if you do twenty statistical tests, the chances are that one of them will be ‘positive’ with a P-value < 0.05. So it is important to specify your primary outcome before doing the analysis so that readers know that you have not simply performed many statistical tests and chosen the one with the significant P-value. Analyses that you plan to do in advance, alongside the primary analysis are called secondary analyses. Any analysis that you decide to do only after looking at the data is always considered exploratory meaning that it can only ever provide a suggestion for further research and should never be used as proof.

An important step before applying any statistical test is to identify the dependent and independent variables. In clinical trials, the dependent variable is the outcome(s) of the study (e.g. mortality rate; change in renal function) and the independent variables are the factors under investigation that could possibly be modifying the outcome, the most important of which is the randomized treatment allocation (i.e. intervention vs. control). Other independent variables (e.g. proteinuria, blood pressure) may be used in multivariable analysis, however this is almost always a secondary analysis in a clinical trial because randomization has been used to balance all other factors between groups.

Each variable should be classified by type (eg. continuous, ordinal, categorical, dichotomous) and distribution (normal [also known as parametric or Gaussian] or non-normal). This will help you determine the right statistical test to use. This should be considered when planning the study as the type of data will result in strengths and limitations in terms of the possible mathematical tests and the interpretation of the obtained results. For example, if CKD stage is collected then one must use a categorical data analysis, however if eGFR is collected then one can use a continuous data analysis (usually more powerful) or convert them to CKD categories to use a categorical analysis. Therefore, a careful plan to adjust the study design according to the research question and the characteristics of the study variables is always desirable.