Tips for a high-quality statistical analysis

Gabrielle Teixeira
PlayKids Tech Blog
Published in
8 min readSep 30, 2019
Image from O Estatístico

Statistical analysis is increasingly present in companies, as well-managed analysis can extract gold from data and empower decision making. However, some mistakes may be common in this process, and taking the necessary precautions can increase the assertiveness of the results, improving the management of the company and your career. In this article, we will describe the conceptual tools and what to (not) do with them for a good data analysis.

Statistical analysis can be divided into two classes: descriptive analysis and modeling or inference.

Descriptive analysis

Descriptive analysis is a summary of the data, describing it through graphs and tables, in order to visualize and interpret them. Listed below are some mistakes you can make at this stage, and how to avoid them:

  • Using mode, average and median as the same thing: these three values may be very similar in your data set, but they do not represent the same thing. If your data has many outliers or a distorted distribution, these values will be very different, and using only one to describe the data may generate false impressions.
  • Misunderstanding standard deviation: this statistic represents the deviation of the data from the mean, showing the dispersion. However, it is a big misleading factor for data with the distribution of a non-normal curve, so it should not be interpreted in isolation.
  • Wrongly assuming correlations: two errors that can occur are ignoring correlations and seeing correlations that do not exist. Don’t assume cause and effect relations in your research — test causality to see if one variable actually implies the other. If you do not take this precaution, you may find unusual correlations, such as in the chart below.
Correlation does not imply causality. Source: http://www.tylervigen.com/spurious-correlations
  • Not defining the format of your variables: planning the essential data to your analysis and its format is essential for a good data analysis. You need to define a unique representation for each variable. For example, in a salary field, you can have answers like 2 minimum wages, $3,000 monthly, $24,000 annually, no income, family income, etc. All of these data refer to salary, but it is not possible to compare them. Inserting your data in a single pattern for each variable will always help you to analyze it, such as defining that the salary variable will always be expressed in dollars only and that it will consider a monthly income. For qualitative data, it is always good to define a finite number of possible accepted values, for example high, medium and low frequencies. You will have a more complete result and will make it easier to interpret.
  • Forgetting to consider null and zero fields: in an analysis, always define if your variables accept null or zero fields. Sometimes you may want to calculate only the average of the fields that have values, such as calculating the average usage time of an app only for people who watched a video.
  • Beware of the Hawthorne Effect: in an experiment data collection, make sure your data is not biased. People may change their behavior or respond differently if they feel they are under research. Some ways to avoid bias in the responses would be to use discrete observation techniques, such as the naturalistic technique, or to ensure the confidentiality and anonymity of these responses.
  • Not understanding probability: a big mistake in statistical analysis is having the correct data but not interpreting it correctly, e.g.: confusing chance with certainty. If a hypothesis has a 90% chance of being true, it doesn’t mean that it cannot be false. It just means that out of 100 events, 90 will conform to the hypothesis, but it could happen to fall into one of those 10 that will not.
  • Believing that the average explains the whole group: a common misinterpretation of calculations is that the average represents all the points in your data. Therefore, you will think that the mean is wrong if you find values too far apart from it in your data set. An individual experience should never be generalized — if ten people not wearing seat belts survived a car accident, we cannot assume that, on average, people should not wear seat belts. The statistic that using seat belts save lives is still much more relevant. Another common example: some novice analysts trying to prove that their product is better accepted may take a sample of five people who are satisfied and say that their average has reached a new level, when in fact that is a biased subgroup that does not affect the average. You should draw conclusions based on reliable statistics about the whole group.

Modeling (or inference)

Modeling (or inference) is the evaluation of data: trying to find a trend, testing hypotheses and drawing conclusions. It evaluates current data and makes predictions based on numbers. Some tips for making good modeling/inference analysis are listed below:

  • Estimating the sample size: in statistical inference, not all data is considered, as it would be a time-consuming and costly process. A sample is selected to perform the study and its results are applied to the whole group. Determining the right sample size is essential for an accurate analysis. To estimate this size you should calculate the standard deviation of your data, define the acceptable margin of error (usually 5%) and the desired confidence interval. You should also be cautious in adhering to assumptions about new and untested variables.
  • Hypothesis test: a hypothesis test evaluates if a premise is true for a data set. Then, it checks the significance of the test result, which means how likely it was to have reached that result by chance (known as the p-value). The caution for hypothesis testing is to determine which statistical propositions are taken as estimates, which confidence interval is accepted, and which is the prediction range.
  • Regression: generally in regression the data is plotted on a scatter plot to model the relationships between dependent and explanatory variables, demonstrating if the associations of the variables are strong, weak, dependent or explanatory. However, regression cannot be used to explain all kinds of data. It ignores abnormal values such as outliers and depending on the study, these outliers can be very important, like the best-selling product. Another caution is that all variables must be normalized so that some variables do not weigh more than others.
  • Statistical software: technology has come to help bring agility and accuracy in the analysis of large amounts of data, but it should not be interpreted as a substitute for it. Before inserting your data into a model, you must interpret them with a critical view. Knowing the theory of what is being calculated, the prerequisites of its validity, understanding your business, doing exploratory statistics like plotting data on graphs, are the starting steps to a good analysis. Software is just a processing tool, not a substitute for judgment, knowledge and common sense.

In addition to these pieces of advice, some tips should always be followed for any analysis:

  • Have a clear goal: before starting an analysis, you must keep in mind the questions that must be answered.
  • Don’t seek a perfect database: you will never have a perfect database and should not expect one. You are working with statistics, looking for a model that explains the general behavior of your data. As it is commonly known, there is no silver bullet, no quick solution for a big and difficult problem.
  • Believe that a simpler model can bring a more satisfactory result: simple algorithms, such as a decision tree or a logistic regression, are often enough for a machine learning project. In a more complex process like deep learning, the risk of overfitting is much higher. Exhaust all the simple methods before moving to a sophisticated one.
  • Do not force results: many companies, trying to stand out in the market, collect the data and apply the narrative that they want regardless of what it demonstrates. For example, they may change a chart from Dollar to Euro to achieve a certain curve, using only one variable to explain the whole thing. Torturing your data until it says what you want will not change reality. The bill always comes and the consequences can be disastrous. The 2008 financial crisis basically occurred because banks tried to sell high-risk CDOs like they were low risk, resulting in a world crisis and a loss of $1.4 trillion in market valuation in 4 days. Use data to test hypotheses, not to hit targets.
  • Don’t ignore results because you don’t like them: a company can go bankrupt simply because their owners refuse to accept the data. You may not like the data, you may try to justify it, but you cannot ignore it.
  • Don’t try to answer all questions with a single model: creating a model is a learning and evolutionary process. It requires improvement and time to mature the use of technology and information. Remember that often the knowledge gained in the creation process generates more value than the model itself. In order not to lose motivation in the short term, start by setting some quick-win goals and then go deeper.
  • Accept the limitations of your data: if the data is not accurate, it is pointless to perform an analysis since will generate false answers. An analysis is performed with comparisons, and the longer your history is and the higher the data quality you have, the better.
  • Update your model periodically: as we work with dynamic data in a dynamic world, it is normal for your model to become obsolete and lose its predictive power. Always update your models and test their validity.

Whether you like statistics or not, it is increasingly present in data analysis, and it’s worthwhile to devote efforts to correctly use it to leverage your career and your business.

He uses statistics as a drunken man uses lamp-posts — for support rather than illumination.” — Andrew Lang

--

--