Data will talk if you’re willing to listen…

Jaya Verma
Women Data Greenhorns
7 min readJul 19, 2018

It’s said that its easier to collect data than to discover knowledge. Data science does exactly that by using scientific methods, algorithms and processes to extract knowledge and insights from data. This, in turn, helps us to make better decisions.

Research is closely related to data science because research seeks to understand existing behaviours and factors which influence change. Data science, in turn, seeks to find trends in data which can be used for making future predictions. Hence, both research and data science go hand in hand.

Visualising data to draw meaningful insights

What is Research?

The word research is derived from the 16th-century French word recerche meaning “act of searching closely”. It is a “systematic investigation into and study of materials and sources in order to establish facts and reach new conclusions.”

For example, if a company wants to increase the sales of its product then it can conduct a research on the market coverage of its product, the effect of packaging and other buying factors. The company can then experiment with the most deciding factors to increase sales.

Valid Research Study

For conducting a valid research study, we need to take the following into consideration:

  1. A good sample size
  2. A representative sample
  3. A sound methodology

We also need to consider the extraneous factors/lurking variables while selecting our sample. These are external factors which can significantly affect the outcome if unaccounted for. In order to reduce their impact on test results, it is better to keep the conditions same for all the people who are taking the test.

For example, we are conducting a memory test and the temperature of the room in which the test is being conducted is above normal. This will cause the participants to perform poorly in the test even though temperature does not influence memory directly. Thus temperature, in this case, is a lurking variable.

The population parameter(μ) is any summary number like the average or percentage which describes the entire population. The sample statistic() is a summary number which describes the sample. The difference between population parameter and sample statistic is known as sampling error. For a good sample, the sampling error should be negligible.

Taking a sample of scores

Consider, the above example in which we are selecting three scores out of nine. Here, μ=71.33 and x̄=97. The sampling error x̄-μ=25.67 which is very high because our sample includes only the larger numbers. It does not represent the population which has large, medium and small values. This could have been avoided by taking some smaller numbers in the sample or by taking a larger sample to include all kinds of numbers.

The chosen sample should accurately represent the population. For this, we need to select a random sample. The best example of randomness is the arrangement of drops of rain. If we mark any two areas then both of them will have an equal probability of being struck by raindrops. Similarly, in a random sample, each subject has an equal chance of being selected.

A random sample is the opposite of a convenience sample in which subjects who are convenient to find are used in the study. Such a sample may not represent the population accurately.

For example, if we are conducting a research on a country’s population then our sample should contain people from all economic groups, ages, genders and cultures. The sample size also needs to be large enough to have people from different socioeconomic groups. Ideally, a sample should include the entire population but this is not always possible. A larger sample can better represent the population.

Types of Researches

The type of research method used depends on the objective of our study.

For showing relationships between variables we can conduct observational studies or surveys.

For showing causation that one variable causes to another, we use controlled experiments.

Difference between observational study and controlled experiment

Observational Studies

In an observational study, we analyse the already existing data. The members of the sample are not affected.

An example of an observational study is taking a random sample of people and examining their social media habits to classify them as light, moderate or heavy social media user.

Surveys

In a survey research, people are the main subjects and we ask them questions. The data collected is then analysed to draw meaningful conclusions. Surveys are often used to analyse constructs.

A construct is an abstract idea which can be defined and measured in many ways like happiness, love, guilt etc. An operational definition is a description of constructs which allows us to measure them. For example, happiness can be defined as the number of times a person smiles in a day.

Benefits of surveys:

  • Easy way to get information about a population.
  • Inexpensive
  • Can be conducted remotely.
  • Anyone can access and analyse survey results if the survey owner is willing to share the data.

Disadvantages of surveys:

  • Untruthful responses
  • Biased responses
  • Response bias when respondents do not understand the questions.
  • Non-Response bias when respondents refuse to answer.

Controlled Experiments

In a controlled experiment, we assign people or things to groups and apply some treatment to one of the groups, while the other group does not receive any treatment. For this, fake treatments or placebos are used. This is done to account for the placebo effect where if a person believes a fake treatment to be real then their expectations of recovery trigger some physiological factors causing them to feel better.

The group which receives treatment is called the experimental group and the group which does not receive any treatment is called the control group. If the treatment shows a significant benefit compared to the placebo, it is proved effective.

The participants are kept unaware about which treatment they have received. This is called blinding and is done to prevent bias due to the placebo effect where participants receiving fake treatment might subconsciously think that the placebo is not doing anything. This can influence the experiment results.

In order to remove bias, the researchers observing the participants are also kept unaware of which treatment the participant has received. When only participants are unaware of their treatment condition, then it is a single blind experiment. When both participants and researchers are unaware of the treatment condition, then it is a double-blind experiment.

An example of a controlled experiment is studying the effect of sleep medication on a group of people. They are divided into two groups. One group receives the medicated pill and the other group receives an inactive pill(placebo). The researchers then analyse the results and compare which group slept better.

Sometimes we can conduct an experiment twice with the same person in different conditions and analyse the effect of those conditions. This is called within-subject design. For example, we can test the effect of sleep medication on a single person at different times of the day and analyse how the medication influences sleep.

Visualising Relationships

We can visualise the relationship between variables using graphs. For example, in the below graph we are trying to predict the sleep time from the number of caffeinated drinks consumed.

Relationship between Caffeinated drinks consumed and Sleep time

An independent variable is a factor which is controlled by researchers to see its effect on the dependent variable. In the above example, the number of caffeinated drinks consumed is an independent variable.

A dependent variable or predictor variable is a factor which is affected by changes in the independent variable. In the above example, sleep time is a dependent variable.

Independent and Dependent variables in an experiment

When visualising relationships, the independent/predictor variable is plotted on the x-axis and the dependent variable is plotted on the y-axis.

Correlation does not imply Causation

Sometimes events which coincide with each other are not necessarily caused by each other. We also need to take the lurking variables into account.

An example is the Golden Arches Theory of Conflict Prevention which states that no two countries having a McDonald’s have ever gone to war since the opening of McDonald’s. A plausible explanation for this would be that countries with McDonald’s are open to globalisation and foreign investments. Therefore they will be less interested to go on war with other countries. This does not imply that McDonald’s has any influence on peace between countries.

Below are some examples of spurious correlations which happen to be just coincidences.

Correlation between US spending on science, space and technology and suicides by hanging, strangulation and suffocation
Image result for spurious correlation
Correlation between per capita cheese consumption and no. of people who died by becoming tangled in their bed sheets
Correlation between divorce rate in Maine and per capita consumption of margarine

Conclusion

Research methods can be used to extract knowledge from data but not every fact presented by the data is true. It can be the result of a mere coincidence than an actual relationship between the variables. The results can also be manipulated by external factors unknown to the researcher.

So to conclude, data will talk to you if you’re willing to listen but not everything which it says is the truth.

--

--