Survey and Sampling

Why does a Data Scientist have to know about survey methodology?

Fabrice Mesidor
Analytics Vidhya
8 min readSep 4, 2019

--

An unprecedented amount of data is being produced daily. For the happiness of the data scientists. We have so much data available that we can integrate an unimaginable amount of variables in our models. The more the better. But…Will we always be able to find existing data for our research? Is there any case where connecting to a database and using our SQL knowledge would be no hope to gather data? How will we be able to resolve it? Well, we will have to collect our data. In the following, I will share a bit of knowledge about the survey methodology and show why it is important for a data scientist.

Quick definition

A survey is “ a systematic method for gathering information from (a sample of) individuals for the purposes of describing the attributes of the larger population of which the individuals are members.” (Enanoria, 2005)

A single survey is made of at least a sample (or full population in the case of a census), a method of data collection (e.g., a questionnaire) and individual questions or items that become data that can be analyzed statistically. A single survey may focus on different types of topics such as preferences (e.g., for a presidential candidate), opinions (e.g., should abortion be legal?), behavior (smoking and alcohol use), or factual information (e.g., income), depending on the purpose of our research or model.

As the survey is based on a sample of the population (in order to determine its attributes), the success of the research is highly dependent on the representativeness of the sample.

To chose the sample, we need a list of all members of the population of interest: the sampling frame. Remember that the goal of a survey is not to describe the sample, but the larger population. Very similar to any statistical test or model. We want to estimate the value of some attribute of the population but will use a part of it to get to our conclusion. The population parameter is the true value of a population attribute. What we will get for the survey is the sample statistic: an estimate, based on sample data, of a population parameter. It is important to always keep this principle in mind as it will guide us in the choice of our sample and the questions that we will formulate.

This generalizing ability is dependent on the representativeness of the sample, as stated earlier. One common error that results is selection bias. Selection bias results when the procedures used to select a sample result in over-representation or under-representation of some significant aspect of the population. For instance, if the population of interest consists of 65% females, and 35% males and the sample consists of 40% females and 60% males, females are underrepresented while males are overrepresented. To minimize selection biases, different techniques will be used.

In few words, a sample is a smaller part of a population used to describe what is looks like. A sample has to be representative of the population it came from.

How to conduct a survey

There are several ways of administering a survey. The choice between administration modes is influenced by several factors, including costs (what is our budget), coverage of the target population (what percentage do we want to cover), flexibility of asking questions, respondents’ willingness to participate and response accuracy.

The different methods have different advantages and have impact on how the respondents react to the questions. The most common modes of administration are: Telephone, Mail (post), Online surveys, Personal in-home surveys, Personal mall or street intercept survey, Hybrids of the above.

Questionnaires are the most commonly used tool to gather data that are needed in the research. A questionnaire is a set of questions used in a survey. The survey questionnaire is a type of data gathering method that is utilized to collect, analyze and interpret the different views of a group of people from a particular population. However, the results of a particular survey are worthless if the questionnaire is written inadequately. Questionnaires should produce valid and reliable demographic variable measures and should yield valid and reliable individual disparities to avoid having bad interpretation of the results.

As a critical element in one’s interrogation, writing a survey questionnaire could be challenging as well. It should gather all the necessary information, but at the same time, it should not bore and confuse your respondents. The survey questionnaire uses statistical analysis to collect data, and the result of it will be used in the development of an individual or to a community.

Polls and survey questionnaire are both methods in gathering information and data from an audience or population. But these two has some slight differences.

Burgess (2001) offers a basic structure to be considered when developing a survey. 1) Define your research aims. 2) Identify the population and sample. 3) Decide how to collect replies. 4) Design your questionnaire. 5) Run a pilot survey. 6) Carry out the main survey. 7) Analyze the data. To complete we would add 8) Critical evaluation.

Survey Sampling Methods

Sampling method refers to the way that we select our observations from the population before conducting our survey. The sampling methods consist of two categories: Probability and non-probability samples.

How we select our observations from the population: sampling method

With probability sampling methods, each population element has a known (non-zero) chance of being part of the sample. The main types of probability sampling methods are simple random sampling, stratified sampling, cluster sampling, multistage sampling, and systematic random sampling.

Simple random sampling. Simple random sampling refers to any sampling method that has the following properties. The population consists of N objects. The sample consists of n objects. If all possible samples of n objects are equally likely to occur, the sampling method is called simple random sampling. One way would be the lottery method.

Stratified sampling. With stratified sampling, the population is divided into groups, based on some characteristic (geography, gender…). Then, within each group, a probability sample (often a simple random sample) is selected. In stratified sampling, the groups are called strata.

Cluster sampling. With cluster sampling, every member of the population is assigned to one, and only one, group. Each group is called a cluster. A sample of clusters is chosen, using a probability method (often simple random sampling). Only individuals within sampled clusters are surveyed.

Multistage sampling. With multistage sampling, we select a sample by using combinations of different sampling methods. For example, in Stage 1, we might use cluster sampling to choose clusters from a population. Then, in Stage 2, we might use simple random sampling to select a subset of elements from each chosen cluster for the final sample.

Systematic random sampling. With systematic random sampling, we create a list of every member of the population. From the list, we randomly select the first sample element from the first k elements on the population list. Thereafter, we select every kth element on the list. This method is different from simple random sampling since every possible sample of n elements is not equally likely.

With non-probability sampling methods, we do not know the probability that each population element will be chosen, and/or we cannot be sure that each population element has a non-zero chance of being chosen.

Voluntary sample. A voluntary sample is made up of people who self-select into the survey. Often, these folks have a strong interest in the main topic of the survey (example: online poll)

Convenience sample. A convenience sample is made up of people who are easy to reach (for example: interviewing shoppers at a local mall)

The key benefit of probability sampling methods is that they guarantee that the sample chosen is representative of the population. This ensures that the statistical conclusions will be valid.

Non-probability sampling methods offer two potential advantages — convenience and cost. The main disadvantage is that non-probability sampling methods do not allow you to estimate the extent to which sample statistics are likely to differ from population parameters. Only probability sampling methods permit that kind of analysis.

As a Data Scientist, why master survey methods?

Recently, I shared some skills that a data scientist need. In case, you missed it: click here. Long post short: Data scientists live at the intersection of programming, statistics, and domain knowledge. As Josh Wills put it, “data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” Although, survey method is mainly related to the statistical field, it can be a good help for any data scientist. Let me show other reasons (different from the one mentioned above) why I think it can be a good value-added.

Even with the expansion of available data for commercial uses, online or via social media, survey is still an important source of data. Some enterprises are not yet in a stage where they have their data stored and ready for use. Some specific research will question new phenomenon. The data scientist should be able to lead the data collection and provide relevant guidelines to gather data to conduct the studies. In some cases, a statistician might be responsible for this task, it is incumbent to the data scientist to understand all the steps to avoid any surprises during the data exploration and model building phases.

Table showing the different types of data used from a survey in 2014

Data scientists are interested in understanding what people think and do, and surveys are one of the most straightforward ways to collect that kind of data. It is important to understand the ideas behind the different survey methods, in order to know how and when to use them. Social media might contain a lot of information about people sentiments or habits. However, a survey can be more specific (the questions are chosen based on the research needs) and the data collected can be better value for estimation of the population parameter. A lot of online tools are available making the conduction of a survey less time consuming and cheaper.

Also, I have to mention that while using administrative data or data from a database, it is wise to question the source of the data. It might be the case where the data are from a survey. Having a look at the questionnaire, understanding the sampling method and knowing the calculation of the size of the sample would help considerably during the interpretation of our model results.

Survey methods and sampling constitute a specific domain in statistics and I wouldn’t be able to share all details in a post. Size of the sample (I didn’t want to share any mathematical formula), sampling error, types of survey questions are some of the points that I didn’t treat in my post. Understanding the fundamentals of survey methods and deepening those notions can increase your worth as a data scientist.

Thoughts! Leave a comment and you can discuss about it…

--

--