Understanding SQL Data Analytics(Part 1)

Anmol Adhikari
Analytics Vidhya
Published in
12 min readAug 25, 2020
Data Analytics Zero to Hero(Part 1)

Introduction to Data

Data has fundamentally transformed the 21st century.Using data, insights that would have been sincerely not possible to derive 50 years in the past can now be observed with simply a few strains of laptop code. Two of the most essential equipment in this revolution are the relational database and its essential language, Structured Query Language (SQL).
While we could, in theory, analyze all facts by means of hand, computer systems are a long way higher at the mission and are absolutely the favored device for storing, organizing, and processing data. Among the most indispensable of these statistics equipment are the relational database and language used to get right of entry to it, SQL. These two applied sciences have been cornerstones of records processing and proceed to be the spine of most corporations that deal with large quantities of data.
Companies use SQL as the essential approach for storing an awful lot of their data. Furthermore, groups now take a whole lot of this information and put it into specialized databases known as records warehouses and facts lakes so that they can function superior analytics on their data. Virtually all of these facts warehouses and facts lakes are accessed the usage of SQL. We’ll be searching at working with SQL the use of analytics systems such as records warehouses.

Here, we will learn about data and it’s types.

Understanding Data

Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of things. For example: Each student’s test score is one piece of data.

Types of Data

Data can also be broken down into two main categories, quantitative and qualitative:

Quantitative records is a dimension that can be described as a number; qualitative statistics is data that is described by non-numerical values, such as text. Your height is facts that would be described as quantitative. However, describing yourself as either a “smoker” or a “non-smoker” would be viewed qualitative data.

Quantitative statistics can be in addition categorized into two subcategories: discrete and continuous. Discrete quantitative values are values that can take on a constant stage of precision — usually integers.
A continuous variable is a cost that, in theory, should be divided into an arbitrary amount of precision.
However, if a value ought to be described with higher precision, then it is commonly viewed continuous.
It’s necessary to observe that qualitative information can typically be transformed into quantitative data, and quantitative data can also be transformed into qualitative data.

Understanding about Data Analytics and Statistics

Raw facts through itself is genuinely a crew of values. However, it is now not very interesting in this form. It is solely when we begin to locate patterns in the data and commence to interpret them that we can begin to do fascinating matters such as make predictions about the future and perceive sudden changes. These patterns in the statistics are referred to as information. A giant equipped collection of persistent and sizable data and trip that can be used to describe and predict phenomena in the actual world is referred to as knowledge. Data analysis is the method with the aid of which we convert information into facts and, thereafter, knowledge. When information evaluation is mixed with making predictions, we then have records analytics. There are a lot of tools that are reachable to make feel of data. One of the most effective equipment of facts evaluation is using mathematical strategies on datasets. One of these mathematical methods is statistics.

Statistics Types

Statistics can be similarly divided into two subcategories: descriptive records and inferential statistics.

Descriptive information are used to describe data. Descriptive information on a single variable in a datasets is known as uni-variate analysis, while descriptive data that seem to be at two or extra variables at the identical time is called multivariate analysis.

For example: inferential statistics think of datasets as a sample or a small portion of measurements from a larger group called a population.

  • a survey of 30,000 voters in a national election is a sample of the entire population of voters in a country. Inferential statistics are used to try to infer the properties of a population-based on the properties of a sample.

Assume that you are a health policy analyst and are given the following datasets with information about patients:

Health Data(Packt)

When a dataset is provided , it’s regularly useful to classify the underlying data. In this case, the unit of statement for the dataset is an man or woman affected person because each row represents an individual observation, which is a special patient. There are 10 facts points, each with 5 variables. Three of the columns, Year of Birth, Height, and Number of Doctor Visits, are quantitative because they are represented by numbers. Two of the columns, Eye Color and Country of Birth, are qualitative.

Descriptive Statistics

Descriptive data is one of the methods in which we can analyze information in order to recognize it. Both univariate analysis and multivariate evaluation can provide us an perception into what may be going on with a phenomenon. In this section, we will take a nearer seem to be at the primary mathematical strategies that we can use to better understand and describe a dataset.

Univariate Analysis

One of the main branches of statistics is univariate analysis. These methods are used to understand a single variable in a dataset. Here, we will understand some of the major techniques.

Data Frequency Distribution

The distribution of records is simply a count number of the wide variety of values that are in a dataset. For example, say that we have a dataset of 1,000 medical records, and one of the variables in the dataset is eye color. If we seem at the dataset and locate seven hundred humans have brown eyes, 200 people have green eyes, and 100 people have blue eyes, then we have simply described the distribution of the dataset. Specifically, we have described the absolute frequency distribution. If we had been to describe the counts no longer by means of the proper wide variety of occurrences in the dataset however as the proportion of the complete quantity of statistics points, then we are describing its relative frequency distribution. In the previous eye coloration example, the relative frequency distribution would be 70% brown eyes, 20% inexperienced eyes, and 10% blue eyes.It’s effortless to calculate the distribution when the variable can take on a small number of constant values, such as eye color. But what about a quantitative variable that can take on many extraordinary values, such as height? The customary way to calculate distributions for these types of variables is to make interval “buckets” that these values can be assigned to, and then calculate distributions using these buckets. For example, top can be damaged down into 5-cm interval buckets to make the following absolute distribution. We can then divide every row in the desk by using the complete quantity of statistics factors (that is, 10,000) and get the relative distribution.

What is Central Tendency?

One of the frequent questions requested of a variable in a dataset is what a ordinary cost for that variable is. This value is often described as the central tendency of the variable. There are many numbers calculated from a dataset that is frequently used to describe its central tendency, every with its personal blessings and disadvantages. Some of the approaches to measure central tendency encompass the following:

  1. Mode: the mode is truely the price that comes up most often inside the distribution of a variable. In above figure, the eye colour example, the mode could be “brown eyes” as it happens the most usually within the dataset. If multiple values are tied for the maximum common variable, then the variable is called multimodal, and all the maximum values are suggested. If no fee is repeated, then there’s no mode for that set of values. Mode has a tendency to be beneficial when a variable can take on a small, fixed quantity of values. However, it’s far intricate to calculate whilst a variable is a non-stop quantitative variable, such as in our peak trouble. With those variables, other calculations are greater suitable for determining the relevant tendency.
  2. Mean: The average of a variable (additionally known as the suggest) is the fee calculated while you take the sum of all of the values of the variable and divide by way of the wide variety of facts points. For example, say you had a small dataset of a long time: 36, 35, 21, 25, and 39. The common of these a while might be 39. 2 because that is the number you get while you sum the five numbers after which divide by means of 5, this is, the wide variety of facts points. The suggest is simple to calculate and, normally, does an excellent job of describing a “standard” value for a variable. No marvel it’s miles one of the most generally suggested descriptive data in the literature.
  3. Median: The median is form of a atypical degree of critical tendency but has some severe benefits over the average. To calculate the median, take the numbers of a variable and type them from the bottom to the very best, and then determine the center number. For an atypical range of facts factors, this wide variety is absolutely the middle value of the ordered data. If there are an even number of information factors, then take the common of the 2 middle numbers. Whilst the median is a piece unwieldy to calculate, it is less stricken by outliers, in contrast to the imply. To illustrate this reality, we are able to calculate the median of the skewed age dataset of 36, 35, 21, 25, 39, and 42. This time whilst we calculate the median of the dataset, we get the fee of 30. This value is much in the direction of the everyday price of the dataset than the common of 38.

What is Dispersion?

Another belongings this is of interest in a dataset is coming across how near collectively statistics points are in a variable. As an example, the wide variety units [200, 200, 200] and [50, 200, 250] both have an average of one hundred, however the numbers within the second institution are spread out greater than the primary. This property of describing how the statistics is spread is known as dispersion. Right here are some of the maximum not unusual approaches to assess dispersion:

  1. Variety: the range is honestly the distinction between the very best and lowest values for a variable. It is rather smooth to calculate but could be very prone to outliers. It also does now not offer a great deal statistics approximately the spread of values in the center of the dataset.
  2. Standard deviation: wellknown deviation is definitely the rectangular root of the average of the squared difference among each statistics point and the suggest. The fee of wellknown deviation degrees from 0 all the manner to fantastic infinity. The closer the same old deviation is to zero, the much less the numbers inside the dataset vary. If the standard deviation is 0, it approach all of the values for a dataset variable are the identical. One diffused difference to observe is that there are two extraordinary formulas for preferred deviation, While doubtful, use the sample general deviation, as it’s miles considered to be greater conservative. Additionally, in practice, the difference among the 2 formulation could be very small while there are numerous records factors. The standard deviation is commonly the amount used most often to explain dispersion. But, like variety, it may also be affected by outliers, although now not as intense as the range. It may additionally be fairly worried to calculate. Modern-day equipment, however, usually make it easy to calculate the same old deviation. One very last word is that, sometimes, you can see a related value (variance) listed as properly. This quantity is genuinely the rectangular of the same old deviation:
Formula for Standard Deviation(Slideshare)

3. Interquartile variety (iqr): the interquartile variety is the distinction between the first quartile, q1 (that is additionally known as the lower quartile) and the 1/3 quartile, q3 (that is also called the higher quartile). Iqr, in contrast to range and preferred deviation, is powerful towards outliers, and so, even as it’s far one of the maximum complex of the features to calculate, it presents a stronger manner to measure the unfold of datasets. In fact, iqr is frequently used to outline outliers. If a price in a dataset is smaller than q1–1. Five x iqr or large than q3 + 1. Five x iqr, then the price is taken into consideration an outlier.

What is Bivariate Analysis?

Here, we will discuss how to find patterns with two variables using bivariate analysis.

Scatterplots

A wellknown principle you’ll locate in analytics is that graphs are exceptionally useful for locating patterns. Simply as histograms assist you to to recognize a single variable, scatterplots can help you to apprehend variables. Scatterplots may be produced pretty effortlessly using your preferred spreadsheet. A whole lot of one-of-a-kind patterns are worth searching out for within a scatterplot. The maximum not unusual sample human beings look for is an upward or downward fashion between the 2 variables; this is, as one variable increases, does the opposite variable decrease? One of these fashion suggests there can be a predictable mathematical courting among the 2 variables. For more understanding look at the image below:

The upward and downward between two variables(Visme)

There are also lots of trends that are worth looking out for that are not linear, including quadratic, exponential, inverse, and logistic trends.

Trends(freecodecamp)

At the same time as tendencies are useful for knowledge and predicting styles, detecting changes in traits is regularly extra crucial. Changes in developments normally indicate a critical alternate in some thing you are measuring and are worth inspecting in addition for a proof. A real-international instance of any such alternate might be when the stock of a enterprise begins to drop after rising for a long time.

Example of change in trend(packt)

Another pattern people generally tend to search for is periodicity — this is, repeating styles in the statistics. Such patterns can imply that variables might also have cyclical behavior and may be beneficial in making predictions. The subsequent discern suggests an instance of periodic conduct:

Example of Periodic Behaviour(Researchgate)

Scatterplots are also used to hit upon outliers. When most factors in a graph seem like in a selected vicinity of the graph, but some points are pretty some distance, it may imply those points are outliers in regard to the 2 variables. When doing similarly bivariate analysis, it can be clever to remove those factors for you to reduce any noise and convey higher insights. The subsequent figure shows a few factors that may be taken into consideration outliers:

A scatterplot with two outliers

Pearson Correlation Coefficient

One of the most commonplace developments in reading bivariate records is linear trends. Often, although, a few linear developments are susceptible, while different linear developments are sturdy in phrases of ways well a linear fashion fits the statistics. We will see examples of scatterplots with their line of fine healthy. That is a line calculated using a technique called normal least squares (ols) regression. Despite the fact that ols is beyond the scope of this direction, expertise how nicely bivariate information match a linear trend is a treasured device for knowledge the connection between variables:

Strong Linear Trend

Now, we see the weak linear trend

Week Linear Trend

Now, we will see both the figure

Strong&Week Correlation

One approach for quantifying linear correlation is to apply the pearson correlation coefficient. The pearson correlation coefficient, often represented via the letter r, is a variety of starting from -1 to at least one, indicating how well a scatterplot suits a linear fashion. To calculate the pearson correlation coefficient, r, we use the subsequent formula:

Formula to calculate the pearson correlation coefficient(Datascience)

Today in the lesson, we have learn about the data and important information regarding SQL Data Analytics, we will continue this on another part.

Stay tuned and follow Analytics Vidhya.

--

--

Anmol Adhikari
Analytics Vidhya

Experienced Information Technology Consultant with a demonstrated history of working in the industry. Data , SQL Enthusiast | Researcher | Data Analyst