Back to the basics — Knowing your DATA
So you have a dataset and you are trying to make sense out of it? Let's start with the very basics and understand what exactly your data consists of.
A dataset is a set of Entities, for example, for a sales dataset it will be the collection of sales, for a university dataset, it will be a collection of student information.
These entities are represented by Attributes( also known as columns, features, variables). These represent specific information collected for an entity. For example, if we are looking at a sales dataset. The entity will be a record that represents the sales for a store and the attributes can be Store Name, Region, Sales, etc. A collected value for an attribute is also known as an Observation.
Discrete Attributes — These have a finite set of values.
Continuous Attributes — If the attribute is not discrete it’s continuous(haha!). These do not have a finite set of values.
An attribute can contain different values and the type of value in an attribute can be
- Nominal — The values that represent a name or category. These do not have a meaningful order. The nominal attributes are also known as categorical attributes. For example, Name, City, etc. There can be instances where the nominal attributes contain numbers, for example, representation of gender (0 — Male / 1 — Female), colors (1 — Red/2 — Black), etc. In such cases applying any mathematical aggregation on the values will not result in anything sensible and should be avoided. However, to find the central tendency(more on this below) mode can be used.
- Ordinal — The values have a meaningful order and ranking. However, the magnitude difference is not known for these values. For example, the size of a Cup, small/medium/large. These values do have a ranking, but the measurable difference is not known. The central tendency of these variables can be defined using median or mode.
- Binary — There can be only two possible values. For example, True / False. If the values do not have any ranking between them and carry the same weight, then the variable is said to be symmetric. But if there is a weight associated with the values, for example, medical results as +ve or -ve, then the attribute is called asymmetric.
- Numeric — Numbers! These values are measurable values that are represented by real or integer values. The values represent an actual measurement that is collected for a categorical value. For example, the Sales for a store. These are also known as continuous variables. The central tendency of numeric attributes can be computed using mean, median, or mode.
So now we know the type of variables (attributes) that our dataset can consist of. To make some more sense and get a deeper understanding of our data, we use measures of central tendency. These measures help in understanding the distribution of data and if there are any outliers in the data.
Measures of Central Tendency
- Mean — This is the most widely used measure to find out the center of the data also known as the arithmetic mean. Computing the mean value is super easy. Just sum the values and divide by the count of the values :). In python, we can use the NumPy or statistics libraries and compute the mean like this —
- Median — The middle value of an attribute. This is useful when we have outliers in our dataset. The mean can be influenced by the extreme values in the higher and the lower ends and will represent a wrong picture of the data. First, to compute the median, order the attribute values in ascending order and find out the middle value. In the case of an even number of values, it will be the average of the middle two values. Here is how median can be calculated using Numpy and statistics libraries in python —
- Mode — Mode is the value that occurs most frequently. A dataset may have more than one mode. In that case, the datasets are said to be bimodal(two modes) or trimodal(three modes), or simply multimodal. Mode is used mainly for categorical values. Here is how mode can be calculated —
Measures of Dispersion or Spread
- Variance and Standard deviation — They indicate how spread the data distribution is. Variance can be calculated as the mean of the squared difference between observations and the overall mean.
Standard deviation is the square root of the variance and is in the units of an attribute. A higher value represents data is spread over a large range of values and a lower value represents that the values are close to the mean. here is a snippet to compute the variance and standard deviation using python —
- Quantiles and Quartiles — Quantiles are the data points that split the data distribution into equal-sized consecutive sets. The three data points split the data into four equal parts so that each split represents 1/4th of the data distribution. These are called Quartiles. The first quartile(Q1) is the 25th percentile that cuts the lower 25% of the data. The second quartile(Q2) is the median or 50th percentile, and the third quartile(Q3) or 75th percentile, cuts off the lowest 75% of the data. The distance between the Q1 and Q3 gives information on the range covered by the middle half of the data and is known as Interquartile Range or IQR.
The values for min, max, Q1, Q3, and median together are also known as a five-number summary of a dataset attribute. The values provide a more complete picture of the attribute together. The values can be calculated using methods available in python or can be seen graphically using a box plot.
The lower and upper end of the box represents the Q1 and Q3 values. The line also known as whiskers, outside the lower end(Q1) of the box is the minimum value in the dataset and the line outside the upper end (Q3) of the box is the maximum value. The red line between is the median or Q2. The two points beyond the minimum and maximum values are outliers.
Thanks for reading!
Outliers? Outliers in your Data!