Essential Concepts for Data Scientists: Location Estimates and Data Types

Megan Carroll
Nov 6 · 5 min read

Data has been seeing a burgeoning amount of interest. Some have been so bold as to say “data is the new oil,” referring to the value lying within it as well as its potential to revolutionize the way we live in the same way as the internal combustion engine. Much like oil, we need to refine data in order to make use of its power.

Unstructured data is data that is not easily searchable, has internal structure but is not structured via pre-defined data models or schema. Some types of unstructured data include e-mail, text files, social media data and mobile data. This data could be text, images, sound, video or other formats. Typically, unstructured data comes from Applications, NoSQL databases and data lakes.

Structured data is data contained in a Relational Database Management System or data warehouse. This data may be human- or machine-generated as long as the data is created within a schema or model. This data can be queried, sorted and searched through using database tools. Examples of structured data include dates, phone numbers, social security numbers, credit card numbers, customer names, product names and numbers.

There are two basic types of structured data: numeric and categorical. Numeric data comes in two forms: continuous and discrete. Categorical data takes only a fixed set of values.

Structured and unstructured data can both be generated by humans or machines.

Semi-structured data combines some of the attributes of structured and unstructured data. This type of data maintains internal tags and markings that identify separate data elements, enabling information grouping and hierarchical arrangement. Examples of semi-structured data include markup languages, like XML, JSON, NoSQL.

Key Terms for Data Types:

  1. Continuous: Data that can take on any value in an interval. Examples are intervals, floats and numeric types.
  2. Discrete: Data that can take on only integer values, such as counts. Examples are integers and counts.
  3. Categorical: Data that can take on only a specific set of values representing a set of possible categories. Examples are enumerated data, factors, nominal and polychotomous data.
  4. Binary: A special case of categorical data with just two categories of values, either one or zero, true or false. Examples are dichotomous, logical, indicator and boolean data.
  5. Ordinal: Categorical data that has an explicit ordering. Examples are ordered factors.

Rectangular data is the typical frame of reference for an analysis in data science. It is, essentially, a two-dimensional matrix with rows indicating records and columns indicating features. Unstructured data must be processed and manipulated so that it can be represented as a set of features in the rectangular data.

Predictor variables are used in a model to predict a response or dependent variable. Features are used to predict a target. The terminology for data can be confusing, making sure to clarify what you mean when discussing a project can provide synergy when working between engineering teams.

Key Terms for Rectangular Data

  1. Data Fram: Rectangular data is the basic data structure for statistical and machine learning models.
  2. Feature: A column in the table is commonly referred to as a feature. Examples are attributes, input, predictors and variables.
  3. Outcomes: Many data science projects involve predicting an outcome. The features are sometimes used to predict the outcome in an experiment or study.
  4. Records: A row in the table is commonly referred to as a record. Examples are cases, instances, observations, patterns and samples.

Non-rectangular Data Structures exist within the realm of data science. Time series data records successive measurements of the same variable. Spatial data structures are used in mapping and location analytics. There are different representations of spatial data, in the object representation, the focus of the data is an object. In the field view, the focus is on small units of space and the value of a relevant metric. Graph data structures, or network data structures, are used to represent physical, social and abstract relationships.

Estimates of location are a basic way to explore your data. are variables with measured or count data that might have thousands of distinct values. A basic step in exploring your data is getting a typical value for each feature, like an estimate of where more of the data is located.

Key Terms for Estimates of Location

  1. Mean: The sum of all values divided by the number of values.
  2. Weighted mean: The sum of all values times a weight divided by the sum of the weights.
  3. Median: The values such that one-half of the data lies above and below.
  4. Weighted median: The value such that one-half of the sum of the weights lies above and below the sorted data.
  5. Trimmed mean: The average of all values after dropping a fixed number of extreme values.
  6. Robust: Not sensitive to extreme values.
  7. Outlier: A data value that is very different from most of the data.

The mean is the average value. This is the sum of all the values divided by the number of values. The symbol in the center, x-bar, represents the mean of a sample from the population. A variation is the trimmed mean, which is calculated by dropping a fixed number of sorted values at either end and then taking the average of the remaining values. A trimmed mean eliminates the influence of extreme values.

A weighted mean is calculated by multiplying each data value by a weight and dividing their sum by the sum of the weights. There are two motivations for using weighted means, The formula for a weighted mean is:

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade