Some basic concepts in Data Analysis & Data Science — Part1

ENTANGO
10 min readMar 19, 2022

--

As I approached the world of data science I encountered some basic concepts that every beginner should know. These concepts constitute a core set from which to start in deepening the subject.

Now let’s go into these concepts in more detail:

1.DATASET:

Data science is a branch of science that applies the scientific method to data with the goal of studying the relationships between different features and drawing out meaningful conclusions based on these relationships. Data is, therefore, the key component in data science. A dataset is a particular instance or set of data that is used for analysis or model building at any given time. A dataset can be of different types such as numerical data (quantitative data), categorical data (qualitative data), text data, image data, voice data, and video data.

A dataset could be static (not changing) or dynamic (changing with time, for example, stock prices or changing with space, for example, temperature data depend on geographical position). In data science projects the most popular type of dataset is a dataset containing numerical data. Numerical data is typically stored in file (for example a comma-separated values, CSV, file format) or database (for example a DBMS table or set of tables).

2.DATA PREPARATION:

Data preparation is the act of manipulating (or pre-processing) datasets (which may come from disparate data sources) into a form that can readily and accurately be analysed, e.g. for business purposes.

Data preparation is the first step in data analytics or data science projects and can include many discrete tasks such as loading data or data ingestion, data fusion, data cleaning, data augmentation, and data delivery.

[source: https://en.wikipedia.org/wiki/Data_preparation]

Now let’s go in deep of some step of data preparation:

Data loading:

There are many sources of business data within any organization. Examples include endpoint data, customer data, marketing data, sales data, and all their associated repositories. This first essential data preparation step involves identifying the necessary data and its repositories. This is not simply identifying all possible data sources and repositories, but identifying all that are applicable to the desired analysis. This means that there must first be a plan that includes the specific questions to be answered by the data analysis.

Data ingestion:

Once the data is identified, it needs to be brought into the analysis tools. The data will likely be some combination of structured and semi-structured data in different types of repositories. Importing it all into a common repository is necessary for the subsequent steps in the pipeline. Access and ingest tend to be manual processes with significant variations in exactly what needs to be done. Data preparation steps require a combination of business and IT expertise and are therefore best done by a small team. This step is also the first opportunity for data validation.

Data cleaning:

Cleaning the data ensures that the dataset can provide valid answers when the data is analyzed. This step could be done manually for small data sets but requires automation for most realistically sized data sets. There are software tools available for this processing. If custom processing is needed, many data engineers rely on applications coded in Python. There are many different problems possible with the ingested data. There could be missing values, out-of-range values, nulls, and whitespaces that obfuscate values, as well as outlier values that could skew analysis results. Outliers are particularly challenging when they are the result of combining two or more variables in the data set. Data engineers need to plan carefully for how they are going to cleanse their data.

Data delivery:

When the data set has been cleaned and formatted, it may be transformed by merging, splitting, or joining the input sets. Once the combining step is complete, the data is ready to be moved to the data warehouse staging area. Once data is loaded into the staging area, there is a second opportunity for validation.

3.DATA MODELING:

As a result of data preparation we have a set of data that may not be correlated with each other. The purpose of data modeling is to build relationships between data aimed at the analysis that must or wants to be done.

Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures. The goal is to illustrate the types of data used and stored within the system, the relationships among these data types, the ways the data can be grouped and organized and its formats and attributes.

Data models are built around business needs. Rules and requirements are defined upfront through feedback from business stakeholders so they can be incorporated into the design of a new system or adapted in the iteration of an existing one.

Data can be modeled at various levels of abstraction. The process begins by collecting information about business requirements from stakeholders and end users. These business rules are then translated into data structures to formulate a concrete database design. A data model can be compared to a roadmap, an architect’s blueprint or any formal diagram that facilitates a deeper understanding of what is being designed.

Data modeling employs standardized schemas and formal techniques. This provides a common, consistent, and predictable way of defining and managing data resources across an organization, or even beyond.

Data modeling process:

As a discipline, data modeling invites stakeholders to evaluate data processing and storage in painstaking detail. Data modeling techniques have different conventions that dictate which symbols are used to represent the data, how models are laid out, and how business requirements are conveyed. All approaches provide formalized workflows that include a sequence of tasks to be performed in an iterative manner. Those workflows generally look like this:

  1. Identify the entities: The process of data modeling begins with the identification of the things, events or concepts that are represented in the data set that is to be modeled. Each entity should be cohesive and logically discrete from all others.
  2. Identify key properties of each entity: Each entity type can be differentiated from all others because it has one or more unique properties, called attributes. For instance, an entity called “customer” might possess such attributes as a first name, last name, telephone number and salutation, while an entity called “address” might include a street name and number, a city, state, country and zip code.
  3. Identify relationships among entities: The earliest draft of a data model will specify the nature of the relationships each entity has with the others. In the above example, each customer “lives at” an address. If that model were expanded to include an entity called “orders,” each order would be shipped to and billed to an address as well. These relationships are usually documented via unified modeling language (UML).
  4. Map attributes to entities completely: This will ensure the model reflects how the business will use the data. Several formal data modeling patterns are in widespread use. Object-oriented developers often apply analysis patterns or design patterns, while stakeholders from other business domains may turn to other patterns.
  5. Assign keys as needed, and decide on a degree of normalization that balances the need to reduce redundancy with performance requirements: Normalization is a technique for organizing data models (and the databases they represent) in which numerical identifiers, called keys, are assigned to groups of data to represent relationships between them without repeating the data. For instance, if customers are each assigned a key, that key can be linked to both their address and their order history without having to repeat this information in the table of customer names. Normalization tends to reduce the amount of storage space a database will require, but it can at cost to query performance.
  6. Finalize and validate the data model: Data modeling is an iterative process that should be repeated and refined as business needs change.

[Source: https://www.ibm.com/cloud/learn/data-modeling]

Furthermore, in data science the concept of data model is extended by the presence of machine learning algorithms useful for exploring, analyzing and interpreting certain business needs.

Example of these algorithms are the following:

1.Supervised Learning
It is based on the results of a previous operation that is related to the existing business operation. Based on previous patterns, Supervised Learning aids in the prediction of an outcome. Some of the Supervised Learning Algorithms are:

  • Linear Regression
  • Random Forest
  • Support Vector Machines

2.Unsupervised Learning
This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on examining the interactions and connections between the presently available Data points. Some of the Unsupervised Learning Algorithms are:

  • KNN (k-Nearest Neighbors)
  • K-means Clustering
  • Hierarchical Clustering
  • Anomaly Detection

3.Reinforcement Learning
It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with the real world. In simple terms, it is a mechanism by which a system learns from its mistakes and improves over time. Some of the Reinforcement Learning Algorithms are:

  • Q-Learning
  • State-Action-Reward-State-Action (SARSA)
  • Deep Q Network

4.DATA VISUALIZATION:

Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is numerous as for example a time series.

From an academic point of view, this representation can be considered as a mapping between the original data (usually numerical) and graphic elements (for example, lines or points in a chart). The mapping determines how the attributes of these elements vary according to the data. In this light, a bar chart is a mapping of the length of a bar to a magnitude of a variable. Since the graphic design of the mapping can adversely affect the readability of a chart, mapping is a core competency of Data visualization.

Data visualization has its roots in the field of statistics and is therefore generally considered a branch of descriptive Statistics. However, because both design skills and statistical and computing skills are required to visualize effectively, it is argued by authors such as Gershon and Page that it is both an art and a science.

Research into how people read and misread various types of visualizations is helping to determine what types and features of visualizations are most understandable and effective in conveying information.

[Source: https://en.wikipedia.org/wiki/Data_visualization]

From the point of view of data visualization design, data visualization is a form of communication that portrays dense and complex information in graphical form. The resulting visuals are designed to make it easy to compare data and use it to tell a story — both of which can help users in decision making.

Data visualization can express data of varying types and sizes: from a few data points to large multivariate datasets.

Types of visualization

Data visualization can be expressed in different forms. Charts are a common way of expressing data, as they depict different data varieties and allow data comparison.

The type of chart you use depends primarily on two things: the data you want to communicate, and what you want to convey about that data. These guidelines provide descriptions of various different types of charts and their use cases.

Types of chart

Change over time

Change over time charts show data over a period of time, such as trends or comparisons across multiple categories.

Common use cases include:

  • Stock price performance
  • Health statistics
  • Chronologies

Change over time charts include:

1. Line charts
2. Bar charts
3. Stacked bar charts
4. Candlestick charts
5. Area charts
6. Timelines
7. Horizon charts
8. Waterfall charts

Category comparison

Category comparison charts compare data between multiple distinct categories.

Use cases include:

  • Income across different countries
  • Popular venue times
  • Team allocations

Category comparison charts include:

1. Bar charts
2. Grouped bar charts
3. Bubble charts
4. Multi-line charts
5. Parallel coordinate charts
6. Bullet charts

Ranking

Ranking charts show an item’s position in an ordered list.

Use cases include:

  • Election results
  • Performance statistics

Ranking charts include:

1. Ordered bar charts
2. Ordered column charts
3. Parallel coordinate charts

Part-to-whole

Part-to-whole charts show how partial elements add up to a total.

Use cases include:

  • Consolidated revenue of product categories
  • Budgets

Part-to-whole charts include:

1. Stacked bar charts
2. Pie charts
3. Donut charts
4. Stacked area charts
5. Treemap charts
6. Sunburst charts

Correlation

Correlation charts show correlation between two or more variables.

Use cases include:

  • Income and life expectancy

Correlation charts include:

1. Scatterplot charts
2. Bubble charts
3. Column and line charts
4. Heatmap charts

Distribution

Distribution charts show how often each values occur in a dataset.

Use cases include:

  • Population distribution
  • Income distribution

Distribution charts include:

1. Histogram charts
2. Box plot charts
3. Violin charts
4. Density charts

Flow

Flow charts show movement of data between multiple states.

Use cases include:

  • Fund transfers
  • Vote counts and election results

Flow charts include:

1. Sankey charts
2. Gantt charts
3. Chord charts
4. Network charts

Relationship

Relationship charts show how multiple items relate to one other.

Use cases include

  • Social networks
  • Word charts

Relationship charts include:

1. Network charts
2. Venn diagrams
3. Chord charts
4. Sunburst charts

[Source: https://material.io/design/communication/data-visualization.html#principles]

--

--

ENTANGO

Data management, data visualization, data analytics and data science