Recipe to become a Data Analyst — All about data and its ways

Vaishnave Jonnalagadda
9 min readApr 26, 2022

--

Welcome back to Episode 3

Recap: You’ve asked all the right questions, applied structured thinking and you’re completely in sync with your stakeholders. You’re off to a great start. Now here comes the next step in the secret recipe of data analysis: Preparing the data correctly.

As a reminder 6 phases of data analysis are:

This is a phase where understanding the different types of data and data structures comes in. Knowing this lets you figure out what type of data is right for the question you’re answering.

Data is humungous and abundantly available everywhere, right now data is being generated all around the world, we’re talking tons of data. All this data is collected either through Interviews, observations, forms, questionnaires, surveys, cookies etc… As a real-world analyst, you’ll have all kinds of data right at your fingertips. Knowing how it’s been generated can help add context to the data, and knowing how to collect it can make the data analysis process more efficient.

Three types of data sources that you should be aware of are

  • First-party data — data collected by an individual/group using their resources
  • Second-party data — Data collected by a group directly from its audience and then sold
  • Third-party data — data collected by outside sources who did not collect it directly

As Data Analysts, it’s our responsibility to collect the right kind of data for every project, and that means choosing the data that can help you, find the right answers and solve the problems and not get distracted by unwanted data.

For most of the projects datasets are given to you, however, if you’re working on personal projects, you will have to go through public data sets or collect ethical data all by yourself. Anyhow your data needs to be inspected for accuracy, bias, credibility and trustworthiness. Let’s dive into inspecting it.

Data Type, Formats and Structures: Few things to know about data

A data type is a specific kind of data attribute that tells what kind of value the data is. In other words, a data type tells you what kind of data you’re working with. For eg:

  • Text/String/Characters
  • Number/Integer/Decimals
  • Boolean
  • Datetime

In the last blog we learned about Qualitative & Quantitative data, there are a lot of other kinds of data types & formats that you will come across as a Data professional. When you think about the word “format,” a lot of things might come to mind. Think of an advertisement at an IPL or Super Bowl event. You might find it in the form of a print ad, a billboard, or even a commercial. The information is presented in the format that works best for you to take it in. The format of a dataset is a lot like that, and choosing the right format will help you manage and use your data in the best way possible.

Data Formats

Most of the data, being generated right now is unstructured. Audio files, video files, emails, photos, and social media are all examples of unstructured data. These can be harder to analyze in their unstructured format. Transforming data into structured format is on the plate of “Data Engineer” which we shall talk about in a different blog. But here’s the good news for analysts, you’ll be working with structured data most of the time.

Diff between Structured and Unstructured data

Structured data works nicely within a data model, a model that is used for organizing data elements and how they relate to one another.

What are data elements? They’re pieces of information, such as people’s names, account numbers, and addresses.

Data models help to keep data consistent and provide a map of how data is organized. This makes it easier for analysts and other stakeholders to make sense of their data and use it for business purposes. In addition to working well within data models, structured data is also useful for databases. This makes it easy for analysts to enter, query, and analyze the data whenever they need to. This also helps make data visualization pretty easy because structured data can be applied directly to charts, graphs, heat maps, dashboards and most other visual representations of data.

As a data analyst, you will come across multiple types of data as discussed above, the important factor of the “Prepare” step is Data Transformation which is the catalyst to our analysis.

What is data transformation? A process of changing the data’s format, structure, or values.

Data transformation usually involves:

  • Adding, copying, or replicating data
  • Deleting fields or records
  • Standardizing the names of variables
  • Renaming, moving, or combining columns in a database
  • Converting data types & formats to one another.
  • Joining one set of data with another
  • Saving a file in a different format. For example, saving a spreadsheet as a comma-separated values (CSV) file.

Why transform data?

  • Data organization: better-organized data is easier to use
  • Data compatibility: different applications or systems can then use the same data.
  • Data migration: data with matching formats can be moved from one system to another.
  • Data merging: data with the same organization can be merged
  • Data enhancement: data can be displayed with more detailed fields
  • Data comparison: apples-to-apples comparisons of the data can then be made

We talked about how to prepare data in a way that helps you tell a meaningful story. Like all good plays, our data story will be filled with characters, questions, challenges, conflict, and hopefully a resolution. The trick is to avoid the conflict, overcome the challenges and answer the questions. This can be done by ensuring data integrity and analyzing data for bias and credibility.

This is very important because even the most sound data can be skewed or misinterpreted. We should understand the difference between good and bad data. Just like when we were kids… Exploring good data sources and learning how to steer clear of their nemesis, bad data.

To avoid this, the course has come up with criteria to identify the Good data known as ROCCC acronym for Reliable, Original, Comprehensive, Current & Cited. If you have original data from a reliable organization and it’s comprehensive, current, and cited, it ROCCCs! There are lots of places that are known for having good data. Your best bet is to go with the vetted public data sets, academic papers, financial data, and governmental agency data.

Just in case you still figuring out what is Bad Data? Bad Data is flat out wrong or filled with human error, incomplete and bias.

Little about Data Ethics

As data professionals, it’s necessary for us to understand data ethics and privacy because, in our work, we’ll make a lot of decisions on the correct use and application of data. We have to think about bias and fairness from the moment we start collecting data to the time we present our conclusions. After all, those conclusions can have serious implications.

People have a personal code of ethics that helps them navigate the world. When we’re young, it could be as simple as never lying, cheating or stealing, but as we get older, it’s a much broader list of dos and don’ts. Our ethics evolves and becomes more rational, giving us a moral compass to use as we face life’s questions, challenges, and opportunities. When we analyze data, we’re also faced with questions, challenges, and opportunities, but we have to rely on more than just our code of ethics to address them.

Data ethics refers to well-founded standards of right and wrong that dictate how data is collected, shared, and used.

Aspects of Data ethics

  • Ownership: Individuals who own the raw data they provide and have primary control over its usage, how it’s processed and how it’s shared.
  • Transaction transparency: All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data.
  • Consent: An individual’s right to know explicit details about how and why their data will be used before agreeing to provide it. (How will it be used? How long will it be stored?)
  • Currency: Individuals should be aware of financial transactions resulting from the use of their data and the scale of these transactions.
  • Privacy: Preserving a data subject’s information and activity any time a data transaction occurs.
  • Openness: Free access, usage, and sharing of data.
Importance of Data Ethics

What is a Data Bias?

It is a type of error that systematically skews results in a certain direction.

Biases we have as an individual can end up creating biased data, we all get biased when we have our own preferences. But when data is biased, it can systematically incline results in a certain direction, making them unreliable & unrealistic. The most common bias data professionals face is Sampling Bias(when a sample isn’t representative of the population as a whole) to avoid favouring one particular outcome, the sample should be chosen at random so that all parts of the population have an equal chance of being included. We will read more about this in the next episode.

Few other types of bias that we may come across are

  • Observer Bias: experimenter/research bias: a tendency for diff people to observe things differently. ( microscope analysis of two people diff results)
  • Interpretation bias: a tendency to always interpret ambiguous situations in a +ve/-ve way ( Voice tone analysis b/w two diff people)
  • Confirmation Bias: a tendency to search for or interpret info in a way that confirms pre-existing beliefs ( confirming our beliefs with media/like-minded people)

The four types of data bias we covered are all unique, but they do have one thing in common. They each affect the way we collect and make sense of the data. That’s why we have data ethics, an important aspect of analytics.

The battle between security and data analytics

Data security means protecting data from unauthorized access or corruption by putting safety measures in place. Usually, the purpose of data security is to keep unauthorized users from accessing or viewing sensitive data. Data professionals have to find a way to balance data security with their actual analysis needs. This can be tricky — we want to keep our data safe and secure, but we also want to use it as soon as possible so that we can make meaningful and timely observations.

To do this, companies need to find ways to balance their data security measures with their data access needs. Luckily, there are many security measures that can help companies do just that.

Who wins?

Let’s take a moment to celebrate how far we’ve come and everything we’ve learned from this course. Different types, formats, structures & sources of data. Learnt about the importance of data transformation, bias and ethics. And how to differentiate good and bad data. And last but not the least Data Security. All of this will help prepare our data for the next step in the data analysis life cycle: processing. Processing our data to make sure that it’s clean and complete is the last step before we start analyzing it.

Until then keep governing your data.

--

--

Vaishnave Jonnalagadda

Hello, Feel free to read my content on Data and how it’s impacting you and how you can create an impact using data.