Introduction to Data Analysis

A gentle introduction to data analysis from a beginner’s perspective

Payal Kumari
Geek Culture
8 min readNov 16, 2021

--

Prologue

I am an emerging data scientist with an academic background in biomedical engineering. I love to solve problems using data. Therefore, I have written this article to introduce data analysis to students from a non-technical background.

Who should read?

This article is for students who have no prior knowledge of programming. Also, students who want to pursue a career in data science will benefit from this article.

What to expect?

This article dives into the basic introduction of data analysis. In addition, you will also learn why there is a need to do data analysis and what tools are required.

Why data analysis?

Have you ever wondered why data analysis is important? There are numerous companies across the globe that generate a huge amount of data. This generated data, in its raw form, is of no use to anyone. Also, the companies depend on these data to make crucial decisions that can impact their businesses. Because of this, generated data need to be converted into meaningful information in order to be used by organizations. This is done by analyzing generated data and from this, we have data analysis.

Now the next question arises what is data analysis?

What is data analysis?

Data analysis is not just a single step but a set of processes.

Wikipedia defines data analysis as a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Let’s analyze the definition piece by piece.

We will begin with inspecting, cleansing, and transforming data.

The first part of the process of data analysis is frequently the most time-consuming. It starts with gathering data, cleaning it (by cleaning I mean removing irrelevant data), and then transforming it into meaningful information, which we can compare to the process of putting together a jigsaw puzzle, where we put all the pieces together and fit them together to create a beautiful picture. This is where Python and the Py data tools excel. We’ll be using pandas to read, clean, and transform our data.

Now, we will look at modeling data.

Modeling data means applying real-world scenarios to information systems and looking for patterns or models using inferential statistics. We’ll use pandas statistical analysis features as well as matplotlib and seaborn visuals for data modeling. Inferences will also be drawn from the data after processing using constructed models. This is done by looking for intriguing patterns and anomalies in the data.

It is time to understand how we discover useful information in data.

The word “information” is crucial here. We’re attempting to convert data into knowledge. This is the important part of the data analysis. Everyday companies are producing data and companies make full use of this data by applying a lot of concepts to analyze the data they have collected. The companies convert their raw data into meaningful data that can help them to increase their business profits. There are different data analysis tools that companies use to discover useful insights from data. For instance, recorded data flows in an application can be used to understand the meaningful pattern and trends which may subsequently be used to increase sales or awareness of products and services. These meaningful patterns, that innately exist in the data, can also be used as a biomarker to inform the service provider of the satisfaction rate of customers and not plan a switch from the provided services. Thus, getting useful data can be very important when working in companies.

Let’s now learn how to inform our conclusion and support decision-making

This is the final objective of data analysis. We need to back up our results with evidence, develop comprehensible reports and dashboards, and share the information we’ve acquired with the companies. This study will be used by a variety of actors, including marketing sales, accounting executives, etc. to optimize the overall performance. They may require a different perspective on the same data. It’s possible that they’ll all want different reports or amounts of details.

What are data analysis tools?

To achieve the goals of data analysis, we use a number of data analysis tools to gather and transform their data into meaningful insights. So the question is: Which tools should you choose to analyze data? Or better still, which tools you should learn if you want to make a career in this field?

Here are a few tools I found :

Tableau

Tableau is a data visualization software that was founded in January 2003 in Mountain View California. It is used for data science and business intelligence and can create a wide range of different visualization to interactively present the data and showcase insights.

Now let’s look at some of the features of the tableau: Tableau allows for quick data analysis and visualizations in the form of dashboards and worksheets. Tableau creates interactive dashboards that allow users to gain real-time insights. It can translate searches into visuals and input data in all ranges and sizes. Tableau gives you the ability to ask questions, see trends, and spot opportunities. You may connect to cloud databases, Amazon Redshift, and Google BigQuery using tableau online. Tableau is currently used by Deloitte, Adobe, Cisco, LinkedIn, Amazon, etc.

Power BI

Power BI is a self-service business analytic tool developed by Microsoft that lets you analyze and visualize data, and share insights across your organization. It can connect to hundreds of data sources and bring your data to life with live dashboards and reports.

Now let’s look at some of the aspects of Power BI: Power BI includes simple drag-and-drop functionality as well as data visualization features. You can make reports without knowing how to program in any language. It allows users to view not only what has occurred in the past and what is occurring now, but also what may occur in the future. It, like tableau, has a large number of detailed and appealing visuals to choose from when creating reports and dashboards. Power Bi can recognize patterns in data and use those patterns to produce intelligent forecasts and execute what-if scenarios with its machine learning skills. It supports multiple data sources such as excel, text/ CSV, oracle, pdf, and XML files. The platform integrates with other popular business management tools like SharePoint office 365 and Dynamics 365 as well as other non-Microsoft products like Spark, Hadoop, Google Analytics ASAP Salesforce, and MailChimp. Power BI is currently used by Adobe, Axa, Carlsberg, Capgemini, Nestle, etc.

Qlikview

Qlikview software is a business discovery platform that provides self-service Bi for all business users and organizations. With QlikView, you can analyze data and use your data discoveries to support decision-making.

Now, let’s look at its features: Within memory storage technology, QlikView enables interactive guided analytics. The QlikView software assists the user during the data finding and interpretation process by providing possible interpretations. It uses a new patent in-memory architecture for data storage all the data from the different sources is loaded in the RAM of the system and it is ready to be retrieved from there. It has the capability of efficient social and mobile data discovery. Social data discovery offers to sharing individual data insights within groups or out of it. A user can add annotations as an addition to someone else’s insights on a particular data report. Qlikview supports mobile data discovery within an HTML file enable touch feature which lets the user search the data and conduct data discovery interactively and explore other server-based applications. Qlikview performs OLAP and ETL features to perform analytical operations extract data from multiple sources transform it for usage and loaded to a data warehouse. Qlikview is currently used by Mercedes-Benz, Citibank, Cognizant, Accenture, etc.

Apache Spark

Apache spark is an open-source engine developed specifically for handling large-scale data processing and analytics. It allows to store and process data in real-time across various clusters of computers using simple programming constructs. Apache Spark is designed to accelerate analytics on Hadoop while providing a complete suite of complementary tools that include a fully-featured machine learning library, a graph processing engine, and stream processing.

Now, let’s look at its features: Spark stores data in the random access memory. Hence, it can access the data quickly and accelerate the speed of analytics. It supports multiple languages and allows the developers to write applications in Java, Scala, R, or Python. Analytics can be performed better as Spark has a rich set of SQL queries, machine learning algorithms, complex analytics, etc. Apache Spark is currently used by Netflix, IKEA, eBay, etc.

R and Python

R is a programming language that is used for analysis as well. It has traditionally been used in academics and research. Python is a high-level programming language that has a python data analysis library. It is used for everything starting from importing data from Excel spreadsheets to processing them for analysis.

Now, let’s look at its features: R and Python are completely free, hence they can be used without any license. R used to compute everything in memory and hence the computations were limited but now it has changed both R and Python have options for parallel computations and good data handling capabilities. R and Python are currently used by Uber, Google, Facebook, Instagram, Amazon, etc.

Statistical Analysis System (SAS)

SAS is software developed by the SAS institute. It facilitates analysis reporting and predictive modeling with the help of powerful visualizations and dashboards. In SAS data is extracted and categorized which helps in identifying and analyzing data patterns.

Now, let’s look at its features: SAS enables better data analysis using SAS SQL and automatic code generation. It allows you to access Microsoft Office by letting you create reports using it and by distributing them through it. SAS helps with an easy understanding of complex data and allows you to create interactive dashboards and reports. Statistical Analysis System is currently used by Genpact, IQVIA, Accenture, IBM, etc.

Microsoft Excel

At some point, we all have used Microsoft Excel. It is easy to use and one of the best tools for data analysis developed by Microsoft. Microsoft Excel is basically a spreadsheet program that is used to create grids of numbers, text, and formulae. It is one of the widely used tools beat in a small or large setup.

Now moving on to the features of Excel: Excel works with almost every other piece of software in the office. We can easily add Excel spreadsheets to Word documents and PowerPoint presentations to create more visually appealing reports and presentations. It supports programming through VBA which enables spreadsheet manipulations. The biggest benefit of excels is that it allows you to display analysis results in the form of line graphs, charts, and histograms. Microsoft Excel is currently used by Wipro, UrbranPro, Amazon, etc.

Thank you for sticking with me this far. I hope you found this article useful and that it gives you some idea about data analysis. Because I’m new at this, I’d appreciate your opinion on any mistakes I’ve made and how I might improve. I’m looking forward to hearing from you! :)

--

--