Starting a Career in Data Analytics

“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway” — Geoffrey Moore

Modupeola Alade
5 min readMar 29, 2019
Image from Google search

When we think of data analytics, what comes into mind? Detailed and well-constructed data? Data turned information turned insights? If these are your thoughts, you couldn’t be more right. Data Analytics, according to Wikipedia, is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Therefore, a data analyst interprets data and turns it into information which can offer ways to improve a business, thus affecting business decisions. There are various software that are used to achieve data analysis but Microsoft Excel is probably the most well-known tool for working with data. The simplicity of Excel lowers the entrance barrier and allows to quickly get started with data manipulation.

Excel is one of the top tools for data analysis and the built-in pivot tables are arguably the most popular analytic tool. Learn more on analyzing data using Pivot tables. When the concepts and ideas of pivot tables has been understood, you can move unto the advanced course and lessons on pivot tables.

At the start of the Code n’ Wine Data Analytics Bootcamp Curriculum organized by GitGirl Initiative which I enrolled in to pursue a career in data analytics, the first module covered how to manipulate data using Excel Spreadsheets. To achieve this, we are given a dataset called Titanic. As the name suggests, this dataset provides the data on all the passengers who were aboard the RMS Titanic when it sank on 15 April 1912 after colliding with an iceberg in the North Atlantic ocean. It is the most commonly used and referred to data set for beginners in data science. With 891 rows and 12 columns, this data set provides a combination of variables based on personal characteristics such as age, class of ticket and sex, and tests one’s classification skills.

The objective of the module was to:

  1. Download the dataset from its source
  2. Import the dataset containing sheets “Train.csv” and “test.csv” from CSV format into google sheets. To see how this was generated, see Import datasets and spreadsheets
  3. Understand the data set
  4. Construct a statement on your understanding of the dataset and describe what the data tells you. (What does it mean to you? What are its features? — how many rows and columns?)
  5. Predict the survival of the passengers aboard RMS Titanic

Understanding the dataset

If you are working with a ‘not too large’ dataset, you can scroll roll and study it. If Not, there are other advanced ways of understanding a dataset. Here, let’s look at some steps we can take when working with a ‘not too large’ dataset.

  1. Take a detailed good look at the whole dataset.
  2. After this has been done, take into note the things that are common(i.e. appearing a considerable great amount of time). Such things are used to create insights by carrying out some Excel functions on them.
  3. Likewise, dissect the dataset preferably column by column and see if any column(s) affects the output of another column(s).

My statement

train.csv contains the data of the passengers of Titanic. It shows data in 891 rows and 12 columns. There are 12 columns well represented with details of each passengers such as “PassengerId”, “Survived”, “PClass”, “Name “, “Sex “, “Age”, “SibSp”, “Parch” “Ticket”, “Fare”, “Cabin” and “Embarked”. The column Survived shows details in which 1 represent that the passenger survived while O denotes otherwise.

As for test.csv, it has the same columns as train.csv except the column Survived which is not included. Therefore, we can’t determine their survival status.

Furthermore information gotten from database:

The total number of males on the Titanic is 843 while females were 466. The total number of passengers is 1309. From the Survived column, survivors is 342 while those who didn’t survive is 549. Likewise, the number of male survivors is 109 while that of female survivors is 233 (which means the number of female survivors is twice and a little more than that of male survivors) The minimum age in the database is 0.17 which means we have children on board of the Titanic. The maximum age on board is 80. From the data, I could also deduce that most of the females were their 20’s while the male were in their late 20’s, 30’s or 40’s. From the name column, I realized that we had passengers with titles such as “Master” who in total are 61. Unmarried females with the title “Miss” were 260 while the married females with the title “Mrs” were 197. The men with the title “Mr” were 757 in numbers. Others with our titles were 34 in number. Each passenger details is different from the next and the PassengerId Column is the unique key that each passenger is identified by.

In data analysis, your intuition is often a great place to start. If you have ever seen the film Titanic, they try to save the women and children first. This would be a good start for one’s predictions.

My predictions

From the train.csv file, we can deduce that almost 75% of the females survived while 19% of the male passengers survived. From this, we can say that the type of gender is a strong survival factor with females being highly considered than the males. In the test.csv file, a model was created where the Sex column was considered in stating that if gender is female, survival status ‘Yes’ represented by 1 should be assigned to it in the newly created column Survived. Likewise if the gender is male, 0 should be assigned to the column as the survival status. Therefore in the file test.csv, it was concluded that if you are a female, you survived, while if you a male, you didn’t survive.

In populating the Survived column, an IF statement was used. The IF function can perform a logical test and return one value for a TRUE result, and another for a FALSE result. The syntax is:

=IF (logical_test, [value_if_true], [value_if_false])

Flowchart of IF-statement

In this case, the IF statement was:

  • =IF (E2 = “female”, 1, 0) where E2 is an instance of the Sex column
Bar chart visualization for the number of survivors against non-survivors

Excel is a good way to start out a career in data analysis, so if you are seeking on venturing into this career path, start with it. Take courses and watch videos on Advanced Excel skills and you are one step away from achieving your goal.

Reference

  • www.google.com

--

--

Modupeola Alade

Data Analyst | Power BI Specialist | Python Enthusiast | Transforming Messy Data into Actionable Insights