[Data Analysis] Defining Objectives (1/9)

Sam Taylor
6 min readNov 15, 2023

--

Learn how to define objectives, goals, and questions for your data analysis journey. Apply these steps to the Iris Flower dataset and unlock the power of insights.

[This guide is part 1 of an 9-article walkthrough.]

Key concepts:
Data analysis · Machine learning · Data analysis process · Data analysis projects · VS Code · Python

Photo by Glenn Carstens-Peters on Unsplash

Embarking on a data analysis project can be exhilarating, but defining clear objectives and goals is crucial for success. In this guide, we’ll walk through the process step by step: from framing business objectives and selecting questions, to choosing potential evaluation metrics, using the famous Iris Flower dataset as our playground.

To remind ourselves where in the data analysis process defining objectives comes into play, here is a general outline of the data analysis process:

  1. ➡️ Define Objectives: Clearly understand the goals of your analysis.
    ◦ What questions are you trying to answer?
    ◦ What data do you have?
    ◦ What insights are you looking to uncover?
    ◦ Define the scope of your analysis.
  2. Data Acquisition: Obtain the dataset you’ll be working with.
  3. Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
  4. Data Cleaning: Preprocess the data to ensure its quality and consistency.
  5. Data Visualization: Create visualizations to gain insights into the data.
  6. Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
  7. Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
  8. Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
  9. Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

Step 1: Identify the Project Objective(s)

The first step is to understand the problem at hand and to define the overarching purpose of your analysis.

For instance:
◦ Are you aiming to classify iris flowers based on their features?
◦ Do you want to be able to predict house prices from a set of features?
◦ Do you want to be able to detect if an email is spam or not?

In a work setting, the objective is often provided to you by a stakeholder, colleague or boss. If you’re working alone on a project, then you might be the one setting the objective yourself.

Objectives should be clear, conscise and unambigious. The end goal of the analysis should be clear.

In our case:

Objective:
Use four features
(sepal_width, petal_width, sepal_length, petal_length), to classify unseen iris flower data as one of three different species (iris_setosa, iris_versicolor, iris_virginica).

Step 2: Understand the Dataset

The second step is to understand the dataset that you are working with.
◦ If you are in a work setting, you can ask you colleagues and stakeholders for information about the dataset. In some cases, there might be a subject matter expert (SME) that you can ask.
◦ If you are working alone, you could have a look at the dataset to see how it is structured and what variables there are.

Some questions to ask:

  • How much data is there?
  • Is the data online (constantly updating with new records) or batch (a single, static dataset)?
  • What does each column/feature represent?
  • Has the data been edited in anyway?
    ◦ New columns created by combining other columns?
    ◦ Binary columns created from categorical features?
  • Is there missing data?

In our case, the Iris dataset is a well-known dataset in the field of machine learning and consists of measurements of four features of three species of iris flowers.

The iris dataset
  1. Features:
    The dataset includes four features, which represent various dimensions of the iris flowers:
    ◦ Sepal Length (in centimeters)
    ◦ Sepal Width (in centimeters)
    ◦ Petal Length (in centimeters)
    ◦ Petal Width (in centimeters)
  2. Species:
    There are three species of iris flowers in the dataset:
    ◦ Setosa
    ◦ Versicolor
    ◦ Virginica
  3. Observations:
    There are a total of 150 observations, with 50 samples from each species. There are is no missing data.
  4. Structure:
    Each observation includes the measurements of the four features and is labeled with the corresponding species.

Step 3: Formulate Specific Goals

Break down your objective into measurable goals. Specify the tasks required to achieve your objectives:

  • Goal 1: Explore and visualize the dataset
  • Goal 2: Preprocess data for machine learning
  • Goal 3: Build a classification model

Step 4: Pose Analytical Questions

Frame questions that align with your goals. These will guide your analysis and help you draw meaningful insights:

  1. [Goal 1] What is the distribution of each feature in the dataset?
  2. [Goal 1] What are the characteristics of each iris species?
  3. [Goal 1] How are the features correlated with each other?
  4. [Goal 2] Are there any outliers?
  5. [Goal 2] Are there any missing values?
  6. [Goal 2] Are the features on the same scale?
  7. [Goal 3] Which classification algorithm is most suitable for this task?
  8. [Goal 3] How will we assess the accuracy of our model?
  9. [Goal 3] How accurate is our model?

Step 5: Validate and Refine

Re-read over your objective, goals and questions and adjust them as necessary to ensure clarity and relevance.
◦ If you are in a work setting, share your objectives, goals, and questions with colleagues, relevant stakeholders, or your boss and adjust according to their feedback, also.

The aim here is to confirm that your objectives and questions match the overall project objective. That is, have you fully understood what has been asked of you?

It’s essential that everyone is aligned from the very beginning. Otherwise, you might get half-way through your analysis and find out that you haven’t quite understood the goals, and have to start over!

Step 6: Frame your analysis

With the information we’ve gathered above, we can already start to frame our analysis.

We know that, that this analysis will be:

  • A (multi-class) classification problem
    Classification: As we want to predict the correct iris species, given input data.
    Multi-class: As what we want to predict is one of three possible categories.
  • A supervised learning task
    ◦ Our dataset has labelled examples (iris ‘species’ column) of the categories we wish to predict, that we can use to train our model.
  • A multivariate dataset
    ◦ We have several features available (sepal_length, petal_length, sepal_width, etc.) that we will use to predict the species category

This will help us decide what kind of machine learning model we can choose and how we will assess its performance.

For example, the following might be useful:

  • [Model] Random Forest
    ◦ Versatile and effective for a wide range of classification tasks.
  • [Performance metric] F1 score
    ◦ Given that we have a classification problem, and because there is no business need for us to favour higher accuracy on one kind of species over the other, we could select the F1 score as our performance metric.
    ◦ We could also look at precision, recall and accuracy, to give extra context to our F1 score.

This gives us a great starting point for our analysis.

Conclusion

🎉 Congratulations! You’ve laid the foundation for a successful data analysis project. Stay tuned for future posts where we delve into each goal and question, bringing this Iris Flower dataset to life.

Reference(s):

Géron, A. (2023). ‘Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems’. O’Reilly Media.

--

--

Sam Taylor

Operations Analyst & Data Enthusiast. Sharing insights to support aspiring data analysts on their journey 🚀. Discover more at: https://samtaylor92.github.io