Start Your Data Science Journey in Seconds with This CLI Template

Published in

Practical Coder’s Chronicles

9 min readOct 30, 2023

Credit — imgflip.com, generated by geeks.sw.gig

— Listen, I’ve got a million dollar idea and need a data scientist that will validate it.
— Tell me more about this!

Back to reality. We’ve all been there, right? During a typical workday, a flash of inspiration strikes, and you’re hit by two or three brilliant ideas. They all seem like they could be turned into something fantastic in just a few days. Yet, more often than not, those ideas fizzle into nothingness.

Have you ever contemplated scraping the web for apartment rental prices in specific neighborhoods of your city to make more informed moving decisions? Or perhaps you’ve pondered diving into the realm of car prices in your area, seeking to understand price trends, ideal models, or that elusive gem under $20,000. While these ideas may come and go, some remain in progress, like my car price project (stay tuned for that in a future post).

But here’s the catch: Taking that first actionable step, especially when data collection plays a vital role, can be a daunting hurdle to clear. It becomes even more challenging when there’s no external motivator like a paycheck, or the looming specter of a boss withholding that paycheck.

The good news is, it’s 2023, and we can draw inspiration from history and psychology to set ourselves up for success. James Clear, in his “Atomic Habits” book, emphasizes the significance of our environment in building good habits or shedding the bad ones. Consider a peculiar case in a suburb near Amsterdam: Some homeowners consumed 35% less electricity despite having homes of identical size and cost. The explanation was surprisingly simple — homes with lower consumption had an electrical meter located in the main hallway, while their counterparts paid more due to basements.

This tale underscores the power of environmental cues in shaping our habits and keeping us on track. Now, you might be wondering, “What does the location of an electrical meter in Amsterdam have to do with launching a data science project within seconds of a brilliant idea?” Well, it turns out that it can make a world of difference.

I’m here to provide you with the virtual “main hallway electrical meter” for your data science projects, ensuring that you never have to utter the dreaded words, “I’ll start tomorrow; I need this and that, and it’ll take too long to set up.” Tomorrow never comes unless we act today.

Before we dive into this, let’s take a quick tour of what a data science project entails and the crucial phases it traverses.

What a data science project consists of

Starting a data science project can often feel like embarking on an abstract journey with unclear waypoints. What’s the first step? The second? When do you consider it “done”? What exactly is a data science project?

Let’s demystify it. A data science project is essentially a comprehensive approach to extract valuable insights, knowledge, and even predictions from data. While this process can be outlined in varying levels of detail, it generally consists of five key phases, as depicted in the diagram below (yes, there are five phases, with the sixth arrow indicating the potential for iteration if needed).

Phases of a Data Science Project, according to me (Geek On SW)

Phase 1 — Question To Answer/ Problem Definition

As highlighted in the diagram, the initial and most critical step involves defining the question you aim to answer or the problem you intend to solve. Many data science projects start and then languish because they kick off with the wrong premise. Instead of beginning with “What question do we need to answer?” or “What problem must we address?” the starting point often becomes “Hey, we have some data; what can we do with it?”

To clarify, there’s nothing wrong with discovering new insights from existing data. However, there’s a difference between leveraging data from an already profitable process and diving into an untouched database with the hope of stumbling upon something valuable. The former approach has a higher likelihood of success because it starts with a meaningful and realistic objective.

Phase 2 — Work With The Data

After formulating the problem you aim to tackle, the natural next step is to engage with the data itself. This includes gathering, processing, and cleaning the data. You should have a clear understanding of what data you need, in what format, and the key metrics you seek.

Phase 3 — Work With The Model

You don’t need to wait until you’ve processed all the data to start working with your model. Once you’ve collected a few batches of data, you can swiftly begin selecting and fine-tuning your model.

Whether it’s a clustering, classification, or reinforcement model, the decision regarding the model type, or more broadly, the family of models, should be made early on. It all comes back to the crucial first step.

Phase 4 — Work With The Results

As your model runs, you’ll accumulate results, which can be compiled into periodic reports, feedback loops, and knowledge-sharing activities. Monitoring your pipeline can be done either through automated systems or manual checks, depending on your organization’s policies, project scale, and deadlines.

Phase 5 — Document everything

This fifth phase is intertwined with all the others, as documentation should be an ongoing process from the project’s inception. Questions like “What’s the structure of a data point in my dataset?” or “What are the model setup details?” demand documentation. Waiting until the project is complete is a poor strategy, as our memories are often fallible. In a high-paced software development environment (and data science projects are a part of software development), it’s easy to become distracted by numerous concurrent tasks. Early and thorough documentation is a favor you do yourself for the weeks ahead.

Phase 6 — Iterate

The sixth arrow in the diagram doesn’t represent a distinct phase. It’s a reminder that at any point during these five phases, you might need to revisit square one based on your discoveries so far. This might involve refining your initial steps, enhancing your data gathering, or even redefining your project’s objectives.

Now, why did I walk you through all these phases without providing a concrete example? Well, it’s because, in both a data science project and in life itself, as the ancient philosopher Lao Tzu astutely noted over 2000 years ago, “A journey of a thousand miles begins with a single step.”

You might already feel that you’ve initiated your project by posing the right question. But is your environment optimized for taking this crucial first step? Can you launch your project swiftly with just a few clicks?

Chances are, you can’t. To get started, you’d typically need to:

Pick a programming language (with Python being the most popular choice).
Create a project folder.
Design a dedicated space for storing collected data.
Develop a package for making API calls to gather data.
Set up a package for data processing scripts.
Create a separate unit tests package that you’ll gradually build as you work on the source code’s functionality.

But what if I told you that all these steps could be streamlined with a single, straightforward command? Moreover, you’d receive a ready-made codebase to execute these commands via the command line, enabling you to work in a user-friendly, CLI-style environment.

CLI Template — GitHub Repo description

Let’s start the actionable steps by cloning the CLI Template from GitHub.

git clone https://github.com/vBarbaros/cli-template-py.git cli-datascience-intro

# expected output
Cloning into 'cli-datascience-intro'...
remote: Enumerating objects: 44, done.
remote: Counting objects: 100% (44/44), done.
remote: Compressing objects: 100% (31/31), done.
remote: Total 44 (delta 12), reused 44 (delta 12), pack-reused 0
Receiving objects: 100% (44/44), 10.29 KiB | 3.43 MiB/s, done.
Resolving deltas: 100% (12/12), done.

Now, here is what you’ll get after cloning it.

$ tree cli-datascience-intro   

# expected output                       
cli-datascience-intro
├── README.md
├── src
│   ├── __init__.py
│   ├── data
│   │   ├── __init__.py
│   │   ├── input.txt
│   │   └── out.txt
│   ├── run.py
│   └── scripts
│       ├── __init__.py
│       ├── api_calls.py
│       └── io_ops.py
└── test
    ├── __init__.py
    ├── test_data.py
    ├── test_scripts.py
    └── test_src.py

The src/ directory contains

a data/ folder, to indicate the location of your collected data in the future
a scripts/ folder containing a ready to adapt api_calls script and io_ops
a run.py script that handles main CLI commands and executes specified scripts/methods

The test/ directory contains three unit tests stubs for each of the three operations scripts, provided as part of src/ directory.

Now, so far I just showed you that you got a folder with some directories and I made some promises on how quickly you may get up and running but here is the best part — I will show you that what I promised is true!

Demo with AlphaVantage API

In the api_calls.py script we have the following method.

def get_generic(path_var):
    # Make a GET request to the API endpoint
    try:
        response = requests.get('https://api.example.com/endpoint' + path_var)
        # Check the status code of the response
        if response.status_code == 200:
            # Print the JSON data returned by the API
            print(response.json())
        else:
            # Print the status code if the request was not successful
            print('Error:', response.status_code)
    except ConnectionError:
        print('Connection Error: Please provide a valid URL')

In order to adapt it to our needs, I found a public API that does not need any authorization, nor any other setup — https://www.alphavantage.co/

Disclaimer: I selected the AlphaVantage platform solely to simplify the setup process and enhance your experience when experimenting with the project. I have no financial or other incentives associated with the use of this API.

Change the GET performing method to a more meaningful name and adapt the code as follows:

def get_alphavantage_demo(hostname):
    # Make a GET request to the API endpoint
    try:
        response = requests.get('https://www.' + hostname + '.co/query?function=TIME_SERIES_INTRADAY&symbol=IBM&interval=5min&apikey=demo')
        # Check the status code of the response
        if response.status_code == 200:
            # Print the JSON data returned by the API
            return response.json()
        else:
            # Print the status code if the request was not successful
            print('Error:', response.status_code)
    except ConnectionError:
        print('Connection Error: Please provide a valid URL')

Ensure you have a method at hand that will take the JSON response and will write it to a file. Define the helper method in io_ops.py

def write_to_file(data, out_file):
    out_file = os.path.join(parentdir, 'data', out_file)
    with open(out_file, 'w') as output_file:
        json.dump(data, output_file)

Finally, adapt the CLI parameters, so that your run.py script will perform exactly what you expect from it. Change the generic code for GET command to the following one

    if args.get:
        json_data = get_alphavantage_demo(args.get)
        write_to_file(json_data, 'demo_data.json')
        return

, and you are done!

Once you run you CLI command as follows, you’ll end up with some data, collected under the data/ folder.

Run It!!!

$ python run.py --get alphavantage

In the data/ folder you can see a new file, demo_data.json with a similar content

{
  "Meta Data": {
    "1. Information": "Intraday (5min) open, high, low, close prices and volume",
    "2. Symbol": "IBM",
    "3. Last Refreshed": "2023-10-27 19:35:00",
    "4. Interval": "5min",
    "5. Output Size": "Compact",
    "6. Time Zone": "US/Eastern"
  },
  "Time Series (5min)": {
    "2023-10-27 19:35:00": {
      "1. open": "142.3000",
      "2. high": "142.3000",
      "3. low": "142.3000",
      "4. close": "142.3000",
      "5. volume": "12"
    },
    "2023-10-27 19:10:00": {
      "1. open": "142.4100",
      "2. high": "142.4100",
      "3. low": "142.4100",
      "4. close": "142.4100",
      "5. volume": "10"
    },
.
.
.

Congratulations! You completed your very first step in the long data science journey that you are just about to start!

Conclusion

In the realm of data science, where ideas can transform into insights and innovations, the journey begins with a single step. It’s easy to be inspired by a brilliant concept, but the real challenge lies in taking that vital first action, especially when data collection stands as a significant part of the journey. This is where the environment plays a crucial role, setting us up for success by simplifying our path.

We have the opportunity to harness the power of this environment to supercharge our data science projects. Just as a well-placed electrical meter in a home can drive energy efficiency, the right tools can empower us to start our data science endeavors in seconds after that spark of genius ignites. The solution is at hand, removing the barriers that delay the realization of your brilliant ideas.

Call for Action

If you found this interesting click the follow button, give a round of applause (aka Clapping), and share your thoughts in the comments.

Ready to start your data science adventure? Access the GitHub repository here and launch your projects with ease, or access directly the demo version here and run the example used above.