Web scraping has three simple steps:

  • Step 1: Access the webpage
  • Step 2: Locate and parse the items to be scraped
  • Step 3: Save scraped items on a file

The top Python libraries for webscraping are: requests, selenium, beautiful soup, pandas and scrapy. Today, we will only cover the first four and save the fifth one, scrapy, for another post (it requires more documentation and is relatively complex). Our goal here is to quickly understand how the libraries work and to try them for ourselves.

As a practice project, we will use this 20 dollar job post from Upwork:

Image for post
Image for post

There are two links that the client wants to scrape and we will focus on the second one. It’s a webpage for publicly traded companies listed in…

Image for post
Image for post

If you’re like me, then you use Spreadsheet as a database. Yes, I said it. And no, no one will judge here. This is a safe place for us heathens. After all, who needs a full blown database for projects that will either die or evolve right?

The problem with data, regardless where they’re stored, is that they’re only useful if they’re visible. And pushing data in front of people can be a challenge sometimes. Good thing they’re easy to do when the people are in Slack and the data is in Spreadsheets!

Here are some other use cases:

  • You have a sensitive budget sheet and you want to get alerted whenever someone is changing a…

Our goal in this post is not to predict, as accurately as possible, the price of bitcoin tomorrow. Instead we want to see how we can use a machine learning algorithm called Random Forest to create a model that can predict bitcoin prices using historical data on bitcoin supply and demand.

Nota bene: The random forest algorithm, while awesome in many ways, has no awareness of time. Meaning the price predictions in this post will ignore seasonality. Again, our goal is NOT to accurately predict bitcoin prices but to see random forest in the works.

Specifically we are interested to know how the following factors affect the average market price of bitcoin across major bitcoin…

Image for post
Image for post
Photo by Bohlam Neon from freepik.com

In this tutorial we will create our own trivia bot! The questions and answers will be stored in a Google Spreadsheet and we will write our program inside its script editor. We will be using webhooks to connect to Telegram. If you have no idea how to do this, don’t worry — I wrote a seven step tutorial here.

We will trigger the trivia bot by simply sending a message. Our bot will evaluate if the message is the correct answer. If so, it will send the next question; otherwise, it will repeat the current question. Here’s the finish product:

There are multiple ways to combine data in Pandas:

  • By appending — df.append()
  • By concatenating — pd.concat()
  • By joining —df.join()
  • By merging — pd.merge() or df.merge()

By Appending

An append looks like this:

Image for post
Image for post

It’s what you’d use if you want to stack dataframes vertically. Very straightforward, just like its syntax: df1.append(df2, sort = False)

It takes minimal coding to create your own Telegram bot. In fact, you don’t even need to have a code editor installed to start building one. By the end of this post, you’d have learned how to create your personal interactive telegram bot with just a Google Spreadsheet. The final product would be a bot that can reply to your messages. Something like this:

Image for post
Image for post

Before I hash out the step by step instructions, it’s important that you have a conceptual understanding of how your bot is going to work. …

There are three main ways to group and aggregate data in Pandas.

  • Using the groupby() function
  • Using the pd.pivot_table() function
  • Using the pd.crosstab() function

There’s not a lot of difference between these functions except performance and readability. The groupby() function has the fastest runtime amongst the three but that is barely noticeable if you are running it against a small dataframe. In this post we will go through the syntax of each function so you can decide which one is most convenient for you.

But first, let us be clear on what we mean by “group and aggregate.” …

There are multiple ways to filter data inside a Dataframe:

  • Using the filter() function
  • Using boolean indexing
  • Using the query() function
  • Using the str.contains() function
  • Using the isin() function
  • Using the apply() function (but we will save this for another post)

Using the filter() function

The name of this function is often a source of confusion. Contrary to what you might expect, the filter function cannot filter values inside a Dataframe. It can only filter the row and column labels.

To demonstrate what I mean, we will use a Dataframe called books that has data of the top 100 books from 1990 to 2010:

Image for post
Image for post

With the filter() function, I can filter the columns I want to see — for example, if I’m interested to know which authors made it to the list, I filter for the Author

There are two main ways to locate data inside a dataframe.

  • By integer location — integers that starts with zero and increments by one. All rows and columns have an integer location. Rows are by default indexed by integer locations.
  • By labels — typically strings that represent location of the data. Columns are by default indexed by labels.

Here’s an example dataframe:

Image for post
Image for post

By integer location

To select data by integer location, we will use the iloc method which, yep, literally translates to “integer location”.

df.iloc[ ]

.iloc accepts:

  • integer
  • list of integers
  • slice notation using integers as the start and stop values

Using .iloc with an integer will select a single row of data. Here we selected our first row using the integer location, 0. …

Data exploration can be overwhelming for anyone who has little to no background in data analysis. There is technically no limits to what you can explore and there is no guidelines to what you should be looking for. But as the saying goes, all journeys begin with a single step. This post attempts to lay down not one but five easy steps you can follow when you’re exploring a dataset. I call it LAUGH:

Step 1: L-oad the data

Step 2: A-sk for definitions

Step 3: U-se questions

Step 4: G-et a feel

Step 5: H-ave next actions

By the end of this post, you would have learned what you should care about when exploring a dataset and how to do basic data exploration in a Jupyter Notebook. …

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store