All you need is a little python, some APIs and/or a database. 😉
Let’s pretend you work at a company that doesn't’ have the resources quite yet to have a data engineer available to you or have one of the big BI tools (Looker, Tableau, Power BI etc.). You’re a BI analyst who will do whatever it takes to get the job done and you have a lot on your plate. You don’t want to be stuck with all this manual reporting so how do you solve this problem?
One approach is to use python to take data from an api or a database, transform that data and upload it to a google sheet (ETL). That sheet can then be shared across the organization and then further manipulated by the end user and/or visualized. You can even create dynamic charts in google sheets.
This can of course all be done automatically every day and can run on the weekends, when you’re on vacation (like I currently am and no I don’t want to leave Puerto Rico) or when you get another project and don’t have the time to update it yourself. Pretty neat.
I’ll show an example of what that would look like with an api and a database so let’s get started. The full code and dependencies can be found on my github page.
Automating an API
As an example I’m going to use the News Api to get all the articles published by day about ‘artificial intelligence’ and then write a script to automatically add those to a google sheet. News API is a paid service but they have a developer option. For developers it’s free to use for non-commercial projects and in-development commercial projects. The data is limited compared to the tiers you have to pay for obviously but will be fine for demonstration purposes.
We’ll put that data into a dataframe first where you can do whatever transformations you want like a group by to get the count of the sites that are posting the most articles about artificial intelligence, how many articles are published by day and if it’s increasing/decreasing, any NLP you want like topic modeling to see what topics within AI are being written about, named entity recognition etc.
Go ahead and register for a developer account to get the api key. Once you have that you can get started. We will be using the wonderful ‘requests’ python library to call the News API along with the datetime library to get yesterday’s date automatically.
Let’s define the api call. We will be hitting the ‘everything’ endpoint so we specify that in our base url. The query string parameters come next and these are right from the documentation. The ‘q’ is the query and we put ‘artificial intelligence’ in there becuase we want articles related to that. In the ‘to’ and ‘from’ parameters we add the latest date which will give us all the articles for that day.
The rest is pretty self explanatory, we pass in the api key (replace that with yours), we only want english language articles so we specify that and we also set the page size to 100. The default page size is 20 and developer accounts are limited to 100 responses so if we had more we would just need to loop through the different pages.
Next we put the api results which are in a json format into a df. There’s a few ways to do this but I like to flatten each field and put it into a list and then make the df from those lists. Most of the data here is flattened with the exception of the ‘name’ field which is nested within the ‘source’ field as you can see below. There are also some articles with no content and just a title so we check for the ‘content’ key and if it’s not there we add in a NaN to avoid a key error.
Automating SQL Queries
The next example shows how you can automate a SQL query to run everyday with python. You can connect directly to the most popular databases with python like mySQL, SQLite, postgreSQL, and mongoDB. I use BigQuery mostly so that’s what I’m going to use here.
BigQuery actually has an api so the process is pretty similar to using any api. To get your query to automatically grab the latest day’s data, all you have to do is use a regex and sub in the latest date since the query will be a string. ✨
The first step is to create a service account. Why a service account and not just your personal account? If someone at your company does this with their personal account and then leaves, that account will be deactivated and everything will break. So yeah, use a service account. 😂
Once you do that you’ll get a json file with the necessary credentials that you will use to authenticate the api call. Save that to wherever you want to store it. I’ll use one of the public datasets for this example and we’ll send a query to the api, get back the data and put it into a df like last time.
OK so now you have the data from either an api or a database and it’s set up to get the latest day’s data. It’s in a df so you can do whatever transformations you need to do. Next you need to set up the sheets api so you can put it there.
First step is to create a service account that will be authorized to read/write to a sheet via the api. If you already have one from using BQ then you can use that if it’s part of the same project. You just need to enable the sheets api for that service account. For those who don’t though, below are the instructions on how to set one up for the sheets api.
- Create a service account by going to https://console.developers.google.com/
- Create or select a project. Ours will be called ‘Analytics’
- Enable the google sheets api
4. Search for and select the sheets api and enable it
5. Search for ‘credentials’ then select ‘Create credentials’ then select ‘Service account key’
6. You will then be directed to the below page where you have to give your service account a name (we’ll call ours sheets-api-reporting) and you have to select the roles. In the project section we will select the editor role which will allow us to do what we need to (you never want to give a service account more permissions than it needs). Leave the ‘Key type’ as json and click ‘Create’. A json file will automatically download to your computer with the credentials you need. Rename it and put it in your desired folder.
Create a new google sheet and grant the service account e-mail address access to it. The service account email address can be found in the json file in the ‘client_email’ field.
You can now read and write data to that sheet with the api!!!!
Now is a good time to test that you just set up everything correctly. The below test script will add 1, 2, 3, 4, 5 to columns A1:E1.
You have to install the required libraries, add the scopes for google sheets, define the path to the credentials file, add the spreadsheet id and the sheet name. You can get the spreadsheet id from the url of the sheet you created:
In terms of how to add the data in the correct format etc. it’s all in the api reference: https://developers.google.com/sheets/api/reference/rest/ but you basically have to pass in a range of columns and rows where the data will live and then pass in the data as an array of arrays.
Putting It All Together
I know that seems like a lot but most of it is in the initial set up and the ROI will be worth it and you’ll be more productive in the long run. 💰
The below script will grab all of the ‘artificial intelligence’ articles from the news api and append them to a google sheet for yesterday.
In terms of running this automatically you have a few options. You can run it automatically with windows task scheduler, set up a cron job if you’re on Linux or you can run it as a cloud function on GCP/lambda on AWS. Running it as a cloud function/lambda has the benefit of being serverless and it will run even if you’re computer is off/crashes. GCP has a free tier where they give you $300 for 12 months. I have about 8 cloud functions running everyday and I don’t know the exact cost but it’s maybe a few dollars a month at the most so you could set up a lot for free.