Turning News Into Data — Day 4

5 min readAug 13, 2022

Aug 13 2022

I re-ran the script and still got rate-limited, which means they probably rate limit for 24 hours but it’s only been about 15 hours since then. Too bad. The response has a key-value pair that looks like this

'x-ratelimit-hit-reset': '2022-08-13 16:39:00 +0000'

So I won’t be able to reach their API until later today

This is another reason storing the data myself once I make requests is a good idea, I won’t hit any rate limits if I implement some sort of check that will go through local data for this info first

Anyway, I suppose I could take some time to think about design & stuff. One thing I’ve always wanted to do with Python is build an application using fastapi. The reason being that making any sort of backend with Flask or Django leaves something to be desired especially performance-wise. I hear very good things about fastapi though, so no time like the present

I will check Github to see if there’s a solid production ready template someone wrote already

This one looks promising but when learning a new framework I think it’s best to try out something that isn’t so specific such as serving an ML app just yet, get to know the quirks first

I am mostly used to no-SQL databases, but thinking to the future where perhaps feature vectors get built from this data it is probably a good idea to use a structured database. Luckily this repo seems to have that, and is updated pretty regularly. There’s even a caching feature which is awesome. Additionally I can always have a separate DB that stores in JSON the responses

Got the API running, it also comes with swagger since fastapi has it built in

Personally I’m not a huge fan of pipfiles and pipenv, which are used in this repo. It’s slow and gets pretty messy when trying to do multi-stage builds in Dockerfiles. So I’ve collected the requirements manually instead

aylien-news-api==5.1.1
sqlalchemy[asyncio]==1.4.40
alembic==1.8.1
pyjwt==2.4.0
uvicorn[standard]==0.18.2
fastapi==0.79.0
celery==5.2.7
gunicorn==20.1.0
fastapi-event==0.1.3
pythondi==1.2.4
aioredis==2.0.1
ujson==5.4.0
aiomysql==0.1.1

I like fastapi already. It has pretty advanced typing. Like the health route for example

from fastapi import APIRouter, Response, Depends
from core.fastapi.dependencies import PermissionDependency, AllowAllhome_router = APIRouter()@home_router.get(
    "/health",
    dependencies=[Depends(PermissionDependency([AllowAll]))]
)
async def home():
    return Response(status_code=200)

All in all this repo is solid, missing only testing which is nbd will add it later

Deployment wise I want to try using serverless first using Cloud Run because managing kubernetes clusters manually is a huge pain, so having a managed kubernetes instance is a plus. I have no idea how the extra features in this API (Celery, Redis, SQL, etc) will work with Cloud Run but I will learn on the way I suppose

I don’t like the idea of managing auth within the app, since there are teams that are dedicated to just managing authentication. So using google’s oauth2 scheme is probably a better idea. For now I’ll focus on just deploying as is, with the aylien api code embedded somewhere

For this I’ll create a brand new repo and put in all the code from this repo in it. I’ll call the project backend-template and use that to make a codebase that has everything I need in a backend deployment-wise

The structure is as follows:

├── src/
│   ├── ...(the boilerplate repo)
│   └── tests/ (where unit/integration tests will go)
├── dev_scripts/
│   └── (some shell scripts that help with local deployments)
├── Dockerfile
├── .gitignore
├── .dockerignore
├── requirements.txt
├── requirements-dev.txt
├── nginx.conf
├── .gitlab-ci.yml
└── project-name.txt

I will try to include all deployment related things inside the ci file, but may need a deployment folder

project-name.txt just contains the repo name. Python doesn’t have a package.json to pull the name out from, so I’ll just use this to build

Ok let’s build out the Dockerfile:

## build stage
FROM python:3.9-alpine as buildWORKDIR /app# tell python not to write to .pyc files and instead write bytes
ENV PYTHONWRITEBYTECODE 1
# force stdout and stderr to be unbuffered
ENV PYTHONBUFFERED 1COPY . /app
COPY src ./src
RUN pip wheel --no-cache-dir --no-deps --wheel-dir /app/wheels -r requirements.txt## main stage
FROM python:3.9-alpine# set some environment variables
ARG PORT 
ENV PORT=${PORT}WORKDIR /appCOPY --from=build /app/wheels /wheels
COPY --from=build /app .RUN pip install --no-cache /wheels/*EXPOSE ${PORT}CMD python src/app.py

Notice that I don’t explicitly say the port in the Dockerfile. Let’s say someone were to stumble onto your code and knows which ports to attack that wouldn’t be good. So the way I like to do it is I set the $PORT variable in the project’s CICD variables, then add that to my docker buildcommand as build arguments (more on this in the .gitlab-ci.yml file below. That reminds me, I’ve been meaning to try out hashicorp vault as a solution for storing such sensitive info

Anyway, I will also create some shell scripts that help with building containers & stuff out locally. You may be wondering why not use a docker-compose.yml file. 3 reasons really, 1) I want to have scripts that can teardown containers & images, 2) I want to minimize the amount of files needed in the repo. The gitlab-ci.yml file will contain (hopefully) all the deployment related code necessary, and 3) I will have a stage in the CI where images get pushed to Google’s registry

dev_scripts/rm_container.sh:

#!/bin/sh#### this script tears down all existing images & containers# container id
CONTAINERS=$(docker container ls -q)
IMAGES=$(docker image ls -q)# first delete the containers
if [ -z $CONTAINERS ]
then 
  echo "No containers found"
else
  for container in $CONTAINERS
  do 
    docker rm $container --force
  done     
fi# then the images
if [ -z $IMAGES ]
then 
  echo 'No images found'
else
  for image in $IMAGES
  do
    docker rmi $image --force
  done
fiecho ''
echo 'Done'

dev_scripts/launch_container.sh

#!/bin/shexport APP_NAME=$(cat project-name.txt)if [ -z ${PORT} ]; then
    echo 'Set the variable PORT '
    exit
fiecho "Building container image"
docker build -t ${APP_NAME} .
# you need -it to make interactive so you can run bash scripts directly within docker
echo ""
echo "Deploying image as container"
docker run -it -dp ${PORT}:${PORT} ${APP_NAME}

And I make sure to do

chmod +x dev_scripts/*

The container doesn’t run properly with this. And after digging into the repo a little more (the docker folder) they do this smarter then what I have above using a utility called dockerize. You can wait for other Unix processes to launch before deploying

Ok at this point I’m getting tired so will continue this tomorrow

Turning News Into Data — Day 4

Written by Norman Benbrahim