Improving “Hokatsu” as a Developer

Published in

henngeblog

5 min readJul 14, 2023

Explaining the background

Have you ever heard of the Japanese word “Hokatsu”?

Hokatsu is the name for the activity for parents and guardians to get children into daycare centers. We need to do this because daycare centers do not have enough slots to accept all children, especially in Tokyo.

The steps are:
- Look for an available daycare center nearby.
- Contact and visit the center.
- Apply to the ward office.

As a father of a 2-year-old son, I also experienced this, and fortunately, I could enroll my son in a daycare.
I was lucky to be accepted by the daycare center, which is just a 10min walk from my house.

An article by NewYork Times says

Increasingly desperate women are forced into an annual competition for day care slots that is grueling enough to merit its own name, “hokatsu,” and is said by some to surpass the notorious, stress-filled job hunt endured by Japanese college students.

Yes, it’s stressful.

The painful “Hokatsu”

There are some kinds of daycare centers in Japan.
Some daycare centers are approved by the government or prefectures by meeting some standards for size, number of staff, etc.
The most significant benefit of applying to them is that we can get subsidies if our child is enrolled in those centers.
However, the application is the most cumbersome since it’s basically paper-based.

Let’s see how you can start “Hokatsu”.
To know which daycare center is available, you have to visit the websites of the ward office.

The screen shot of Ota city website, where I live in — The screenshot of the Ota city website, where I currently live in

If you click the link on the page, you can get this PDF file which looks like this.

The list of available slots and other information

The columns 0歳(0-year-old) to 5歳(5-year-old) tell us how many slots are available in each place. As you see, there’s a lot of irrelevant information if you only want to know about the availability.

The thing is, we only have access to the PDF files. If we had other formats like CSV, the whole process could be faster.

I came up with doing some programming to improve this step.
I thought it would be great to have a table with some features, e.g., sorting, filtering, etc.
I decided to extract the data from the PDF file, have it in the database, and then show it in a user-friendly way.

Implementation

Application Architecture

I wanted to build an application using AWS resources this time since I didn’t have much experience.
I’ve deployed the table data extraction on AWS Lambda, scheduled it to invoke it periodically, and then stored the data in DynamoDB.
For the frontend side, I built the app using Next.js and deployed it to the Vercel server.
To test and deploy the Lambda function, I used AWS SAM(Serverless Application Model).
Thanks to AWS documentation and ChatGPT, I finally made the application work though it was my first time building AWS applications.

Getting the PDF file data from the website

I’ve implemented the logic in Python this time.
First of all, I did web scraping using Beautiful Soup.

import requests
import io
from bs4 import BeautifulSoup

baseUrl = (
    "https://www.city.ota.tokyo.jp/seikatsu/kodomo/hoiku/hoikushisetsu_nyukibo/"
)
url = baseUrl + "aki-joho.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.find_all("a")
for link in links:
    if ".pdf" in link.get("href", []):
        response = requests.get(baseUrl + link.get("href"))
        pdfFile = io.BytesIO(response.content)

It won’t work once they change the URL or if something happens, but for now, it’s fine.
Now I have the actual data of the pdf file.

Tabular data extraction

Then I read the PDF file using Tabula-py. I also had another option called Camelot, but Tabula-py worked fine for me.
Fortunately, the table in the file isn’t an image so that both libraries could read and detect it. If it were, I would need other tools supporting OCR to scan the file.

At first, I wasn’t sure if it worked fine since the data was written in Japanese, but it worked-although an empty column was detected.
However, as you see in the code, I had a few things to do with handling data in Japanese.

import json
import pandas
import hashlib
import uuid
from tabula import read_pdf

df = read_pdf(pdfFile, pages="all", lattice=True)
dataList = pandas.concat(df)

// renaming the Japanese column title to English
data = dataList.rename(
    {
        "番号": "list_number",
        ... // other translations
    },
    axis="columns",
).drop(columns=["Unnamed: 0", "Unnamed: 1"])

// converting weird characters to boolean
data.loc[:, "can_extend"] = data["can_extend"] == "*"
data.loc[:, "emergency"] = data["emergency"] == "★"

// adding uuid
data["id"] = data.apply(
    lambda x: str(
        uuid.UUID(hex=hashlib.md5(repr(x["phone"]).encode("UTF-8")).hexdigest())
    ),
    axis=1,
)

// converting string typed numbers etc to numeric values
attributes = [
    "list_number",
    ... // other attributes to be converted
]
for attribute in attributes:
    data.loc[pandas.isna(data[attribute]), attribute] = 0
    data.loc[data[attribute] == "×", attribute] = -1
    data[attribute] = pandas.to_numeric(
        data[attribute], errors="coerce", downcast="signed"
    )

// converting to json
result = data.to_json(orient="records")
records = json.loads(result)

After finishing this part, I encountered a problem when running the function. Tabula-py requires the Java runtime to run it, but it didn’t work when invoking the Lambda function locally.
Docker helped me to go through it.

I made the docker file like this

FROM public.ecr.aws/lambda/python:3.10

RUN yum update -y && yum install -y java-1.8.0-openjdk

ENV JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk

ENV PATH=$JAVA_HOME/bin:$PATH

WORKDIR /var/task

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY ota.py .
CMD [ "ota.lambda_handler" ]

and made the template.yaml (we need this to define the app’s behavior based on AWS SAM) like this

Resources:
  Ota:
    Type: AWS::Serverless::Function
    Properties:
      PackageType: Image
      Architectures:
        - x86_64
      Policies: AmazonDynamoDBFullAccess
      Events:
        ScheduleEvent:
          Type: Schedule
          Properties:
            Schedule: cron(30 4 ? * MON-FRI *)
    Metadata:
      Dockerfile: Dockerfile
      DockerContext: ./lambda
      DockerTag: python3.10-v1

Frontend

I’ve built the frontend app using the latest Next.js.
I used the app directory feature that was added in the recent version of Next.js.

App Router

Use the new App Router with Next.js' and React's latest features, including Layouts, Server Components, Suspense, and…

nextjs.org

I could implement the API to get the data from DynamoDB by having api directory in the project.

Route Handlers

Create custom request handlers for a given route using the Web's Request and Response APIs.

nextjs.org

Then I just passed them to the MUI DataGrid component.

Finally, I have the table showing the same data in a cooler way!
It already has a sorting and filtering feature.

Conclusion

The app is already useful compared to the original PDF file format, but there are a bunch of things I want to do. I’m pretty sure that I need some refactors to make it scalable.

In addition, I want to add a feature to send notifications about monthly updates to the registered email.

Thanks for reading!