Dashboarding surfing in Japan

Aki Kutvonen
8 min readJan 20, 2023

--

A python Dash app to explore surf statistics in Japan. Also random data sciency development and workflow tips and how to serve the dashboard using Python, Dash, Pandas, Docker & AWS.

In shonan surf’s up every day no matter the conditions. Photo by Ye Linn Wai on Unsplash.

Since the start of the pandemic, I as well was thinking of moving out of Tokyo. Since I surf, would be nice to live close to a surfable beach. But where is a good place for consistent waves? If I ask local surfers I get “yea man waves are good here”. Can’t trust that kind chatter, way too positive to be true. Better to scrape some data and build dashboards and get the truth out. Stoke is real.

If you just want to know about the spots and probabilities, and not interested in any error analysis or tech stuff, you might as well skip directly to the app at http://18.183.42.104:5000/ (running on small instance so a bit slow now).

Screencap of the app. Main tab lets user to select a spot, a wave type and allows configuring default swell and wind directions for the spot. Result at the right shows the probability of surf as a matrix of time of the day and month. Other tabs allow to dig deeper into wave and wind patters or compare different spots.

In a nutshell the results for most spots are: Probability for a good waves is inversely proportional to pleasancy. Winter and mornings are better.

Data source and accuracy

In the part 1, I explain more about surfable wave conditions. To investigate probabilities of getting waves at a certain spot, we need daily wave information containing wave height, period and direction from measurement spots close by. In addition we need wind directions and speed for the same day, hopefully hourly since especially the wind changes during the day.

Luckily for this project, Japan Meteorological Agency, has a website which provides wind and waves data (takes some time click/script downloading the data tho). For the moment the data used in this project is a twice a day wave information and hourly wind information for the spots listed below for the last 25 years. More wind data could be downloaded and spots added later if there’s demand for it.

Wave and wind measurement spots used at the website. Most of blue wave measurement points were not used.

The wind measurements should be pretty accurate and close to the surf spots. The main errors come from the wave report part of the analysis:

  • The wave data only consists of the primary swell. Often there is also a secondary swell which can create surfable waves. So I expect in reality to be able to catch waves bit more often than the app states.
  • The wave measurement spots are on the open sea. Spot by spot you can change the acceptable swell angle, but in reality the waves diffract for example to bays depending on the wave period and ocean floor geometry. Accounting this is complicated.

Tech part

Going through all the parts of the project in detail would take too long for the scope of this post. Instead I will keep it general and discuss few selected work flow related points.

Setting up the environment & workflow

I’m a big fan of Docker and prefer using it over Conda environments whenever possible. I have noticed tho that the new Mac M1 works sometimes better with Conda environments.

Typically my data sciency type of setup consists of 2 different containers, a development container where I mount a local volume to during the development, edit the source code via VS code and run scripts from Jupyter notebook. The changes to the code are then propagated without tedious rebuilding of the container. Just use the importlib.reload(your_module) in notebook to reload the changes done in your_module.py.

In the production container only the part of source code, packages and data needed for inference/serving is copied to the container making it a “standalone package”. Typically there are packages you only need in development and these can be cutted away from the final serving making the container lighter and cleaner.

Another point I would like to mention about the workflow is that I might initially do few lines in Jupyter notebook but try to move the code in reusable functions as fast as possible to .py files. For example some data transformation parts typically need to be done in inference time as well so time can be saved by making those into a reusable form from the beginning.

Writing the code in editor has a benefit of being able to use editor functions such as autocomplete and multipaste etc., and you can still easily call your functions from the notebook. The data wrangling, EDA, training I all like to keep as a readable story, a kind of readme in a notebook form with possibly graphs and images there, but the details should be inside the functions in the .py files.

Below examples of mainly demonstrating the difference between the dev and prod environments in terms of docker file and running the container:

FROM python:3.9.7-slim-buster

# set working directory in container
WORKDIR /work

# Copy and install packages
# ** You might have separate _dev and _prod requirements
COPY requirements_dev.txt /
RUN pip install -r /requirements_dev.txt

# ** In prod we copy the source code to the container,
# in dev just skip this
COPY /src /work/src

# ** in dev just start the bash where we can run jupyter notebook
# or do what we want
CMD ["bash"]

# ** in prod we would start the Dash app in this case
CMD gunicorn --chdir /work/src --bind 0.0.0.0:5000 --workers=3 --timeout 90 namiaru_app:server

# the development container I mount the code and data directory to the
# container and then start jupyter notebook from the bash after the container
docker run -v $PWD:/work \
-p 8888:8888 -it --entrypoint /bin/bash MY_DEV_CONTAINER

# production just start the container and forward the port :5000 for the app
docker run -p 5000:5000 MY_PROD_CONTAINER

Data wrangling

Once the Dash server starts the dataframes are loaded and operations such as filtering on those are done. To reduce at least the needed RAM on the server, I suggest to reduce the memory print by downgrading datatypes and saving as pickle (or other fancier format) instead of CSV. A simple example:

wind_df = wind_df.astype({col: 'int16' for col in wind_df.select_dtypes('int64').columns})
wind_df = wind_df.astype({col: 'float32' for col in wind_df.select_dtypes('float64').columns})
wind_df = wind_df.astype({col: 'category' for col in wind_df.select_dtypes('object').columns})

wind_df.to_pickle("winds.pkl")

Visualizations (plotly express)

The graphs for Dash are best made using Plotly. For this project I used Plotly express, which is a simplified API for Plotly and works great especially with Dataframes. Example of a function returning a bar plot graph from dataframe “Month” column vs a column we give as an input.

def p(waves_df, column):

fig = px.bar(waves_df, x="Month" , y=column)

return fig

Dash board app (Dash)

Dash is a (relatively) low code framework for building data apps in Python. The official site with the tutorials and readmes is an excellent reference, so just introducing the main concepts here.

To start with, every app has an layout, basically, we define positions for things like buttons, text and graphs. Some basic understanding of HTML would help. Took me a while to understand how to make for example responsive apps which scale automatically to device specs. If that is the goal, better to spend few minutes in youtube with Dash tutorials on making responsive apps before starting.

Below we create a layout for a text (‘My interactive graph:’), graph and selector where we can select any value from waves_df column.

# define the dash app
app = dash.Dash()

app.layout = html.Div([
html.H1('My interactive graph:'),
dcc.Graph(id='graph1'),
dcc.Select(
id="column_selector",
options=[{"label":x, "value":x} for x in waves_df.columns]
])

Updating of graphs and all functionality uses callback functions. Once the input changes the function runs and returns the output. In this example when we change the value of the column selector, the selected value will be input to the function we built before and the output is a figure which will be placed to ‘graph1’.

@app.callback(
Output('graph1', 'figure'),
Input('column_selector', 'value'))
def update_figure(selected_column):

# use the function we made in the "visualizations part"
return p(waves_df, selected_column)

Dash is an excellent framework and you can add countless of functions, editable tables, downloads and uploads, cool maps, user management. It’s multi tenant ready by default (scalable), and most importantly at least to my opinion, you can easily add your own functions and API calls to your apps. Something which is difficult in commonly used dashboard apps.

Serving (Docker + AWS)

To run the app straight with docker run we need to add the magic line to the namiaru_app.py

server = app.server

The last line in the production dockerfile ensures that the app starts served with gunicorn when we run the docker container.

# Run locally on port 8050
CMD gunicorn --bind 0.0.0.0:5000 --workers=3 --timeout 90 namiaru_app:server

The thing left to do is to push the container into ECR and load it to the EC2 instance from there. Running the container will start the app at server ip:5000. In my case http://18.183.42.104:5000/ .

The last workflow tip to share is that I normally use environmental variables file and use it in my scripts with the “source” command. It makes it easy to use the same templates with many projects. Basically I often need to change only the project name for a new project. For example push to ECR script:

#!/bin/bash
ENV_VAR_FILE="config/env_vars.sh"

source $ENV_VAR_FILE
echo "Using env vars in: $ENV_VAR_FILE"

IMAGE_NAME=$PROD_NAME
AWS_IMAGE_NAME=$PROD_NAME

AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com

docker tag ${IMAGE_NAME}:latest ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/${AWS_IMAGE_NAME}:latest
docker push ${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/${AWS_IMAGE_NAME}:latest

Using the env_vars.sh:

export PROJECT_NAME="namiaru"
export PYTHONPATH=".:/work:/work/src:/work/config:/work/test"
export BASE_PATH="/work"
export PROD_NAME="${PROJECT_NAME}_prod"
export REGION="ap-northeast-1"

Similarly my docker build scripts consists of

ENV_VAR_FILE="config/env_vars.sh"
source $ENV_VAR_FILE

docker build -t $PROJECT_NAME \
--build-arg BASE_PATH=$BASE_PATH \
--build-arg PYTHONPATH=$PYTHONPATH .

# to use the build args remember to add to dockerfile also:

# ARG BASE_PATH
# ARG PYTHONPATH
# ENV BASE_PATH=$BASE_PATH
# ENV PYTHONPATH=$PYTHONPATH

I also use these variables as reference inside the container, for example python config file would have lines like DATA_DIR = os.path.join(os.environ[“BASE_PATH”], “data”).

If you’re interested in the surfing part, please check the app for more details and the previous post might be of interest as well to you. For the tech part this was just a small intro using Dash and collection of some workflow related tips. Dash is a great tool, but prepare to spend some amount of time when using it the first time. Enjoy and contact me if you want to explore surf or data nerding or have some development needs. Thank you for reading. Shakashaka.

--

--

Aki Kutvonen

Founder of Hyouka, the more fun customer insights platform. Former theoretical physicist, tech lead and a product manager.