CitySpire is a city data aggregation service that seeks to be a one stop shop for any city data a user might be interested in seeing. It was built for Lambda School and our team consisted of 2 web, 2 iOS engineers and 2 data scientists. Working on a team with a diverse set of skills and responsibilities really helped to build teamwork and coordination.
To prepare for this project we used Trello extensively. As a team, we thought about different user stories that would be applicable to CitySpire. For example:
“As a potential user I would like to be able to see the population size of the city that I am contemplating moving to so that I can select the appropriate size city for me.”
For this project, I contributed to four key areas:
I was responsible for the initial deployment to AWS Elastic Beanstalk as well deploying project updates. If you don’t run into any strange bugs that need to be squashed, this is a fairly straightforward procedure using the command line.
- The first step is to install the AWS Elastic Beanstalk CLI
pip install awsebcli
2. You then need to get your AWS credentials using these steps here
3. Configure the AWS CLI
4. Initialize but replace “CHOOSE-YOUR-NAME” with whatever you want the name to be
eb init — platform python-3.7 — region us-east-1 CHOOSE-YOUR-NAME
5. Now create but replace “CHOOSE-YOUR-NAME” with whatever you want the name to be
eb create — region us-east-1 CHOOSE-YOUR-NAME
I found that if the project was working locally then it would most likely work once it was deployed. Any updates can be pushed to AWS with
One thing that I found useful was pushing an update without committing it using the following:
eb deploy --staged
I only used this to update the pipfile.lock that had somehow become corrupted.
I was also responsible for setting up the postgres database on AWS RDS. This was also a simple process and can be done on Amazon’s AWS console. I pretty much just followed the AWS guide step by step and didn’t run into any problems.
I used dbdesigner.net to help design the database structure. This was an incredibly useful tool, and I couldn’t recommend it enough. It made it simple to keep track of foreign keys and how the different tables interacted
Gathering and cleaning data was the most time consuming and labor-intensive part of this project. I focused on:
- population data
- crime data
- location data
- the list of cities to include in the database
For each one, I used an official government source to make sure that the data I was using was accurate. For the population, location, and the list of cities to include, I used the Census Bureau. For the crime data I used the FBI’s UCR table 8.
The data was cleaned and added to the database using the following two jupyter notebooks.
Our API was built using FastAPI and I built the following endpoints:
- livability (the basic structure as well as the crime and population components)
The API is structured so that there is a main.py file that runs the API and it imports in the various endpoints that are contained in other files. You can see the code below:
We decided to organize all of our endpoints that are solely calls to the database into a single file. Each one follows a basic format of writing a query and then using sqlalchemy to connect to the database and get results.
For the livability score, we decided to equal weight the different components. It is constructed so that if a given component is not available for a certain city then it will just return an average of the available components.
While working on this project, I did run into a couple of problems that had to be resolved. The most challenging was an odd glitch with jupyter notebooks. For some reason, my notebook was failing to load modules/packages that I 100% knew were installed. Before this project, I was comfortable using conda for environment management but here we were required to use pipenv. I never did figure out if this issue was purely due to pipenv, but I did figure out what was causing packages too fail to load. They were not included in the path that the jupyter notebook was using. This took a little bit to solve but I managed to figure out how to update the path of the jupyter notebook using the following:
import sysPATH = os.path.join(os.path.getcwd(), ‘..’, ‘..’, ‘.local’,
‘share’,'virtualenvs’, ‘cityspire-ds-h-NIlzhGdy’, ‘lib’,
This is the actual code that I used in my project so if you need to use this in your own project just change the PATH to be the path to the packages on your local machine
The other main issues with this project revolved around deploying to AWS. The first problem that I ran into was an easy solution, but it took way to long to figure out. When I first entered the environment variables, I copied them straight from my .env file. I didn’t really put too much thought into this as I figured it was just a basic step to complete and as such, I included the apostrophes in the environment variable. Apparently, AWS expects that your environment variables are strings and as such it doesn’t need apostrophes.
The other issue I had with AWS deployment was also an easy fix. There were a couple of times that our pipfile.lock file was causing an error. The easiest solution to this was to delete it and then use pipenv to recreate it.
pipenv install --dev
The most obvious ways to continue work on this project is to build more predictive models. I started this project excited about building a model that would try to predict future population levels. Unfortunately, we had half of the data science team quit the project, so the scope had to be changed. Instead of focusing solely on population data, I also had to source and clean crime data too. A basic idea that I had planned to start with was just a time series model using only the year before. I was thinking about using scikit learn to do this. I was then thinking about expanding the lookback period to include multiple years and then try to use a LSTM neural net to try to make an improvement over the simple baseline.
Another way to improve this project would be to build a single endpoint that would contain all the data in addition to the data being segregated into individual endpoints. This could work as a summary and only include the most important parts of each data source. For example, the “crime data” endpoint includes both raw numbers and per capita numbers for a multitude of different crime statistics.
It is possible that end users only care about the per capita total crime level and then the summary could only contain that aspect of the data.
The project would also be greatly improved by increasing the variety of data that is aggregated. Currently it only has:
- Rental Rates
- Walk Score
- Livability Score
Potential data that can be added includes:
- Income Levels
- Employment Levels
- Number of Restaurants/Points of Interest
The composite livability score is currently just an average of the data that is included in it, but different users have different preferences. Currently the livability score assumes that where population is concerned, the bigger a city the better. However, it is highly likely that some people would prefer to live in small or medium sizes cities.
In conclusion, this was an interesting project to work on and I learned a lot about AWS and collaborating with a multi-discipline team. If you would like to take a closer look at this project, checkout the GitHub repo here.