Import Yelp data into your local DynamoDB

AWS offers a really helpful feature: you can set-up a local DynamoDB database! It’s really useful, weither you want to test DynamoDB, or simply test database before going live.

The aim of this post is to show you how to import data in this local database.

First, you need to have created the local database. You need an AWS account of course. The documentation for the set-up is here.

In order to use dynamoDB locally, you will need to:

  • Download and extract the DynamoDB archive (available here).
  • Create a group of users for DynamoDB.
AWS has very useful “predifined” groups strategies that you can use
  • Add a user to this group, and retrieve access key and secret access key.If you have created the user, you can download a csv with the informations.
Add a new user
$ aws configure
AWS Access Key ID [None]: ###MY_ACCESS_KEY###
AWS Secret Access Key [None]: ###MY_SECRET_ACCESS_KEY###
Default region name [None]: ###MY_REGION###
Default output format [None]: json

Well, now that you have configured AWS CLI, you can start you local! You can use the command line. Go in the the directory the DynamoDB files, then run the following code:

Linux: java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb

Windows:java -D"java.library.path=./DynamoDBLocal_lib" -jar DynamoDBLocal.jar

Well! You should get something like that:

DynamoDB running.

As you can see, the port is 8000.

We will import some data from Yelp. You can find them on the Yelp website. The documentation is available here.

Well, now, we can start. We will first set-up some variables:

import pandas as pd
import os
dataset_folder=###Path to the folder where you have extracted the Yelp json files###key_file=pd.read_csv(###path to csv file with keys###) MY_ACCESS_KEY_ID = key_file['Access key ID'] #AWS Access key
MY_SECRET_ACCESS_KEY = key_file['Secret access key'] #AWS secret key
endpoint='http://localhost:8000' #endpoint of the local DynamoDBdynamoDB_folder=###path to the folder dynamodb_local_latest/###

First, we will connect to the client. This client will give us access to the dynamoDB API. We will use boto3, the AWS SDK for Python. You can find more informations about boto3 here.

import boto3client = boto3.client('dynamodb', endpoint_url=endpoint)

We can check if some tables are already existing:

print(client.list_tables())

{'TableNames': [], 'ResponseMetadata':{####infos###}

We can see that there’s no tables currently. The database is empty. Now you can upload Yelp data. We will use the business data.

import jsonlink=dataset_folder+'/'+'yelp_academic_dataset_business.json'with open(link,encoding='utf-8') as f:
data = [json.loads(line) for line in f]
print("There's",len(data),"items in the file.")There's 209393 items in the file.

This is only an exercice, so we will only keep the 10000 first rows, and have a look at it.

data=data[:10001]
data[0]#Have a look at the first item.
{'business_id': 'f9NumwFMBDn751xgFiRbNA',
'name': 'The Range At Lake Norman',
'address': '10913 Bailey Rd',
'city': 'Cornelius',
'state': 'NC',
'postal_code': '28031',
'latitude': 35.4627242,
'longitude': -80.8526119,
'stars': 3.5,
'review_count': 36,
'is_open': 1,
'attributes': {'BusinessAcceptsCreditCards': 'True',
'BikeParking': 'True',
'GoodForKids': 'False',
'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
'ByAppointmentOnly': 'False',
'RestaurantsPriceRange2': '3'},
'categories': 'Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping',
'hours': {'Monday': '10:0-18:0',
'Tuesday': '11:0-20:0',
'Wednesday': '10:0-18:0',
'Thursday': '11:0-20:0',
'Friday': '11:0-20:0',
'Saturday': '11:0-20:0',
'Sunday': '13:0-18:0'}}

Ok! We can see that this is a nested document, with different types of data, including floats. Floats are not supported by DynamoDB and need a transformation in order to be able to upload data, so it’s important to identify them. We can retrieve them in a list:

list_float=[]for key,value in data[0].items():
if isinstance(value,float):
list_float.append(key)
elif isinstance(value,dict):
while isinstance(value,dict):
for key,value in value.items():
if isinstance(value,float):
list_float.append(key)


print('Float variables:',list_float)
Float variables: ['latitude', 'longitude', 'stars']

We need to keep that in mind.

We will now create the business table in our database. We will create an empty table. You will need to choose a unique identifier, the partition key.

If your data have an unique identifier (like business_id in our business json file), it’s quite easy: you choose this file as the partition key. If not, you can create a composite partition key: it will be a combination of the partition and the sort key. This combination needs to provide unique values.

The sort key is the key on which DynamoDB decides which data are stored physically closed. It is used to improve the queries performance. So you need to think about the queries you would like to perform on the table, in order to choose this key.

I know that I will query a lot based on city: so I choose ‘city’ as my sort key.

So here we go: first, we set-up the service resource. It will allows us to interact with the database in an object-oriented way.

resource = boto3.resource('dynamodb',endpoint_url=endpoint)

We can now create our business table.

try:
table = resource.create_table(
TableName='business_yelp', #add here the name of your table
KeySchema=[
{
'AttributeName': 'business_id',
'KeyType': 'HASH'
},
{
'AttributeName': 'city',
'KeyType': 'RANGE'
}
],
AttributeDefinitions=[
{
'AttributeName': 'business_id',
'AttributeType': 'S'
},
{
'AttributeName': 'city',
'AttributeType': 'S'
},
]
,
ProvisionedThroughput={
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1
}
)
except OSError:
pass

As we run locally, we can put whatever value we want in ‘ReadCapacityUnits’ and ‘WriteCapacityUnits’. If you don’t add a try/except, you will see an OSError, due to the local run. Even if you see this error, your table should be created. You can easily check it:

print("Tables names:",client.list_tables()['TableNames'])Tables names: ['business_yelp']

Great!

Now, we will populate this table with the Yelp data. First, we check that our table is empty.

table = resource.Table('business_yelp')print("Number of items in the",'business_yelp',"table:",len(table.scan()['Items']),'\n')Number of items in the business_yelp table: 0

Ok, the table is empty. Now we can try to populate the table, with the put_item method. We can try to add the first item.

table.put_item(Item = data[0])

Your should get an error (TypeError). Remember, DynamoDB doesn’t support floats! We need to convert floats before loading data into the table.

from decimal import Decimaldef convert_floats(item,list_float=list_float):
for var in item:
if var in list_float:
item[var]=Decimal(str(item[var]))
return item
table.put_item(Item = convert_floats(data[0]))

Now we can check how many items we have in our table:

print("Number of items in the",table_name,"table:",len(table.scan()['Items']))Number of items in the business_yelp table: 1

Great! I have my item. I can have a look at it:

table.scan()['Items'][{'hours': {'Monday': '10:0-18:0',
'Thursday': '11:0-20:0',
'Friday': '11:0-20:0',
'Sunday': '13:0-18:0',
'Wednesday': '10:0-18:0',
'Tuesday': '11:0-20:0',
'Saturday': '11:0-20:0'},
'address': '10913 Bailey Rd',
'city': 'Cornelius',
'is_open': Decimal('1'),
'latitude': Decimal('35.4627242'),
'review_count': Decimal('36'),
'stars': Decimal('3.5'),
'name': 'The Range At Lake Norman',
'attributes': {'BikeParking': 'True',
'BusinessAcceptsCreditCards': 'True',
'ByAppointmentOnly': 'False',
'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",
'GoodForKids': 'False',
'RestaurantsPriceRange2': '3'},
'state': 'NC',
'categories': 'Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping',
'postal_code': '28031',
'business_id': 'f9NumwFMBDn751xgFiRbNA',
'longitude': Decimal('-80.8526119')}]

Great! I have everything.

I will know upload the other data in the table. For un item, I use “put_item”. For multiple items, it’s better to use the batch_writer() methods, which is optimized for mass upload and faster than the “put_item” method. As it can take a while, I will add a progress bar to show progression. I use the log-progress widget, you can find informations about it here. You will need to copy/paste the code available on the link (and run it) before being able to use it.

with table.batch_writer() as batch:
for item in log_progress(data[1:]):#Log progress is used to show the progression.
batch.put_item(convert_floats(item))
print("Number of items in the",table_name,"table:",len(table.scan()['Items']))

You should have your 10 000 items. However, as you run locally, and can have memory issues, you can end up with less items downloaded, so you need to check it.

--

--