Build a Goodreads Clone with Spring Boot and Astra DB — Part 5
Author: Pieter Humphrey
This is the fifth post in a series that walks you through building a simple, highly available Spring Boot application that can handle millions of data records. In this post, we will set up our database schema and load our data into the app. To understand how to set up an Astra DB instance and connect to it, check part four.
Now we are going to use Spring Data to connect to Apache Cassandra® using entity classes and the repository pattern. As a shortcut, we’re going to reuse a lot of work from our previous post to load our data. Since data loading performance is not the focus of this tutorial, our approach of a Spring Boot Data Loader app with Spring Data Cassandra works just fine. The Data Loader will post these data elements or records to our Cassandra instance on DataStax Astra DB.
There are two tables we have to create:
Authors_by_idfrom the authors dump
Books_by_idfrom the books dump
We have two files that we want to import into our Cassandra instance: a data dump of all the authors, and a data dump of all the works — or books. The “works” is the most important. We need to load the works data into a table called
books_by_id, which maps an ID (the primary key) to a book. Unfortunately, the works data does not contain the author name, just the Author ID.
We will need to create the
author_by_id table first, as it will contain both the Author ID and the Author Name. The Author ID in the
book_by_id table can then be mapped to the Author Name in the
author_by_id table. We are going to do all this using Spring Data, which means we will be creating an entity class.
Create author entity class
The Author entity class contains the Author ID and the Author Name, and we will use that to define the Cassandra table. To save Author data to Cassandra, we are going to create new instances of Author, and use a repository pattern to persist each Author instance to the table.
First, we need to create an Author package with an Author class. We want everything related to the Author to be in this package. Similarly, we are going to create a Book package.
Then, we need to give this entity class annotations. These annotations tell the Spring Data dependency what the backing tables are in the database. The Author class to be mapped to an
author_by_id table using the
Now we can define the shape of this table by providing some more annotations. Using
@Column annotations, we specify the names of the columns in our table and indicate the primary key.
@CassandraType annotation indicates the data type.
There are three columns in the
author_id— which is also the primary key
author_name— of data type TEXT
personal_name— of data type TEXT
Generate the getters and setters. If you are using Visual Studio Code, just follow these steps:
- Right click
- Source Action
- Generate getters and setters
- Select all properties
You can see an example of what the full code would look like for Authors.java.
Create a repository
Now we have our Model class, we can create a repository for it. Go back to the Author package and add a new Java class. Create an
AuthorRepository. This is going to be an interface.
AuthorRepository is going to extend
CassandraRepository, which takes two types:
- The entity class (
- The ID (
Now we have a repository class that we can use for fetching data from Cassandra, as well as persisting data to Cassandra. To save a row of Author data, you just need to create a new instance of the Author class, and tell the
AuthorRepository to persist it. Since we are in a Spring context, we will mark this as a
@Repository (so that Spring will take care of the lifecycle).
See an example of the full code for AuthorRepository.java.
Now go back to the main method (
BetterReadDataLoader.java). We want to create a method that runs when the application starts. We run this method by using the annotation
@PostConstruct. When we annotate a method in a Spring Bean with
@PostConstruct annotation, it gets executed after the Spring Bean is initialized.
What we can do now is create a new test Author and persist it using the
AuthorRepository. We will dependency inject the repository using the
@Autowired annotation. Create a new author in the start method (you can give some random values to check it works). Finally, tell the
AuthorRespository to save the new Author.
Once we run this, there should be a Cassandra schema created first from our
application.yml file (if it doesn’t exist). A new instance of the Author model will be created and persisted. There will be a new row in our database in an
author_by_id table. You can use the CQL console to see if it worked.
We now have a mechanism to save data to Cassandra. Next, we are going to be using it on our author file. We are going to open our file, parse it line by line, and build an Author instance for each line. All this will be done persisting so that all the Author information gets saved to our database.
Saving all the authors in the world to Cassandra
At this point, we have:
- A model class
- A repository set up
- A file with data that needs to be persisted to the database
start() method, using the Author entity, is going to run through the list of author data in the file. It will create these Author entities and persist them to the database.
Create a property in our properties file (
application.yml), and call it
Datadump.location.author. This is where we give the location of our Author file. You can add the works file location while you are in there.
We can access this using a
@Value annotation in
BetterreadsDataLoaderApplication.java — our main application file. Now when we are trying to read the file, they will be ready for us to use.
We are going to create a new method called
initAuthors() and create the same for “works”. This is going to take the job of initializing all the author values into the database. We will call it from the
InitAuthors() first gets the Author file into a path variable and then reads the file line by line. Each line will be sent out as a record to our Astra DB instance to be saved.
We now have a stream of lines coming from our Author file, and for each one we need to:
- Read and parse the line
- Construct an Author object
- Persist using the Repository
We need to do this for each and every line in the Author’s dump file.
Read and parse every line. From each line — from the first curly brace onwards to the end of the file — is your JSON blob. We want to be able to construct a JSON object from that blob, and use it to populate the author data. So, we need to find the position of the first curly brace in each line, to the end of the line, and get a substring. We get that JSON string and create a JSON object from that.
Next, we use the JSON API to create a JSON object from this, and put it into a local variable. We are able to pluck properties from this JSON object (name, personal name, and so on). To get the key, we need to remove “/author/” in front of it.
Construct the author object. Create a new instance of Author. We will get the name, personal_name, and key from the JSON object. We need to remove the “/author/” portion from the key first before we can use it. Replace it with an empty string.
Persist using repository. This is very easy. All we need to do is just save the author we just created.
Now we have the processing logic for getting each line from the file, creating an Author object from it, and saving it to our Cassandra database. To test with only ten records, you can use the
lines.limit(10) method to get just the first ten lines in the file. We need to surround this in a try catch to make sure one line won’t break the load of the data. If something were to fail, we just keep going.
Now run your application. If you check CQL console, you should see that you have 10 authors in your table.
We are able to get data from the file and save it to the database. For testing purposes, we want a fresh schema and data every time we run this. Change
recreate in application.yml.
Now remove the
lines.limit(10) from your code. Print out the Author’s name as the rows are being saved.
Now let's run the application and let it work on loading our data. Check the
authors_by_id table in the CQL Console again afterward to ensure that it works.
Now that we have all of our author data in our Astra DB instance, it is time that we did the same with the book data, using a very similar method. Next, we will set up the entity and repository to fetch book information, given a single book ID.
That’s what we will be covering in our next post. In the meantime, check out the Github repository with the full code for this project.