Schema versioning and upgrade in document store — Implementation with Java, MongoDB and SpringData

Published in

ELCA IT

7 min readDec 22, 2023

The our previous article we have shown how schema versioning and upgrading in a document store could work from a conceptual point of view. In this article, we show how this can be implemented using Java, MongoDB and Spring Data.

The source code of the sample application is accessible in GitHub: https://github.com/ELCAIT/document-store-schema-migration

Sample application

The sample application manages a set of trains that stop or pass through a series of train stations at specific times:

The UI of the application is a Swagger interface that offers the following operations:

Data setup operations:

PUT: Load the list of trains from a JSON file (included in the application)
DELETE: Delete all trains

Business operations:

GET: Get a list of all train numbers
GET: Get a train by train number
PUT: Update a train
PUT: Modify the label of a train by train number

Schema version operations:

GET: Count the number of documents per schema version
PUT: Upgrade the documents from a source to a target schema version

Document Store: MongoDB

In MongoDB, the data is stored in one collection: Train

The collection can be initialized with data from a JSON file and initially contains only documents in schema version V1.

Depending on the schema-compatibility-version configured in the application.yml (field “schemaCompatibilityVersion”), the application may read and write documents either in schema version V1 or V2.

Schema-Versions

The schema version of each document is stored in the attribute “schemaVersion”, having the two possible values “V1” or “V2”.

For the purpose of this example, the versions differ only in attribute names:

Indexes

The collection has two separate indexes, one per schema version. The indexes are defined directly in the entity classes (V2: Train, V1: TrainV1):

Current version (V2):

@Document(collection="train")
@CompoundIndexes({
  @CompoundIndex(
    name = "v2_number",
    def = "{'schemaVersion': 1, 'number': 1}",
    partialFilter = "{'schemaVersion': 'V2'}"
  )
})
public class Train {
  ...
}

Previous version (V1):


@Document(collection="train")
@CompoundIndexes({
  @CompoundIndex(
    name = "v1_trainNumber",
    def = "{'schemaVersion': 1, 'trainNumber': 1}",
    partialFilter = "{'schemaVersion': 'V1'}"
  )
})
public class TrainV1 {
  ...
}

The indexes have the following properties:

The first attribute is the schema version
The second attribute is the train number. Note that the attribute name differs between the two versions
They are partial indexes, containing only data from the respective schema version (using the property “partialFilter”). This limits the size of each index and ensures that the index field names correspond to the actual documents of the respective schema version. For example, the index “v1_trainNumber” will only reference documents with schema version “V1”, each having an attribute “trainNumber”.

Queries

In order to find data for the respective schema version and to make use of the indexes, the queries need to include the schema version and use the correct attribute name for the train number:

Current version (V2):

{'schemaVersion': 'V2', 'number': 701}

Previous version (V1):


{'schemaVersion': 'V1', 'trainNumber': 701}

Note: Each query will only return documents of the respective schema version. In order to find all documents matching the given criterion (TrainNumber: 701), both queries have to be executed and their result sets have to be combined.
It is also possible to perform a combined query, for example:


{ $or: [ {'schemaVersion' : 'V2', 'number': 701}, {'schemaVersion': 'V1', 'trainNumber': 701} ]}

Note: For the application, the fact that the schema version has to be included in the queries is no problem since the queries are hard-coded in the Spring Data MongoDB repositories. However, if a user executes ad-hoc queries directly in the database, he has to be aware of the fact that he has to include the schema version in the query. Of course, it is also possible to execute queries without specifying the schema version, as long as there are documents that match the given attributes, for example:

{ $or: [ { 'number': 701}, {'trainNumber': 701} ]}

For better performance, this might require the definition of additional indexes.

Java Application

The following diagram illustrates the different classes involved in the Java application:

**Separate data access services and spring data mongo repositories for each schema version**

Entity classes and Spring Data MongoDB Repositories

In order to simplify the application and take advantage of Spring Data MongoDB, the application uses separate entities and spring data mongo repositories for both schema versions, each of them accessing the same MongoDB collection (train).

As for the naming conventions, the classes for older schema versions (V1) use the schema version as suffix (e.g. “TrainV1”), whereas the classes for the current schema version (V2) don’t use such a suffix. In that way, the business code can always use the current version and is not cluttered with schema version suffixes.

Current version (V2):

@Value
@Builder(toBuilder = true)
@Document(collection="train")
public class Train {
 
  @NonNull
  String id;
 
  @NonNull
  SchemaVersion schemaVersion;
 
  @Version
  Integer optimisticLockingVersion;
 
  @NonNull
  Integer number;
 
  ...
}
 
@EnableMongoRepositories
@Repository
public interface TrainMongoRepository extends MongoRepository<Train, String> {
 
  @Query("{'schemaVersion' : 'V2', 'number': ?0}")
  List<Train> findByNumber(int number);
 
  ...
}

Previous version (V1):

@Value
@Builder(toBuilder = true)
@Document(collection="train")
public class TrainV1 {
 
  @NonNull
  String id;
 
  @NonNull
  SchemaVersion schemaVersion;
 
  @Version
  Integer optimisticLockingVersion;
 
  @NonNull
  Integer trainNumber;  
 
  ...
}
 
@EnableMongoRepositories
@Repository
public interface TrainV1MongoRepository extends MongoRepository<TrainV1, String> {
 
  @Query("{'schemaVersion' : 'V1', 'trainNumber': ?0}")
  List<TrainV1> findByTrainNumber(int trainNumber);
 
  ...
 }

Data access services

Since the data access is split over two separate Spring Data MongoDB repositories, the data access is encapsulated in a service that handles the combination of the two versions.

There is a data access service for each schema version:

Current version (V2): TrainService
Previous version (V1): TrainV1Service

Both services provide the same methods for access to the documents, using the current version of the entity classes (Train):

TrainService provides access to documents of both schema versions V2 and V1
for documents having schema version V2, TrainService uses the spring data repository (TrainMongoRepository)
for documents having schema version V1, TrainService delegates the method to the access service V1 (TrainV1Service)
TrainV1Service provides access only to documents having schema versions V1 and uses the corresponding spring data repository (TrainV1MongoRepository)

For reading operations, two separate queries are performed on the MongoDB collection, one for each schema version, and the results are combined:

public class TrainService {
 
    public List<Train> findByNumber(int trainNumber) {
        List<Train> trainsV2 = trainMongoRepository.findByNumber(trainNumber);
        List<Train> trainsV1 = trainV1Service.findByTrainNumber(trainNumber);
 
        return Stream.concat(trainsV2.stream(), trainsV1.stream()).toList();
    }
    ...
}

The access service for version V1 reads the documents from the repository and converts them into schema version V2:

public class TrainServiceV1 {
 
    public List<Train> findByTrainNumber(int trainNumber) {
        return trainV1MongoRepository.findByTrainNumber(trainNumber).stream()
            .map(trainV1Converter::fromV1)
            .toList();
        }
    }
     
    ...
}

Note: This approach requires that the new schema version is backward compatible to the previous schema version (e.g. each new mandatory field needs to be derived either from the content of the previous document or to be set with a useful default value). However, in case of a non backward compatible schema change, any type of update mechanism (e.g. SQL upgrade script) would face the same problem. Therefore, one should attempt to keep the schema versions backward compatible as far as possible.

For writing operations, only one operation is performed on the MongoDB collection, depending on the schema-compatibility-version. In that way, the schema-compatibility-version determines in which schema version the documents are written:


public class TrainService {   
 
    public Train save(Train train) {
        switch (schemaCompatibilityVersionConfiguration.getSchemaCompatibilityVersion()) {
            case V1:
                return trainV1Service.save(train);
 
            case V2:
                return trainMongoRepository.save(train);
        }
    }    
 
    ...
}

The access service for version V1 converts the entity from schema version V2 to V1 before storing it in the repository:

public class TrainServiceV1 {
 
    public void save(Train train) {
        TrainV1 trainV1 = trainV1Converter.toV1(train);
        trainV1MongoRepository.save(trainV1);
    }    
 
    ...
}

Note: the concept of “schema-compatibility-version” is described in the related article https://medium.com/@bernhard.ruch/d15a2cecd4e9

The conversion from schema version V1 to V2 and vice versa is implemented by TrainConverterV1:

public class TrainV1Converter {
 
    public Train fromV1(TrainV1 trainV1) {
        ...
    }
 
    public TrainV1 toV1(Train train) {
        ...
    }
}

Note: there are tools like MapStruct that could be used for converting objects from one model to another.

Data migration: upgrade schema version

The upgrade of documents with older schema version (V1) to the current schema version (V2) is straightforward and can be done by the document access service (TrainService):

Find the ids of the documents having an older schema version
For each document: read the document and write a new version

In that way, the document will be written in the schema version determined by the schema-compatibility-version. If the application is running with schema-compatibility-version V2, then the documents having schema version V1 will be upgraded to V2.

public class TrainService {
 
  public void upgradeDocuments(SchemaVersion sourceSchemaVersion) {
    // Read documents from source-schema-version and write them again in schema-compatibility-version
    List<String> ids = trainMongoRepository.findIdsBySchemaVersion(sourceSchemaVersion);
    for (String id : ids) {
      findById(id, sourceSchemaVersion)
        .ifPresent(this::upgradeDocument);
    }
  }
 
  private Optional<Train> findById(String id, SchemaVersion sourceSchemaVersion) {
    switch (sourceSchemaVersion) {
      case V1:
        return trainV1Service.findById(id);
 
      case V2:
        return trainMongoRepository.findById(id);
    }
  }
 
  private void upgradeDocument(Train train) {
    switch (schemaCompatibilityVersionConfiguration.getSchemaCompatibilityVersion()) {
      case V1:
        trainV1Service.save(train);
        break;
 
      case V2:
        trainMongoRepository.save(train);
        break;
    }
 
    ...
}

Schema versioning and upgrade in document store — Implementation with Java, MongoDB and SpringData

Sample application

Document Store: MongoDB

Schema-Versions

Indexes

Queries

Java Application

Entity classes and Spring Data MongoDB Repositories

Data access services

Data migration: upgrade schema version

Written by Bernhard Ruch