Building a Python-Based Backend Service to Collect Audit Logs from MongoDB Atlas

Noy Meir
Gong Tech Blog

--

Introduction

In the world of data management, ensuring the security and integrity of our data is essential, especially when handling vast amounts of information across multiple MongoDB clusters. At our company, we identified the need for comprehensive audit logging to monitor our MongoDB Atlas clusters. However, we faced several challenges in efficiently collecting and storing these logs. In this blog post, we’ll guide you through our journey of enabling audit logging in MongoDB Atlas, gathering logs from all our databases and clusters, and securely storing them in Amazon S3. By sharing our experience, we hope to offer valuable insights and tips for other DevOps engineers facing similar challenges.

Goal: Collect and Store Logs from All Databases and Clusters Across All Regions

Our objective was to develop a solution capable of gathering logs from multiple sources across all regions of our MongoDB Atlas clusters and ensuring that all collected logs are securely stored in Amazon S3, making them easily accessible when needed.

Challenges

  1. No direct log forwarding from Atlas: MongoDB Atlas provides an API for querying logs but does not support direct log forwarding to external storage solutions.
  2. Unfriendly log collection API: The API requires specifying timeframes for log retrieval and does not maintain state across requests. Additionally, it collects logs per server, while our requirement was to collect logs at the database level.
    Example of the API request:
    https://cloud.mongodb.com/api/atlas/v1.0/groups/<company_id>/clusters/<cluster_id>.mongodb.net/logs/mongos.gz?startDate=1660126415&endDate=1660126474
  3. Diverse log types per cluster instance:
  • Mongod: Logs from the primary daemon process that handles data requests and background management operations.
  • Mongos: Shared utility logs for sharded clusters, acting as the controller and query router.
  • Mongos-audit-logs: Audit logs specific to Mongos instances.
  • Mongod-audit-logs: Audit logs specific to Mongod instances.

The Solution

To address these challenges, we developed a custom Python-based service with the following capabilities:

  1. Listing all servers: The service queries the API to get a list of all instances for all clusters in all of our projects.
  2. Querying the API: The service queries the API endpoint for each server in our database list, collecting logs based on the timestamp of the last successful run.
  3. Locking Mechanism: To prevent concurrent processes from querying the same API endpoint, we implemented a robust locking mechanism using DynamoDB.
  4. Timestamp Management: The service saves the timestamp of the last successful log collection for each server and file type, ensuring seamless continuity in log retrieval.

Deployment and Architecture Challenges

1. Where to keep the latest run’s timestamp?

We opted for Amazon DynamoDB due to its speed, simplicity, and flexibility.

Each entry includes attributes to store the lock status and the last successful timestamp. The service interacts with these attributes using methods like get_timestamp() and set_timestamp(). DynamoDB’s flexibility allowed us to add attributes as needed, keeping our datastore adaptable to changing requirements.

update last timestamp

2. Implementing a Locking Mechanism

Using the same DynamoDB table, we implemented a locking mechanism with a boolean locked attribute. The service includes methods to manage this attribute:

  • is_locked(): Checks if the object is currently locked.
  • lock_object(): Sets the lock status to True.
  • unlock_object(): Sets the lock status to False.

This mechanism ensures that no two processes attempt to collect logs from the same server and file simultaneously, preventing data inconsistencies and potential API throttling.

locking mechanism flow

3. Indexing Strategy in DynamoDB

We considered several factors for our DynamoDB index:

  • Discovery of new hosts and files: Our indexing strategy needed to accommodate dynamic changes in our infrastructure, such as the addition of new hosts and log files.
  • Search Efficiency: Optimizing the search for keys to ensure quick retrieval and minimizing query costs.
  • Partition Sizing: Managing partition sizes to balance performance and cost, ensuring the database scales effectively with our growing data needs.

Our final design uses the hostname as the partition key and filename as the sort key. This approach ensures efficient data retrieval and storage, accommodating both existing and new log sources.

Each entry has the following key attributes:

  • Partition Key: hostname
  • Sort Key: filename

4. Service vs. Cronjob

We debated whether to run the log collection as a cron job or a persistent service in Kubernetes. Despite our service’s ability to handle concurrent runs with locks, we opted for a deployment model for several reasons:

  • Scalability: Running as a service in Kubernetes allows for easier scaling and management.
  • Monitoring and Health Checks: We implemented a small Flask application that runs the log collection script in a while true loop. The service exposes a /health endpoint that returns a status of 200 when functioning properly. This setup facilitates easy monitoring and health checks, ensuring the service remains operational.

Detailed Implementation

  1. Flask Application: The Flask application acts as the entry point for the log collection service. It runs the log collection script in a loop, ensuring continuous log collection and providing health checks.
  2. Log Collection Script: The script queries the MongoDB Atlas API for each server, collects logs based on the last successful timestamp, and saves them to Amazon S3. It updates the timestamp in DynamoDB after each successful run.
  3. Error Handling: We implemented robust error handling to manage API failures, network issues, and data inconsistencies. The service retries failed requests and logs errors for further analysis.
  4. Performance Optimization: To optimize performance, we implemented concurrent requests to the API and efficient data processing techniques. This ensures that the service can handle large volumes of log data without significant delays.

Conclusion

Building a robust and efficient log collection system for MongoDB Atlas was a challenging but rewarding endeavor. Our custom Python-based service, leveraging DynamoDB for state management and AWS S3 for storage, has proven effective in addressing the unique challenges posed by Atlas’s API and diverse log types. This solution ensures comprehensive audit logging while providing a scalable and maintainable approach to managing our database logs.

By sharing our journey and technical solutions, we hope to inspire and assist other teams facing similar challenges in their DevOps and data management efforts.

--

--

Noy Meir
Gong Tech Blog
0 Followers
Writer for

DevOps Engineer | Passionate about technology and driven by a love for innovation