Enterprise Application Logging Best Practices (A Support Engineer’s Perspective)

Let’s write meaningful log messages that everyone loves!

Thilina Ashen Gamage
Dec 2, 2020 · 15 min read
computer code on a screen
computer code on a screen
Image source: Author

Application logs reveal information on both internal and external events that are visible to the application during its runtime. When a bug, security breach, or anomaly is present in software deployment, application logs are the most helpful and reliable evidence to conduct a proper root cause analysis of the incident.

In this article, let’s understand how to write meaningful application log messages that everyone loves.

1. What to Log

When components communicate with each other via message passing, both incoming and outgoing messages must be recorded with API endpoint URLs, request parameters, request origin and intermediary IPs, request headers, auth info, request and response bodies, business context, timestamps, and internal processing steps. Most important, for tracing message cycle/roundtrip across multiple services in the system, a unique message ID must be assigned to each message (usually generated at the API manager/service facade level).

When a service or function is invoked, it is a good practice to log its context in a lower log level, mostly for debugging purposes (use TRACE or DEBUG). Having these logs helps us investigate issues related to business logic easily, especially when we don’t have the privilege of attaching a debugger to our application (e.g., deployments to testing, staging, pre-prod environments).

Every application has its unique business cases and user journeys, and they reveal many insights for the domain experts in the system. For instance, whether a certain transaction takes too long or the end users get stuck at some functionality all the time are very critical pieces of information in terms of user experience. Other business-related information, like transaction volumes and active users and their stages, is important for deriving business insights and even can be used for business intelligence purposes too.

In most enterprise applications, for security and compliance reasons, it is mandatory to keep a separate log for data-related operations with all important information like access IDs (user/system), exact service instances and role privileges used, timestamps, data layer queries, and snapshots of both previous and new states of the changed dataset (diff). An audit trail must capture all data-related attempts (access, import, export, etc.), and CRUD operations (creates, reads, updates, deletes) performed by the users, as well as by the other systems and services.

System events must capture information on behaviour events (startups, stops, restarts, security events), changeover modes (cold, warm, hot), inter-service communication (handshakes, connection building statuses — connect, disconnect, reconnect, retry events), service instance IDs, actively serving APIs, actively listening IP and port ranges, configurations loaded (initial load up and dynamic updates), overall service health, and anything that helps to understand the behaviour of the system.

Diligence is a great characteristic of computing devices, but they may not be as perfect as we were taught at school. Performance anomalies or sudden unexpected degradations in services (mostly due to unhandled errors and corrupted data) can happen at any given time. To identify them, it is always recommended to publish stats on overall system health and performance. Such stats may include information like API calls counts (both successfully served ones and a separate failure count), network latencies, average roundtrip times, memory consumption, and other application-specific information (mostly decided by the business context).

Revealing threats and vulnerabilities with the help of application runtime and log messages is an art that every enterprise software developer must learn. In general, security breaches and catastrophic failures in applications do not happen suddenly. Most of the time, there are some clues that no one notices in the first place. Therefore we must always log suspicious human activities (e.g., failed authentication and verification attempts with all low-level information like networks used, request origins, and user roles and privileges attached) as well as the system behaviours (e.g., increasing of spikes in resource consumption patterns, high loads on web servers, services getting choked randomly). When a suspicious event is noticed, make sure the logs capture all information related to it, ideally a full-stack trace including parameter values and additional information available from the application context.

2. What Not to Log

Almost all the privacy laws (e.g., GDPR, CCPA) clearly advise developers to keep PII away from the logs. PII includes information like first name, last name, username, gender, birthday, mailing and billing addresses, emails, phone numbers, social security numbers (SSN), and credit card numbers.

Make sure you avoid logging information like business names, related personnel (employees, clients, suppliers, etc.), and business and personal contact information. Logs should never expose the business relationships and related party transactions to outsiders. To trace back specific transactions, instead of using real business names and identifiers, make sure you use an event ID generated by the system and pass it through the other services.

By law, all financial data must be fully hidden/masked in the logs. Exposing such information in logs can easily result in serious lawsuits (can be even interpreted as criminal offences). Therefore, avoid such cases at all times.

Security credentials and auth tokens are considered sensitive information, and making them exposed via logs help intruders carry out easy security breaches in the system. Therefore, always be mindful of such sensitive information and keep it away from the logs.

Note: Specifying what information to hide from logs will be easy if you can attach an attribute to each field specifying its visibility level (e.g., show, mask, hide, encrypt). If you have such a mechanism, you can enable and disable the visibility of fields just by updating a property in configurations. If you have a requirement for logging some of the user data in non-production environments, especially for testing and debugging purposes, this is the way to go. Alternatively, you can write parsers to filter the log messages and handle the sensitive fields according to the pre-defined instructions based on the environment.

3. Best Practices

The log level is used to denote the severity of each event in the system. In most logging frameworks, the following levels are available.

  • FATAL: denotes very severe error events that will presumably lead the application to abort. Usually, these can end up in catastrophic failures.
  • ERROR: denotes error events that might still allow the application to continue running, with reduced capabilities in the affected paths.
  • WARN: denotes less-harmful events than errors. Usually, they do not lead to any degradation of capabilities or complete failure of the application. However, they are still red flags and must be investigated.
  • INFO: denotes the important event banners and informational messages in the application behaviour.
  • DEBUG: denotes specific and detailed information, mainly used for debugging purposes. These logs help us step through the code.
  • TRACE: denotes most low-level information like stack traces of code to provide the most information on a certain event/context. These logs help us inspect the variable values and full error stacks.

If we take Linux Syslog, there are more severity levels like Emergency, Alert, Critical, Error, Warning, Notice, Informational, and Debug.

Regardless of the complexity and depth of each log level implementation, we must properly set them in our code to provide the optimum level of information based on each scenario. For instance, all information used by developers for debugging and tech analysis purposes must go under DEBUG or TRACE levels, while system-level banners can go under INFO level.

Some tools and terminal consoles do not support printing and storing log messages with certain Unicode characters. Hence localization and other advanced features may be challenging at the logging level. Therefore, make sure you stick to the English language and always use a widely accepted character set for writing log messages.

2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG Adapters:20 - Successfully loaded all

If we log too little, we may not be able to capture adequate information to build the context of each important event. If we log too much, it will end up with performance issues. (Writing a huge log file results in higher I/O and disk storage consumptions. If the file system does not support it, it will degrade the overall performance of the system.)

To optimize log messages, develop a thorough understanding of the functional and non-functional expectations of the system and plan for the right quality and quantity of log messages. Make every log message meaningful and related to the context — be short and sweet always.

2020-11-11 13:52:27 DEBUG  Users:18 - Successfully created new user (RecordID: 5f5a0d594d17e22b848ee570)2020-11-11 13:52:27 ERROR  Users:34 - Failed to update DB (E: inactive user, RecordID: 5f5a0d594d17e22b848ee570)

Instead of logging full descriptions all the time, try to introduce a reference ID or simplified template to represent lengthy descriptions that are appearing repeatedly in your log. This reduces the overall log messages count and length while providing the flexibility to keep selected information hidden. For instance, without publishing vulnerabilities in detailed text format to a log file, it is always safe to use a code name or common reference ID to represent them so that only the domain experts will understand the actual scenario.

2020-11-11 13:52:27 ERROR  DB:22 - Failed to write (E:ORA-01017)// ORA-01017 denotes "invalid username/password; logon denied" error message

Timestamps provide us with an understanding of the sequence of log events and are necessary for debugging and analytics purposes. It is recommended to use the most verbose granularity when capturing time (e.g., millisecond or microsecond level accuracy) so that adjacent log events can be identified more easily. Also, make sure that the timestamp is placed at the beginning of the log message with yyyy-mm-dd HH:mm:ss. Always mention the timezone if you are not using the default time in the server (UTC).

// with default server time (UTC)2020-11-11 13:52:12 INFO  XYZ Integration API Manager v2.0.0// with timezone (e.g. PST - Pacific Standard Time)2020-11-11 13:52:12PST INFO  XYZ Integration API Manager v2.0.0

Mentioning log source with DEBUG, TRACE, and ERROR logs is very helpful to clearly pinpoint the exact location that triggered the corresponding log message. Some log frameworks provide capabilities to mention the most detailed levels (e.g., filenames with line numbers), but mentioning only the class/functions/filename would also work most of the time.

2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO app - Loading configurations..
2020-11-11 13:52:18 INFO app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO app - *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO app - *** LogLevel 05 (DEBUG)
2020-11-11 13:52:12 DEBUG App:22 - Initializing Swagger UI..
2020-11-11 13:52:14 DEBUG App:28 - Generating schemata..
2020-11-11 13:52:14 DEBUG App:30 - Initializing REST services..
2020-11-11 13:52:15 DEBUG App:32 - Generating documentation..
2020-11-11 13:52:18 DEBUG App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG Adapters:20 - Successfully loaded all
2020-11-11 13:52:31 INFO app - Started listening...

One mistake that most newbie developers make is to copy-paste the same log message in multiple files while letting the final log aggregate fill with similar log lines coming from multiple different parts of the system. When they do that, it is not easy to trace the exact location that triggered the event in the code. If the wording cannot be changed, at least mention the log source with the log message to make the final log lines look different from each other. Also, if the logging functionality is handled by a parent class, make sure you send an identifier during the initialization and use it when writing log messages for child behaviours.

2020-11-11 13:52:18 DEBUG  App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG Adapters:20 - Successfully loaded all

Usually, when an event/message enters into the system, a unique ID is assigned to that event/message. This ID will be carried forward throughout the next stages of processing, and it helps us trace the flow of an event throughout the system, especially for debugging, concurrency handling, and data-related operations. Ideally, in a single system, there should be only one unique ID for an event throughout all modules and services.

[87d4s-a98d7-9a8742jsd] Request Body: {
"request_id": "87d4s-a98d7-9a8742jsd",
"app_id": "TX001",
"option_val": "IBM",
"req_type_id": "0013",
"data": {...........}
}
[87d4s-a98d7-9a8742jsd] Sending request to RefData: http://10.244.2.168:8280/v1

In certain cases, especially when an event is passed from one system to another system, event IDs get changed according to the convention of each system. At such transition points, we must clearly mention the mapping between the previous event ID and the newly assigned ID in a separate log message with the required context.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****
[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508
[1000508] Transaction successfully added to Rabbit Queue

Most enterprise systems operate as distributed computing platforms, and there are multiple instances of the same service with a variety of app configs, resources, versions, and network properties. To identify each one, it is recommended to assign an instance ID and use it during the inter-service communications.

2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO app - Loading configurations..
2020-11-11 13:52:18 INFO app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO app - *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO app - *** LogLevel 05 (DEBUG)

Based on the deployment environment, the active log level also must be changed. The recommended convention for production environments is to print logs up to INFO level. In other environments, logs will be printed up to either DEBUG or TRACE level, according to the level of granularity preferred by the Dev and SysOps teams.

// Active Log Level = DEBUG2020-11-11 13:52:12 INFO  app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO app - Loading configurations..
2020-11-11 13:52:18 INFO app - *** InstanceID APIM_V2_I02
2020-11-11 13:52:18 INFO app - *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO app - *** LogLevel 05 (DEBUG)
2020-11-11 13:52:12 DEBUG App:22 - Initializing Swagger UI..
2020-11-11 13:52:14 DEBUG App:28 - Generating schemata..
2020-11-11 13:52:14 DEBUG App:30 - Initializing REST services..
2020-11-11 13:52:15 DEBUG App:32 - Generating documentation..
2020-11-11 13:52:18 DEBUG App:36 - Loading adapters..
2020-11-11 13:52:21 DEBUG Adapters:10 - Loading adapter docs/v1
2020-11-11 13:52:22 DEBUG Adapters:16 - Loading adapter mongo/v1
2020-11-11 13:52:26 DEBUG Docs:38 - docs adapter initialized
2020-11-11 13:52:27 DEBUG Mongo:38 - mongo adapter initialized
2020-11-11 13:52:22 DEBUG Adapters:20 - Successfully loaded all
2020-11-11 13:52:31 INFO app - Started listening...
// Active Log Level = INFO2020-11-11 13:52:12 INFO app - XYZ Integration API Manager v2.0.0
2020-11-11 13:52:15 INFO app - Loading configurations..
2020-11-11 13:52:18 INFO app - *** InstanceID API_V2_I02
2020-11-11 13:52:18 INFO app - *** BaseURL http://10.244.2.168:3000
2020-11-11 13:52:19 INFO app - *** LogLevel 04 (INFO)
2020-11-11 13:52:31 INFO app - Started listening...

Errors and failures need best-effort inspections. To make that possible, the application must provide the domain experts of the system with sufficient background information and the business and technology contexts. For instance, if a request or message is failing to process, in addition to the error message, it is very helpful to log the failed request body, too.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508[1000508] Failed to validate msg body (E: Uncaught ReferenceError: missing params - option_val)[1000508] Failed Message: {
"request_id": "87d4s-a98d7-9a8742jsd",
"app_id": "TX001",
"req_type_id": "0013",
"data": {...........}
}

Each time a data-related operation happens, without blindly accepting its completion, make sure you always validate the end state with evidence. For instance, in general, when you create/update/delete a record in the database, it returns the count of records changed, along with the updated record. In such cases, always run validations on the expected counts or results. Another example is when you insert a record into a data structure (say a queue). Instead of blindly assuming the insert operation, always take the queue size and validate whether it was increased by one. (Suppose your system is concurrent, but the underlying queue does not support concurrent operations. In such scenarios, you can actually lose some records, and having something like length validations is the only way to detect such hidden errors in the code.)

DEBUG  BatchWriter:28 - Successfully connected to DB. Trying to insert 24 accounts...DEBUG  BatchWriter:35 - Successfully inserted 24 accounts. Total DB rows affected: 24

In most cases, it is a legal requirement for systems to mask and/or encrypt the sensitive information of the users and the internal systems. Based on the industry you operate in and the geographical region you belong to, the compliance requirements may change. Therefore, do proper research and implement the correct procedures in your application. In some cases, before getting the application up and running, you may need to present the logging strategy to security authorities and get their approvals/certifications, too.

[1000508] ***** Incoming request from 10.244.2.168:3000 *****[1000508] Origin Id: 87d4s-a98d7-9a8742jsd -> System ID: 1000508[1000508] Request Body: {
”user_id”:”XXXXXXXXXX”,
”personal_details”:{
”firstName”:”XXXXXXXXXX”,
”lastName”:”XXXXXXXXXX”,
”DOB”:”XXXXXXXXXX”,
”gender”:”Male”,
”proffessional”:”Software Architect”,
”industry”:”IT”,
”SSN”:”XXXXXXXXXX”
},
”address_history”:[
{”streetAdress”:”Street No 1″,”zipcode”:”XXXXX”,”state”:”CA”},
{”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”NY”},
{”streetAdress”:”Street No 2″,”zipcode”:”XXXXX”,”state”:”AL”}
],
”card_info”:[
{”type”:”amex″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”},
{”type”:”visa″,”card_number”:”XXXXXXXXX”,”credit_limit”:”XXXXX”}
]
}

If you’re writing logs to a file, make sure they are stored in a separate disk that has no impact on the running application (e.g., you can take a dedicated location from a separate server and mount it to the application servers). Also, understand the log frequency and growth pattern of the log files. Make sure you have a log rotation policy with proper file-naming conventions (e.g., append a timestamp to the file name when creating each log file) to keep log files at a fine length and quantity. Also, there should be mechanisms to back up old logs to safe locations and clean up the log storages regularly. Based on the industry you operate in, you can decide the backup expiration time (usually a few months or years) and, at the end of that period, destroy all the past log files.

[ec2-user@ip-XXX-XX-X-XXX logs]$ ls 
..
APIM_V2_I02-2020-11-20_04:38:43.log
APIM_V2_I02-2020-11-23_02:05:35.log
APIM_V2_I02-2020-11-24_04:38:17.log
APIM_V2_I02-2020-11-27_03:28:37.log
APIM_V2_I02-2020-11-27_12:06:45.log
...

4. Advanced Recommendations

Having a central, accessible server/location to aggregate the logs is a very common practice among enterprise software developers. Usually, these log aggregators keep track of not only the application logs but also the other log data such as device/OS logs (e.g. Linux Syslog), network/firewall logs, database logs, etc. Also, they decouple the log files from the application servers and let us store all log data in more secure, organized, and efficient formats for a longer period of time. In some industries (e.g. banking and finance), it is mandatory to keep these logs preserved both locally and in central storage to make it hard for intruders and cybercriminals to access both locations and delete evidence of their activity at the same time. Due to this log redundancy, the mismatch between two locations can suggest red flags and help us prevent breaches from going unnoticed.

Writing custom log parsers and filters is a feature integrated in most log monitoring tools — aka security information and event management (SIEM) integration. Such parsers help us store the log data in more organized formats and querying becomes a lot easier and quicker. Also, properly organized log data can be fed into the log monitoring and anomaly detection systems to proactively monitor the system and forecast future events, too. These tools are so advanced that they provide a great visual experience via interactive dashboards based on time series and by real-time event analysis based on log data and other sources.

Almost all the log monitoring tools include features to define custom thresholds at certain levels. When the system hits those levels, the monitoring tool will proactively detect them with the help of log data and notify SysAdmins via alarms, push notification APIs (e.g., Slack Audit Logs API), emails, etc. Also, they can be preconfigured to trigger automated processes like dynamic scaling, system backup, changeovers, etc. However, if you invest in commercial software for log monitoring purposes, make sure you do a proper analysis because, for most small to medium software systems, this can be overkill.

5. Recommended Tools

JavaScript/TypeScript ---> Log4js 😍 / pino
Java --------------------> Log4j 😍
Golang ------------------> Logrus 😐
Serverless Functions ----> aws-lambda-powertools-python / Build it yourself 🤔
.....
...
.
👉 Let me know your favorites. I'd love to list them here!
CLI Tools ----> less, vim
Cloud Tools --> Fluentd, AWS CloudWatch
SIEM Tools ---> SolarWinds, Splunk, McAfee ESM, DataDog, IBM QRadar
Other --------> ELK Stack 😍 (ElasticSearch, Logstash, Kibana, Beats) , Loggly
.....
...
.
👉 Let me know your favorites. I'd love to list them here!

Stay tuned for the next programming tip. Until then, happy coding!

Better Programming

Advice for programmers.

By Better Programming

A weekly newsletter sent every Friday with the best articles we published that week. Code tutorials, advice, career opportunities, and more! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Thanks to Zack Shapiro

Thilina Ashen Gamage

Written by

Optimist & Product Person • Building things and learning how things work • Senior Software Engineer at Yaala Labs • Formerly MillenniumIT (LSEG Tech)

Better Programming

Advice for programmers.

Thilina Ashen Gamage

Written by

Optimist & Product Person • Building things and learning how things work • Senior Software Engineer at Yaala Labs • Formerly MillenniumIT (LSEG Tech)

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store