Validation Rules!

Peter Arts
SafetyCulture Engineering
10 min readSep 19, 2023
Pile of Lego minifigure heads.
There’s good and bad characters out there. Who/which do you let in? https://unsplash.com/photos/v8pL84kvTTc

Secure and reliable apps with input validation

Intro / TL;DR
Input validation is powerful. It can prevent invalid, malformed data from reaching an application which greatly increases its reliability and security. What if there is a way to make it substantially harder or even impossible to exploit injection vulnerabilities such as SQL injection or Cross-Site Scripting (XSS)?

I’m a Security Engineer at SafetyCulture and one of the benefits of a smaller security team is that it provides the opportunity to work across many different areas of security, instead of being siloed and focusing on only one area.

I’m specialized in application security and a fan of input validation. Without fully understanding all the different data inputs that an application has to handle, I find it tricky to be confident about its security, resiliency, and the quality of data it potentially stores. I know it’s not everyone’s piece of cake so if you make it to the end of this post and you’re still not convinced of the powers of input validation, I’ll give you five reasons why you want to embrace input validation.

This is the first post in a series where I will explain why input validation is important for the security of an application by using a SQL injection vulnerability as an example. In future posts, I’ll share how we built a validator module that is used in most of our microservices (it’s open source so stay tuned) and increased input validation coverage across our microservices landscape from 40% to 70% by leveraging custom Semgrep rules.

Background

In order to handle data in a secure manner, it is important to understand potential data inputs that an application has to deal with. Using input validation, we can restrict application inputs to adhere to a predefined pattern consisting of a permitted length, valid characters, special symbols, encoding such as UTF-8, etc.

Funky names

Have I lost you already? Ok, go to your favourite web application or an app that you’re working on and enter an arbitrary emoji in any input field, or try using the following as your name when signing up for an account:

♙♘ 𝐒нσᵘ𝐥Ⓓ𝔸𝓋ά𝓁ίĐⒶᵗ𝓔 🐲💲

Would you consider that a valid name? What about Nguyễn Tấn or 毛泽东 ?

Names with less common characters often result in issues, especially when exporting it to PDF/Word or when shared via the application in an email. It can also happen that text looks fine in the web app, but not in the mobile app. When there is an option to share something in a text message (SMS), it is almost guaranteed to fail with characters like these. Also searching or filtering based on these characters often does not work well.

When I tested this about 2 years ago in the SafetyCulture app, there were a few places that showed the infamous replacement character � or something like ⎕ instead of an emoji. I tried it again today and couldn’t find any display issues anymore. 🙌

Example of a PDF document with broken characters.
SafetyCulture inspection PDF report with broken characters back in 2021.
Example of a PDF document that displays special characters correctly.
Today, special characters in PDF reports are working perfectly fine.

Admittedly, the above is a quality issue and not so much a security concern, but you’ll be able to connect the pieces shortly. When trying to prioritize security work, it often helps to also demonstrate the related quality issues that often result in a poor customer experience to get product managers and senior leadership’s attention. I’m not saying that they don’t care about security, but they most certainly care about a broken application or poor user experience!

Code vs. Data

Note that while some of the characters seen above might not be acceptable in a person’s name, they shouldn’t necessarily be a concern when looking at injection vulnerabilities given that these characters don’t have a meaning in code.

  • In SQL or HTML, “code” is represented by a limited subset of ASCII characters, for instance <h1> is HTML syntax (code) where ˂h1˃ is not code but data. 🤯

Can you spot the difference?
It really depends on the font type that is used how obvious the differences are, but in case of the latter we used a modifier letter that looks quite similar in common fonts.

Chrome’s syntax highlighting actually shows this in a pretty neat way:

Chrome browser’s view source showing the contents of a html file that contains to different forms of a header tag.
Note that the second <h1> is not highlighted in blue meaning it’s not code but data.

VS Code actually highlights Unicode/non-basic ASCII characters, I suspect to detect “invisible” source code vulnerabilities, nice! Also notice that its default font type represents the signs in a different way, more like superscript:

VS Code editor displaying HTML file contents highlighting that some characters are not ASCII characters.

When rendering the HTML in the browser, obviously the first line is displayed as a H1 header (the HTML markup code is not visible as it was “executed”), the second line is just treated as text as it didn’t contain any markup code:

A web browser displaying a HTML file that renders one of the two representations of the header tag.

Note that a similar confusion is also used in phishing emails to bypass filters and other controls, e.g. 𝒩𝑒𝓉𝒻𝓁𝒾𝓍 𝒶𝓁𝑒𝓇𝓉:

A JavaScript snippet that demonstrates that a brand name written in different letter type is not the equivalent to the ASCII representation of the same brand name.

Or Amazon impersonations (notice the ticks between some of the letters):

A phishing email leveraging characters that look similar to their ASCII alternative for malicious purposes.

While character replacement is used for bad practices, we can also turn it into something beneficial. In the SafetyCulture validator, it is used to “magically” make input “secure”. With the <h1> example above still in your mind, think about this for a minute and you might understand where this is going! More on that in a later post.

Allow vs. Deny

When validating data, it is crucial to use an allowlist of valid inputs and not rely on a denylist with invalid inputs as it’s impossible to define all invalid inputs (unknown unknowns). This might sound like common sense but unfortunately, I still regularly see denylist approaches that don’t work effectively.

Now this is the challenging part as often developers, product managers, and even users don’t know what a valid pattern looks like, while they might be able to tell you some inputs that should not get accepted (e.g. don’t allow emojis in a legal full name). For that reason, it is quite common to see a lack of input validation, denylists, or weak validation patterns that don’t offer much benefit.

In a later post in this series, I’ll explain how we leveraged data analytics to find a pattern that is suitable for many pre-existing string fields in our main application. The pattern allows more than 99.95% of previously seen characters across our most used product feature and does not block legitimate customer input, while still rejecting bad data such as control characters (e.g. null bytes) and invalid character encoding.

Effective input validation

While input validation should not be the only defense against injection vulnerabilities, it often protects the application in case other protections or best practices — such as output encoding and the use of safe APIs — are lacking or failing. As common in a defense in depth approach, it is one of many layers and a useful control in further hardening an application.

To demonstrate how effective input validation can be, imagine an application is suffering from a SQL injection vulnerability, a critical issue that can often be leveraged to extract the entire contents of a database or run arbitrary code on the server. Yikes! So to protect against it, we embed any reasonable defenses and security measures wherever possible. This makes successful exploitation of flaws extra difficult, requiring more motivated, sophisticated adversaries and ditto tools to succeed.

The following Python code snippet suffers from a theoretical SQL injection vulnerability (line 6) in a query that takes a sorting order (e.g. ascending or descending) from the user’s request (line 2). This is quite a common use case and unfortunately, this involves a part of the query that generally doesn’t allow for query parameters.

Acceptable inputs are either the value asc or desc (generally case insensitive), for instance sent by a UI component but is under direct control of a user which means that any value can be submitted to the application.

A typical user request might look like the follow URL, followed by an example of a malicious request:
https://example.com/planets?sort=asc
https://example.com/planets?sort=asc+union+select+*+from+users

Guess you can work out what the latter does. 💥

While this problem must be fixed in the code, input validation could make it impossible to exploit this vulnerability. A great input validation rule would only allow the exact values asc and desc as that’s really all that is ever expected or needed and then reject anything else.

You won’t get many validation patterns that are much easier than this. If your validation library allows for it, this can be defined as an enum or a regex, for instance using Joi.string().valid(‘asc’, ‘desc’) in the popular Joi Node.js library or ^(asc|desc)$ when using regex patterns. If you don’t like regexes, don’t worry too much as a good validation library can abstract these mostly away from places where input validation is implemented. That said, just invest a bit of time on learning how to read and write regex patterns! It’s not that complex and an essential tool of your engineer’s tool belt.

Alternatively, even when using more generic patterns, only allowing alphanumeric characters such as a-z would prevent this vulnerability from being exploitable. Sure most values other than asc/desc would still cause a syntax error, but without permitting a space or other characters than a-z it can’t be leveraged to create a syntactically correct, executable query (feel free to challenge me on this one you SQLi gurus).

Let’s visualize this in flow diagrams. A legitimate user requests a list of planets sorted in ascending order which looks as follows:

Flow of a user request to a web application, resulting in a database lookup with a list of planets in the response.

This time, a malicious user injects SQL code in the sort parameter and the SQLi vulnerability results in the SQL code being executed against the database, returning a list of planets (as expected) and dumping all users (certainly not expected):

Flow of a manipulated user request to a web application, resulting in a database lookup using an injected SQL query that returns a list of planets and users in the response.

Now let’s see how input validation can save the day! Even if the app is still vulnerable, the request gets rejected because it doesn’t satisfy the required pattern (either asc or desc) for the sort field:

A manipulated user request is rejected by a web app that implemented input validation.

The attack payload used in the example has been simplified for readability. Also note that not all database engines accept UNION after an ORDER BY clause although it’s generally still exploitable.

I hope this example demonstrates how powerful input validation can be.

Impact

Now that we have seen a rapid increase in the coverage of input validation in SafetyCulture, more and more often I’m being “hindered” by input validation when running internal security assessments. This can be pretty “annoying” when testing something or crafting a proof of concept for an issue and it often reminds me how effective and valuable input validation is as it will be similarly annoying for a bad actor!

Even without a vulnerability, input validation can support early detection of suspicious behavior when logging and monitoring for failed validation attempts although you probably want to filter out noise generated by automated tooling such as Nuclei or Sqlmap or erroneous use of an API.

5 Benefits of Input Validation

As promised, here is the list of five (other) benefits you’ll get from input validation:

#1 Quality
Data consistency, testability, healthy data and data integrity.

#2 Reliability, Resiliency, Performance, Availability
Supports stable, responsive and trustworthy applications; malformed or excessive data can have implications on application performance and availability.

It also prevents errors, retries, and exhausting rate limits when unvalidated data is sent to another system (upstream or third party API) that does not accept the data. It can also reduce the risk of poor email reputation when sending many emails to invalid addresses (bounces).

#3 Localisation/ internationalization
Wake up native English speakers, many other languages use letters that you don’t find on your keyboard such as é (diacritics) or (Balinese letter) or even right-to-left scripts.

Defining validation rules requires you to think about whether you should be supporting these. Many fonts don’t support all possible letters resulting in a poor and broken user experience or misbehaving/crashing apps (e.g. PDF generation).

#4 Usability
Improves the user experience as unexpected data can result in application or display errors, data corruption and customer churn. It’s also important to show data/input requirements to a user so it is clear what data they can enter or submit and what would not be acceptable. It can be super frustrating to receive an error without a clear hint why data is not accepted.

#5 Security
Read all about it in this post.

Bonus reason: Cost savings
A quick way to save on your next AWS / GCP / Azure bill: when proper data length checks are enforced and invalid/malformed data is kept out, it requires less computing resources and also results in storage savings.

In the next post you’ll read how we leveraged an interceptor used in our Golang microservices to automatically validate GRPC proto messages (requests), the intrinsics of the validator, some tricky validation patterns and how we enforce the implementation of input validation via Semgrep. I’ll update with a link here as soon as it’s live!

With input validation, we only accept good characters…. https://unsplash.com/photos/I6gWLztKrYc

--

--

Peter Arts
SafetyCulture Engineering

A passionate and pragmatic Application Security Engineer.