NoVerify: Dynamic Rules for Static Analysis

VK Tech
VK Tech
Dec 9, 2019 · 12 min read

VK has one of the largest PHP codebases in Russia. When you have several million lines of code, even seemingly simple refactoring tasks can turn into an exciting journey. Put a large monolith in a monorepo, add several tightly connected system parts, and even the smallest change couldrequire your full self-confidence to merge.

NoVerify is a static analysis tool that helps VK developers. Not only does it assist in finding potential problems in the code but it also helps maintain a consistent programming style. This past spring, the linter was made available in opensource on GitHub.

Of course, we tried using some existing solutions at first. The most suitable of them in terms of performance was Psalm, but we wanted an even faster static analysis tool, one developed using a statically typed language.

NoVerify is written in Go, which allows us to process our entire repo in mere seconds. It runs for the entire project, performing a global analysis in two stages: indexing and the actual analysis itself. Metadata gathered during the indexing stage is cached, making subsequent rechecks much faster.

Our analyzer is extremely strict. It wants as many types to be specified in your PHP program as possible before its execution. If some types can’t be inferred automatically, it expects you to add @var or a similar phpdoc comment.

Today we’re going to talk about a new killer feature — a declarative way of describing inspections that doesn’t require code compilation or interacting with what NoVerify has under the hood.

To get your attention, I’m going to show you the description of one simple but useful inspection:

/** @warning duplicated sub-expressions inside boolean expression */
$x && $x;

This inspection finds all logical && expressions where the left and right operands are identical.

Premise

When even the simplest new check requires several dozen lines of Go code, you start to wonder if there’s another way.

We used Go to write type inference, the entire linter pipeline, the metadata cache and many other important features that are essential for NoVerify to work. These components are unique, unlike tasks such as “prohibit calling function X with Y arguments”. We added the dynamic rule mechanism for these simple kinds of tasks.

Dynamic rules allow you to separate complex components from trivial tasks. The definition file can be stored and versioned separately and can be edited by developers who aren’t involved in NoVerify development. Each rule performs a code inspection (which we’ll sometimes refer to as a check). With the rule definition language, we could always end up writing a semantically incorrect template or miss some type restrictions, which will lead to false positives. Nevertheless, the data race or dereferencing the nil pointer can’t be implemented through the rule language.

Template description language (TDL)

TDL is syntactically compatible with PHP, which simplifies its learning curve and also makes it possible to edit the rulesfiles using PhpStorm.

We recommend inserting a directive at the beginning of the rules file:

<?php/**
* Disable all the inspections for this file,
* because we have non-executable PHP code.
*
* @noinspection ALL
*/
// ...the rules themselves are below.

My first experiment with the syntax and possible filters for templates was phpgrep. It can be useful by itself, but inside NoVerify, it became even more interesting: it now has access to type information. Some of my colleagues had already been using phpgrep, which became another reason for choosing this syntax. phpgrep is a gogrep adaptation for PHP (youmight be interested in cgrep as well). Using this program, you can search code using syntax templates.

PhpStorm’s structural search and replace (SSR) syntax could’ve been an alternative to this. The main benefit is that it’s an existing format, but I only found out about this feature after implementing phpgrep. It’s possible, of course, to provide a technical explanation: its syntax is incompatible with PHP, and our parser isn’t going to work; however, the real reasononly showed up after we reinvented the wheel.

We could require the template to be displayed with the exact PHP code, or invent another language, for example, with the syntax of S-expressions.

PHP-like    Lisp-like
-----------------------------
$x = $y | (expr = $x $y)
fn($x, 1) | (expr call fn $x 1)
We could express the types and branching right inside the templates:(or (expr == (type string (expr)) (expr))
(expr == (expr) (type string (expr))))

In the end, I decided that template readability is important, and we could add filters through phpdoc attributes. clang-query is an example of such an idea, but it uses a more traditional syntax.

Creating and running our diagnostics

Let’s try to implement our new diagnostics for the analyzer. To do that, you need to install NoVerify. Get the binary if you do not have the Go Toolchain (otherwise you can compile everything from the source).

If you are not going to install NoVerify, you can keep reading and just pretend that you are following the steps and admiring the results!

Problem description

PHP has many interesting functions, with one of them being parse_str. Its signature:

// Parses encoded_string as if it were the query string
// passed via a URL and sets variables in the current scope
// (or in the array if the result parameter is provided.
parse_str ( string $encoded_string [, array &$result ] ) : void

You’ll understand what’s wrong here if you look at this example from the documentation:

$str = "first=value&arr[]=foo+bar&arr[]=baz";parse_str($str);
echo $first; // value
echo $arr[0]; // foo bar
echo $arr[1]; // baz

Hmm, the string parameters end up in the current scope. To prevent this from happening, we will require our new check to use the second parameter of the function, $result, so that the result is written to this array.

Creating our diagnostics

Let’s create the myrules.php file:

<?php/** @warning parse_str without second argument */
parse_str($_);

The rules file is a top-level list of expressions, each of which is interpreted as a phpgrep template. A special phpdoc comment is expected for each of the templates. The only required attribute is the error category with a warning text.

There are currently 4 levels: error, warning, info, and maybe. The first two are critical: the linter will return a non-zero code after execution if at least one of the critical rules is triggered. The warning text is written after the attribute and will be displayed by the linter in case the template is triggered.

In the template we created, we use $_, an anonymous template variable. We could name it $x for example, but because we’re not doing anything with it, we can just give it an “empty” name. The difference between template variables and PHP variables is that the first ones are matched with absolutely any kind of expression and not just the “literal” variable. This is convenient since most of the time we need to search for unknown expressions instead of specific variables.

Running new diagnostics

Let’s create a small test file for debugging, test.php:

<?phpfunction f($x) {
parse_str($x); // Here's where our linter will show an error
}

Next, let’s run NoVerify with our rules on this file:

$ noverify -rules myrules.php test.php

Our warning will look something like this:

WARNING myrules.php:4: parse_str without second argument at test.php:4
parse_str($x);
^^^^^^^^^^^^^

By default, the check name is the same as the name of the rules file and the line that points to that test. In our case, it is myrules.php:4.

You can set a custom name using the @name <name> attribute.

/**
* @name parseStrResult
* @warning parse_str without second argument
*/

parse_str($_);
WARNING parseStrResult: parse_str without second argument at test.php:4
parse_str($x);
^^^^^^^^^^^^^

Named rules follow the same principles as the rest of the diagnostics:

  • They can be disabled using –exclude-checks
  • The criticality level can be redefined using –critical

Working with types

The previous example is good for a “hello world” kind of code, but more often, we need to know the expression types so we can decrease the number of times the diagnostics are triggered. For example, for the in_array function, we require the argument $strict=true when the first argument ($needle) has the string type. We have result filters for this.

One of such filters is @type <type> <var>. It allows us to filter out everything that does not match the listed types.

/**
* @warning 3rd arg of in_array must be true when comparing strings
* @type string $needle
*/

in_array($needle, $_);

Here we set a name for the first argument of the in_array call to assign a type filter to it. A warning will only be displayed when $needle has the string type.

Filter sets can be combined with @or:

/**
*
Each check can be given a description.
*
* @warning strings must be compared using '===' operator
* @type string $x
* @or
* @type string $y
*/

$x == $y;

In the example above, the template will only match with == expressions where both left and right operands have the string type. Without @or, all of the filters are combined with @and, but there is no need to specify it.

Limiting the diagnostic scope

You can specify @scope <name> for each test:

  • @scope all — default value, the check works everywhere
  • @scope root — only runs at the top level
  • @scope local — only runs inside functions and methods

Let’s suppose that we want to report return outside of the function body. This is sometimes necessary in PHP, for example, when including a file from a function, but we don’t like doing this.

/**
* @warning don't use return outside of functions
* @scope root
*/

return $_;

Let’s see how this rule behaves:

<?php

function f() {
return "OK";
}

return "NOT OK"; // Gives a warning

class C {
public function m() {
return "ALSO OK";
}
}

Similarly, we can set up a preference to use *_once instead of require and include:

/**
* @maybe prefer require_once over require
* @scope root
*/

require $_;

/**
* @maybe prefer include_once over include
* @scope root
*/

include $_;

Template grouping

When phpdoc comment duplication happens in templates, combining templates comes to the rescue.

A simple example:

Before

/** @maybe don't use exit or die */
die($_);
/** @maybe don't use exit or die */
exit($_);

After (with grouping)

/** @maybe don't use exit or die */
{
die($_);
exit($_);
}

Now imagine how unpleasant it would be to describe a rule in the next example without this feature!

/**
* @warning don't compare arrays with numeric types
* @type array $x
* @type int|float $y
* @or
* @type int|float $x
* @type array $y
*/

{
$x > $y;
$x < $y;
$x >= $y;
$x <= $y;
$x == $y;
}

The format provided in this article is just one of the suggested options. If you want to suggest an option, you have that opportunity: you only need to vote +1 for the suggestions you like. You can find more information about this here.

Dynamic rule integration

When NoVerify is launched, it attempts to find the rules file, which is specified in the rules argument.

Next, this file is parsed like a regular PHP script and a set of rule objects with phpgrep templates assigned to them is compiled from the resulting AST. The analyzer then starts working as usual. The only difference is that for some of the checked parts of the code it runs a set of assigned rules. If a rule is triggered, a warning is displayed.

A successful match of a phpgrep template and passing at least one of the filter sets (separated by @or) is considered a trigger. At this stage, the rules mechanism does not slow down the linter much, even if there are a lot of dynamic rules.

Matching algorithm

With a naive approach, we need to apply all the dynamic rules for each AST node. This is a very inefficient implementation because most of the work is done in vain: many templates have defined prefixes, which can be used to cluster the rules. This is similar to parallel matching, but instead of honest NFA building, we only “parallelize” the first computing step.

Let’s look at this in an example with three rules:

/** @warning duplicated then/else parts of ternary */
$_ ? $x : $x;

/** @warning don't call explode with delim="" */
explode("", ${"*"});

/** @maybe suspicious empty body of the if statement */
if ($_);

If we have N elements and M rules, with the naive approach, we have N*M operations to perform. In theory, this complexity can be reduced to linear, and we can get O(N) if we combine all the templates into one and perform matching the way it is done by the Go regexp package for example.

However, in practice, I have only partially completed implementing this approach. This approach will allow us to divide rules into three categories and assign the fourth, empty, category to the AST elements that do not match with any rules. Thanks to this, only one rule is assigned to one element.

If we have thousands of rules and we feel a significant slowdown, the algorithm will have to be improved. For now, though, I am satisfied with the simplicity and increased performance of this solution.

The problem of choice, or a little bit about declaring @type

The task is to choose a good phpdoc annotation syntax for filters.

The current syntax duplicates @var and @param, but we might need new operators, such as “type not equals” for example. Let’s think about how this might look.

We have at least two priorities:

  1. Readable and brief annotation syntax.
  2. The best possible IDE support without extra effort.

There is a plugin for PhpStorm called php-annotations. It adds auto-completion, navigation to class annotations and other useful tools for working with phpdoc comments.

In practice, priority (2) means that you make decisions that do not contradict the IDE expectations and plugins. For example, it is possible to make an annotation in a format that can be recognized by the php-annotations plugin:

/**
* Type is a filter that checks that $value
* satisfies the given type constraints.
*
* @Annotation
*/

class Filter {
/** Variable name that is being filtered */
public $value;

/** Check that value type is equal to $type */
public $type;
/** Check that value text is equal to $text */
public $text;
}

This way, applying type filters will look like this:

@Type($needle, eq=string)
@Type($x, not_eq=Foo)

Users could go to the Filter definition and the possible parameters list (type/text/etc.) would be displayed.

Alternative ways, some of which were suggested by my colleagues:

@type string $needle
@type !Foo $x

@type $needle == string
@type $x != Foo

@type(==) string $needle
@type(!=) Foo $x

@type($needle) == string
@type($x) != Foo

@filter type($needle) == string
@filter type($x) != Foo

We then got a bit distracted and forgot that this is all inside phpdoc and ended up with this:

(eq string (typeof $needle))
(neq Foo (typeof $x))

However, the option with postfix annotation was jokingly mentioned too. The language for describing type limits and values could be called sixth:

@eval string $needle typeof =
@eval Foo $x typeof <>

The search for the best solution continues…

Jokes aside, we went with !<type> syntax for “not type” constraints.

Comparing extensibility with Phan

As was described in the article, “Static analysis of PHP code using PHPStan, Phan and Psalm”, one of the advantages of using Phan is extensibility.

This is what was implemented in the plugin example:

We wanted to evaluate how ready our code was for PHP 7.3 (in particular, to find out if it has case-insensitive constants). We were sure that there are no such constants, but a lot has happened over the past 12 years, so we needed to check. We wrote a plugin for Phan that will trigger an alert whenever a third parameter is used in define().

This what the plugin code looks like (formatting optimized to match the width):

<?php

use Phan\AST\ContextNode;
use Phan\CodeBase;
use Phan\Language\Context;
use Phan\Language\Element\Func;
use Phan\PluginV2;
use Phan\PluginV2\AnalyzeFunctionCallCapability;
use ast\Node;

class DefineThirdParamTrue
extends PluginV2
implements AnalyzeFunctionCallCapability {

public function getAnalyzeFunctionCallClosures(CodeBase $code_base) {
$def = function(CodeBase $cb, Context $ctx, Func $fn, $args) {
if (count($args) < 3) {
return;
}
$this->emitIssue(
$cb, $ctx,
'PhanDefineCaseInsensitiv',
'define with 3 arguments', []
);
};
return ['define' => $def];
}
}

return new DefineThirdParamTrue();

And this is how it is done using NoVerify:

<?php

/** @warning define with 3 arguments */
define($_, $_, $_);

This is the result I wanted to achieve: to be able to solve trivial tasks as easily as possible.

Conclusion

  • Try using NoVerify in your project.
  • If you have any ideas for improvement or you want to report a bug, contact us.
  • If you want to participate in the development, you are welcome to join!

Useful links

We collected important links, some of which were mentioned in this article, and put them all here for your convenience.

If you need more examples of rules that can be implemented, you can check out NoVerify tests.

VK Tech

Written by

VK Tech

About vk.com technologies and infrastructure

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade