Automated type inference for dynamically typed programs

Recently, we released a tool called Psalm that’s designed to check PHP code for potential errors. We use it at Vimeo as part of our build process, and it helps us stop bugs from making it into production code.

A number of similar tools exist for dynamically typed languages:

  • Hack (created by Facebook) has an OCaml-based type-checker that does an excellent job of analysing that language’s PHP-like syntax.
  • TypeScript (Microsoft) and Flow (also Facebook) are great for checking a Javascript-like syntax.
  • For PHP code: Phan (from Etsy) and PhpStan.

Psalm is designed to have a better understanding of the codebase than existing PHP static analysis tools, and most the work on Psalm has involved figuring out how to safely infer types in a large and varied codebase. I hope this brief introduction might shed some light on how that type inference works, and thus how Psalm interprets a given chunk of code.

Basics

To get a full picture of the program state in a block of code, we need to keep track of two separate but related concepts:

  • type assignments, e.g. $a = 1
  • type assertions, e.g. if ($a) {}

Type Assignments

$meaning_of_life = 42;

In the above assignment we know, regardless of any other factors, that $meaning_of_life is assigned the type int.

We can make this a little more abstract, and say that a set variable 𝔸 contains all the type assignments that we know about. In the above example 𝔸 = { ($meaning_of_life: int) }.

Each new expression E (as opposed to a statement, which can contain multiple expressions) potentially gives us new information about type assignments. We can say more generally that, if there’s a conflict, we can create a new set of assignments 𝔸ⁿ based on the previously known assertions 𝔸ⁿ⁻¹:

𝔸ⁿ = merge(getAssignments(E(n), 𝔸ⁿ⁻¹), 𝔸ⁿ⁻¹)

Here we also must define two functions: getAssignments extracts assignments from an expression and merge creates a new set of assertions substituting the new information for old.

In the context of the block below:

$a = 5;
$a = "five";
$b = $a;

we can calculate how the set of assertions changes line by line:

𝔸⁰ = {}
𝔸¹ = merge(getAssignments($a = 5, {}), {}) 
= merge({ ($a: int) }, {})
= { ($a: int) }
𝔸² = merge(
getAssignments(
$a = "five",
{ ($a: int) }
),
{ ($a: int) }
)
= merge({ ($a: string) }, { ($a: int) })
= { ($a: string) }

𝔸³ = merge(
getAssignments(
$b = $a,
{ ($a: string) }
),
{ ($a: string) }
)
= merge({ ($b: string) }, { ($a: string) })
= { ($a: string), ($b: string) }

Type Assertions

Conditional expressions often contain information about types that we can use. For example, given the function:

function foo($f) {
if (is_int($f)) {
// some code
} else {
// some other code
}
}

we can easily see that $f is an integer inside the if statement. But this is not just a type assignment (of ($f: int)), because we know something extra — that $f can never be an int in the else block.

Instead, we need to keep track of all currently valid type assertions, 𝕋 (stored in Conjunctive Normal Form for easy manipulation).

To see how assignments and assertions interact, let’s analyse an if block in more depth.

If blocks

Let’s start with the simplest incarnation of an if statement — one with no elseif/else branches. How can we determine that the string return type of the following function is correct?

function foo(?string $a, ?string $b) : string {
if (!$a || $b === "hello") {
$a = "goodbye";
}
return $a;
}

Inside the if statement we need to calculate a new set of assignments based on the type assertions contained in the conditional:

𝔸ⁿ = merge(reconcile(𝕋, 𝔸ⁿ⁻¹), 𝔸ⁿ⁻¹)

Here we use a new function, reconcile. reconcile takes a set of assertions and creates a new set of type assignments given those truths.

Given the parameter type for $a, we have:

𝔸⁰ = { ($a: ?string), ($b: ?string) }

The if statement conditional contains the group of possible truths:

𝕋 = ((!$a) ∨ ($b === "hello"))

and inside the if we have a new set of assertions 𝔸¹:

𝔸¹ = merge( 
reconcile(
((!$a) ∨ ($b === "hello")),
{ ($a: ?string), ($b: ?string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: ?string), ($b: string) },
{ ($a: ?string), ($b: ?string) },
)
= { ($a: ?string), ($b: string) }

And then we encounter the expression $a = "goodbye";. When we see this expression, we can trivially calculate a new set of assertions as above:

𝔸¹ = { ($a: string), ($b: string) }

But we also need to update the group of truths within the scope of the if block because, by asserting that $a is now a string, the truth ((!$a) ∨ ($b === "hello")) is no longer applicable.

𝕋ⁿ = removeTruths(𝕋ⁿ⁻¹, getChangedVariables(𝔸ⁿ, 𝔸ⁿ⁻¹))

𝕋² = removeTruths(
((!$a) ∨ ($b === "hello")),
getChangedVariables(
{ ($a: string), ($b: string) },
{ ($a: ?string), ($b: string) }
)
)
= removeTruths(
((!$a) ∨ ($b === "hello")),
{ $a }
)
= {}

After the if block

After this specific if block we must update our set of assertions again. Specifically we must update the set of assertions 𝔸 with the knowledge we gained from the if block. Given:

  • 𝕋⁰, the set of assertions in the if block’s conditional
  • 𝔸⁰ being the set of assignments before the if block
  • 𝔸¹ being the set of assignments inside the if block

We can calculate 𝔸² as

𝔸² = merge(findContradictions(𝕋⁰, 𝔸¹), 𝔸⁰)

where the function findContradictions finds all assertions that contradict the current truths.

𝔸² = merge( 
findContradictions(
((!$a) ∨ ($b === “hello”)),
{ ($a: string), ($b: string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: string) },
{ ($a: ?string), ($b: ?string) },
)
= { ($a: string), ($b: ?string) }

Now we can look at the statement return $a; and ascertain that it does, indeed, return a string.

Handling exit statements

Instead of asserting a non-null value, what if we instead return a value:

function foo(?string $a, ?string $b) : string {
if (!$a || $b === “hello”) {
return “goodbye”;
}

return $a;
}

That first return statement means that, after the if, our assignments now depend on the negation of 𝕋, ¬𝕋, and the initial assignments 𝔸⁰ before the if:

𝔸² = merge(reconcile(¬𝕋⁰, 𝔸⁰), 𝔸⁰)

And substituting in those values:

𝔸² = merge(  
reconcile(
(($a) ∧ ($b !== "hello")),
{ ($a: ?string), ($b: ?string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: string) },
{ ($a: ?string), ($b: ?string) }
)
= { ($a: string), ($b: ?string) }

Which allows us, in a different context, to again assert that return $a; returns a string.

Wrapping up

We saw how you can analyse simple if statements if you keep track of variable type assignments and assertions.

More complicated rules are needed to analyse complex if statements (for example where variables are first assigned to within if blocks), and loops (for, foreach & while) present many other challenges.

If you’re interested, feel free to browse Psalm’s source code, and let us know if you have questions or further ideas: no ifs, ands, or strings are off limits!