Automated type inference for dynamically typed programs
Recently, we released a tool called Psalm that’s designed to check PHP code for potential errors. We use it at Vimeo as part of our build process, and it helps us stop bugs from making it into production code.
A number of similar tools exist for dynamically typed languages:
- Hack (created by Facebook) has an OCaml-based type-checker that does an excellent job of analysing that language’s PHP-like syntax.
- TypeScript (Microsoft) and Flow (also Facebook) are great for checking a Javascript-like syntax.
- For PHP code: Phan (from Etsy) and PhpStan.
Psalm is designed to have a better understanding of the codebase than existing PHP static analysis tools, and most the work on Psalm has involved figuring out how to safely infer types in a large and varied codebase. I hope this brief introduction might shed some light on how that type inference works, and thus how Psalm interprets a given chunk of code.
Basics
To get a full picture of the program state in a block of code, we need to keep track of two separate but related concepts:
- type assignments, e.g.
$a = 1
- type assertions, e.g.
if ($a) {}
Type Assignments
$meaning_of_life = 42;
In the above assignment we know, regardless of any other factors, that $meaning_of_life
is assigned the type int
.
We can make this a little more abstract, and say that a set variable 𝔸 contains all the type assignments that we know about. In the above example 𝔸 = { ($meaning_of_life: int) }
.
Each new expression E
(as opposed to a statement, which can contain multiple expressions) potentially gives us new information about type assignments. We can say more generally that, if there’s a conflict, we can create a new set of assignments 𝔸ⁿ based on the previously known assertions 𝔸ⁿ⁻¹:
𝔸ⁿ = merge(getAssignments(E(n), 𝔸ⁿ⁻¹), 𝔸ⁿ⁻¹)
Here we also must define two functions: getAssignments
extracts assignments from an expression and merge
creates a new set of assertions substituting the new information for old.
In the context of the block below:
$a = 5;
$a = "five";
$b = $a;
we can calculate how the set of assertions changes line by line:
𝔸⁰ = {}𝔸¹ = merge(getAssignments($a = 5, {}), {})
= merge({ ($a: int) }, {})
= { ($a: int) }𝔸² = merge(
getAssignments(
$a = "five",
{ ($a: int) }
),
{ ($a: int) }
)
= merge({ ($a: string) }, { ($a: int) })
= { ($a: string) }
𝔸³ = merge(
getAssignments(
$b = $a,
{ ($a: string) }
),
{ ($a: string) }
)
= merge({ ($b: string) }, { ($a: string) })
= { ($a: string), ($b: string) }
Type Assertions
Conditional expressions often contain information about types that we can use. For example, given the function:
function foo($f) {
if (is_int($f)) {
// some code
} else {
// some other code
}
}
we can easily see that $f
is an integer inside the if
statement. But this is not just a type assignment (of ($f: int)
), because we know something extra — that $f
can never be an int
in the else
block.
Instead, we need to keep track of all currently valid type assertions, 𝕋 (stored in Conjunctive Normal Form for easy manipulation).
To see how assignments and assertions interact, let’s analyse an if
block in more depth.
If
blocks
Let’s start with the simplest incarnation of an if statement — one with no elseif
/else
branches. How can we determine that the string
return type of the following function is correct?
function foo(?string $a, ?string $b) : string {
if (!$a || $b === "hello") {
$a = "goodbye";
}
return $a;
}
Inside the if
statement we need to calculate a new set of assignments based on the type assertions contained in the conditional:
𝔸ⁿ = merge(reconcile(𝕋, 𝔸ⁿ⁻¹), 𝔸ⁿ⁻¹)
Here we use a new function, reconcile
. reconcile
takes a set of assertions and creates a new set of type assignments given those truths.
Given the parameter type for $a
, we have:
𝔸⁰ = { ($a: ?string), ($b: ?string) }
The if
statement conditional contains the group of possible truths:
𝕋 = ((!$a) ∨ ($b === "hello"))
and inside the if
we have a new set of assertions 𝔸¹:
𝔸¹ = merge(
reconcile(
((!$a) ∨ ($b === "hello")),
{ ($a: ?string), ($b: ?string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: ?string), ($b: string) },
{ ($a: ?string), ($b: ?string) },
)
= { ($a: ?string), ($b: string) }
And then we encounter the expression $a = "goodbye";
. When we see this expression, we can trivially calculate a new set of assertions as above:
𝔸¹ = { ($a: string), ($b: string) }
But we also need to update the group of truths within the scope of the if
block because, by asserting that $a
is now a string, the truth ((!$a) ∨ ($b === "hello"))
is no longer applicable.
𝕋ⁿ = removeTruths(𝕋ⁿ⁻¹, getChangedVariables(𝔸ⁿ, 𝔸ⁿ⁻¹))
𝕋² = removeTruths(
((!$a) ∨ ($b === "hello")),
getChangedVariables(
{ ($a: string), ($b: string) },
{ ($a: ?string), ($b: string) }
)
)
= removeTruths(
((!$a) ∨ ($b === "hello")),
{ $a }
)
= {}
After the if block
After this specific if
block we must update our set of assertions again. Specifically we must update the set of assertions 𝔸 with the knowledge we gained from the if
block. Given:
- 𝕋⁰, the set of assertions in the
if
block’s conditional - 𝔸⁰ being the set of assignments before the
if
block - 𝔸¹ being the set of assignments inside the
if
block
We can calculate 𝔸² as
𝔸² = merge(findContradictions(𝕋⁰, 𝔸¹), 𝔸⁰)
where the function findContradictions
finds all assertions that contradict the current truths.
𝔸² = merge(
findContradictions(
((!$a) ∨ ($b === “hello”)),
{ ($a: string), ($b: string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: string) },
{ ($a: ?string), ($b: ?string) },
)
= { ($a: string), ($b: ?string) }
Now we can look at the statement return $a;
and ascertain that it does, indeed, return a string.
Handling exit statements
Instead of asserting a non-null value, what if we instead return a value:
function foo(?string $a, ?string $b) : string {
if (!$a || $b === “hello”) {
return “goodbye”;
}
return $a;
}
That first return
statement means that, after the if
, our assignments now depend on the negation of 𝕋, ¬𝕋, and the initial assignments 𝔸⁰ before the if
:
𝔸² = merge(reconcile(¬𝕋⁰, 𝔸⁰), 𝔸⁰)
And substituting in those values:
𝔸² = merge(
reconcile(
(($a) ∧ ($b !== "hello")),
{ ($a: ?string), ($b: ?string) }
),
{ ($a: ?string), ($b: ?string) }
)
= merge(
{ ($a: string) },
{ ($a: ?string), ($b: ?string) }
)
= { ($a: string), ($b: ?string) }
Which allows us, in a different context, to again assert that return $a;
returns a string.
Wrapping up
We saw how you can analyse simple if
statements if you keep track of variable type assignments and assertions.
More complicated rules are needed to analyse complex if
statements (for example where variables are first assigned to within if
blocks), and loops (for
, foreach
& while
) present many other challenges.
If you’re interested, feel free to browse Psalm’s source code, and let us know if you have questions or further ideas: no if
s, and
s, or string
s are off limits!