Impossibility of Correctness Checking for Generic Numeric Programs
(This post is translated from my Chinese blog https://zhuanlan.zhihu.com/p/30938921562)
Background : I recently returned to the Julia community because I am brainstorming several interface designs for Julia (one of such attempts is presented at https://zhuanlan.zhihu.com/p/30356390469). To avoid disrupting the flow of the main article, I’ve placed my motivations at the end [1], as general non-Julia readers may not be interested in them.
After researching several posts on discord, I’ve completely clarified an issue: the correctness of generic numeric programs cannot be universally verified. Therefore, the only feasible way to avoid errors is to avoid using generic numeric programs on untested inputs. (By generic programs, I mean a function f which can accept different input types).
Specifically, I want to assert this proposition: if f is a generic numeric program, then there does not exist a systematic constraint/proof (e.g., based on types, interfaces, or any check that doesn’t require examining f’s specific implementation) that can prove f is correct for any input type.
Roughly speaking, this is because
a) the correctness of numeric programs can be viewed as the difference between numerical approximation and theoretical definition, which depends on too many factors that cannot be captured by the interfaces alone.
b) generic programs’ correctness specification is also generic, which requires a nontrivial *proof*. That is, if you know for some input type S your program is correct, this correctness will not automatically imply the correctness of another type S’.
This impossibility is neither a problem of program design (such as lack of interfaces) nor a coding issue, but a pure semantic and logical impossibility. It’s similar to how formal proofs cannot verify the correctness of a specification itself. For example, if you accidentally write a wrong specification — you want to implement an addition function, but your specification declares a subtraction function, and then you actually implement addition incorrectly — of course this sounds unlikely, but when doing formal verification, it’s easily to get confused: does a seemingly complex abstract declaration actually declare what we want?
Returning to numerical computation, let me illustrate this impossibility with an actual example from the Julia community:
Mean of integers overflows — bug or expected behaviour? discourse.julialang.org/t/mean-of-integers-overflows-bug-or-expected-behaviour/34767
This example concerns the mean function. You might say, isn’t mean just calculating the average of an array? What’s there to discuss? And it’s easy to think of an algorithm: mean(x) = sum(x) / length(x) — first sum, then divide by length. For floating-point numbers, this might be numerically imprecise, but not a major problem. But let’s think about it — is this approach correct for an integer array?
Let me be more specific: if integers have only finite representation (distinguishing from Python’s unlimited precision), which might cause overflow during summation, is this acceptable behavior? For example, the average of [INT_MAX, INT_MAX] is mathematically INT_MAX, but INT_MAX + INT_MAX will overflow, leading to strange results.
Obviously, there are many different perspectives on this issue:
1. Still use sum(x)/length(x), use infinite precision integers like Python
- Benefits: Integers never overflow
- Drawbacks: Infinite precision calculation is very slow; for large integer arrays, this implementation is practically unusable
2. Still use sum(x)/length(x), let integers silently overflow (treat as algebraic operations on Z-ring) or UB
- Benefits: Implementation is fast
- Drawbacks: Results are incorrect; users might silently get incorrect values
3. Still use sum(x)/length(x), integer overflow is an error
- Benefits: Users won’t silently get incorrect results; implementation is simple with overflow checking
- Drawbacks: Implementation is slow, but better than infinite precision
4. Convert elements in x to floating-point before summing, then calculate the mean as for a floating-point array
- Benefits: Results are correct
- Drawbacks: Additional conversion has performance costs, and now the program will not be generic
None of these solutions is better than the others, but that’s not my main point. What I want to emphasize is that the absolutely correct mathematical definition mean(x) = sum(x) / length(x) is so deceptive — this elegant definition is almost useless in actual numerical computation because it can’t calculate meaningful results: either it’s wrong due to overflow or numerically unstable.
This is obvious in the deep learning community: neural networks using different precisions (FP32, FP16, BF16, FP8) behave completely differently. FP16/BF16 require mixed precision to avoid divergence (normalization layers use FP32 for accumulation), FP8 kernels must be written manually to ensure performance and prevent overflow, and you need various tricks like initialization with small values and gradient clipping. The root cause is: for 99% of numeric programs, we’re not running them with real number, but an approximation of this algorithm under finite precision. Different input types can be seen as different levels of approximation of the same program, and you can even mix different precisions, creating interpolations between these levels.
For downstream library users, they want reasonably correctresults. If they want performance they can define their own sum(x) / length(x). But this requires analyzing the program’s properties. For example, a function f(x) might have two equivalent algebraic expressions a and b, where a is more accurate in one part of the domain, and b is more accurate in another part. You can check Herbie (https://herbie.uwplse.org/), which can automatically improve the precision of floating point programs. But the problem is, if you don’t know the specific properties of the generic input type, you can’t draw definite conclusions.
To summarize: Ideally, generic program = mathematical expression = correct result, so it can be universally checked (without knowing specific properties of the input, only needing a declaration). But in practice, due to numerical precision issues, the same implementation might lead to numerical instability -> therefore not the implementation users want -> implementation is “incorrect”. This incorrectness is a mismatch with expectations, which cannot be captured by tools.
Therefore, I refute the view that traditional software engineering interfaces/typeclasses are effective means to constrain numeric programs. This doesn’t work because traditional software studies objects that can be abstracted as a series of precise, verifiable descriptions. For example, no matter how complex Amazon’s S3 storage implementation is or how many machines it’s deployed on, we only need to abstractly care about the eventual consistency provided by S3 and code against this interface. But for general numeric programs, because the implementation is strongly coupled with the properties of the input, the results are fragile, so merely changing the float -> int (with same values) can lead to vastly different results. And you can’t even define what is correct — if the calculation result differs from the theoretical value by 5%, is this right or wrong?
Even if we ignore floating point imprecision, some numeric programs are difficult to define such properties. For example, let’s consider a program property P: this program works for any array with index 0 or index 1 (Julia uses 1-index and C language uses 0-index). For C, P doesn’t hold. For C++, programs only using vector iterators satisfy P. But whether a program conforms to property P is difficult to verify. Especially if the program is ported from C to Julia.
To see an example, supposed that we want to implement a quick sort program that also has the aforementioned property P. We start from a specification that also has the property P : the sorting algorithm is correct if and only if for every pair i and j in the array’s index domain where i < j, we have x[i] ≤ x[j]. Note how we must avoid referring to concrete indices in the specification. Then we try to proof the correctness of the program. In this process, you may notice that many things you have ignored : to access the first element you cannot use x[0] or x[1], but x[firstindex(x)]…Yeah, you need to reason your generic program generically, to reach a generic conclusion. But honestly speaking, this kind of proofs is just too difficult, even for CS students.
Therefore, my conclusion is that in traditional software engineering, interfaces are effective not because PL researchers’ work is effective, but because the problem itself entails such structure. This structure is poor in numeric programs (e.g, the property P is difficult to proof or reason) — you might accidentally calculate the correct result, or you might not — thus, checking generic numeric programs is meaningless (checking problems with specific input types is meaningful), and the results after checking don’t prove anything. Even just checking the behavior of specific floating-point numbers already requires developing specialized tools like Herbie.
However, this does not mean that Julia is necessarily a language riddled with holes, leaving users helpless! Here I advocate for a test-based approach. Suppose we have a test file, where obviously we can only test specific types:
@test mean([1,2,3]) == 2
@test mean([1.0, 2.0, 3.0]) == 2.0We run this test file, and the compiler can record what types were used to call the functions, and store them in a database. We can actually reuse Julia’s precompile caches. During runtime, programmers can enable a check (similar to a deprecation warning) that issues a warning if the called type isn’t included in the tests.
Of course, some generic programs like traditional data structures and algorithms (Dict/Vector) are better suited for interface-based checks, or even formal verification, rather than checks based on concrete types. We can use some annotations to achieve this.
In summary, I want to express this point: ensuring the correctness of numerical programs requires a variety of approaches. Static typing can indeed make programming easier for developers and speed up iteration, but this doesn’t mean it’s a universal solution.
Footnote: after playing with LLM and diffusion models for a year, I decide to pick up my PL hobby and read some compiler papers. My main movitation originates from reinforcement learning, specifically, by PufferLib https://github.com/PufferAI/PufferLib/tree/2.0:
Is there a generic and safe way to extend/compile Python language with a fast JIT subset? For example, I have a compliated Python environment that plays poker, and I want to run PPO algorithm with this environment. Performance is important, but since I am unfamilar with the codes, I want to minize manual modificaiton of the codes
The general idea is to use gradual typing. Adding types in Python won’t change Python’s behavior. If a program can be statically typed (but the lazy programmer doesn’t do so), then by refining Any type to concrete type, we recover a static program. I am quite satisfied with Python’s type hints. Then I just noticed that if I can also create a similar stratification type system in Julia, then type stability bugs can be localised and error report is easier to read.
So why should I concern about type stability? It’s well-known that there are several criticisms on the internet about “many correctness bugs in Julia.” Initially, I (incorrectly) believed that the existence of these correctness bugs was due to underdeveloped toolchains like linters. My reasoning is like this:
Julia has no OOP and interfaces > Linters can’t easily provide hints -> Programmers easily make silly mistakes -> Code quality is lower -> correctness bugs
So fixing correctness bugs should require solving the type stability first. Therefore, my first draft of interfaces assign them their own kinds. They can exist only at static compile time, and have OOP syntax (where OOP would be rewritten as global method calls). They are bound to variables and they are not real types. So they are more like type hints in Python. My biggest dissatisfaction was that this solution led to two sets of syntax: if x is static, you write x.f(), otherwise you write f(x), although x.f() is just syntactic sugar for f(x).
Following this line of reasoning, I found several possibilities after considering “whom should interfaces be associated with,”:
- Interfaces belong to types Not feasible. In Rust, interfaces are constraints on type variables. In OOP languages, interfaces are parent classes. Julia has neither abstract types nor parent classes at runtime.
- Interfaces belong to methods Pre-condition/post-condition, each method expresses a series of constraints (of input valus) that can be checked at runtime or compile time.
- Interfaces belong to variables Exists only at compile time, reflecting how we use this variable. Similar to Python’s type hints.
A simple observation is that “interfaces belong to methods” entails “interfaces belong to variables”; I can declare what properties a variable has in the precondition. At first glance, “interfaces belong to methods” seems most general. But interface checks can only be performed when the method is actually called, which means linter basically cannot provide useful hints.
So I wondered why, apart from Lint being difficult to implement, this precondition and postcondition approach hasn’t been more widely used — I noticed that this is essentially a variant of Hoare logic. But Hoare logic is not an automatic proof system. For example, if you implement a sorting function, you don’t actually know whether this function can accept both 0-indexed and 1-indexed inputs. In other words, pre/post conditions are just declarations, not correctness guarantees. And Julia’s biggest problem is that people often incorrectly assume their programs have these pre/post condition relationships. The type system, on the contrary, is structurally and recursively defined in its correctness. For instance, if the input is of type A, I can prove through type inference that the output must be of a certain type, without manual proof.
In other words, I found that another root of the correctness problem is that the Julia community wants something like an automated formal verification tool, not a linter (so not type stability):
- Julia has many generic programs
- I know my program is correct for a specific type S
- I want this tool to prove that my program is correct for any type T, or it automatically reveals what assumptions I’ve made
In other words, this is a program synthesis problem, not a linting problem. But this is too difficult.
First, we don’t even have specifications. Julia programmers haven’t actually written down any formal declaration expressing the expected behaviors of a program. In fact, they only have some rough ideas of algebraic identities. For example, Julia programmers might come up with this kind of correctness about array indexing:
Define a 0-indexed CanonicalizedArray, all abstract arrays implement toCanonicalizedArray and fromCanonicalizedArray methods.
f is index-invariant, iff f(x) == fromCanonicalizedArray(f(toCanonicalizedArray(x)).
For example, sort(x) == offsetarray(sort(x.parent)). In other words, sorting an offsetarray is equivalent to first sorting the underlying array and then reconstructing the offsetarray. But documenting these declarations unambiguously is not simple. And this cannot be captured by pre/post conditions.
Second, programmers know the program is correct, but the compiler doesn’t know the correctness proof, so the compiler needs to synthesize the proof from scratch. This is possible for non-recursive programs, just using symbolic execution and constant folding. But for loops, more complex analysis methods are needed.
I won’t say this is all impossible (actually I already have a rough implementation idea, which can be realized through egraphs), the problem is that the Julia community is too small, and all these extremely complex compiler proofs might be full of bugs, especially when you’re in a dynamic language. If there is already a correct implementation for Vector, why wouldn’t users just feed the underlying array of the offsetarray into it rather than making a new proof? These methods should be used as a last resort, to maintain legacy codes.
Considering all this and the inherent imprecison in numerical programs, I believe defensive programming that only considers specific types is the most practical solution for general audiance (if they want to use untested implementations for fun, then it’s another problem). To summarize briefly, there exist several distinctly different objectives in Julia that must be covered by different methods.
