Walter Bright, who designed D, wrote an article called “So You Want to Write Your Own Language”. I love D and I respect Mr. Bright, but I disagree with nearly everything he says in it, mostly because D itself and Digital Mars’s work in general is (and should be) extremely atypical of the kind of programming language design that occurs today.
Digital Mars has been writing cross-platform compilers for decades (and when I was first getting into C on Windows in the early naughts, Digital Mars’s C compiler was the one that gave me the least trouble), and D was positioned as an alternative to C++ as a general purpose language. The thing is, most languages being developed today aren’t going to be C-like performance-conscious compiled languages that need to build binaries on multiple platforms and live in environments with no development toolchains (all major platforms either ship with a GNU or GNU-like UNIX dev toolchain or have a system like cygwin or homebrew that makes installing one straightforward), and trying to compete with C++ is not only technically difficult but also foolish for wholly non-technical reasons.
The second piece of advice Mr. Bright gives is to stick relatively close to familiar syntax, in order to maximize audience. This makes sense if you are trying to compete with C++, as D does. But D is a great illustration of the problems with this approach: D does everything C++ does better than C++ does it, and yet it has failed to displace C++. The reason is that C++ is a general purpose language, like Java, C#, and Go (in other words, a language that performs every task roughly equally poorly) and the appeal of general purpose languages is that they allow one to trade time learning a language that’s a better fit for the problem for immediate implementation difficulty.
In other words, most things that are implemented in C++ are done that way not because C++ is the best tool for the job, but instead because C++ is marginally better suited than the three other languages that everybody on the dev team has known inside-out for twenty years. D can’t compete because the whole point of a general purpose language is to appeal to devs who failed the marshmallow test & want to minimize the initial steepness of the learning curve.
The domain of general purpose languages is crowded, and each of them casts a wide net: being technically better across the board than all of them is difficult, and without really heavy institutional support (like that given to C++ and Java by university accreditation boards & given to C# and Go by large corporations) being technically better will only give you a small community of dedicated fans and an isolated ecosystem. Writing a general purpose language is a bit like forming 2-person garage startup positioned against General Mills.
Instead, it makes more sense to target some underserved niche: general purpose languages are necessarily bad at serving most niches, and the syntax changes necessary to serve such a niche better are generally all too obvious to anybody actually developing in such a niche. Iteratively identifying and relieving pain points in real development work is sufficient to vastly improve a conventionally-structured language. Cutting the gordian knot by introducing a wildly unconventional syntax that’s better suited for certain kinds of work can be even easier, since such a syntax is allowed to be poorly suited for work outside its domain, and since the syntax of conventional general-purpose languages is made complicated & hard to reliably implement by decades of pressure to be suitable for all kinds of very dissimilar problems.
Syntax is the difference between a tea spoon and an ice pick. Rather than aspiring to being vice grips or duct tape, build a language that is exactly the tool you need, and allow it to be a poor fit for everything else.
When beggining a language project, it makes sense to try to get it minimally functional as quickly as possible, so that you can dogfood it & start identifying where your model of suitability has flaws. So (in direct contradiction to Mr. Bright’s list of false & true gods) it makes sense to start off with the easiest suitable syntax to parse and as few keywords as you know how to proceed with. As you write code in your language, you will identify useful syntactic sugar, useful changes to the behavior of corner cases, and keywords you would like to add — and none of these are likely to be the ones that were at the bottom of your bucket list initially, unless you’re sticking quite close to some known conventional language.
To mimic Mr. Bright’s format, here are some false gods of syntax:
- Tried and true / universally-familiar syntactic constructs. A language should not sacrifice the ease with which it caters to its niche in order to cater to people who don’t want to really learn it, but instead should have a syntax that minimizes the effort necessary to solve problems within its domain and that corresponds predictably to its operation. If a language is syntactically similar to some other language, this similarity should reflect a semantic similarity (i.e., a stack language is justified in looking like forth and a declarative logic language is justified in looking like prolog). Avoid making a functional language look like an imperative one or using constructions that are misleadingly/shallowly similar to ones from another popular language, and instead choose metaphors for best fit.
- Over-design / biting off more than you can chew. Start off with a minimally viable language and determine what needs to be added from experience. Over time the gaps will be filled in and your language will look less like an esolang and more like a general purpose language, unavoidably; making a sturdy foundation is easier if you start off with only a few elements and making them quite solid. (I’m not encouraging the kind of masturbatory minimization of keyword lists you sometimes see in forth implementations, where only one or two keywords are hard-coded and everything else is written in the language; however, starting off with functions, conditions, i/o, and simple mathematics is generally sufficient to start writing real code, and it’s generally possible to get a simple language from zero to capable of hosting a fibbonacci sequence function in an afternoon.)
- “Readability”. Reading any language is a learned skill; while you should know how to parse the language you’re designing (don’t make it too hard for yourself since you’re going to spend a lot of time puzzling over the proper behavior of code you yourself wrote), caring too much about lowering the initial learning curve of new developers will put more work on your plate and limit your flexibility. I know people who consider C++ easy to read, and even a very straightforward language like forth or brainfuck looks like line noise to someone who doesn’t know the trick to reading it. If your language is well-suited to some problem, you need to trust that developers will figure out how to read it, and having a simple syntax that doesn’t cater too much to beginners will help more seasoned developers understand complicated code later.
Here are some true gods of syntax:
- Suitability. Be an ice pick, not duct tape.
- Simplicity. Parsing is hard enough without ambiguity and corner cases (for both humans and machines); create a simple foundational syntax with as little ambiguity as possible, and make sure every behavior you add is compatible with that foundation.
- Iterative development. Making a really minimal language is straightforward, but making a language that is useful for complex problems is hard; it’s much easier to add features slowly based on need than it is to determine all of the useful behaviors before starting development, and it’s much easier to test a small number of base features before building on top of them than to test a large complex language all at once.
Regex is a very powerful tool, if you know how to use it. Lexing involves identifying character classes, which is exactly what regex is best at. I wouldn’t recommend writing large chunks of your language as regular expressions, or using nonstandard extensions like lookahead & brace matching (which break some of the useful performance guarantees of vanilla regex anyway), but regex is invaluable for lexing, and with care it can be invaluable for parsing as well.
If you are writing your language implementation in C, using lex (or GNU flex) can be a great time saver. I wouldn’t recommend using Bison / YACC, personally — it’s possible to thread arbitrary C code into lex rules, and implementing parsing with this feature and a set of flags is much easier for simpler syntaxes than using a compiler-compiler. With regard to portability concerns, any major platform will have a development toolchain containing lex these days.
Of course, if you go the forth-lite route and have nearly completely consistent tokenization along a small set of special characters, this is much easier. Forth-lite languages can be split by whitespace and occasionally re-joined when grouping symbols like double quote are found; lisp-lite languages with only a few paired grouping symbols can easily be parsed into their own ASTs.
It is possible to design a series of regex match expressions and corresponding handler functions, and iteratively split chunks of code from the outside in. I did this in Mycroft, and I recommend it only if your syntax requires very little context: the two different meanings of comma and period in Mycroft (as well as handling strings) made parsing this way complicated in some cases, but this kind of parsing is very well-suited for languages like lisp where all grouping symbols are paired and very few characters are special-cased. This is a style I find quite natural, but I understand that many people prefer thinking of parsing in terms of a left-to-right iteration over tokens; one benefit is that grouping symbol mismatch will be reported in the middle of a block rather than at the very end (i.e., closer to where the actual mismatch probably is located, for large blocks with many levels of grouping).
In all of these cases, you should make sure that information about both position in the file & the tokenization & parsing context are available to every function, for error reporting purposes. If your syntax is sufficiently simple & consistent, this will be enough to ensure that you can produce useful error messages. Never expect any general purpose parsing or lexing tool to understand enough about your language to emit useful error messages; instead, expect to thread your own context through your implementation.
I agree with Mr. Bright about the proper way to handle errors: rather than trying to make the compiler or interpreter smarter than the programmer, it makes more sense to exit and print a sensible error message. However, he doesn’t touch upon the difference between compile-time and run-time errors here, and in an interpreted language or one that is compiled piecemeal at runtime (which will be typical of the kind of language somebody writes today) the distinction is a little more fuzzy. Rather than, as Mr. Bright suggests, exiting upon encountering any error, I recommend keeping information about current errors (and the full call stack information related to them) and going back up the stack until an error handler is reached; if we completely exhaust user code and get to the top level, we should probably print the whole call stack with our message (or some large section of it). This is the style used in Java, Python, and sometimes Lua. It’s straightforward to implement (Mycroft does it this way, with ~10 lines of handling in the main interpreter loop and less than 40 lines in a dedicated error module), and it can provide good error messages and the framework for flexible error handling in user code (which itself can be implemented easily — see Mycroft’s implementation for the throw and catch builtins).
With regard to performance: it’s not that performance doesn’t matter, but instead that performance matters only sometimes. Machine time is much cheaper than enginer time, and so a language that makes solving certain kinds of problems straightforward is valuable even if it does so slowly. A general purpose language has to run fast and compile fast in order to compete, but notoriously slow languages have become popular because they served particular niches, even in eras when processors were much slower and speed mattered much more (consider perl in the text-processing space — although, because of accidents of history, it took a positon better served by icon — or prolog in the expert system space; prolog was so well-suited to the expert system space that expert systems were prototyped in prolog and then hand-translated from prolog to C++ in order to avoid the performance overhead).
Unless you make huge dumb performance mistakes (and as the success of Prolog, PHP, Python, and Ruby makes clear, sometimes even big dumb performance mistakes aren’t fatal), if you scratch an itch the performance won’t be a deal-breaker, and optimization can come later, after both behavior and idiom are appropriately crystallized. Using a profiler is overkill in early iterations of the language, and since there’s a general trend in hardware toward many lower-power cores, you may get significantly more performance benefit from making parallelism easy to use (as Go does) than from tuning the critical path.
Nearly all of Mr. Brights points about runtime libraries only really apply to languages competing with C++. He focuses on low-level stuff necessary for cycle-heavy processing on single cores, but this is almost entirely irrelevant to most of the code that gets written these days.
Here are my suggestions:
- String I/O should be unicode-aware & support utf-8. Binary I/O should exist. Console I/O is nice, and you should support it if only for the sake of having a REPL with readline-like features. Basically all of this can be done by making your built-in functions wrappers around the appropriate safe I/O functions from whatever language you’re building on top of (even C, although I wouldn’t recommend it).
- It’s no longer acceptable to expect strings to be zero-terminated rather than length-prefixed. It’s no longer acceptable to have strings default to ascii encoding instead of unicode. In addition to supporting unicode strings, you should also probably support byte strings, something like a list or array (preferably with nesting), and dictionaries/associative arrays. It’s okay to make your list type do double-duty as your stack and queue types and to make dictionaries act as classes and objects. Good support for ranges/spans on lists and strings is very useful. If you expect your language to do string processing, built-in regex is important.
- If you provide support for parallelism that’s easier to manage than mutexes, your developers will thank you. While implicit parallelism can be hard to implement in imperative languages (much easier in functional or pure-OO languages), even providing support for thread pools, a parallel map/apply function, or piping data between independent threads (like in goroutines or the unix shell) would help lower the bar for parallelism support.
- Make sure you have good support for importing third party packages/modules, both in your language and in some other language. Compiled languages should make it easy to write extensions in C (and you’ll probably be writing most of your built-ins this way anyway). If you’re writing your interpreted language in another interpreted language (as I did with Mycroft) then make sure you expose some facility to add built-in functions in that language.
- For any interpreted language, a REPL with a good built-in online help system is a must. Users who can’t even try out your language without a lot of effort will be resistant to using it at all, whereas a simple built-in help system can turn exploration of a new language into an adventure. Any documentation you have written for core or built-in features (including documentation on internal behavior) should be assessible from the REPL. This is easy to implement (see Mycroft’s implementation of online help) and is at least as useful for the harried language developer as for the new user.
After the prototype
If your language is useful, you will use it. Maybe some other people will as well. Writing a language is a great learning opportunity even if nobody uses it, since it improves your understanding of the internals of even other unrelated languages.
Unless you are selling some proprietary compiler or interpreter (something that even Apple & Microsoft can’t get away with anymore) adoption rates don’t actually matter, except for stroking your ego — which is great, because there are a lot of languages out there and relatively few people are interested in learning obscure new ones.
If your language gets any use, it will grow and mature into something harder to predict as behaviors change to be more in line with the desires of users (even if the only user is yourself). When languages grow too fast or are built on top of a shaky foundation they can become messy, with new behaviors contradicting intuition; some languages have taken a tool-box approach and embraced this (notably Perl and to a lesser extent TCL), and other languages merely suffer from it (notably PHP and C++). It makes sense to try to make sure that new features behave in ways that align as closely as possible to idiom, and to develop a small set of strong simple idioms that guide all features; that way, pressure to become general purpose won’t make the learning curve for new users too rocky.