How to Design Your Own Programming Language | Sea #0
Welcome! This is the first article in a series I plan to write involving my programming language — Sea. In these articles, I plan on documenting my process for writing the language to serve a handful of main goals:
- To teach you the process of creating a programming language
- To help me learn the process better myself and develop Sea
- To create documentation for the design choices made for Sea
- To take the time to properly design Sea
Note: My experience with writing programming languages comes from personal experience while working on Sea and CodePulse’s Programming Language Tutorial. There is a lot of really great material in that tutorial series, however the code is atrocious; there are a handful of files hundreds of lines long. So, I still believe my series of articles will be useful. I will be saving them to my Programming Language List.
This series is also an extension of my article A Basic History of Programming Languages and how to Create Your Own. In that article, I go through the historical development of programming languages and explain the general process of creating a programming language conceptually. In this series, I will go through each step in practical detail. You can copy what I do exactly, or use it to create your own language. Note that I will be adding a fourth step to my previously described three steps: preprocessing. This is because C has a pre-processor, and it offers a handful of benefits. For now, though, I will be ignoring the pre-processor until an article at a later date.
Purpose and Design Philosophy
The idea for Sea is a simple one. I’ve spent a lot of time programming in numerous languages, but my personal favourite language is Python. I spent a while almost elitist thinking that “Python is a beginners’ language. Brackets aren’t scary; I’m fine with Java and C++”. However, once I started really getting into Python, I fell in love. Python is simply a pleasure to write in due to the simple syntax. It is much easier to debug, and it is overall easy to read the code. I think pushing computer languages towards conversational language is ideal.
Python is a great language, but it’s a high-level programming language. You can use lower level features with Cython, but that’s more of a hack than a feature, as far as I’m concerned. There simply aren’t convenient and obvious ways to use something like pointers in Python as efficiently as in C. I’m also heavily aware that not every language should be used for everything. A good developer does not simply choose their favourite language and use it for web development, game development, and app development; a good web developer uses the tool that is best for the job at hand. So wouldn’t it be nice if there were a language that were able to do everything; it would be a modular language that can function very lightly for low-level applications such as operating systems, it would have high-level features that allow it to be used for web development and general app development, and lastly, it would need to have simple and readable syntax.
That is my dream language. However, I thought that would be too large a task to tackle as my first language. So, I began to think smaller and came up with Sea — it’s C, just written differently. The goal is to create a version of C with syntax similar to Python, without adding a significant runtime cost. I prefer indent-syntax over bracket-syntax in most scenarios. Good code should be indented and spaced anyway, so making it required is not a horrible idea. The primary focus should be on readability. The way I will achieve this with Sea is by pythonifying the general syntax of C and then creating syntax that is sugar; syntax that is solely a nicer way of doing something you can already do.
The purpose of Sea is to be a beginner-friendly version of C. It should be as powerful as C, without a significant runtime cost. As such, there should be syntactic sugar used to mimic certain higher level functionality. Doing things this way will add a compilation cost to Sea, but not a significant runtime cost. Sea will need to strike a balance between Python and C. With these considerations in mind, we can begin designing the language.
Syntax and General Grammar
Creating the grammar for a language is, in my opinion, the hardest part of the process. There are simply so many parts involved in a language that it can hard to identify them all. I think our best option here is to start with a language you know, and modify it as you go along. Since Sea really is just a redesign of C, the grammar can be mostly the same. The Syntax will mostly be derived from Python, however. So, I am going to reference Microsoft’s C Language Syntax Summary. Within this, there is an in-depth grammatical model for C. I will be mimicking their Definitions and Conventions to an extent. I recommend reading the page (it’s really short). Since Medium doesn’t have subscripts, I will be using the regex “?” for optional components.
Please note that as I write all of this, I haven’t written Sea completely yet. Meaning, I might not Pythonify all of the C grammar. So, I will likely have to come back and modify this article as I develop Sea.
Tokens
Tokens are the basis of our language. Every syntax element will be expressed as a token. Our tokens will be one of the following types: keyword
, identifier
, constant
, string-literal
, and punctuator
.
keyword
: All the reserved keywords in Sea such as “if”, “else”, “struct”, etc.
identifier
: All user-entered names for things such as variables, functions, and structs. These are made up of underscores, English letters, and digits. However, an identifier must not begin with a digit.
constant
: These are all constant data that cannot be modified and are known at compile time. They are further divided into integer-constant
, floating-constant
, enumeration-constant
, character-constant
.
integer-constant
: For our purposes, all integers will be decimal and composed of decimal digits. I will implement different number bases at a later date.floating-constant
: These will be decimal for the time being and composed of decimal digits and, at most, one decimal place. I will implement different number bases as well as scientific notation at a later date.enumeration-constant
: These will be identifiers that represent enums in Sea.character-constant
: Any character in Sea. They will be a single character within single quotes, or the character may be prefixed with a backslash.
string-literal
: All strings or implicit arrays of characters. I will add different encoding types and fancy string types such as Python-style f-strings.
punctuator
: All the general syntax symbols. These include the operators and general “punctuation”. In Sea, indents and newlines will be punctuators.
These are our tokens for Sea! Your language may have different tokens or may end up representing these tokens in different ways, but that’s alright. Now, let’s see how these tokens will be used to construct the grammar of our language.
Grammar
Recall that the grammar of a programming language defines how tokens should be used together to create order or an understanding. An important thing to note is that, since Sea uses Python-style indentation, we will not be ignoring all whitespace like in many languages; however, we will be ignoring individual whitespace characters, if they are not newlines, after the first non-whitespace character. We will also ignore empty newlines that serve only to make the code more readable. Our grammar will be composed of expressions, declarations, statements, and external definitions.
Expressions
Expressions are lines of code that resolve to a value. Prepare yourself because the following grammar might be overwhelming. These expressions will often be recursive as well. The order of operations is from top to bottom between types of expressions. However, each element of an expression has the same priority as the others of that same element. This is just how in PEMDAS (or BODMAS), parentheses have the highest order followed by exponents followed by multiplication and division. Many people get this wrong, but multiplication and division have the same priority and must be performed from left to right. Lastly, addition and subtraction are done with the least priority, but they have the same priority as one another. This has to be done cautiously in a programming language, however. With recursive definitions of grammar, it is important to search for the more complex definition first. Otherwise, it may only ever detect the simplest definition.
primary-expression
: The absolute basis for all expressions. Can take the following forms:
identifier
constant
string-literal
(
expression
)
generic-selection
generic-selection
: A mapping for a function that allows for generic types.
generic
(
assignment-expression
,
generic-assoc-list
)
generic-assoc-list
: A list of associations to define a generic function.
generic-association
generic-assoc-list
,
generic-asociation
generic-association
: An association between type and code for a function.
type-name
:
assignment-expression
default
:
assignment-expression
postfix-expression
: An expression involving a postfix operator.
primary-expression
cast-expression
postfix-expression
[
expression
]
postfix-expression
(
argument-expression-list
?)
postfix-expression
.
identifier
postfix-expression
->
identifier
postfix-expression
++
postfix-expression
--
(
type-name
)
{
initializer-list
}
(
type-name
)
{
initializer-list
,
}
cast-expression
: An expression that converts from one type to another. Note that my casting is in the style of Python’s function-call casting. However, it will behave more like C’s casting. Nonetheless, I am keeping it at the same priority as a function call.
postfix-expression
type-name
(
cast-expression
)
argument-expression-list
: A list of arguments.
assignment-expression
argument-expression-list
,
assignment-expression
I’d like to pause for a moment to let everyone catch their breath and to address a few things. For the most part, all of this is straight from C. Since I haven’t written Sea yet, it’s likely some of this grammar is too C-like and I will have to modify it to make it more Python-like. When I do this, I will try to come back and update this article. Keep in mind once again that your language might not need all of this. I think starting from scratch, which is what I did, isn’t that ideal of a strategy when it comes to language grammar. I recommend you continue to follow this Sea grammar, figure out what each part means, and determine whether your language needs it or not. Shrink down the list and change symbols to fit your language, then attempt to add new elements when required. Also, don’t feel bad if you need to read this over many times to understand it. As I’m writing this, even I have to do that. Stick with it and you will endure.
exponential-expression
: An expression that performs exponentiation.
postfix-expression
exponential-expression
**
postfix-expression
unary-expression
: An expression containing a unary operator.
exponential-expression
++
unary-expression
--
unary-expression
unary-operator
exponential-expression
size
of
unary-expression
size
of
(
type-name
)
align
of
(
type-name
)
unary-operator
: An operator with only one operand. One of &
, *
, +
, -
, or ~
.
multiplicative-expression
: An expression that performs multiplicative operations.
unary-expression
multiplicative-expression
*
unary-expression
multiplicative-expression
/
unary-expression
multiplicative-expression
%
unary-expression
additive-expression
: An expression that performs additive operations.
multiplicative-expression
additive-expression
+
multiplicative-expression
additive-expression
-
multiplicative-expression
shift-expression
: An expression that performs bitwise shift operations.
additive-expression
shift-expression
<<
additive-expression
shift-expression
>>
additive-expression
bitwise-and-expression
: An expression that performs bitwise and.
shift-expression
bitwise-and-expression
&
shift-expression
bitwise-xor-expression
: An expression that performs bitwise xor.
bitwise-and-expression
bitwise-xor-expression
^
bitwise-and-expression
bitwise-or-expression
: An expression that performs bitwise or.
bitwise-xor-expression
bitwise-or-expression
|
bitwise-xor-expression
relational-expression
: An expression that performs a relational comparison. Note that these are Python-style relational comparisons.
bitwise-or-expression
relational-expression
<
bitwise-or-expression
relational-expression
>
bitwise-or-expression
relational-expression
<=
bitwise-or-expression
relational-expression
>=
bitwise-or-expression
relational-expression
==
bitwise-or-expression
relational-expression
!=
bitwise-or-expression
logical-not-expression
: An expression that performs logical not.
relational-expression
not
logical-not-expression
logical-and-expression
: An expression that performs logical and.
logical-not-expression
logical-and-expression
and
logical-not-expression
logical-or-expression
: An expression that performs logical or.
logical-and-expression
logical-or-expression
or
logical-and-expression
conditional-expression
: An expression that performs the condition operation.
logical-or-expression
expression
if
logical-or-expression
else
conditional-expression
assignment-expression
: An expression that performs an assignment operation.
conditional-expression
unary-expression
assignment-operator
assignment-expression
assignment-operator
: An operator that assigns a value. One of =
, **=
, *=
, /=
, %=
, +=
, -=
, <<=
, >>=
, &=
, ^=
, or |=
.
expression
: Code that resolves to a value.
assignment-expression
expression
,
assignment-expression
constant-expression
: An expression that has a value calculable at compile time.
conditional-expression
Alright! Congratulations on getting this far! We’ve managed to lay out all of our expressions. You may notice that I am primarily following C’s grammar, but making modifications that follow Python’s grammar. I think the second half of grammar is much easier to follow, since they are all of the familiar operators. Don’t panic, but we’re only one third of the way through Sea’s grammar.
Declarations
A declaration is code that specifies the nature of an identifier. These allow for creation of variables, structs, enums, etc. Note that I’m finding it hard to label all of these in a way that adds to understanding, so I will only label when I have something to add.
declaration
:
declaration-specifiers
init-declarator-list
?\n
static_assert-declaration
declaration-specifiers
:
storage-class-specifier
declaration-specifiers
?using
declaration-specifiers
as
declaration-specifiers
type-specifier
declaration-specifiers
?type-qualifier
declaration-specifiers
?function-specifier
declaration-specifiers
?alignment-specifier
declaration-specifiers
?align
declaration-specifiers
as
declaration-specifiers
init-declarator-list
:
init-declarator
init-declarator-list
,
init-declarator
init-declarator
:
declarator
declarator
=
initializer
storage-class-specifier
:
auto
external
register
static
type-specifier
:
void
bool
char
short
int
long
float
double
signed
unsigned
complex
atomic-type-specifier
struct-or-union-specifier
enum-specifier
typedef-name
struct-or-union-specifier
:
struct-or-union
identifier
?:
struct-declaration-block
\n
struct-or-union
identifier
struct-or-union
:
struct
union
struct-declaration-block
:
struct-declaration
struct-declaration-block
?\n
\t
indent-level
struct-declaration
struct-declaration
:
specifier-qualifier-list
struct-declarator-list
?static_assert-declaration
indent-level
: A collection of tabs that represents how deeply nested the current code is.
struct-qualifier-list
:
type-specifier
specifier-qualifier-list
?type-qualifier
specifier-qualifier-list
?alignment-specifier
specifier-qualifier-list
?
struct-declarator-list
:
struct-declarator
struct-declarator-list
,
struct-declarator
struct-declarator
:
declarator
declarator
?:
constant-expression
enum-specifier
:
enum
identifier
?:
enumerator-block
\n
enum
identifier
enumerator-block
:
enumerator
enumerator-block
?\n
\t
indent-level
enumerator
enumerator
:
enumeration-constant
enumeration-constant
=
constant-expression
atomic-type-specifier
:
atomic
(
type-name
)
type-qualifier
:
const
restrict
volatile
atomic
function-specifier
:
inline
returnless
declarator
:
pointer
?direct-declarator
direct-declarator
:
identifier
(
declarator
)
direct-declarator
[
type-qualifier-list
?assignment-expression
?]
direct-declarator
[
static
type-qualifier-list
?assignment-expression
]
direct-declarator
[
type-qualifier-list
static
assignment-expression
]
direct-declarator
[
type-qualifier-list
?*
]
direct-declarator
(
parameter-type-list
)
pointer
:
*
type-qualifier-list
?*
type-qualifier-list
?pointer
type-qualifier-list
:
type-qualifier
type-qualifier-list
type-qualifier
parameter-type-list
:
parameter-list
parameter-list
,
...
parameter-list
:
parameter-declaration
parameter-list
,
parameter-declaration
parameter-declaration
:
declaration-specifiers
declarator
declaration-specifiers
abstract-declarator
?
type-name
:
specifier-qualifier-list
abstract-declarator
?
abstract-declarator
:
pointer
pointer
?direct-abstract-declarator
direct-abstract-declarator
:
(
abstract-declarator
)
direct-abstract-declarator
[
type-qualifier-list
?assignment-expression
?]
direct-abstract-declarator
[
static
type-qualifier-list
?assignment-expression
]
direct-abstract-declarator
[
type-qualifier-list
static
assignment-expression
]
direct-abstract-declarator
[
type-qualifier-list
?*
]
direct-abstract-declarator
(
parameter-type-list
?)
typedef-name
:
identifier
initializer
:
assignment-expression
{
initializer-list
}
{
initializer-list
,
}
initializer-list
:
designation
?initializer
initializer
,
designation
?initializer
designation
:
designator-list
=
designator-list
:
designator
designator-list
designator
designator
:
[
constant-expression
]
.
identifier
static_assert-declaration
:
static
assert
(
constant-expression
,
string-literal
)
\n
Phew we’re through that! Let’s take a moment to enjoy this beautiful image.
Alright. I must say, that was much harder for me than the expressions. I found it to be much less familiar and much longer. I’m certain I made spelling mistakes, (if you find those please point them out for me) and I’m sure I failed to properly Pythonify a lot of C’s grammar. I will do my best to continue to come back and update it as I find mistakes. Now, all we have left are statements and external definitions. Luckily, these are the shortest sections.
Statements
A statement is the unit for code performing work. Each line of code is often a single statement.
statement
:
indent-level
labeled-statement
indent-level
compound-statement
indent-level
expression-statement
indent-level
selection-statement
indent-level
iteration-statement
indent-level
jump-statement
indent-level
pass
jump-statement
:
goto-statement
continue-or-break-statement
return-statement
goto-statement
:
go
to
identifier
\n
go
to
identifier
if
expression
\n
continue-or-break-statement
:
continue-or-break
\n
continue-or-break
if
expression
\n
continue-or-break
:
continue
break
return-statement
:
return
expression
\n
return
expression
if
expression
\n
compound-statement
:
\n
?declaration-list
?statement-list
?
declaration-list
:
declaration
declaration-list
declaration
statement-list
:
statement
statement-list
statement
expression-statement
:
expression
\n
iteration-statement
:
while
expression
:
statement
else-clause
?do
:
statement
while
expression
\n
?else-clause
?for
expression
?;
expression
?;
expression
?:
statement
else-clause
?for
each
declaration
in
expression
:
statement
else-clause
?
else-clause
:
else
statement
selection-statement
:
if
expression
:
statement
else-clause
?falling
?switch
expression
:
statement
labeled-statement
:
identifier
:
statement
case
constant-expression
:
statement
default
:
statement
External Definitions
translation-unit
:
external-declaration
translation-unit
external-declaration
external-declartion
: Allowed only at external (file) scope
function-definition
declaration
function-definition
: Declarator here is the function declarator
declaration-specifiers
?declarator
declaration-list
?compount-statement
I told you there wasn’t much left! Let me just quickly say, I’ve been working on this article for a few days so please forgive me if it isn’t that easy to follow. I will gladly take your suggestions for improving it. I especially would love if you point out errors I’ve made. The good news is, now that I’ve created the grammar for Sea, I can get to work writing the next article where we begin to program! As I program more and more of Sea, I can find out where I made mistakes in the grammar. In particular, I wasn’t too rigorous with the indentation or newline requirements. Anyways, this has been long enough so let me finish up here.
Finishing Up
If you are creating your own programming language, starting with the grammar can be a daunting task. It is one of the hardest parts of creating a language; however, that’s the reason I suggest starting with it. Look at languages you know and search for their grammar. Here is Python’s official grammar for example. You aren’t creating the world’s first programming language; you are creating a variation based on all of the existing programming languages. So, use those languages! People have put in work over decades to optimize these languages, so start from there and make the modifications you want. For example, other than general Pythonifying of C’s syntax, I’ve added a few things for Sea:
- For-each loops. These will allow for easy iteration through a list without using any more resources than if you had done it manually with a traditional for loop.
- The
falling
keyword for switch statements. This modifier will make switch statements behave as they traditionally do; if you do not break, the execution will fall through the switch statement. I find that most of the time, fall-through behavior is not that useful. So by default, switch statements will behave as if every case has a break. However, you can usecontinue
to fall through. - Conditional goto, break, continue, return. Often, you want to jump only if a condition is true. So, I’m providing a readable one-line solution to this. I personally don’t like the look of one-line if-statements in Python, so I’m creating this syntax. It won’t be confused with the conditional statement because this will never have an else clause while the conditional always will.
I’ll also point you at the else clause for loops from Python (the else statement is performed if you did not break from the loop) as well as the do-while loop from C. I would prefer that Sea have redundant notation if it means creating more readability in different situations. I’ve also cleaned up some of C’s strange underscore-heavy keywords, as well as split up multiword keywords into two keywords. This makes Sea read slightly more like a natural language and, thus, is optimal even if it means there are slightly fewer possible identifiers.
Look forward to the next article where I will begin to program Sea in Python and we can start to see how all of this grammar works. If you don’t fully understand the grammatical structure now, it’s alright; once we start writing code, it’ll become clearer. Check out my Programming Language List for the previous and following articles involved!