Designing a programming language from scratch

Thread Starter

xox

Joined Sep 8, 2017
838
So I have been working on side project over the past couple of weeks, designing a programming language. At this point everything is still somewhat crude. I do have the tokenizer and parser complete, although the latter doesn't actually build an AST (parse tree) just yet. Before doing that I really wanted to nail down the opcode format. So I have a simple interpreter set up now which reads binary files and executes them. The code is currently being generated "by hand" in a C program (basically just writing the opcodes and data to an array which then gets written to file). I can already write simple programs with loops and such. Here is the output of one of them:

Value of `Index`: 0
Value of `Limit`: 3
Entering loop...
Value of `Index`: 0
Value of `Index`: 1
Value of `Index`: 2
Done!
Pretty cool, but there is still SO much to do and I have many concerns over the bytecode format being abused or in any way allow the interpreter to enter into any kind of invalid state. For example currently badly crafted bytecode could easily jump OVER the initialization of a variable. The only problem with detecting that is that the bytecode is not so easy to scan in the FORWARD direction at that particular point of execution. To do that I would basically have to run a separate instance of the interpreter in some special mode just to be able to look ahead and detect potential invalid jumps.

Another related worry is that I don't want the code to be able to make arbitrary jumps into a raw data section (which is of course stored within the bytecode itself) and continue execution there. I would also like to avoid the possibility of anyone abusing the bytecode format to store unreachable code or data, as that what would most likely only aide hackers/malware-makers I think so I would really like to explicitly disallow such constructs. But again, that of course means a much more complex loading process.

I realize that these are really quite abstract problems and perhaps I should just ignore them for now and move on with the project. I guess it just feel like if I don't address these issues NOW then it will only hinder things later on down the road and may even force an entire rewrite of the bytecode format (and I REALLY don't want to have to do that).

Anyone here have any thoughts on how I might proceed from here?
 

ApacheKid

Joined Jan 12, 2015
1,609
So I have been working on side project over the past couple of weeks, designing a programming language. At this point everything is still somewhat crude. I do have the tokenizer and parser complete, although the latter doesn't actually build an AST (parse tree) just yet. Before doing that I really wanted to nail down the opcode format. So I have a simple interpreter set up now which reads binary files and executes them. The code is currently being generated "by hand" in a C program (basically just writing the opcodes and data to an array which then gets written to file). I can already write simple programs with loops and such. Here is the output of one of them:



Pretty cool, but there is still SO much to do and I have many concerns over the bytecode format being abused or in any way allow the interpreter to enter into any kind of invalid state. For example currently badly crafted bytecode could easily jump OVER the initialization of a variable. The only problem with detecting that is that the bytecode is not so easy to scan in the FORWARD direction at that particular point of execution. To do that I would basically have to run a separate instance of the interpreter in some special mode just to be able to look ahead and detect potential invalid jumps.

Another related worry is that I don't want the code to be able to make arbitrary jumps into a raw data section (which is of course stored within the bytecode itself) and continue execution there. I would also like to avoid the possibility of anyone abusing the bytecode format to store unreachable code or data, as that what would most likely only aide hackers/malware-makers I think so I would really like to explicitly disallow such constructs. But again, that of course means a much more complex loading process.

I realize that these are really quite abstract problems and perhaps I should just ignore them for now and move on with the project. I guess it just feel like if I don't address these issues NOW then it will only hinder things later on down the road and may even force an entire rewrite of the bytecode format (and I REALLY don't want to have to do that).

Anyone here have any thoughts on how I might proceed from here?
I've built full compilers from scratch.

Here's one I recently made public, it's a compiler written in C for the PL/I language. It creates 32-bit Windows COFF object modules that can be linked and executed.

It is real code with real code generator and optimizer, it builds OK in Visual Studio 2022. It was built mainly around 18 years ago and sold a few copies as a tool for validating PL/I code without having to rely on a host mainframe or mini, the languages has faded now and commercial value has all but gone.

PL/I is noteworthy as it has no reserved words, that language feature is impressive and means new keywords can be added over time without ever breaking existing source code, this brings me to the next project too:

Much more recently I've become increasingly frustrated with the growing grammar of C#, it is getting IMHO rather messy and idiosyncratic so as an exercise I started a project to create a new grammar that can/could do whatever C# can but without the tired old reliance on the old C and Java grammars and needing reserved words.

This project would - initially anyway - consume code in the new grammar and output C#, as a proof of concept.

This is something I dabble in as the mood takes me!
 
Last edited:

Thread Starter

xox

Joined Sep 8, 2017
838
I've built full compilers from scratch.


Here's one I recently made public, it's a compiler written in C for the PL/I language. It creates 32-bit Windows COFF object modules that can be linked and executed.


It is real code with real code generator and optimizer, it builds OK in Visual Studio 2022. It was built mainly around 18 years ago and sold a few copies as a tool for validating PL/I code without having to rely on a host mainframe or mini, the languages has faded now and commercial value has all but gone.


PL/I is noteworthy as it has no reserved words, that language feature is impressive and means new keywords can be added over time without ever breaking existing source code, this brings me to the next project too:


Much more recently I've become increasingly frustrated with the growing grammar of C#, it is getting IMHO rather messy and idiosyncratic so as an exercise I started a project to create a new grammar that can/could do whatever C# can but without the tired old reliance on the old C and Java grammars and needing reserved words.


This project would - initially anyway - consume code in the new grammar and output C#, as a proof of concept.


This is something I dabble in as the mood takes me!

Never heard of PL/I before actually! It's got a very interesting backstory too. I can't exactly say it suits my tastes, although I certainly understand just a bit more about its historical importance now.

As for my current conundrum I've decided to simply ignore the problem for now. At some point I will probably need to implement some sort of system for checking the validity of a given chunk of bytecode, but for now it should suffice to do only the most basic checks (bounds-checking and such) to ensure at the very least that the interpreter itself never falls into an invalid state. Currently I am using setjump/longjump to propagate those types of "exceptions" upward (so to speak) and that seems to be working very well at the moment.
 

ApacheKid

Joined Jan 12, 2015
1,609
Never heard of PL/I before actually! It's got a very interesting backstory too. I can't exactly say it suits my tastes, although I certainly understand just a bit more about its historical importance now.

As for my current conundrum I've decided to simply ignore the problem for now. At some point I will probably need to implement some sort of system for checking the validity of a given chunk of bytecode, but for now it should suffice to do only the most basic checks (bounds-checking and such) to ensure at the very least that the interpreter itself never falls into an invalid state. Currently I am using setjump/longjump to propagate those types of "exceptions" upward (so to speak) and that seems to be working very well at the moment.
PL/I was in fact the first high level language used to write an operating system and Multics was that operating system. PL/I was the first language (other than assembler of course) to support the concept of pointers (hence its appeal to OS development).

You can see traces of PL/I's influence on C slightly, comments for example, the syntax of label statements, data structures, semicolons etc.

PL/I was a better language, a more solid engineering oriented language than C (for example it supports a bit data type directly) but was a challenge at the time to implement a compiler for. It almost became the de facto programming language for the IBM PC due to the efforts of Gary Kildall, but Microsoft's growing dominance of that market put an end to that.

Anyway this is all now part of history!

Now as for bounds checking there are two kinds of bounds checking and each has a different runtime cost.

The first is "is the accessed element even within the array's storage" and the other is "is every subscript within its defined bounds".

The latter is more costly but the former is enough to catch invalid storage access.

So if we had a 10 x 10 array of doubles, then we have 100 elements, test 1 would trap [11][12] say but only test 2 would trap [5][20] the 5x20 is the same elements as 10x10 so lies within the array so would not fail test 1.

There are some great forums for compiler/interpreter writers, checkout Discord for example.
 
Top