This post is a developer log - To share the practical experiences and thoughts, give insights and create some ideas of your own. To document my compiler internals. And hopefully, to be mildly entertaining.
This post is about the recent journey to create a Web based frontend for my language. Some context on where the project was before that journey started.
In the long term, I have always envisioned that this project will incorperate a full development environment (my take on an IDE). Just providing a compiler toolchain is a dated idea. I think a newly developed language should at least support some kind of IDE integration (Language Server Protocol) at a minimum.
And my roadmap had the following milestones.
- Create a command line compiler.
- Create a gedit plugin, I’ve had experience of this before. I was most likely going to keep gedit doing syntax highlighting, but add more highlighting when parsing was complete. Underline errors, tooltip’s etc. Along with a panel with a list of errors and feedback.
- Implement a basic text editor with the parser at the heart.
Part 1 has been done for at least 10 months now. I have a parser (C++, LL(2) with operator precedence and error nodes) that creates a Concrete Syntax Tree (CST), and because an Abstract Syntax Tree (AST) is an induced subgraph, I can access either graph quite freely — by indexing the nodes of interest and some methods to access those.
I have already been generating output from the CST, Graphviz to visualise the nodes of the graph, and a HTML file that showed the original code, highlighted, and with semantic error information.
Having a server running the compiler and responding with the HTML doesn’t seem that interactive, and is not going to benefit the project in the long run. I thought, Lets give WASM a try, how hard could it be? The compiler already generates HTML and is stable. Or so I thought.
New to WASM, I wanted to see the development status of all options. This took a number of weeks. The parser is quite simple structurally speaking — So I was open to the idea of another language to make the Web frontend a reality. I did try a number of them, but the truth is that some didn’t make the cut purely on the fact that they lacked another feature I valued, or I was too unfamiliar and uncomfortable making a full compiler in those languages.
I settled on two choices, D and C++. D’s introspection and metaprogramming features really impressed me. They are incredibly powerful, the code for the parser when written in D was very consistant and clear.
But from my experiences and tests. I can only conclude that D’s WASM support is not production ready yet. Some of the fundimentals (classes, arrays) require GC — and so the druntime. This isn’t available in the WASM output by default. You have to make your own. This problem is not something I wished to face.
C++, Emscripten, and embind
The project is already in C++, clearly a pro. My experience and tests with Emscripten have been positive overall. Installation and usage straight forward. Embind is a feature of Emscripten to help with exporting C++ functionality. There are some things worth noting if you are going to use emscripten;
Exceptions are off by default
The default is to not catch an exception when one is thrown, instead the WASM will halt. The compiler flag -s DISABLE_EXCEPTION_CATCHING=x is what you need to look into to reenable this.
Compiling is sloow
My parser implementation is very small, 24 cpp files with a total of 1108 lines of code. I don’t have the newest of machines (Intel i7–4785T @ 2.20GHz, 16GB RAM). Compiling a WASM release was taking 2–4 minutes to complete. I dread to think what it would take for larger projects.
Emscripten does not generate intermediate object files. So it must re-compile all files. I do use the standard library, constexpr, template features etc— I have a small DSL/helper functions to build the CST.
I couldn’t deal with this, so I’ve opted for a technique called a Unity Build.
Disclaimer: A Unity Build does not always have the same semantics as a standard build. It can cause bugs and break things in unexpected ways.
However, in the normal case (How most people write C++ code), a Unity Build is fine. Please read up on the pro’s and con’s of this technique.
Effectively, a Unity Build it is achieved by #include’ing all required .cpp files into one .cpp file, which is then the only .cpp file you need to compile. This has bought compile times down to 9–10 seconds.
It was wrong of me to give it so much complexity to deal with. My advice is to keep it as simple as possible. Here is the basic layout of my wasm_module.cpp Unity Build file.
void check_for_errors() checks the CST for semantic errors.
std::string toTree() produces a graphviz representation.
std::string token_list() produces the lexer token list.
std::string asHTML() produces the HTML representation.
Bugs, bugs everywhere
Now we’re actually able to get some WASM output. The real pain begins.
There were a number of bugs right from the start. Some of these reminded me of older days, experiences of changing to a new toolchain and/or targeting a new architecture and the issues that highlights.
Parse as you type bugs
A whole set of bugs appeared simply because of the way im now interacting with the parser. Passing in a file and reading in the data is very different from key presses.
With a production rule of; VAR := “var” <identifier>
I have tests for cases like “var 123”, which is invalid. I do not have a test for “var “ — A completely empty identifier. This case has never happened while reading in files. but does happen while you’re typing into the editor! I needed to add a load more error nodes and error recovery
Memory managment blunders
I am using moves,string_view, and holding tokens and tree nodes seperately. There was some stupid typos and blunders hidden for months. I can only assume this is because of the differences with e.g. WASM is stack based, and was much less forgiving.
The HTML Side
Contenteditable is broken
Contenteditable=”true”, job done. Unfortunately, this way is fundamentally broken.
It appears to be very difficult to update the underline data and control the caret position. I never recommend accepting a dependency to project lightly, but I 100% suggest going with a library that handles this for you.
I have tried a few and gone for CodeJar. It is light weight and does everything I need. The HTML side of things fits in 45 lines.
I’ve covered the bulk of whats to be said on this. But I still have a lot of fixing to do. I will publish a live demo link at some point. Here’s a GIF of the editor so far!