We’re Doing It Wrong – ANTLR

This is part of WDIW series – where we reflect on our misuse of a specific technology, which results in all kind of weird edge cases. In the end we always ask ourselves, is all software just buggy or are we doing something wrong?

Use of ANTLR in DSL Platform

Grammar for DSL Platform is currently defined in ANTLR, or better to say in unmanageable 7k+ lines of ANTLR. And it doesn’t include any other grammar like C#, Java or SQL, which it should. Of course we don’t actually work with 7k+ lines of code, but instead work with small snippets which are aggregated into that grammar behemoth.

So you could say that we have a small working set of ANTLR grammar – 100 lines or so, for defining a DSL ANTLR grammar, in which we build snippets for each concept used in DSL and as a result get a fully defined grammar which we send to ANTLR.

In practice this looks something like:

keyword mixin;

rule mixin_rule [IToken<ModuleConcept> Module]
  scope [IToken<MixinConcept> current]
  <# mixin n=ident { $mixin_rule::current = Parse<MixinConcept>($Module, n.Text); } #>
  extends [module_rule [current]]

… where we define the “keyword” mixin, rule mixin_rule with its arguments, its scope variables, actual grammar and the conversion to our tokens. Keyword is not really a keyword in a sense that it’s reserved. A valid DSL Platform input is:

module module {
  mixin mixin;
}

From which you get a module named “module” and a mixin named “mixin”. We’ve even gone so extreme that you can define a domain such as:

module public {
  entity class {
    int int;
    long[] for;
  }
}

But, whether this will actually compile is up to the target languages (C# and Scala support this).

Many many years ago

Before the age of DSL Platform, while doing research and looking at Language workbenches such as MPS, prototyping ended with a simple parser and grammar which works OK and is still used in protoduction today, but we have since evolved and moved past it. The premise was that it must be super easy to add new ASTs. Grammar ended up being bound to the AST, without an explicit grammar definition. So while modeling, you were writing AST directly, but without MPS strong type checking. In terms of code that prototype AST looked like:

public class MixinConcept : IConcept
{
  [Key]
  public ModuleConcept Module { get; set; }
  [Key]
  public string Name { get; set; }
}

where you can pick up Name from definition and identity from the [Key] attribute. By having a ModuleConcept property you’re implicitly extending its grammar. You can also use the interface to extend multiple concepts at once. By adding something other than text editor on top of it, you can get to work directly on the AST, just like MPS does. Of course, you also have external constraints such as how each rule ends, starts or is extended by.
By moving away from that approach, we got a lot more flexibility in the grammar, which still looks very closely to the AST, but it’s not a 1-1 mapping anymore. Unfortunately, now we have to actually define grammar instead of being implicitly defined by the AST.

One could argue that previous approach is superior extensibility-wise (and extensibility is very high on our priorities), since you can just plug in a new AST type and parser automatically picks it up (as long as there are no ambiguities created, or such similar issues). But in practice, you don’t change grammar that often and even when you do, you usually just add a new snippet which is included automatically into the aggregated grammar and pass it through ANTLR. I guess dynamic-ism at the grammar definition level didn’t pan out to be that important since all you need is a recompile. Often you can have a mix of both, so you can have dynamic-ism at few important places. For example in DSL Platform we can add new simple type (such as int/float/decimal) without grammar changes, as long as few rules are satisfied. This makes DSL easily extendable at important places.

ANTLR issues

Funny how our newfound love for ANTLR turned out, since today we constantly have the need to remove ANTLR from our system, due to it not coping well with the grammar. We are using ANTLR3 with infinite lookahead (not that we need it, but otherwise ANTLR produces incorrect grammars), since we get all the benefits of grammar validation, ambiguity warnings and rather fast parser using pre-built DFA.

During initial phases of grammar definitions we would end up with a lot of strange errors, for example ANTLR missing arguments to rule, since it moved execution of the rule somewhere else. But I guess, there are workarounds for that, since there is a couple of ways to use context arguments in rules. What we can’t workaround using vanilla ANTLR is its DFA explosion (at least without extensive code changes to ANTLR). Some rules create such a big DFA that on few rules it takes 60+ seconds just for processing that rule. But even that is not a big deal, since we don’t rebuild the grammar all the time, but rather that you can’t compile languages such as Java when ANTLR outputs target Java file. In Java there is a limit of 64KB for classes/methods which are broken with generated code. Some of it can be fixed, by moving variable initializations in class into other classes, but methods with giant switch statements and dozens of if statements cannot be fixed that easily.
What’s worse, latest version of ANTLR3 doesn’t even manage to build our grammar. Strangely enough, latest .NET ANTLR3 port works fine.

So, when we started integrating tooling support in various IDEs, we were expecting that it would be a breeze, since everyone is telling you: oh, you have an ANTLR grammar, yeah just plug it in and it works. And don’t let me get started on keywords and context-sensitivity. Since it’s much easier to have keywords in your grammar every ANTLR tutorial highlights how easy it is to get those keywords highlighted. But of course, only if those keywords are not context-sensitive. If they are context-sensitive, they are not really keywords, right? 😉

One can say, that again, WDIW, since almost always you will hear advices such as, don’t build context sensitive grammars since they don’t play well with existing tooling. But if you want to have something which can be easily read by both programmer and non-technical person, you don’t really have a choice.

Well, of course you do, you can build a visual editor and let programmer work with that 😀

Everything new is better

At one time we even considered moving to ANTLR4, since it seemed that it should cope better with such a grammar. Due to our abstract factory factory around ANTLR grammar, it wasn’t really hard to translate it to ANTLR4, but I guess we didn’t really like the dynamic nature of ANTLR4. Not getting ambiguity warnings anymore and not really using much of ANTLR features, it didn’t really make sense.

So considering that we don’t use ANTLR for anything else besides parsing to our token representation in a single pass, it seems like a technical debt to use ANTLR at all.
ANTLR4 prefers that you use hooks instead of injecting code snippets, since code snippets can’t be translated easily to other languages (except when you know how to translate it to other languages and have an abstract abstract factory which enables you to do just that).

I guess ANTLR4 was trying to solve some other problems and it doesn’t fit our requirements nicely. What’s worse when you look more deeply into its relationship with other parsers/communities, you will find out that parser writers almost always suggest writing your own parser to take full control of the process. So it looks to me like ANTLR gave up on trying to serve their needs.

So, we’ll be staying with an older version of the previous ANTLR, at least until we decide it’s time to drastically improve IDE support.

One thought on “We’re Doing It Wrong – ANTLR

  1. megan adams

    Yeah what’s up with antlr4. I felt it was a pull back from serious grammars to toy / prototype / instructional / or small ad hoc domain specific grammars. Luckily antlr3 works well for us though I am uneasy about the lack of support going forward. There is a bug I’ve run into several times that I think we’ll just have to continue to carefully avoid. antlr4 was a disappointment. Is this a failure of the open source model? If antlr3 had paying customers I doubt it would have been scrapped.

Leave a Reply

Your email address will not be published. Required fields are marked *