Lexical Analysis (Flex/ANTLR) in CUDA

I’m currently in the process of trying to write a processor software stack (uArch simulator, compiler, binary tools, etc) in CUDA,
and while most things are pretty straight forward, the compiler frontend is turning out to be an area where I’m finding a lack
of related work and existing implementations, specifically the Lexer/Parser/AST.

So I’m hoping that some people here will find this interesting enough to walk through a potential design with me.

I’m much more interested in an implementation that is simple, easy to understand, and scalable on future processors than one
that trades implementation complexity for performance. Let’s start with one problem at a time (Lexical Analysis), and begin with
requirements:

Requirements:

  1. Consume an arbitrary long sequence of characters, produce the corresponding sequence of tokens (possibly with attached metadata) defined by a set of rules (regular expressions).
  2. Get good occupancy for large inputs on a 2015-2018 era GPU (extrapolate current architectures out a few die shrinks).
  3. Have linear or near-linear complexity for simple languages (don’t try to beat a sequential implementation based on FAs by brute forcing it).

Problems:

  1. Settle on a simple high-level algorithm. My current favorite is bottom-up DFA merging.
  2. Handling payload data, e.g. “some_variable_name” → TOKEN_IDENTIFIER (then token can be encoded as an int, how do you store the value “some_variable_name” semi-efficiently? How does this generalize to complex payload data?)
  3. How to automatically synthesize a CUDA engine from a language specification?

Algorithm Overview (bottom-up DFA merging):

  1. Sweep over a window of the input stream, feed each character as the input to an uninitialized state machine.
  2. Recursively merge neighboring state machines together, possibly generating tokens and a single state machine in a new state.
  3. Gather the generated tokens together into a sequence.

Any thoughts?
Alternative algorithms?
Data structures?
Related work?

Here’s how I think that bottom-up merging would work for a simple example requiring lookahead:

rule1: A B C |
A D C;

In the figure, the circles represent states, and the squares represent rules for merging states together.

This implementation would converge in at most two reduction stages, and be almost completely data-parallel. The amount of parallelism
will be determined by the input grammar.


state-transitions-bottom-up.pdf (8.82 KB)

[font=“Arial”]If you haven’t already, you should check out Leo Meyerovich’s research here starting with this paper Parallelizing the Web Browser.

Also, this looks relevant and is available on CiteSeer:[/font]

[font=“Arial”]

[/font] [indent][font=“Arial”]M. P. van Lohuizen. Survey of parallel context-free parsing techniques. Parallel and Distributed Systems Reports Series PDS-1997-003, Delft University of Technology, 1997.

[/font][/indent][font=“Arial”]

There are a pile of other parallel parsing citations at the CiteSeer link.[/font]

[font=“Arial”]Hope this helps and isn’t redundant![/font]

Also, this ParLab slide deck has a nice lexing hack/observation on pg. 15.

Thanks a lot for the links. I’m still reading through “Survey of parallel context-free parsing techniques” and the related
references. There are definitely some similarities with what I am proposing, possibly with what the paper calls connectionist
algorithms.

Some of the other links had some interesting observations, for example, that starting sequential parsers from random
locations, would typically converge quickly. However, I’m generally not in favor of speculative methods because I think
that they would fall apart when parsing worst-case grammars. For example, when a dependence in the DFA spans multiple
partitions. In such cases, I’m not sure if the result would even be correct.

Also, I don’t think that you can easily evolve a partitioning based algorithm into a conservative algorithm because,
unlike other partitioning based parallel algorithms like intersect or merge, the input cannot be easily sorted. So
you may have to linearly scan backwards after an initial partitioning stage to find the beginning of the first token in
the partition. I can imagine cases where the partition occurs in a very large comment, string, array, etc, and the
runtime is dominated by the linear scan.

I’m much more in favor of conservative algorithms where performance possibly degrades as the grammar becomes more complicated,
but the partitioning should be done with a divide and conquer algorithm to place a bound on the worst-case runtime.

There also seems to be a lot of recent interest in CYK parsing on GPUs.