https://www.alternetsoft.com.au/blog/code-parsing-explained

CODE PARSING EXPLAINED

Explains various approaches to syntax and semantic analysis for C#, Visual Basic, JavaScript, TypeScript, Python and other programming languages.
5 min read

At AlterNET Software, we have developed several syntax parsers designed to power Code Editor with code writing features such as syntax highlighting, auto-completion, code formatting, code outlining, etc. Below is a brief explanation of ways we are using syntax parsing.

Generic Parsers

These parsers provide the most basic syntax parsing method; they can only perform syntax highlighting of the text in the editor. Generic parsers use finite-state automaton rules driven by regular expressions matching the parsed text. There’s typically one rule per lexical type, i.e., identifiers, numbers, strings, comments, etc. In a human language for identifiers, this rule might look like an identifier is a word that starts with characters in a-z, A-Z range and can include numbers.

Please read more information about generic parsers with the example of creating your own syntax scheme in the Code Editor user guide on our documentation page.

In the future, we will upgrade our Generic engine to support TextMate language grammars used in Visual Studio Code. It will allow additional features on top of syntax highlighting, like automatic brace matching, and make all syntax schemes developed for Visual Studio code available to Code Editor.

Generic Parsers

Advanced Parsers

These parsers use very similar finite-state automation logic for performing lexical analysis. This logic is implemented via hard-coded routines instead of a regular-expression-based engine to improve performance.

We have implemented advanced parsers for several programming languages, including C#, Visual Basic, Python, Java, JavaScript, SQL, XML, and HTML. These parsers perform syntax analysis of the text to build an Abstract Syntax Tree (AST) representation of the text and report syntax errors found during parsing. Code Editor then uses AST for code outlining, syntax guidelines, smart formatting, and to provide visual feedback on the incorrect syntax in the code.

Features like Intellisense (Code Completion), finding declarations and references, and alike require additional semantic information about symbols in the text.

For example, if code contains a variable declaration, like var myString = “text“, the semantic analysis determines that a is a variable of a type string and links it to the string symbol that contains all methods declared for the string class. This information is used later for tasks like code completion, such as when the user types myString. in the editor.

Advanced Parsers

Some advanced parsers like the ones for_C#/Visual Basic_support code completion by resolving semantic information for a partial scope, like statement block or expression on the fly when the user types special characters (such as “.” after identifier or “(” after the name of the method.

We have developed our proprietary implementation of the semantic analysis for Python and IronPython languages, which builds a semantic model of the whole text displayed in the editor (and also processes included files). This approach was inspired by studying Microsoft Code Analysis (“Roslyn”) API implementation, which we will explain below.

Code parsing with industrial-grade APIs for C#/Visual Basic and TypeScript/JavaScript

No matter how hard we try to support full specification for the particular language, there is nothing better than being able to use the same methods of code analysis that native tools like Visual Studio or Visual Studio Code rely upon.

Microsoft has published an open-source NET Compiler Platform ("Roslyn") project, which provides open-source C# and Visual Basic compilers with rich code analysis API. We use this API in our next-generation C# and Visual Basicparsers.

The latest versions of Visual Studio use these API internally, which is as good as it gets when parsing C# and Visual Basic code. The API covers syntax highlighting, error diagnostics, building AST, code completion service, finding declarations and references, and much more.

Some of these features, however, are implemented internally, such as signature help for method parameters, code fixes, and code refactoring - we can not access these APIs directly. Some techniques we observed in programs like Roslynpad, which pre-processes Microsoft assemblies during the build process and makes all internal classes public. We are reluctant to use such an approach in the commercial library; instead, we’re using Reflection to get access to this internals. We have already implemented the signature help tooltip for method parameters this way. We are looking at implementing advanced functionality like code fixes and refactoring with the same approach.

Code parsing with industrial-grade APIs

Like Roslyn-based parsers, we use Microsoft TypeScript API, which is very similar to Roslyn API for TypeScript/JavaScript parsers. Most of the APIs we need for advanced code editing features, such as code completion, smart formatting, code fixes, and refactoring, are publicly available and already used in Code Editor.

Code parsing with industrial-grade APIs

Code parsing with LangServer.org protocol

The Language Server protocol is used between a tool (the client) and a language smartness provider (the server) to integrate features like auto complete, go to definition, find all references, and like into the tool.

Most tools implement a subset of the language protocol specification, but with these parsers, Code Editor provides a similar code-writing experience compared to the native tools.

The parser requires a language server to be installed on a target machine. Therefore we provide two flavors for each parser - one which relies on a server already installed and another which embeds all language-server-related files which are copied the first time the code parser is initialized.

Code parsing with LangServer.org protocol

We currently have C/C++, Python, Lua, and PowerShell parsers ready to be used and are looking at supporting a few more languages using the LangServer protocol, including Java and XML.

Dmitry Medvedev
Dmitry Medvedev
We now support the Visual Studio code theme and descendant TextMate color theme, which provides the distinct appearance of outlining sections, guidelines, folded sections, and the current line. It supports dark, light, and all other VS Code color themes.