Skip to content

Instantly share code, notes, and snippets.

@tef
Last active August 29, 2019 23:38

Revisions

  1. tef revised this gist Aug 29, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion commonmark.md
    Original file line number Diff line number Diff line change
    @@ -6,7 +6,7 @@

    CommonMark Spec 0.29

    - https://github.com/tef/toyparser2019/tree/master/commonmark
    - https://github.com/tef/toyparser2019/tree/master/toyparser/commonmark

    My toy parser implementaton

  2. tef created this gist Aug 29, 2019.
    147 changes: 147 additions & 0 deletions commonmark.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,147 @@
    # Trip Report: Writing a CommonMark Parser

    ## Resources

    - https://spec.commonmark.org/0.29/

    CommonMark Spec 0.29

    - https://github.com/tef/toyparser2019/tree/master/commonmark

    My toy parser implementaton

    ## Breakdown/Costs

    - 2500 lines, CommonMark specific code

    - 1300 lines, Base commonmark grammar

    - 600 lines, HTML tag handling (under commonmark rules)

    - 200 lines, Tree rewriting
    - Filling in backreferences/link definitions
    - Deactivating Links that have nested links inside
    - Stripping images/links inside image text
    - Tight/Loose List detection
    - Emphasis handling

    - 350 lines, HTML output

    - 2000 lines, Support code

    - 1000 lines, Grammar Definition Library
    - 1000 lines, Parser-Generator

    ## Pain points

    Everything? Is that an answer.

    ### Tabs

    In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules.

    If a tabstob is 8, you'd have to match `whitespace(4), whitespace(4)`

    ### Indented blocks

    Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on.

    ### Blank Lines

    Every block type treats blank lines differently. Lists become loose or tight, amongst other examples.

    ### Contextual Newlines

    Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside

    ### Setext Headers

    The `===` needs a `>` if the paragraph is inside a blockquote, but the lines in between the start and last need not have a `>`

    ### Blockquotes

    a `>` can be missing on continuation lines, which is convenient, but leads to weird thngs

    ```
    >> a
    > b
    ```

    is one paragraph inside two blockquotes.

    ### HTML

    Seven cases, each with somewhat unique rules.

    ### Images

    Can nest, can contain links inside alt-text. These get formatted differently

    ### Links

    Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. `[foo][bar][baz]` can be parsed in two different ways, `[foo [foo]]` isn't a link with a link inside. `[foo [foo]()][foo]` is.

    Life would be easier if the outermost link was preserved, but alas. no.

    ### Emphasis

    Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike.

    ## Parsing with PEGs

    The usual commonmark strategy is this:

    - tokenization / or regular expression matching
    - a recursive descent parser that tracks things like columns and partial tabs
    - a block parser, handling things like setext headers, etc, into mostly tokens
    - line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here.
    - extracting backreferences
    - resolving links (which ones can nest), handling alt text in images
    - handling emphasis matching

    Instead I have:

    - A PEG Grammar that produces a node tree.
    - Links do not allow links defined inside
    - Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs
    - Indentation, Continuation lines handled with a `indent()` grammar rule
    - Emphasis markers matched and counted
    - Atx headers counted

    Then postprocessing over the parse tree:

    - Finding backreferences
    - Populating Links, Flattening inactive links
    - Handling Emphasis
    - Handling loose/tight links
    - Converting to HTML

    ### Indentation

    `indented { rules .... indent .... rules}`

    A parser rule that takes some nested rules, along with an indent, and dedent rule.

    Calling `indent` matches the indent rules in order, if any

    Calling `partial_indent` matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules.

    Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization.

    ### Links

    Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later.

    ### Emphasis

    Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating.

    ### PEG shenanigans

    I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules.

    I added relative offsets to lookaheads, too

    ## Wrapup

    Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as `[foo][bar][baz]` gets disambiguated after parse time.