Last active
August 29, 2019 23:38
Revisions
-
tef revised this gist
Aug 29, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -6,7 +6,7 @@ CommonMark Spec 0.29 - https://github.com/tef/toyparser2019/tree/master/toyparser/commonmark My toy parser implementaton -
tef created this gist
Aug 29, 2019 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,147 @@ # Trip Report: Writing a CommonMark Parser ## Resources - https://spec.commonmark.org/0.29/ CommonMark Spec 0.29 - https://github.com/tef/toyparser2019/tree/master/commonmark My toy parser implementaton ## Breakdown/Costs - 2500 lines, CommonMark specific code - 1300 lines, Base commonmark grammar - 600 lines, HTML tag handling (under commonmark rules) - 200 lines, Tree rewriting - Filling in backreferences/link definitions - Deactivating Links that have nested links inside - Stripping images/links inside image text - Tight/Loose List detection - Emphasis handling - 350 lines, HTML output - 2000 lines, Support code - 1000 lines, Grammar Definition Library - 1000 lines, Parser-Generator ## Pain points Everything? Is that an answer. ### Tabs In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules. If a tabstob is 8, you'd have to match `whitespace(4), whitespace(4)` ### Indented blocks Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on. ### Blank Lines Every block type treats blank lines differently. Lists become loose or tight, amongst other examples. ### Contextual Newlines Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside ### Setext Headers The `===` needs a `>` if the paragraph is inside a blockquote, but the lines in between the start and last need not have a `>` ### Blockquotes a `>` can be missing on continuation lines, which is convenient, but leads to weird thngs ``` >> a > b ``` is one paragraph inside two blockquotes. ### HTML Seven cases, each with somewhat unique rules. ### Images Can nest, can contain links inside alt-text. These get formatted differently ### Links Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. `[foo][bar][baz]` can be parsed in two different ways, `[foo [foo]]` isn't a link with a link inside. `[foo [foo]()][foo]` is. Life would be easier if the outermost link was preserved, but alas. no. ### Emphasis Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike. ## Parsing with PEGs The usual commonmark strategy is this: - tokenization / or regular expression matching - a recursive descent parser that tracks things like columns and partial tabs - a block parser, handling things like setext headers, etc, into mostly tokens - line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here. - extracting backreferences - resolving links (which ones can nest), handling alt text in images - handling emphasis matching Instead I have: - A PEG Grammar that produces a node tree. - Links do not allow links defined inside - Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs - Indentation, Continuation lines handled with a `indent()` grammar rule - Emphasis markers matched and counted - Atx headers counted Then postprocessing over the parse tree: - Finding backreferences - Populating Links, Flattening inactive links - Handling Emphasis - Handling loose/tight links - Converting to HTML ### Indentation `indented { rules .... indent .... rules}` A parser rule that takes some nested rules, along with an indent, and dedent rule. Calling `indent` matches the indent rules in order, if any Calling `partial_indent` matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules. Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization. ### Links Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later. ### Emphasis Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating. ### PEG shenanigans I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules. I added relative offsets to lookaheads, too ## Wrapup Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as `[foo][bar][baz]` gets disambiguated after parse time.