tef · August 29, 2019 23:38 · Aug 29, 2019 · Aug 29, 2019
diff --git a/commonmark.md b/commonmark.md
@@ -6,7 +6,7 @@
 
   CommonMark Spec 0.29
 
-- https://github.com/tef/toyparser2019/tree/master/commonmark
+- https://github.com/tef/toyparser2019/tree/master/toyparser/commonmark
 
   My toy parser implementaton
 

diff --git a/commonmark.md b/commonmark.md
@@ -0,0 +1,147 @@
+# Trip Report: Writing a CommonMark Parser
+
+## Resources
+
+- https://spec.commonmark.org/0.29/
+
+  CommonMark Spec 0.29
+
+- https://github.com/tef/toyparser2019/tree/master/commonmark
+
+  My toy parser implementaton
+
+## Breakdown/Costs
+
+- 2500 lines, CommonMark specific code
+
+  - 1300 lines, Base commonmark grammar
+
+  - 600 lines, HTML tag handling (under commonmark rules)
+
+  - 200 lines, Tree rewriting
+    - Filling in backreferences/link definitions
+    - Deactivating Links that have nested links inside
+    - Stripping images/links inside image text
+    - Tight/Loose List detection
+    - Emphasis handling
+
+  - 350 lines, HTML output
+
+- 2000 lines, Support code
+
+  - 1000 lines, Grammar Definition Library
+  - 1000 lines, Parser-Generator 
+
+## Pain points
+
+Everything? Is that an answer.
+
+### Tabs
+
+In truth, this only matters because of indented code blocks. Tabs make them a little worse. You have to track column values to expand tabs into their correct space values, and be able to capture a tab that spans grammar rules.
+
+If a tabstob is 8, you'd have to match `whitespace(4), whitespace(4)`
+
+### Indented blocks
+
+Indented code blocks make everything terrible. A four space indent, which means certain rules have to read ahead and then read forwards n-4 characters, before moving on.
+
+### Blank Lines
+
+Every block type treats blank lines differently. Lists become loose or tight, amongst other examples. 
+
+### Contextual Newlines
+
+Knowing when and where you can read a newline, well, continue on to the next, depends on which block construct the current paragraph is nested inside
+
+### Setext Headers
+
+The `===` needs a `>` if the paragraph is inside a blockquote, but the lines in between the start and last need not have a `>`
+
+### Blockquotes
+
+a `>` can be missing on continuation lines, which is convenient, but leads to weird thngs
+
+```
+>> a
+> b
+```
+
+is one paragraph inside two blockquotes.
+
+### HTML
+
+Seven cases, each with somewhat unique rules.
+
+### Images
+
+Can nest, can contain links inside alt-text. These get formatted differently
+
+### Links
+
+Can't nest. Have to be parsed as if they nest. Can turn into text if not defined elsewhere. `[foo][bar][baz]` can be parsed in two different ways, `[foo [foo]]` isn't a link with a link inside. `[foo [foo]()][foo]` is. 
+
+Life would be easier if the outermost link was preserved, but alas. no.
+
+### Emphasis
+
+Can't be parsed beyond tokenization. Highly context sensitive. Matching depends on if a link is active or not, and if the left and right flanks have a certain length or sum of lengths. Christ on a bike.
+
+## Parsing with PEGs
+
+The usual commonmark strategy is this:
+
+ - tokenization / or regular expression matching
+ - a recursive descent parser that tracks things like columns and partial tabs
+ - a block parser, handling things like setext headers, etc, into mostly tokens
+ - line at a time parsing, choosing which dom node to add to before parsing the line. tight/loose list recognition is handled here.
+ - extracting backreferences
+ - resolving links (which ones can nest), handling alt text in images
+ - handling emphasis matching
+
+Instead I have:
+
+- A PEG Grammar that produces a node tree.
+  - Links do not allow links defined inside
+  - Links, Images, reparse the text matched as a paragraph, to handle missing linkdefs
+  - Indentation, Continuation lines handled with a `indent()` grammar rule
+  - Emphasis markers matched and counted
+  - Atx headers counted
+
+Then postprocessing over the parse tree:
+
+- Finding backreferences
+- Populating Links, Flattening inactive links
+- Handling Emphasis
+- Handling loose/tight links
+- Converting to HTML
+
+### Indentation
+
+`indented { rules .... indent .... rules}`
+
+A parser rule that takes some nested rules, along with an indent, and dedent rule. 
+
+Calling `indent` matches the indent rules in order, if any
+
+Calling `partial_indent` matches the indent rules in order, and if an indent rule fails, but the dedent rule doesn't match, it continues as usual. This is how I handle continuation lines in the PEG rules. 
+
+Its very much the same approach as a line at a time parser, but backtracking instead of precisely matching which block can continue. Amenable to memoization.
+
+### Links
+
+Parallel parsing, if one rule matches, reparse it with another rule, and store the result of both. I parse the link, and the link as if it was paragraph text, and resolve the ambiguity later.
+
+### Emphasis
+
+Same as everyone else, made simpler by not having to handle links at the same time. Still frustrating.
+
+### PEG shenanigans
+
+I put in variables, named captures, and a lot of the complexity became a bit more explict, but manageable. Backreferences, or counting operators to extract text or numbers, that can be passed down into later rules.
+
+I added relative offsets to lookaheads, too
+
+## Wrapup
+
+Things that would make markdown significantly easer to parse have been covered before. To echo others, indented code blocks make everything else worse. Links are another case, as `[foo][bar][baz]` gets disambiguated after parse time.