This is not an answer in the specific sense, but a general response to your question that I hope will be helpful.
[As ever, I defer to @raiph's attention to detail and actual code fix]
I am very impressed with the progress that you have made and would encourage you to keep going ... I have built several realworld raku grammars and they are always quite intricate since that is the nature of parsing/regex at a character level. I am sure you know to take ChatGPT with a pinch of salt.
At first, I wanted to say "don't use raku grammars to solve this problem, the quickest way to extract data from your source file is more likely to be a set of regexes". Why? Well the source file is quite odd - there is a prediliction for newlines and repeated info. A regex type approach would try and pick out anchors (eg section
, subsection
, subsubsection
) and then key off these to capture the variable data. In contrast a grammar like yours is trying to pick up all the text and is more work and more prone to small errors.
Then I saw you wrote that you want to check the correctness/completeness of the source. [This goal seems a bit nutty to be, but I am sure you have your reasons]
In this case, I think you have made a good (comprehensive) start, but your Grammar is brittle - would you really care if version 0.001
became version 0.002
?
So, my current view based on how I would do this myself, is to say that your grammar token structure needs to have a good impedance fit with the language that you are parsing. This is another way of saying take a top down look and try to extract the patterns that you want to extract in a hierarchical way.
What do I mean by that, what would I change...
Many of the features are 3 line stanzas - so I would try to make a general to match these paras
Many of these have repeat text - so I would try to check and then eliminate the duplications
They have a consistent syntax built with components, so have tokens for each component
Something like:
... what you have already around TOP ...
token stanza { <header> <tagged> <untagged> }
token header { '@' ['section'|'subsection'|<subsub>] }
token tagged { '[#:tag "\' <factor> <subject> <yyyymm> ']' } # look up ~ and % in the docs
token untagged { '{' <factor> <subject> <yyyymm> '}' }
token factor { Factor \d+ ':' <.ws> }
token subject { [\@italic\{]? [Quaffing | Quenching] [}]? ',' <.ws> }
token yyyymm { like you have it }
This is just a rough idea ... but hopefully you get the feeling for the level of granularity / reusability of tokens.