79395318

Date: 2025-01-28 22:34:48
Score: 0.5
Natty:
Report link

This is not an answer in the specific sense, but a general response to your question that I hope will be helpful.

[As ever, I defer to @raiph's attention to detail and actual code fix]

I am very impressed with the progress that you have made and would encourage you to keep going ... I have built several realworld raku grammars and they are always quite intricate since that is the nature of parsing/regex at a character level. I am sure you know to take ChatGPT with a pinch of salt.

At first, I wanted to say "don't use raku grammars to solve this problem, the quickest way to extract data from your source file is more likely to be a set of regexes". Why? Well the source file is quite odd - there is a prediliction for newlines and repeated info. A regex type approach would try and pick out anchors (eg section, subsection, subsubsection) and then key off these to capture the variable data. In contrast a grammar like yours is trying to pick up all the text and is more work and more prone to small errors.

Then I saw you wrote that you want to check the correctness/completeness of the source. [This goal seems a bit nutty to be, but I am sure you have your reasons]

In this case, I think you have made a good (comprehensive) start, but your Grammar is brittle - would you really care if version 0.001 became version 0.002?

So, my current view based on how I would do this myself, is to say that your grammar token structure needs to have a good impedance fit with the language that you are parsing. This is another way of saying take a top down look and try to extract the patterns that you want to extract in a hierarchical way.

What do I mean by that, what would I change...

  1. Many of the features are 3 line stanzas - so I would try to make a general to match these paras

  2. Many of these have repeat text - so I would try to check and then eliminate the duplications

  3. They have a consistent syntax built with components, so have tokens for each component

Something like:

... what you have already around TOP ...

token stanza  { <header> <tagged> <untagged> }

token header  { '@' ['section'|'subsection'|<subsub>] }

token tagged  { '[#:tag "\' <factor> <subject> <yyyymm> ']' }   # look up ~ and % in the docs
token untagged { '{' <factor> <subject> <yyyymm> '}' }
token factor  { Factor \d+ ':' <.ws> }
token subject { [\@italic\{]? [Quaffing | Quenching] [}]? ',' <.ws> }
token yyyymm  { like you have it }

This is just a rough idea ... but hopefully you get the feeling for the level of granularity / reusability of tokens.

Reasons:
  • Blacklisted phrase (1): not an answer
  • Blacklisted phrase (0.5): Why?
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • User mentioned (1): @raiph's
  • High reputation (-1):
Posted by: librasteve