79729486

Date: 2025-08-08 08:15:35
Score: 1
Natty:
Report link

I'm going to go into some technical details of the explanation of

[why] lookbehind assertion[s] must be fixed length in [PCRE] regex[es],

using the specific case that brought me here to the question. (I came understanding the lack of support for variable lookbehind assertions, just looking for some details of how to write fixed-length assertions.) Note that I'm basically following the answer of @Alan-Moore but giving a specific example. Other answers and comments were all very helpful.

I'll make a toy version of my problem and tweak it a bit to match it to the problem with variable length lookbehind assertions.

I have a group of classifications that are used to mark linguistic and other characteristics of speeches made by politicians, celebrities, scientific researchers, etc. These need to be inserted into the text at the point where they occur. e.g. If the speaker gave a date with the phrase,

This hasn't been done since 1762 and shan't be done ever again if I can help it!

and dates were to be marked with bcde_ (we'll talk about the markers, below), the annotated version would be changed as

This hasn't been done since 1762 bcde_ and shan't be done ever again if I can help it!

To make these different from the actual words in the speeches, these are groups of four letters, either repeated or in-a-row, followed by an underscore. So, a few examples would be

aaaa_, abcd_, bbbb_, bcde_, ... , hhhh_, ... , mnop_, ... wxyz_, ... zzzz_.

Note that some combinations without the trailing underscore, such as eeee , mmmm, oooo , zzzz , are found in some linguistic corpora. That's one reason for the trailing underscore. FYI (and maybe TMI), we don't actually use all the combinations, but I'll pretend that we do for the example. However, the regex engine would need to check these even if it had an intelligent implementation. Note also that these are inserted using a nicely UX-designed GUI and the linking between classification string and meaning is done by some nice software which includes PCRE-compliant regex matching.

I won't go into details why, but whenever there is a zzzz_ or yyyy_ or xxxx_ anywhere in the speech (and there can be only one of those three per speech), it works out that one and only one of the four-contiguous-letters-plus-underscore classifications needs to be part of the annotation of the same speech. If one of zzzz_ or yyyy_ or xxxx_ is there, there can't be zero members of the set, {abcd_, bcde_, cdef_, defg_, ..., vwxy_, wxyz_} in the annotated speech, and there can't be 2 or 3 or 4 or more members of the set. One-and-only-one if there is a /([x-z])\1\1\1_/. There can be any number of /([a-w])\1\1\1_/, i.e. aaaa_, bbbb_, cccc_, ..., if we have one of /([x-z])\1\1\1_/.

So, to check if this rule is being followed, I write a couple of grep commands, the second of whose regex has a negative lookahead and a negative lookbehind. I do it naïvely, first, without concern for the grep: length of lookbehind assertion is not limited error. Note that I won't list all of the (26 - 3 =) twenty-three possible in-a-row classifications, I'll use (abcd_|bcde_| ... |wxyz_). This will compile and run, but it's not the code that will give the desired results. The real code uses all 23 possibilities.

And no, I don't type all 23 possibilities each time.

$ grep -P " ([x-z])\1\1\1_" file_with_all_annotated_speeches_one_per_line \
   > to_check_speeches_test_05
$ #  Down here, on the 2nd line    vv     is the problem
$ grep -P \
    '(?<! (abcd_|bcde_| ... |wxyz_).*)'\
'( (abcd_|bcde_| ... |wxyz_))'\
'(?!.* (abcd_|bcde_| ... |wxyz_))' \
          to_check_speeches_test_05 \
   > all_acceptable_speeches_as_per_05
$ #  `comm -23' will give us those lines (speeches) that are only in the 
$ #+ first argument (file), but not those in the second one nor those 
$ #+ which are in both file 
$ comm -23 <(sort to_check_speeches_test_05) \
           <(sort all_accceptable_speeches_as_per_05) \
 > not_acceptable_speeches_as_per_05 

Okay, let's look at some speeches. Let's say that the Gettysburg Address is now a wrong version, so we need to change it to the right version. I'll just give relevant parts (any parts that have a classifier string).

Wrong Version:

Fourscore and rstu_ seven years jjjj_ ago
...

The world will little note, nor long wwww_ remember zzzz_ what
...
by the people, for the people, shall not perish mnop_ from 
the earth. _-lincoln-getty-1863-_ 

Note that, in the file, this would all be on one line. If not, the grep would become a lot harder.

We can't have both rstu_ and mnop_ in the same line (speech) as zzzz_.

Right version:

Fourscore and seven years jjjj_ ago
...

The world will little note, nor long wwww_ remember zzzz_ what
...
by the people, for the people, shall not perish mnop_ from 
the earth. _-lincoln-getty-1863-_ 

I'm not sure how the order of processing the negative lookbehind and the negative lookahead work (perhaps, once the negative lookahead fails, i.e. finds the matching string, the regex exits; perhaps the negative lookbehind starts first; idk), but I'm going to pretend that we'll get a negative lookahead from the rstu_ and a negative lookbehind from the mnop_. (If the speech has zzzz_ and three of / (abcd_|bcde_| ... |wxyz_)/, I'm pretty sure both the lookahead and lookbehind would be run.)

The wrong version has a negative lookahead. (I think that's the case; if not, let's pretend.) That means the lookahead regex starts at rstu_ and runs one regex pass on up to 1491 characters. It finds mnop_ after going through 1452 characters. That means it fails, and I don't know if checks would continue. Still, I'm going to make the next assumption. Anyone is welcome to comment about whether I'm "running the regex engine" right. I think I'll keep this version with assumptions, anyway, but I'd like to know (and probably note) what a PCRE-compliant engine actually does.

Now let's assume that we also get a negative lookbehind from mnop_. (There might be a smarter algorithm, but lets assume that) the engine first moves one character back to the space (' ') then goes through a minimum of 5 characters (looking for 5 characters matching any of the letters-in-a-row_plus_underscore strings) or a maximum of 45 characters (to the end of the line) looking for any of (abcd_|bcde_| ... |wxyz_). Then it goes back to the 'h' in perish and looks through 5|46 characters for a letters-in-a-row string. Then 's' with 5|47, 'i' with 5|48, 'r' with 5|49, 'e' with minimum 5 or maybe 6—to the start of mnop_—or maybe up to 50—to the end of the line, 'p' with 5|7|51, ... all the way back through 1440 more characters to rstu_'s '_' which runs through 5|1448|1491, 'u' which runs through 5|1449|1492, 't' with 5|1449|1493, 's' with 5|1450|1494, then finally 'r', where I think it would only go through 5 characters until the rstu_ matched. Using the minimums, that's 1452 * 5 = 7262 steps, five times as many as the lookahead. (The exact multiple isn't a coincidence.) These thousands of steps are for the Gettysburg Address, which is famous for how short it is! (1487 characters, if you use the transcript of Cornell University's copy (linked above).

I won't do more details (sleep time), but imagine finding a mistake where one xxxx_ has three different (abcd_|bcde_| ... |wxyz_) instances in John F. Kennedy's We choose to go to the moon speech. Or imagine checking, whether there are errors or not, through the famous-for-its-length 1960 speech of Fidel Castro at the United Nations with its ~200k characters. If there are errors that make you use negative lookbehinds, that could be a lot of computing cycles, especially since we're likely dealing with recursion in the regex engine's details.


So, the way I see it, there are two things I could do.

First Way (Painful)

The first, painful way is to use the oft-repeated answer given here to do as @Dávid-Horváth suggested and

try to branch your lookbehind part to fixed length patterns if possible.

That would involve splitting up the ORs (|) in the first (?<!(abcd_|bcde_| ... |wxyz_).*), a process that would begin something like

'/(?:(?<! abcd_.*)|(?<! bcde_.*)| ... |(?<! wxyz_.*))'\
'( (abcd_|bcde_| ... | wxyz))(?!.* (abcd_|bcde_| ... |wxyz_))'

without metaprogramming (archived Wikipedia site, as I see it), I don't think so.

Second Way

If the lengths work out for you

My workaround is to make a copy of the file-with-one-speech-per-line, then take all the classification strings to the beginning of each line. Because other combinations will doubtless need regexes, I've figured the longest possible string (I think), with just

(alphabet)   (only one of x, y, z)
  ( 26     -          3 )             +

(alphabet of in-a-row-letters)
            26                        =

49 possible classification strings

(I didn't note that no classification string may be repeated, but that's the case.)

I think that's the max we'd need, with something like

/(?<! (aaaa_|abcd_|bbbb_|bcde_| ... |wwww_|wxyz_).{0,50})/

for the negative lookbehind part.

I've experimented on my system1 and found that I can go up to .{0,251} before I get a complaint about grep: branch too long in variable-length lookbehind assertion. If I went really high (past the 50000 range into the into the .{0,70000}, I got a complaint about grep: number too big in {} quantifier.

I'm not really sure why the number, 252 is considered too big for the lookbehind case. Maybe it has to do with the amount of memory needed to carry out such a lookbehind.

I checked and found that my length would not still be okay if I needed to count characters, figuring that a max would be

49 classification strings  *  5 letters  + 1 space + 49 underscores  =  295

which is greater than 251. However, I figure I could rewrite the pattern as

/(?<! (aaaa|abcd|bbbb|bcde| ... |wwww|wxyz)_.{0,251})/

giving me

49 * 4 + 1 + 1 = 198

, easily within the allowed length. I figured I might as well use the complete 251, just in case.


Feel free to comment about my misunderstandings of PCRE regexes. I love to learn, and I'd love to make this answer as accurate as possible.


Notes:

[1]

My system:

$ uname -a
CYGWIN_NT-10.0-19045 MY-MACHINE 3.6.3-1.x86_64 2025-06-05 11:45 UTC x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 5.2.21(1)-release (x86_64-pc-cygwin)
Reasons:
  • Blacklisted phrase (0.5): I need
  • Blacklisted phrase (1): to comment
  • Long answer (-1):
  • Has code block (-0.5):
  • User mentioned (1): @Alan-Moore
  • User mentioned (0): @Dávid-Horváth
Posted by: bballdave025