I see that your second alternation (?:<span[^>]*?color: black.*?>[\S\s\n]*?<code)
is open ended. This pattern will cause the regex engine to keep looking for the next <code
until 'the end of time', with [\S\s\n]*?
. It may be a factor in getting the error:
"Invalid Regular expression": "The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent eternal matches that take an indefinite period time to locate."
I noticed that, not only is the <code>
element we want to skip preceded by <p>
or <span>
elements with style="color: black
, but they are inside these elements.
So, this pattern below looks to skip the entire <p>
or <span>
elements were the style="color: black"
is true for the element. It works with the test string you provided.
I am curious to see if this pattern/approach solves the issue. Please let me know.
REGEX PATTERN (PCRE2 Flavor; Flags:gms)
(?s)(?:<(p|span)[^>]*?color: black[^>]*>.*?<\/\1)(*SKIP)(*F)|<code\s*style="background-color:\s*transparent;">
Regex Demo: https://regex101.com/r/ebwZLJ/8
REGEX NOTES:
(?s)
Single line flag means that dot (.
) will match all characters including newline.(?:
Begin Non-capturing group (?:...)
to make the alternation before the (*SKIP)(*FAIL)
explicit.
<
Match literal <
.(p|span)
Capture Group 1, referred to with \1
later in the pattern. Alternation match literalp
or (\
) span
.[^>]*?
*Negated class [^...]
. Match any character that is not >
0 or more times (*
).color: black
Match literal "color: black"[^>]*
*Negated class [^...]
. Match any character that is not >
0 or more times (*
).>
Match literal >
.*?
Match any character (including newline, due to (?s)
) 0 or more times, be lazy (*?
) about it, only match as few characters as needed to make a match.<\/
Match literal </
\1
Match character(s) captured in group 1, i.e. we are matching to close the element <p>
or <span>
.)
Close non-capturing group.(*SKIP)(*F)
Consider the characters for the match consumed, but do not return or acknowledge the match. Continue matching from here.|
Or<code
Match literal <code
.\s*
Match any whitespace character 0 or more (*
) times.style="background-color:
Match literal style="background-color:
.\s*
Match any whitespace character 0 or more (*
) times.transparent;">
Match literal " transparent;">
.For reference, here is the regex pattern from the Question (https://regex101.com/r/kIm0bl/1)