79414890

Date: 2025-02-05 13:23:15
Score: 1.5
Natty:
Report link

3 + 1 Things to Consider when Optimizing Search with Regex:

I ran some analysis on the pattern that @dawg created. To see what happens. I searched a 1000-line text sample on regex101.com using six (6) different permutations of the (email|date|phone) patterns. Please see the links below. For comparison, I ran each of the 6 permutations in both Python and PCRE2 flavors. PCRE2 is used by Perl & PHP.

Here's what I discovered. I was especially surpised about the impact of the inline flag.

Here is an example of the six (email|phone|date) permutations:

# 1:inline-flag:    
pattern = r'''(?x)
(?P<email>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b)|
(?P<phone>\b\d{3}-\d{3}-\d{4}\b)|
(?P<date>\b\d{4}-\d{2}-\d{2}\b)
''' 

# 1:re.X:   
pattern = r'''
(?P<email>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b)|
(?P<phone>\b\d{3}-\d{3}-\d{4}\b)|
(?P<date>\b\d{4}-\d{2}-\d{2}\b)
''' 

Outcome:

Python regex flavor:

Inline flag average steps: 298,866 {min 297,777, max 299,995}

Regular flag average steps: 265,101 {min 264,012, max 266,190}

Difference: 33,765 (exactly the same with every pattern)

PCRE2 regex flavor (used by PHP and Perl):

Inline flag average steps: 214,344 {min 213,255, max 215,433}

Regular flag average steps: 189,495 {min 188,406, max 190,584}

Difference: 24,849 (exactly the same with every pattern)

All permutations completed in < 150ms.


What I discovered:

1) Flavor Matters: PRCE2 regex flavor vs. Python regex flavor

Python flavor, with inline flag ((?x)), had exactly 84,522, or average of 28.28%, more steps than PCRE2 flavor for each permutation. With regular flag (re.X) Python flavor had exactly 75,606, or average of 28.52%, more steps than PCRE2 flavor for each permutation.

The processing speeds cut down in half using PRCE2 flavor vs. python. There were 40% (~77K steps) fewer steps using PRCE2 regex flavor than Python regex flavor.4

For large data sizes regex flavor can make a big difference.

2) Flag Type Matters !: Inline flag (?x) vs. regular flag re.X

For Python, regex with inline flag had 12.74% more steps than regular flag, exactly 33,765 each.

For PRCE2, regex with inline flag had 13.11% more steps than regular flag, exactly 24,849 each.

This means will have 11.4% fewer steps on average if you remove the inline flag and use regular flag instead. So, to optimize it makes sense to remove the inline flag and replaced it with the regular flag re.X.

It was interesting to see that it was exactly the same difference in steps between inline flag and regular flag for every permutation! Inline flag is definitely busy doing something.

3) Pattern order matters:

The difference between most and least steps for permutations was within 1.0% for PCRE2 flavor and 0.77% for python flavor.

(email|phone|date) had least steps and (date|phone|email) had the most steps regardless of regex flavor or type of flag (inline or regular).

So depending on the size of the data, it may or may not make a real difference.

4) Pattern is matters:

I created this regex to capture simple emails (where extra dots are allowed), phone number xxx-xxx-xxxx, date xxxx-xx-xx. It did not have capture groups.

For python this pattern resulted in 91,566 or average 32.5% steps less than the permutations used in the or (|) pattern.

Use re.X flag:

# 7:re.X:
pattern =  r'''\b(
(\d\d\d[\d-] [\d-] \d\d-\d\d(?:\d\d)?\b)|(?=\b\w+(?:\.\w+)*@)(\b\w+(?:\.\w+)*@\w+(?:\.\w+))
)\b
''' 

Links to permutations:

NUM | FLAG | PERMUTATION | URL:

1 | re.X | (email|phone|date) | https://regex101.com/r/LCaSTy/2

2 | re.X | (email|date|phone) | https://regex101.com/r/Qwno6l/2

3 | re.X | (phone|date|email) | https://regex101.com/r/vtnLQv/2

4 | re.X | (phone|email|date) | https://regex101.com/r/gWFYzB/2

5 | re.X | (date|phone|email) | https://regex101.com/r/0z9cLt/2

6 | re.X | (date|email|phone) | https://regex101.com/r/lTfvqX/2

7 | re.X | ((phone/date)|email) | https://regex101.com/r/mP8v4Z/4

Reasons:
  • RegEx Blacklisted phrase (1): see the links
  • Long answer (-1):
  • Has code block (-0.5):
  • User mentioned (1): @dawg
  • Low reputation (1):
Posted by: rich neadle