79385650

Date: 2025-01-24 21:20:24
Score: 0.5
Natty:
Report link

I had to do this for a membership application for an Irish organisation which also took subscriptions from non-Irish people.

I quickly stopped being amazed by all of the characters (and other stuff) that people could copy'n'paste into fields for data entry. A good rule of thumb is to sanitise as much as possible before and/or at data entry - but, you will never be able to stop people from doing stupid things!

Also, one can have totally unexpected (for me anyway) legitimate symbols, e.g. Dutch names can have an apostrophe as a surname's first character and the first character doesn't have to be capitalised (it's even wrong - see 1st example in link - same in Irish). Some African languages use the ! (exclamation mark) as a click sound.

Now, the Irish language (and Hiberno-English) has the potential for all sorts of traps - for example the surname Ua hUigín is valid - note lower case h directly in front of upper case U - which is correct! As well as that, we obviously have apostrophes all over the place - the O' in O'Brien being a case in point - O' is a modern version of Ua meaning "descendant of". We also have Mc & Mac meaning "son of". Plus, we have accents/diacritics - see here.

To add to all this fun and games is that fact that some people put a space between O' and Brien (uncommon) and others do not - the same goes for Mc and Mac in front of their suffixes (more common) - as well as Fitz from our Norman friends (French "fils" = son), very common in the south west - space or no space?

So, we devised a scheme whereby we used triggers to populate a field called surname_search - these were the days before `computed/calculated/virtual/derived/generated fields, but we can use them today - see below.

For surname_search we:

You may think that this is overkill, but it's not - if people are allowed to put in rubbish, they will!

Your best bet is possibly to do something like the following (all of the code below is available on the fiddle here):

Your original table might look something like this:

CREATE TABLE t  -- could be user or person or...
(

  id SMALLINT NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,

  -- title TEXT NULL,
  -- fname TEXT NOT NULL

  surname  TEXT NOT NULL,

  -- ...
  -- ... other fields
  -- ...

  symbol TEXT NOT NULL  -- Not part of original table - helps with testing, see below and fiddles   

);

The first thing we could do is this.

Install the CITEXT (case insensitive) and UNACCENT (remove accents) extensions. These two are part of the contrib modules - plus, both of these are "trusted":

Trusted extensions can be run by any user who has CREATE privilege on the current database. Extensions that are trusted are identified as such in the sections that follow. Generally, trusted extensions are ones that cannot provide access to outside-the-database functionality.

--
-- Can workaround this using the TRANSLATE() function, but this is messy!
--
-- see https://stackoverflow.com/q/13596638/470530
--

CREATE EXTENSION "unaccent";

and

--
-- Can workaround this using the UPPER() function, not too bad!
--

CREATE EXTENSION "citext";

If you cannot install extensions, see the workarounds for other ideas.

Caveats for UNACCENT:

This (see snippet in fiddle):

--
-- Works for the last three variants of Æ, but not the first two!
--

SELECT
  UNACCENT('Ǣ/ǣ, Ǽ/ǽ, Æ̀/æ̀, Æ̂/æ̂ and Æ̃/æ̃'),  -- failing chars not found in unaccent.rules
  UNACCENT('ḃ, ċ, ḋ, ḟ, ġ, ṁ, ṗ, ṡ, ṫ');   -- all found in unaccent.rules

result:

unaccent                            unaccent
Ǣ/ǣ, Ǽ/ǽ, AE/ae, AE/ae and AE/ae    b, c, d, f, g, m, p, s, t

only partially works - it also partially fails on my laptop Windows 2019 server.

This is probably due to the dictionary - see here.

A more complete example, which is directly useful for most European languages, can be found in unaccent.rules, which is installed in $SHAREDIR/tsearch_data/ when the unaccent module is installed. This rules file translates characters with accents to the same characters without accents, and it also expands ligatures into the equivalent series of simple characters (for example, Æ to AE).

The interested reader is invited to investigate further but it's beyond the scope of this answer. Personally, I'd never even seen the failing characters before tackling this answer - the ligature yes, but accented, no!

We now change the surname's column type to CITEXT:

ALTER TABLE t
  ALTER COLUMN surname
    SET DATA TYPE CITEXT;

We now ALTER the table again so as not to allow any digits (I've yet to see a surname with digits) in the surname field and also disallow any non-valid punctuation characters - see here and here for the use of $$ to escape quotes of whatever sort and increase legibility.

I used a regex (regular expression - see here and here) - if you don't know regexes, go here - quick start here (helpful).

--
-- First, make sure that puncuation symbols that are not valid for names 
-- and digits aren't allowed at all - so, in this case we disallow certain characters. For example, hyphen, space and quotation marks are valid!
--

-- 
-- Note the `\` (backslash) before characters full stop (`.`), open and close square
-- bracket (`[]`), before open and close parentheses (`()`) and backslash itself (`\`)
-- but not `$`, `^` or `?` which are also regex meta-characters - this involved lots of
-- testing!
--

ALTER TABLE t 
  ADD CONSTRAINT no_punct_no_digit_ck
    CHECK (NOT REGEXP_LIKE(surname, $$[_#=0-9@#$.\[\]?<>^/(){}~|\\]$$));

Regex explained:

On running this SQL, you may be told that you already have some names which breach the constraint - you'll have to edit these by hand!

Did a lot of testing (see here) - quite why backslash (\) is required for some regex meta-characters and not others is a mystery to me!

But, as you'll see from the testing snippets, this allows surnames with just one single exclamation mark as a valid surname (same for just a single letter), so we add another criterion to our CHECK (after dropping the original CONSTRAINT - see fiddle).

I know of no surnames that are one character long, but there are Asian surnames (e.g. Li, Ng) which have two characters, so we do the following by modifying our original constraint:

--
-- Now ensure least two letters
--

ALTER TABLE t 
  ADD CONSTRAINT no_punct_no_digit_ck
    CHECK ((NOT REGEXP_LIKE  (sname, $$[_#=0-9@#$.\[\]?<>^/(){}~|\\]$$))   AND
          ((    REGEXP_COUNT (surname, $$[[:alpha:]]$$, 1) >= 2        )));

Regex explained:

Now, we see about our O'Brien string - and variants.

INSERT INTO t (surname, symbol) VALUES  -- see https://typographyforlawyers.com/straight-and-curly-quotes.html

-- Also, see https://stackoverflow.com/questions/12316953/insert-text-with-single-quotes-in-postgresql/12320729#12320729
-- for dollar quoted strings

($$O'Brien$$,            'Straight single quote'), 
($$O"Brien$$,            'Straight double quote'),
($$O‘Brien$$,            'Opening single quote'),
($$O’Brien$$,            'Closing single quote'),
($token$O“Brien$token$,  'Opening  double quote'),  -- note - $token$
($$O”Brien$$,            'Closing  double quote'),
($$O”Br-ie-n$$,          'Hyphens'),  
($$O”Br   i e  n$$,      'Arbitrary different spaces'),
($$!O!Br!i!en!!$$,       'Exclamation mark possible!'),

--
-- Sample accents - impossible to test exhaustively, see above
--
  
($$O'Bríen$$,           'Acute accent on i'         ),
($$O'Brièn$$,           'Grave accent on i'         ),
($$O'Briuç$$,           'Cedilla on c'              );

all INSERTs work smoothly - we check (SELECT * FROM t;) - see fiddle.

Now, if we want to search for anybody called any form of ... O'Brien..., a naïve first take might be something like this:

--
-- Tilde (~) is the case sensitive regex match operator - matches mixed-case
-- here because the surname field is now CITEXT!
--

SELECT *
FROM t
WHERE surname ~ '^.*BrIe.*$';

Regex explained:

But! - result:

id  sname          symbol
2   O'Brien   Straight single quote
3   O"Brien   Straight double quote
4   O‘Brien   Opening single quote
5   O’Brien   Closing single quote
6   O“Brien   Opening  double quote
7   O”Brien   Closing  double quote

So what we can do is to create a search field using PostgreSQL's GENERATED columns as follows:

ALTER TABLE t
ADD COLUMN
  surname_search CITEXT   -- also CITEXT!
    GENERATED ALWAYS AS 
    (
      REGEXP_REPLACE(UNACCENT(surname), '[^[:alpha:]]', '', 'g')  -- 'g' for global!
    ) STORED;  -- has to be STORED - no VIRTUAL, i.e. calculated on the fly, columns yet!

Problem! the result of running this query:

ERROR:  generation expression is not immutable

Again, this is beyond the scope of my efforts at an answer here, (but you can look here and here if you're interested), however (check the fiddle), there are functions which can circumvent this problem.

After running the function definitions, rerun the ALTER TABLE ADD COLUMN... command from above - works this time!

Regex explained:

So now, your app/interface/agents only have to search against a string like "%BRiEn%" or "%bri%" - i.e. any combination of upper and lower case letters that match a substring of the user's surname. The % (percentage) character is one of the LIKE operator's wildcards - it matches 0 or many characters. The other wildcard is _ (underscore) which matches one and only one character.

p.s. +1 for a question that a) made me think and b) brought back some memories!

Reasons:
  • Blacklisted phrase (1): stackoverflow
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (0.5):
Posted by: Vérace