I had to do this for a membership application for an Irish organisation which also took subscriptions from non-Irish people.
I quickly stopped being amazed by all of the characters (and other stuff) that people could copy'n'paste into fields for data entry. A good rule of thumb is to sanitise as much as possible before and/or at data entry - but, you will never be able to stop people from doing stupid things!
Also, one can have totally unexpected (for me anyway) legitimate symbols, e.g. Dutch names can have an apostrophe as a surname's first character and the first character doesn't have to be capitalised (it's even wrong - see 1st example in link - same in Irish). Some African languages use the !
(exclamation mark) as a click sound.
Now, the Irish language (and Hiberno-English) has the potential for all sorts of traps - for example the surname Ua hUigín
is valid - note lower case h
directly in front of upper case U
- which is correct! As well as that, we obviously have apostrophes all over the place - the O'
in O'Brien
being a case in point - O'
is a modern version of Ua
meaning "descendant of". We also have Mc
& Mac
meaning "son of". Plus, we have accents/diacritics - see here.
To add to all this fun and games is that fact that some people put a space between O'
and Brien
(uncommon) and others do not - the same goes for Mc
and Mac
in front of their suffixes (more common) - as well as Fitz
from our Norman friends (French "fils" = son), very common in the south west - space or no space?
So, we devised a scheme whereby we used triggers to populate a field called surname_search
- these were the days before `computed/calculated/virtual/derived/generated fields, but we can use them today - see below.
For surname_search
we:
You may think that this is overkill, but it's not - if people are allowed to put in rubbish, they will!
Your best bet is possibly to do something like the following (all of the code below is available on the fiddle here):
Your original table might look something like this:
CREATE TABLE t -- could be user or person or...
(
id SMALLINT NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
-- title TEXT NULL,
-- fname TEXT NOT NULL
surname TEXT NOT NULL,
-- ...
-- ... other fields
-- ...
symbol TEXT NOT NULL -- Not part of original table - helps with testing, see below and fiddles
);
The first thing we could do is this.
Install the CITEXT
(case insensitive) and UNACCENT
(remove accents) extensions. These two are part of the contrib
modules - plus, both of these are "trusted":
Trusted extensions can be run by any user who has CREATE privilege on the current database. Extensions that are trusted are identified as such in the sections that follow. Generally, trusted extensions are ones that cannot provide access to outside-the-database functionality.
--
-- Can workaround this using the TRANSLATE() function, but this is messy!
--
-- see https://stackoverflow.com/q/13596638/470530
--
CREATE EXTENSION "unaccent";
and
--
-- Can workaround this using the UPPER() function, not too bad!
--
CREATE EXTENSION "citext";
If you cannot install extensions, see the workarounds for other ideas.
Caveats for UNACCENT
:
This (see snippet in fiddle):
--
-- Works for the last three variants of Æ, but not the first two!
--
SELECT
UNACCENT('Ǣ/ǣ, Ǽ/ǽ, Æ̀/æ̀, Æ̂/æ̂ and Æ̃/æ̃'), -- failing chars not found in unaccent.rules
UNACCENT('ḃ, ċ, ḋ, ḟ, ġ, ṁ, ṗ, ṡ, ṫ'); -- all found in unaccent.rules
result:
unaccent unaccent
Ǣ/ǣ, Ǽ/ǽ, AE/ae, AE/ae and AE/ae b, c, d, f, g, m, p, s, t
only partially works - it also partially fails on my laptop Windows 2019 server.
This is probably due to the dictionary - see here.
A more complete example, which is directly useful for most European languages, can be found in unaccent.rules, which is installed in $SHAREDIR/tsearch_data/ when the unaccent module is installed. This rules file translates characters with accents to the same characters without accents, and it also expands ligatures into the equivalent series of simple characters (for example,
Æ
toAE
).
The interested reader is invited to investigate further but it's beyond the scope of this answer. Personally, I'd never even seen the failing characters before tackling this answer - the ligature yes, but accented, no!
We now change the surname
's column type to CITEXT
:
ALTER TABLE t
ALTER COLUMN surname
SET DATA TYPE CITEXT;
We now ALTER
the table again so as not to allow any digits (I've yet to see a surname with digits) in the surname
field and also disallow any non-valid punctuation characters - see here and here for the use of $$
to escape quotes of whatever sort and increase legibility.
I used a regex (regular expression - see here and here) - if you don't know regexes, go here - quick start here (helpful).
--
-- First, make sure that puncuation symbols that are not valid for names
-- and digits aren't allowed at all - so, in this case we disallow certain characters. For example, hyphen, space and quotation marks are valid!
--
--
-- Note the `\` (backslash) before characters full stop (`.`), open and close square
-- bracket (`[]`), before open and close parentheses (`()`) and backslash itself (`\`)
-- but not `$`, `^` or `?` which are also regex meta-characters - this involved lots of
-- testing!
--
ALTER TABLE t
ADD CONSTRAINT no_punct_no_digit_ck
CHECK (NOT REGEXP_LIKE(surname, $$[_#=0-9@#$.\[\]?<>^/(){}~|\\]$$));
Regex explained:
starting and ending square brackets ([
, ]
) - these are "meta-characters, i.e. they have a function within a regex - they specify a "character class" - a group of characters that we wish to identify.
the characters in between the starting and ending square brackets - i.e. _#=0-9@#$.\[\]?<>^/(){}~|\\
- these are the characters that we wish to single out as not being acceptable in a surname
. Note that some of these are themselves regex meta-characters and hence have to be escaped using the backslash (\
) character.
Notice that there are no quote characters included in the character class.
On running this SQL, you may be told that you already have some names which breach the constraint - you'll have to edit these by hand!
Did a lot of testing (see here) - quite why backslash (\
) is required for some regex meta-characters and not others is a mystery to me!
But, as you'll see from the testing snippets, this allows surname
s with just one single exclamation mark as a valid surname
(same for just a single letter), so we add another criterion to our CHECK
(after dropping the original CONSTRAINT
- see fiddle).
I know of no surnames that are one character long, but there are Asian surnames (e.g. Li
, Ng
) which have two characters, so we do the following by modifying our original constraint:
--
-- Now ensure least two letters
--
ALTER TABLE t
ADD CONSTRAINT no_punct_no_digit_ck
CHECK ((NOT REGEXP_LIKE (sname, $$[_#=0-9@#$.\[\]?<>^/(){}~|\\]$$)) AND
(( REGEXP_COUNT (surname, $$[[:alpha:]]$$, 1) >= 2 )));
Regex explained:
[:alpha:]
- a character class which matches any alphabetical characters. So, if the count of alphabetical characters isn't at least 2, then the CHECK
becomes FALSE
and any INSERT
fails - see fiddle for brief test.Now, we see about our O'Brien
string - and variants.
INSERT INTO t (surname, symbol) VALUES -- see https://typographyforlawyers.com/straight-and-curly-quotes.html
-- Also, see https://stackoverflow.com/questions/12316953/insert-text-with-single-quotes-in-postgresql/12320729#12320729
-- for dollar quoted strings
($$O'Brien$$, 'Straight single quote'),
($$O"Brien$$, 'Straight double quote'),
($$O‘Brien$$, 'Opening single quote'),
($$O’Brien$$, 'Closing single quote'),
($token$O“Brien$token$, 'Opening double quote'), -- note - $token$
($$O”Brien$$, 'Closing double quote'),
($$O”Br-ie-n$$, 'Hyphens'),
($$O”Br i e n$$, 'Arbitrary different spaces'),
($$!O!Br!i!en!!$$, 'Exclamation mark possible!'),
--
-- Sample accents - impossible to test exhaustively, see above
--
($$O'Bríen$$, 'Acute accent on i' ),
($$O'Brièn$$, 'Grave accent on i' ),
($$O'Briuç$$, 'Cedilla on c' );
all INSERT
s work smoothly - we check (SELECT * FROM t;
) - see fiddle.
Now, if we want to search for anybody called any form of ... O'Brien..., a naïve first take might be something like this:
--
-- Tilde (~) is the case sensitive regex match operator - matches mixed-case
-- here because the surname field is now CITEXT!
--
SELECT *
FROM t
WHERE surname ~ '^.*BrIe.*$';
Regex explained:
^
) symbol is a marker for the beginning of the search string..
) can mean any character*
(asterisk) - means the previous thing - a dot, any character - 0 or n times - the .*
could be .{0}
, see regex cheat-sheet and fiddleBRIE
is the string literal 'BRIE' - just normal characters to search for.*
(anything)$
) sign is a marker for the end of the search stringBut! - result:
id sname symbol
2 O'Brien Straight single quote
3 O"Brien Straight double quote
4 O‘Brien Opening single quote
5 O’Brien Closing single quote
6 O“Brien Opening double quote
7 O”Brien Closing double quote
We're missing names that contain spaces - which are common enough, and names missing hyphens - again not unheard of, and also any names containing exclamation remarks - rarer, but still valid as discussed above.
We're also missing names that contain the various accents which are common enough in Irish names, not to mention in French, German and many other languages.
There is a further disadvantage to this approach in that your customer-facing agents will have to be expert in regexes - not exactly a common skill amongst non-IT savvy office staff.
So what we can do is to create a search field using PostgreSQL's GENERATED
columns as follows:
ALTER TABLE t
ADD COLUMN
surname_search CITEXT -- also CITEXT!
GENERATED ALWAYS AS
(
REGEXP_REPLACE(UNACCENT(surname), '[^[:alpha:]]', '', 'g') -- 'g' for global!
) STORED; -- has to be STORED - no VIRTUAL, i.e. calculated on the fly, columns yet!
Problem! the result of running this query:
ERROR: generation expression is not immutable
Again, this is beyond the scope of my efforts at an answer here, (but you can look here and here if you're interested), however (check the fiddle), there are functions which can circumvent this problem.
After running the function definitions, rerun the ALTER TABLE ADD COLUMN...
command from above - works this time!
Regex explained:
[]
- first set of square brackets - character class as before^
- Here, the caret character is not the start of the line, rather it is the negation operator - i.e. anything not in the character class for the outer square brackets.[:alpha:]
- anything alphabetic - so this regex replaces anything that's not alphabetic with the empty string! Since this is done only once on data-entry, it's not a big overhead! Works for all alphabets, not just Latin!So now, your app/interface/agents only have to search against a string like "%BRiEn%"
or "%bri%" - i.e. any combination of upper and lower case letters that match a substring of the user's surname
. The %
(percentage) character is one of the LIKE
operator's wildcards - it matches 0 or many characters. The other wildcard is _
(underscore) which matches one and only one character.
p.s. +1 for a question that a) made me think and b) brought back some memories!