79682710

Date: 2025-06-28 04:35:44
Score: 1
Natty:
Report link

This answer is largely dependent on @nick-bull 's and https://stackoverflow.com/a/48058708/3809427 but additional details are big so I added new answer.

First, normalization is needed for special characters. e.g. "π“‘π“˜π“– π“›π“žπ“₯𝓔 γŒ”" -> "BIG LOVE γ‚­γƒ­"

text = Normalizer.normalize(text, Normalizer.Form.NFKC);

And, removing "VARIATION SELECTOR" is needed.

str = str.replaceAll("[\uFE00-\uFE0F\\x{E0100}-\\x{E01EF}\u0023\u002A\u0030-\u0039\u20e3]+", "")

Combine them all,

//This is needed to output Unicode correctly by `System.out.println()``. This is not related directly to answer but needed to show result.
try {
    System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    throw new InternalError("VM does not support mandatory encoding UTF-8");
}


final String VARIATION_SELECTORS = "["
        //Variation Selectors https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
        +"\uFE00-\uFE0F"
        //Variation Selectors Supplement https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
        +"\\x{E0100}-\\x{E01EF}"
        //https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)#Variants
        //Basic Latin variants
        +"\u0023\u002A\u0030-\u0039"
        // COMBINING ENCLOSING KEYCAP
        +"\u20e3"
        +"]+";

String example ="\uD835\uDCD1\uD835\uDCD8\uD835\uDCD6 \uD835\uDCDB\uD835\uDCDE\uD835\uDCE5\uD835\uDCD4 γŒ” hello world _# ηš†γ•γ‚“γ€γ“γ‚“γ«γ‘γ―οΌγ€€η§γ―γ‚Έγƒ§γƒ³γ¨η”³γ—γΎγ™γ€‚πŸ”₯ !!\uFE0F!!\uFE0F!!\uFE0Fa⃣";
System.out.println(example);

//Main
var text = Normalizer.normalize(example, Normalizer.Form.NFKC)
        //This is originalte from Nick Bull
        .replaceAll("[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]+", " ")
        .replaceAll(VARIATION_SELECTORS, " ")
         //reduce consecutive spaces to a single space and trim 
        .replaceAll(" {2,}", " ").trim();
System.out.println(text);
// Output:
//   "π“‘π“˜π“– π“›π“žπ“₯𝓔 γŒ” hello world _# ηš†γ•γ‚“γ€γ“γ‚“γ«γ‘γ―οΌγ€€η§γ―γ‚Έγƒ§γƒ³γ¨η”³γ—γΎγ™γ€‚πŸ”₯ !!️!!️!!️a⃣"
//   "BIG LOVE γ‚­γƒ­ hello world _ ηš†γ•γ‚“γ€γ“γ‚“γ«γ‘γ―! 私はジョンと申します。 !! !! !! a"
Reasons:
  • Blacklisted phrase (1): stackoverflow
  • Long answer (-1):
  • Has code block (-0.5):
  • User mentioned (1): @nick-bull
  • Low reputation (0.5):
Posted by: Lamron