Reports

You can detect and correct duplicate records with a two-step process.

Find duplicates by aggregation
Review and correct them

Let me demonstrate it with dummy data.

---

POST translations_test/_bulk
{ "index": {} }
{ "raw_body_text": "¿Hola, cómo estás?", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "Muy bien, ¡gracias!", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "¿Cómo te va?", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "Estoy bien.", "translated_body_text": "I am fine." }

GET translations_test/_search
{
  "size": 0,
  "aggs": {
    "translations": {
      "terms": {
        "field": "translated_body_text.keyword",
        "min_doc_count": 2,
        "size": 10000
      },
      "aggs": {
        "unique_sources": {
          "terms": {
            "field": "raw_body_text.keyword",
            "size": 10000
          }
        },
        "having_multiple_sources": {
          "bucket_selector": {
            "buckets_path": {
              "uniqueSourceCount": "unique_sources._bucket_count"
            },
            "script": "params.uniqueSourceCount > 1"
          }
        }
      }
    }
  }
}

Tips:

If .keyword subfields don’t exist, you’d first need a reindex with mapping update.
If you have more than 10k distinct translated_body_text value, use composite aggs with after_key parameter.

Extra Tip:

To automatically correct the results, use cursor + OpenAI API's. Cursor will write the code for you, OpenAI will correct the results. Remember that, it's always good to double check.

79733429