You can detect and correct duplicate records with a two-step process.
Find duplicates by aggregation
Review and correct them
Let me demonstrate it with dummy data.
---
POST translations_test/_bulk
{ "index": {} }
{ "raw_body_text": "¿Hola, cómo estás?", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "Muy bien, ¡gracias!", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "¿Cómo te va?", "translated_body_text": "Hello, how are you?" }
{ "index": {} }
{ "raw_body_text": "Estoy bien.", "translated_body_text": "I am fine." }
GET translations_test/_search
{
"size": 0,
"aggs": {
"translations": {
"terms": {
"field": "translated_body_text.keyword",
"min_doc_count": 2,
"size": 10000
},
"aggs": {
"unique_sources": {
"terms": {
"field": "raw_body_text.keyword",
"size": 10000
}
},
"having_multiple_sources": {
"bucket_selector": {
"buckets_path": {
"uniqueSourceCount": "unique_sources._bucket_count"
},
"script": "params.uniqueSourceCount > 1"
}
}
}
}
}
}
Tips:
If .keyword
subfields don’t exist, you’d first need a reindex with mapping update.
If you have more than 10k distinct translated_body_text
value, use composite aggs with after_key
parameter.
Extra Tip: