79541684

Date: 2025-03-28 14:29:29
Score: 0.5
Natty:
Report link

Do the genes have to be strictly consecutive (i.e. adjacent) ? if not:

You can get the duplicated genes, then for each of them get all the rows that match it, then loop over them to add a suffix


import pandas as pd

df_genes_data = {"gene_id": ["g0", "g1", "g1", "g2", "g3", "g4", "g4", "g4"]}
df_genes = pd.DataFrame.from_dict(df_genes_data)
print(df_genes.to_string())

duplicated_genes = df_genes[df_genes["gene_id"].duplicated()]["gene_id"]
for gene in duplicated_genes:
    df_gene = df_genes[df_genes["gene_id"] == gene]
    for i, (idx, row) in enumerate(df_gene.iterrows()):
        df_genes.loc[idx, "gene_id"] = row["gene_id"] + f"_TE{i+1}"

print(df_genes)

out:

  gene_id
0      g0
1  g1_TE1
2  g1_TE2
3      g2
4      g3
5  g4_TE1
6  g4_TE2
7  g4_TE3

if they have to be strictly adjacent then the answer would change

Reasons:
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (1):
Posted by: rn998