The issue in your code arises because df1$gene
contains concatenated gene names (e.g., "TMEM201;PIK3CD"
), while df2$gene
has individual gene names. The %in%
operator checks for exact matches, so "PIK3CD" %in% "TMEM201;PIK3CD"
returns FALSE
.
To fix this, you need to check for partial matches using stringr::str_detect()
. Here's a solution using sapply()
and str_detect()
from the stringr
package:
**Solution
library(stringr)**
df1 <- data.frame(gene = c('TMEM201;PIK3CD','BRCA1','MECP2','TMEM201', 'HDAC4','TMEM201'))
df2 <- data.frame(gene = c('PIK3CD','GRIN2B','BRCA2'))
df1_common_df2 <- df1[sapply(df1$gene, function(x) any(str_detect(x, df2$gene))), ]
print(df1_common_df2)
--------------
str_detect(x, df2$gene): Checks if any value from df2$gene is present as a substring in each row of df1$gene.
sapply(..., any(...)): Ensures that if any match is found, the row is included in df1_common_df2