Extracting some text from images is difficult. As @Wimpel already said, extracting solid data from images or the text in there is very difficult. in addition, how should the code know which kind of chart the figure represents? Sure, there are some digitalization tools for scatter or point based charts like digitize
. But in general, it's better to mine the underlying data directly.
Still, I built this code for your specific example.
library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")
image_url <- html_url %>% html_elements("img") %>% html_attr("src")
graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")
# Load and preprocess image
img <- image_read("chart_image.png") %>%
image_resize("800x800") %>%
image_convert(colorspace = "gray")
# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")
text_to_asylum_df <- function(text) {
# Split text into lines
lines <- strsplit(text, "\n")[[1]]
# Filter out empty lines and header/footer
data_lines <- lines[grepl("[0-9]", lines)]
# Extract country and number using regex
asylum_data <- lapply(data_lines, function(line) {
# Extract country (word characters at start of line)
country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
country <- trimws(country)
# Extract number (digits, possibly with comma or period)
number <- gsub("[^0-9,.]", "", line)
number <- gsub(",", "", number)
number <- gsub("\\.", "", number)
number <- as.numeric(number)
return(c(country = country, granted = number))
})
# Convert to dataframe
df <- as.data.frame(do.call(rbind, asylum_data))
# Convert granted column to numeric
df$granted <- as.numeric(as.character(df$granted))
# Add year as attribute
attr(df, "year") <- 2022
return(df)
}
# Create the dataframe
asylum_df <- text_to_asylum_df(text)
# View the result
print(asylum_df)
As you can see, China and Venezuela are not even recognized by tesseract
.
Output:
> print(asylum_df)
country granted
1 asylum in the U 2022
2 El Salvador S TS 2639
3 Guatemala 2329
4 india 22203
5 Honduras 1829
6 Afghanistan 1493
7 turkey 1228