79300008

Date: 2024-12-21 19:52:06
Score: 0.5
Natty:
Report link

Extracting some text from images is difficult. As @Wimpel already said, extracting solid data from images or the text in there is very difficult. in addition, how should the code know which kind of chart the figure represents? Sure, there are some digitalization tools for scatter or point based charts like digitize. But in general, it's better to mine the underlying data directly. Still, I built this code for your specific example.

library(tesseract)
library(rvest)
library(dplyr)
library(tidyr)
library(tidyverse)
library(magick)
library(data.table)
# Read the webpage
html_url <- read_html("https://www.statista.com/chart/25619/asylum-grants-in-the-us-by-nationality/")

image_url <- html_url %>% html_elements("img") %>% html_attr("src")

graphics <- image_url[grepl("Infographic", image_url)]
# Download the image
download.file("https://cdn.statcdn.com/Infographic/images/normal/25619.jpeg", destfile = "chart_image.png", mode = "wb")

# Load and preprocess image
img <- image_read("chart_image.png") %>%
  image_resize("800x800") %>%
  image_convert(colorspace = "gray")

# Save processed image and apply OCR
image_write(img, "processed_image.png")
text <- tesseract::ocr("processed_image.png")

text_to_asylum_df <- function(text) {
  # Split text into lines
  lines <- strsplit(text, "\n")[[1]]
  
  # Filter out empty lines and header/footer
  data_lines <- lines[grepl("[0-9]", lines)]
  
  # Extract country and number using regex
  asylum_data <- lapply(data_lines, function(line) {
    # Extract country (word characters at start of line)
    country <- gsub("^([A-Za-z ]+).*$", "\\1", line)
    country <- trimws(country)
    
    # Extract number (digits, possibly with comma or period)
    number <- gsub("[^0-9,.]", "", line)
    number <- gsub(",", "", number)
    number <- gsub("\\.", "", number)
    number <- as.numeric(number)
    
    return(c(country = country, granted = number))
  })
  
  # Convert to dataframe
  df <- as.data.frame(do.call(rbind, asylum_data))
  
  # Convert granted column to numeric
  df$granted <- as.numeric(as.character(df$granted))
  
  # Add year as attribute
  attr(df, "year") <- 2022
  
  return(df)
}

# Create the dataframe
asylum_df <- text_to_asylum_df(text)

# View the result
print(asylum_df)

As you can see, China and Venezuela are not even recognized by tesseract.

Output:

> print(asylum_df)
           country granted
1  asylum in the U    2022
2 El Salvador S TS    2639
3        Guatemala    2329
4            india   22203
5         Honduras    1829
6      Afghanistan    1493
7           turkey    1228
Reasons:
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • User mentioned (1): @Wimpel
  • Low reputation (0.5):
Posted by: G-Man