79494667

Date: 2025-03-08 16:26:10
Score: 1
Natty:
Report link

A few things I ended up doing:

  1. I removed my custom spark config, due to my ignorance I was unsure if I was doing anything counterintuitive. So took advice from @suvayu and decided to attack the other side of the problem

  2. The solve to this, was making the data smaller.... i knew that strings are larger on than int but i did not realized that it would smash the size of the dataset

This is an example of a function that I used for one table to significantly shrink the size of the of the table.

demographics_df = (
    spark.table("demographics")
    .withColumn("race", explode(col("races")))  # Explode races array
    .withColumn("ethnicity", explode(col("ethnicities")))  # Explode ethnicities array
    .withColumn("race_primary_display", col("race.standard.primaryDisplay"))  # Extract race
    .withColumn("ethnicity_primary_display", col("ethnicity.standard.primaryDisplay"))  # Extract ethnicity
    .withColumn(
        "gender_recoded",
        when(col("gender.standard.primaryDisplay") == "Male", lit(1))
        .when(col("gender.standard.primaryDisplay") == "Female", lit(2))
        .when(col("gender.standard.primaryDisplay") == "Transgender identity", lit(3))
        .otherwise(lit(None)) 
    )
    .withColumn(
        "race_recoded",
        when(col("race_primary_display").isin(["African", "African American", "Liberian"]), lit(1))
        .when(col("race_primary_display").rlike("(?i)(Cherokee|Mohawk|Algonquian|American Indian)"), lit(2))
        .when(col("race_primary_display").rlike("(?i)(Chinese|Vietnamese|Thai|Japanese|Taiwanese|Filipino)"), lit(3))
        .when(col("race_primary_display").rlike("(?i)(Chamorro|Fijian|Kiribati|Marshallese|Palauan|Samoan|Tongan)"), lit(4))
        .when(col("race_primary_display").isin(["Caucasian", "European", "White Irish", "Polish", "Scottish"]), lit(5))
        .when(col("race_primary_display").rlike("(?i)(Arab|Middle Eastern|Iraqi|Afghanistani)"), lit(6))
        .otherwise(lit(None)) 
    )
    .withColumn(
        "ethnicity_recoded",
        when(col("ethnicity_primary_display").rlike("(?i)(Hispanic|Mexican|Puerto Rican|Cuban|Dominican)"), lit(1))
        .when(col("ethnicity_primary_display").isin(["Not Hispanic or Latino"]), lit(2))
        .otherwise(lit(None)) 
    )
)

Last tip is to configure repartition so that each partition is < 500 MB so it took some guess work but 200 was correct for me.

df_indexed = df_indexed.repartition(200)

So while this may not been the solution people are looking for at least using this syntax you cannot exceed the memory you physically have on your machine. I guess the next question:

Is there a package that allows you do statistics and on data that is larger than your system memory? Not by chunking the dataset and averaging the results but rather, iteratively calculating the variance and only carrying forward the necessary values instead of requiring the whole dataset.

Reasons:
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • User mentioned (1): @suvayu
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: jcruzer