I have tried to install the check_excel_errors package like you:
%pip install check_excel_errors
%pip show check_excel_errors
But I am getting the below Error
ERROR: Could not find a version that satisfies the requirement check_excel_errors (from versions: none) ERROR: No matching distribution found for check_excel_errors
I have tried to use check_excel_errors
inside requirment.txt method to upload the packages to spark pool
that did not work.
In Python module you mentioned init.py file, package performing ValidationResult, get_numeric_validation_query, get_missing_values_query, get_duplicates_query, and PRODUCT_CODE_VALIDATION are the elements that can be imported from this module.
As a workaround I have tried using pyspark:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Numeric Validation (e.g., price > 0)
numeric_validation = df.filter(F.col("price") <= 0)
missing_values = df.filter(F.col("price").isNull() | F.col("quantity").isNull())
duplicates = df.groupBy("product_code").count().filter(F.col("count") > 1)
product_code_validation = df.filter(~F.col("product_code").rlike("^P\\d{3}$"))
Results:
Numeric Validation (Invalid Prices):
+---+-----+------------+--------+
| id|price|product_code|quantity|
+---+-----+------------+--------+
| 4|-10.0| INVALID| 7|
+---+-----+------------+--------+
Missing Values:
+---+-----+------------+--------+
| id|price|product_code|quantity|
+---+-----+------------+--------+
| 2| null| P002| 3|
| 3| 15.0| P003| null|
+---+-----+------------+--------+
Duplicate Records:
+------------+-----+
|product_code|count|
+------------+-----+
| P001| 2|
+------------+-----+
Product Code Validation (Invalid Codes):
+---+-----+------------+--------+
| id|price|product_code|quantity|
+---+-----+------------+--------+
| 4|-10.0| INVALID| 7|
+---+-----+------------+--------+