Reports

Master in StandAlone cluster is a coordinator process, so I don't think it makes some sense. What goal you want to achieve?

How do you submit your apps to spark from airflow? With SparkSubmitOperator?

This attempt give me an error saying that I need to have hadoop aws jdk. I assume that this means, the airflow is acting as a driver

Yes, you're correct, when you submit from Airflow it will launch driver process on that machine, and you'll see driver logs in "logs" tab of airflow. Anyway you need at least spark binaries/jars on Airflow (which automatically installed with pip install pyspark==3.5.4).

As for error about hadoop aws jdk: since minio (s3) is hadoop compatbile file system, spark will use this API in order to connect to S3.

So do something like this:

pip install pyspark=={version}
pip install apache-airflow-providers-apache-spark=={version}
pip install apache-airflow[s3]=={version}

When I change deploy mode to cluster, I got error saying that "Cluster deploy mode is currently not supported for python applications on standalone clusters"

That's also predictable StandAlone cluster only supports client mode for .py apps

DAG example with SparkSubmit operator:


from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.bash import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.S3_hook import S3Hook
from datetime import datetime, timedelta
from textwrap import dedent
from airflow import DAG

s3_log_path = "s3a://test1/sparkhistory"
spark_config = {
    "spark.sql.shuffle.partitions": 8,
    "spark.executor.memory":"4G",
    "spark.driver.memory":"4G",
    "spark.submit.deployMode": "client", #default
    "spark.hadoop.fs.s3a.endpoint": "http://1.1.1.1:8083",
    "spark.hadoop.fs.s3a.access.key":"",
    "spark.hadoop.fs.s3a.secret.key":"",
    "spark.eventLog.enabled":"true",
    "spark.eventLog.dir":s3_log_path
    "spark.driver.extraJavaOptions":"-Dspark.hadoop.fs.s3a.path.style.access=true" #example for driver opts
}

with DAG(
    'App',
    default_args={
        'depends_on_past': False,
        'retries': 1,
        'retry_delay': timedelta(minutes=5),
    },
    description='Some desc',
    schedule_interval=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=['example'],
) as dag:
    t1 = SparkSubmitOperator(
        application="s3a://bucket/artifacts/app.py",
        conf = spark_config,
        py_files = "if any",
        conn_id = "spark_default",
        task_id="submit_job",
    )

P.S: If you want to get rid of driver process on your airflow machine you'll need something like "spark on kubernetes" does:
When you submit on k8s with spark-submit, it will create driver pod. From this pod it will make another submit in client mode. So driver pod will be a driver actually.

79672140