79149788

Date: 2024-11-02 01:31:52
Score: 0.5
Natty:
Report link

Theres no configuration in airflow.cfg to make this act like Apache Kafka, Until Airflow decide to add a config for log retention policy, I usually try not to make things in Airflow complicated to maintain so I pick this easy route so Systems team or Data Analysts in the team can maintain it with less of a senior developer experience.

Usually you go for deleting logs or obsolete DAG_RUN deletion to save space or make Airflow load dags faster and for that you need to make sure Airflow's integrity is not harmed(learned the hard way), while you make unorthodox changes.

Logs in Airflow can be in 3 places, backend DB, log folder(dag logs, scheduler logs, etc) , remote location(not needed in 99% of times).

Make sure to delete old DAG RUNS first and in Database, mine is Postgres and bellow SQL has a purpose and thats to keep me latest 10 "runs" and delete the rest.

Step 1: Deleting data in Backend database(to make airflow faster in loads)

   WITH RankedDags AS (
        SELECT id,  
               ROW_NUMBER() OVER (PARTITION BY dag_id ORDER BY execution_date DESC) AS rn
        FROM public.dag_run
        WHERE (state = 'success' OR state = 'failed')
    )
    DELETE FROM public.dag_run
    WHERE id IN (
        SELECT id
        FROM RankedDags
        WHERE rn > 10
    );

You can also pick a date and instead of above, use the result of a select query like bellow to only delete the old ones, I usually dont do this as I have dags that run each year or month and I want to know how those looked in their first run:

 SELECT *
FROM public.dag_run f
WHERE (f.state = 'success' OR f.state = 'failed')
  AND DATE(f.execution_date) <= CURRENT_DATE - INTERVAL '15 days';

Step 2: remove the scheduler logs(these are the logs that waste so much space and dont worry, nothing is gonna happen) Just dont delete the folder that has 'latest' shortcut

root@airflow-server:~/airflow/logs/scheduler# ll
drwxr-xr-x 3 root root 4096 Sep 24 20:00 2024-09-25
drwxr-xr-x 3 root root 4096 Sep 25 20:00 2024-09-26
drwxr-xr-x 3 root root 4096 Sep 26 20:00 2024-09-27
drwxr-xr-x 3 root root 4096 Sep 30 10:57 2024-09-30
drwxr-xr-x 7 root root 4096 Oct 31 20:00 2024-11-01
lrwxrwxrwx 1 root root   10 Oct 31 20:00 latest -> 2024-11-01

# rm -rf 2024-09-*

Now you have at least 80% of your logs deleted and must be satisfied, but if you want to go further you can write a bash script to traverse through your /root/airflow/logs/dag_id* to find folders or files within those that have old modified data. even if you past the Step 1 and 2 and you delete the dirs mentioned you only lose the details of logs within each task instance logs.

Also you can take measures like changing all log levels to 'ERROR' inside airflow.cfg to lighten the app.

You can always make above steps into an ETL to run automatically, but as disk is cheap and 30GB disk can easily store more than 10,000 complex dag_runs with heavy spark logs, you really just need to spend 30 minutes every other month to clean the scheduler logs.

Reasons:
  • Blacklisted phrase (1): I want to know
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • Has code block (-0.5):
Posted by: Aramis NSR