July 22, 2024


Internet hosting, orchestrating, and managing knowledge pipelines is a fancy course of for any enterprise.  Google Cloud gives Cloud Composer – a completely managed workflow orchestration service – enabling companies to create, schedule, monitor, and handle workflows that span throughout clouds and on-premises knowledge facilities. Cloud Composer is constructed on the favored Apache Airflow open supply venture and operates utilizing the Python programming language.  Apache Airflow permits customers to create directed acyclic graphs (DAGs) of duties, which might be scheduled to run at particular intervals or triggered by exterior occasions.

This information incorporates a generalized guidelines of actions when authoring Apache Airflow DAGs.  This stuff comply with finest practices decided by Google Cloud and the open supply group.  A set of performant DAGs will allow Cloud Composer to work optimally and standardized authoring will assist builders handle a whole lot and even 1000’s of DAGs.  Every merchandise will profit your Cloud Composer surroundings and your improvement course of.

Get Began

1. Standardize file names. Assist different builders browse your assortment of DAG information.
a. ex) team_project_workflow_version.py

2. DAGs needs to be deterministic.
a. A given enter will all the time produce the identical output.

three. DAGs needs to be idempotent
a. Triggering the DAG a number of instances has the identical impact/final result.

four. Duties needs to be atomic and idempotent
a. Every process needs to be answerable for one operation that may be re-run independently of the others. In an atomized process, a hit in a part of the duty means a hit of the whole process.

5. Simplify DAGs as a lot as potential.
a. Easier DAGs with fewer dependencies between duties are inclined to have higher scheduling efficiency as a result of they’ve much less overhead. A linear construction (e.g. A -> B -> C) is usually extra environment friendly than a deeply nested tree construction with many dependencies. 

Standardize DAG Creation

6. Add an proprietor to your default_args.
a. Decide whether or not you’d desire the e-mail handle / id of a developer, or a distribution checklist / workforce identify.

7. Use with DAG() as dag: as an alternative of dag = DAG()
a. Stop the necessity to cross the dag object to each operator or process group.

eight. Set a model within the DAG ID. 
a. Replace the model after any code change within the DAG.
b. This prevents deleted Activity logs from vanishing from the UI, no-status duties generated for previous dag runs, and common confusion of when DAGs have modified.
c. Airflow open-source has plans to implement versioning sooner or later. 

9. Add tags to your DAGs.
a. Assist builders navigate the Airflow UI by way of tag filtering.
b. Group DAGs by group, workforce, venture, software, and many others. 

10. Add a DAG description. 
a. Assist different builders perceive your DAG.

11. Pause your DAGs on creation. 
a. This may assist keep away from unintended DAG runs that add load to the Cloud Composer surroundings.

12. Set catchup=False to keep away from computerized catch ups overloading your Cloud Composer Surroundings.

13. Set a dagrun_timeout to keep away from dags not ending, and holding Cloud Composer Surroundings assets or introducing collisions on retries.

14. Set SLAs on the DAG stage to obtain alerts for long-running DAGs.
a. Airflow SLAs are all the time outlined relative to the beginning time of the DAG, to not particular person duties.
b. Make sure that sla_miss_timeout is lower than the dagrun_timeout.
c. Instance: In case your DAG often takes 5 minutes to efficiently end, set the sla_miss_timeout to 7 minutes and the dagrun_timeout to 10 minutes.  Decide these thresholds primarily based on the precedence of your DAGs.

15. Guarantee all duties have the identical start_date by default by passing arg to DAG throughout instantiation

16. Use a static start_date along with your DAGs. 
a. A dynamic start_date is deceptive, and might trigger failures when clearing out failed process cases and lacking DAG runs.

17. Set retries as a default_arg utilized on the DAG stage and get extra granular for particular duties solely the place obligatory. 
a. A great vary is 1–four retries. Too many retries will add pointless load to the Cloud Composer surroundings.

Instance placing all of the above collectively:


Source link