• Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
Monday, June 9, 2025
newsaiworld
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us
No Result
View All Result
Morning News
No Result
View All Result
Home Artificial Intelligence

Cease Creating Dangerous DAGs — Optimize Your Airflow Setting By Enhancing Your Python Code | by Alvaro Leandro Cavalcante Carneiro | Jan, 2025

Admin by Admin
January 30, 2025
in Artificial Intelligence
0
0b1gjwghk0qyedtnm.jpeg
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

READ ALSO

Choice Bushes Natively Deal with Categorical Information

5 Essential Tweaks That Will Make Your Charts Accessible to Individuals with Visible Impairments


Apache Airflow is among the hottest orchestration instruments within the information area, powering workflows for firms worldwide. Nevertheless, anybody who has already labored with Airflow in a manufacturing setting, particularly in a posh one, is aware of that it will possibly sometimes current some issues and bizarre bugs.

Among the many many features you might want to handle in an Airflow setting, one essential metric typically flies beneath the radar: DAG parse time. Monitoring and optimizing parse time is crucial to keep away from efficiency bottlenecks and make sure the right functioning of your orchestrations, as we’ll discover on this article.

That stated, this tutorial goals to introduce airflow-parse-bench, an open-source device I developed to assist information engineers monitor and optimize their Airflow environments, offering insights to scale back code complexity and parse time.

Relating to Airflow, DAG parse time is usually an neglected metric. Parsing happens each time Airflow processes your Python recordsdata to construct the DAGs dynamically.

By default, all of your DAGs are parsed each 30 seconds — a frequency managed by the configuration variable min_file_process_interval. Because of this each 30 seconds, all of the Python code that’s current in your dags folder is learn, imported, and processed to generate DAG objects containing the duties to be scheduled. Efficiently processed recordsdata are then added to the DAG Bag.

Two key Airflow elements deal with this course of:

Collectively, each elements (generally known as the dag processor) are executed by the Airflow Scheduler, making certain that your DAG objects are up to date earlier than being triggered. Nevertheless, for scalability and safety causes, it’s also attainable to run your dag processor as a separate element in your cluster.

In case your setting solely has a couple of dozen DAGs, it’s unlikely that the parsing course of will trigger any sort of drawback. Nevertheless, it’s widespread to search out manufacturing environments with tons of and even 1000’s of DAGs. On this case, in case your parse time is simply too excessive, it will possibly result in:

  • Delay DAG scheduling.
  • Enhance useful resource utilization.
  • Setting heartbeat points.
  • Scheduler failures.
  • Extreme CPU and reminiscence utilization, losing assets.

Now, think about having an setting with tons of of DAGs containing unnecessarily advanced parsing logic. Small inefficiencies can shortly flip into important issues, affecting the soundness and efficiency of your total Airflow setup.

When writing Airflow DAGs, there are some necessary greatest practices to remember to create optimized code. Though you will discover numerous tutorials on how one can enhance your DAGs, I’ll summarize among the key rules that may considerably improve your DAG efficiency.

Restrict High-Stage Code

Some of the widespread causes of excessive DAG parsing occasions is inefficient or advanced top-level code. High-level code in an Airflow DAG file is executed each time the Scheduler parses the file. If this code contains resource-intensive operations, equivalent to database queries, API calls, or dynamic process era, it will possibly considerably impression parsing efficiency.

The next code reveals an instance of a non-optimized DAG:

On this case, each time the file is parsed by the Scheduler, the top-level code is executed, making an API request and processing the DataFrame, which may considerably impression the parse time.

One other necessary issue contributing to sluggish parsing is top-level imports. Each library imported on the high stage is loaded into reminiscence throughout parsing, which will be time-consuming. To keep away from this, you’ll be able to transfer imports into features or process definitions.

The next code reveals a greater model of the identical DAG:

Keep away from Xcoms and Variables in High-Stage Code

Nonetheless speaking about the identical subject, is especially attention-grabbing to keep away from utilizing Xcoms and Variables in your top-level code. As said by Google documentation:

In case you are utilizing Variable.get() in high stage code, each time the .py file is parsed, Airflow executes a Variable.get() which opens a session to the DB. This may dramatically decelerate parse occasions.

To handle this, think about using a JSON dictionary to retrieve a number of variables in a single database question, reasonably than making a number of Variable.get() calls. Alternatively, use Jinja templates, as variables retrieved this manner are solely processed throughout process execution, not throughout DAG parsing.

Take away Pointless DAGs

Though it appears apparent, it’s at all times necessary to recollect to periodically clear up pointless DAGs and recordsdata out of your setting:

  • Take away unused DAGs: Test your dags folder and delete any recordsdata which might be not wanted.
  • Use .airflowignore: Specify the recordsdata Airflow ought to deliberately ignore, skipping parsing.
  • Evaluate paused DAGs: Paused DAGs are nonetheless parsed by the Scheduler, consuming assets. If they’re not required, take into account eradicating or archiving them.

Change Airflow Configurations

Lastly, you can change some Airflow configurations to scale back the Scheduler useful resource utilization:

  • min_file_process_interval: This setting controls how typically (in seconds) Airflow parses your DAG recordsdata. Growing it from the default 30 seconds can cut back the Scheduler’s load at the price of slower DAG updates.
  • dag_dir_list_interval: This determines how typically (in seconds) Airflow scans the dags listing for brand new DAGs. Should you deploy new DAGs sometimes, take into account rising this interval to scale back CPU utilization.

We’ve mentioned rather a lot in regards to the significance of making optimized DAGs to take care of a wholesome Airflow setting. However how do you truly measure the parse time of your DAGs? Happily, there are a number of methods to do that, relying in your Airflow deployment or working system.

For instance, you probably have a Cloud Composer deployment, you’ll be able to simply retrieve a DAG parse report by executing the next command on Google CLI:

gcloud composer environments run $ENVIRONMENT_NAME 
— location $LOCATION
dags report

Whereas retrieving parse metrics is easy, measuring the effectiveness of your code optimizations will be much less so. Each time you modify your code, you might want to redeploy the up to date Python file to your cloud supplier, anticipate the DAG to be parsed, after which extract a brand new report — a sluggish and time-consuming course of.

One other attainable strategy, should you’re on Linux or Mac, is to run this command to measure the parse time regionally in your machine:

time python airflow/example_dags/instance.py

Nevertheless, whereas easy, this strategy will not be sensible for systematically measuring and evaluating the parse occasions of a number of DAGs.

To handle these challenges, I created the airflow-parse-bench, a Python library that simplifies measuring and evaluating the parse occasions of your DAGs utilizing Airflow’s native parse methodology.

The airflow-parse-bench device makes it straightforward to retailer parse occasions, examine outcomes, and standardize comparisons throughout your DAGs.

Putting in the Library

Earlier than set up, it’s advisable to make use of a virtualenv to keep away from library conflicts. As soon as arrange, you’ll be able to set up the bundle by working the next command:

pip set up airflow-parse-bench

Observe: This command solely installs the important dependencies (associated to Airflow and Airflow suppliers). You have to manually set up any extra libraries your DAGs depend upon.

For instance, if a DAG makes use of boto3 to work together with AWS, be certain that boto3 is put in in your setting. In any other case, you may encounter parse errors.

After that, it is necessary to initialize your Airflow database. This may be achieved by executing the next command:

airflow db init

As well as, in case your DAGs use Airflow Variables, you should outline them regionally as nicely. Nevertheless, it’s not essential to place actual values in your variables, because the precise values aren’t required for parsing functions:

airflow variables set MY_VARIABLE 'ANY TEST VALUE'

With out this, you’ll encounter an error like:

error: 'Variable MY_VARIABLE doesn't exist'

Utilizing the Software

After putting in the library, you’ll be able to start measuring parse occasions. For instance, suppose you have got a DAG file named dag_test.py containing the non-optimized DAG code used within the instance above.

To measure its parse time, merely run:

airflow-parse-bench --path dag_test.py

This execution produces the next output:

Execution consequence. Picture by creator.

As noticed, our DAG introduced a parse time of 0.61 seconds. If I run the command once more, I’ll see some small variations, as parse occasions can fluctuate barely throughout runs on account of system and environmental elements:

Results of one other execution of the identical DAG. Picture by creator.

With a view to current a extra concise quantity, it’s attainable to combination a number of executions by specifying the variety of iterations:

airflow-parse-bench --path dag_test.py --num-iterations 5

Though it takes a bit longer to complete, this calculates the common parse time throughout 5 executions.

Now, to judge the impression of the aforementioned optimizations, I changed the code in mydag_test.py with the optimized model shared earlier. After executing the identical command, I obtained the next consequence:

Parse results of the optimized code. Picture by creator.

As seen, simply making use of some good practices was able to decreasing virtually 0.5 seconds within the DAG parse time, highlighting the significance of the modifications we made!

There are different attention-grabbing options that I believe it’s related to share.

As a reminder, you probably have any doubts or issues utilizing the device, you’ll be able to entry the whole documentation on GitHub.

Apart from that, to view all of the parameters supported by the library, merely run:

airflow-parse-bench --help

Testing A number of DAGs

Generally, you possible have dozens of DAGs to check the parse occasions. To handle this use case, I created a folder named dags and put 4 Python recordsdata inside it.

To measure the parse occasions for all of the DAGs in a folder, it is simply essential to specify the folder path within the --path parameter:

airflow-parse-bench --path my_path/dags

Working this command produces a desk summarizing the parse occasions for all of the DAGs within the folder:

Testing the parse time of a number of DAGs. Picture by creator.

By default, the desk is sorted from the quickest to the slowest DAG. Nevertheless, you’ll be able to reverse the order by utilizing the --order parameter:

airflow-parse-bench --path my_path/dags --order desc
Inverted sorting order. Picture by creator.

Skipping Unchanged DAGs

The --skip-unchanged parameter will be particularly helpful throughout improvement. Because the title suggests, this selection skips the parse execution for DAGs that have not been modified for the reason that final execution:

airflow-parse-bench --path my_path/dags --skip-unchanged

As proven under, when the DAGs stay unchanged, the output displays no distinction in parse occasions:

Output with no distinction for unchanged recordsdata. Picture by creator.

Resetting the Database

All DAG info, together with metrics and historical past, is saved in a neighborhood SQLite database. If you wish to clear all saved information and begin recent, use the --reset-db flag:

airflow-parse-bench --path my_path/dags --reset-db

This command resets the database and processes the DAGs as if it have been the primary execution.

Parse time is a vital metric for sustaining scalable and environment friendly Airflow environments, particularly as your orchestration necessities develop into more and more advanced.

Because of this, the airflow-parse-bench library will be an necessary device for serving to information engineers create higher DAGs. By testing your DAGs’ parse time regionally, you’ll be able to simply and shortly discover your code bottleneck, making your dags quicker and extra performant.

Because the code is executed regionally, the produced parse time gained’t be the identical because the one current in your Airflow cluster. Nevertheless, if you’ll be able to cut back the parse time in your native machine, the identical may be reproduced in your cloud setting.

Lastly, this venture is open for collaboration! If in case you have solutions, concepts, or enhancements, be happy to contribute on GitHub.

Tags: AirflowAlvaroBadCarneiroCavalcanteCodeCreatingDAGsEnvironmentImprovingJanLeandroOptimizePythonStop

Related Posts

Tree.png
Artificial Intelligence

Choice Bushes Natively Deal with Categorical Information

June 9, 2025
The new york public library lxos0bkpcjm unsplash scaled 1.jpg
Artificial Intelligence

5 Essential Tweaks That Will Make Your Charts Accessible to Individuals with Visible Impairments

June 8, 2025
Ric tom e9d3wou pkq unsplash scaled 1.jpg
Artificial Intelligence

The Function of Luck in Sports activities: Can We Measure It?

June 8, 2025
Kees streefkerk j53wlwxdsog unsplash scaled 1.jpg
Artificial Intelligence

Prescriptive Modeling Unpacked: A Full Information to Intervention With Bayesian Modeling.

June 7, 2025
Mahdis mousavi hj5umirng5k unsplash scaled 1.jpg
Artificial Intelligence

How I Automated My Machine Studying Workflow with Simply 10 Strains of Python

June 6, 2025
Heading pic scaled 1.jpg
Artificial Intelligence

Touchdown your First Machine Studying Job: Startup vs Large Tech vs Academia

June 6, 2025
Next Post
Gary Gensler.jpg

Crypto Neighborhood Protests MIT’s Rehiring of Gary Gensler

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

0 3.png

College endowments be a part of crypto rush, boosting meme cash like Meme Index

February 10, 2025
Gemini 2.0 Fash Vs Gpt 4o.webp.webp

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

January 19, 2025
1da3lz S3h Cujupuolbtvw.png

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

January 2, 2025
How To Maintain Data Quality In The Supply Chain Feature.jpg

Find out how to Preserve Knowledge High quality within the Provide Chain

September 8, 2024
0khns0 Djocjfzxyr.jpeg

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

November 5, 2024

EDITOR'S PICK

1dh2gydhr8wqymyhscueasg.gif

Understanding DDPG: The Algorithm That Solves Steady Motion Management Challenges | by Sirine Bhouri | Dec, 2024

December 11, 2024
1g0hklsuxpirlt5bb9kjlvg.jpeg

2024 Survival Information for Machine Studying Engineer Interviews | by Mengliu Zhao | Dec, 2024

December 24, 2024
Vertical Ai Agents In Saas Scaled.jpg

Why Vertical AI Brokers Are the Way forward for SaaS

May 16, 2025
0cicmpeasrerccsyu.jpeg

Peer Assessment Demystified: What, Why, and How | by Shrey Pareek, PhD | Sep, 2024

September 5, 2024

About Us

Welcome to News AI World, your go-to source for the latest in artificial intelligence news and developments. Our mission is to deliver comprehensive and insightful coverage of the rapidly evolving AI landscape, keeping you informed about breakthroughs, trends, and the transformative impact of AI technologies across industries.

Categories

  • Artificial Intelligence
  • ChatGPT
  • Crypto Coins
  • Data Science
  • Machine Learning

Recent Posts

  • Tips on how to Design My First AI Agent
  • Choice Bushes Natively Deal with Categorical Information
  • Morocco Arrests Mastermind Behind Current French Crypto-Associated Kidnappings
  • Home
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy

© 2024 Newsaiworld.com. All rights reserved.

No Result
View All Result
  • Home
  • Artificial Intelligence
  • ChatGPT
  • Data Science
  • Machine Learning
  • Crypto Coins
  • Contact Us

© 2024 Newsaiworld.com. All rights reserved.

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?