Getting the most out of Apache Airflow 2.0+
Sr. Team Lead
October 14, 2021
Leverage the experts to take full advantage of Apache Airflow’s new features.
Your data pipelines are crucial for managing the processes connecting your data sources and analytical tools. They allow your team to code a series of tasks in order to extract, transform, and load your business data on a regular schedule.
But managing the pipelines themselves can be difficult. You may have thousands of data pipelines incorporating hundreds of parallel tasks. Hand-coding them can quickly overwhelm any data engineering team.
That’s where Apache Airflow comes in. Airflow’s open-source technology allows you to describe and manage the tasks that make up your data pipelines with a clear user interface. By taking advantage of the right framework, you can automatically generate the complex processes you need and free your teams from being mired in hand-coding data pipelines.
Say hello to the next generation of airflow
Over the past year, Airflow has gotten an upgrade, first with its 2.0 then its 2.1 versions, and, just this month, version 2.2. The upgrades include hundreds of new features, but the experts at Astronomer highlight some of the most significant. At the center of the Airflow community and ecosystem, Astronomer is a steward of the Apache Airflow project and a driving force behind releases and the Airflow roadmap.Create a trusted, central source of truth.
In too many organizations, people spend as much time trying to figure out which key data figures are correct as they do strategizing about how to use the data to improve their business. Solve that problem by storing a centralized, standardized set of data within a strong data platform that’s modern and scalable. When you have a trusted, transparent source of data where metrics are clearly defined, the discussion will shift from whether the numbers are right to why the numbers are what they are.
- Refactored Airflow Scheduler for enhanced performance and high availability. The scheduler is much faster compared to previous Airflow versions. Astronomer’s performance benchmark showed that the new versions’ scheduler is up to 17 times faster.
- Full REST API that enables more opportunities for automation. Airflow previously had only an experimental API with limited functionality. The new versions offer a full REST API to create, read, update, or delete DAG runs, variables, connections, or node pools.
- Deferrable Operators free up worker resources when waiting for external systems and events. With this feature, operators or sensors can postpone themselves until a light-weight async check succeeds, and one triggerer process can run hundreds of async deferred tasks concurrently. As a result, tasks like monitoring a job on an external system or watching for an event become far less expensive.
- Independent Providers for improved usability and a more agile release cadence. Contributed modules from the Airflow community have been restructured around the external systems that can be used with Airflow. That means you can install Airflow for only the operator(s) you want. The change allows for a separation of concerns, faster release cycles for specific components, and a much cleaner organizational structure of where you can find code related to the specific external system.
- User interface and user experience improvements. Airflow’s new UI improves on the previous versions, and adds an Auto-refresh feature. The status of your workflow’s progress refreshes automatically unless you choose to deactivate this feature.
- TaskFlow API for a simpler way to transfer information between tasks. Airflow 2.0’s TaskFlow API makes directed acyclic graphs (DAGs) significantly easier to write by abstracting the task and dependency management layer from users, while also giving developers a streamlined alternative to XComs for passing information between tasks.
Using DAS42’s framework extends the reach of Airflow
While Airflow 2.0+ offers greater flexibility and efficiency in managing data pipelines than ever before, your team still has to create them first.
Realistically, if you’re implementing a modernization project at a large, enterprise-level organization, your data could require thousands of pipelines. Each pipeline could incorporate hundreds of tasks that require detailed Python files to describe. Even if your pipelines are very similar, your team faces a long, tedious road of coding.
At DAS42, we’ve created an abstraction layer on top of Airflow that allows you to automatically generate the DAGs your system needs. Rather than writing thousands of Python files, you can write a set of configuration files outlining the details of your data stack. Then our framework will automatically generate the thousands of DAGs data pipelines.
Consequently, your team no longer has to learn the ins and outs of Airflow, which requires its own programming expertise. Our framework allows you to separate the details of your data transformations from Airflow in a way more of your data team will understand.
Deploy data pipelines faster by closing the language gap
With best practices honed over many Airflow implementations, we’ve essentially democratized the data pipeline process by eliminating language barriers. Instead of using Python, our framework allows your team to configure data pipelines in YAML files.
Designed to be a readable markup language, YAML allows your team to describe your system’s data in straightforward terms. Once you’ve supplied the metadata for a given data pipeline in a YAML file, the framework then programmatically creates a DAG that can be used as a data pipeline in Airflow.
Depending on the size of your organization, your data team could have varying specialties. Your data engineers may know Python and other programming languages. But when you approach a business analyst about transforming one of their daily SQL queries into an Airflow DAG, they’ll be at a loss.
With DAS42’s framework in place, your analysts can write their SQL queries and create a YAML file that’s associated with each. The framework will then use Python code to convert the SQL query into a DAG automatically.
In the end, your analyst teams are empowered to create scheduled reports for Airflow without needing to consult your data engineers. As a result, your analysts are freed from generating repetitive queries with the assurance that the data pipeline will remain robust and secure.
Our Airflow best practices streamline maintenance and data quality
An effective Airflow framework also provides benefits if your organization uses multiple environments. Whether you use different resources to support your development, QA, or production environments, our framework places all these details into a YAML file.
With thousands of different DAGs pulling data from multiple systems, any changes to your data sources need to be manually updated. Instead, our framework ensures all of your DAGs gather their information from a YAML file. Whenever you need to make changes to your configuration, you only need to make one update.
Get expert help to get the most out of Airflow
Airflow’s capabilities introduce new possibilities for your analysts to generate scheduled reports that incorporate fresh, trustworthy data. But setting up a data pipeline requires some digging to ensure your connections are in place. And setting up Airflow architecture can be difficult and resource-intensive if you do it yourself.
Maintaining Airflow and all its components requires full-time resources just to manage the platform and its underlying infrastructure. It makes sense – and saves time and money – to get help from the experts. Astronomer offers a managed service with extra functionality built on top of Airflow’s open-source version, enterprise support, training, and consulting, and has long been involved in contributing to the open-source project.
They also offer some advice for upgrading to take advantage of all of Airflow’s latest features: the latest versions of Airflow 2.1 was a strong subsequent release to Airflow 2.0, so if you’re not running on Airflow 2.0+ yet, they recommend upgrading to Airflow 2.1 directly.