Installing Apache Airflow on Ubuntu 24.04: A Step-by-Step Guide
Apache Airflow is an open-source platform for managing workflows, particularly useful for automating data pipelines and processes like Extract, Transform, Load (ETL). It leverages Python-based Directed Acyclic Graphs (DAGs) to schedule tasks, handling dependencies to ensure seamless execution. This guide details how to install and configure Apache Airflow on Ubuntu 24.04, secure your setup, and verify functionality using a sample DAG.
Prerequisites
- Access to an Ubuntu 24.04 server with a minimum of 4 GB RAM
- A configured A record in your domain pointing to the server’s IP
Installing Apache Airflow on Ubuntu 24.04
Airflow is distributed as a Python package and can be installed via Pip. Follow these steps to set up Python, create a virtual environment, and install Airflow.
Step 1: Update Package Index
$ sudo apt update
Step 2: Verify Python Installation
$ python3 --version
Expected output:
Python 3.12.3
Step 3: Install Python (If Needed)
$ sudo apt install python3
Step 4: Install Virtual Environment and Dependencies
$ sudo apt install python3-venv libpq-dev -y
Step 5: Create and Activate Virtual Environment
$ python3 -m venv airflow_env
$ source ~/airflow_env/bin/activate
The shell prompt should now indicate that the virtual environment is active:
(airflow_env) linuxuser@example:~$
Step 6: Install Apache Airflow with PostgreSQL Support
$ pip install apache-airflow[postgres] psycopg2
Step 7: Install PostgreSQL
$ sudo apt install postgresql postgresql-contrib
$ sudo systemctl start postgresql
Step 8: Configure PostgreSQL for Airflow
$ sudo -u postgres psql
Example output:
psql (16.6 (Ubuntu 16.6-0ubuntu0.24.04.1))
Type “help” for help.
postgres=#
postgres=# CREATE USER airflow PASSWORD 'YourStrongPassword';
postgres=# CREATE DATABASE airflowdb;
postgres=# GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;
postgres=# ALTER DATABASE airflowdb OWNER TO airflow;
postgres=# GRANT ALL ON SCHEMA public TO airflow;
postgres=# exit;
Step 9: Configure Airflow
$ nano ~/airflow/airflow.cfg
If the airflow directory doesn’t exist, run this command to generate the necessary structure:
$ airflow db init; airflow scheduler
Press Ctrl+C to stop the scheduler.
In airflow.cfg
, update these values:
executor = LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:YourStrongPassword@localhost/airflowdb
Step 10: Initialize the Airflow Database
$ airflow db init
You should see output similar to:
DB: postgresql+psycopg2://airflow:***@localhost/airflow
[2025-01-05T23:58:36.808+0000] {migration.py:207} INFO – Context impl PostgresqlImpl.
[2025-01-05T23:58:36.809+0000] {migration.py:210} INFO – Will assume transactional DDL.
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running stamp_revision -> 5f2621c13b39
WARNI [airflow.models.crypto] empty cryptography key – values will not be stored encrypted.
Initialization done
Step 11: Create Admin User
$ airflow users create \
--username admin \
--password yourSuperSecretPassword \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com
Step 12: Start Airflow Services
Launch the web server on port 8080 and direct output to webserver.log
:
$ nohup airflow webserver -p 8080 > webserver.log 2>&1 &
Start the scheduler and log output to scheduler.log
:
$ nohup airflow scheduler > scheduler.log 2>&1 &
Set Up Nginx as a Reverse Proxy for Apache Airflow
Apache Airflow runs on port 8080 by default. To secure and expose it over HTTP or HTTPS, follow these steps using Nginx as a reverse proxy.
Step 1: Install Nginx
$ sudo apt install -y nginx
Step 2: Create Nginx Configuration for Airflow
$ sudo nano /etc/nginx/sites-available/airflow
Add this content to the configuration file (replace airflow.example.com
with your actual domain):
server {
listen 80;
server_name airflow.example.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
This configuration routes incoming requests on port 80 to Airflow’s default port 8080 via the specified domain.
Step 3: Enable the Virtual Host
$ sudo ln -s /etc/nginx/sites-available/airflow /etc/nginx/sites-enabled/
Step 4: Test and Reload Nginx
$ sudo nginx -t
Expected output:
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
$ sudo systemctl reload nginx
Step 5: Open Port 80 via Firewall
$ sudo ufw allow 80/tcp
$ sudo ufw reload
You can now open http://airflow.example.com
in your browser to see the Airflow login page.
Use SSL to Secure Apache Airflow with HTTPS
To protect connections with SSL encryption, generate Let’s Encrypt certificates using Certbot and configure Nginx to use them.
Step 1: Install Certbot
$ sudo snap install --classic certbot
If Snap is missing, install it with the following command:
$ sudo apt install snapd -y
Step 2: Enable Certbot Globally
$ sudo ln -s /snap/bin/certbot /usr/bin/certbot
Step 3: Request an SSL Certificate
Replace airflow.example.com
and admin@example.com
with your domain and email.
$ sudo certbot --nginx --redirect -d airflow.example.com -m admin@example.com --agree-tos
Expected result:
…
Account registered.
Requesting a certificate for airflow.example.com
Successfully received certificate.
Certificate is saved at: /etc/letsencrypt/live/airflow.example.com/fullchain.pem
Key is saved at: /etc/letsencrypt/live/airflow.example.com/privkey.pem
This certificate expires on 2025-04-21.
These files will be updated when the certificate renews.
Certbot has set up a scheduled task to automatically renew this certificate.
Deploying certificate
Successfully deployed certificate for airflow.example.com to /etc/nginx/sites-enabled/airflow
Congratulations! You have successfully enabled HTTPS on https://airflow.example.com
Step 4: Test Automatic Renewal
$ sudo certbot renew --dry-run
Step 5: Restart Nginx
$ sudo systemctl restart nginx
Accessing the Apache Airflow Dashboard
To open the Airflow interface and begin working with DAGs on your server, follow the instructions below.
Launch your browser and go to:
Use the admin credentials you created earlier to sign in:
- Username: admin
- Password: yourSuperSecretPassword
Create and Execute a DAG in Apache Airflow
Use the following steps to build a simple Directed Acyclic Graph (DAG) and run it using the Apache Airflow interface.
Step 1: Set Up the DAG Directory
$ mkdir ~/airflow/dags
Step 2: Create a Sample Python DAG File
$ nano ~/airflow/dags/my_first_dag.py
Step 3: Add Code to Define the DAG
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
with DAG(
'my_first_dag',
start_date=datetime(2024, 1, 1),
schedule_interval=timedelta(days=1),
catchup=False
) as dag:
def print_hello():
print('Greetings from centron')
hello_task = PythonOperator(
task_id='hello_task',
python_callable=print_hello
)
The code above defines a simple DAG named my_first_dag
that executes once per day and outputs a greeting message.
Step 4: Activate the DAG via the Web Interface
Open the Apache Airflow dashboard, go to the DAGs section, locate my_first_dag
, enable it, and then manually trigger its execution.
Step 5: Monitor DAG Execution
Use the Graph View and Event Log tools in the Airflow UI to track task execution and diagnose issues.
Conclusion
You’ve successfully deployed Apache Airflow on Ubuntu 24.04, secured it using Nginx as a reverse proxy, and created a sample DAG to get started. Apache Airflow offers powerful workflow orchestration and can be customized to manage various automation needs in your environment.
For deeper insights and additional features, refer to the official Airflow documentation.