How to Configure Celery Workers for Asynchronous Query Execution in Apache Superset
To configure Celery workers for asynchronous query execution in Apache Superset, define the broker URL in superset/config.py, start workers with celery -A superset.tasks.celery_app.celery_app worker, and ensure the execute_sql_task in superset/sql/execution/celery_task.py processes queries from the message queue.
Apache Superset relies on Celery to offload long-running SQL queries and background tasks from the web application, ensuring the UI remains responsive during heavy workloads. This distributed task queue architecture requires proper configuration of message brokers, worker processes, and result backends to function correctly in production environments. Understanding how to configure Celery workers for asynchronous query execution allows administrators to scale query processing horizontally across multiple worker nodes.
Architecture of Asynchronous Query Execution
The implementation spans three core components that interact to process queries outside the request-response cycle.
Celery App Initialization
The global Celery application is defined in superset/tasks/celery_app.py and registered during Flask-AppBuilder initialization within superset/initialization/__init__.py. This module reads configuration values from superset/config.py (specifically lines 1359–1419), establishing the connection to the message broker and result backend before workers begin consuming tasks.
SQL Execution Task Logic
The actual query processing logic resides in superset/sql/execution/celery_task.py, specifically within the execute_sql_task function decorated with @celery_app.task(name="query_execution.execute_sql"). When invoked via SQLExecutor.execute_async(), this task manages the complete query lifecycle: transitioning status to running, executing statements via _execute_sql_statements, finalizing successful queries through _finalize_successful_query, and optionally storing results via _store_results_in_backend.
Result Backend Integration
After successful execution, query payloads are serialized and written to the configured results backend—commonly Redis, S3, or other storage systems supported by Celery. The RESULTS_BACKEND_USE_MSGPACK setting controls whether Apache Arrow IPC or msgpack serialization optimizes transfer efficiency for large result sets.
Configuration Steps
Setting up Celery requires configuring the message broker, defining worker parameters, and launching processes with appropriate queue isolation.
Configure the Message Broker
Superset defaults to Redis for both the broker and result backend. Set these variables in your superset/config.py or via environment variables:
REDIS_HOST = "redis"
REDIS_PORT = 6379
REDIS_CELERY_DB = 2
CELERY_BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
CELERY_RESULT_BACKEND = CELERY_BROKER_URL
These values construct the connection strings used by the Celery app during initialization.
Enable Celery Configuration
Ensure your custom configuration properly references the Celery configuration class:
from superset.config import CeleryConfig
CELERY_CONFIG = CeleryConfig
For advanced setups, subclass CeleryConfig to override broker URLs or serialization settings while maintaining the base configuration structure.
Start Celery Workers
Launch worker processes on each host designated for query processing:
celery -A superset.tasks.celery_app.celery_app worker \
--loglevel=INFO \
--concurrency=4 \
--queues=queries
The -A flag points to the Superset Celery app instance. The --queues=queries flag isolates workers to process only SQL Lab asynchronous queries, preventing resource contention with other background tasks like email reports or cache warming.
Optional: Run the Celery Beat Scheduler
For periodic tasks such as scheduled reports or cache refreshes, run the beat scheduler:
celery -A superset.tasks.celery_app.celery_app beat \
--loglevel=INFO
This process, defined in superset/tasks/scheduler.py, dispatches scheduled jobs to available workers at configured intervals.
Critical Configuration Parameters
Understanding these settings ensures stable operation under varying workload conditions:
-
CELERY_BROKER_URL: The connection string for the message broker (Redis, RabbitMQ, etc.). Defaults are constructed from
REDIS_HOST,REDIS_PORT, andREDIS_CELERY_DBinsuperset/config.py. -
SQLLAB_ASYNC_TIME_LIMIT_SEC: Soft time limit for query execution. Tasks exceeding this duration raise
SoftTimeLimitExceeded, allowing graceful termination without killing the worker process. -
SQLLAB_PAYLOAD_MAX_MB: Maximum serialized payload size permitted for storage in the results backend. Increase this for large result sets, but monitor storage consumption.
-
CELERY_ALWAYS_EAGER: When set to
Trueinsuperset/tasks/celery_app.py, tasks execute synchronously within the web process—useful for testing but disabled in production to enable true asynchronous processing. -
RESULTS_BACKEND_USE_MSGPACK: Enables msgpack or Apache Arrow IPC serialization for efficient result transfer between workers and the web application.
Production Tuning Recommendations
Optimize worker performance based on infrastructure constraints and query patterns.
Concurrency and Resource Allocation
Set --concurrency to match available CPU cores while respecting database connection pool limits. Each concurrent worker process maintains database connections; exceeding pool capacity causes connection failures.
Payload Size Management
Monitor SQLLAB_PAYLOAD_MAX_MB when enabling RESULTS_BACKEND_USE_MSGPACK. While msgpack reduces serialization overhead in superset/sql/execution/celery_task.py, extremely large payloads may still overwhelm Redis memory or S3 transfer limits.
Summary
- Configure
CELERY_BROKER_URLinsuperset/config.pyusing Redis or RabbitMQ to enable message passing between the web application and workers. - Launch workers with
celery -A superset.tasks.celery_app.celery_app worker, optionally specifying--queues=queriesfor dedicated SQL Lab processing. - The
execute_sql_taskinsuperset/sql/execution/celery_task.pyhandles query lifecycle management, impersonating users within a Flask request context for security compliance. - Tune
SQLLAB_ASYNC_TIME_LIMIT_SECandSQLLAB_PAYLOAD_MAX_MBto prevent resource exhaustion from long-running queries or oversized result sets. - Use
RESULTS_BACKEND_USE_MSGPACKfor efficient serialization of large query results.
Frequently Asked Questions
What message brokers does Superset support for Celery?
Superset supports any broker compatible with the Celery framework, including Redis, RabbitMQ, and Amazon SQS. Redis is the default and most common choice, configured via REDIS_HOST, REDIS_PORT, and REDIS_CELERY_DB variables in superset/config.py.
How does Superset handle security context in asynchronous tasks?
The execute_sql_task in superset/sql/execution/celery_task.py wraps execution in a Flask test request context and uses override_user to impersonate the original querier. This ensures all security checks, row-level security filters, and datasource permissions apply exactly as they would in a synchronous web request.
What happens if a query exceeds the async time limit?
When execution duration surpasses SQLLAB_ASYNC_TIME_LIMIT_SEC, Celery raises a SoftTimeLimitExceeded exception. The task attempts graceful cleanup via _finalize_successful_query or error handlers, updating the query status to failed or success depending on completion state, without terminating the worker process.
Can I run different workers for SQL Lab queries and scheduled reports?
Yes. Use the --queues parameter when starting workers to isolate responsibilities. For example, launch one worker with --queues=queries for SQL Lab async execution and another with --queues=celery (or specific report queues) for email reports and alerts defined in superset/tasks/scheduler.py.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →