How to Work with the Semantic Layer: Calculated Columns, Custom Metrics, and Virtual Datasets in Apache Superset

Apache Superset's semantic layer uses Dataset objects to abstract raw database tables into reusable entities that support calculated columns (SQL expressions), custom metrics (reusable aggregations), and virtual datasets (SQL queries), all stored as SQLAlchemy models and processed through Jinja templating before query execution.

Apache Superset's semantic layer bridges the gap between raw database tables and end-user visualizations through Dataset objects that encapsulate business logic. This layer enables data teams to define calculated columns, reusable metrics, and virtual datasets directly within the platform without modifying source databases. Understanding how these components are stored, rendered, and resolved in the source code is essential for building scalable analytics workflows.

Understanding the Semantic Layer Architecture

In Superset, the semantic layer is implemented through the SqlaTable class in superset/connectors/sqla/models.py. This class serves as the central abstraction between physical databases and charts, supporting three primary extension points:

  • Calculated columns: Virtual columns defined by SQL expressions that do not exist in the source table
  • Custom metrics: Reusable aggregation definitions (e.g., SUM(revenue)) attached to the Dataset
  • Virtual datasets: Datasets defined entirely by a SQL query rather than a physical table reference

Each element persists within the Dataset's SQLAlchemy relationships and participates in the query building pipeline.

Calculated Columns: Adding Virtual Columns to Physical Tables

Calculated columns allow analysts to define SQL expressions that transform or derive data without altering the underlying database schema. According to the Apache Superset source code, these are stored as TableColumn objects with a populated expression attribute.

When a Dataset refreshes its metadata, the SqlaTable.refresh() method preserves calculated columns by filtering for those with expressions:


# From superset/connectors/sqla/models.py lines 1829-1831

columns.extend([col for col in old_columns if col.expression])

This ensures that physical columns from the database schema merge with user-defined calculated columns. The expression field contains raw SQL that Superset injects into queries when the column is referenced.

Custom Metrics: Creating Reusable Aggregations

Custom metrics (SqlMetric objects) provide consistent aggregation logic across multiple charts. Stored on the Dataset via the self.metrics relationship, metrics survive table refreshes through the add_missing_metrics() helper:


# From superset/connectors/sqla/models.py lines 86-92

def add_missing_metrics(self, metrics: list[SqlMetric]) -> None:
    """Merge missing metrics into the dataset."""
    existing = {m.metric_name for m in self.metrics}
    for metric in metrics:
        if metric.metric_name not in existing:
            self.metrics.append(metric)

This persistence mechanism ensures that business-critical KPIs remain attached to Datasets even after schema updates.

Virtual Datasets: Query-Based Data Sources

Virtual datasets bypass physical table constraints entirely, deriving data from a SQL query stored in the self.sql attribute. The is_virtual property identifies these Datasets:


# From superset/connectors/sqla/models.py lines 307-309

@property
def is_virtual(self) -> bool:
    return self.kind == DatasourceKind.VIRTUAL

Because virtual datasets execute arbitrary SQL, they require special cache invalidation logic. The get_extra_cache_keys() method incorporates row-level security (RLS) predicates into cache keys for virtual datasets:


# From superset/connectors/sqla/models.py lines 87-97

if self.is_virtual and self.sql:
    rls_predicates = collect_rls_predicates_for_sql(...)
    extra_cache_keys.extend(rls_predicates)

This ensures that users with different RLS rules never share cached results for the same virtual dataset query.

Rendering Jinja-Templated Expressions

All three semantic layer elements support Jinja templating for dynamic SQL generation. The REST API processes these templates in DatasetDAO.render_dataset_fields() within superset/datasets/api.py:


# From superset/datasets/api.py lines 1397-1400

items = [
    ("query", "sql", "rendered_sql", processor.process_template),
    ("metric", "metrics", "metrics", render_item_list),
    ("calculated column", "columns", "columns", render_item_list),
]

The TemplateProcessor evaluates Jinja syntax in SQL expressions, metrics, and calculated columns before query execution. If rendering fails, the system raises a SupersetTemplateException to surface errors in the UI.

Query Resolution and Execution

When building queries, Superset must distinguish between physical columns, calculated columns, and custom metrics. The has_extra_cache_key_calls() method in superset/connectors/sqla/models.py resolves these references by building a dictionary of calculated expressions:

calculated_columns = {
    c.column_name: c.expression for c in self.columns if c.expression
}
for column_ in columns:
    if utils.is_adhoc_column(column_):
        templatable_statements.append(column_["sqlExpression"])
    elif isinstance(column_, str) and column_ in calculated_columns:
        templatable_statements.append(calculated_columns[column_])

This resolution ensures that calculated column expressions and metric definitions inject correctly into the final SQL before database execution.

Practical Implementation Examples

Creating a Calculated Column via REST API

import json, requests

payload = {
    "column_name": "order_month",
    "type": "STRING",
    "expression": "DATE_FORMAT(order_date, '%Y-%m')",
}
response = requests.post(
    "http://localhost:8088/api/v1/dataset/42/columns/",
    headers={"Authorization": "Bearer <TOKEN>", "Content-Type": "application/json"},
    data=json.dumps(payload),
)
print(response.json())

Defining a Custom Metric

metric = {
    "metric_name": "total_sales_by_month",
    "expression": "SUM(sales)",
    "verbose_name": "Total Sales (by month)",
}
requests.post(
    "http://localhost:8088/api/v1/dataset/42/metrics/",
    headers={"Authorization": "Bearer <TOKEN>", "Content-Type": "application/json"},
    data=json.dumps(metric),
)

Creating a Virtual Dataset

virtual_sql = """
SELECT
    DATE_FORMAT(order_date, '%Y-%m') AS order_month,
    SUM(sales) AS total_sales
FROM raw_orders
GROUP BY 1
"""

# POST to /api/v1/dataset/ with "sql": virtual_sql and "is_virtual": true

Using Semantic Layer Elements in Queries

SELECT
    {{ order_month }} AS month,
    SUM({{ total_sales_by_month }}) AS revenue
FROM "virtual_dataset"
GROUP BY 1

Summary

  • Calculated columns persist as TableColumn objects with expression attributes and survive table refreshes through SqlaTable.refresh() in superset/connectors/sqla/models.py.
  • Custom metrics store as SqlMetric objects merged via add_missing_metrics(), providing reusable aggregation logic across charts.
  • Virtual datasets identify via is_virtual (where kind == DatasourceKind.VIRTUAL) and require RLS-aware cache keys through get_extra_cache_keys().
  • Jinja templating processes uniformly across all three element types in DatasetDAO.render_dataset_fields() within the datasets API.
  • Query resolution distinguishes between physical and calculated references through has_extra_cache_key_calls(), injecting expressions into final SQL.

Frequently Asked Questions

How do calculated columns differ from custom metrics in Superset?

Calculated columns define SQL expressions that create new column values (like DATE_FORMAT(order_date, '%Y-%m')), while custom metrics define aggregations applied to columns (like SUM(sales)). Calculated columns appear as selectable dimensions in the Explore view, whereas metrics appear as measurable values. Under the hood, calculated columns use the TableColumn model with an expression field, while metrics use the SqlMetric model.

Do virtual datasets support row-level security (RLS)?

Yes. According to the source code in superset/connectors/sqla/models.py, virtual datasets automatically incorporate RLS predicates into their cache keys via get_extra_cache_keys(). When self.is_virtual is true and self.sql is defined, the system calls collect_rls_predicates_for_sql() to append security filters to the cache key calculation, ensuring users with different permissions never share cached results.

Can I use Jinja templating in calculated columns and metrics?

Yes. Superset supports Jinja templating across all semantic layer elements. The DatasetDAO.render_dataset_fields() method in superset/datasets/api.py processes templates for queries, metrics, and calculated columns using the TemplateProcessor. This allows dynamic referencing of filters, user attributes, and macro functions within your SQL expressions.

What happens to my calculated columns when the underlying table schema changes?

Calculated columns persist through schema changes. When you refresh a Dataset's metadata via SqlaTable.refresh() (triggered manually or automatically), the system extends the new physical column list with existing calculated columns that have an expression attribute set. This merge logic at lines 1829-1831 in superset/connectors/sqla/models.py ensures your virtual columns survive schema updates, though you should verify that expressions remain valid against the updated physical structure.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →