How the Terraform Data Block Works: From Parsing to Provider Execution
The terraform data block is a read-only configuration construct that retrieves external information via the provider's ReadDataSource gRPC call, caching results for interpolation across your configuration.
The HashiCorp Terraform data block allows configurations to query existing infrastructure without managing lifecycle operations. Unlike managed resources, data sources provide read-only access to external state through a specific internal pipeline involving HCL parsing, address resolution, and provider RPC communication. This article traces the exact path from configuration syntax to provider execution based on the source code implementation.
Parsing and Address Resolution
When Terraform loads configuration, the HCL parser identifies blocks with the type identifier data and constructs a unique address in the format data.<TYPE>.<NAME>.
In internal/configs/parser_config.go, the parser handles the "data" case within the parseBlock switch statement (lines 215–224). This extracts the block's arguments and validates the structure before creating a data source configuration object.
The address string undergoes further resolution in internal/addrs/parse_ref.go. Here, the parser converts the raw data.<TYPE>.<NAME> reference into a structured addrs.DataResource object (see the case "data": handling around lines 216–224). This typed object carries the data source type name and local identifier, serving as the canonical reference throughout planning and apply phases.
Schema Discovery and Validation
Before executing any read operations, Terraform must understand the data source's expected arguments and exported attributes. During provider initialization, the core requests the provider schema via the GetProviderSchema RPC.
The response structure, defined in internal/tfplugin6/tfplugin6.pb.go, contains a DataSourceSchemas map (lines 3900–3915) keyed by type name. This schema dictates which arguments Terraform can pass to the data source and which attributes the provider will return.
For early error detection, Terraform optionally invokes the ValidateDataSourceConfig RPC if the provider implements it (defined in internal/tfplugin5/tfplugin5_grpc.pb.go, lines 44–60). This validation occurs during the plan phase, catching configuration errors before any network calls to the cloud API.
The ReadDataSource RPC Execution
During planning or refresh operations, Terraform constructs a ReadDataSource_Request containing the data source type name, evaluated argument values, and provider metadata. The core then calls the provider's ReadDataSource gRPC method, defined in internal/tfplugin6/tfplugin6_grpc.pb.go (lines 292–301).
The provider returns a ReadDataSource_Response struct (lines 5910–5940 in internal/tfplugin6/tfplugin6.pb.go), which includes:
- A
Stateobject (DynamicValue) holding exported attributes - Diagnostics for any errors encountered
- A
Deferredflag indicating whether the read should be retried later
Unlike resource blocks, data sources implement only the Read lifecycle operation. There are no Create, Update, or Delete RPCs associated with a terraform data block.
Caching and Evaluation
Terraform stores the returned state in an internal data source cache, making attributes available for interpolation using the syntax ${data.<TYPE>.<NAME>.<ATTR>}. The evaluation logic resides in internal/terraform/expr/data_source.go within the EvalDataSource function (lines 30–57), which resolves attribute references against cached state.
The data source result remains cached for the duration of the run. Subsequent references to the same data source reuse the cached value, preventing duplicate API calls. Terraform persists these results in the state file (handled in internal/states/statefile/version4.go) to maintain consistency across plan and apply phases.
Practical Configuration Examples
Querying Existing Infrastructure
The following configuration retrieves the latest Ubuntu AMI from AWS and uses it in a managed resource:
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"]
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
}
resource "aws_instance" "example" {
ami = data.aws_ami.ubuntu.id
instance_type = "t3.micro"
}
During execution, Terraform parses this block in parser_config.go, validates it against the AWS provider schema, and calls ReadDataSource to populate data.aws_ami.ubuntu.id.
Iterating Over Data Sources
Data sources work with meta-arguments like for_each to create multiple resources based on queried data:
data "aws_subnet_ids" "selected" {
vpc_id = var.vpc_id
filter {
name = "tag:Env"
values = ["prod"]
}
}
resource "aws_instance" "per_subnet" {
for_each = toset(data.aws_subnet_ids.selected.ids)
subnet_id = each.value
ami = var.ami_id
instance_type = "t3.micro"
}
The data block yields a list attribute (ids) consumed by for_each, creating independent resource instances for each subnet.
Encapsulating Data Queries in Modules
Child modules can encapsulate data source logic and expose derived values through outputs:
# Root module
module "vpc_info" {
source = "./modules/vpc-info"
vpc_id = aws_vpc.main.id
}
# modules/vpc-info/main.tf
data "aws_vpc" "selected" {
id = var.vpc_id
}
output "cidr_block" {
value = data.aws_vpc.selected.cidr_block
}
The parent module accesses the queried data via module.vpc_info.cidr_block without exposing the internal data source implementation.
Summary
- The
terraform data blockfollows a strict pipeline: parsing ininternal/configs/parser_config.go, address resolution viaaddrs.DataResource, and execution through theReadDataSourceRPC. - Read-only architecture: Data sources implement only the Read lifecycle operation, with no Create, Update, or Delete equivalents.
- Provider communication: Schema discovery uses
GetProviderSchema, optional validation usesValidateDataSourceConfig, and data retrieval usesReadDataSource(all defined in the tfplugin6 gRPC interfaces). - Caching mechanism: Results are cached during the run and persisted to state files via
internal/states/statefile/version4.goto prevent redundant API calls. - Evaluation: Attribute references resolve through
EvalDataSourceininternal/terraform/expr/data_source.go, making external data available to managed resources.
Frequently Asked Questions
What is the difference between a terraform data block and a resource block?
A terraform data block performs read-only queries against existing infrastructure through the ReadDataSource RPC, while a resource block manages full lifecycle operations including Create, Update, and Delete. Data sources cannot modify external state; they only retrieve and cache information for use in expressions.
When does Terraform refresh data sources during the plan phase?
Terraform evaluates data sources during the planning phase unless Terraform defers the read due to unknown values or provider-side limitations. When deferred, the ReadDataSource call executes during the apply phase. The engine caches results after the first successful read, making subsequent references to data.<TYPE>.<NAME> use the cached value rather than re-invoking the provider.
Can terraform data blocks use count or for_each meta-arguments?
Yes, data sources support count and for_each meta-arguments, available since Terraform 0.13. When using these, Terraform creates multiple data source instances indexed by the iteration key, each generating independent ReadDataSource RPC calls. The resulting collection can be referenced using index syntax or iteration functions like toset().
How does Terraform handle data source failures during planning?
If the provider returns an error in the ReadDataSource_Response struct (defined in internal/tfplugin6/tfplugin6.pb.go), Terraform displays the diagnostic and stops the plan execution before any infrastructure changes occur. For providers implementing ValidateDataSourceConfig, syntax errors are caught even earlier during the validation phase, preventing unnecessary network calls to the cloud API.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s https://instagit.com/install.md