# Implementing Shadow AI Discovery to Find Unregistered Agents in the Microsoft Agent Governance Toolkit

> Implement Shadow AI Discovery with Agent Governance Toolkit to find unregistered agents. Analyze processes, files, and repos using regex discovery rules and confidence scoring.

- Repository: [Microsoft/agent-governance-toolkit](https://github.com/microsoft/agent-governance-toolkit)
- Tags: how-to-guide
- Published: 2026-05-29

---

**The ShadowDiscoveryScanner provides an SDK-level engine that locates unregistered AI agents by analyzing process commands, filesystem artifacts, inline text, and GitHub repositories using regex-based discovery rules and confidence scoring.**

The Microsoft Agent Governance Toolkit includes a lightweight discovery engine designed to identify shadow AI—agents running in your environment that lack proper registration or governance. The `ShadowDiscoveryScanner` implemented in Go offers four distinct scanning surfaces to detect unregistered agents wherever they hide, from running processes to buried configuration files.

## How Shadow AI Discovery Works

The scanner operates as a stateless, rule-based engine that aggregates evidence across multiple data sources. According to the Microsoft Agent Governance Toolkit source code, the `ShadowDiscoveryScanner` in [`agent-governance-golang/packages/agentmesh/discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/agent-governance-golang/packages/agentmesh/discovery.go) exposes methods to scan text, processes, directories, and remote repositories.

Each scanning method returns **DiscoveredAgent** objects that aggregate multiple **DiscoveryEvidence** signals. The scanner supports four **DetectionBasis** types—`DetectionProcess`, `DetectionConfigFile`, `DetectionGitHubRepo`, and inline text—and tracks **AgentStatus** as `registered`, `unregistered`, `shadow`, or `unknown`.

### Scanning Inline Text and Process Commands

The `ScanText` method walks line-by-line through arbitrary strings, matching against built-in `discoveryRules` regular expressions to identify API keys, framework imports, or agent signatures. For live process discovery, `ScanProcessCommands` feeds command-line strings into the same text scanner, while `ScanProcesses` handles deduplication using process IDs.

In [`discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery.go), the process scanner builds fingerprints from PID and command-line combinations to ensure the same running agent isn't counted twice:

```go
// Scanning arbitrary text for agent signatures
textFindings := scanner.ScanText("sample.env",
    "OPENAI_API_KEY=sk-test-not-real\nimport langchain\n")

// Scanning process command lines
cmds := []string{
    "/usr/bin/python -m crewai.run",
    "node /opt/agent/mcp-server.js",
}
procFindings := scanner.ScanProcessCommands(cmds)

```

### Scanning Filesystem Configuration

The `ScanConfigPaths` method walks directory trees while skipping known large or irrelevant directories, hunting for agent-specific configuration files, dependency manifests, and content matching discovery rules. Each match generates a **DiscoveredAgent** with attached evidence pointing to the specific file path and line number.

The implementation resides in [`discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery.go) lines 96-135, where the scanner recursively traverses supplied paths and applies both filename patterns (`configPatterns`) and content signatures (`processSignatures`).

### Scanning GitHub Repositories

For shadow AI discovery in source control, `ScanGitHubRepositories` utilizes a thin `GitHubDiscoveryClient` wrapper around the GitHub Contents API. The scanner fetches raw files from specified repositories, then applies the same config-file and dependency-file logic used in local filesystem scans.

This enables organizations to identify unregistered agents defined in infrastructure-as-code repositories or microservice definitions before they deploy to production:

```go
client := agentmesh.NewGitHubDiscoveryClient("<TOKEN>")
client.BaseURL = "https://api.github.com"

scanner := agentmesh.NewShadowDiscoveryScanner()
result := scanner.ScanGitHubRepositories(client, []string{"octo/demo", "myorg/agent-repo"})

```

## Core Types and Confidence Aggregation

The discovery engine employs a probabilistic model to combine weak signals into high-confidence findings. Four core types drive this aggregation:

- **DetectionBasis** – Enum describing how evidence was gathered (process, config file, or GitHub repo)
- **AgentStatus** – Classification: `registered`, `unregistered`, `shadow`, or `unknown`
- **DiscoveryEvidence** – Individual signal containing scanner name, basis, confidence score, timestamp, and raw data
- **DiscoveredAgent** – Aggregated view with fingerprint, status, and computed confidence

### The Noisy-OR Confidence Model

The `AddEvidence` method in [`discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery.go) (lines 82-92) implements a noisy-OR aggregation formula that allows multiple low-confidence signals to combine into high-confidence findings. Each incoming confidence value is clamped to **[0, 1]** and combined with existing evidence using:

```

combined = 1 - (1 - prior) * (1 - new)

```

This probabilistic approach prevents single false positives from dominating while allowing cumulative evidence to surface genuine shadow agents. Unit tests in [`discovery_test.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery_test.go) verify this behavior across edge cases.

## Fingerprinting and Deduplication

To prevent duplicate entries for the same logical agent, the scanner generates stable hashes from **merge keys**—combinations of PID plus truncated command lines for processes, or normalized file paths for configuration files. This fingerprinting logic ensures that an agent detected via process scanning and again via filesystem scanning merges into a single **DiscoveredAgent** record with combined evidence from both sources.

## Practical Implementation

Below is a complete, runnable example demonstrating the full shadow AI discovery workflow. This mirrors the reference implementation in [`examples/shadow-discovery/main.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/examples/shadow-discovery/main.go):

```go
package main

import (
	"fmt"
	"log"
	"os"
	"path/filepath"

	agentmesh "github.com/microsoft/agent-governance-toolkit/agent-governance-golang/packages/agentmesh"
)

func main() {
	// Create the scanner with built-in heuristics
	scanner := agentmesh.NewShadowDiscoveryScanner()

	// 1. Scan arbitrary text (environment files or code snippets)
	textFindings := scanner.ScanText("sample.env",
		"OPENAI_API_KEY=sk-test-not-real\nimport langchain\n")
	fmt.Printf("Text findings: %d\n", len(textFindings))

	// 2. Scan process command lines
	cmds := []string{
		"/usr/bin/python -m crewai.run",
		"node /opt/agent/mcp-server.js",
	}
	procFindings := scanner.ScanProcessCommands(cmds)
	fmt.Printf("Process findings: %d\n", len(procFindings))

	// 3. Scan directory tree
	root := buildFixture()
	defer os.RemoveAll(root)

	cfgResult := scanner.ScanConfigPaths([]string{root}, 5)
	fmt.Printf("Config scan: %d agents discovered, %d errors\n",
		len(cfgResult.Agents), len(cfgResult.Errors))
	
	for _, a := range cfgResult.Agents {
		fmt.Printf("  %-30s type=%-12s confidence=%.2f evid=%d\n",
			a.Name, a.AgentType, a.Confidence, len(a.Evidence))
	}
}

func buildFixture() string {
	dir, _ := os.MkdirTemp("", "shadow-discovery-*")
	files := map[string]string{
		"agentmesh.yaml": "agent_id: did:agentmesh:demo\n",
		"src/handler.py": "from langchain import agents\n",
		"mcp.json":       `{"server":"demo-mcp"}`,
	}
	for rel, body := range files {
		path := filepath.Join(dir, rel)
		os.MkdirAll(filepath.Dir(path), 0o755)
		os.WriteFile(path, []byte(body), 0o644)
	}
	return dir
}

```

Running this program prints low-level findings followed by aggregated **DiscoveredAgent** objects with confidence scores derived from the noisy-OR formula.

## Summary

- **ShadowDiscoveryScanner** provides four scanning surfaces: text, process commands, filesystem paths, and GitHub repositories.
- **Fingerprinting** using merge keys (PID + command line or file paths) prevents duplicate agent entries across data sources.
- **Noisy-OR confidence aggregation** (`1 - (1-prior)*(1-new)`) combines weak signals into actionable high-confidence findings.
- The scanner is **stateless and extensible**—add custom regex rules to `discoveryRules` or extend `configPatterns` for new frameworks.
- All core logic resides in [`agent-governance-golang/packages/agentmesh/discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/agent-governance-golang/packages/agentmesh/discovery.go) with comprehensive tests in [`discovery_test.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery_test.go).

## Frequently Asked Questions

### How does Shadow AI Discovery differentiate between registered and unregistered agents?

The scanner assigns **AgentStatus** based on evidence context. Registered agents typically appear in governance registries with stable identifiers, while shadow agents lack these markers. The `ScanConfigPaths` method specifically looks for governance metadata files (like [`agentmesh.yaml`](https://github.com/microsoft/agent-governance-toolkit/blob/main/agentmesh.yaml)) to mark agents as `registered`; their absence combined with process signatures or API key patterns indicates `unregistered` or `shadow` status.

### Can I extend the discovery rules to detect custom agent frameworks?

Yes. The `ShadowDiscoveryScanner` accepts extensions to `discoveryRules`, `configPatterns`, and `processSignatures`. Because the scanner is stateless, you can instantiate multiple scanners with different rule sets for different environments, or modify the global rule sets in [`discovery.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery.go) before scanner instantiation to recognize proprietary frameworks or internal agent patterns.

### What confidence threshold should I use for production alerts?

The noisy-OR formula in `AddEvidence` allows tuning based on your risk tolerance. In practice, a confidence threshold of **0.7 or higher** indicates strong evidence of a shadow agent, while 0.4-0.6 suggests investigation-worthy leads requiring manual verification. The unit tests in [`discovery_test.go`](https://github.com/microsoft/agent-governance-toolkit/blob/main/discovery_test.go) demonstrate how three low-confidence signals (0.4 each) combine to 0.784, illustrating why thresholds below 0.8 may be appropriate for discovery scenarios.

### Does the scanner require elevated privileges to detect running agents?

Process scanning via `ScanProcessCommands` or `ScanCurrentHostProcessList` requires appropriate OS-level permissions to read process command lines. However, filesystem scanning (`ScanConfigPaths`) and text scanning (`ScanText`) operate without elevation, making them suitable for CI/CD pipelines or developer workstations where admin rights aren't available.