# HugeGraph-LLM Module Architecture: Directory Structure and Key Components Explained

> Explore the hugegraph-llm module architecture. Understand its directory structure and key components forming a pipeline for natural language to graph query and RAG outputs.

- Repository: [The Apache Software Foundation/incubator-hugegraph-ai](https://github.com/apache/incubator-hugegraph-ai)
- Tags: architecture
- Published: 2026-02-24

---

**The hugegraph-llm module organizes its source code into ten purpose-driven directories—`utils/`, `state/`, `operators/`, `nodes/`, `indices/`, `flows/`, `config/`, `api/`, `document/`, `enums/`, and `tests/`—forming a layered pipeline that transforms natural language into graph queries and retrieval-augmented generation (RAG) outputs.**

The `hugegraph-llm` module serves as the core intelligence layer within the Apache HugeGraph-AI project, residing under `hugegraph-llm/src/hugegraph_llm/`. Its architecture follows a clean separation of concerns, enabling developers to extend LLM capabilities, swap vector backends, or modify graph schemas with minimal friction. Understanding this **hugegraph-llm module architecture** is essential for customizing pipelines or debugging the text-to-Gremlin and RAG workflows.

## The Core Directory Layout

The module's root package contains distinct functional groups that handle everything from low-level logging to high-level REST API exposure.

### Utility and Configuration Layers

The foundational layers provide cross-cutting services and runtime settings.

- **`utils/`** – Houses reusable helpers for logging, embedding calculations, graph-client wrappers, and decorators. Key files include [`hugegraph-llm/src/hugegraph_llm/utils/log.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/utils/log.py) for centralized logging and [`hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py) for vector operations.
- **`config/`** – Contains Pydantic-based configuration models for LLM providers, prompt templates, and HugeGraph connection parameters. Files like [`hugegraph-llm/src/hugegraph_llm/config/llm_config.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/config/llm_config.py) and [`hugegraph-llm/src/hugegraph_llm/config/prompt_config.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/config/prompt_config.py) define startup configurations and CLI utilities for generating default configs.
- **`enums/`** – Defines type-safe enumerations for property data types, cardinalities, and ID strategies, ensuring consistency across the codebase.

### State and Index Management

These directories manage runtime context and similarity search capabilities.

- **`state/`** – Maintains runtime objects that track LLM execution state, including request-level caches and intermediate results. The primary implementation resides in [`hugegraph-llm/src/hugegraph_llm/state/ai_state.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/state/ai_state.py).
- **`indices/`** – Implements concrete index backends for vector and graph search. This includes FAISS, Milvus, and Qdrant vector stores under `indices/vector_index/` (e.g., [`hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py)) and Gremlin example indexes in [`hugegraph-llm/src/hugegraph_llm/indices/graph_index.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/indices/graph_index.py).

### The Operator Layer

The **`operators/`** directory contains the core LLM-driven logic, subdivided by function:

- **`llm_op/`** – LLM-centric operations including keyword extraction ([`keyword_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/keyword_extract.py)), schema building, property-graph extraction, and Gremlin generation ([`gremlin_generate.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/gremlin_generate.py)).
- **`index_op/`** – Index-building operators such as [`hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py) for creating semantic search indexes.
- **`document_op/`** – Document processing operations including chunking, word extraction, and Textrank implementations.
- **`common_op/`** – Shared utilities for schema checks, result formatting, and NLTK helpers.

### Node Abstractions

The **`nodes/`** directory wraps operators into graph-compatible execution units used by the scheduler. Each node exposes a uniform `run()` interface and manages input/output conversion. Examples include [`hugegraph-llm/src/hugegraph_llm/nodes/llm_node/text2gremlin.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/nodes/llm_node/text2gremlin.py) for Gremlin generation workflows and [`hugegraph-llm/src/hugegraph_llm/nodes/index_node/vector_query_node.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/nodes/index_node/vector_query_node.py) for vector similarity queries.

### Pipeline Orchestration

The **`flows/`** directory defines high-level pipeline compositions that stitch nodes into end-to-end services. These scheduler-driven flows include:
- [`hugegraph-llm/src/hugegraph_llm/flows/rag_flow_graph_only.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/flows/rag_flow_graph_only.py) – RAG pipeline using graph-only retrieval
- [`hugegraph-llm/src/hugegraph_llm/flows/text2gremlin.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/flows/text2gremlin.py) – Natural language to Gremlin conversion pipeline

### Interface and Validation

- **`api/`** – FastAPI-style REST endpoints exposing RAG and other services to external callers (e.g., Gradio UI). The entry point [`hugegraph-llm/src/hugegraph_llm/api/rag_api.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/api/rag_api.py) forwards requests to scheduler flows.
- **`document/`** – Utilities for loading and splitting raw documents (text, PDFs) before indexing, such as [`hugegraph-llm/src/hugegraph_llm/document/chunk_split.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/document/chunk_split.py).
- **`tests/`** – Unit and integration suites validating operators, nodes, and flows, providing canonical usage patterns like [`hugegraph-llm/src/tests/operators/llm_op/test_keyword_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/tests/operators/llm_op/test_keyword_extract.py).

## Architectural Layers Explained

The **hugegraph-llm module architecture** follows a six-layer stack:

1. **Foundation** – `utils/`, `enums/`, and `config/` provide shared services and type safety.
2. **Context** – `state/` and `indices/` manage execution state and vector/graph search capabilities.
3. **Logic** – `operators/` implements the actual LLM processing steps.
4. **Adaptation** – `nodes/` adapts operators to the scheduler's graph execution model.
5. **Orchestration** – `flows/` composes nodes into complete pipelines (RAG, text2gremlin).
6. **Surface** – `api/` and `tests/` expose and verify the service interface.

## Practical Code Examples

### Executing a Graph-Only RAG Flow

The scheduler singleton orchestrates flows defined in the `flows/` directory:

```python
from hugegraph_llm.flows.scheduler import SchedulerSingleton

scheduler = SchedulerSingleton.get_instance()

result = scheduler.schedule_flow(
    "rag_graph_only",
    query="Tell me about the movie Inception.",
    graph_only_answer=True,
    vector_only_answer=False,
)

print("Graph-only answer:", result.get("graph_only_answer"))

```

This utilizes [`flows/rag_flow_graph_only.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/flows/rag_flow_graph_only.py), `nodes/`, `operators/`, and `indices/` to execute the query.

### Building a Semantic Vector Index

Index construction flows leverage operators and index implementations:

```python
from hugegraph_llm.flows.scheduler import SchedulerSingleton

documents = [
    {"id": "doc1", "content": "Apache HugeGraph is a graph database."},
    {"id": "doc2", "content": "Large Language Models can reason over graphs."},
]

scheduler = SchedulerSingleton.get_instance()
index_res = scheduler.schedule_flow("build_semantic_index", documents)

print("Index built:", index_res)

```

This calls [`operators/index_op/build_semantic_index.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/operators/index_op/build_semantic_index.py) and persists vectors via [`indices/vector_index/faiss_vector_store.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/indices/vector_index/faiss_vector_store.py).

### Converting Natural Language to Gremlin

The text2gremlin flow demonstrates how the architecture handles complex multi-step LLM operations:

```python
from hugegraph_llm.flows.scheduler import SchedulerSingleton

scheduler = SchedulerSingleton.get_instance()
gremlin_res = scheduler.schedule_flow(
    "text2gremlin",
    "find all people who studied at MIT",
    2,                     # number of examples

    "hugegraph",           # schema name

    None,                  # custom prompt

    ["template_gremlin", "raw_gremlin"],
)

print("Gremlin template:", gremlin_res.get("template_gremlin"))

```

Underlying this are [`nodes/llm_node/text2gremlin.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/nodes/llm_node/text2gremlin.py) and [`operators/llm_op/gremlin_generate.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/operators/llm_op/gremlin_generate.py).

## Key Source Files by Function

| Function | File Path | Purpose |
|----------|-----------|---------|
| **Logging** | [`hugegraph-llm/src/hugegraph_llm/utils/log.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/utils/log.py) | Centralized logging utilities |
| **Embeddings** | [`hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py) | Vector calculation helpers |
| **Runtime State** | [`hugegraph-llm/src/hugegraph_llm/state/ai_state.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/state/ai_state.py) | Execution context management |
| **Keyword Extraction** | [`hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py) | NLP keyword extraction |
| **Gremlin Generation** | [`hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py) | LLM-based query generation |
| **Vector Storage** | [`hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py) | FAISS backend implementation |
| **Graph Index** | [`hugegraph-llm/src/hugegraph_llm/indices/graph_index.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/indices/graph_index.py) | Gremlin example storage |
| **RAG API** | [`hugegraph-llm/src/hugegraph_llm/api/rag_api.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/api/rag_api.py) | FastAPI endpoint definitions |
| **Configuration** | [`hugegraph-llm/src/hugegraph_llm/config/llm_config.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/config/llm_config.py) | Provider and model settings |
| **Document Chunking** | [`hugegraph-llm/src/hugegraph_llm/document/chunk_split.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/document/chunk_split.py) | Text segmentation utilities |

## Summary

- The **hugegraph-llm module architecture** separates concerns across ten directories, from low-level utilities to high-level API endpoints.
- **`operators/`** contains the core LLM logic, subdivided into `llm_op/`, `index_op/`, `document_op/`, and `common_op/`.
- **`nodes/`** wraps operators for the scheduler, while **`flows/`** orchestrates them into complete pipelines.
- **`indices/`** abstracts vector (FAISS, Milvus, Qdrant) and graph indexes, enabling pluggable similarity search.
- All configurations are Pydantic-based in **`config/`**, and the **`api/`** layer exposes FastAPI endpoints for external integration.

## Frequently Asked Questions

### What is the role of the operators directory in hugegraph-llm?

The `operators/` directory implements the actual LLM-driven processing steps, including keyword extraction, schema building, property-graph extraction, and Gremlin query generation. It is subdivided into `llm_op/` for LLM-centric tasks, `index_op/` for index construction, `document_op/` for text processing, and `common_op/` for shared utilities. Each operator is a discrete unit of work that can be chained together via the scheduler.

### How does the nodes directory differ from the operators directory?

While `operators/` contains the raw business logic for LLM interactions, `nodes/` wraps these operators into graph-compatible execution units that conform to the scheduler's interface. Nodes manage input/output conversion and expose a uniform `run()` method, allowing the scheduler in `flows/` to treat diverse operations as interchangeable vertices in an execution graph.

### Which directory contains the REST API endpoints?

The **`api/`** directory houses FastAPI-style REST endpoints that expose RAG and text2gremlin services to external callers. The primary entry point is [`hugegraph-llm/src/hugegraph_llm/api/rag_api.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/hugegraph-llm/src/hugegraph_llm/api/rag_api.py), which forwards HTTP requests to the appropriate scheduler flows and returns JSON responses suitable for frontend consumption (e.g., Gradio UIs).

### Where are vector indexes implemented in the hugegraph-llm module?

Vector indexes reside in **`indices/vector_index/`**, with concrete implementations for FAISS ([`faiss_vector_store.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/faiss_vector_store.py)), Milvus, and Qdrant. These classes provide the storage and retrieval mechanisms used by [`operators/index_op/build_semantic_index.py`](https://github.com/apache/incubator-hugegraph-ai/blob/main/operators/index_op/build_semantic_index.py) and the vector query nodes, abstracting the underlying vector database specifics from the rest of the pipeline.