How to Use Open-Source LLMs from Hugging Face in Your Applications

You can integrate open-source LLMs from Hugging Face into your applications by configuring the HUGGING_FACE_API_KEY environment variable, using the transformers library's pipeline or AutoModel APIs, and swapping the model client in existing lesson code while keeping the same prompt logic.

The Generative AI for Beginners curriculum by Microsoft provides a production-ready workflow for leveraging open-source large language models (LLMs) from Hugging Face. By following the repository's architecture, you can securely authenticate, load models like Llama 2 or Mistral, and integrate them into existing applications with minimal code changes.

Configuring Hugging Face Authentication

Before calling any model, you must authenticate with the Hugging Face Hub. The repository separates credentials from code using environment variables.

Environment Setup

Create a .env file in your project root based on the provided template .env.copy. Add your Hugging Face access token:

HUGGING_FACE_API_KEY=hf_your_token_here

Generate this token from your Hugging Face account settings. The 00-course-setup/03-providers.md file documents this requirement alongside other supported providers like OpenAI and Azure OpenAI.

Secure Token Loading

Never hard-code secrets in your application logic. Instead, use the shared utility located at shared/python/env_utils.py. This module provides the get_env() function to safely load variables from your .env file:

from shared.python.env_utils import get_env

hf_token = get_env("HUGGING_FACE_API_KEY")

This pattern ensures your credentials remain secure while remaining accessible to the model client.

Installing and Loading Model Dependencies

The repository includes all necessary dependencies in requirements.txt. Ensure you have transformers and huggingface_hub installed:

pip install -r requirements.txt

These libraries provide the core interfaces for downloading and running open-source models from the Hugging Face Hub.

Generating Text with Hugging Face Models

The 16-open-source-models lesson demonstrates two primary methods for interacting with LLMs: the high-level pipeline API and the lower-level AutoModel classes.

Using the Pipeline API

The pipeline abstraction handles tokenization and generation in a single call. This is the fastest way to get started:

from transformers import pipeline
from shared.python.env_utils import get_env

generator = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    token=get_env("HUGGING_FACE_API_KEY"),
    device=0,  # Use -1 for CPU

    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)

result = generator("Explain quantum computing in simple terms.", return_full_text=False)
print(result[0]["generated_text"])

Using AutoModel Classes

For fine-grained control over the generation process, instantiate the model and tokenizer separately:

from transformers import AutoModelForCausalLM, AutoTokenizer
from shared.python.env_utils import get_env

model_name = "meta-llama/Llama-2-7b-chat-hf"
token = get_env("HUGGING_FACE_API_KEY")

tokenizer = AutoTokenizer.from_pretrained(model_name, token=token)
model = AutoModelForCausalLM.from_pretrained(model_name, token=token)

inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Integrating into Existing Applications

One strength of the repository's architecture is the clean separation between the model client and application logic. You can swap the backend provider without rewriting your prompt engineering or UI code.

Swapping the OpenAI Client

The lesson 06-text-generation-apps/python/oai-app.py demonstrates a text generation application using the OpenAI client. To use a Hugging Face model instead, replace the client initialization while preserving the rest of the application structure:


# Remove: from openai import OpenAI

# Remove: client = OpenAI(api_key=get_env("OPENAI_API_KEY"))

# Hugging Face replacement

from transformers import pipeline
from shared.python.env_utils import get_env

generator = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    token=get_env("HUGGING_FACE_API_KEY"),
    device=0,
    max_new_tokens=256,
)

def generate_response(prompt: str) -> str:
    result = generator(prompt, return_full_text=False)
    return result[0]["generated_text"]

The generate_response function now serves as a drop-in replacement for the OpenAI client's chat.completions.create method, allowing the rest of the application to function unchanged.

Deploying as a REST API

For production deployments, you can wrap the Hugging Face model in a simple Flask server. This pattern mirrors the Azure Functions examples in the repository but uses your local or remote open-source model.


# flask_hf_api.py

from flask import Flask, request, jsonify
from transformers import pipeline
from shared.python.env_utils import get_env

app = Flask(__name__)

# Initialize model once at startup

generator = pipeline(
    "text-generation",
    model="meta-llama/Llama-2-7b-chat-hf",
    token=get_env("HUGGING_FACE_API_KEY"),
    device=0,
    max_new_tokens=256,
    temperature=0.7,
)

@app.post("/generate")
def generate():
    data = request.get_json()
    prompt = data.get("prompt", "")
    result = generator(prompt, return_full_text=False)
    return jsonify({"response": result[0]["generated_text"]})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Run this server with python flask_hf_api.py and send POST requests to http://localhost:8000/generate with a JSON body containing your prompt.

Key Files in the Repository

Understanding the repository structure helps you navigate the implementation details:

Summary

  • Configure authentication by adding HUGGING_FACE_API_KEY to your .env file and loading it via shared/python/env_utils.py.
  • Install dependencies from requirements.txt to get transformers and huggingface_hub.
  • Choose your interface: use the pipeline API for simplicity or AutoModelForCausalLM for fine-grained control.
  • Swap providers in existing applications by replacing the OpenAI client initialization with a Hugging Face pipeline while preserving prompt logic.
  • Deploy as an API using Flask to create endpoints compatible with the repository's serverless patterns.

Frequently Asked Questions

How do I get a Hugging Face access token?

Visit your Hugging Face account settings and navigate to the "Access Tokens" section. Generate a new token with read permissions and copy it into your .env file as HUGGING_FACE_API_KEY. The 00-course-setup/03-providers.md file documents this process alongside other provider configurations.

Can I use CPU instead of GPU for inference?

Yes. When creating the pipeline or loading AutoModelForCausalLM, set the device parameter to -1 to force CPU usage, or omit the parameter entirely. The examples in 16-open-source-models demonstrate both CPU and GPU configurations depending on your hardware availability.

Which open-source models does the repository recommend?

The 16-open-source-models lesson specifically mentions Llama 2, Mistral, Falcon, and OLMo as recommended open-source LLMs available on Hugging Face. These models vary in size and licensing, allowing you to choose based on your specific performance and commercial use requirements.

How do I migrate an existing OpenAI-based app to Hugging Face?

Replace the OpenAI client initialization in files like 06-text-generation-apps/python/oai-app.py with a Hugging Face pipeline instance. Keep the prompt engineering logic and response handling code identical—only the model client changes. The generate_response function serves as a drop-in replacement for chat.completions.create, returning text that the rest of your application can process unchanged.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →