How to Self-Host CosyVoice for Chinese TTS with Voice Cloning Capabilities

Self-hosting CosyVoice requires deploying a Docker container or Python environment with the Chinese model checkpoints, then using the HTTP API or CLI to synthesize speech and clone voices from 5–10 second audio samples.

CosyVoice is an open-source speech synthesis framework developed by Alibaba AI Labs that enables high-quality Chinese text-to-speech (TTS) and zero-shot voice cloning. As documented in the cyfyifanchen/one-person-company repository's TTS section, this guide provides concrete deployment instructions for running the complete inference pipeline locally.

Architecture Overview

CosyVoice combines a FastSpeech2 acoustic model with a HiFi-GAN neural vocoder to convert text into natural speech. The system architecture consists of five core components that orchestrate the transformation from normalized text to waveform audio.

  • Text Front-end: The text_normalizer module handles Chinese character normalization, punctuation processing, and optional SSML tag support.
  • Acoustic Model: A Transformer-based FastSpeech2 architecture predicts mel-spectrograms from normalized text inputs.
  • Vocoder: HiFi-GAN performs GPU-accelerated conversion of mel-spectrograms into raw audio waveforms.
  • Speaker Encoder: A pre-trained speaker-verification network extracts fixed-dimensional embeddings from reference audio for voice cloning.
  • Inference Engine: The cosyvoice.infer Python module coordinates batch and streaming inference across all components.

The complete pipeline requires a single GPU with at least 8 GB VRAM for real-time generation, though CPU inference is possible at significantly slower speeds.

Deployment Methods

The official Docker image bundles all CUDA dependencies, PyTorch libraries, and pre-compiled kernels, eliminating manual environment configuration.

  1. Clone the upstream repository:
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
  1. Download Chinese model checkpoints (approximately 1.4 GB):
mkdir -p models/zh
wget https://huggingface.co/FunAudioLLM/CosyVoice-zh/resolve/main/model.ckpt -O models/zh/model.ckpt
wget https://huggingface.co/FunAudioLLM/CosyVoice-zh/resolve/main/vocoder.ckpt -O models/zh/vocoder.ckpt
  1. Launch the inference server:
docker run --gpus all -p 8000:8000 \
    -v $(pwd)/models/zh:/CosyVoice/models/zh \
    funaudiollm/cosyvoice:latest \
    python -m cosyvoice.server --lang zh

The service exposes an HTTP endpoint at http://localhost:8000 and accepts JSON POST requests for synthesis.

Python Environment Setup

For non-Docker deployments, install dependencies via the provided requirements file:

pip install -r requirements.txt
python -m cosyvoice.server --lang zh --model_dir models/zh

Generating Chinese Speech

Send a POST request to the /synthesize endpoint with Chinese text and language specification:

POST http://localhost:8000/synthesize
{
  "text": "今天天气很好,我想听一首轻音乐。",
  "lang": "zh",
  "speaker_id": "default"
}

The API returns a base64-encoded WAV file. Below is a complete Python client implementation:

import requests
import base64

def synthesize(text, speaker_path=None):
    payload = {
        "text": text,
        "lang": "zh",
        "speaker_id": "default"
    }
    if speaker_path:
        payload["speaker_id"] = open(speaker_path, "rb").read().decode("utf-8")
    
    r = requests.post("http://localhost:8000/synthesize", json=payload)
    wav_b64 = r.json()["audio"]
    wav = base64.b64decode(wav_b64)
    
    with open("output.wav", "wb") as f:
        f.write(wav)

# Standard TTS

synthesize("你好,欢迎使用Cosy Voice!")

# Voice cloning

synthesize("这是我自己的声音。", speaker_path="my_speaker.npy")

Voice Cloning Implementation

CosyVoice enables zero-shot voice cloning by conditioning the acoustic model on speaker embeddings extracted from short reference audio.

Creating Speaker Embeddings

Prepare a reference audio file of 5–10 seconds duration, recorded at 16 kHz mono for optimal quality. Generate the embedding using the provided utility script:

python tools/create_speaker_embedding.py \
    --audio path/to/reference.wav \
    --out speaker_id.npy

Alternatively, use the CLI cloning interface:

python -m cosyvoice.clone \
    --ref-audio samples/my_voice.wav \
    --output my_voice.npy \
    --lang zh

Using Custom Voices in Synthesis

Reference the generated .npy file in the speaker_id field instead of using "default":

{
  "text": "欢迎使用我的新声音!",
  "lang": "zh",
  "speaker_id": "speaker_id.npy"
}

The cosyvoice/infer.py module loads the embedding and conditions the FastSpeech2 generator to match the reference speaker's acoustic characteristics.

Key Source Files

Understanding these critical files enables advanced customization and debugging:

  • Dockerfile: Defines the container environment with CUDA support and dependency installation.
  • cosyvoice/server.py: Implements the FastAPI HTTP service handling /synthesize endpoints.
  • cosyvoice/infer.py: Contains core inference logic coordinating text normalization, mel-spectrogram generation, and vocoding.
  • models/zh/: Directory containing Chinese-specific acoustic and vocoder checkpoints.
  • tools/create_speaker_embedding.py: Utility for extracting speaker vectors from reference audio using the pre-trained encoder.
  • requirements.txt: Lists PyTorch, librosa, and other Python dependencies for manual installation.

Summary

  • CosyVoice combines FastSpeech2 and HiFi-GAN for high-fidelity Chinese TTS with voice cloning capabilities.
  • Docker deployment is the fastest setup method, requiring only the Chinese model checkpoints (~1.4 GB) mounted at models/zh/.
  • Voice cloning requires 5–10 seconds of 16 kHz mono reference audio processed through tools/create_speaker_embedding.py.
  • Hardware requirements include a GPU with 8 GB+ VRAM for real-time inference, though CPU execution is supported.
  • The HTTP API at localhost:8000/synthesize accepts JSON payloads with Chinese text and speaker embeddings.

Frequently Asked Questions

What hardware is required to self-host CosyVoice?

CosyVoice requires a NVIDIA GPU with at least 8 GB of VRAM to achieve real-time speech synthesis performance. The system can run on CPU-only machines, but inference speed decreases significantly, making it suitable only for batch processing or testing rather than interactive applications.

How many seconds of audio are needed for voice cloning?

CosyVoice achieves zero-shot voice cloning with 5–10 seconds of clear reference audio. The tools/create_speaker_embedding.py script processes this sample to generate a fixed-dimensional speaker embedding that conditions the acoustic model during synthesis.

Can CosyVoice run without Docker?

Yes, CosyVoice supports native Python deployment by installing dependencies listed in requirements.txt and executing python -m cosyvoice.server. However, Docker is strongly recommended because it handles CUDA kernel compilation, PyTorch version alignment, and system library dependencies automatically.

Where are the Chinese model checkpoints stored?

The Chinese-specific checkpoints must be placed in the models/zh/ directory relative to the repository root. This directory should contain model.ckpt (FastSpeech2 acoustic model) and vocoder.ckpt (HiFi-GAN weights), which together total approximately 1.4 GB in size.

Have a question about this repo?

These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:

Share the following with your agent to get started:
curl -s "https://instagit.com/install.md"

Works with
Claude Codex Cursor VS Code OpenClaw Any MCP Client

Maintain an open-source project? Get it listed too →