# How to Self-Host CosyVoice for Chinese TTS with Voice Cloning Capabilities

> Self-host CosyVoice for Chinese TTS and voice cloning. Deploy using Docker or Python and use its API for fast speech synthesis from short audio samples.

- Repository: [Elliot Chen/one-person-company](https://github.com/cyfyifanchen/one-person-company)
- Tags: how-to-guide
- Published: 2026-02-28

---

**Self-hosting CosyVoice requires deploying a Docker container or Python environment with the Chinese model checkpoints, then using the HTTP API or CLI to synthesize speech and clone voices from 5–10 second audio samples.**

CosyVoice is an open-source speech synthesis framework developed by Alibaba AI Labs that enables high-quality Chinese text-to-speech (TTS) and zero-shot voice cloning. As documented in the `cyfyifanchen/one-person-company` repository's TTS section, this guide provides concrete deployment instructions for running the complete inference pipeline locally.

## Architecture Overview

CosyVoice combines a **FastSpeech2** acoustic model with a **HiFi-GAN** neural vocoder to convert text into natural speech. The system architecture consists of five core components that orchestrate the transformation from normalized text to waveform audio.

- **Text Front-end**: The `text_normalizer` module handles Chinese character normalization, punctuation processing, and optional SSML tag support.
- **Acoustic Model**: A Transformer-based FastSpeech2 architecture predicts mel-spectrograms from normalized text inputs.
- **Vocoder**: HiFi-GAN performs GPU-accelerated conversion of mel-spectrograms into raw audio waveforms.
- **Speaker Encoder**: A pre-trained speaker-verification network extracts fixed-dimensional embeddings from reference audio for voice cloning.
- **Inference Engine**: The `cosyvoice.infer` Python module coordinates batch and streaming inference across all components.

The complete pipeline requires a single GPU with at least **8 GB VRAM** for real-time generation, though CPU inference is possible at significantly slower speeds.

## Deployment Methods

### Docker Deployment (Recommended)

The official Docker image bundles all CUDA dependencies, PyTorch libraries, and pre-compiled kernels, eliminating manual environment configuration.

1. **Clone the upstream repository**:

```bash
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

```

2. **Download Chinese model checkpoints** (approximately 1.4 GB):

```bash
mkdir -p models/zh
wget https://huggingface.co/FunAudioLLM/CosyVoice-zh/resolve/main/model.ckpt -O models/zh/model.ckpt
wget https://huggingface.co/FunAudioLLM/CosyVoice-zh/resolve/main/vocoder.ckpt -O models/zh/vocoder.ckpt

```

3. **Launch the inference server**:

```bash
docker run --gpus all -p 8000:8000 \
    -v $(pwd)/models/zh:/CosyVoice/models/zh \
    funaudiollm/cosyvoice:latest \
    python -m cosyvoice.server --lang zh

```

The service exposes an HTTP endpoint at `http://localhost:8000` and accepts JSON POST requests for synthesis.

### Python Environment Setup

For non-Docker deployments, install dependencies via the provided requirements file:

```bash
pip install -r requirements.txt
python -m cosyvoice.server --lang zh --model_dir models/zh

```

## Generating Chinese Speech

Send a POST request to the `/synthesize` endpoint with Chinese text and language specification:

```json
POST http://localhost:8000/synthesize
{
  "text": "今天天气很好，我想听一首轻音乐。",
  "lang": "zh",
  "speaker_id": "default"
}

```

The API returns a base64-encoded WAV file. Below is a complete Python client implementation:

```python
import requests
import base64

def synthesize(text, speaker_path=None):
    payload = {
        "text": text,
        "lang": "zh",
        "speaker_id": "default"
    }
    if speaker_path:
        payload["speaker_id"] = open(speaker_path, "rb").read().decode("utf-8")
    
    r = requests.post("http://localhost:8000/synthesize", json=payload)
    wav_b64 = r.json()["audio"]
    wav = base64.b64decode(wav_b64)
    
    with open("output.wav", "wb") as f:
        f.write(wav)

# Standard TTS

synthesize("你好，欢迎使用Cosy Voice！")

# Voice cloning

synthesize("这是我自己的声音。", speaker_path="my_speaker.npy")

```

## Voice Cloning Implementation

CosyVoice enables zero-shot voice cloning by conditioning the acoustic model on speaker embeddings extracted from short reference audio.

### Creating Speaker Embeddings

Prepare a reference audio file of 5–10 seconds duration, recorded at **16 kHz mono** for optimal quality. Generate the embedding using the provided utility script:

```bash
python tools/create_speaker_embedding.py \
    --audio path/to/reference.wav \
    --out speaker_id.npy

```

Alternatively, use the CLI cloning interface:

```bash
python -m cosyvoice.clone \
    --ref-audio samples/my_voice.wav \
    --output my_voice.npy \
    --lang zh

```

### Using Custom Voices in Synthesis

Reference the generated `.npy` file in the `speaker_id` field instead of using "default":

```json
{
  "text": "欢迎使用我的新声音！",
  "lang": "zh",
  "speaker_id": "speaker_id.npy"
}

```

The [`cosyvoice/infer.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/cosyvoice/infer.py) module loads the embedding and conditions the FastSpeech2 generator to match the reference speaker's acoustic characteristics.

## Key Source Files

Understanding these critical files enables advanced customization and debugging:

- **`Dockerfile`**: Defines the container environment with CUDA support and dependency installation.
- **[`cosyvoice/server.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/cosyvoice/server.py)**: Implements the FastAPI HTTP service handling `/synthesize` endpoints.
- **[`cosyvoice/infer.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/cosyvoice/infer.py)**: Contains core inference logic coordinating text normalization, mel-spectrogram generation, and vocoding.
- **`models/zh/`**: Directory containing Chinese-specific acoustic and vocoder checkpoints.
- **[`tools/create_speaker_embedding.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/tools/create_speaker_embedding.py)**: Utility for extracting speaker vectors from reference audio using the pre-trained encoder.
- **[`requirements.txt`](https://github.com/cyfyifanchen/one-person-company/blob/main/requirements.txt)**: Lists PyTorch, librosa, and other Python dependencies for manual installation.

## Summary

- **CosyVoice** combines FastSpeech2 and HiFi-GAN for high-fidelity Chinese TTS with voice cloning capabilities.
- **Docker deployment** is the fastest setup method, requiring only the Chinese model checkpoints (~1.4 GB) mounted at `models/zh/`.
- **Voice cloning** requires 5–10 seconds of 16 kHz mono reference audio processed through [`tools/create_speaker_embedding.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/tools/create_speaker_embedding.py).
- **Hardware requirements** include a GPU with 8 GB+ VRAM for real-time inference, though CPU execution is supported.
- The **HTTP API** at `localhost:8000/synthesize` accepts JSON payloads with Chinese text and speaker embeddings.

## Frequently Asked Questions

### What hardware is required to self-host CosyVoice?

CosyVoice requires a NVIDIA GPU with at least 8 GB of VRAM to achieve real-time speech synthesis performance. The system can run on CPU-only machines, but inference speed decreases significantly, making it suitable only for batch processing or testing rather than interactive applications.

### How many seconds of audio are needed for voice cloning?

CosyVoice achieves zero-shot voice cloning with **5–10 seconds** of clear reference audio. The [`tools/create_speaker_embedding.py`](https://github.com/cyfyifanchen/one-person-company/blob/main/tools/create_speaker_embedding.py) script processes this sample to generate a fixed-dimensional speaker embedding that conditions the acoustic model during synthesis.

### Can CosyVoice run without Docker?

Yes, CosyVoice supports native Python deployment by installing dependencies listed in [`requirements.txt`](https://github.com/cyfyifanchen/one-person-company/blob/main/requirements.txt) and executing `python -m cosyvoice.server`. However, Docker is strongly recommended because it handles CUDA kernel compilation, PyTorch version alignment, and system library dependencies automatically.

### Where are the Chinese model checkpoints stored?

The Chinese-specific checkpoints must be placed in the `models/zh/` directory relative to the repository root. This directory should contain `model.ckpt` (FastSpeech2 acoustic model) and `vocoder.ckpt` (HiFi-GAN weights), which together total approximately 1.4 GB in size.