# s2-pro **Repository Path**: hf-models/s2-pro ## Basic Information - **Project Name**: s2-pro - **Description**: Mirror of https://huggingface.co/fishaudio/s2-pro - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-11 - **Last Updated**: 2026-03-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- tags: - text-to-speech license: other license_name: fish-audio-research-license license_link: LICENSE.md language: - zh - en - ja - ko - es - pt - ar - ru - fr - de - sv - it - tr - "no" - nl - cy - eu - ca - da - gl - ta - hu - fi - pl - et - hi - la - ur - th - vi - jw - bn - yo - sl - cs - sw - nn - he - ms - uk - id - kk - bg - lv - my - tl - sk - ne - fa - af - el - bo - hr - ro - sn - mi - yi - am - be - km - is - az - sd - br - sq - ps - mn - ht - ml - sr - sa - te - ka - bs - pa - lt - kn - si - hy - mr - as - gu - fo pipeline_tag: text-to-speech inference: false extra_gated_prompt: >- You agree to not use the model to generate contents that violate DMCA or local laws. extra_gated_fields: Country: country Specific date: date_picker I agree to use this model for non-commercial use ONLY: checkbox --- # Fish Audio S2 Pro Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference. **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine. ## Architecture S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a **Dual-Autoregressive (Dual-AR)** architecture: - **Slow AR** (4B parameters): Operates along the time axis and predicts the primary semantic codebook. - **Fast AR** (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail. This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. ## Fine-Grained Inline Control S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using `[tag]` syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts **free-form textual descriptions** — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level. **Common tags (15,000+ unique tags supported):** `[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]` ## Supported Languages S2 Pro supports 80+ languages. **Tier 1:** Japanese (ja), English (en), Chinese (zh) **Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de) **Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, sl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more. ## Production Streaming Performance On a single NVIDIA H200 GPU: - **Real-Time Factor (RTF):** 0.195 - **Time-to-first-audio:** ~100 ms - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5 ## Links - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech) - [Fish Audio Playground](https://fish.audio) - [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/) ## License This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.