Model architecture
SlopTTS is an experimental neural TTS system with an eSpeak-based phonemization frontend, contextual text encoder, and an adversarial flow-matching acoustic predictor operating in a VAE-style latent space.
The predictor estimates phoneme durations and acoustic latents, which are decoded into waveform audio by a neural vocoder. Text is processed sentence by sentence with neighboring-context conditioning for smoother prosody across sentence boundaries.
This model lacks generalization due to a small amount of data and computation. The model was trained using random datasets found online.
Note: This model is not optimized for fast inference yet.
Language
Speaker ID
Preset
0.1 2
0.5 2.5
0.1 5
1 5