Introducing synthetictext
Published:
This weekend I published synthetictext, an LLM-powered Python package for generating synthetic text data for text classification tasks.
The goal is simple: make it easier to create usable synthetic training data for classification workflows without having to wire together a one-off pipeline every time.
Quarto version
This post is also published as a Quarto article, so it can host more interactive content as the project grows.
- Open the Quarto version: synthetictext on Quarto
What it does
- Task-agnostic generation for binary and multi-class text classification tasks
- Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
- Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks
- Multilingual support, including cross-lingual transfer for lower-resource settings
- Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers
- Both a Python API and a CLI
Why I built it
I used the same basic synthetic-data workflow in recent classification projects, including multilingual and low-resource settings, and wanted a reusable package that I could point at a new task spec instead of rebuilding the pipeline from scratch.
Links
If you try it, I’d love to hear what works well and what needs to be added next.
