Introducing synthetictext

This weekend I published synthetictext, an LLM-powered Python package for generating synthetic text data for text classification tasks.

The goal is simple: make it easier to create usable synthetic training data for classification workflows without having to wire together a one-off pipeline every time.

What it does

Task-agnostic generation for binary and multi-class text classification tasks
Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks
Multilingual support, including cross-lingual transfer for lower-resource settings
Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers
Both a Python API and a CLI

Why I built it

I used the same basic synthetic-data workflow in recent classification projects, including multilingual and low-resource settings, and wanted a reusable package that I could point at a new task spec instead of rebuilding the pipeline from scratch.

What it does

Why I built it

Links

Enjoy Reading This Article?