Introducing synthetictext

1 minute read

Published:

This weekend I published synthetictext, an LLM-powered Python package for generating synthetic text data for text classification tasks.

The goal is simple: make it easier to create usable synthetic training data for classification workflows without having to wire together a one-off pipeline every time.

Quarto version

This post is also published as a Quarto article, so it can host more interactive content as the project grows.

What it does

  • Task-agnostic generation for binary and multi-class text classification tasks
  • Five generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
  • Multi-stage quality filtering, including deduplication, label leakage detection, embedding-based near-duplicate removal, LLM-as-judge checks, and keyword marker checks
  • Multilingual support, including cross-lingual transfer for lower-resource settings
  • Provider-agnostic design, with built-in OpenAI support and extension points for custom LLM and translation providers
  • Both a Python API and a CLI

Why I built it

I used the same basic synthetic-data workflow in recent classification projects, including multilingual and low-resource settings, and wanted a reusable package that I could point at a new task spec instead of rebuilding the pipeline from scratch.

If you try it, I’d love to hear what works well and what needs to be added next.