Synthetic Text

Python package for generating and filtering synthetic text data for NLP tasks

Building synthetictext, a Python package for generating synthetic training and evaluation data for text classification, style transfer, and RAG evaluation.

Current focus:

  • Task-agnostic generation for binary and multi-class classification
  • Multiple generation strategies: direct generation, paraphrasing, contrastive pairs, backtranslation, and pivot translation
  • Quality filters for deduplication, label leakage, embedding similarity, marker artifacts, and LLM-as-judge scoring
  • Multilingual and low-resource workflows, including cross-lingual transfer and provider-agnostic model support

Tech: Python, OpenAI API, embeddings, Google Translate, packaging, CLI design