Session Outline

We describe how we improved the sentiment analysis for a specific domain by using generative models (GPT-3.5/4) to generate synthetic examples that were then used for fine-tuning an existing transformer-based model.

In initial experiments, the method we devised allowed us to scale up 450 domain-specific, severely skewed texts to a corpus of 500.000+ balanced and labeled texts, thus allowing us to circumvent privacy issues in the original data, and improve predictive power in the final model.

The final fine-tuned model showed an improved F1 score of between 8 and 10 points when evaluated on a held-out, non-synthetic dataset.

The talk will also address the two major challenges when generating labeled synthetic training data: label noise, and making sure that the data generated is “similar enough” to the original data to be of use as training data.

Key Takeaways

  • Generative LLMs can be used to generate synthetic, labeled training data for NLP tasks.
  • Two challenges of the synthetic data: Can we trust the labels that the LLM outputs? Can we control the distribution of the generated data to ensure that it is similar enough to the original data?


Speaker Bio

Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai

Fredrik Olsson (PhD) is the Head of Data Science and part of the management team at Gavagai, a company providing multilingual and scalable text analytics in the Customer Experience domain.

For the past 25 years, Fredrik has held positions in research, start-ups, and as an advisor in matters regarding data, Machine Learning, and NLP.

October 26 @ 15:50
15:50 — 16:20 (30′)


Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai