Session Outline

We describe how we improved the sentiment analysis for a specific domain by using generative models (GPT-3.5/4) to generate synthetic examples that were then used for fine-tuning an existing transformer-based model.

In initial experiments, the method we devised allowed us to scale up 450 domain-specific, severely skewed texts to a corpus of 500.000+ balanced and labeled texts, thus allowing us to circumvent privacy issues in the original data, and improve predictive power in the final model.

The final fine-tuned model showed an improved F1 score of between 8 and 10 points when evaluated on a held-out, non-synthetic dataset.

The talk will also address the two major challenges when generating labeled synthetic training data: label noise, and making sure that the data generated is “similar enough” to the original data to be of use as training data.

Key Takeaways

  • Generative LLMs can be used to generate synthetic, labeled training data for NLP tasks.
  • Two challenges of the synthetic data: Can we trust the labels that the LLM outputs? Can we control the distribution of the generated data to ensure that it is similar enough to the original data?

————————————————————————————————————————————————————

Speaker Bio

Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai

Fredrik Olsson (PhD) is the Head of Data Science and part of the management team at Gavagai, a company providing multilingual and scalable text analytics in the Customer Experience domain.

For the past 25 years, Fredrik has held positions in research, start-ups, and as an advisor in matters regarding data, Machine Learning, and NLP.

October 26 @ 15:50
15:50 — 16:20 (30′)

Day 2 | 26 Oct 2023 | INFRASTRUCTURE + DATA ENGINEERING STAGE

Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai