Session Outline
We describe how we improved the sentiment analysis for a specific domain by using generative models (GPT-3.5/4) to generate synthetic examples that were then used for fine-tuning an existing transformer-based model.
In initial experiments, the method we devised allowed us to scale up 450 domain-specific, severely skewed texts to a corpus of 500.000+ balanced and labeled texts, thus allowing us to circumvent privacy issues in the original data, and improve predictive power in the final model.
The final fine-tuned model showed an improved F1 score of between 8 and 10 points when evaluated on a held-out, non-synthetic dataset.
The talk will also address the two major challenges when generating labeled synthetic training data: label noise, and making sure that the data generated is “similar enough” to the original data to be of use as training data.
Key Takeaways
- Generative LLMs can be used to generate synthetic, labeled training data for NLP tasks.
- Two challenges of the synthetic data: Can we trust the labels that the LLM outputs? Can we control the distribution of the generated data to ensure that it is similar enough to the original data?
————————————————————————————————————————————————————
Speaker Bio
Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai
Fredrik Olsson (PhD) is the Head of Data Science and part of the management team at Gavagai, a company providing multilingual and scalable text analytics in the Customer Experience domain.
For the past 25 years, Fredrik has held positions in research, start-ups, and as an advisor in matters regarding data, Machine Learning, and NLP.
Day 2 | 26 Oct 2023 | MACHINE LEARNING + MLOPS
Fredrik Olsson – Head of Data Science & Product Strategist | Gavagai