Last Thursday, we sat down with a couple great data scientists from Oracle to learn how they use people-powered sentiment analysis. Our CEO Lukas Biewald was joined by Randall Sparks (Principal Member of Technical Staff at Oracle Data Cloud) and Pallika Kanani (Senior Research Staff Member at Oracle Labs) for the session and the folks at Oracle showed us how they create training sets, iterate on their algorithms, and explained how they handle sentiment across multiple languages. We had a lot of questions in the Q&A we couldn’t get to, so we’ll be answering those below. To start, here’s a recording of our chat if you weren’t able to join us:
(You can also peruse the slides here)
Alright. Onto your questions:
1. What languages did you run jobs in?
Oracle ran jobs in Spanish and French, among others. CrowdFlower has fluent contributor bases in the following languages:
- Bahasa (Indonesian)
If you want to get sentiment in a language that’s not listed above, what we recommend that you write your instructions in the language of your choosing and just create test questions. If people don’t understand what you’re asking, they won’t hop in your job and, even if they do, they won’t pass if they can’t read what you need done.
2. How many rows do you need to train a sentiment model?
This was actually asked and answered in the webinar above, but we wanted to call it out here. While there’s no silver bullet, most data scientists get at least 10,000 rows to train a sentiment algorithm. After that, as Randall mentioned in the webinar, things flatten out a bit. You can read about the importance of having a lot of training data in a recent post we wrote and even download some open data in our Data for Everyone library if you want to get a head-start.
3. Can you walk us through the process of setting up a similar job?
Of course we can. While we could go on for ages talking about how to set up a great sentiment analysis job on CrowdFlower, we’ll give you a few tips and a sort of high-level view of best practices. Of course, if you’d like run a job of your own, we have templates for this sort of thing and you can check those out if you take the platform for a spin by signing up for a trial. You can also contact us and we’ll walk you through whatever you need.
4. Is this Oracle’s user interface that has layered CrowdFlower data or are these now new jobs available in CrowdFlower?
The slide in question was actually Oracle’s interface, not ours. We thought it looked nice too. That said, we released a new graphical editor a week or two ago that we think looks pretty slick in its own right. If you want to see a demo, we’ll be showing it off (and our new reports functionality) in another webinar on September 10th. Register here.
5. Do you train models per topic, per language or both?
The more specific a model is (and the more data its learned from), the better. We covered the benefits of both in our post about MonkeyLearn, who used CrowdFlower to create training sets for industry-specific algorithms. Splitting languages is usually best practice as well.
That’s it for now. If you have any additional question that aren’t answered in the webinar or the text above, feel free to ask them in the comments. We’ll answer you there or amend the post above. Thanks for watching.