It’s hard to avoid spoilers on the internet these days — even if you’re careful, a random tweet or recommended news item could lay to waste your plan to watch that season finale a day late or catch a movie after the crowds have subsided. But soon an AI agent may do the spoiler-spotting for you, and flag spoilerific reviews and content before you even have a chance to look.
SpoilerNet is the creation of a team at UC San Diego, composed perhaps of people who tried waiting a week to see Infinity War and got snapped for their troubles. Never again!
They assembled a database of more than a million reviews from Amazon-owned reading community Goodreads, where it is the convention to note spoilers in any reviews, essentially line by line. As a user of the site I’m thankful for this capability, and the researchers were too — because nowhere else is there a corpus of written reviews in which whatever constitutes a “spoiler” has been meticulously labeled by a conscientious community.
(Well, sort of conscientious. As the researchers note: “we observe that in reality only a few users utilize this feature.”)
At any rate, such labeled data is these days basically food for what are generally referred to as AI systems: neural networks of various types that “learn” the qualities that define a specific image, object, or in this case spoilers. The team fed the 1.3 million Goodreads reviews into the system, letting it observe and record the differences between ordinary sentences and ones with spoilers in them.
Perhaps writers of reviews tend to begin sentences with plot details in a certain way — “Later it is revealed…” — or maybe spoilery sentences tend to lack evaluative words like “great” or “complex.” Who knows? Only the network.
Once its training was complete, the agent was set loose on a separate set of sentences (from both Goodreads and mind-boggling timesink TV Tropes), which it was able to label as “spoiler” or “non-spoiler” with up to 92 percent accuracy. Earlier attempts to computationally predict whether a sentence has spoilers in it haven’t fare so well; one paper by Chiang et al. last year broke new ground, but is limited by its dataset and approach, which allow it to consider only the sentence in front of it.
“We also model the dependency and coherence among sentences within the same review document, so that the high-level semantics can be incorporated,” lead author of the SpoilerNet paper, Mengting Wan, told TechCrunch in an email. This allows for a more complete understanding of a paragraph or review, though of course it is also necessarily a more complex problem.
But the more complex model is a natural result from richer data, he wrote:
Such a model design indeed benefits from the new large-scale review dataset we collected for this work, which includes complete review documents, sentence-level spoiler tags, and other meta-data. To our knowledge, the public dataset (released in 2013) before this work only involves a few thousand single-sentence comments rather than complete review documents. For research communities, such a dataset also facilitates the possibility of analyzing real-world review spoilers in details as well as developing modern ‘data-hungry’ deep learning models in this domain.
This approach is still new, and the more complex approach has its drawbacks. For instance, the model occasionally mistakes a sentence as having spoilers if other spoiler-ish sentence are adjacent; and its understanding of individual sentences is not quite good enough to understand when certain words really indicate spoilers or not. You and I know that “this kills Darth Vader” is a spoiler, while “this kills the suspense” isn’t, but a computer model may have trouble telling the difference.
Wan told me that the system should be able to run in real time on a user’s computer, though of course training it would be a much bigger job. That opens up the possibility of a browser plugin or app that reads reviews ahead of you and hides anything it deems risky. Though Amazon is indirectly associated with the research (co-author Rishabh Misra works there) Wan said there was no plan as yet to commercialize or otherwise apply the tech.
No doubt it would be a useful tool for Amazon and its subsidiaries and sub-businesses to be able to automatically mark spoilers in reviews and other content. But until the new model is implemented (and really until it is a bit better) we’ll have to stick to the old-fashioned method of avoiding all contact with the world until we’ve seen the movie or show in question.
The team from UCSD will be presenting their work at the Association for Computational Linguistics conference in Italy later this month; you can read the full paper here — but beware of spoilers. Seriously.