Anansi - TV game show crawler

Anansi is a computer vision (cv2 and FFmpeg) + OCR (EasyOCR and tesseract) python-based crawler for finding and extracting questions and correct answers from video files of popular TV game shows in the Balkan region.

Anansi - TV game show crawler

Idea & Motivation

There are two main reasons for doing this project.

Reason #1

In the last couple of years, the pub quiz scene in Serbia has seen a rise in popularity. After doing a bit of research and after competing myself in some of the quizzes, I saw that a lot of questions are recycled and that it would be nice if there was some kind of a "knowledge database" where you can test yourself and perhaps prepare yourself for the quizzes.

To my surprise, I found that there are only a couple of popular mobile/browser games that mimicks popular games from the TV shows, but are very limited when it comes to the actual distinct number of questions that they have in their databases.

I found that a lot of questions used in the pub quizzes are from TV game shows, so, naturally, I started watching the episodes of Slagalica (Serbian), Potera (Serbian), and Potjera (Croatian), however, that took too long. Episodes of Slagalica are 22mins and Potera (eng. The Chase) is more than 40mins long. And the most fun games in both of them are games where you can directly test your knowledge - in Slagalica game is called "Ko zna zna" and in Potera, I think it does not have a name officially, it's always being referred to as the "second game" :).

So after giving it some thought, I decided to create a program that will go through the episodes, find the games, extract the questions, find the answers and put all of that in some spreadsheet-friendly format such as ".csv".

After a couple of days, this is the project that I come up with.

Reason #2

Learn something new. I've never, not since college anyway, done any computer vision work, and I didn't have any experience with OCRs too. So it was a really nice opportunity to venture into the unknown.

Results - extracted questions & answers

Before we start with the algorithm explanation, how-to-use guides, and requirements, here is the full version of all the extracted questions & answers of the youtube available episodes of Slagalica and Pot(j)era.

Slagalica (more than 24k questions and answers)

https://tinyurl.com/anansi-slagalica

Pot(j)era (3k questions and answers)

https://tinyurl.com/anansi-pot(j)era

Note: Slagalica has a lot more (24k > 3k) questions extracted simply because the Slagalica TV show has an official youtube channel where all recent episodes can be found. Pot(j)era does not have an official channel, and videos are often removed due to copyright infringements which in result leads to a very limited number of episodes circulating on the internet.

Algorithm

The following is the section explaining how everything works under the hood.

General idea

In both Slagalica and Pot(j)era, the main idea is the same:

Open the video file
Find the beginning of the desired game
Recognize the frame where both question and answer are visible
Extract sections of the frame with the texts and preprocess them for OCR
OCR the sections to get questions & answers
Move to the next question
Finish processing the video file if the game has ended

The easy and straightforward, almost bulletproof idea, right? What could possibly go wrong?

Slagalica

Slagalica pseudo algorithm

Here is the basic idea of the Slagalica crawler algorithm.

Open the video file
Skip the first half of the video
If game start is not found
1. Go through frames until the template for the game start is found
If game start is found
1. In the seek question area look for the blue mask and the blue rectangle
2. If the question rectangle is found
  1. Monitor for changes in the answer rectangle
  2. Keep track of the changes
  3. If a change occurred in the answer rectangle
    1. Preprocess the question & answer
    2. OCR
    3. Sanitization
If the number of found questions is 10 or game end is found or the video file has no more frames
1. Finish processing

In the next sections, I will go through every step to explain the reasoning behind it and discuss the current implementation.

Slagalica TV game show

https://sr.m.wikipedia.org/sr-ec/TV_slagalica

Slagalica algorithm

Rules of the game

In the TV game show called "Slagalica", there is a game near the end with the name "Ko zna zna", in which players are giving answers to 10 general knowledge questions. This is, I think, by far, the most liked game in the show.

Finding the beginning and the end of the game

The "Ko zna zna" game begins usually in the last third of the show. So we can immediately skip the first half of the video. Then we need to figure out how to find the game start.

After 106th season

Starting from the 106th season (starting from 4.5.2018), the game intro for the "Ko zna zna" is played on the full screen just before the game.

Game intro:

The easy thing to do was to create a template (smaller image based on the full-frame) to match:

Using OpenCV you can try to find a template in the image, and the OpenCV will return the confidence level, i.e. how similar are both of the images. By using some kind of threshold (e.g. if the similarity is above 0.5), it is trivial to find the game start with this logic.

To save the processing time, the next game (the game after "Ko zna zna") intro should also be found and be used as a game end. Sure, you can have a condition if 10 questions are found to end immediately, but sometimes, not all 10 questions will be found (sometimes TV show editor cuts to the next game before showing the last question (e.g. episode from 14.11.2018).

Using the same logic as for the game intro, you can find the game outro (game end). And as you know, one game end is for another game the beginning - :O mindblown.gif. :)

So, by using this reference game intro (the game after the main one):

and with the template:

You can find the next game intro, which is surely the previous game's end.

Before 106th season

Before the 106th season (before 4.5.2018) the game intro was played on the big screen behind the TV show hosts. So this straightforward way of matching templates cannot work just as well as before the 106th season episodes.

However we can re-use the same logic that is being used for Potera (instead of template matching, use pink mask + contour area matchin

Pabkvizgenerator

Install / Use

README