WebBot

A browser extension for Mozilla and Chrome that simulates a user searching (at least) 50-top main, news, images and videos search results of up to 8 different search engines.

Install on Firefox or Chrome
Select search engines and keywords
Start crawling
Save accessed pages into downloads
Deactivate the extension

Cite us

Ulloa, R., Makhortykh, M., & Urman, A. (2022). Scaling up search engine audits: Practical insights for algorithm auditing. Journal of Information Science. https://doi.org/10.1177/01655515221093029

Demo

WebBot demo

Installation

Note that as soon as the extension is installed and activated, it will interfere with your normal web browser usage. Deactivate the extension whenever you don't need it.

Firefox

Clone or download this repository
Open Firefox
Navigate to about:debugging
Click on This Firefox to the left
Click on Load Temporary Add-on...
Navigate to where you downloaded (or cloned) the extension
Open the file build/manifest.json

Chrome

Clone or download this repository
Open Chrome
Navigate to chrome://extensions
Activate Developer Mode switch (top-right corner)
Click on Load unpacked extension...
Navigate to where you downloaded (or cloned) the extension, and select the directory build
Open the build directory

Usage

🔧 Adjusting the Settings

Settings can be accessed by clicking on the extension's icon in the browser's tool bar. Settings are applied after the Update Settings button is pressed and are stored in the browser's local storage. If settings are changed after crawling had already started, it is recommended to reload the extension. The following settings are available:

| Option | Default | Behavior | |-----------------------|--------------------------|-----------------------------------| | Clear Browser Data | No | WARNING Activating this option will delete all your browser data. | | Close Inactive Tabs | No | WARNING Activating this option will close all your browser tabs upon landing on a search engine. | | Save Pages | No | Automatically save the complete result pages as downloads for further analysis. | | Save In | webbot | If Save Pages is activated, this is the subdirectory of your downloads folder that the webpages are saved into. | | Configuration | Local | Switch between selecting engines and keywords locally or providing them through a server. For the latter, see the Advanced guide down below. | | Server | - | If Server configuration is selected, the full URL of the server. | | Search Engines | Google, DuckDuckGo, Bing | If Local configuration is selected, determine the search engines to query – see the table below. | | Result Types | Text, News | If Local configuration is selected, select which results should be gathered from each engine. | | Query Terms | - | If Local configuration is selected, provide a comma-separated list of terms to query. Each term can be composed of multiple words and symbols such as -"+, only commas are reserved. Each term is queried once by a selected search engine. To query the same term by multiple search engines, repeat the term for each engine. |

Example: Assuming the goal is to query both Google and Baidu for the terms "climate" and "kyoto protocol +band" and you want to use the Local configuration. Then select Google and Baidu from the list of search engines, unticking all other engines. In the Query Terms field, input the following: "climate, climate, kyoto protocol +band, kyoto protocol +band". It is necessary to repeat the terms such that both engines are queried with the same terms. Otherwise the crawled pages would just include "climate" results from Google and "kyoto protocol +band" results from Baidu.

🕷️ Start Crawling

Navigate to the URL of the engine that you would like to start crawling with. You do not need to accept any cookies etc., this is all handled by the extension.
Wait up to a minute, the automatic search will start in next minute o' clock, e.g. 14:37:00.
Let the extension handle navigation between search results (text, news, images, videos) and between the engines you selected. The engines will be accessed in the same order as in the table below. Each engine has 6 minutes to provide all results. If the request times out or navigation is interrupted, e.g. by a captcha, the next engine will automatically be accessed after 6 minutes have passed. If not all result types are selected, the time is reduced by 1 min per unselected result type.

Be aware that some search engines might display weird behavior if developer tools are opened. Make sure to close the inspector/console/etc. unless you are debugging the extension.

🔍 Supported Engines

| Engine | URL | Notes | |------------|-----|-------| | Google | google.com | | | DuckDuckGo | duckduckgo.com | | | Bing | bing.com | | | Yandex | yandex.com | Yandex is very strict with captchas and might thus require some manual intervention. News are currently not supported. Not yet implemented are ya.ru and yandex.ru (which now redirects to dzen.ru). | | Yahoo! | search.yahoo.com | Note that Yahoo! handles localization primarily through subdomains, so we use the 'neutral' search subdomain for now. | | Baidu | baidu.com | Baidu provides information rather than news results. | | So | so.com | So also provides information rather than news results. | | Sogou | sogou.com | Sogou does not provide news results. |

💾 Saving Search Results

We integrated the wonderful SingleFile into this extension to automatically save search result pages. This feature can be turned on or off in the settings. Pages will be stored as full archives containing all necessary scripts, fonts, pictures, etc. in-line.

If search results are presented as multiple pages, each page is saved individually. If more search results are automatically loaded after scrolling to the bottom, the page is only saved once after scrolling the designated amount. Pages are saved in the format <engine url>_<keyword>_<result type>_<date>_<time>.html. It is also possible to designate a specific subdirectory to download the pages into. This might come handy if multiple browsers are used to crawl and save into the same downloads directory.

⚙️ Processing Saved Search Results

We provide scripts for parsing the search results in Python and R. Have a look!

In general, saved result pages parsed in Python with Beautiful Soup. As images are stored inline, they can be extracted from the result pages for further processing, no re-loading the original image required.

🚧 Reload, Deactivate, or Remove

On Firefox, navigate to about:addons to deactivate or remove the extension. Reload the extension by deactivating and re-activating the extension.

On Chrome, navigate to chrome://extensions to reload, deactivate, or remove the extension.

Advanced

For some experimental setups, crawling search engines in parallel on multiple browsers or machines could be desired. For example, to investigate how Google search results differ between Germany and Brazil, one could rent virtual servers in both countries and then start crawling. In these scenarios it makes sense not to define the lists of search engines and query terms within the extension but to provide it through a central server. WebBot supports this by allowing for a Server configuration in the Settings.

🚲 Installing the Microserver

To test out server deployment, this repository includes a microserver that can be started in the same machine where the browser is running (localhost). It requires python and the simplejson package to be installed. The lists of engines, result types and query terms are served from engines.txt, resulttypes.txt, and queryterms.txt, separated by newlines. Beware that engines.txt has to contain the full URL of each engine such as https://search.yahoo.com.

Open a terminal
Navigate to where you downloaded/cloned this repository
Navigate to the microserver folder
Install simplejson: pip install simplejson or conda install simplejson
Run the server: python sever.py 8000
In the extension's settings, select the Server configuration and enter http://localhost:8000/ for the server URL.

🚀 Setting up a Production Server

The microserver provided is meant to be used on the same machine only, which is suboptimal if you want to control several machine because, for example, if you want to change the query terms, you would have to change the file on all the microservers.

Therefore, it is better to set up a server in an external machine that is accessible to all the machines, so one can easily change the lists for all machines at once. Any server would do (e.g. Flask, Apache, klein, node), you just have to make sure that the following requests are available:

- POST: bot/getengines
- POST: bot

WebBot

Install / Use

README

WebBot

Cite us

Demo

Installation

Firefox

Chrome

Usage

🔧 Adjusting the Settings

🕷️ Start Crawling

🔍 Supported Engines

💾 Saving Search Results

⚙️ Processing Saved Search Results

🚧 Reload, Deactivate, or Remove

Advanced

🚲 Installing the Microserver

🚀 Setting up a Production Server