OpenWPM
A web privacy measurement framework
Install / Use
/learn @openwpm/OpenWPMREADME
OpenWPM
<!-- omit in toc -->
OpenWPM is a web privacy measurement framework which makes it easy to collect data for privacy studies on a scale of thousands to millions of websites. OpenWPM is built on top of Firefox, with automation provided by Selenium. It includes several hooks for data collection. Check out the instrumentation section below for more details.
Table of Contents <!-- omit in toc -->
- Installation
- Quick Start
- Troubleshooting
- Documentation
- Advice for Measurement Researchers
- Developer instructions
- Instrumentation and Configuration
- Storage
- Docker Deployment for OpenWPM
- Citation
- License
Installation
OpenWPM is tested on Ubuntu 24.04 via GitHub Actions and is commonly used via the docker container that this repo builds, which is based on Ubuntu 22.04. Although we don't officially support other platforms, conda is a cross platform utility and the install script can be expected to work on OSX and other linux distributions.
OpenWPM does not support windows: https://github.com/openwpm/OpenWPM/issues/503
Pre-requisites
The main pre-requisite for OpenWPM is conda, a fast cross-platform package management tool.
Conda is open-source and can be installed from https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html.
Install
An installation script, install.sh is included to: install the conda environment,
install unbranded firefox, and build the instrumentation extension.
All installation is confined to your conda environment and should not affect your machine. The installation script will, however, override any existing conda environment named openwpm.
To run the install script, run
./install.sh
After running the install script, activate your conda environment by running:
conda activate openwpm
Mac OSX
You may need to install make / gcc in order to build the extension.
The necessary packages are part of xcode: xcode-select --install
We do not run CI tests for Mac, so new issues may arise. We welcome PRs to fix these issues and add full CI testing for Mac.
Running Firefox with xvfb on OSX is untested and will require the user to install an X11 server. We suggest XQuartz. This setup has not been tested, we welcome feedback as to whether this is working.
Quick Start
Once installed, it is very easy to run a quick test of OpenWPM. Check out
demo.py for an example. This will use the default setting specified in
openwpm/config.py::ManagerParams and
openwpm/config.py::BrowserParams, with the exception of the changes
specified in demo.py.
The demo script also includes a sample of how to use the
Tranco top sites list via the optional command line
flag demo.py --tranco. Note that since this is a real top sites list it will
include NSFW websites, some of which will be highly ranked.
More information on the instrumentation and configuration parameters is given below.
The docs provide a more in-depth tutorial, and a description of the methods of data collection available.
Troubleshooting
-
WebDriverException: Message: The browser appears to have exited before we could connect...This error indicates that Firefox exited during startup (or was prevented from starting). There are many possible causes of this error:
-
If you are seeing this error for all browser spawn attempts check that:
- Both selenium and Firefox are the appropriate versions. Run the following
commands and check that the versions output match the required versions in
install.shandenvironment.yaml. If not, re-run the install script.
cd firefox-bin/ firefox --versionand
conda list selenium- If you are running in a headless environment (e.g. a remote server), ensure
that all browsers have
display_modeset to"headless"before launching.
- Both selenium and Firefox are the appropriate versions. Run the following
commands and check that the versions output match the required versions in
-
If you are seeing this error randomly during crawls it can be caused by an overtaxed system, either memory or CPU usage. Try lowering the number of concurrent browsers.
-
-
In older versions of firefox (pre 74) the setting to enable extensions was called
extensions.legacy.enabled. If you need to work with earlier firefox, update the setting nameextensions.experiments.enabledinopenwpm/deploy_browsers/configure_firefox.py. -
Make sure your conda environment is activated (
conda activate openwpm). You can see you environments and the activate one by runningconda env listthe active environment will have a*by it. -
make/gccmay need to be installed in order to build the web extension. On Ubuntu, this is achieved withapt-get install make. On OSX the necessary packages are part of xcode:xcode-select --install. -
On a very sparse operating system additional dependencies may need to be installed. See the Dockerfile for more inspiration, or open an issue if you are still having problems.
-
If you see errors related to incompatible or non-existing python packages, try re-running the file with the environment variable
PYTHONNOUSERSITEset. E.g.,PYTHONNOUSERSITE=True python demo.py. If that fixes your issues, you are experiencing issue 689, which can be fixed by clearing your python user site packages directory, by prependingPYTHONNOUSERSITE=Trueto a specific command, or by setting the environment variable for the session (e.g.,export PYTHONNOUSERSITE=Truein bash). Please also add a comment to that issue to let us know you ran into this problem.
Documentation
Further information is available at OPENWPM's Documentation Page.
Advice for Measurement Researchers
OpenWPM is often used for web measurement research. We recommend the following for researchers using the tool:
Use a versioned release. We aim to follow Firefox's release cadence, which is roughly once every four weeks. If we happen to fall behind on checking in new releases, please file an issue. Versions more than a few months out of date will use unsupported versions of Firefox, which are likely to have known security vulnerabilities. Versions less than v0.10.0 are from a previous architecture and should not be used.
Include the OpenWPM version number in your publication. As of v0.10.0 OpenWPM pins all python, npm, and system dependencies. Including this information alongside your work will allow other researchers to contextualize the results, and can be helpful if future versions of OpenWPM have instrumentation bugs that impact results.
Developer instructions
If you want to contribute to OpenWPM have a look at our CONTRIBUTING.md
Instrumentation and Configuration
OpenWPM provides a breadth of configuration options which can be found in Configuration.md More detail on the output is available below.
Storage
OpenWPM distinguishes between two types of data, structured and unstructured. Structured data is all data captured by the instrumentation or emitted by the platform. Generally speaking all data you download is unstructured data.
For each of the data classes we offer a variety of storage providers, and you are encouraged to implement your own, should the provided backends not be enough for you.
We have an outstanding issue to enable saving content generated by commands, such as
screenshots and page dumps to unstructured storage (see #232).
For now, they get saved to manager_params.data_directory.
Local Storage
For storing structured data locally we offer two StorageProviders:
- The SQLiteStorageProvider which writes all data into a SQLite database
- This is the recommended approach for getting started as the data is easily explorable
- The LocalArrowProvider which stores the data into Parquet files.
- This method integrates well with NumPy/Pandas
- It might be harder to ad-hoc process
For storing unstructured data locally we also offer two solutions:
- The LevelDBProvider which stores all data into a LevelDB
- This is the recommended approach
- The LocalGzipProvider that gzips and stores the files individually on disk
- Please note that file systems usually don't like thousands of files in one folder
- Use with care or for single site visits
Remote storage
When running in the cloud, saving records to disk
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
