ProVe
A tool for reference verification of Wikidata items by verbalizing claims and gathering HTML reference resources utilizing various language models. Please refer to the ProVe API through the link provided below.
Install / Use
/learn @King-s-Knowledge-Graph-Lab/ProVeREADME
ProVe (Provenance Verification for Wikidata claims)
Overview
ProVe is a system designed to automatically verify claims and references in Wikidata. It extracts claims from Wikidata entities, fetches the referenced URLs, processes the HTML content, and uses NLP models to determine whether the claims are supported by the referenced content.
System Architecture
The RQV system consists of several key components:
-
Data Collection and Processing:
WikidataParser: Extracts claims and URLs from Wikidata based on QID (item identifier)HTMLFetcher: Collects HTML content from reference URLsHTMLSentenceProcessor: Converts HTML to sentences for analysis
-
Evidence Selection and Verification:
EvidenceSelector: Selects relevant sentences as evidenceClaimEntailmentChecker: Verifies entailment relationship between claims and evidence
-
NLP Models:
TextualEntailmentModule: Checks textual entailment relationshipsSentenceRetrievalModule: Retrieves relevant sentencesVerbModule: Handles verbalization processing
-
Data Storage:
- MongoDB: Stores HTML content, entailment results, parser statistics, and status information
- SQLite: Stores verification results for API access
-
Service Structure:
ProVe_main_service.py: Main service logicProVe_main_process.py: Entity processing logicbackground_processing.py: Background processing tasks
Setup Instructions
1. Install Dependencies
pip install -r requirements.txt
2. Download NLP Models
The 'base' folder contains essential NLP models for the RQV tool, including pre-trained & fine-tuned BERT, T5, and related parsers and NLP models.
Download from:
https://emckclac-my.sharepoint.com/:f:/r/personal/k2369089_kcl_ac_uk/Documents/base?csf=1&web=1&e=TBo3nE
Place the downloaded 'base' folder in the project root directory.
3. Configure the System
Review and modify the config.yaml file to adjust database settings, HTML fetching parameters, and evidence selection thresholds.
Usage
Processing a Single Entity
from ProVe_main_process import initialize_models, process_entity
# Initialize models
models = initialize_models()
# Process entity by QID
qid = 'Q44' # Example: Barack Obama
html_df, entailment_results, parser_stats = process_entity(qid, models)
Running the Service
The main service can be started by running:
python ProVe_main_service.py
This will start the MongoDB handler and schedule background processing tasks.
Background Processing
The system can automatically process:
- Top viewed Wikidata items
- Items from a pagepile list
- Random QIDs
Configuration
The config.yaml file contains important settings:
- Database configurations
- Algorithm version
- HTML fetching parameters (batch size, delay, timeout)
- Text processing settings
- Evidence selection parameters
Data Flow
- A Wikidata QID is provided to the system
- The system extracts claims and reference URLs from the entity
- HTML content is fetched from the reference URLs
- The HTML is processed into sentences
- Relevant sentences are selected as evidence
- NLP models verify if the evidence supports the claims
- Results are stored in the database
