SimpDOM
Simplified DOM Trees for Transferable Attribute Extraction from the Web
Install / Use
/learn @MurtuzaBohra/SimpDOMREADME
Attribute Extraction from Web Documents.
Title: "Simplified DOM Trees for Transferable Attribute Extraction from the Web"
Original paper
The conference paper is available via here.
Keywords
structured data extraction, web information extraction, Simplified DOM
Implementation details
The implementation is in PyTorch.
Make sure that all Notebooks use Python3 kernel.
Pre-trained model
Trained weights on SWDE dataset (auto- vertical) are available here.
In order to execute test.ipynb notebook, download the file and unzip in ./data folder.
To run the pre-trained model Modify and execute the test.ipynb notebook:
- Set the
test_websitesandattributes, based on the content ofSWDE_Dataset/webpages/and the description found inSWDE_Dataset/readme.txt - Execute the notebook
To re-train the model
For that one has to download the SWDE dataset:
- Download the SWDE dataset via the Torrent file found on the references webpage
- Choose the vertical you want to train the model on the list of verticals in
SWDE_Dataset/webpages/ - Extract the vertical folder into the
./datasubfolder- For instance:
SWDE_Dataset/webpages/movies.7zinto./data/movies
- For instance:
- Extract the ground trooth folder into the *./data` subfolder
- For example:
SWDE_Dataset/groundtruth.7zinto./data/groundtruth
- For example:
Then make sure to follow the next steps:
- Remove the following files and folders, if present:
./data/English_charDict.pkl./data/HTMLTagDict.pkl./data/nodesDetails/./data/last.ckpt./data/weights.ckpt
- Modify and execute the generate.ipynb notebook
- Set the
verticalandattributes, based on the content ofSWDE_Dataset/webpages/and the description found inSWDE_Dataset/readme.txt - Set the number of friends to be used, the suggested values is
num_friends = 10 - Execute the notebook
- Set the
- Modify and execute the train.ipynb notebook
-
Set the
verticalandattributes, based on the content ofSWDE_Dataset/webpages/and the description found inSWDE_Dataset/readme.txt -
Set the training, validation and testing website lists:
train_websites,val_websites, andtest_websites -
Execute the notebook
- The execution may fail while model training with:
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV`
- This indicates that it ran out of memory.
- The solution is to reduce the training set by considering fewer web-sites.
-
Getting more GloVe features
Pre-trained GloVe features and the Glove implementation is available from here.
