Iww
AI based web-wrapper for web-content-extraction
Install / Use
/learn @MohamedHmini/IwwREADME
IWW-IntelliWebWrapper
an AI based web-mining library for web-content-extraction using machine learning algorithms.
currently, the library offers many functionalities to be exploited & some interesting algos to look at:
- DOM extractor, mapper, reducer and flattening functionality...
- DoC, degree of coherence, a euclidean distance based similarity.
- LD, Lists detector algorithm.
- MCD, Main content detector algorithm.
- MCD algorithms results integrator method.
- CETD algorithm.
- DOM tags detector script (highlighting the chosen nodes).
P.S :
- the documentation isn't available yet.
- LD & MCD algorithms are to be released as a research article in the near future.
- the pip package of iww will be available online as soon as possible.
USE CASE EXAMPLE :
1- extraction :
from iww.extractor import extractor
from iww.detector import detector
from iww.features_extraction.lists_detector import Lists_Detector as LD
from iww.features_extraction.main_content_detector import MCD
url = "https://www.theiconic.com.au/catalog/?q=kids%20sunglasses"
json_file = "./iconic.json"
extractor.extract(
url = url,
destination = json_file
)
2- data exploratory analysis :
from iww.utils.dom_mapper import DOM_Mapper as DM
dm = DM()
dm.retrieve_DOM_tree("./iconic.json")
print("total number of nodes : {}".format(dm.DOM['CETD']['tagsCount']))
total numbre of nodes : 2098
3- LD algorithm :
ld = LD()
ld.retrieve_DOM_tree(file_path = "./iconic.json")
ld.apply(
node = ld.DOM,
coherence_threshold= (0.75,1),
sub_tags_threshold = 2
)
ld.update_DOM_tree()
detector.detect(
input_file = "./iconic.json",
output_file = "./iconic_ld.png",
mark_path = "LISTS.mark",
mark_value = "1"
)

4- MCD algorithm :
mcd = MCD()
mcd.retrieve_DOM_tree("./iconic.json")
mcd.apply(
node = mcd.DOM,
min_ratio_threshold = 0.0,
nbr_nodes_threshold = 1
)
mcd.update_DOM_tree()
detector.detect(
input_file = "./iconic.json",
output_file = "./iconic_mcd.png",
mark_path = "MCD.mark",
mark_value = "1"
)

5- LD/MCD integration (main list detection) :
mcd.integrate_other_algorithms_results(
node = mcd.DOM,
nbr_nodes = 1,
mode = "ancestry",
condition_features = [("LISTS.mark","1")])
mcd.update_DOM_tree()
detector.detect(
input_file = "./iconic.json",
output_file = "./iconic_main_list.png",
mark_path = "MCD.main_node",
mark_value = "1"
)

License
MOHAMED-HMINI 2019
Related Skills
claude-opus-4-5-migration
81.4kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
docs-writer
98.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
330.7kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
49.6k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
<br/>
