Wit
Algorithms for "schema matching"
Install / Use
/learn @bkj/WitREADME
wit
Algorithms for string classification and string embeddings using 'weak' supervision, with eventual application to 'schema alignment'.
NB: This package is in the middle of an API redefinition and simplification. The master branch is functional, but keep an eye out for changes. Ongoing work is being done on the api-v3 branch.
Method Overview
For schema alignment, basic idea is to:
- learn an embedding of strings into dense N-dimensional vector representations s.t. instances of the same variable are closer than instances of other variables (recurrent neural networks)
- align variables whose embedded distributions are "close" (solve assignment problem)
Notes
Here are two ways that we could think about similarity of strings:
-
syntactic: strings are similar, because they have similar structure- usernames :
ben46 is close to frank123 - subject_line :
'Re: good morning' is close to 'Re: circling back'
- usernames :
-
semantic: strings are similar, because of extrinsic information about the world- date :
'2016-01-01' is close to 'Jan 1st 2016' - country :
'AR' is close to 'Argentina'
- date :
and here are two ways we could think about similarity of sets of strings:
-
distributional: sets have similar distributions- forum post_id : (near?) unique key
- forum username : may follow similar distributions across domains
-
relational: sets have similar relationships to other sets of strings- relationship (eg mutual information) between post_id and username may be similar across domains
Software
Prototype code for calculating syntactic and semantic similarity are included in this repo.
Scripts
wit/examples/string-example.pyshows how to build a string classifier (iesemantic)wit/examples/simple-embedding-example.pyshows how to use the triplet loss function to learn a string embedding (iesyntactic)wit/examples/simple-alignment-example.py-- splitting and re-aligning a simple dataset
Notebooks
wit/notebooks/address-matching.ipynb-- trying to learn a good metric for addresseswit/notebooks/simple-forum-notebook.py-- aligning schemas of multiple forums at once
More
See https://github.com/gophronesis/census-schema-alignment for some more concrete examples, developed during the January 2016 XDATA census hackathon.
Related Skills
node-connect
347.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
