Models
Machine learning models for MLonCode trained using the source{d} stack
Install / Use
/learn @src-d/ModelsREADME
source{d} MLonCode models
bot-detection
Model that identifies bots from humans among developer identities.
Example:
from sklearn.preprocessing import LabelEncoder
from sourced.ml.models import BotDetection
from xgboost import XGBClassifier
bot_detection = BotDetection.load(bot-detection)
xgb_cls = XGBClassifier()
xgb_cls._Booster = bot_detection_model.booster
xgb_cls._le = LabelEncoder().fit([False, True])
print('model configuration: ', xgb_cls)
print('BPE model vocabulary size: ', len(bot_detection.bpe_model.vocab()))
1 model:
- <default> 94806d1f-1995-4c72-89c9-07681fa9d97d
bow
Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF.
Example:
from sourced.ml.models import BOW
bow = BOW().load(bow)
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))
4 models:
- 1e0deee4-7dc1-400f-acb6-74c0f4aec471
- <default> 1e3da42a-28b6-4b33-94a2-a5671f4102f4
- 694c20a0-9b96-4444-80ae-f2fa5bd1395b
- da8c5dee-b285-4d55-8913-a5209f716564
docfreq
Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature.
Example:
from sourced.ml.models import DocumentFrequencies
df = DocumentFrequencies().load(docfreq)
print("Number of tokens:", len(df))
2 models:
id2vec
Source code identifier embeddings, that is, every identifier is represented by a dense vector.
Example:
from sourced.ml.models import Id2Vec
id2vec = Id2Vec().load(id2vec)
print("Number of tokens:", len(id2vec))
2 models:
id_splitter_bilstm
Model that contains source code identifier splitter BiLSTM weights.
Example:
from sourced.ml.models.id_splitter import IdentifierSplitterBiLSTM
id_splitter = IdentifierSplitterBiLSTM().load(id_splitter_bilstm)
id_splitter.split(identifiers)
1 model:
- <default> 522bdd11-d1fa-49dd-9e51-87c529283418
topics
Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories.
Example:
from sourced.ml.models import Topics
topics = Topics().load(topics)
print("Number of topics:", len(topics))
print("Number of tokens:", len(topics.tokens))
1 model:
- <default> c70a7514-9257-4b33-b468-27a8588d4dfa
typos_correction
Model that suggests fixes to correct typos.
Example:
from lookout.style.typos.corrector import TyposCorrector
corrector = TyposCorrector().load(typos_correction)
print("Corrector configuration:\n", corrector.dump())
3 models:
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
mentoring-juniors
Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
Security Score
Audited on Dec 29, 2022
