Models

Machine learning models for MLonCode trained using the source{d} stack

Generate Convert Improve

Install / Use

/learn @src-d/Models

About this skill

Quality Score

0/100

README

source{d} MLonCode models

bot-detection

Model that identifies bots from humans among developer identities.

Example:

from sklearn.preprocessing import LabelEncoder
from sourced.ml.models import BotDetection
from xgboost import XGBClassifier

bot_detection = BotDetection.load(bot-detection)
xgb_cls = XGBClassifier()
xgb_cls._Booster = bot_detection_model.booster
xgb_cls._le = LabelEncoder().fit([False, True])
print('model configuration: ', xgb_cls)
print('BPE model vocabulary size: ', len(bot_detection.bpe_model.vocab()))

1 model:

<default> 94806d1f-1995-4c72-89c9-07681fa9d97d

bow

Weighted bag-of-words, that is, every bag is a feature extracted from source code and associated with a weight obtained by applying TFIDF.

Example:

from sourced.ml.models import BOW
bow = BOW().load(bow)
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))

4 models:

docfreq

Document frequencies of features extracted from source code, that is, how many documents (repositories, files or functions) contain each tokenized feature.

Example:

from sourced.ml.models import DocumentFrequencies
df = DocumentFrequencies().load(docfreq)
print("Number of tokens:", len(df))

2 models:

id2vec

Source code identifier embeddings, that is, every identifier is represented by a dense vector.

Example:

from sourced.ml.models import Id2Vec
id2vec = Id2Vec().load(id2vec)
print("Number of tokens:", len(id2vec))

2 models:

id_splitter_bilstm

Model that contains source code identifier splitter BiLSTM weights.

Example:

from sourced.ml.models.id_splitter import IdentifierSplitterBiLSTM
id_splitter = IdentifierSplitterBiLSTM().load(id_splitter_bilstm)
id_splitter.split(identifiers)

1 model:

<default> 522bdd11-d1fa-49dd-9e51-87c529283418

topics

Topic modeling of Git repositories. All tokens are identifiers extracted from repositories and seen as indicators for topics. They are used to infer the topic(s) of repositories.

Example:

from sourced.ml.models import Topics
topics = Topics().load(topics)
print("Number of topics:", len(topics))
print("Number of tokens:", len(topics.tokens))

1 model:

<default> c70a7514-9257-4b33-b468-27a8588d4dfa

typos_correction

Model that suggests fixes to correct typos.

Example:

from lookout.style.typos.corrector import TyposCorrector
corrector = TyposCorrector().load(typos_correction)
print("Corrector configuration:\n", corrector.dump())

3 models:

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

mentoring-juniors

Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

src-d

View profile

View on GitHub

GitHub Stars19

CategoryEducation

Updated3y ago

Forks10

src-d/models

Security Score

65/100

Audited on Dec 29, 2022

No findings