SkillAgentSearch skills...

Jemma

JEMMA: An Extensible Java dataset for Many ML4Code Applications

Install / Use

/learn @giganticode/Jemma
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README -->

This is the official documentation for the JEMMA project

JEMMA is an Extensible Java dataset for Many ML4Code Applications. It is primarily a dataset of Java code entities at multiple granularities, their properties, and representations. To help users interact and work with the data seamlessly, we have added Workbench capabilities to it as well.

This repository hosts the Workbench part of JEMMA, while the raw data is hosted on Zenodo which can be downloaded at any moment while using the Workbench. The following sections provide more details.


Contents


<a id="setup-instructions"></a>

Setup Instructions

<!-- > Getting started with jemma -->

First steps: Install jemma locally

1. $ git clone https://github.com/giganticode/jemma.git 
2. $ cd jemma/ 
3. $ pip install -r requirements.txt
4. $ pip install -e .

Next steps: Downloading all the datasets <br> Sign-up to Zenodo.org and generate an API num_token [IMPORTANT!]

5. $ cd jemma/download/ 
6. $ nano config.ini (& replace the dummy `access_token` with your API key)
7. $ python3 download.py 
8. $ python3 sanity_checks.py

Getting to know JEMMA Datasets

JEMMA Metadata

| Link to metadata | columns | |:---------|:---------- | | projects | project_id | | | project_path | | | project_name | | | || | packages | project_id | | | package_id | | | package_path | | | package_name | | | || | classes | project_id | | | package_id | | | class_id | | | class_path | | | class_name | | | || | methods | project_id | | | package_id | | | class_id | | | method_id | | | method_name | | | start_line | | | end_line |


JEMMA Representations

| Representation Code | Representation Name | Link to dataset | |:------------------------------|:----------------------|:------------------| | TEXT | raw_source_code | https://doi.org/10.5281/zenodo.5813705 | | TKNA | code_tokens (spaced) | https://doi.org/10.5281/zenodo.5813717 | | TKNB | code_tokens (comma) | https://doi.org/10.5281/zenodo.5813730 |
| C2VC | code2vec* | https://doi.org/10.5281/zenodo.5813993 | | C2SQ | code2seq* | https://doi.org/10.5281/zenodo.5814059 | | FTGR | feature_graph* | https://doi.org/10.5281/zenodo.5813933 |


JEMMA Properties

| Property Code $~~~~~~~~~$ | Property Name $~~~~~~~~~$ | Link to dataset | |:------------------------------|:---------------------------------|:------------------| | RSLK | resource_leak | https://doi.org/10.5281/zenodo.1096082 | | NLDF | null_dereference | https://doi.org/10.5281/zenodo.1096080 | | NMLC | num_local_calls | https://doi.org/10.5281/zenodo.7020084 | | NMNC | num_non_local_calls | https://doi.org/10.5281/zenodo.7019960 | | NUCC | num_unique_callees | https://doi.org/10.5281/zenodo.7019176 | | NUPC | num_unique_callers | https://doi.org/10.5281/zenodo.7019128 | | CMPX | cyclomatic_complexity | https://doi.org/10.5281/zenodo.5813084 | | MXIN | max_indent | https://doi.org/10.5281/zenodo.5813081 | | NAME | method_name | https://doi.org/10.5281/zenodo.5813308 | | NMLT | num_literals | https://doi.org/10.5281/zenodo.5813054 | | NMOP | num_operators |https://doi.org/10.5281/zenodo.5813055 | | NMPR | num_parameters | https://doi.org/10.5281/zenodo.5813053 | | NMRT | num_returns | https://doi.org/10.5281/zenodo.5813034 | | NMTK | num_tokens | https://doi.org/10.5281/zenodo.5813032 | | NTID | num_identifiers | https://doi.org/10.5281/zenodo.5813029 | | NUID | num_unique_identifiers | https://doi.org/10.5281/zenodo.5813028 | | SLOC | source_lines_of_code | https://doi.org/10.5281/zenodo.5813094 | | TLOC | total_lines_of_code | https://doi.org/10.5281/zenodo.5813102 |

<!-- \textit{Properties:} \texttt{[TLOC]} & \url{} & 335.5 MB\Tstrut{}\\ \textit{Properties:} \texttt{[SLOC]} & \url{} & 335.0 MB \\ \textit{Properties:} \texttt{[NUID]} & \url{} & 335.6 MB \\ \textit{Properties:} \texttt{[NTID]} & \url{} & 336.7 MB \\ \textit{Properties:} \texttt{[NMTK]} & \url{} & 342.5 MB \\ \textit{Properties:} \texttt{[NMRT]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMPR]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMOP]} & \url{} & 334.5 MB \\ \textit{Properties:} \texttt{[NMLT]} & \url{} & 333.4 MB \\ \textit{Properties:} \texttt{[NAME]} & \url{} & 432.0 MB \\ \textit{Properties:} \texttt{[MXIN]} & \url{} & 267.0 MB \\ \textit{Properties:} \texttt{[CMPX]} & \url{} & 267.1 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NUPC]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NUCC]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[NMNC]} & \url{} & 334.0 MB \\ \textit{Properties:} \texttt{[NMLC]} & \url{} & 333.2 MB \\ % \textit{Properties:} \texttt{[NMTC]} & \url{https://doi.org/10.5281/zenodo.7019246} & 334.0 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NLDF]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[RSLK]} & \url{} & 334.0 MB\Bstrut{}\\ -->

JEMMA Callgraphs

| Link to callgraphs data | columns | |:----------------------- |:------- | | Callgraphs | caller_project_id | | | caller_class_id | | | caller_method_id | | | call_direction | | | callee_project_id | | | callee_class_id | | | callee_method_id |

<!-- | | *call_type* | -->

Working with JEMMA Workbench

List of API calls


projects

  • get_project_id

    Returns the project_id of the project (queried by project name).

    Parameters:

    • project_name: (str) - name of the project

    Returns:

    • Returns a str uuid of the corresponding project (project_id)
    • Returns None if no such project_id was found
    • Returns None if multiple projects were found with the same name

  • get_project_id_by_path

    Returns the project id of the project (queried with project path).

    Parameters:

    • project_path: (str) - path of the project defined in jemma

    Returns:

    • Returns a str uuid of the corresponding project (project_id)
    • Returns None if no such project_path was found
    • Returns None if multiple projects were found with the same path

  • get_project_id_class_id

    Returns the project id of the project (queried with class id)

    Parameters:

    • class_id: (str) - any class_id defined within jemma

    Returns:

    • Returns a str uuid of the corresponding project (project_id)
    • Returns None if no such project_id was found

  • get_project_id_by_method_id

    Returns the project id of the project (queried with method id)

    Parameters:

    • method_id: (str) - any method_id defined within jemma

    Returns:

    • Returns a str uuid of the corresponding project (project_id)
    • Returns None if no such project_id was found

  • get_project_name

    Returns the project name of the project.

    Parameters:

    • project_id: (str) - any project_id defined within jemma

    Returns:

    • Returns a str of the corresponding project name
    • Returns None if no such project_id is defined in jemma

  • get_project_path

    Returns the project path of the project.

    Parameters:

    • project_id: (str) - any project_id defined within jemma

    Returns:

    • Returns a str of the corresponding project path
    • Returns None if no such project_id is defined in jemma

  • get_project_size_by_classes

    Returns the size of a project, by the number of classes.

    Parameters:

    • project_id: (str) - any project_id defined within jemma

    Returns:

    • Returns a str of the corresponding project size, by the number of classes
    • Returns None if no such project_id is defined in jemma

  • get_project_size_by_methods

    Returns the size of a project, by the number of methods.

    Parameters:

    • project_id: (str) - any project_id defined within jemma

    Returns:

    • Returns a str of the corresponding project size, by the number of methods
    • Returns None if no such project_id is defined in jemma

  • get_project_class_ids

    Returns all class ids defined within the project.

    Parameters:

    • project_id: (str) - any project_id defined within jemma
View on GitHub
GitHub Stars19
CategoryDevelopment
Updated11mo ago
Forks6

Languages

Python

Security Score

67/100

Audited on Apr 30, 2025

No findings