Jemma
JEMMA: An Extensible Java dataset for Many ML4Code Applications
Install / Use
/learn @giganticode/JemmaREADME

This is the official documentation for the JEMMA project
JEMMA is an Extensible Java dataset for Many ML4Code Applications. It is primarily a dataset of Java code entities at multiple granularities, their properties, and representations. To help users interact and work with the data seamlessly, we have added Workbench capabilities to it as well.
This repository hosts the Workbench part of JEMMA, while the raw data is hosted on Zenodo which can be downloaded at any moment while using the Workbench. The following sections provide more details.
Contents
<a id="setup-instructions"></a>
Setup Instructions
<!-- > Getting started with jemma -->First steps: Install jemma locally
1. $ git clone https://github.com/giganticode/jemma.git
2. $ cd jemma/
3. $ pip install -r requirements.txt
4. $ pip install -e .
Next steps: Downloading all the datasets <br> Sign-up to Zenodo.org and generate an API num_token [IMPORTANT!]
5. $ cd jemma/download/
6. $ nano config.ini (& replace the dummy `access_token` with your API key)
7. $ python3 download.py
8. $ python3 sanity_checks.py
Getting to know JEMMA Datasets
JEMMA Metadata
| Link to metadata | columns | |:---------|:---------- | | projects | project_id | | | project_path | | | project_name | | | || | packages | project_id | | | package_id | | | package_path | | | package_name | | | || | classes | project_id | | | package_id | | | class_id | | | class_path | | | class_name | | | || | methods | project_id | | | package_id | | | class_id | | | method_id | | | method_name | | | start_line | | | end_line |
JEMMA Representations
| Representation Code | Representation Name | Link to dataset |
|:------------------------------|:----------------------|:------------------|
| TEXT | raw_source_code | https://doi.org/10.5281/zenodo.5813705 |
| TKNA | code_tokens (spaced) | https://doi.org/10.5281/zenodo.5813717 |
| TKNB | code_tokens (comma) | https://doi.org/10.5281/zenodo.5813730 |
| C2VC | code2vec* | https://doi.org/10.5281/zenodo.5813993 |
| C2SQ | code2seq* | https://doi.org/10.5281/zenodo.5814059 |
| FTGR | feature_graph* | https://doi.org/10.5281/zenodo.5813933 |
JEMMA Properties
| Property Code $~~~~~~~~~$ | Property Name $~~~~~~~~~$ | Link to dataset | |:------------------------------|:---------------------------------|:------------------| | RSLK | resource_leak | https://doi.org/10.5281/zenodo.1096082 | | NLDF | null_dereference | https://doi.org/10.5281/zenodo.1096080 | | NMLC | num_local_calls | https://doi.org/10.5281/zenodo.7020084 | | NMNC | num_non_local_calls | https://doi.org/10.5281/zenodo.7019960 | | NUCC | num_unique_callees | https://doi.org/10.5281/zenodo.7019176 | | NUPC | num_unique_callers | https://doi.org/10.5281/zenodo.7019128 | | CMPX | cyclomatic_complexity | https://doi.org/10.5281/zenodo.5813084 | | MXIN | max_indent | https://doi.org/10.5281/zenodo.5813081 | | NAME | method_name | https://doi.org/10.5281/zenodo.5813308 | | NMLT | num_literals | https://doi.org/10.5281/zenodo.5813054 | | NMOP | num_operators |https://doi.org/10.5281/zenodo.5813055 | | NMPR | num_parameters | https://doi.org/10.5281/zenodo.5813053 | | NMRT | num_returns | https://doi.org/10.5281/zenodo.5813034 | | NMTK | num_tokens | https://doi.org/10.5281/zenodo.5813032 | | NTID | num_identifiers | https://doi.org/10.5281/zenodo.5813029 | | NUID | num_unique_identifiers | https://doi.org/10.5281/zenodo.5813028 | | SLOC | source_lines_of_code | https://doi.org/10.5281/zenodo.5813094 | | TLOC | total_lines_of_code | https://doi.org/10.5281/zenodo.5813102 |
<!-- \textit{Properties:} \texttt{[TLOC]} & \url{} & 335.5 MB\Tstrut{}\\ \textit{Properties:} \texttt{[SLOC]} & \url{} & 335.0 MB \\ \textit{Properties:} \texttt{[NUID]} & \url{} & 335.6 MB \\ \textit{Properties:} \texttt{[NTID]} & \url{} & 336.7 MB \\ \textit{Properties:} \texttt{[NMTK]} & \url{} & 342.5 MB \\ \textit{Properties:} \texttt{[NMRT]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMPR]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NMOP]} & \url{} & 334.5 MB \\ \textit{Properties:} \texttt{[NMLT]} & \url{} & 333.4 MB \\ \textit{Properties:} \texttt{[NAME]} & \url{} & 432.0 MB \\ \textit{Properties:} \texttt{[MXIN]} & \url{} & 267.0 MB \\ \textit{Properties:} \texttt{[CMPX]} & \url{} & 267.1 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NUPC]} & \url{} & 333.3 MB \\ \textit{Properties:} \texttt{[NUCC]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[NMNC]} & \url{} & 334.0 MB \\ \textit{Properties:} \texttt{[NMLC]} & \url{} & 333.2 MB \\ % \textit{Properties:} \texttt{[NMTC]} & \url{https://doi.org/10.5281/zenodo.7019246} & 334.0 MB\Bstrut{}\\ \textit{Properties:} \texttt{[NLDF]} & \url{} & 333.6 MB \\ \textit{Properties:} \texttt{[RSLK]} & \url{} & 334.0 MB\Bstrut{}\\ -->JEMMA Callgraphs
| Link to callgraphs data | columns | |:----------------------- |:------- | | Callgraphs | caller_project_id | | | caller_class_id | | | caller_method_id | | | call_direction | | | callee_project_id | | | callee_class_id | | | callee_method_id |
<!-- | | *call_type* | -->Working with JEMMA Workbench
List of API calls
projects
-
get_project_id
Returns the project_id of the project (queried by project name).Parameters:
- project_name: (str) - name of the project
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
- Returns None if multiple projects were found with the same name
-
get_project_id_by_path
Returns the project id of the project (queried with project path).Parameters:
- project_path: (str) - path of the project defined in jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_path was found
- Returns None if multiple projects were found with the same path
-
get_project_id_class_id
Returns the project id of the project (queried with class id)Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
-
get_project_id_by_method_id
Returns the project id of the project (queried with method id)Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
-
get_project_name
Returns the project name of the project.Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project name
- Returns None if no such project_id is defined in jemma
-
get_project_path
Returns the project path of the project.Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project path
- Returns None if no such project_id is defined in jemma
-
get_project_size_by_classes
Returns the size of a project, by the number of classes.Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project size, by the number of classes
- Returns None if no such project_id is defined in jemma
-
get_project_size_by_methods
Returns the size of a project, by the number of methods.Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project size, by the number of methods
- Returns None if no such project_id is defined in jemma
-
get_project_class_ids
Returns all class ids defined within the project.Parameters:
- project_id: (str) - any project_id defined within jemma
