Spiral<img width="130px" align="right" src=".graphics/spiral.svg">

Spiral is a Python module that provides several different functions for splitting identifiers found in source code files. The name Spiral is a loose acronym based on "SPlitters for IdentifieRs: A Library".

Authors: Michael Hucka<br> Repository: https://github.com/casics/spiral<br> License: Unless otherwise noted, this content is licensed under the GPLv3 license.

🏁 Recent news and activities

April 2018: Version 1.1.0 fixes a bug that prevented importing Spiral, and another bug that cause setup.py to fail to install dependencies automatically. Additional enhancements include improved command-line help and internal code refactoring.

Introduction
Please cite the paper
Installation instructions
Basic operation
Performance of Ronin
Other splitters in Spiral
Limitations
More information
Getting help and support
Contributing — info for developers
Acknowledgments

☀ Introduction

Spiral is a Python 3 package that implements numerous identifier splitting algorithms. Identifier splitting (also known as identifier name tokenization) is the task of breaking apart program identifier strings such as getInt or readUTF8stream into component tokens: [get, int] and [read, utf8, stream]. The need for splitting identifiers arises in a variety of contexts, including natural language processing (NLP) methods applied to source code analysis and program comprehension. Spiral provides some basic naive splitting algorithms, such as a straightforward camel-case splitter, as well as more elaborate heuristic splitters, such as a new algorithm called Ronin.

♥️ Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Spiral and you publish papers about your software, we ask that you please cite the Spiral paper:

<dl> <dd> Hucka, M. (2018). Spiral: splitters for identifiers in source code files. <i>Journal of Open Source Software</i>, 3(24), 653, <a href="https://doi.org/10.21105/joss.00653">https://doi.org/10.21105/joss.00653</a> </dd> </dl>

Please also indicate the specific version of Spiral you use, to improve other people's ability to reproduce your results. You can use the Zenodo DOIs we provide for this purpose:

Spiral release 1.0.1 ⇒ 10.5281/zenodo.1211835

✺ Installation instructions

For basic usage, the following is probably the simplest and most direct way to install Spiral on your computer:

sudo pip3 install git+https://github.com/casics/spiral.git

Alternatively, you can clone this GitHub repository and then run setup.py:

git clone https://github.com/casics/spiral.git
cd spiral
sudo python3 -m pip install .

The above should be all you need to run Spiral out of the box. If you plan on experimenting with alternative parameter values or alternative dictionaries, you will additionally need to install the following:

Platypus-Opt, an optimization library (if you want to optimize parameters). Important: This Platypus is not the same as the other package called "Platypus" in PyPI. Make sure to get Platypus-Opt or install from the Platypus repo.
Data modules from NLTK, particularly nltk_words and ntlk_wordnet from the nltk.corpus module and the nltk.stem module. These cannot be installed automatically by setup.py—please refer to the NLTK installation instructions for instructions on how to do it.

▶︎ Basic operation

Spiral is extremely easy to use. To use a Spiral splitter in a Python program, simply import a splitter and then call it.

from spiral.simple_splitters import pure_camelcase_split
print(pure_camelcase_split('TestString'))

Some splitters take optional parameters, and the more complex splitters have an init() function that you can optionally call to set additional parameters or load data. Currently, only Ronin and Samurai have initialization functions, and calling init() is optional—if you do not call it, Ronin and Samurai will call their init() functions automatically.

Here are some examples of using the Ronin splitter algorithm. The following input,

from spiral import ronin
for s in [ 'mStartCData', 'nonnegativedecimaltype', 'getUtf8Octets', 'GPSmodule', 'savefileas', 'nbrOfbugs']:
    print(ronin.split(s))

produces the following output:

['m', 'Start', 'C', 'Data']
['nonnegative', 'decimal', 'type']
['get', 'Utf8', 'Octets']
['GPS', 'module']
['save', 'file', 'as']
['nbr', 'Of', 'bugs']

Spiral also includes a command-line program named spiral in the bin subdirectory; it will split strings provided on the command line or in a file. (Note: Ronin and Samurai load internal data files when they start up. In normal use, called from an application program, the resulting startup delay will only happen once. In the command-line program, the data is reloaded at every invocation, causing a startup delay every time. The delay is not typical for normal Spiral usage.)

By the way, if your goal is to split identifiers obtained during source code mining and you need to filter out gibberish strings of characters before attempting to split them, you may find Nostril (the Nonsense String Evaluator) useful.

🎯 Performance of Ronin

Splitting identifiers is deceptively difficult. It is a research-grade problem for which no perfect solution exists. Even in cases where the input consists of identifiers that strictly follow conventions such as camel case, ambiguities can arise. For example, correctly splitting J2SEProjectTypeProfiler into J2SE, Project, Type, Profiler requires the reader to recognize J2SE as a unit. The task of splitting identifiers is made more difficult when there are no case transitions or other obvious boundaries in an identifier. And then there is the small problem that humans are often simply inconsistent!

Ronin is an advanced splitter that uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin includes a default table of term frequencies derived from an analysis of over 46,000 randomly selected software projects in GitHub that contained at least one Python source code file. The tokens were extracted using software from Spiral's parent project, CASICS (specifically, the extractor package), and the frequency table constructed using a procedure encoded in the small program create_frequency_file included with Spiral. Ronin also has a number of parameters that need to be tuned; this can be done using the optimization program optimize-ronin in the dev/optimization subdirectory. The default parameter values were derived by optimizing performance against two data sets available from other research groups:

The Loyola University of Delaware Identifier Splitting Oracle (Ludiso)
The INTT data set, extracted from the zip archive of INTT

Spiral includes copies of these data sets in the tests/data subdirectory. The parameters derived primarily by running against the INTT database of 18,772 identifiers and their splits. The following table summarizes the results:

| Data set | Number of splits matched by Ronin | Total in data set | Accuracy | |-----------------------------------|------------------------:|---------------:|----------:| | INTT | 17,287 | 18,772 | 92.09% | | Ludiso | 2,248 | 2,663 | 84.42% |

Many of the "failures" against these sets of identifiers are actually not failures, but cases where Ronin gets a more correct answer or where there is a legitimate difference in interpretation. Here are some examples:

| Identifier | Ludiso result | Ronin split | |------------|---------------|-------------| | a.ecart | a ecart | a e cart | | ConvertToAUTF8String | Convert To AUTF 8 String | Convert To A UTF8 `S

Spiral

Install / Use

README