Siamese
Siamese: a scalable code clone search engine
Install / Use
/learn @UCL-CREST/SiameseREADME
Siamese: Code Clone Search Engine

Siamese (Scalable, incremental, and multi-representation) is a code clone search system powered by Elasticsearch with code clone detection approaches, including code normalisation, n-grams, and query reduction technique, built on top. It can scalably search for clones of Type-1 to Type-3/Type-4 from a large corpora of Java source code within seconds.
<!--*Note: **Siamese** stands for **S**calalbe, **I**usingnstant, **A**nd **M**ulti-Repr**es**entation* ## Analyse term frequency and document frequency of terms in the index 1. Modify the class ```TermFreqAnalyser``` with appropriate configurations 2. The result frequency files will be generated (e.g. freq_df_src.csv, freq_df_toksrc.csv). The one without `tok` means the normalised source code tokens, whilst the one with `tok` means the original source code tokens. 3. Modify the sort_term.py script with the generated result frequency files and run the script. 4. Graphs will be genearated. They follow Zipf's law. Hooray! -->Build from Source:
1. Download elasticsearch-2.2.0 and extract to disk.
mkdir ~/siamese
cd ~/siamese
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.2.0/elasticsearch-2.2.0.tar.gz
tar -xvf elasticsearch-2.2.0.tar.gz
rm elasticsearch-2.2.0.tar.gz
2. Modify the configuration file in config/elasticsearch.yml
cd elasticsearch-2.2.0
vim config/elasticsearch.yml
Add the following lines at the end of the file. Save and quit.
cluster.name: stackoverflow
index.query.bool.max_clause_count: 4096
3. Clone the project from GitHub.
cd ~/siamese
git clone https://github.com/UCL-CREST/Siamese.git
4. Install JDK and Maven
sudo apt-get install default-jdk
sudo apt-get install maven
5. Check if you can call javac.
javac
If javac does not produce any results, your JAVA_HOME is not set, set the JAVA_HOME by opening the file /etc/environment
vim /etc/environment
and paste the location of JAVA_HOME at the end of the file. You can locate JAVA_HOME by
whereis javac
ls -l <the path>
... keep following the path until you find the real path (not a symlink) to the javac
5. Modify the location of elasticsearch in config.properties.
elasticsearchLoc=/my/dir/elasticsearch-2.2.0
Save and quit.
cd Siamese
vim config.properties
6. Try starting the elasticsearch service
./elasticsearch-2.2.0/bin/elasticsearch
You should see elasticsearch execution log like this.
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] version[2.2.0], pid[27101], build[8ff36d1/2016-01-27T13:32:39Z]
[2018-10-02 03:50:35,305][INFO ][node ] [Warlock] initializing ...
[2018-10-02 03:50:35,658][INFO ][plugins ] [Warlock] modules [lang-expression, lang-groovy], plugins [], sites []
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [107.8gb], net total_space [202.6gb], spins? [no], types [ext4]
[2018-10-02 03:50:35,674][INFO ][env ] [Warlock] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] initialized
[2018-10-02 03:50:36,919][INFO ][node ] [Warlock] starting ...
[2018-10-02 03:50:36,982][INFO ][transport ] [Warlock] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-10-02 03:50:36,989][INFO ][discovery ] [Warlock] stackoverflow/VPfoqhukSoiP7RtKKgvYmg
[2018-10-02 03:50:40,037][INFO ][cluster.service ] [Warlock] new_master {Warlock}{VPfoqhukSoiP7RtKKgvYmg}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2018-10-02 03:50:40,063][INFO ][http ] [Warlock] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-10-02 03:50:40,064][INFO ][node ] [Warlock] started
[2018-10-02 03:50:40,101][INFO ][gateway ] [Warlock] recovered [0] indices into cluster_state
Then, kill the process (Ctrl+C) and start the elasticsearch engine as a background service (with -d flag).
./elasticsearch-2.2.0/bin/elasticsearch -d
You can also test that elasticsearch is running in the background by issuing the command below.
curl -XGET 'localhost:9200/_cat/indices?v&pretty'
You should see the output like this, which means there is no index in elasticsearch yet.
health status index pri rep docs.count docs.deleted store.size pri.store.size
7. Create an executable jar and copy to the Siamese home directory
cd Siamese
mvn compile package
cp -i target/siamese-0.0.*.jar .
8. Try to execute Siamese.
java -jar siamese-0.0.6-SNAPSHOT.jar
9. You will see how to execute Siamese printed on the screen.
$ java -jar siamese-0.0.6-SNAPSHOT.jar
usage: \(v 0.6\) $java -jar siamese.jar -cf <config file> [-i input] [-o
output] [-c command] [-h help]
Example: java -jar siamese.jar -cf config.properties
Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o
/my/output/dir -c index
-c,--command <arg> [optional] command to execute [index, search].
This will override the configuration file.
-cf,--configFile <arg> [* requried *] a configuration file
-h,--help <optional> print help
-i,--inputFolder <arg> [optional] location of the input files \(for
index or query\). This will override the
configuration file.
-o,--outputFolder <arg> [optional] location of the search result file.
This will override the configuration file.
10. An example of running Siamese to index a project "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties
11. Then, tell Siamese to search for clones of "bar" in the index of "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties
12. After Siamese finishes its execution, the output file (clone classes) will be located at /my/output/dir.
The file will be using the pattern data_qr_<timestamp>.xml.
13. If you want to enforce similarity threshold on the search results,
modify the config.properties file to enable fuzzywuzzy or tokenratio (recommended) similarity.
Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.
computeSimilarity : tokenratio
simThreshold : 50%,50%,50%,50%
Downloads
- Executable Tool (JAR file):
-
Siamese: Siamese executable can be downloaded here: Siamese v. 0.6. Please make sure you have Java 8 installed on your machine.
1. To execute Siamese, unzip the file and follow the steps below:
$cd siamese $./elasticsearch-2.2.0/bin/elasticsearch -d $java -jar siamese-0.0.5-SNAPSHOT.jarThen you'll see the usage and example of how to use Siamese.
usage: (v 0.5) $java -jar siamese.jar -cf <config file> [-i input] [-o output] [-c command] [-h help] Example: java -jar siamese.jar -cf config.properties Example: java -jar siamese.jar -cf config.properties -i /my/input/dir -o /my/output/dir -c index -c,--command <arg> [optional] command to execute [index, search]. This will override the configuration file. -cf,--configFile <arg> [* requried *] a configuration file -h,--help <optional> print help -i,--inputFolder <arg> [optional] location of the input files (for index or query). This will override the configuration file. -o,--outputFolder <arg> [optional] location of the search result file. This will override the configuration file.2. An example of running Siamese to index a project "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c index -i /my/dir/foo -cf config.properties3. Then, tell Siamese to search for clones of "bar" in "foo".
java -jar siamese-0.0.6-SNAPSHOT.jar -c search -i /my/dir/bar -o /my/output/dir -cf config.properties4. After Siamese finishes its execution, the output file (clone classes) will be located at
/my/output/dir. The file will be using the patterndata_qr_<timestamp>.xml.5. If you want to enforce similarity threshold on the search results, modify the
config.propertiesfile to enable fuzzywuzzy or tokenratio (recommended) similarity. Choose any similarity thresholds you like for the four code representations (r0, r1, r2, r3) respectively.computeSimilarity : tokenratio simThreshold : 50%,50%,50%,50% -
BigCloneEval: BigCloneEval is a tool for automated recall evaluation based on BigCloneBench data set. It can be downloaded from: BigCloneBench
-
- Data sets: the data sets that we used to evaluate Siamese are listed below:
- OCD (Obfuscation/Compilcation/Decompilation) data set. The OCD data set is from a study by Ragkhitwetsagul et al. and can be found here: OCD data set.
- **SOCO (SOurce COde
