Cpg
A library to extract Code Property Graphs from C/C++, Java, Go, Python, Ruby and every other language through LLVM-IR.
Install / Use
/learn @Fraunhofer-AISEC/CpgREADME
Code Property Graph
A simple library to extract a code property graph out of source code. It has support for multiple passes that can extend the analysis after the graph is constructed. It currently supports C/C++ (C17), Java (Java 13) and has experimental support for Golang, Python and TypeScript. Furthermore, it has support for the LLVM IR and thus, theoretically support for all languages that compile using LLVM.
What is this?
A code property graph (CPG) is a representation of source code in form of a labelled directed multi-graph. Think of it as directed a graph where each node and edge is assigned a (possibly empty) set of key-value pairs (properties). This representation is supported by a range of graph databases such as Neptune, Cosmos, Neo4j, Titan, and Apache Tinkergraph and can be used to store source code of a program in a searchable data structure. Thus, the code property graph allows to use existing graph query languages such as Cypher, NQL, SQL, or Gremlin in order to either manually navigate through interesting parts of the source code or to automatically find "interesting" patterns.
This library uses Eclipse CDT for parsing C/C++ source code JavaParser for parsing Java. In contrast to compiler AST generators, both are "forgiving" parsers that can cope with incomplete or even semantically incorrect source code. That makes it possible to analyze source code even without being able to compile it (due to missing dependencies or minor syntax errors). Furthermore, it uses LLVM through the javacpp project to parse LLVM IR. Note that the LLVM IR parser is not forgiving, i.e., the LLVM IR code needs to be at least considered valid by LLVM. The necessary native libraries are shipped by the javacpp project for most platforms.
Specifications
In order to improve some formal aspects of our library, we created several specifications of our core concepts. Currently, the following specifications exist:
We aim to provide more specifications over time.
Usage
To build the project from source, you have to generate a gradle.properties file locally.
This file also enables and disables the supported programming languages.
We provide a sample file here - simply copy it to gradle.properties in the directory of the cpg-project.
Instead of manually generating or editing the gradle.properties file, you can also use the configure_frontends.sh script, which edits the properties setting the supported programming languages for you.
For Visualization Purposes
In order to get familiar with the graph itself, you can use the subproject cpg-neo4j. It uses this library to generate the CPG for a set of user-provided code files. The graph is then persisted to a Neo4j graph database. The advantage this has for the user, is that Neo4j's visualization software Neo4j Browser can be used to graphically look at the CPG nodes and edges, instead of their Java representations.
Please make sure, that the APOC plugin is enabled on your neo4j server. It is used in mass-creating nodes and relationships.
For example using docker:
docker run -p 7474:7474 -p 7687:7687 -d -e NEO4J_AUTH=neo4j/password -e NEO4JLABS_PLUGINS='["apoc"]' neo4j:5
As Library
The most recent version is being published to Maven central and can be used as a simple dependency, either using Maven or Gradle.
dependencies {
val cpgVersion = "9.0.2"
// use the 'cpg-core' module
implementation("de.fraunhofer.aisec", "cpg-core", cpgVersion)
// and then add the needed extra modules, such as Go and Python
implementation("de.fraunhofer.aisec", "cpg-language-go", cpgVersion)
implementation("de.fraunhofer.aisec", "cpg-language-python", cpgVersion)
}
There are some extra steps necessary for the cpg-language-cxx module. Since Eclipse CDT is not published on maven central, it is necessary to add a repository with a custom layout to find the released CDT files. For example, using Gradle's Kotlin syntax:
repositories {
// This is only needed for the C++ language frontend
ivy {
setUrl("https://download.eclipse.org/tools/cdt/releases/")
metadataSources {
artifact()
}
patternLayout {
artifact("[organisation].[module]_[revision].[ext]")
}
}
}
Beware, that the cpg module includes all optional features and might potentially be HUGE (especially because of the LLVM support). If you do not need LLVM, we suggest just using the cpg-core module with the needed extra modules like cpg-language-go. In the future we are working on extracting more optional modules into separate modules.
Development Builds
For all builds on the main branch, an artefact is published in the GitHub Packages under the version main-SNAPSHOT. Additionally, selected PRs that have the publish-to-github-packages label will also be published there. This is useful if an important feature is not yet in main, but you want to test it. The version refers to the PR number, e.g. 1954-SNAPSHOT.
To use the GitHub Gradle Registry, please refer to https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-gradle-registry#using-a-published-package
Configuration
The behavior of the library can be configured in several ways. Most of this is done through the TranslationConfiguration
and the InferenceConfiguration.
TranslationConfiguration
The TranslationConfiguration configures various aspects of the translation. E.g., it determines which languages/language
frontends and passes will be used, which information should be inferred, which files will be included, among others. The
configuration is set through a builder pattern.
InferenceConfiguration
The class InferenceConfiguration can be used to affect the behavior or the passes if they identify missing nodes.
Currently, there are flags which can be enabled, the most important ones are:
inferRecordsenables the inference of missing record declarations (i.e., classes and structs)inferDfgForUnresolvedCallsadds DFG edges to method calls represent all potential data flows if the called function is not present in the source code under analysis.
Only inferDfgForUnresolvedCalls is turned on by default.
The configuration can be made through a builder pattern and is set in the TranslationConfiguration as follows:
val inferenceConfig = InferenceConfiguration
.builder()
.inferRecords(true)
.inferDfgForUnresolvedCalls(true)
.build()
val translationConfig = TranslationConfiguration
.builder()
.inferenceConfiguration(inferenceConfig)
.build()
Development
This section describes languages, how well they are supported, and how to use and develop them yourself.
Language Support
Languages are maintained to different degrees, and are noted in the table below with:
maintained: if they are mostly feature complete and bugs have priority of being fixed.incubating: if the language is currently being worked on to reach a state of feature completeness.experimental: if a first working prototype was implemented, e.g., to support research topics, and its future development is unclear.discontinued: if the language is no longer actively developed or maintained but is kept for everyone to fork and adapt.
The current state of languages is:
| Language | Module | Branch | State |
|--------------------------|---------------------------------------|-------------------------------------------------------------------------|----------------|
| Java (Source) | cpg-language-java | main | maintained |
| C++ | cpg-language-cxx | main | maintained |
| Python | cpg-language-python | main | maintained |
| Go | cpg-language-go | main | maintained |
| INI | cpg-language-ini | main | maintained |
| JVM (Bytecode) | cpg-language-jvm | main | incubating |
| LLVM | cpg-language-llvm | main | incubating |
| TypeScript/JavaScript | cpg-language-typescript | main | `experimen
