1.实现功能

1.收集全局信息，包括project下能收集到的函数定义、类型别名、结构体定义信息。
2.继续收集全局信息，搜寻global scope下的函数引用
3.遍历每个函数，获取每个函数中函数引用处
4.针对存在indirect-call的函数，收集该函数下的局部变量定义，对于icall，尝试基于参数类型匹配潜在的callee。

传统静态分析部分参考code-analyzer

如果要使用openai的API，请先运行pip install openai
如果要使用google gemini，请运行 pip install google-generativeai。
如果调用ChatGLM的API，请先运行 pip install zhipuai。
如果调用通义千问的API，请运行 pip install dashscope。

使用：

纯类型分析，不用LLM:

python evaluation_analyzer.py --only_count_scope --disable_llm_for_uncertain --llm_strategy=none --base_analyzer=kelp --evaluate_uncertain --root_path=$PATH_TO_PROJECT --scope_strategy=base --num_worker=1 --projects=$PROJECT_NAME

用LLM分析:

python evaluation_analyzer.py --only_count_scope --log_llm_output --log_res_to_file --disable_llm_for_uncertain  --llm_strategy=sea --base_analyzer=flta --evaluate_uncertain --root_path=$PATH_TO_PROJECT --projects=$PROJECT --num_worker=12 --temperature=$TEMPERATURE --running_epoch=$EPOCH openai_local --model_type=$SELECTED_MODEL --address=$LOCAL_ADDRESS

2.LLM的部署

2.1.server

该项目目前支持调用openai, 智谱, google gemini, 阿里通义系列的API。本地部署的模型尝试过用3种方式部署：

目前以上部署方式都支持openai的api访问server。不过使用时发现了一些问题

vllm部署时通过openai api访问时不需要添加 max_tokens 参数，但是sglang部署时需要手动指定这些 max 参数，容易降低效率。
vllm和sglang部署只需要传递context长度参数 (vllm的 --max-model-len 以及sglang的 --context-length)，但是text-generation-inference需要指定 --max-total-tokens、--max-input-length，感觉不是很灵活。
vllm单gpu部署时效率感觉很高，但是多gpu部署时容易出现同步错误，这个错误貌似到0.4.0还没解决。

这里建议大家通过vllm或者sglang部署，如果用vllm，用 openai_local 调用本地模型时可以不传入 max_tokens 参数，但是sglang得传入，可以传个大点的比如 3072。

chat模板加载方式：

sglang
- 对于用 launch_server 的方式如果没有手动指定chat_template，则会用tokenizer默认的chat_template。
- 如果用python代码调用API，chat_template加载方式为硬编码在py文件中，参考chat_template.py，sglang会在把modelpath lower后比对qwen等关键词查找对应模板。如果手动指定模版参数，其处理过程参考load_chat_template_for_openai_api，模版文件必须为json格式。
vllm的模板加载相对灵活，会去model的tokenizer文件中找chat template，比如qwen1.5-14B-Chat的tokenizer_config.json中有 chat_template 字段定义了该模型的chat template。
swift的模型-模板对应表参考model.py，定义的全部模版参考template.py，同义硬编码。不过相比sglang，硬编码的是真多，需要在参数用 template_type 手动指定使用的模板。

在我们tool下，当用sglang部署model时，请添加 max_tokens 参数，否则sglang会用默认最大生成token数。用swift部署时，记得添加 server_type 参数，将 model_name 做一次映射。

2.2.models

llama3存在一个eos token问题，参考llama3 end token，需要user手动设置eos token。不过TGI貌似2.0.2版本后修复了这个问题，不需要手动设置eos token。

3.others

插桩用到的pass：LLVM Instrumentation Pass

其它LLM模型：rebuttal的时候怕了两组code特定LLM作为baseline，当时（24年6月）算比较SOTA的code LLM，不过效果看起来并不是很好。一个可能原因是code LLM的自然语言推理能力不如general LLM。

| model | model_type | precision | recall | F1 | | ---- | ---- | ---- | ---- | ---- | | deepseek-coder-instruct | code-LLM | 27.6 | 92.1 | 36.3 | | CodeQwen-1.5-Chat |code-LLM| 27.3 | 35.4 | 26.9 | | Qwen1.5-72B-Chat | general-LLM | 49.1 | 97.3 | 59.4 |

为了减少LLM的query次数，我们起初试图用CodeBert计算callee的declaration和caller的文本相似度进行些简单的filter操作。这部分理论上不应该引入false negative。下表展示了分别用余弦和欧氏相似度筛选top k%的caller-callee pair时的recall，可以看到CodeBert会不可避免的引入false negative，top-80%的recall只有74.1%。因此CodeBert不适合用来进行pre-filter。

| similarity | top 20 | 40 | 60 | 80 | 100 | | ---- | ---- | ---- | ---- | ---- | ---- | | cosine-similarity | 12.8 | 36.9 | 55.7 | 74.1 | 97.9 | | Euclidean-similarity | 12.3 | 36.5 | 51.3 | 71.1 | 97.9 |

4.Citation

@inproceedings{10.1145/3691620.3695016,
  author = {Cheng, Baijun and Zhang, Cen and Wang, Kailong and Shi, Ling and Liu, Yang and Wang, Haoyu and Guo, Yao and Li, Ding and Chen, Xiangqun},
  title = {Semantic-Enhanced Indirect Call Analysis with Large Language Models},
  year = {2024},
  isbn = {9798400712487},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3691620.3695016},
  doi = {10.1145/3691620.3695016},
  booktitle = {Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering},
  pages = {430–442},
  numpages = {13},
  keywords = {indirect-call analysis, semantic analysis, LLM},
  location = {Sacramento, CA, USA},
  series = {ASE '24}
}

CodeAnalyzer

Install / Use

README