SkillAgentSearch skills...

NexusRaven

NexusRaven-13B, a new SOTA Open-Source LLM for function calling. This repo contains everything for reproducing our evaluation on NexusRaven-13B and baselines.

Install / Use

/learn @nexusflowai/NexusRaven
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

NexusRaven: Surpassing the state-of-the-art in open-source function calling LLMs

Code License Data License Python 3.10+

<p align="center"> <a href="https://huggingface.co/Nexusflow" target="_blank">Nexusflow HF</a> - <a href="http://nexusflow.ai/blog" target="_blank">NexusRaven blog post</a> - <a href="https://huggingface.co/Nexusflow/NexusRaven-13B" target="_blank">NexusRaven-13B</a> - <a href="https://huggingface.co/datasets/Nexusflow/NexusRaven_API_evaluation" target="_blank">NexusRaven API evaluation dataset</a> - <a href="https://www.linkedin.com/feed/update/urn:li:activity:7113263527909330945/" target="_blank">NexusRaven LinkedIn Post</a> - <a href="https://x.com/NexusflowX/status/1707470614012035561?s=20" target="_blank">NexusRaven Twitter Thread</a> </p> <p align="center" width="100%"> <a><img src="docs/NexusRaven.png" alt="NexusRaven" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a> </p>

Welcome to the NexusRaven API repository! The primary purpose of this repository is to serve as an evaluation framework for the NexusRaven workflow and to enable accessible reproduction of our results. We hope that the contents of this repository are of value to you and your team!

Table of contents

Introducing NexusRaven-13B

NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities.

📊 Performance Highlights: With our demonstration retrieval system, NexusRaven-13B achieves a 95% success rate in using cybersecurity tools such as CVE/CPE Search and VirusTotal, while prompting GPT-4 achieves 64%. It has significantly lower cost and faster inference speed compared to GPT-4.

🔧 Generalization to the Unseen: NexusRaven-13B generalizes to tools never seen during model training, achieving a success rate comparable with GPT-3.5 in zero-shot setting, significantly outperforming all other open-source LLMs of similar sizes.

🔥 Commercially Permissive: The training of NexusRaven-13B does not involve any data generated by proprietary LLMs such as GPT-4. You have full control of the model when deployed in commercial applications.

<p align="center" width="100%"> <a><img src="docs/Single-Attempt_Function_Calling.png" alt="NexusRaven" style="width: 80%; min-width: 300px; display: block; margin: auto;"></a> <a><img src="docs/Zero-shot_Evaluation.png" alt="NexusRaven" style="width: 80%; min-width: 300px; display: block; margin: auto;"></a> </p>

Setup

git clone https://github.com/nexusflowai/NexusRaven
pip install -e .

NexusRaven model usage

NexusRaven accepts a list of python functions. These python functions can do anything (including sending GET/POST requests to external APIs!). The two requirements include the python function signature and the appropriate docstring to generate the function call.

NexusRaven is highly compatible with langchain. See scripts/langchain_example.py. An example without langchain can be found in scripts/non_langchain_example.py

Please note that the model will reflect on the answer sometimes, so we highly recommend stopping the model generation at a stopping criteria of ["\nReflection:"], to avoid spending unnecessary tokens during inference, but the reflection might help in some rare cases. This is reflected in our langchain example.

The "Initial Answer" can be executed to run the function.

Quickstart

You can run the model on a GPU using the following code.

# Please `pip install transformers accelerate`
from transformers import pipeline


pipeline = pipeline(
    "text-generation",
    model="Nexusflow/NexusRaven-13B",
    torch_dtype="auto",
    device_map="auto",
)

prompt_template = """
<human>:
OPTION:
<func_start>def hello_world(n : int)<func_end>
<docstring_start>
\"\"\"
Prints hello world to the user.

Args:
n (int) : Number of times to print hello world.
\"\"\"
<docstring_end>

OPTION:
<func_start>def hello_universe(n : int)<func_end>
<docstring_start>
\"\"\"
Prints hello universe to the user.

Args:
n (int) : Number of times to print hello universe.
\"\"\"
<docstring_end>

User Query: Question: {question}

Please pick a function from the above options that best answers the user query and fill in the appropriate arguments.<human_end>
"""
prompt = prompt_template.format(question="Please print hello world 10 times.")

result = pipeline(prompt, max_new_tokens=100, return_full_text=False, do_sample=False)[0]["generated_text"]

# Get the "Initial Call" only
start_str = "Initial Answer: "
end_str = "\nReflection: "
start_idx = result.find(start_str) + len(start_str)
end_idx = result.find(end_str)
function_call = result[start_idx: end_idx]

print (f"Generated Call: {function_call}")

This will output:

Generated Call: hello_world(10) 

Which can be executed.

Evaluation dataset curation

The instructions below can be used to reproduce our evaluation set, found at NexusRaven_API_evaluation dataset, dataset schema.

API evaluation dataset standardization

# Process raw ToolAlpaca data
python raven/data/process_toolalpaca_evaluation_data.py

# Upload raw queries
python raven/data/upload_raw_queries.py \
    --hf_path {your hf path} \
    --subset raw_queries

# Upload standardized api list
python raven/data/upload_standardized_api_list.py \
    --hf_path {your hf path} \
    --subset standardized_api_list

# Upload standardized queries
python raven/data/upload_standardized_queries.py \
    --hf_path {your hf path} \
    --standardized_queries_subset standardized_queries \
    --raw_queries_subset raw_queries \
    --standardized_api_list_subset standardized_api_list

Running ToolLLM model evaluation

Getting generation responses from the ToolLLM model using this code requires access to a single GPU. We ran ours on a 40GB A100 GPU.

# Upload queries and api list in ToolLLM format
python raven/data/upload_queries_in_toolllm_format.py \
    --hf_path {your hf path} \
    --toolllm_queries_subset queries_in_toolllm_format \
    --toolllm_api_list_subset api_list_in_toolllm_format \
    --standardized_queries_subset standardized_queries \
    --standardized_api_list_subset standardized_api_list

# Run ToolLLM evaluations
python raven/eval/run_toolllm.py \
    --hf_path {your hf path} \
    --toolllm_queries_subset queries_in_toolllm_format \
    --toolllm_api_list_subset api_list_in_toolllm_format \
    --toolllm_outputs_subset outputs_in_toolllm_format

Evaluation

Results can be reproduced using the following instructions. The NexusRaven and CodeLlama 13B Instruct evaluations require access to a Huggingface Inference Endpoint. GPT-3.5, GPT-3.5 Instruct, and GPT-4 require an OpenAI API key.

NexusRaven is especially capable at single-turn zero-shot function calling capability. Thus, we evaluate all models using this paradigm. Some of the models below like ToolLLM and ToolAlpaca leverage multiturn reactions to generate better function calling outputs. In practice, the react approach is expensive in terms of both time and money, especially in production driven environments when lower latency is critical.

We provide evaluation data and infrastructure for 5 datasets:

  • cve_cpe
  • emailrep
  • virustotal
  • toolalpaca
  • toolllm*

*The ToolLLM evaluation dataset originally contains no ground truths. As a result, we have done our best to curate, filtere, and post-process the ToolLLM evaluation dataset to a higher quality. Unfortunately, the resulting dataset only consists of 21 samples. When benchmarking, we have seen up to 1 sample accuracy differences across runs due to the non-deterministic nature of how these models are served. This translates to a swing in accuracy around 5%. In the future, we are looking to improve the quality of this dataset or use a higher quality generic-domain function calling evaluation dataset!

NexusRaven

  1. Create a HF inference endpoint using https://huggingface.co/Nexusflow/NexusRaven-13B
    1. We ran ours on a GPU xlarge node consisting of 1x A100 40GB GPU
    2. Under Advanced configuration:
      1. We use no quantization in the TGI endpoint (only using the HF default torch.bfloat16)
      2. Max Input Length is set to 8192, Max Number of Tokens to 8193, and Max Batch Prefill Tokens to 8192
  2. Copy the inference endpoint url and use it here
  3. Make the inference endpoint public (LangChain currently does not sup

Related Skills

View on GitHub
GitHub Stars319
CategoryDevelopment
Updated23d ago
Forks24

Languages

Python

Security Score

95/100

Audited on Mar 16, 2026

No findings