SkillAgentSearch skills...

Nanoowl

A project that optimizes OWL-ViT for real-time inference with NVIDIA TensorRT.

Install / Use

/learn @NVIDIA-AI-IOT/Nanoowl

README

<h1 align="center">NanoOWL</h1> <p align="center"><a href="#usage"/>👍 Usage</a> - <a href="#performance"/>⏱️ Performance</a> - <a href="#setup">🛠️ Setup</a> - <a href="#examples">🤸 Examples</a> <br> - <a href="#acknowledgement">👏 Acknowledgment</a> - <a href="#see-also">🔗 See also</a></p>

NanoOWL is a project that optimizes OWL-ViT to run 🔥 real-time 🔥 on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.

<p align="center"> <img src="assets/jetson_person_2x.gif" height="50%" width="50%"/></p>

Interested in detecting object masks as well? Try combining NanoOWL with NanoSAM for zero-shot open-vocabulary instance segmentation.

<a id="usage"></a>

👍 Usage

You can use NanoOWL in Python like this

from nanoowl.owl_predictor import OwlPredictor

predictor = OwlPredictor(
    "google/owlvit-base-patch32",
    image_encoder_engine="data/owlvit-base-patch32-image-encoder.engine"
)

image = PIL.Image.open("assets/owl_glove_small.jpg")

output = predictor.predict(image=image, text=["an owl", "a glove"], threshold=0.1)

print(output)

Or better yet, to use OWL-ViT in conjunction with CLIP to detect and classify anything, at any level, check out the tree predictor example below!

See Setup for instructions on how to build the image encoder engine.

<a id="performance"></a>

⏱️ Performance

NanoOWL runs real-time on Jetson Orin Nano.

<table style="border-top: solid 1px; border-left: solid 1px; border-right: solid 1px; border-bottom: solid 1px"> <thead> <tr> <th rowspan=1 style="text-align: center; border-right: solid 1px">Model †</th> <th colspan=1 style="text-align: center; border-right: solid 1px">Image Size</th> <th colspan=1 style="text-align: center; border-right: solid 1px">Patch Size</th> <th colspan=1 style="text-align: center; border-right: solid 1px">⏱️ Jetson Orin Nano (FPS)</th> <th colspan=1 style="text-align: center; border-right: solid 1px">⏱️ Jetson AGX Orin (FPS)</th> <th colspan=1 style="text-align: center; border-right: solid 1px">🎯 Accuracy (mAP)</th> </tr> </thead> <tbody> <tr> <td style="text-align: center; border-right: solid 1px">OWL-ViT (ViT-B/32)</td> <td style="text-align: center; border-right: solid 1px">768</td> <td style="text-align: center; border-right: solid 1px">32</td> <td style="text-align: center; border-right: solid 1px">TBD</td> <td style="text-align: center; border-right: solid 1px">95</td> <td style="text-align: center; border-right: solid 1px">28</td> </tr> <tr> <td style="text-align: center; border-right: solid 1px">OWL-ViT (ViT-B/16)</td> <td style="text-align: center; border-right: solid 1px">768</td> <td style="text-align: center; border-right: solid 1px">16</td> <td style="text-align: center; border-right: solid 1px">TBD</td> <td style="text-align: center; border-right: solid 1px">25</td> <td style="text-align: center; border-right: solid 1px">31.7</td> </tr> </tbody> </table>

<a id="setup"></a>

🛠️ Setup

  1. Install the dependencies

    1. Install PyTorch

    2. Install torch2trt

    3. Install NVIDIA TensorRT

    4. Install the Transformers library

      python3 -m pip install transformers
      
    5. (optional) Install NanoSAM (for the instance segmentation example)

  2. Install the NanoOWL package.

    git clone https://github.com/NVIDIA-AI-IOT/nanoowl
    cd nanoowl
    python3 setup.py develop --user
    
  3. Build the TensorRT engine for the OWL-ViT vision encoder

    mkdir -p data
    python3 -m nanoowl.build_image_encoder_engine \
        data/owl_image_encoder_patch32.engine
    
  4. Run an example prediction to ensure everything is working

    cd examples
    python3 owl_predict.py \
        --prompt="[an owl, a glove]" \
        --threshold=0.1 \
        --image_encoder_engine=../data/owl_image_encoder_patch32.engine
    

That's it! If everything is working properly, you should see a visualization saved to data/owl_predict_out.jpg.

<a id="examples"></a>

🤸 Examples

Example 1 - Basic prediction

<img src="assets/owl_predict_out.jpg" height="256px"/>

This example demonstrates how to use the TensorRT optimized OWL-ViT model to detect objects by providing text descriptions of the object labels.

To run the example, first navigate to the examples folder

cd examples

Then run the example

python3 owl_predict.py \
    --prompt="[an owl, a glove]" \
    --threshold=0.1 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/owl_predict_out.jpg.

You can also use this example to profile inference. Simply set the flag --profile.

Example 2 - Tree prediction

<img src="assets/tree_predict_out.jpg" height="256px"/>

This example demonstrates how to use the tree predictor class to detect and classify objects at any level.

To run the example, first navigate to the examples folder

cd examples

To detect all owls, and the detect all wings and eyes in each detect owl region of interest, type

python3 tree_predict.py \
    --prompt="[an owl [a wing, an eye]]" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/tree_predict_out.jpg.

To classify the image as indoors or outdoors, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors)" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

To classify the image as indoors or outdoors, and if it's outdoors then detect all owls, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors [an owl])" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

Example 3 - Tree prediction (Live Camera)

<img src="assets/jetson_person_2x.gif" height="50%" width="50%"/>

This example demonstrates the tree predictor running on a live camera feed with live-edited text prompts. To run the example

  1. Ensure you have a camera device connected

  2. Launch the demo

    cd examples/tree_demo
    python3 tree_demo.py ../../data/owl_image_encoder_patch32.engine
    
  3. Second, open your browser to http://<ip address>:7860

  4. Type whatever prompt you like to see what works! Here are some examples

    • Example: [a face [a nose, an eye, a mouth]]
    • Example: [a face (interested, yawning / bored)]
    • Example: (indoors, outdoors)

<a id="acknowledgement"></a>

👏 Acknowledgement

Thanks to the authors of OWL-ViT for the great open-vocabluary detection work.

<a id="see-also"></a>

🔗 See also

View on GitHub
GitHub Stars416
CategoryDevelopment
Updated3d ago
Forks73

Languages

Python

Security Score

100/100

Audited on Apr 2, 2026

No findings