SkillAgentSearch skills...

SaRLVision

A reinforcement learning object detector leveraging saliency ranking, offering a self-explainable system with a fully observable action log. | B.Sc. IT (Hons) Artificial Intelligence Dissertation | University of Malta Dean's List Awards 2024

Install / Use

/learn @mbar0075/SaRLVision

README

SaRLVision

<p align="right" style="text-align: right;"> <strong>"A reinforcement learning object detector which leverages saliency ranking."</strong> </p> <br> <p align="left" style="text-align: left;"> <strong>"A self-explainable detector that provides a fully observable action log."</strong> </p> <p align='center'> <table align="center"> <tr> <td align="center"> <img src="Diagrams/GIFs/bottle/bottle_GIF_5.gif" alt="GIF1" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/horse/horse_GIF_1.gif" alt="GIF2" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/car/car_GIF_1.gif" alt="GIF3" width="100%" height="auto" /> </td> </tr> <tr> <td align="center"> <img src="Diagrams/GIFs/diningtable/diningtable_GIF_6.gif" alt="GIF4" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/sheep/sheep_GIF_1.gif" alt="GIF5" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/pottedplant/pottedplant_GIF_9.gif" alt="GIF6" width="100%" height="auto" /> </td> </tr> <tr> <td align="center"> <img src="Diagrams/GIFs/train/train_GIF_3.gif" alt="GIF7" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/person/person_GIF_3.gif" alt="GIF8" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/dog/dog_GIF_6.gif" alt="GIF9" width="100%" height="auto" /> </td> </tr> <tr> <td align="center"> <img src="Diagrams/GIFs/bird/bird_GIF_5.gif" alt="GIF10" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/cat/cat_GIF_6.gif" alt="GIF11" width="100%" height="auto" /> </td> <td align="center"> <img src="Diagrams/GIFs/aeroplane/aeroplane_GIF_6.gif" alt="GIF12" width="100%" height="auto" /> </td> </tr> </table> </p> <p align="justify">

Abstract

<i>In an era where sustainability and transparency are paramount, the importance of effective object detection algorithms, pivotal for enhancing efficiency, safety, and automation across various domains, cannot be overstated. While these algorithms such as YOLO and Faster R-CNN, are notably fast, unfortunately they lack transparency in their decision-making process. This study explores a series of experiments on object detection, which combines reinforcement learning-based visual attention methods with saliency ranking techniques, in an effort to investigate transparent and sustainable solutions. By employing saliency ranking techniques that emulate human visual perception, the reinforcement learning agent is provided with an initial bounding box prediction. The agent, then iteratively refines these bounding box predictions by selecting from a finite set of actions over multiple time steps, ultimately achieving accurate object detection. This research also investigates the use of various image feature extraction methods, in addition to exploring diverse Deep Q-Network (DQN) architectural variations for deep reinforcement learning-based localisation agent training. Additionally, it focuses on optimising the pipeline at every juncture by prioritising lightweight and faster models. Another feature of the proposed system includes the classification of detected objects, a capability absent in previous reinforcement learning approaches. After evaluating the performance of these agents using the Pascal VOC 2007 dataset, faster and more optimised models were developed. Notably, the best mean Average Precision (mAP) achieved in this study was 51.4, surpassing benchmarks from RL-based single object detectors present in the literature. The designed system provides a distinct edge over previous methods by allowing multiple configurable real-time visualisations. These visualisations offer users a clear view of the current bounding boxes' coordinates and the types of actions being performed, both of which enable a more intuitive understanding of algorithmic decisions. Ultimately, fostering trust and transparency in object detection systems, aiding in the deployment of artificial intelligence techniques in high-risk areas, while continuously advancing research in the field of AI.</i>

System Overview

<p align="justify">

Initially, the system proceeds to generate a saliency ranking heatmap using the input image, emphasising regions of interest. It then takes the most important ranks to create an initial bounding box prediction, which is a key stage in object localisation. This prediction is then fed to the RL environment, where an agent navigates through a series of time steps, repeatedly completing actions to improve the bounding box and precisely pinpointing the object within an image, while also predicting the object class label.

</p> <p align='center'> <img src="Diagrams/Architecture.png" alt="Architecture" width="80%" height="auto"> </p>

Saliency Ranking

<p align='justify'>

The initial process in the development of the system involves the utilisation of saliency ranking to derive an initial bounding box estimate. Alternatively, users may choose not to employ this technique, resulting in the initial bounding box covering the entirety of the input image, a practice commonly observed in existing literature. Following the acquisition of the Saliency Ranking heatmap from SaRa, the first stage of this process entails the extraction of a bounding box that delineates the pertinent image segments. This technique considers a proportion of the highest-ranked areas, with a fixed threshold of 30% and number of iterations set to 1. The generation of these initial bounding boxes is critical due to the fact that it allows for the separation and delineation of prominent regions in the image for further refining utilising RL techniques.

</p> <p align='center'> <img src="Diagrams/SaRa -3D plot.png" alt="SaRa" width="70%" height="auto"> </p>

Reinforcement Learning

<p align="justify">

In the subsequent phase of the devised pipeline, reinforcement learning is harnessed to accomplish object localisation within the images. To this extent the developed system was built via the gymnasium API, which facilitated the formulation of the problem as a Markov Decision Process (MDP), inspired from the existing literature. Subsequently, Deep Reinforcement Learning (DRL) techniques were applied to approximate the object detection problem.

</p>

Action Space

<p align="justify">

Similar to methodologies commonly employed in object localisation tasks, the action set $A$ consists of eight transformations that can be applied to the bounding box, along with one action designated to terminate the search process. These transformations are grouped into four subsets: horizontal and vertical box movement, scale adjustment, and aspect ratio modification. Consequently, the agent has four degrees of freedom to adjust the bounding box $[x_1, y_1, x_2, y_2]$ during interactions with the environment. Additionally, a trigger action is incorporated to indicate successful object localisation by the current box, thereby concluding the ongoing search sequence, and drawing an IoR marker on the detected object.

</p> <p align='center'> <img src="Diagrams/Actions_white.png" alt="Actions" width="100%" height="auto"> </p>

Deep Q-Network Architecture

<p align="justify">

The DQN architecture, introduced in the presented system, assumes responsibility for decision-making in object localisation. To this extent, the designed architecture draws inspiration from methodologies present in the prevalent literature. Our proposed approach, introduces four DQN variants:

  1. Vanilla DQN (DQN)
  2. Double DQN (DDQN)
  3. Dueling DQN (Dueling DQN)
  4. Double Dueling DQN (D3QN)

Our approach advocates for a deeper DQN network to bolster decision-making capabilities and enhance learning complexity. To mitigate concerns regarding overfitting, dropout layers are seamlessly integrated into the network architecture. Additionally, this work develops a Dueling DQN Agent to improve learning efficiency by decoupling state and advantage functions. The Dueling DQN design divides the $Q$-value function into two streams, allowing the agent to better comprehend the value of doing specific actions in different situations. The proposed approach also evaluates DDQN and D3QN techniques, which have also not been previously examined, in pursuit of achieving better results.

</p> <p align='center'> <img src="Diagrams/dqn_architectures.png" alt="DQN Architecture" width="70%" height="auto"> </p>

Self-Explainability

<p align="justify"> The study proposes a system that creates a log and displays the current environment in several rendering modes to illustrate explainability, as demonstrated below: <p align='center'> <img src="Diagrams/Visualisations.png" alt="Visualisations" width="100%" height="auto"> </p>

These visualisations provide users with insights into the current action being performed, the current IoU, the current Recall, the environment step counter, the current reward, and a clear view of the current bounding box and ground truth bounding box locations in the original image. Furthermore, unlike all object detectors and methodologies previously discussed, this methodology permits decision-making observation during the training phase, albeit there is a slight time overhead for the creation of visualisations. Nonetheless, the system provides a clear log outlining the framework's decision-making process for current item detection, allowing insight into the object detector's training and assessment, as observed below:

<p align='center'> <img src="Diagrams/Self-Explainability.png" alt="Self-Explainability" width="100%" height="auto"> </p> </p>

SaRLVision Window

<p align="justify

Related Skills

View on GitHub
GitHub Stars13
CategoryEducation
Updated2mo ago
Forks1

Languages

Jupyter Notebook

Security Score

95/100

Audited on Jan 16, 2026

No findings