RobotVQA

RobotVQA is a project that develops a Deep Learning-based Cognitive Vision System to support household robots' perception while they perfom human-scale daily manipulation tasks like cooking in a normal kitchen. The system relies on dense description of objects in the scene and their relationships

Generate Convert Improve

Install / Use

/learn @fkenghagho/RobotVQA

About this skill

Quality Score

0/100

README

RobotVQA: A Scene-Graph- and Deep-Learning-based Visual Question Answering System for Robot Manipulation

</center>

Introduction

EASE is the initiative of the interdisciplinary research center EASE at the University of Bremen that aims at enabling robots to competently perform human-scale daily manipulation activities such as cooking in an ordinary kitchen. However, to complete such tasks, robots should not only exhibit standard visual perception capabilities such as captioning, detection, localization or recognition, but also demonstrate some cognitive vision, in which all these capabilities including considerable reasoning are integrated together, so that semantically deep understanding of the unstructured and dynamic scene can be reached.

In this thesis, we formulate the underlying perception problem of scene understanding as three subproblems:

Objects description: we design and train a deep convo-neural network to provide an end-to-end dense description of objects in the scene.
Relationships description: we design and train a relational neural network for computing relationships among objects in the scene. Relationships essentially encode scale-invariant relative positioning (on-the, in-the, left-of, under-the, ...) as well as composition (has-a). The network takes as inputs the input image code(Referential) and the codes(Mask(local infos)+Pose) of two objects in the image, then outputs the most likely relationship between both objects. Note that this can be regarded as a soft or more semantical variant of the hard object pose estimation performed by the object describer mentioned above.
Visual Question Answering: Allowing the robot control program to only access needed information eases programming, boosts realtimeness and makes robot control program further independent of the perception system and therefore more generic. Moreover, answering structured queries based on structured and complete information (scene graph) is more robust and efficient than brute force approaches that tend to take unstructured queries as input to to output unstructured answers.

For achieving a good reasoning about scenes, the first two tasks including object description task's subtasks are integrated together into a single multi-task deep convo-neural network where training and inference take place end-to-end.

As output, the system returns a scene graph. A scene graph is a directed graph, whose nodes and edges respectively encode objects description and relationships among objects in the scene. This scene graph is then used as informational basis to solve the third task. The figure below illustrates the concept.

1. Contribution

The contribution of this thesis is manifold:

Design and implementation of a kitchen-activity-related dataset synthesizer: It defines a scene representation, noted as Scene Graph, for autonomous manipulation robots. Then, it generates robot scene images and annotates them with respect to this representation.
Synthesis of a big set of kitchen-activity-related RGBD-images: annotated with corresponding scene graphs and augmented with real images.
Design and implementation of a unified deep learning model,coined as RobotVQA: that takes a scene RGB(D) image as input and outputs the corresponding scene graph. RobotVQA stands for Visual Question Answering for Robots. We demonstrate transferability of RobotVQA’s knowledge from virtual to real worlds and suitability to robot control programs

The following figure graphically illustrates and summarizes our contributions.

contribution

2. Typical Scene

The following figure briefly illustrates the concept of scene graph.The system operates in two modes: either with depth information or without.

Objects and Relationships description

3. Multi-Task Deep Convo-Neural Network

Our model, from a broader view, works as follows:

Objects and Relationships description

5. Frameworks

We make use of the following Frameworks:

PythonLibs, LabelMe, Unreal Engine and UnrealCV: to build the dataset
PythonLibs, TensorFlow and Caffe: to build the deep convo-neural and relational networks, train them and make inferences

6. Dataset

We provide a big rich synthetic dataset of around 75000 synthetic and 1650 real annotated images collected while simulating kitchen activities. The dataset is to be found here RobotVQA dataset. The structure of the visual dataset can be found at dataset's structure. This file deeply specifies the structure of objects in the scene and the image and information types needed to learn the structure of objects. For building a huge dataset, we model the environment of the target robot in Unreal Engine4.16, which is a photorealistic game engine allowing therefore efficient transfer learning. Then, we write a software for an end-to-end construction of the dataset starting from the automatic camera navigation in the virtual environment till the saving of scene images and annotations. To augment the annotation of the collected images with relationships, we built a flexible software. The use of this software is definitively compulsory when dealing with some external images. Our annotation softwares can be found at automatic dataset generator and manual dataset generator . An example of annotation can be downloaded from example of annotation(zip file).

The above picture shows how realistic the unreal scene can be:

The following images depict the annotation of a typical scene from objects to relationships description:

Typical scene

Objects in the scene

The following image gives an overview of the collected dataset.

7. Object Color Estimation: Learning vs Engineering

In this thesis, we semantically select as many color classes as needed, which are the basic/standard colors set up by the ISCC–NBS System, to colourfully label the robot environment. These colors are red, blue, green, yellow ...

Engineering approach: we assume that the pixels's colors are normally distributed as it usually seems to be the case for natural images(Gaussian mixtures). We find the most frequent pixel color(RGB) and compute the nearest color class to the most frequent pixel color. The color class is then considered as the color of the object. By modeling colors as 3D-vectors/points in the Lab-Color space, we define a consistent distance between two colors, which captures human visual ability for color similarity and dissimilarity.

Quantitatively more precise than human vision, however fails to capture quality: too few qualitative classes(only 12) and very low features(Gaussian mean of pixel values)
Very sensitive to noise: shadow, very strong mixture of colors
Very simple and fast.

Learning approach: By building the dataset for learning, we objectively assign color classes to objects and after the training, the network is able to qualitatively appreciate the color of objects in the scene.

Very powerful: end-to-end rational color estimation only based on objective description of data
Qualitatively matches human vision (color constancy)
More complicated and costlier: objective description of data, network, training and estimation.

8. Object Pose Estimation: Position, Orientation and Cuboid

Refering to the best of our knowledge, we are the first to design an end-to-end object pose estimator based on Deep-Learning. From Unreal Engine, we sample images annotated with objets and cameras' poses within the virtual world coordinate system. Then we refine the annotation by mapping the objects' poses from the world's coordinate system into the camera's coordinate system. The end-to-end object pose estimator takes as input an image and outputs the poses of objects on the image. Futhermore, the estimator also outputs the minimal cuboid enclosing an object.

A brief comparison of our learning approach to other traditional approaches on pose estimation follows:

Traditional approach: state of the art as PCL requires a lot of engineering

A 3D-Model for each objects, explicit features extraction: almost manually
Calibration(camera)
Clustering-based scene segmentation: imposes some restrictions on the scene configuration(colorfulness, sparseness, ...)

Learning approach:

very natural: only based on data
No engineering effort: end to end
Almost no restriction on the scene configuration(colorfulness, sparseness)
Huge dataset: no problem, we do it almost automatically

The following figure illustrates how we compute the objects' poses for annotation/learning:

Pose estimation

Given that the dataset sampling camera's intrinsic matrix might be different from the ones of the robot camera, we evaluate the intrisic matrices R1,R2 respectively of the sampling camera and the robot camera once and properly map the outputted position (X,Y,Z) by the network as inverse(R2)xR1x(X,Y,Z). The following figures depicts a demonstration of the camera projection.

Camera Projection

9. Object Shape Recognition

As far as 3D-shape recognition in Computer Vision is concerned, our main contribution in this thesis is the setup of a pure learning

Related Skills

openhue

348.0k

Control Philips Hue lights and scenes via the OpenHue CLI.

sag

348.0k

ElevenLabs text-to-speech with mac-style say UX.

weather

348.0k

Get current weather and forecasts via wttr.in or Open-Meteo

tweakcc

1.6k

Customize Claude Code's system prompts, create custom toolsets, input pattern highlighters, themes/thinking verbs/spinners, customize input box & user message styling, support AGENTS.md, unlock private/unreleased features, and much more. Supports both native/npm installs on all platforms.

fkenghagho

View profile

View on GitHub

GitHub Stars18

CategoryCustomer

Updated3mo ago

Forks3

fkenghagho/RobotVQA

Languages

Python

Security Score

87/100

Audited on Dec 12, 2025

No findings