SkillAgentSearch skills...

VSDataset

Dataset for visual servoing and camera pose estimation. The images were obtained by a manipulator robot with an eye-in-hand camera in different poses. The labels represent the camera pose. It is possible to obtain the absolute pose of the camera without any pre-processing of the dataset, as well as the relative pose between images through matrix transformations. One may also use the dataset to get the camera's 6DoF speeds so that the visual servo control between two images can be performed. Such speeds are already calculated through the classic PBVS law and made available in the VSLabels.txt file.

Install / Use

/learn @RauldeQueirozMendes/VSDataset
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

GitHub GitHub language count GitHub top language GitHub repo size GitHub all releases GitHub contributors GitHub Release Date

Visual Servoing dataset

E. G. Ribeiro, R. Q. Mendes and V. Grassi Jr

If you find this dataset useful in your research please consider citing our work:

Real-Time Deep Learning Approach to Visual Servo Control and Grasp Detection for Autonomous Robotic Manipulation

Elsevier's Robotics and Autonomous Systems (2021)

DOI: /10.1016/j.robot.2021.103757

:pencil2: Description

Summary

Dataset for visual servoing (VS) and camera pose estimation. 
The images were obtained by a manipulator robot with an eye-in-hand camera in different poses. 
The labels represent the camera pose. 
It is possible to obtain the absolute pose of the camera without any pre-processing of the dataset, as well as the relative pose between images through matrix transformations. 
One may also use the dataset to get the camera's 6DoF speeds so that the visual servo control between two images can be performed. 
Such speeds are already calculated through the classic PBVS law and made available in the VSLabels.txt file.

Each folder contains images of an object/scene in different poses. The Labels.txt file within these folders concerns only the image and the respective camera pose: (I, [x, y, z, α, β, γ]). These files make it possible to use the dataset for camera pose estimation problems, whether relative or absolute.

The VSLabels.txt file in the root concerns the tuples (I<sub>d</sub>, I<sub>c</sub>, v<sub>c</sub>) that enable end-to-end training of supervised visual servoing models.

:gear: Construction of the dataset

We tested the visual servoing dataset on a hardware with Python 3.6.8, Keras 2.2.4, Tensorflow 1.13.1, Ubuntu 16.04 operating system, Intel Core i7-7700 processor with 8 cores of 3.6GHz and an Nvidia GPU GeForce GTX 1050 Ti.

To train a model that can perform visual servoing on different target objects, without the need to design features, it is necessary to have a dataset that efficiently captures the attributes of the environment in which the robot operates, be representative of the VS task and diverse enough to ensure generalization. To this end, the data is collected by the robot Kinova Gen3 in a way that approximates the self-supervised approach. Human interventions are associated with the assembly of the workspace and the setup of the robot, which involves determining a reference pose from which the robot captures the images and labels them.

The robot is programmed to assume different poses from a Gaussian distribution centered in the reference pose, with different standard deviations. This approach is inspired by the work of Bateux et al. (1), which do the same, yet using virtual cameras and homography instead of a real environment. The reference pose (mean of the distribution) and the sets of Standard Deviations (SD) assumed by the robot are presented in the following table:

<table> <caption>Gaussian distribution parameters to build the VS dataset</caption> <thead> <tr> <th colspan="2" rowspan="2">Dimension</th> <th rowspan="2">Mean</th> <th colspan="3">Standard Deviation</th> </tr> <tr> <td>Mid</td> <td>High</td> <td>Low</td> </tr> </thead> <tbody> <tr> <td rowspan=3>Position [m]</td> <td>x</td> <td>0.228</td> <td>0.080</td> <td>0.030</td> <td>0.010</td> </tr> <tr> <td>y</td> <td>0.344</td> <td>0.080</td> <td>0.030</td> <td>0.010</td> </tr> <tr> <td>z</td> <td>0.532</td> <td>0.080</td> <td>0.030</td> <td>0.005</td> </tr> <tr> <td rowspan=3>Orientation [<sup>o</sup>]</td> <td>α</td> <td>175.8</td> <td>5.0</td> <td>2.0</td> <td>1.0</td> </tr> <tr> <td>β</td> <td>-5.5</td> <td>5.0</td> <td>2.0</td> <td>1.0</td> </tr> <tr> <td>γ</td> <td>90.0</td> <td>5.0</td> <td>2.0</td> <td>1.0</td> </tr> </tbody> </table>

The SD choices take into account the expected displacement values that the robot must perform during the VS. In this way, the images obtained from a high SD help the network to understand the resulting changes in the image space when a large displacement is made by the robot. The instances obtained from a low SD enable the reduction of the error between the reference image and the current one when they are very close, for a good precision in steady state. The mean SD values help the network to reason during most of the VS execution. Two dataset instance examples and their respective labels are illustrated in the following figure:

Instance examples from the VS dataset. Generated from a Gaussian distribution with mean in the reference pose [x, y, z, α, β, γ]: (0.228m, 0.344m, 0.532m, 175.8<sup>o</sup>, -5.5<sup>o</sup>, 90.0<sup>o</sup>) - Source: Author.

Image taken from camera in pose (0.326m, 0.356m, 0.503m, 178.0<sup>o</sup>, 1.1<sup>o</sup>, 91.5<sup>o</sup>)

(a) Image taken from camera in pose (0.326m, 0.356m, 0.503m, 178.0<sup>o</sup>, 1.1<sup>o</sup>, 91.5<sup>o</sup>)

Image taken from camera in pose (0.258m, 0.207m, 0.402m, -175.8<sup>o</sup>, -23.0<sup>o</sup>, 87.2<sup>o</sup>)

(b) Image taken from camera in pose (0.258m, 0.207m, 0.402m, -175.8<sup>o</sup>, -23.0<sup>o</sup>, 87.2<sup>o</sup>)

After obtaining the data, the dataset is structured in the form (I, [x, y, z, α, β, γ]), in which I is the image, and [x, y, z, α, β, γ] is the associated camera pose when this image was captured. In order to use this dataset to train a Position Based VS neural network, two images and the relative pose between them must be considered. Then, each instance of the processed dataset takes the form (I<sub>d</sub>, I<sub>c</sub>, <sup>d</sup>H<sub>c</sub>), in which I<sub>d</sub> is a random instance chosen as the desired image, I<sub>c</sub> is another instance chosen as current image, and <sup>d</sup>H<sub>c</sub> is the transformation that relates the current frame to the desired camera frame. This is done by expressing each pose, represented by translations and Euler angles, in an homogeneous transformation matrix form <sup>0</sup>H<sub>d</sub> and <sup>0</sup>H<sub>c</sub>, and then obtaining <sup>d</sup>H<sub>c</sub> = <sup>0</sup>H<sub>d</sub><sup>-1</sup> <sup>0</sup>H<sub>c</sub>.

Finally, for the network to be, in fact, a controller, the intention is that its prediction is directly the velocity signal of the camera, i. e. the control signal. So, the data structured as (I<sub>d</sub>, I<sub>c</sub>, <sup>d</sup>H<sub>c</sub>) is transformed to (I<sub>d</sub>, I<sub>c</sub>, v<sub>c</sub>), in which v<sub>c</sub> is the proportional camera velocity. The proportional term is used because the λ gain is not considered in determining the labeled velocity, and must be tunned a posteriori, during the control execution. The velocity v<sub>c</sub> is obtained from <sup>d</sup>H<sub>c</sub> using equations 7, 8 and 13 from the details (not considering λ).

The final number of instances generated for network training and validation is given by the following equations:

<a href="https://www.codecogs.com/eqnedit.php?latex=N_{ins}&space;=&space;\sum&space;_{i=1}^{20}&space;h_il_i&space;&plus;&space;m_il_i&space;&plus;&space;C_{l_{i},2}&space;&plus;&space;C_{m_{i},2}" target="_blank"><img src="https://latex.codecogs.com/svg.latex?N_{ins}&space;=&space;\sum&space;_{i=1}^{20}&space;h_il_i&space;&plus;&space;m_il_i&space;&plus;&space;C_{l_{i},2}&space;&plus;&space;C_{m_{i},2}" title="N_{ins} = \sum _{i=1}^{20} h_il_i + m_il_i + C_{l_{i},2} + C_{m_{i},2}" /></a>

In this equation, N<sub>ins</sub> is the total number of generated instances in the form (I<sub>d</sub>, I<sub>c</sub>, v<sub>c</sub>), i is the considered scene (since I<sub>d</sub> and I<sub>c</sub> must be from the same scene) and h<sub>i</sub>, m<sub>i</sub> and l<sub>i</sub> are, respectively, the number of images obtained from a high, medium and low standard deviation. C<sub>l<sub>i</sub>,2</sub> and C<sub>m<sub>i</sub>,2</sub> is the total number of combinations of two between the images obtained with low and medium standard deviations, respectively, as given by the following equation:

<a href="https://www.codecogs.com/eqnedit.php?latex=C_{n,p}=\frac{n!}{p!(n-p)!}" target="_blank"><img src="https://latex.codecogs.com/svg.latex?C_{n,p}=\frac{n!}{p!(n-p)!}" titl

View on GitHub
GitHub Stars17
CategoryDevelopment
Updated2mo ago
Forks3

Languages

Shell

Security Score

90/100

Audited on Dec 25, 2025

No findings