DSeg
Invariant Superpixel Features for Object Detection
Install / Use
/learn @sam0x17/DSegREADME
Invariant Superpixel Features for Object Detection & Localization
by Sam Johnson
Introducing DSeg
DSeg is a novel superpixel-based feature designed to be useful in semantic segmentation, object detection, and pose-estimation/localization tasks. DSeg extracts the shape and color of object-positive and object-negative superpixels, and encodes this information in a compact feature vector.
While this project is only in its infancy, DSeg has already shown promising results on superpixel-wise object detection (is superpixel A part of an instance of object class B?). Below we cover the implementation details of DSeg, present preliminary experimental results, and discuss future work and additional enhancements.
Motivation
Many who work or dabble in computer vision have noticed the intrinsic, deep coupling between image segmentation and higher-level tasks, such as detection, classification, localization, or more generally, whole-image understanding. The interdependence between these crucial computer vision tasks and segmentation is undeniable, and often frustrating. A chicken-and-egg scenario quickly arises:
How can you properly find the outline of an object without knowing what it is?
On the other hand, how can you begin to understand the context and identity of an object, before determining its extents?
Superpixel Segmentation
One "neat trick" that partially alleviates this issue is the superpixel segmentation algorithm. Superpixels arise from a clever, over-segmentation of the original image. Once the superpixel algorithm has been run, virtually all major and most minor edges and boundaries in the image will fall on some boundary or another between two superpixels. This fact exposes an intriguing trade-off: after running the superpixel algorithm, we can rest assured that virtually all edges and boundaries of interest have been found somewhere encoded in our superpixels. That said, we unfortunately also have to deal with a large number of minor or imagined edges that are included in the superpixel segmentation.
Properties of Superpixels
Superpixels are extremely powerful because they often fall on important boundaries within the image, and tend to take on abnormal or unique shapes when they contain salient object features. For example, in the superpixel segmentation of the below aircraft carrier, notice that the most salient object features (the radar array, anchor, and railing of the carrier) are almost perfectly segmented by highly unique superpixels.

Fig. 1: A superpixel segmented aircraft carrier. Salient features are revealed by the superpixels.
From this example, several intuitions arise:
- In general, superpixels tend to hug salient features and edges in the image
- "Weird" superpixels that differ greatly from the standard hexagonal (or rectangular) shape are more likely to segment points of interest
- "Boring" superpixels are part of something larger than themselves, and can often be combined with adjacent superpixels
Advantages of Synthetic Data
When writing object detectors, it helps to have images that fully describe the broad range of possible poses of the object in question. With real-world images, this is seldom possible, however the advent of readily available, high quality 3D models, and the maturity of modern 3D rendering pipelines has created ample opportunity to use synthetic renderings of 3D models to generate training data that:
- Covers the full range of poses of the object in question, with the ability to render arbitrary poses on the fly
- Can be tuned to vary scale, rotation, translation, lighting settings, and even occlusion
- Is already segmented (at every possible pose) without the need for human labeling
- Is easy to generate, no humans required once a 3D model is made

Fig. 2: 3D rendering of a Hamina-class missile boat, one of the models used to train DSeg
DSeg Feature Extraction Pipeline
In this section we outline the details of the DSeg feature extraction pipeline and comment on the major design decisions and possible future work and/or alterations.
Image Data Sources
DSeg requires a synthetic data source (i.e. a 3D model of the object that is to be learned) to generate class-positive training samples, and a "real-world" image source, preferably scene images, to generate class-negative training samples. That is, we use synthetic data to generate positive examples of our object class's superpixels, and we use real-world data to generate negative examples of our object class's superpixels.
For this project, we used the high quality ships 3D model dataset from [1], and SUN2012 [2] for real-world scene images. In particular, we randomly sampled 2000 images from [2], and for each 3D model in [1], we rendered 27,000 auto-scaled, auto-cropped poses uniformly distributed across the space off possible poses at 128 x 128, at a fixed camera distance, and with slightly variable lighting. The rendering of these images is covered in greater depth in [1].
DSeg Step 1: Segmentation
SLIC [3] is the state-of-the-art method for performing superpixel segmentation, whereas gSLICr [4] is a CUDA implementation of SLIC that can perform superpixel segmentation at over 250 Hz. Ideally we would use gSLICr in our pipeline, however significant modifications would need to be made to gSLICr to make it possible to extract the pixel boundaries of the generated superpixels, so we settled on the slower, VLFeat [5] implementation of SLIC.
We run SLIC on each image in our training set with a constant regularization parameter of 4000 (so as to avoid over-sensitivity to illumination). We do this over a range of 25 different grid sizes to ensure we capture all of the possible superpixel shapes that could occur within our object (or couldn't occur, in the case of negative examples). For positive examples, it is necessary to ignore superpixels that are not part of the object, however this is easy since we rendered translucent PNGs and merely need to ignore superpixels with transparent pixels.

Fig. 3: Superpixel segmentation is performed at multiple scales
For each valid superpixel found in the current image, we will eventually create a feature vector. Before passing these superpixel features on to the next pipeline stage, we calculate the closest-to-average RGB color for each superpixel. That is, we calculate for each superpixel the RGB color from its color palette that is closest to the mean RGB color across the entire superpixel. We measure color distance simply using the 3D euclidean distance formula. Future work might explore LAB or other color spaces.
In practice, stage 2 yields approximately 700 features per positive input image, depending on the model and viewing angle.
DSeg Step 2: Scale Invariance via ResizeContour
To obtain scale invariance for our features, we must normalize all of our superpixel features to the same square grid size. For this project, we used a 16 x 16 pixel grid.
A naive approach to this normalization phase would be to use an off-the-shelf image re-sizing routine, such as bicubic or bilinear interpolation. In practice, we found that these algorithms, along with nearest neighbor and other alternatives all result in either excessive blurring or pixelation when up-scaling very small features (which are quite common in DSeg, as most features are approximately 6 x 6 pixels in size. This blurring or pixelation means that a neural network will be able to tell that the feature in question was up-scaled. The purpose of scale invariance is to hide all evidence that any sort of up-scaling or down-scaling has taken place, so this is undesirable.

*Fig. 4: comparison of image re-sizing algorithms up-scaling from 16x16
ResizeContour
ResizeContour is a simple, novel algorithm for up-scaling single color threshold images with neither the pixelation introduced by nearest neighbor, nor the blurring introduced by bilinear interpolation. ResizeContour is how we up-scale superpixels (especially very small ones) in DSeg. The algorithm itself is relatively simple:
- Project each pixel from the original image to its new location in the larger image
- Find all the black (filled in) pixels among the 8 pixels immediately surrounding our pixel in the original image
- Project the coordinates of these dark pixels to the new larger image
- Perform a polygon fill operation using this list of projected points

Fig. 5: ResizeContour in action. The + signs represent pixels from the original superpixel.
Because we fill in overlapping polygons over each pixel in the image, this approach will only work if the entire image is of one uniform color. A side effect of this algorithm is that one pixel lines are usually erased from the resulting image, however this is good as it makes our features more block-like and more resistant to artifacts in the pre-segmented image.
Once the superpixel is re-sized to our decided canonical form, our feature vector is ready. We store the thresholded values of the pixel grid, concatenated with three floats representing the RGB value of the closest-to-average color. For a 16 x 16 standard superpixel size, this results in a 16 * 16 + 3 = 259 dimensional vector.
Evaluation
To evaluate the efficacy of DSeg as a general purpose feature, we constructed a simple experiment that trains (using RPROP) a standard feed-forward Multi-Layer Perceptron to take a single DSeg feature as input and output
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
