RepNeXt
RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization
Install / Use
/learn @suous/RepNeXtREADME
RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization
<p align="center"> <img src="figures/latency.png" width=70%> <br> The top-1 accuracy is tested on ImageNet-1K and the latency is measured by an iPhone 12 with iOS 16 across 20 experimental sets. </p>RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization.
Mingshu Zhao, Yi Luo, and Yong Ouyang
[arXiv]

Abstract
We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt's superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5's 82.3% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP$^{box}$ by 1.3 on MS-COCO, and reduces parameters by 0.7M.

For example, the large-kernel depthwise convolution can be accelerated by the implicit GEMM algorithm: DepthWiseConv2dImplicitGEMM of RepLKNet.
Many token mixers can be generalized as a distribute-transform-aggregate process:
| Token Mixer | Distribution | Transforms | Aggregation | |:-----------:|:------------:|:--------------:|:-----------:| | ChunkConv | Split | Conv, Identity | Cat | | CopyConv | Clone | Conv | Cat | | MixConv | Split | Conv | Cat | | MHSA | Split | Attn | Cat | | RepBlock | Clone | Conv | Add |
ChunkConv and CopyConv can be viewed as grouped depthwise convolutions.
- Chunk Conv
class ChunkConv(nn.Module):
def __init__(self, in_channels, bias=True):
super().__init__()
self.bias = bias
in_channels = in_channels // 4
kwargs = {"in_channels": in_channels, "out_channels": in_channels, "groups": in_channels, "bias": bias}
self.conv_i = nn.Identity()
self.conv_s = nn.Conv2d(kernel_size=3, padding=1, **kwargs)
self.conv_m = nn.Conv2d(kernel_size=7, padding=3, **kwargs)
self.conv_l = nn.Conv2d(kernel_size=11, padding=5, **kwargs)
def forward(self, x):
i, s, m, l = torch.chunk(x, chunks=4, dim=1)
return torch.cat((self.conv_i(i), self.conv_s(s), self.conv_m(m), self.conv_l(l)), dim=1)
@torch.no_grad()
def fuse(self):
conv_s_w, conv_s_b = self.conv_s.weight, self.conv_s.bias
conv_m_w, conv_m_b = self.conv_m.weight, self.conv_m.bias
conv_l_w, conv_l_b = self.conv_l.weight, self.conv_l.bias
conv_i_w = torch.nn.functional.pad(torch.ones(conv_l_w.shape[0], conv_l_w.shape[1], 1, 1), [5, 5, 5, 5])
conv_s_w = nn.functional.pad(conv_s_w, [4, 4, 4, 4])
conv_m_w = nn.functional.pad(conv_m_w, [2, 2, 2, 2])
in_channels = self.conv_l.in_channels*4
conv = nn.Conv2d(in_channels, in_channels, kernel_size=11, padding=5, bias=self.bias, groups=in_channels)
conv.weight.data.copy_(torch.cat((conv_i_w, conv_s_w, conv_m_w, conv_l_w), dim=0))
if self.bias:
conv_i_b = torch.zeros_like(conv_s_b)
conv.bias.data.copy_(torch.cat((conv_i_b, conv_s_b, conv_m_b, conv_l_b), dim=0))
return conv
- Copy Conv
class CopyConv(nn.Module):
def __init__(self, in_channels, bias=True):
super().__init__()
self.bias = bias
kwargs = {"in_channels": in_channels, "out_channels": in_channels, "groups": in_channels, "bias": bias, "stride": 2}
self.conv_s = nn.Conv2d(kernel_size=3, padding=1, **kwargs)
self.conv_l = nn.Conv2d(kernel_size=7, padding=3, **kwargs)
def forward(self, x):
B, C, H, W = x.shape
s, l = self.conv_s(x), self.conv_l(x)
return torch.stack((s, l), dim=2).reshape(B, C*2, H//2, W//2)
@torch.no_grad()
def fuse(self):
conv_s_w, conv_s_b = self.conv_s.weight, self.conv_s.bias
conv_l_w, conv_l_b = self.conv_l.weight, self.conv_l.bias
conv_s_w = nn.functional.pad(conv_s_w, [2, 2, 2, 2])
in_channels = self.conv_l.in_channels
conv = nn.Conv2d(in_channels, in_channels*2, kernel_size=7, padding=3, bias=self.bias, stride=self.conv_l.stride, groups=in_channels)
conv.weight.data.copy_(torch.stack((conv_s_w, conv_l_w), dim=1).reshape(conv.weight.shape))
if self.bias:
conv.bias.data.copy_(torch.stack((conv_s_b, conv_l_b), dim=1).reshape(conv.bias.shape))
return conv
In summary, by focusing solely on the simplicity of the model’s overall architecture and disregarding its efficiency and parameter count, we can ultimately consolidate it into the single-branch structure shown in the figure below:

UPDATES 🔥
- 2024/10/13: Added M0-M2 single branch equivalent form ImageNet-1K results using StarNet's training recipe.
- 2024/09/19: Added M0-M2 ImageNet-1K results using StarNet's training recipe (distilled). Hit 80.6% top-1 accuracy within 1ms on an iPhone 12.
- 2024/09/08: Added RepNext-M0 ImageNet-1K result using StarNet's training recipe. Achieving 73.8% top-1 accuracy without distillation.
- 2024/08/26: RepNext-M0 (distilled) has been released, achieving 74.2% top-1 accuracy within 0.6ms on an iPhone 12.
- 2024/08/23: Finished compact model (M0) ImageNet-1K experiments.
- 2024/07/23: Updated readme about further simplified model structure.
- 2024/06/25: Uploaded checkpoints and training logs of RepNext-M1 - M5.
Classification on ImageNet-1K
Models under the RepVit training strategy
We report the top-1 accuracy on ImageNet-1K with and without distillation using the same training strategy as RepViT.
| Model | Top-1(distill) / Top-1 | #params | MACs | Latency | Ckpt | Core ML | Log | |:------|:----------------------:|:-------:|:----:|:-------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------:| | M0 | 74.2 | 72.6 | 2.3M | 0.4G | 0.59ms | fused 300e / 300e | 300e | distill 300e / 300e | | M1 | 78.8 | 77.5 | 4.8M | 0.8G | 0.86ms | fused 300e / 300e | 300e | distill 300e / 300e | | M2 | 80.1 | 78.9 | 6.5M | 1.1G | 1.00ms | fused 300e / 300e | 300e | distill 300e / 300e | | M3 | 80.7 \
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
