MobileNet-V4 (now in timm)
History
5 years ago now MobileNet-V3 (https://arxiv.org/abs/1905.02244) and EfficientNet (https://arxiv.org/abs/1905.11946) vision models were introduced by Google researchers. At a very early stage in timm
's development, I set out to reproduce these model architectures and port the originally released Tensorflow model weights into PyTorch.
Both of these model architectures were based on the Inverted Residual Block (also called Inverted Bottleneck) that was introduced in the earlier MobileNet-V2 model. The IR consists of a 1x1 pointwise (PW) expansion convolution, followed by a depthwise (DW) convolution (3x3 or 5x5), and finally a 1x1 pointwise linear (PWL, no activation) convolution in the residual path. Unlike most other residual blocks at the time, the wide part is in the middle of the block at the depthwise conv, instead of at the start/end, and the output of the block has no activation (linear output), hence 'Inverted'.
To this day, timm
remains the most comprehensive collection of known EfficientNet and MobileNet-V3 architectures. It covers all of the officially released Tensorflow weights from various model papers (EfficientNet, EfficientNet-EdgeTPU, EfficientNet-V2, MobileNet-V2, MobileNet-V3), training techinques (RandAug/AutoAug, AdvProp, Noisy Student), and numerous other closely related architectures and weights such as MNasNet, FBNet v1/v2/v3, LCNet, TinyNet, MixNet. There are also many weights trained in timm
, pure PyTorch, with PyTorch friendly convolution padding (TF weight ports use a 'SAME' padding emulation) that aren't in other collections.
New Models
Now, it's finally time for MobileNet-V4 (https://arxiv.org/abs/2404.10518). Reading through the paper it's apparent the goal was to come up with a new set of NAS searched models that are runtime optimal on today's mobile/edge hardware, from small DSP/CPU devices to modest edge accelerators (e.g. EdgeTPU) in current mobile phones.
This goal was achieved by introducing two new block types to the previous mix:
- Universal Inverted Bottleneck (UIB)
- Multi Query Attention (MQA)
Universal Inverted Bottleneck
A superset of the original Inverted Residual / Inverted Bottleneck block, the UIB adds more flexibility in the search space, adding 2 extra depthwise convolution positions at the start and end of the block, and making the middle depthwise convolution optional. The extra final convolution isn't used (yet), but the new blocks in use now included:
- 'ExtraDW' with a 3x3 or 5x5 DW convolution to start the block in front of existing 1x1 PW + kxk DW + 1x1 PWL pattern
- 'FFN' with no DW convs enabled and just the 1x1 PW expansion + linear convs
- 'ConvNeXt' with 3x3 or 5x5 DW convolution to start the block, no middle DW convolution, so a 1x1 + 1x1 FFN to end
https://arxiv.org/abs/2404.10518
Mobile MQA
Also added for 'Hybrid' variants of the MobileNet-V4 is attention via a mobile optimized Multi Query Attention module. Neither the key or value have any heads, just 4 or 8 heads for the query. There is optional 2D spatial downsampling for the key-value and/or query.
PyTorch Implementation
I've recently implemented these models in timm
in a bid to keep timm
the best place to go for efficient image encoders. It builds on top of the previous MobileNet-V3 implementation. I have trained a number of initial weights and am working on covering all of the models mentioned in the paper: https://huggingface.co/collections/timm/mobilenetv4-pretrained-weights-6669c22cda4db4244def9637
And in case you look at that PR and wonder, WTH is EfficientNet-X / EfficientNet-H? They are little known variants of those models w/ Space2Depth, tweaked for TPU or GPU use. That's there too but not the focus.
The is an official Tensorflow implementation of these models in the Tensorflow Model Garden (https://github.com/tensorflow/models), but no sign of weights yet.
A comparison of paper ImageNet-1k training results vs timm
in tables below. Note that params in paper assume folding of normalization params into convs, timm values are in training state.
timm:
model | top1 | top1_err | top5 | top5_err | param_count | img_size |
---|---|---|---|---|---|---|
mobilenetv4_hybrid_large.e600_r384_in1k | 84.266 | 15.734 | 96.936 | 3.064 | 37.76 | 448 |
mobilenetv4_hybrid_large.e600_r384_in1k | 83.800 | 16.200 | 96.770 | 3.230 | 37.76 | 384 |
mobilenetv4_conv_large.e600_r384_in1k | 83.392 | 16.608 | 96.622 | 3.378 | 32.59 | 448 |
mobilenetv4_conv_large.e600_r384_in1k | 82.952 | 17.048 | 96.266 | 3.734 | 32.59 | 384 |
mobilenetv4_conv_large.e500_r256_in1k | 82.674 | 17.326 | 96.31 | 3.69 | 32.59 | 320 |
mobilenetv4_conv_large.e500_r256_in1k | 81.862 | 18.138 | 95.69 | 4.31 | 32.59 | 256 |
mobilenetv4_hybrid_medium.e500_r224_in1k | 81.276 | 18.724 | 95.742 | 4.258 | 11.07 | 256 |
mobilenetv4_conv_medium.e500_r256_in1k | 80.858 | 19.142 | 95.768 | 4.232 | 9.72 | 320 |
mobilenetv4_hybrid_medium.e500_r224_in1k | 80.442 | 19.558 | 95.38 | 4.62 | 11.07 | 224 |
mobilenetv4_conv_blur_medium.e500_r224_in1k | 80.142 | 19.858 | 95.298 | 4.702 | 9.72 | 256 |
mobilenetv4_conv_medium.e500_r256_in1k | 79.928 | 20.072 | 95.184 | 4.816 | 9.72 | 256 |
mobilenetv4_conv_medium.e500_r224_in1k | 79.808 | 20.192 | 95.186 | 4.814 | 9.72 | 256 |
mobilenetv4_conv_blur_medium.e500_r224_in1k | 79.438 | 20.562 | 94.932 | 5.068 | 9.72 | 224 |
mobilenetv4_conv_medium.e500_r224_in1k | 79.094 | 20.906 | 94.77 | 5.23 | 9.72 | 224 |
mobilenetv4_conv_small.e2400_r224_in1k | 74.616 | 25.384 | 92.072 | 7.928 | 3.77 | 256 |
mobilenetv4_conv_small.e1200_r224_in1k | 74.292 | 25.708 | 92.116 | 7.884 | 3.77 | 256 |
mobilenetv4_conv_small.e2400_r224_in1k | 73.756 | 26.244 | 91.422 | 8.578 | 3.77 | 224 |
mobilenetv4_conv_small.e1200_r224_in1k | 73.454 | 26.546 | 91.34 | 8.66 | 3.77 | 224 |