So guess what, YOLOv4 has just been released a few days ago, and I must say I am really really excited by this release. Why? Well Yolo version 3 was quite popular, robust and quick, and now YOLOv4 in comparison I feel is a significant upgrade in terms of speed and performance. So, this article I am going to dissect the paper YOLOv4: Optimal Speed and Accuracy of Object Detection by Alexey Bochkovsky, Chien Yao and Hon-Yuan .
Wait – hold, what happened to the original creators of Yolo v1-3 Joseph Redmon and Ali Farhadi – Well Joseph or Joe tweeted in Feb 2020 that he will stop Computer vision research because of how the technology was being used for military applications and that the privacy concerns were have a societal impact.
Ali Farhadi went on to found a company called xnor.ai which specialized in on edge-centric AI. According to Forbes , They have since been acquired by Apple, surprise surprise :P.
Okay so back to YOLO, I am not going to cover Yolo V1-3 in this article because I already cover it in another video of mine which you can check out on my Channel.
I’ll be dissecting the YOLOv4 paper and help you understand this great technology without too much technical jargon, to uncover
How it works
How it was developed
What approached they used
Why they used particular methods.
As well how it performs in comparison to competing object detection models,
and Finally, why it’s so awesome!
Remember that all references that I mention will be linked down in the comments.
Okay so if you are ready to get started with AI, Computer vision and YOLOv4! Click the link down below to get started. 😉
Goal of YOLOv4
So, the goal of YOLOv4, according to the authors was to design a fast-operating object detector for production systems which is also optimized for parallel computations. It had to be better in a lot of ways if it had to the purple cow, or something extraordinary. It had to be super-fast, high quality in terms of accuracy and output convincing object detection results. How would the authors go about achieving these great milestones?
Object Detector Architectures
So, lets first back up and get a bit technical. So, object detectors typically compose of several components, which are the:
Input – This is where you input your image., Next we have the
Backbone – which refers to the network which takes as input the image and extracts the feature map – this may either be a VGG16, Resnet-50, Darknet52 or ResNext50 variants.
The neck and head are sub-sets of the backbone, where the serves to enhance the feature discriminability and robustness using the likes of FPN, PAN, RFB etc, and the
Head which handles the prediction. This can be either using a one stage detector for dense prediction like Yolo and SSD or a two-stage detector also known as Sparse Prediction – with FRCNN and Mask RCNN.
Selection of Architecture
Now there many combinations of the architecture that you can conjure in order to yield an optimal object detector. Looking at the architecture for YOLOv4:
There was a choice Between CSPResNeXt50, CSPDarknet53, and EfficientNet B3 – based on theoretical justification and several experiments CSPDarknet53 neural network was shown to be the most optimal model.
SPP(Spatial pyramid pooling) block over the CSPDarknet53, since it significantly increases the receptive field, separates out the most significant context features and causes almost no reduction of the network operation speed.
They use PANet (Path aggregation network) as the method of parameter aggregation from different backbone levels for different detector levels, instead of the FPN(Feature Pyramid Network) used in YOLOv3.
Finally, they chose Yolo V3 as the head for YOLOv4.
Now there’s a lot more to Yolo V4 then just its architecture for inference. In this case you can do some optimizations to the training method to yield better accuracy without increasing the inference cost. The authors label this training strategy as bag of freebies [bag of whaaaaaat].
Bag of Freebies
Let me reiterate in case you missed it. – A set of methods that only change the training strategy or only increase the training cost as “bag of freebies.” So bag refers to a set of methods or strategies, and freebies means that your inference accuracy goes up without any cost to your hardware, So essentially you are getting the additional performance for FREE.
Now there are a lot, I mean a lot of strategy’s and methods that can be used for optimizing the training within the bag of freebies arsenal. No really...There’s just too much to go in depth.
For the backbone - The authors uses Data Augmentation which is meant to increase the variability of the input images, so that the designed object detection model has higher robustness to the images obtained from different environments. I explain how to implement Data augmentation in my AI Courses on AugmentedStartups.com. But for now there are two methods that the authors adopt which are Cutmix and mosaic. Other freebies in the bag are:
Cutmix and Mosaic data augmentations,
Class label Smoothing
For the Detector. The authors use:
Mosaic data augmentation,
Self-Adversarial Training (SAT),
Eliminate grid sensitivity,
Using multiple anchors for a single ground truth,
Cosine annealing scheduler
Optimal hyper-parameters, Random training shapes.
Bag of Specials
We spoke about what is Bag of freebies, however the authors also speak about bag of specials. So specials refer to getting something of value at a discount or for cheap. Analogously the set of modules that only increase the inference cost by a small amount but significantly improve the accuracy of object detection, are call “bag of specials”. The authors used the following in Yolo V4.
Bag of Specials (BoS) for the Backbone:
Cross-stage partial connections (CSP), and
Multiinput weighted residual connections (MiWRC)
Bag of Specials (BoS) for the Detector:
PAN path-aggregation block,
In order for the authors to make the designed detector more suitable for training on single GPU, they made additional design and improvement as follows:
They introduce a new method of data augmentation called Mosaic, and Self-Adversarial Training (SAT)
They select optimal hyper-parameters while applying genetic algorithms
They also modify some existing methods to make their design suitable for efficient training and detection
Okay so looking at the experiment, Bochkovskiy et. al. performed experiments on both the image net and MS COCO datasets while using a single GPU such as the NVIDIA 1080 TI and 2080 TI. All their Bag of specials experiments used the same hyper parameters as the default settings whereas the Bag of Freebies had 50% additional training steps.
For the ImageNet experiments the hyper parameters were as follows:
Training steps of 8 million
Batch and mini batch size of 128 and 32 respectively
Learning rate of 0.1 using a polynomial decay learning rate scheduling strategy.
Warm up steps of 1000, and
Momentum and weight decay of 0.9 and 0.005 respectively
For the MS COCO experiments the hyper parameters were as follows:
Training steps of 500,500
Learning rate of 0.01 using a step decay learning rate scheduling strategy.
These learning steps were multiplied by a factor of 0.1 at 400k steps and 450k steps respectively.
Momentum and weight decay of 0.9 and 0.005 respectively.
Now regarding the experiments, themselves, there’s a lot of information, mostly technical in nature to be discussed in this video and would be better if you read the YOLOv4 paper. But essentially the authors goal was to performed experiments to test the influence of on the following:
Different features on Classifier training
Different features on detector training
Different backbones and pre-trained weightings on detector training
Different mini-batch size on Detector training
Because there are a ton of features, especially in the bag of freebies and specials, that they had to test. The strategy that implement was to test each feature using a process called ablation study. [Abla.. what…Wait what is that?]. An ablation study is where you systematically remove parts of the input to see which parts of the input are relevant to the networks output. It normally looks like a table like this with the results on the right-hand side.
Speaking of Results, if we look at how YOLOv4 compares to others, you will be quite impressed. But to ensure that we are comparing apples with apples, the authors had to ensure that they did the comparison on commonly adopted GPU’s as competing models used different GPU architectures for inference time verification. They comparative GPU’s architectures that they used were Maxwell, Pascal and Volta architectures. From these graphs…
… you can see that YOLOv4 is superior to the fastest and most accurate in terms of both speed and accuracy. WOW, I think that’s pretty amazing and quite an achievement.
So in summary, the authors offer a state of the state-of-the-art detector which is faster in terms of Frames per Second) and more accurate on MS COCO AP50:::95 and AP50 than all available alternative detectors. What is nice is that Yolo v4 can be trained and used on a conventional GPU with 8-16GB VRAM which are broadly available.
I must say a big well done to the authors for having verified a large number of features and selected for use such of them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments.
If you would like to read their paper YOLOv4: Optimal Speed and Accuracy of Object Detection - Click Here
If you are interested in Enrolling in my upcoming course on YOLOv4 then sign up over here when it gets released - Click Here
A. Bochkovskiy, C.-Y. Wang and M. Hong-Yuan, "YOLOv4: Optimal Speed and Accuracy of Object Detection," arXiv, 2020.
Forbes, "Apple Acquires Xnor.ai To Bolster AI At The Edge - Forbes," Jan 2020. [Online]. Available: https://www.forbes.com/sites/janakirammsv/2020/01/19/apple-acquires-xnorai-to-bolster-ai-at-the-edge/#4a73f5053975.
Stay connected with latest AI & AR videos and tutorials!
Join our mailing list to receive the latest news and updates from our team. You'r information will not be shared.