Techniques of Crowd Counting using CNN: A Review

15 min readJan 4, 2021

I have been reading through some papers published on the topic of Crowd Counting lately. I thought of writing a short review of each of the paper and a comparative study if the techniques introduced in each. Initially it took me quite some time to get the hang of it. Here, I’m giving the gist of the key techniques that gave a break through in the domain of object detection. I hope it would be helpful to many like me who are just entering this field of Computer Vision.

Lets dive in…

Crowd counting has been an active research domain for decades. Since, the advent of deep learning, there has been a drastic improvement in this area. Crowd counting is a technique of estimating the number of people in an image (or video).

Consider an image of thickly crowded scene. It is a tedious task for humans to predict the head count of the people from such an image. But a machine could precisely work out through this problem. This is made possible in the recent past after the introduction of deep neural network into the field of Computer Vision. Particularly, Convolution Neural Network based estimation of the density maps over the image has been a breakthrough in this domain.

This problem statement has attracted a lot of Computer Vision enthusiasts into this domain due to its wide range of application. The use-cases of this includes video surveillance, urban planning, health risk monitoring, managing public places etc. This has gained even more importance in the recent period of the pandemic to ensure safety of the public by controlling the crowd at all public places.

Crowd counting can be broadly classified into sparse and dense crowd counting. Here we discuss a few techniques from both categories. The former techniques discussed here are mostly applied to sparse crowds and the latter ones to a denser stack of the crowd.

Survey

Overfeat
Overfeat paper[1] explores the three tasks, i.e., classification, localization, and detection, each as a sub-task of the next. All these tasks are addressed using shared feature learning and an integrated framework.

Classification
Author prepared two different architectures: fast and accurate. The basic architecture is that of Alexnet (5 layers) with some fine tuning. The differences include, 1) no contrast normalization 2) pooling regions are non-overlapping and 3) use of smaller strides in the first two layer feature maps. The training and inference of the model are different. Training of the classifier is done using a single scale of the size 221 x 221 of the input image, i.e., a non-spatial architecture (output map is 1 x 1). While during the inference step, the spatial size of the feature map varies according to the scale of the input image size, i.e., 6 different scales are considered.

Multi-scale classification
To boost the classification accuracy, the author uses 6 different scales of input which gives feature maps of different sizes at the end of the 5 layers of Alexnet architecture. A max pooling is applied 9 times. This is done to explore the entire image by densely running the network at each location. This approach gives more robustness to the model. The performance of the model would be better if the network window is aligned to the object. Taking 6 different scales of the image achieves this confidence score. The final classification is done by taking the spatial maximum for each class at each of the different scales and flips.

Faster R-CNN
The state-of-the-art object detection network relies on the region proposal algorithms to generate bounding boxes or location of possible objects. The previous methods that were evolved based on the region proposal methods like R-CNN and Fast R-CNN made use of CPUs to generate region proposals. This was indeed a bottleneck in these state-of-the-art detection systems.

Intuitively the features extracted by the CNN are used to classify and give proposal regions of the image. Thus, the observation was that these features contains necessary information required to detect the instances from the image. Leveraging this idea, in the Faster R-CNN paper[2], the author proposes an algorithmic change in computing proposals with CNN which gives a method where proposals are nearly cost-free given the detection network. This significantly decreases the marginal cost of computing proposals. That is, 10ms per image as opposed to 2ms per image with Selective Search and 0.2ms per image with Edgeboxes, the two region proposal algorithms used before.

Thus, the author replaces the region proposal algorithm with the Region Proposal Network (RPN) which estimates the regions of the instances. The RPN gives the region proposals from the extracted CNN feature map. These proposals are then used by the detector network, Fast R-CNN, to give the final bounding boxes of the instances. Since both the RPN and detector network are CNN based feature extractor, the author unifies the structure to have a single CNN based feature extractor as shown in figure 1a (left).

Figure 1a: Faster R-CNN, Source: Faster RCNN paper by Shaoqing Ren

Anchor Boxes
Anchor boxes are the reference boxes placed at various positions of the feature map. Here, we generate 9 anchor boxes of 3 aspect ratios and 3 sizes, for each pixel of the feature map. Instead of getting raw regressor output, we get an offset value with respect to the anchor boxes as output. There would be redundancy in the predictions of the boundary boxes which is later removed by Non-Maximum Suppression (NMS).

RPN
Region Proposal Network (RPN) takes feature maps as input and gives a set of proposals with corresponding objectness score, as shown in figure 1b(right). The objectness score indicates whether the region contains an object or background. Faster R-CNN uses 9 anchor boxes (i.e., 3 aspect ratios and 3 scales) for each pixel in the feature map. The architecture uses a kernel size 3 x 3 followed by two fully connected layer, for classification (objectness score) and for regression of proposals.

The RPN uses dense sampling by using the sliding window over all the pixels in the feature map. Whereas, in the ROI pooling, it applies sparse sampling because it works only on the region proposals generated by the RPN.

Figure 1b: RPN, Source: Faster RCNN paper by Shaoqing Ren

Training
As a unified network is used here, we cannot train RPN and the Fast R-CNN separately. Instead, the author proposes a 4-step training algorithm:
1) RPN trained initially and pre-trained weights from image-net is used.
2) Fast RCNN is trained using proposals generated by step-1 RPN. (Two networks don’t share convolution layers at this point).
3)RPN is trained using convolution layers from step 2 and only layers unique to RPN are updated (weights of convolution layer is not updated).
4) Keeping this convolution layer, layers unique to Fast R-CNN are fine-tuned.
During training we remove the cross-boundary anchors followed by the application of non-maximum suppressor to remove redundant predictions with a threshold of 0.7. The resulting top N ranked proposal regions are taken for detection.

End to end people detection in crowded scenes
The proposed mode[3] accepts the image as input and decodes the image to a set of object bounding boxes as output. Under this method, commonly used post-processing steps like non-maximum suppression are not required as these predictions are generated jointly. This architecture makes use of recurrent LSTM layer to control the sequence generation. Multiple detections on the same object can be avoided by sequential generation controlled by LSTM units, where it could remember the previously generated output.

This architecture (shown in figure 2) follows a decoding process, that is it converts an intermediate representation of the input image into a set of distinct detection hypotheses. This is a fully trainable, end-to-end approach, generic architecture that does not require any pre-defined constellations on contrast to the existing architecture, that independently handles the classification or prediction of each bounding boxes. Here, this approach deals with the prediction of the object in an image jointly. It has an advantage over the previous methods in cases of partially occluded instances. Thus, it performs better on crowded scenes. The model makes use of the deep convolution architecture with RNN based decoders, particularly LSTM units. This model leverages the ability of LSTM to remember previously generated predictions so that it can avoid repetitive predictions of the same instances. Moreover, it makes use of the ability of the decoder network to generate coherent sets of predictions.

The model encodes the input into high level descriptors using CNN and then decodes it to a set of predictions (bounding boxes). The input image is represented as a 1024-dimensional vector. These feature descriptors contain information regarding the position of objects. The LSTM cells acts as a controller in decoding this source of information. At each level, LSTM unit outputs a new boundary box and its corresponding confidence. The confidence value is an indicator that the previously undetected instance will be found in the boundary box. Boxes are generated in descending order of confidence, until a pre-specific threshold value. This is a feed forward pipeline and is therefore fast enough, (i.e., 6 fps).

Figure 2: Source: End to End People Detection in Crowded Scenes 2016 paper by Russell Stewart

Crowdnet
Dense crowd scenarios are often captured by an aerial camera or from varying viewpoint which introduces the problem of perspective and change of scales. The people near the camera would be captured entirely but those away from the camera are captured as blobs. This paper[4] proposes a combination of deep and shallow fully convolutional network to estimate the crowd density. This combination is highly effective in capturing both high level semantic information like face/body detectors and lower-level features like blobs. This is necessary to handle image of dense crowd. The author also performs a multi scale data augmentation to mitigate the issue of limited number of training images in the data set.

Architecture
The deep network used to capture high level semantics is similar to VGG-16. This would return the density map of the input image. To achieve this, the author modifies the VGG network by removing the fully connected layer and making it fully convolutional to get pixel level predictions. The shallow network is used to recognize lower-level features like blobs. This network is built using three layers of convolution. The architecture is shown in the figure below. The deep network used to capture high level semantics is similar to VGG-16. This would return the density map of the input image. To achieve this, the author modifies the VGG network by removing the fully connected layer and making it fully convolutional to get pixel level predictions. The shallow network is used to recognize lower-level features like blobs. This network is built using three layers of convolution. The architecture is shown in figure 3 below.

In this proposal the ground truth is prepared by blurring the head annotations using gaussian kernel normalized to sum to one. Thus, the sum of the density map would be the same as the crowd count. This would be easier for the CNN to learn. To get scale invariant representations, the author crops patches of size 225 x 225 with 50\% overlap from a multi scale pyramidal representation of input images, with scales 0.5 to 1.2 with increments of 0.1. To make the CNN more robust, multiple samples are taken from high density patches.

Figure 3: Source: Crowdnet paper by Lokesh Boominathan

Switching Convolutional Neural Network
The author of this paper[5]proposes the concept of switching Convolutional Neural Network which makes use of the variations in the density map to improve accuracy of the predicted crowd count. Here the switch refers to classifier and CNN is the regressor network. The Switch-CNN proposed here maps the image to its density. Independently CNN regressors are trained on patches of the grid, sampled on the input image. These CNN regressors have different receptive fields and views so that the CNN is adapted to all variations of the image. The switch classifier is trained to correctly relay a particular patch to a regressor. The switch CNN jointly learns to capture all complex faces of the image so that it minimizes the count error and improves density localizations.

Architecture
The proposed architecture consists of three regressors of different shallow CNN architectures. Each of these have 4 convolutional layers and a pooling layer. Filter sizes are 9x9, 7x7 and 5x5. Each of these filters are designed to capture different feature of the image, like high level body parts or lower-level blobs. The switch consists of a classifier and a layer. The classifier infers a label of the regressor to which it must be relayed, and the layer relays it to the regressor. It uses a VGG 16 network architecture as baseline to perform 3 level classification. A global average pool (GAP) on conv5 would aggregate the features. These networks are trained in a coupled fashion to co-adapt the classifier and regressor by training them alternatively to mitigate the effect of switch inaccuracy. This proposed model seems to outperform the Crowdnet model on the UCF\_CC\_50 data set.

Analysis
Both the Overfeat paper[1]and the Faster R-CNN paper[2] proposes the detection method based on sliding window over the convolutional feature maps. In Overfeat paper, the detection method uses regressors and classifiers on sliding window of one aspect ratio over a scale pyramid (6 scales). This method therefore would fail to detect all instances in an input image that differ largely in aspect ratio, say, a bicycle and a monitor, because the proposals would detect only one aspect ratio of the object. They are single stage class specific pipeline. Whereas the Faster R-CNN uses a two staged cascaded pipeline where the first stage involves the region proposal using the 3 x 3 square sliding window and predicts proposals relative to the reference boxes of different scales and aspect ratios (i.e., 9 anchor boxes). Thereafter, the decoder attends to the proposals that cover most probable regions. This improves the quality of the region proposals and thereby increases the overall detection accuracy. The two-stage cascaded Faster R-CNN model is faster compared to the single staged Overfeat model due to the architecture of the RPN where it shares the convolution layers with the detection network. This model significantly reduces the time (i.e., up to 10ms per image, 5–17 fps). This also improves the overall feature representation. By sharing the convolution network between the RPN and Fast R-CNN, the region proposal is nearly cost free given the detector network.

Until now, the people detector models made use of the sliding window concept[1,2] or classifying a set of proposals generated by the RPN in Faster R-CNN paper[2]. In the end-to-end people detection paper [3], the authors introduce a new model that is based on the recurrent LSTM layer for sequence generation. The image representation used in the model is same as that used in the Overfeat model[1], in terms of the filter dimensions and the number of filters used, making these models directly comparable to each other. In case of the Overfeat model[1], the hypotheses generation corresponds to the bounding box regression from each cell, followed by NMS (Non-Maximum Suppression). Whereas, in model[3], the component corresponding to the distinct hypotheses’ generation is the decoder unit with LSTM layer which generated variable length output. The model with LSTM unit outperforms the Overfeat model in case of overlapping instances in the input image. Similarly, comparing the Faster R-CNN[2] and LSTM model[3], we see that the latter performs better than the Faster R-CNN. The Faster R-CNN network performs poor on the images that are highly occluded. Whereas both approaches seem to work good on fully visible people. In Faster R-CNN it is important to have the optimum threshold for NMS. The key limitation of the Non-Maximum Suppression (NMS) is that it does not have any information on the image, but makes inference based on the distance and overlapping score of the bounding boxes. Thus, it works good on isolated images but often fails when the instances are partially or largely overlapping in the input image. On the contrast to the existing methods where the classification or prediction of each bounding boxes is independently handled, the model proposed here[3] jointly predicts the object in the image, and hence, the predictions do not require post-processing steps like NMS. That is, this model[3] leverages the ability of the LSTM layer to remember the previously generated predictions, and thus it avoids predicting multiple times on the same instance. But one of the key problems with LSTM framework used to regress bounding boxes for heads in the images, is that it fails in images with very high inter-occlusions between people. The following approach used in dense crowd counting is designed towards tackling this issue of highly occluded images.

The approach towards dense crowd counting as discussed before in the Crowdnet paper, not only gives us the count of the instances from the image, but also gives us an idea on which area of the image contributes the most towards the count and by how much. In addition, the multiscale extensive data augmentation performed on the data set makes the model robust to large scale changes. Thus, the multiscale image representation would solve the problem of perspective and of severe occlusions which is often a challenge to SIFT \& HOG based approaches.

We see that in Crowdnet model, the author performs a fusion of features from the shallow & deep CNN columns using a 1x1 convolution layer of weighted average to predict the density. However, the weighted averaging technique is global in nature and thus ignores the intra scene variations. However, in the Switch-CNN architecture, the author introduces a patch-based switching to utilize the variation in the crowd density within the scene.

State-of-the-art
In the case of dense crowd counting, some form of Convolutional Neural Network along with a density map estimation which predicts the density map over the input image sum up to give the crowd count. Thus, we see that density map estimation using Convolutional Neural Network is a state of the art. Various challenges faced by the authors of the above papers are severe occlusions in the image, similarity between the head blobs of people far from the camera and some objects in the background, and the problem of perspective. The current state-of-the-art approaches like using multi-scale CNN architectures, recurrent networks, and late fusion of features from multi-column CNN with different receptive fields are used to handle this issue at large.

Open Problems
In case of highly occluded images, these methods still fails to predict precise number of instances. This is observed in case of high-density areas in an input image. In addition, lack of training data would be one of the reasons for having a low accuracy at high density areas. Even in the state-of-the-art methods, this limitation on the training data set and large variations in crowd density limits the ability of the classifier to learn multichotomy of space of crowd scene patches.

Conclusion
Through this review paper, I give a comprehensive literature review of the existing machine learning techniques to solve the problem of crowd monitoring/counting. We also see how the current state of the art evolved over the period. The state-of-the-art approach uses the density map estimation technique using multi-column Convolutional Neural Network. As mentioned in the introduction of the paper, the Convolutional Neural Network has led to a breakthrough in the field of computer vision to tackle a wide range of problem, particularly object detection. We conclude from extensive experiments of these proposed methods on different sparse and dense datasets, that, there have been a very significant improvement in the evaluation metrics over the previous models.

To conclude, this document reviews few effective methods/techniques for crowd counting using CNN as the heart and core of the architectures. All the above discussed approaches describe the state-of-the-art of this problem statement. The analysis section of this review document depicts the competencies of each model over the previous ones. However, there are various open problems in the field that are yet to be resolved, like, highly dense images, limited training examples etc. Therefore, the future development of deep CNN on crowd counting, detection and localization has more challenging opportunities to explore and mitigate.

References
[1] Russell Stewart and Mykhaylo Andriluka. End-to-end people detection in crowded scenes. arXiv preprint arXiv:1506.04878, 2015.

[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS), 2015.

[3] Russell Stewart, Mykhaylo Andriluka and Andrew Y. Ng. End-to-end people detection in crowded scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[4] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pages 640–644, 2016.

[5] Deepak Babu Sam, Shiv Surya, R. Venkatesh Babu. Switching Convolutional Neural Network for Crowd Counting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[6] Russell Stewart and Mykhaylo Andriluka. End-to-end people detection in crowded scenes. arXiv preprint arXiv:1506.04878, 2015.

[7] Abhinav Sagar. Bayesian Multi Scale Neural Network for Crowd Counting arXiv:2007.14245 [cs.CV], 2020.

Techniques of Crowd Counting using CNN: A Review

Written by Neethu Mariya