Face detection from VJ to study in depth part two
Lei feng’s network: the writer Wu Shuzhe, Chinese Academy of Sciences key laboratory of intelligent information processing in the calculations VIPL research group of PhD students and research direction: object detection, with particular attention to target detection method based on deep learning.

| Deep learning for object detection from the change
Face detection as a specific type of object detection tasks, on the one hand has its own distinctive features, consider the face the particularity of this objective, on the other hand also and other types of target detection task has a certain common, will draw directly on the generic object detection method research experience.
Target detection task as a classification problem, which not only benefit from the development of the field of computer vision-related technology, and progress in the field of machine learning is also on target detection task has a driving role. In fact, starting from 2006 of deep learning gradually spread to target detection of the outbreak brought a strong push, generic object detection and a variety of tasks for a specific type of object detection has been leap-forward development.
From neural networks to study in depth
Deep learning is not a new technology, as the core of its physical neural network has been started as early as in the middle of the last century, and in the last century has experienced one climax.
From “deep learning” that literally can see neural network transformed reappearance, the key is in a “deep” Word. Neural networks are inspired by brain structure and design level model, which consists of a series of connected nodes in accordance with certain rules, forming a hierarchical structure. Most simple neural network consists of 3 layers: input layer, hidden layer (and external input and output is not directly related to), and the output layer, between two adjacent layers of nodes connected by a directed edge, in which each side shall have a weight.
To illustrate the function represented by the neural network, we are considering a more simple structure: there is only one input and one output layer, which input d node, output layer only one node, connected to this node and all nodes of the input layer. Input nodes from outside accept input x = (X1, X2, ···, XD), and weights of the nodes of the output layer connection is w = (W1, W2, ···, WD), output node for your input done a g, get the output y, then
Transform g of which are usually referred to as node activation function, is a nonlinear function, such as
We also add a bias in the sum b, that is,
Similarly, we can write 3-layer neural network function
W2 and W3 are the input node and hidden layer nodes, hidden layer nodes and connections between the nodes of the output layer weights, results in a matrix, while the B2 and B3 are posed by the bias vector. In turn, and so on, we can be generalized to the n-layer neural network. As you can see, neural networks have a very large features, is the introduction of nonlinear activation functions and layers of nesting, which enabled it to highly nonlinear (relative to the input) functions, and complex data patterns are more modeling capabilities.
Early neural networks are generally less layers (3 layer shallow network), because the layers of deep Web is very difficult to learn, difficult to achieve satisfactory performance in various tasks, this situation was not broken until 2006. In 2006, the scholar in the field of machine learning, published in the journal Science, Professor Geoffrey E.Hinton entitled Reducing the Dimensionality of Data with Neural Networks paper this work for deep learning provides an effective solution: using layer-by-layer on the network without any supervision training, thereby opening the door to learning Web. Over the next few years, people’s passion for the deep Web has soared to the point of extreme, Web design and learning problems be solved one by one, from initialization to optimization methods, from the activation function to the network structure, researchers have produced a comprehensive study, makes deep Web training can be done faster and better. Due to the discussion of neural network itself is not within the scope of this article, so it is no longer discussed, readers need to more strongly nonlinear neural network is a kind of modeling capabilities of the model.
Neural network consists of many species mentioned above is the simplest one: feedforward neural networks–often referred to as neural networks, neural networks this name often refer to only feed-forward network of this kind. In computer visual field, application most success of a neural network is volume product neural network (CNN), its maximum of features is introduced has volume product operation, will Qian feed network in the of full connection replaced for local connection, in different of connection Zhijian for right value shared–will a a volume product nuclear role Yu a Zhang image Shi, volume product nuclear on like detection Shi of observation window, its from image of left Shang angle gradually sliding to right Xia angle, its sliding of each location corresponds to Yu a output node, The input nodes within the node and Windows-only (images on each point corresponds to an input node) connects different output nodes and weights of the input connection is the same.
Volume product neural network in processing image problem Shang has unique of advantage, because volume product operation can natural to retained image of space information, its only role domain local, thus output node of space structure and entered node of spatial structure is corresponds to of, and Qian feed neural network is do not to this is: entered node of arranged order even can is any of, as long as corresponding to replacement connection of right value, on can guarantee output not variable. CNN early in the last century by the famous Yann LeCun machine learning researchers devised and applied to handwritten digit recognition problem, but its mass used in computer vision 2012 years CNN’s great success in the General image classification tasks.
R-CNN series
At the end of 2013, deep learning for object detection tasks lit a fire, the fire is R-CNN, where r corresponds to the “Region (region)”, refers to the CNN image as input, and this eventually developed into a series, also inspired by and derived from the large amount of follow-up work, a fire was burning Red in the computer vision field of the sky.
R-CNN change the target many of the old practices, evaluation of target detection in standard data sets at the same time leads to unprecedented accuracy improvement. Methods in detection of change, first of all is to abandon the sliding window model, replaced with a new build of the candidate window links. For a given image, not using a sliding window to scan the image, enumerating all possible scenarios, but somehow “nominated” some candidate window, Access treats the target can accept the recall rate of premise, the candidate can control the number of Windows or hundreds of thousands of. From a species meaning Shang told, VJ people face detection device in the multiple classification device phase level joint, each a level classification device are in for next of a level classification device nominated candidate window, but this and R-CNN by used of generated candidate window of way has a important of difference: actually all of window still are was check has a again, just constantly in excluded, this is a subtraction type of programme. In contrast, the R-CNN generates the candidate window mode, based on certain characteristics of the image to determine where there are goals to be checked is possible, and how big is the target, which is an additive solution from scratch. Selective Search is a typical of candidate window generated method, its used has image segmentation of thought, simple to said, Selective Search method first based on various color features will image divided for multiple small block, then since end of up to on different of block for merged, in this process in the, merged before and after of each a a block are corresponds to Yu a candidate window, last pick out most has may contains stay detection target of window as candidate window.
Apart from introducing the candidate window method, 2nd very large changes in feature extraction: no longer using design features, but for CNN from the learning feature. Features extraction process is from original of entered image (pixel color value constitute of matrix) transform to features vector of process, zhiqian of as Haar features, is research workers according to himself of experience and on research object of awareness design out of, in other words artificial defined has a transform, and new of practices is only qualified this transform can with CNN to said–actually CNN has can said enough more enough complex of transform, and not specific design features extraction of details, with training data to replaced people of role. Auto-learning feature and this is deep learning a very distinct characteristics. Automatically to learning right of features, this practices of benefits and let classification device automatically to learning himself of parameter of benefits is similar of, not only avoid has artificial intervention, liberation has human, and conducive to learning to more fit actual data and target of features to, features extraction and classification two a link can mutual promote, complement each other; but shortcomings is some, automatically learning out of features often can explained sex compared poor, cannot let people intuitive to to understanding why such extraction out features will better, Is to train assemblies have a certain degree of trust. Moschino iPhone 6
Also worth mentioning is that R-CNN introduced a new aspect in the detection process: border regression (Note: the “box,” read part IV, not pronunciation! ), Detection is no longer just a classification problem–it was a regression regression and classification is regression model outputs the difference is not a discrete category tag, but real values in a row. Border return refers to the given base forecast the real test of a window location and size, that is, after you have your candidate window if it is distinguishing a face window, that will be further adjusted to obtain a more precise location and size-and targets fit well to be checked. Border return to provide a new perspective to define test tasks, to improve the accuracy of the test results on the other hand have a more prominent role.
Object detection is the process with R-CNN: the method generates candidates such as Selective Search window, then extracted by CNN to learn good candidate window corresponds to the characteristics, then trained classifier based on extraction of feature to categorize the candidate window, and finally identified as window borders face return be amended.
Although the R-CNN target detection accuracy of a huge boost, however, due to the candidate window method and the deep Web have higher computational complexity, and speed is very slow. To address R-CNN’s speed, followed by a Fast R-CNN and Faster R-CNN, can be seen from the name, they speed more quickly. First step accelerated is used has similar Yu VJ people face detection device in the points figure of strategy, points figure is corresponds to whole ZHANG entered image calculation of, it on like a Zhang table, in extraction single window of features Shi, directly through check table to gets by need of data, then for simple of calculation can, in R-CNN in the each candidate window are need separate through CNN to extraction features, dang two a window Zhijian has overlap part Shi, overlap part actually was repeat calculation has two times, and in Fast R-CNN in the, Directly to whole ZHANG image as entered, first get whole ZHANG figure corresponds to of volume product features figure, then for each a a candidate window, in extraction features Shi directly to whole ZHANG figure corresponds to of volume product features figure Shang out window corresponds to of regional, to avoid repeat calculation, zhihou only need through so-called of RoIPooling layer to will all of regional put shrink to same size can, this a strategy of using can provides dozens of even Shang times of accelerated. Second step accelerated, Fast R-CNN using has a name for SVD of matrix decomposition technology, its role is will a big of matrix (approximate) split solutions for three a small of matrix of product, makes split solutions zhihou three a matrix of elements number far is less than original big matrix of elements number, to reached in calculation matrix multiplication Shi reduced calculation volume of purpose, through will SVD application Yu full connection layer of right value matrix, processing a Zhang pictures by need of time can reduced 30%.
Step speed, Faster R-CNN started looking at generating candidate window link, which uses CNN to generate the candidate window while its classification, regression using CNN shared by convolution of the border layer, so that the convolution of two steps, you can use the same map, which greatly reduce the amount of computation.
In addition to using a variety of strategies to accelerate, from R-CNN to Faster R-CNN, testing frameworks and network structure is constantly changing. R-CNN on the framework as a whole, and no essential differences between the traditional detection methods, various sectors done by a separate module: a module to generate candidate window (Selective Search), a module for feature extraction (CNN), a module window is categorized (SVM), also added an module borders returned. When the Fast R-CNN, followed by three modules combined into one module, all done by CNN, so the whole system actually only two modules: a module to generate candidate window, another module classifies the window directly and modified. And Faster R-CNN, all the modules are integrated into a CNN to complete, form a framework for end-to-end: directly from the input image through a model to get the final result, this task common learning practices in the same model, able to effectively use a dependency between tasks, can complement each other, complement each other’s effects. From R-CNN to Faster R-CNN, this is a peers of process, its is can success, while thanks Yu CNN powerful of nonlinear built die capacity, can learning out fit various different child task of features, on the is because people awareness and thinking detection problem of angle in constantly occurred change, break old has sliding window of framework, will detection as a return problem, different task Zhijian of coupled. Although Faster R-CNN the speed still cannot be compared to the use of deep learning method and detection device, but as hardware computing power on the rise and new CNN accelerated the policy of successive, speed issue will be resolved in the near future.
Convolutional networks and DenseBox
Volume product layer is CNN difference Yu other type neural network of nature features, but CNN usually also not only only contains volume product layer, its also will contains full connection layer, full connection layer of harm on is its will damage image of space structure, so people will began with volume product layer to “alternative” full connection layer, usually used 1×1 of volume product nuclear, this not contains full connection layer of CNN called full volume product network (FCN). FCN was originally used in image segmentation tasks, then used on the various issues in the field of computer vision, in fact, Faster R-CNN CNN used to generate candidate window is an FCN.
FCN of features on is entered and output are is II dimension of image, and output and entered has relative should of space structure, in this situation Xia, we can will FCN of output as is a Zhang heat figure, with heat to indicates stay detection target of location and cover of regional: in target by at of regional within displayed high of heat, and in background regional displayed lower of heat, this also can as is on image Shang of each a pixel points are for has classification: this points whether is located in stay detection of target Shang. DenseBox is a typical target detector based on Convolutional network, its heat to be detected through FCN, then according to the heat get the position and size of the target, which provide target detection and a new strategy for solving the problem. (Below this image actually comes from another paper, here is used only to help readers understand what face heat map looks like. )
In the DenseBox, is also worth mentioning, the classification will also predict the location of feature points–the same as mentioned in the article on JointCascade, DenseBox detection and feature point location in the two tasks are integrated in the same network, as well as use heat maps to determine the position of each point. Moschino iPhone 6 cases
Face detection based on CNN
Detection of targets mentioned above are common, these detectors can learn through facial images and get face detection, although they did not take into account particularities of the face itself, but can also get very good precision, which reflect different types of target detection in fact is the same, there is a common set of mechanisms to deal with target detection. Also part of the work is specifically designed for face detection task, considering the characteristics of the face, some are more generic detection method, can be naturally migrate to various types of objective tests in the task.
FacenessNet is specially designed for the face of a detector, which considered the hair, eyes, nose, mouth, and beard these five facial features, simply put, for a candidate window, FacenessNet analysis of these five part exists, then judgment is not a human face.
This method while while using has overall and local of information, can from different of angle on image content for description, makes people face and nonhuman face can better to was distinguish; on the enhanced has on block of robustness, people face of local block will effect overall performance out of features, but does not will on all of local regional caused effect, thus enhanced has detection device on block of tolerance degrees.
Precision of the great leap forward
As more and more detector began using deep Web, face detection accuracy began to dramatically improve. In 2014, the academia on FDDB best accuracy is achieved in 100 false positive detection rate reached 84%, this precision is JointCascade face detector. By 2015, a record broken by FacenessNet at 100 when a false detection, detection rate close to 88%, up almost 4%. Not only that, the best recorded in the industrial sector has reached 100 false detection of 92.5% in the detection rate, detection rate of 90% per cent of the company is more than a, and these results are obtained by Web based face detector.
In sharply upgrade people face detection precision of while, depth learning actually also reduced has including people face detection technology zainei of various target detection technology of threshold, almost to has as long as used depth network on can get good of detection precision of point; in precision aspects, compared Yu based on non-depth learning method of detection device, based on depth learning method of detection device in beginning Shang will high out a cut. But in terms of speed, based on deep learning method of detection is also difficult to meet practical needs, even on the GPU, this would not be at real-time speed (25fps) run; and conversely, once a solution to the speed problem, indepth study on the target detection task will have broader and more widely used.
| Traditional face detection technology and the combination of CNN
VJ face detector since its introduction, has inspired and influenced subsequent work introduced by the integral map, AdaBoost, Cascade in all kinds of detectors, the structure is still being used in a different form. Traditional face detection technology advantage is speed and precision compared to a Web-based method slightly losing a step, in this case, a natural idea is: the possibility of face detection technology and the depth of the traditional networks (such as CNN) combined, while ensuring further enhance the accuracy with speed?
Cascade CNN can think is traditional technology and depth network phase combined of a representative, and VJ people face detection device as, its contains has multiple classification device, these classification device used level joint structure for organization, however different of place is, Cascade CNN used CNN as each a level of classification device, and not with AdaBoost method through multiple weak classification device combination into of strong classification device, and also no longer has separate of features extraction process, Feature extraction and classification by CNN to be finished. In detection process in the, Cascade CNN used of also is traditional of sliding window paradigm, to avoid high of calculation overhead, first level of CNN only contains a a volume product layer and a full connection layer, and entered image of size control in 12*12, while sliding window of step set for 4 a pixel, in this situation Xia, while each Zhang image Shang candidate window of number variable less has, window number with sliding step of increases is according to square law declined of, On the extraction of each window features the computational overhead of and classifications have also been strictly controlled. After first level CNN zhihou, due to through of window in the people face and nonhuman face window Zhijian more to distinguish, so second level CNN will entered image of size increases to has 24*24, to using more of information, and improve has network complex degrees–although still only contains a a volume product layer and a full connection layer, but volume product layer has more of volume product nuclear, full connection layer has more of node. Tertiary CNN also used a similar idea, increase the input image size while increasing network complexity–a two-volume lamination and a full connection layer. Through introduced CNN, traditional of level joint structure also glow out has new of glory, in FDDB Shang, Cascade CNN in produced 100 a errors check of when reached has 85% of detection rate, and in speed Shang, for size for 640*480 of image, in qualified can detection of minimum people face size for 80*80 of conditions Xia, Cascade CNN in CPU Shang can keep close 10fps of processing speed. Cascade CNN also used several other techniques to ensure detection accuracy and speed, multi-scale integration, borders, alignment, such as non-Maxima suppression, as space is limited, there is no longer continuing.
Traditional essence of face detection technology, drawing on deep learning the latest results of the research, deep thinking and understanding of the issues, explore the best mode of new wine in old bottles, which is a road worth continuing to explore.
| Simple reflection on the present and future
After for decades years of research and development, people face detection method are increasingly mature, in reality scene in the also has get has compared widely of application, but people face detection problem also and no was completely solution, complex diverse of attitude changes, strange of block situation, elusive of light conditions, different of resolution, different of clarity, subtle of colour poor, various both inside and outside factors of common role let people face of changes mode became extremely rich, and currently also no detection device can while on all of changes mode are enough Lu rod.
Current face detection on FDDB has been able to achieve a good performance, many detectors in 100 false positive detection rate at over 80%, which means that they detected more than 40 face appears a false detection. To currently weizhi, paper by mentioned of errors check and recalled rate are corresponds to Yu FDDB Shang of discrete type scored ROC curve, so-called “discrete type” is refers to everyone face whether was detection to is respectively with 1 and run said of; relative should to also has continuous type scored ROC curve, and “continuous type” refers to of is people face was detection to or not is through detection box and mark box Zhijian of intersection than to said of, from a species meaning Shang for, continuous type scored tries to judge of is detection box of accurate degree, The detection box and the size and proximity of the actual location and size of the face. For purposes of two different detectors, relative relationship of the two curves are not exactly the same: divergent scores of ROC curve approach two detectors, the continuous score that corresponds to the ROC curve there may be significant differences. Most immediately, this shows that although some detectors detect human faces, but the accuracy of the detection box is relatively low, but in fact caused another important reason for this inconsistency is that detection box and the difference between the callout box. In FDDB in the people face is through oval to for mark of, most situation Xia, almost will contains whole head, compared with, detection device to out of detection results is rectangle of people face box, and usually only contains face regional–especially for used sliding window paradigm of detection device, this on is easy led to detection box and mark oval Zhijian of intersection than had small, even may is less than 0.5. On different of detection device for, its can best area points people face and nonhuman face window of situation by corresponds to of box of size will different, to different detection device to out of detection box also will exists difference, part method will used expanded detection box or return oval of way, to as reduced by Mark box and detection box of inconsistency by caused of effect, guarantee evaluation of fair sex.
In addition to the callout box issues, evaluation results to a more objective view on FDDB, we also need to consider another point: face on FDDB test images and difference of the actual scenario, in other words, we need to think about this question: face detection in FDDB accuracy achieved can truly reflect on its performance in the actual scenario? FDDB in the test image Shang of people face contains has from expression to attitude, and from light to block, all aspects of changes, thus is a relative General of data set, but in actual application in the, different scene human face often rendering out compared sharp of features, for example in video monitoring scene Xia, due to camera erection location high and resolution limited, while in storage and transmission process in the will introduced noise, so image Shang of people face often has larger of pitching angle, and clarity lower, in this situation Xia, Great original FDDB detector may not be able to reach a satisfactory accuracy. In FDDB in the, has about 10% of people face its size in 40*40 following, and for people face recognition, some task for, too small of people face does not for, so if a detection device because in small face Shang performance bad and led to its in FDDB Shang performance flat, and in larger of people face Shang and performance better of some detection device no too big difference, so will its application in people face recognition task in the is completely no problem of, even also may because model simple brings speed Shang of advantage. Word, dang face specific of application scene Shi, while, we also need specific problem specific analysis, cannot blind to according to detection device in FDDB or other people face detection data set Shang precision to conclusions; on the, we need based on current of people face detection device to suitable distribution actual by need processing of data, to makes detection device can in specific of scene Xia reached better of precision.
Other than FDDB, common set of face detection evaluation and AFW, and open the MALF, IJB-A and Wider Face in recent years. AFW contains fewer than the number of images, a total of 205-test image, mark 468 people face, but due to its cover a wide range of face model, has some challenging and therefore more commonly used. Addition three a evaluation set in image scale Shang are relative larger, which MALF and Wider Face no released people face mark and evaluation program, need submitted detection results to released party for evaluation, this in must degree Shang prevent has due to evaluation way inconsistent and led to compared injustice peace on test set for had intends collection of situation; this two data set also according to different of property (as resolution, and attitude, and difficult easy degree,) will test set into has multiple subset, evaluation Shi will while in complete and subset Shang for test, This more comprehensively reflect the ability of detector in different scenarios. IJB-A contain not only a static facial images, some video frames are extracted from the video. In above mentioned of all evaluation concentrated, only Wider Face provides has specifically of training set and validation set, other evaluation collection are only contains test set, this actually also to different method of compared brings has a problem: we to judge led to detection device in precision Shang exists differences of reasons what is training data also is algorithm and model itself, also not know this two aspects of factors who up of role more big. Wider Face should is difficulty maximum of a evaluation set, by mark of people face in attitude, and block situation, aspects of span very big, and resolution in 50*50 following of people face accounted for to has 50% (training set and check concentrated reached has 80% above), but in part application scene Xia (as people face recognition), too concern small size of people face and no necessary.
Although now network detector based on depth can reach very high detection accuracy, and its versatility is very strong, but its the price calculation is very high, so this type of detector breakthrough lies in a Web of simplified and accelerated. In addition, if only the face detection, this classification is relatively simple, there is a possibility: direct a small network is good enough to accomplish this task. The detector for use of deep learning methods, based on detection accuracy will be lower than many, but will have a distinct advantage in speed, so the key is to target specific issues under scenario a reasonable improvement and adaptation in order to obtain better accuracy.
In order to provide more convenient interactive interface, creating effective visual means of understanding, make the machine temperature, observe, feel, our researchers in face detection and general target detection tasks continue to explore. One day, when we and the machine when four-eyes, able to smile with each other: science, better life!
PostScript
Based on face detection tasks themselves, describes the general process of face detection, and then introduces three different types of face detection method: VJ represented by the traditional method of face detection based on modern methods of deep Web, face detection, and traditional technology and methods of combining deep Web. However, in people face detection for decades years of research course in the, also has many other of method cannot return to this three class in the to, which compared important of including based on parts model of method and based on model of method, although paper No on this two class method for introduced, but they in people face detection this a problem Shang still has important of status and meaning, interested in of readers can further to reading related papers for understand.
Lei feng’s network (search for “Lei feng’s network (search for” Lei feng’s network “public concern)” public concern) Note: this article was published by the authors in depth learning Forum, please contact the authorized and keep the source and author, no deletion of content.
Like this:
Like Loading…
Related
Originally published at michaelkors18.wordpress.com on August 11, 2016.