## DDR4 An article to understand exactly how SSDs detect objects

solid state drive SSD OEM **ddr4’s** current deep learning-based general object detection framework is mainly divided into two categoriesddr4, a deep learning object detection algorithm based on candidate region selectionddr4 The method will be done in two steps, the basic steps of which are: the first step is to generate areas where targets may be present, the second step is to enter each region into the classifier for classification, remove the candidate areas with lower confidence, and fix the position of the border of the target area with higher confidence. The advantage of this type of method is that the accuracy rate is relatively high, but the disadvantage is that it is necessary to run the convolutional neural network twice, which runs slower, such as Faster-RCNN

another kind of deep learning target detection algorithm based on regression methods, they are in place in one step, and the basic steps are: Given an image, by setting a reasonable output vector, using regression to output the border and category of this target, one step in place. The advantage of this algorithm is that it is fast, but it is difficult to detect for dense small samples. For example, YOLO, SD, etc.

Yolo and SSD algorithms a representative of the “one-step” algorithm, their main difference is that you only use the information of the terminal feature map, while the SSD uses the information of the last few layers of the feature map, so, theoretically, The SSD algorithm is necessarily more accurate than yolo (at least yolov1).

the convolutional neural network is equivalent to our human eye looking at a picture through a small hole. In a shallow neural network, we’re sticking this image to a hole. You can only see the details and texture information of the picture, just like the peeping leopard in the tube. As the number of layers of the network deepens, it is equivalent to moving the image backward for a distance. In this way, the overall information of the picture can be perceived.

which has been proven in some papers. For example, in the visualizing and Understanding Convolutional Networks] in this paper, it has been shown that the first few layers of the convolutional layer mainly extract the detailed information of the image, and the last layers are more inclined to the abstract information of the image.

## In short, the feature information of each layer mainly has the following differences:

1, low-layer convolution can capture more detailed information, and high-level convolution can capture more abstract information.

2, the low-level features are more concerned with “where”, but the classification accuracy is not high, while the high-level features are more concerned with “what”, but the location information of the object is lost.

SSD uses different scales to detect target objects of different sizes and categories in the picture and obtains good results.

SSD network structure is shown in the following figure, the front end uses the VGG16 network, and then adds 5 convolutional layers to the VGG16 to get more feature maps for detection.

## The feature map used in the and its size is shown in the following table:

For each feature map, SSD introduces the concept of the initial box, that is, in the center of each feature chart cell is set a series of initial boxes of different scales and sizes, these initial boxes will be mapped back to a certain position of the original diagram, if the position of an initial box coincides with the position of the real target box overlaps very highly, then the category of this initial box is predicted by the loss function, and the shape of these initial boxes is fine-tuned to conform to the real target box we marked.

the initial box has two main parameters: scale Scale and aspect ratio a.

for different feature maps, the general scale design principle of SSD is that as the number of layers of the network deepens (the feature map becomes smaller), the scale of the initial box increases linearly. The smallest initial box scale is 0.2 and the largest initial box scale is 0.9.

the initial box scale on each feature map is as follows:

## [Note: This is the method described in the paper, in the official source code given, SSD seems to set the initial box of the first feature map to a scale of 0.1 separately, the rest of the following formula, and the scale on each feature map is slightly fine-tuned, there are various interpretation versions on the Internet, if you are interested, you can see the source code. five ratios, so that the initial box length on each feature map, Width can be derived from the following formula:

For the initial box with a scale of 1, the author adds a square initial box:

generate an initial box at the center of each feature plot according to the above initial box generation rules, this step does not require the use of input training image data.

during training, the first thing to determine is which a priori box the target real box in the training picture matches, and the bounding box corresponding to the matching prior box will be responsible for predicting it. If the IOU of a real box and the initial box is greater than a certain threshold (usually 0.5), then the initial box matches the real box. The initial box that can match the real target box is the positive sample, and the others are negative samples.

calculates the loss value based on the category of the real target box and the offset of the initial box from the target box.

the loss function is as follows, where \alpha is the connection weight of the two loss functions:

where, is a class loss, the formula is as follows:

you can see that it is essentially a softmax loss function, but it should be noted that if you want to predict N categories, Then there needs to be N+1 output, and the extra category is the background category.

is the position loss, the formula is expressed as follows:

is an enumeration variable, if an initial box can match the true target box, then this value is 1, otherwise 0

the number of targets in a picture is generally not too much, and in the above training process, more than 8000 initial boxes will be generated, so the initial boxes that can match the real target box are very few, resulting in the number of positive samples being much smaller than the number of negative samples, resulting in training convergence difficulties.

(1) increases the number of positive samples. For the remaining unmatched initial box, if its IOU to a real box is greater than a certain threshold, the initial box also matches the real target box. That is, a real target box may match multiple initial boxes, which can increase the number of positive samples.

(2) reduces the number of negative samples. Although (1) is used to increase the number of positive samples as much as possible, the number of real target boxes is still too small compared to the initial box, so there will still be many negative samples relative to positive samples. To ensure that the positive and negative samples are as balanced as possible, we delete a part of the negative samples. Specifically, the remaining initial boxes after executing (1) are sorted according to the background confidence, only the boxes with the highest confidence are selected to calculate the loss function, and the remaining negative sample initial boxes are discarded. Ensure that the ratio of negative and positive samples is not up to 3:1.

the prediction process is relatively simple, first of all, the picture to be predicted is entered into the network, the initial box is adjusted according to the target category and offset of the output, the prediction box is obtained, and its category is determined according to the category confidence, and the prediction box belonging to the background is filtered out. Then filter out the prediction box with the lower threshold based on the confidence threshold (for example, 0.5). The NMS algorithm is performed on the remaining prediction boxes, filtering out those prediction boxes with a large degree of overlap. The last remaining prediction box is the test result.

because deep learning is originally a process of empirical tone (lian) reference (dan), some of the structures designed by the author may be thought out by a head, but it is good to prove that this design has advantages, so first of all, several sets of verification experiments were done.

the author’s verification method is to gradually remove the initial box of 1:3, and 1:2 and observe the performance of the model. it can be seen that after removing the initial box of 1:3, the progress of the model dropped by about 0.6 percentage points, and after further removing the initial box of 1:2, the accuracy of the model decreased by 2.1 percentage points, which fully proved that the setting of the initial box of different sizes and scales had a great impact on the accuracy of the model.

test the effect of different feature maps on model accuracy by gradually removing feature maps. For the sake of fairness, even if one feature map is removed, the author will add some initial boxes to other feature maps, ensuring that the total number of initial boxes for all models is almost the same.

it can be seen that with the gradual reduction of feature maps for detection, the model accuracy has dropped from 74.3% to 62.4%, indicating that the use of multi-scale feature map information to detect can indeed improve model accuracy. Here, it can be found that after removing the conv11_2 feature map, the model accuracy has increased, and the explanation given by the author is: that because the initial box with the actual size of the feature map is placed about 270 * 270, there is no such a big error in the training set, resulting in the introduction of some errors. (This further illustrates that the size setting of the initial box has a large impact on model accuracy.)

## which is a comparison with fast-RCNN and Faster-RCNN recognition accuracy on different target objects.

SSD and Faster-RCNN are almost identical in accuracy. In actual testing, Faster-RCNN is better at detecting small target objects.

which is a comparison of the speed of the two, and it can be seen that SSDs have obvious speed advantages and are the only real-time detection algorithm with a MAP of more than 70%.

SSD paper and the actual source code have some differences, especially in the initial box selection part, this article is mainly based on the paper for interpretation, in fact, especially in the computer field, which is more inclined to practical papers, some ideas are still difficult to write, if you can find the source code, or need to experience the source code, to better understand the algorithm. If there is a discrepancy, the author’s source code shall prevail. I have only run the SSD-like algorithm, and I will have the opportunity to experience the wonders of SSD through the source code in the future.

Finally, sincerely admire the author’s novel ideas, but after 2016, the original author has no more innovative improvements, YOLO has been updated to the V3 version, the RCNN series also has a third generation, and SSD authors want to cheer up the duck.

1. [Visualizing and Understanding Convolutional Networks Translation Summary]

2. [SSD: Single Shot MultiBox Detector Translation] solid-state drive SSD OEMddr4