Traditional CNNs perform feature learning in the spatial domain of the image. Due to the limitation of video memory , the input image of CNN cannot be too large. The most common size is 224x224. The commonly used preprocessing and downsampling in CNN will be more crude to lose data information. [1] proposed a model based on DCT transformation , which aims to retain more original picture information through DCT transformation and reduce the communication bandwidth between CPU and GPU . The final experiment also proves the effectiveness of the model.

[1] proves that the input features in the frequency domain can be applied to all existing CNN models developed in the spatial domain with minimal modification. For example in ResNet-50, it only needs to delete the input CNN layer and keep the remaining residual block. The first residual layer is used as the input layer, and the number of input channels needs to be modified to fit the size of the DCT coefficient input. In this way, the modified model can maintain the parameter count and computational complexity similar to the original model. …

​ Multi-scale features are very important for a large number of visual tasks. Many existing network structure improvements have considered the components of multi-scale information. [1] proposes a plug and play”Hierarchical-Split Block (HSB) to improve the performance of existing CNNs. HSB contains multiple Split and Concat operations, which together constitute the multi-scale feature extraction of the block; at the same time, HSB has better flexibility and efficiency.

Image for post
Image for post
Comparing the accuracy-latency of different improved versions of ResNet50 models, the circle area reflects the params size of different models, latency test on T4/FP32 and batchsize=1. The orange circles represent different improvement strategies to improve the ResNet model, the green circles represent the use of different attention strategies, and the red circles are the HS-ResNet we designed. In order to make the indicators more intuitive, we added ResNet101 and ResNet152 represented by blue circles.Source[1]

ResNet based on HSB components has achieved great performance improvements on multiple tasks.For example, on the ImageNet dataset, HS-ResNet50 achieved 81.28% of Top1 accuracy, which exceeds the previous ResNeSt proposed by Amazon. …

Semantic segmentation is a basic computer vision task whose purpose is to predict the pixel-level classification results of images. Due to the vigorous development of deep learning research in recent years, the performance of semantic segmentation models has made considerable progress. However, compared with other tasks (such as classification and detection), semantic segmentation requires the collection of pixel-level class labels that are time-consuming and expensive. In recent years, many researchers have devoted themselves to weakly-supervised semantic segmentation (WSSS) research, such as image-level classification labels, smearing and bounding boxes, trying to achieve segmentation performance comparable to fully-supervised methods. …

A voice signal is a wave that travels through the air and is captured by a microphone. The microphone converts the sound pressure of the wave into an electrical signal. A discrete-time waveform file can be obtained by sampling the electrical signal. The sampling rate of music is usually 44,100 Hz (or 44,100 sampling points per second). According to Nyquist’s theorem, the frequency below 22,050 Hz can be recovered from the sample. The frequency range of the speech signal is usually narrower (below 8000 Hz) so a sampling rate of 16,000 Hz is usually used. …

In many many practical problems, there are often many independent variables of the dependent variable Y. At this time, you may be troubled by not knowing which independent variables to choose for modeling! What kind of model is the best.

Image for post
Image for post
ANOVA Tabe.Image by Author

Suppose there is a linear relationship between variable Y and variables X1, X2,…, Xp which can be model as follow:

Image for post
Image for post

Linear regression model is a solution obtained by minimizing the residual sum of squares (SSE, Sum of Squared Error).

Image for post
Image for post

Intuitively speaking, for a good linear regression model, the predicted value should be close to the true value, so the SSE should be small.However, if we only use a small SSE as a measurement standard to select the final model, our model will always be a model with all variables added. …

Deep neural networks trained on large-scale image classification datasets (e.g., ImageNet ) are usually adopted as backbones to extract strong representative features for down-stream tasks, such as object detection , segmentation, and human keypoint detection . A good classification network often has strong feature transformation capability and therefore provides powerful representations to benefit down-stream tasks . Hence, it is highly desired to enhance the feature transformation capability of convolutional networks.

[1] designs a complex network architecture to enhance feature representation, but introduces self-calibration convolution as an effective method to help convolutional networks learn discriminative representations by adding basic convolution transformations for each layer. …

Delaunay triangulation maximizes the smallest angle among all possible
triangulations of a given input and hence is a powerful discretization tool. However, Delaunay triangulation can have arbitrarily small angles depending on the input configuration.

Delaunay refinement algorithm is the main idea of ​​most constrained Delaunay triangulation algorithms at present. Aiming at the problem that the input constraints cannot include sharp corners with small angles.

In this kind of method, an initial mesh is built, by simply calculating the Delaunay triangulation of a set of points. This so-called “coarse” mesh is then iteratively modified, by adding one vertex at a time.

The first algorithm by refinement of Delaunay presenting theoretical guarantees is that of Paul Chew, in 1989, in a research report which has not been published. Chew’s algorithm allows mesh a polygon in dimension 2, with a Delaunay triangulation, and so that all the angles of all the triangles inside the polygon are between 30 and 120 °. The entry polygon, however, cannot have internal angles smaller than 90 °. …

Object detection based on LiDAR or RGB-D has been widely used, such as autonomous driving and machine vision. The 3D convolutional network based on voxel division has been around for a period of time, and it has enhanced the more complete retention of information when processing point cloud LiDAR data. However, some problems still exist, including slow inference speed and low performance of orientation estimation. Therefore, [1] studies an improved sparse convolution method for this type of network, which significantly improves the speed of training and inference.[1] also introduced a new form of angle loss regression to improve the performance of orientation estimation, and a new data enhancement method that can improve the convergence speed and performance. …

Efficient point cloud 3D object detection running on embedded systems is very important for many robotic applications, including autonomous driving. Most of the previous work tried to use Anchor-based detection methods to solve it, which has two shortcomings:

  • Post-processing is relatively complicated and computationally intensive.
  • Adjusting Anchor point parameters is very tricky

[1] propose an anchor free and NMS free one stage end-to-end point cloud 3D object detector (AFDet) with simple post-processing to solve these shortcomings.

Image for post
Image for post
The framework of anchor free one stage 3D detection (AFDet) system and detailed structure of anchor free detector. The whole pipeline consists of the point cloud encoder, the backbone and necks, and the anchor free detector. The number in the square brackets indicate the number of last convolution layer’s output channels. C is the number of categories used in the detection. Better viewed in color and zoomed in for details.Source[1]

The most important design here is the Anchor-Free detection head, which consists of five different heads to predict object centers in the BEV plane and to regress different attributes of the 3D bounding boxes, the outputs of the five heads are combined together to generate the detection results. …

Three-dimensional surface reconstruction technology is a technology to restore the true three-dimensional surface shape of an object. This technology is widely used in computer graphics, computer animation, reverse engineering, virtual reality and other fields. How to construct a 3D surface model quickly, fully automatically and at low cost is the current hot spot in the research of 3D surface reconstruction technology.

Model reconstruction is a common difficult problem in modern computer graphics and computational geometry. In many applications, it is necessary to deal with the piecewise linear approximation of real objects to complete the task of reconstructing the model. The method of obtaining a three-dimensional digital model of an object generally uses a three-dimensional camera device to digitize the real object by taking pictures or videos from various angles of a certain static scene to obtain point cloud data, and then use the point cloud data for reconstruction. …



I am actually student in Data Science at Ecole Polytechnique the leading French institution combining top-level research, academics, and innovation .

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store