TV Stations Logo Recognition
Implementing a logo detection application based on Deep Learning deployed on Flutter.
AI is revolutionizing the systems and causing a global paradigm shift.
In countries like Brazil under a powerful media environment that concept seems to fit very well due the citizens have no access on how their public measuring system works e.g.:
- Audience System (Monopoly of an unreliable Institute)
- Election Poll (Done by manipulated institutes)
⠀⠀⠀⠀⠀⠀⠀⠀⠀
I was part on the Beep App that brought the proposal to give the population the ability to interact with audience data, also delivering trustful information to advertisers and media agents with real time data.
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
What to expect from machines in object recognizing?
Machines are fantastic, nowadays are better than humans at recognition of patterns, shapes, colors, human faces. What’s more, they surpass our listening level and can copy a person’s voice in a matter of seconds.
Deep learning models could outperform human level of detection in object recognizing on Imagenet competition since 2015. This kind of technology could have a wide usage for surveillance, automatic identification, mapping resources using drones and so on.
How to automate audience measuring system?
To achieve this goal, I trained 2 deep learning models on Darknet neural network framework, the first for capture the TV Station Logo from close distance getting more accuracy on detection and other for greater distance getting more convenience on smartphone handling.
The logo objects for detection:
Here, a demonstration of the final trained convolution neural network:
I decided to use Yolo V2 (You only look once) because its excellent level of accuracy together with the minor time of training. I couldn’t use the Yolo V3 version because some operations aren’t supported at the conversion to TensorFlow lite (FlatBuffer Model). I had to build a lightweight version of the model using “Tiny” configuration on Yolo to work properly on mobile.
PROJECT GITHUB:
This Project is related to my Github repository: (Link)
Part 1: Getting Data
For the close distance model I got the data from Google Images using python library google_images_download (Link), I have demonstrated the use on my github web scraping script where I got all 483 Brazil TV Stations Logos (Link).
For greater distance model I got the data from recorded videos of TV Programs from Youtube together with Logos on the corner of screen. Therefore, I got at least 2000 full screen images from each TV Station using the FFmpeg platform for video manipulating.
⠀⠀⠀⠀⠀⠀⠀⠀⠀
ffmpeg -i RECORD1.mp4 -vf “fps=1” -q:v 2 record_%03d.jpeg
On this command I extracted 2000 frames from this video (RECORD1.mp4) at every second in a row generating images record_xxx.jpg for feed the model.
Part 2: Data Augmentation
For this purpose, I used the vision module from Fastai library Deep Learning on Google Colab. The process is explained on notebook:
This notebook apply all types of transformations on TV Logos images like illumination change, size, crop, angle of view, blur and so on. Data Augmentation is a way to artificially expand your dataset and improve the final accuracy of the model. Here, Data Augmentation of Globo Television Logo:
Part 3: Annotation
So I’ve got at least 2000 augmented logos images for each Tv Station. Is it ready to train? Not yet, the training process needs a process called annotation to recognize specifically the logo format and position on screen.
Yolo Mark was chosen because you can easily get the annotation under Darknet standards. Over each image the software generate a txt file containing the coordinates of the Bounding Boxes and class.
Part 4: Training on Darknet
⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀
The CNN (Convolutional Neural Network) process images through many layers getting patterns of shapes and colors using filters (kernels), after go to classification to identify if the object belongs to some specific class. All parameters can be configured for training.
⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
I accessed Darknet using the AlexeyAB Fork which is a popular version of the Darknet with more options, its faster and more accurate compared to original Darknet program.
In order to start training, you need basically 3 files:
1- (.DATA file). In order to give Darknet all image files, txt annotation path and classes.
2- (darknet19_448.conv.23). Pre-trained weights are used to transfer learning into a new model to avoid start from scratch.
3- (.CFG file) Configuration File. I used the Yolo V2 Tiny file and did the configuration of parameters like filters, batch size, subdivisions.
SO, started to train:
./darknet detector train logopaca.data cfg/logopaca_yolov2-tiny.cfg darknet19_448.conv.23 -map
I put the option “-map” on command in order to visualize the history of Loss Function that shows a penalty score from the False positives and False Negatives. The mAP indicator calculate the Average Precision of the model at specifics ranges, it’s really useful for evaluate the best model.
I decided to use the 14000 iteration instead the later iterations due:
1- The variation on loss value was almost null at 14000, what indicate reduced chance of overfitting compared to later iterations.
2- The the good mAP 98.74% , what indicate a excellent precision on detection.
3- The F1-score was 0.96, it considers both the precision and recall to analyze how the model perform in detection of the specifics TV Station Logos adding the False Negative results.
This is a resume of the 14000 weight file containing the precision for each TV Station:
As a result, with the (.weight file) I start to translate the Darknet to TensorFlow format using the Darkflow:
This code translate to (ProtoBuf) format:
sudo flow — model logopaca_yolov2-tiny.cfg — load logopaca_yolov2-tiny_14000.weights — savepb
From Darkflow: When saving the
.pb
file, a.meta
file will also be generated alongside it. This.meta
file is a JSON dump of everything in themeta
dictionary that contains information necessary for post-processing such asanchors
andlabels
. The created.pb
file can be used to migrate the graph to mobile devices (JAVA / C++ / Objective-C++).
With the .pb file and .meta I started to convert this graph to a TensorFlow lite (.tflite) format using “tflite_convert” from TensorFlow Library in UBUNTU:
tflite_convert — graph_def_file=built_graph/logopaca_yolov2-tiny.pb — output_file=built_graph/logopaca_yolov2_tiny_far.lite — input_format=TENSORFLOW_GRAPHDEF — output_format=TFLITE — input_shape=1,416,416,3 — input_array=input — output_array=output — inference_type=FLOAT
Part 5: Deploy on Flutter
Flutter added a support to image streaming in the camera plugin. This gave the capability to capture separated frames in a camera preview. With this ability I was able to connect this streaming with a tflite (TensorFlow Lite) plugin to process instantly this frames with object detection.
Camera Plugin: https://pub.dev/packages/camera
First I called the startImageStream in the camera controller in order to get the CameraImage img (image format), (height), (width) and (planes) what is the bytes of image.
controller.startImageStream((CameraImage img) {
if (!isDetecting) {
isDetecting = true;int startTime = new DateTime.now().millisecondsSinceEpoch;
The tflite plugin access the TensorFlow Lite API that supports detection (SSD and YOLO), Pix2Pix and Deeplab and PoseNet on both iOS and Android.
The plugin has the function detectObjectOnFrame that receive that parameters and process the recognitions:
Tflite.detectObjectOnFrame(
bytesList: img.planes.map((plane) {
return plane.bytes;
}).toList(),
model: widget.model == yolo ? “YOLO” : “SSDMobileNet”,
imageHeight: img.height,
imageWidth: img.width,
imageMean: widget.model == yolo ? 0 : 127.5,
imageStd: widget.model == yolo ? 255.0 : 127.5,
numResultsPerClass: 1,
threshold: widget.model == yolo ? 0.2 : 0.4,
).then((recognitions) {
int endTime = new DateTime.now().millisecondsSinceEpoch;
print(“Detection took ${endTime — startTime}”);widget.setRecognitions(recognitions, img.height, img.width);isDetecting = false;
});
For output the recognitions, I’d rather to put the preview into RotatedBox and OverflowBox widget. The first rotated the image accordinly the mobile orientation like landscape and portrait. The second fits the preview exactly with the mobile screen size.
Widget build(BuildContext context) {
if (controller == null || !controller.value.isInitialized) {
return Container();
}
var tmp = MediaQuery.of(context).size;
var screenH = math.max(tmp.height, tmp.width);
var screenW = math.min(tmp.height, tmp.width);
tmp = controller.value.previewSize;
var previewH = math.max(tmp.height, tmp.width);
var previewW = math.min(tmp.height, tmp.width);
var screenRatio = screenH / screenW;
var previewRatio = previewH / previewW;return RotatedBox(
quarterTurns: MediaQuery.of(context).orientation == Orientation.landscape ? 3 : 0,
child: OverflowBox(
maxHeight:
MediaQuery.of(context).orientation == Orientation.landscape ? screenW / previewW * previewH : screenH,
maxWidth:
MediaQuery.of(context).orientation == Orientation.landscape ? screenW : screenH / previewH * previewW,
child: CameraPreview(controller),),
);
}
}
The Bounding Boxes on Detections I got the base code from ticketdamoa project and made some modifications to improve the device compatibility, and for show the class name on the box.
Widget build(BuildContext context) {
List<Widget> _renderBoxes() {
return results.map((re) {
var _x = re[“rect”][“x”];
var _w = re[“rect”][“w”];
var _y = re[“rect”][“y”];
var _h = re[“rect”][“h”];
var scaleW, scaleH, x, y, w, h;scaleH = screenW / previewW * previewH;
scaleW = screenW;
var difH = (scaleH — screenH) / scaleH;
x = _x * scaleW;
w = _w * scaleW;
y = (_y — difH / 2) * scaleH;
h = _h * scaleH;
if (_y < difH / 2) h -= (difH / 2 — _y) * scaleH;return Positioned(
left: math.max(0, x),
top: math.max(0, y),
width: w,
height: h,
child: Container(
padding: EdgeInsets.only(top: 5.0, left: 5.0),
decoration: BoxDecoration(
border: Border.all(
color: Color.fromRGBO(37, 213, 253, 1.0),
width: 3.0,
),
),
child: Text(
“${re[“detectedClass”]} ${(re[“confidenceInClass”] * 100).toStringAsFixed(0)}%”,
style: TextStyle(
color: Color.fromRGBO(37, 213, 253, 1.0),
fontSize: 14.0,
fontWeight: FontWeight.bold,
The Final Result:
Close Distance:
Far Distance: