Interview with Dr.Zhou Xi: From Voice to Image, hope technology really “useful”

Published in

SyncedReview

11 min readApr 13, 2017

How can an image recognition company that has been established for less than two years win over a lot of bank clients?

Since CloudWalk’s establishment in April 2015, Haitong Securities, Xi’an Bank, China Construction Bank, and many other financial institutions have applied facial recognition system from this company. In Sep 2016, the Agricultural Bank of China (ABC) initially provide the technology to 37 branches, becoming the first commercial bank to use the face recognition technology among the BIG FOUR ( Four main banks in China including Industrial and Commercial Bank of China, Agricultural Bank of China, Bank of China, China Construction Bank). The company we mentioned above is CouldWalk.

Zhou Xi, a Ph.D holder that studies Computer Vision, has become the founder of CloudWalk. In 2011, Dr. Zhou Xi entered the “Chinese Academy of Sciences Hundred Talents Program”. Jointly developed with UIUC (University of Illinois at Urbana — Champaign) and the National University of Singapore, he established the Chongqing Institute of Chinese Academy of Sciences. He led a team to develop several intelligent image products such as intelligent image detector,face recognition intelligent personnel management system, large-scale dynamic population characteristics detection system. Also,he is the only representative of the Chinese Academy of Sciences face recognition to participate in strategic pilot A class special “Xinjiang security control”.

Nowadays, CloudWalk’s competitiveness is proved by years of discipline study and technology, as well as accumulated experience. The team in CloudWalk intend to help more people by focusing on refining technology such as portable large-scale data acquisition arrays, double-layered heterogeneous depth neural networks and so on.

Synced got a chance to interview with Dr. Zhou Xi . from personal experience to CouldWalk ‘s development, future trend. Here is the interview details.

From Voice to Image, hope technology really “useful”

Synced: Hello,Dr.Zhou! Why did you choose the image as your career?

Zhou Xi: At first I studied voice at the China University of Science and Technology，later went to Beijing, spend a quite lot time in the Microsoft Asia Institute of speech recognition group. However, I made a wrong judgement at that time because I thought Voice has no future, somehow it turns to be a right decision. Comparing with voice field, Image recognition of video and images is a much larger area that can solve the more problems. Also, from the information analysis perspective, voice is a one-dimensional signal, the image is two-dimensional signal, the video is three-dimensional signal. Image can deliver more information than voice. As to the main purpose of voice, we need to distinguish the speech (the voice of people) from voice (Background sounds).

Images and video are completely different. For example, face recognition. We need to identify people’s face from the image , as well as his emotions, age, gender.However, Image and video recognition is much more than that. In medical, early cancer and other diseases can be identified by image processing. Through the image recognition and big data analysis, the better part of the suspects can be found and then we could ask professionals to double check. Don’t you feel it is a more efficiency way to safe people’s life? Also, It can be applied in the industrial vision by detecting the defects samples : is there any crackers ? is the surface flat ? Automatic driving will also be benefit by showing the road condition.

For the image itself, it is meaningful to identify all things in the universe, not just to recognize the person’s face. At that time I saw a news, some people installed a camera under the water in the pool, which can automatically identify if swimmers are drowning or not. I feel this is quite interesting.

Synced: Could you talk about your professor Thomas Huang ? What’s he like ?

Zhou Xi: He is a master that can provide you relax environment, giving us a great platform and high-level guidance. He is definitely the Top professors in this areas, who can always demonstrate a big vision to his students.

Professor Thomas S. Huang has a foundation contribution in image processing, pattern recognition, and computer vision.

Synced : How can you start the CloudWalk after that ?

Zhou Xi:Ｉfeel I am lucky. No matter voice or image, they all branches of artificial intelligence and even machine learning. There are certain disciplines cross, a lot of things are reused. At that time the voice recognition was more advanced than image recognition because it has come to a systematic stage. When I was in the United States, the images were not at this stage. Most researchers in Image field were still focused on one task or individual server, however, in the voice area, Cluster server array became a common use. I developed a cluster server for Image processing right after I was in UIUC, which is far ahead of everyone. There are a lot of good algorithms and ideas in voice field, and I also practice them on the image. That’s why I won a lot of championships during 2006 to 2010. After that, I started to thinking how to make this technology “meaningful” and help others in a variety of occasions. We tried the face recognition initially because we believe face is very important among variable objects. At the beginning of 2014, I ahead to trial the “Smile to pay” via mobile platform. However, it was useless because no one really use it. Indeed, which financial institution will take this seriously? At that time, I started to realized the importance of business model. Thus, I started the CloudWalk.

Yearly accumulation lead to the clients’ choice

Synced: How can CloudWalk get a lot of bank customers in a short time?

Zhou Xi: I have been trying to make the technology practical for many years. From Academic demo to commercial products, there is a long way to pursue. Actually, some of our designed system has been applied in XinJiang and another regions since 2011. The product is mature, but not be commercialized yet. I think focusing matters. We concentrated our research area in face recognition still, and as to application perspective, we focus on financial and security only.

Synced: What’s the specific characteristics of Banking area ?

Zhou Xi: Not only requires stability, but also requests a very fast response speed. There is a strict principle called “ 2 hour, 4 hour and 8 hour” in the financial system. Director of bank will be aware of the issue if the system shut down for 2 hours ; error report is needed if the issue last for 4 hours; it will be a serious accident if it goes to 8 hours, which will impact the reputation of bank for sure, For IT suppliers ourselves, how can we guarantee that we can fix any issues within 2 hours ? That is a big challenge. Even though CouldWalk just established for few years, but we take every business seriously. Now we have service centers in the top 10 main cities, and have our own sales in each province. We promise that we will be there whenever issues happened.

Banks value sales service system more , while Internet model companies don’t.

Synced: What’s the super-large-scale mobile data acquisition array ? How to use it ?

Zhou Xi: It is actually inspired by the learning of medical. When you make a CT slice for medical use, every degree from the positive white to the negative white need to be photoed because light is lumens. These photo are combined to a table, which can reverse the table to solve the problems, no wrong surgery, no miscarriage of justice. I learnt a lot from medical study. Thinking about the data analysis now. Mining data is easy but structuring data is not. That’s why I need to figure out first when I came back from the U.S. Even if the download from the Internet 10 million face, or install the camera in the street to record face, these data are unstructured. For individual face, we need to ask ourselves : What is the light? Where does the light come from? Is there any cover? What is the expression? Is there any blur? It is difficult to come back one by one.

Talking back to the super-large-scale mobile data acquisition array, it took us a while to realize it. we install a camera every 5 degrees, horizontal from negative 30 degrees to 30 degrees, vertically from 0 degrees to 30 degrees . All the cameras are high-speed camera imported from Canada, 7 layers of 13 columns, a total of 91 cameras formed an array.The array structure is detachable. We had done a synchronization unit to ensure that the millisecond trigger synchronization acquisition. Note that we have to keep up with the storage because of the large video size.

The collection of space is dimensioned, and the human face is fixed. Thus you can get light and angle properties with the applied light arrays. We also designed the script to set the expression, block the wig, hat, eyes and so on. However, it is still not enough. The laboratory environment is limited, the array has to be removable if we want to duplicate the test from inside to outside. Since most of the banking business is happened in the lobby area, so we need to collaborate data from lobby; public security may monitor the channel sometimes, we need to collect data in the channel to see what is the specific situation.Why we still need to work hard to make structured data nowadays ?Just as we often say that society is the best school, but we should set up primary school, middle school, university as well. Learning structured knowledge system at school, so that children have their core value and basic judgement when they are confront with a wide range of data and information. Thus, structured data system always comes first, and then we need tons of unstructured data to refine our model.

Synced: Could you explain to Synced’s readers about the double-layered heterogeneous depth of the neural network ? How can this technology combine with face recognition ?

Zhou Xi: I will show you an example. Document photos are used in most format occasions because it is more accuracy . However, the scene photo need a complex network to unlock because it is impacted by variable factors such as light, expression, angle.Same person’s photos may looks totally different if crossing years, so we can not just simply put two photo together. The appropriate way to do is to form a distribution on both layers and connect them with lines. On the other side, big data can have a feature as long as the data enough to let it learn. People know about the impact of light, block, expression, so we can completely let it learn. you can tell it can save a lot of time in advanced.

Heterogeneous dedicates the difference of structure. For example, if you want to teach a child about “apple”, three characters may be enough , such as “round”, “above the fruit pedicle”, “ touch feel” . The next time you show him an apple and asked him “what is this”, he may know or may not know. if he doesn’t know, you can tell him “This is the apple I told you last time”. He may ask “Why the color is different ? ” Then you can answer,“Last time it is a green apple, this time is a red apple.” Thus he knows there are different colors of apple eventually. Similar methods when we think about deep learning. How many data do you need if you want to train a machine to recognize “apple” ? Usually 1000 or 10,000 apple data for training, and training recognition rate should be 90%.That is to say if there are 10,000 apples, there are 1000 will identify by mistake. If we ask the computer why those 1000 apples is not recognized, the computer will not tell you because of the color difference. The computer will show you the value result is 0.4 based on partial guide or integral , but the default value of the term” apple” is 0.5 or more. How can we improve it ?Finding more data to training it until the recognition rate reach 98%.

In my opinion, this is not artificial intelligence. The communication process with child is called artificial intelligence because he understands my abstract, the concept. Through the color, shape, material and other abstract concepts, he achieves the definition of a new thing. He will use the same conception to answer question if he makes a mistake, then we can correct the knowledge again.We can interact at a very high level, which partial guide and integral is far behind . Thus, In addition to the basic, primary pixel information, the upper concept (concept information) and attribute (attribute information) is must in order to achieve a higher level of interaction.

Synced: Image recognition will involve a lot of computing, how do we improve the response speed?

Zhou Xi: This may involves engineering problems.Face recognition itself has dozens of modules, from the detection, tracking, segmentation, key points, spin positive, to the quality of analysis, light compensation, angle compensation.etc.For any module , we need to conduct different adaptation for each scene. If the application level is based on mobile phase, the vendor will ask the model size below 1M, but the entire face recognition model has more than 100 million parameters on the server side.At the same time, we are also requested quick response. For example, we need to identify a number of key points within 1ms among a lot of people in a video. Why is 1ms? Because there are a lot of modules to run, to meet all the running time together to achieve “real-time” (within 30ms). What’s more, accuracy is another critical requirement. People will feel weird even only 1 pixel difference in some beauty APPs.

We always complain that “countless efforts are put on the adaptation “ because of dozens of modules, different scenes and variable hardware (different phone models, servers, embedded devices). It is definitely time consumption to adapt a new algorithm on every model again such as Android、iOS、Linux and so on. That’s why for a single face recognition R&D team we need more than 200 people.

Synced: I remember in one speech , you mentioned that CouldWalk can solve the question where people come from. Have we been able to track individual development nowadays ?

Zhou Xi: This can not only rely on us. First of all, all the monitor video need to be structured, and face data need to be extracted and saved up. In the future, if you want to get a personal information quickly, you can quickly get a request from the system to all the server side, so that the information will be summarized, including what he did before and who he met with in all his life. I will say the technical method is feasible but the data connection can not be guaranteed.

Synced:What are the iconic events do you think in the development of image recognition?

Zhou Xi : Image recognition used to be very popular.By the end of the 20th century, it turned to be decline. However, until 2001,Paul Viola and Michael Jones caused an extensive attention by publishing 《Robust Real-Time face detection》. All audiences’ face were tagged out when they were aimed with camera immediately. This event could change the meaning of image recognition technology because now people want to use it into realistic. The US government took the lead in the application of all the monitor camera, and strengthen the intelligent monitoring, With the merging with deep learning in 2005, it started to be applied in Image recognition from 2009. Until now,it is still outbreak of the stage.

Synced: What China’s role in the development of Image recognition ?

Zhou Xi: In terms of image recognition, we are leading internationally, at least not behind the United States. There is a big market in China for sure, either investment or data resource.

Synced: Some people stated that many image recognition today only focus on winning competition, far from solving real situation in our daily life. How do you think ?

Zhou Xi: Winning a competition is meaningless. Our initial goal is to make image recognition really helpful, making a great contribution in bank, public security, airport and so on. We don’t need many winners who can only solve assumed questions because there are tons of daily issues waiting for us. Technology is not for showing off, but helping out.

Author: Miaomiao Yu | Localized by Synced Global Team: Meghan Han, Rita Chen