With the goal of harnessing the untapped potential of Iranian-Americans, and to build the capacity of the Iranian diaspora in effecting positive change in the U.S. and around the world, the Iranian Americans’ Contributions Project (IACP) has launched a series of interviews that explore the personal and professional backgrounds of prominent Iranian-Americans who have made seminal contributions to their fields of endeavour. We examine lives and journeys that have led to significant achievements in the worlds of science, technology, finance, medicine, law, the arts and numerous other endeavors. Our latest interviewee is Homayoon Beigi.
Homayoon Beigi earned his BS, MS, and PhD from Columbia University in 1984, 1985 and 1990 respectively. The author of the first and only comprehensive textbook on Speaker Recognition, for three decades, has been involved in research and development in Biometrics, Pattern Recognition and Internet-Commerce. He has developed the award-winning RecoMadeEasy® Speaker Recognition and the multiple-award winning, CommerceMadeEasy® software. He has also been an Adjunct Professor since 1995, teaching graduate level Speaker, Speech, Handwriting and Applied Signal Recognition, as well as Control Systems courses at the CS, ME, and EE departments of Columbia University. He was a Research Staff Member at the IBM T.J. Watson Research Center from 1991 to 2001, working on Speaker Recognition, ASR, LM, Search, Handwriting Recognition, Control, and NN Learning. He developed the SAFE Audio ANSI standard and was an active liaison in the US delegation of the ISO/SC37-JTC1-WG3 and the VoiceXML Forum on Speaker Biometrics. His other research includes Structural Health Prognosis, Image Compression, Kinematics, Financial Optimization, and Zero-Gravity Fluid Dynamics. His “Fundamentals of Speaker Recognition” has been downloaded over 51,000 times. He is the recipient of two IEEE best paper awards,13 issued patents, and over 70 peer-reviewed publications. He is an Associate Editor of the AutoSoft Journal, Editor of the BISC Handwriting Recognition, Senior Member of IEEE, on Advisory Board of IEEE Spectrum, and reviewer for the Pattern Recognition Journal, IEEE PAMI, IEEE Transactions on NN, IET Signal Processing Journal, the International Journal of Control,AIAA Journal, ICASSP, Interspeech, etc.
Tell our readers where you grew up and walk us through your background. How did your family and surroundings influence you in your formative years?
I was born in Tehran in 1964 and lived there until the age of 15. I attended Armaghan-e Tarbiat school in Abbas Abad for k-5, and Alborz High School for 6th through 9th grades. My father is a chemist and an environmental engineer, and my mother was an elementary school teacher for her whole adult life. They were distinct sources of inspiration and learning throughout my life. In addition, since my family had to move to Yazd and later to Semnan for my father’s work, I stayed with my grandparents (on my mother’s side) in Tehran in order to attend Alborz during the 6th and 7th grades. Needless to say, I spent a lot of time with my grandparents during that time. My grandfather was a retired police colonel who had incidentally retired the day I was born. He spent most of his time at home, fixing anything that wouldn’t work. This could be a transistor radio, plumbing, or anything else for that matter. He never had any technical (engineering) background, but loved to understand how things worked and had a knack for making things operational. I used to sit beside him and help him out in these endeavors. I, myself, was very keen of understanding the world around me. For the first 8 years of my life I did not have any siblings, so I spent a lot of my time with adults such as my parents, my grandparents, aunts, and uncles. They also treated me more like an adult than a child, so from a very young age I would have discussions about anything and everything with people such as my great grandfather. He used to read the nightly news at the age of 7 to his illiterate father who was one of three colonels in Iran at the time of Mohammad Shah Qajar (circa 1845 AD). I would sit, mesmerized, and listen to his accounts of real history due to his exposure from such early age. I recall listening to classical Persian music with my grandfather and great grandfather, such as the music of Ghamar-ol-molook Varziri and the music of tar. My grandfather used to play the Tar and owned a Yahya tar in his youth. Although I never saw him play, he told me a great deal about the dastgah system, including the names of the dastgahs (modes), the Gushehs (equivalent of Ragas), and how to recognize them. I always liked to know how everything worked and I engaged myself in those activities, from knitting a sweater for myself from scratch, at the age of 9, to fixing a radio, an automobile, or my 80cc Yamaha motorcycle. As I still believe, one should try to learn about every part ofn his/her surroundings. I recall being asked by a doctor stitching the skin on my forehead what I wanted to be when I grew up and my response was, “an engineer.”
The summer before starting the 9th grade at Alborz, social unrest started in Iran, which later lead to the revolution. I had always planned to come to the U.S. for my higher education, but when I saw the turmoil in the capital and after consulting with my parents, I decided to apply to a boarding school and leave earlier for the U.S . My parents were not so keen to send me away, but they entertained the idea of applying for admittance and took on a wait-and-see attitude. Around January of 1979, right at the onset of the revolution, I received my admittance from the Knox School on Long Island, NY, about 50 miles from one of my aunts’ home. Knox was a college preparatory boarding school with 6th through 12th grades, nestled on about 40 acres of land by the Long Island Sound and only housing 170 students across the board. My parents who had witnessed the uproar during the revolution, agreed with sending me to the United States. Now I had to apply for a separate passport. I recall standing in line immediately behind Vigen, the late famous Iranian pop singer. When I started to apply for my U.S. visa, there was a very long line of applicants and the embassy was temporarily closed. The applicants had organized groups of 50 and had assigned group leaders who would maintain ordered lists of people waiting to apply. I was in the 30th group (over 1500 people before me) and there seemed to be a very slim chance that I would be able to get a visa. For 18 days and nights I stayed on the street next to the consulate. My dad would go and get us food; we would sleep in the car, parked next to the group, at nights. Our groups were forced to disperse several times, by thugs accusing us of being western-minded. However, the groups held, thanks to the organization of their leaders. Finally, the consulate was opened for three days. When it was my turn to enter on the third day, the doorman told me I had no chance of getting a student visa when he noted that I did not possess a high school diploma. I followed through anyway. I spoke perfect English since I studied the language beginning the age of 2. I also had the highest grades among the 14 classes of the 9th grade at Alborz. The consul who interviewed me, gave me a 4-year multiple entry visa within a 2 minute conversation. I must have been one of the last people to have obtained a student visa from Tehran. To make the story short, after quickly getting ready for a seemingly one-way journey, I had to say goodbye to my parents and my 6-year-old sister who was starting the first grade the day I was leaving for the United States. This was the first day of Mehr of 1358 (September 23rd of 1979). The next time I would see my father was 5 years later and it would be 7 years until I saw my mother and sister. It is needless to say how hard the first few months were, especially since the hostage crisis and later the war would make matters worse.
At Knox, I skipped the 10th grade and went on to complete the 11th and 12th grades. During this time, I became my class’s valedictorian, and received awards relating to science, the English language, and other topics. Knox was a well-rounded school and they used to take us to plays, symphonies, and more every weekend. Also, a ratio of about 5 students to a teacher meant that we were given as much educational nourishment as we could take. For example, I learned how to recite the prologue to Chaucer’s Canterbury tales in both middle English and the dialect of the time. This was while being the only person in my 12th grade math class due to the fact that they had run out of curriculum that was appropriate for my level of knowledge: post-calculus. I sat together with my teacher one-on-one, and we solved all the problems of the Schaum’s outline on Calculus in the 12th grade.
I applied to several universities at the onset of my senior year, and decided to attend the school of Engineering and Applied Sciences at Columbia University where I would go on to study mechanical engineering. At this time, the war between Iran and Iraq was still continuing and I still had not seen my sister and parents. It was extremely difficult to convert Iranian currency into dollars and nearly impossible to send money from Iran, and at this time, Columbia had the highest tuition in the nation. I decided to finish my bachelor’s degree in three years instead of the normal 4-year course. Although this did not really save much on tuition, it definitely reduced the living expenses by one year. In order to do that, I had to take about 23 credits every semester where the maximum load was dictated at 18. To achieve this, I attended classes until the end of the semester and took all the exams and showed that my grades were good enough, forcing the dean to agree to add the extra courses to my transcript. Normally, the dean would not sign any schedule with more than 19 credits. I also took about 12 credits every summer. While studying at Columbia, I tried to take all my electives that involved Iranian languages, such as two semesters of Middle Persian (Pahlavi) and one in Advanced Modern Persian. I was the only 17-year-old in the class, with all others in a PhD program related to Iranian Studies. I even ended up as the defacto teaching assistant for Pahlavi when a new professor started teaching the course. The study of Middle Persian, in addition to its apparent benefits, taught me a great deal about linguistics and etymology which came in handy in my future endeavors as a speech scientist. Eventually, I finished my bachelor’s degree and my father managed to get a tourist visa to come and attend my graduation. I continued with the expedited education and finished my masters in one year at the age of 20, where most of my colleagues would have taken three semesters or even four to complete the same curriculum. I continued further in the same department, mechanical engineering, at Columbia and earned my doctorate degree.
When I was around 18, I attended a Kamancheh performance by Maestro Andranik Aroustamian at the Metropolitan Museum of Art. I fell in love with the Kamancheh and the mastery of this great artist. I started studying the Kamancheh with him, although I did not even own an instrument at the time. I used to play the organ and almost anything I could get my hands on in my youth in Iran. However, until that time, I had never received a formal education in music. That was about to change. Music was always very important to me and I ended up continuing the study of Kamancheh with the maestro until his passing in early 1990s. More than music, however, we used to speak about his past in Iran, Armenia, Azerbaijan, Russia, and the U.S. I continued to study the Tar with Mr. Reza Derakhshani and then continued to learn the theory with Mr. Hossein Aslani with whom I collaborated on a few occasions, and later on my own, which I am still pursuing to this date.
What has been your personal key to success? What were the biggest inspirations for your career?
Loving anything automated, I fell in love with computer programming at the age of 17 when I took Fortran 77 at Columbia. I was among the first group of students using a non-punch-card terminal-based programming apparatus in 1981. We were using Digital Dec-20 mainframes with so-called dumb terminals. I loved programming so much that I started tutoring Fortran right after taking the freshman course in the subject. Having been extremely keen on research, throughout my bachelors and masters I worked on several projects. These included zero gravity fluid dynamics research in conjunction with a Spacelab project and kinematics research with two very famous professors, one of whom was known as the father of Kinematics. For the zero gravity fluid research, I had to find a way of quantifying the spread of fluids on different surfaces in the Spacelab encircling the earth, at zero gravity. I wrote a C program, at the infancy of the language, to communicate with a sonic digitizer in order to digitize the perimeter of the circle being formed by the spreading fluid in the frames of 24 frame per second film recorded for us in the Spacelab. Also, during my doctorate I wrote thousands of lines of C programs for networking, inventory, and warehouse administration on the Xenix (a Unix) operating system.
For my doctorate, I was able to get a research assistant position and was finally relieved from the hefty Columbia tuition. I did my doctorate in the same department (mechanical engineering) under two wonderful advisors. I was blessed with having a theoretically-minded advisor as well as one with a practical streak, whereas most PhD students only have a single advisor. In the second half of the 1980s, I was working on my doctorate in the field of Learning Control (invented by one of my advisors and completed by me and one of his other students), and on nonlinear optimization of neural networks with many layers. The first revival of Neural Networks had just happened in 1986 and I published my first paper on the subject in the following year. Being a mechanical engineer, I published most of my papers in ASME publications. Most electrical engineers and computer scientists were unaware of the advancements that I had published about. I had already formulated the general optimization of deep neural networks with solutions in the nonlinear optimization domain. Electrical engineering and computer science researchers only started looking into such architectures around 2005. In addition to these two distinct problems in my doctoral thesis research, I worked on the statistical health monitoring of mechanical machinery in naval ships, such as bearings, gear meshes, and other vibrating parts. This portion of my studies was focused on this specific topic since my advisor received funding for it. Nevertheless, this was in line with my continuous fervor for learning anything under the sun and propensity to relate different topics to each other. Having these two advisors from the theoretical and practical aspects of engineering were essential in building my career. Since I did all my work at Columbia, I had run out of courses to take in the mechanical engineering department, so I started taking all the mathematics courses in the applied mathematics department as well. In addition, since passing the doctoral qualifying exams was particularly challenging in our department, I sat through most of my courses a second time. This time I made an effort not to take any notes and just tried to get all the nuances of these courses. This was another very important mechanism is shaping my foundation.
After obtaining my doctorate, while I was looking for employment, I worked at the Center for Telecommunications Research at the electrical engineering department. There, I wrote low level drivers in C for doing lossless compression of images, their transmission across noisy channels, and rendering at low level graphics. This was in 1990. Much of what is now available in terms of device interfaces did not exist at that time. This experience helped me expand the breadth of my knowledge in those directions and paved the way for my position at the IBM T.J. Watson Research Center, where I worked for a decade on online handwriting recognition, music notation recognition, speech recognition, and speaker recognition. Most of these activities served as pioneering work during the infancy of these subjects. At this point I had worked on many different aspects of engineering, which enabled me to connect the dots and see the similarities in these fields. Starting in 1995 and continuing into today, I have been teaching courses in speech and handwriting recognition, first at the electrical engineering department of Columbia and later at the computer science department. This was in addition to teaching digital control systems and applied signal recognition in the mechanical engineering department. I have had and still have PhD students in the civil engineering department working on vibration analysis of bridges who use the same techniques that I use in speaker recognition to solve these problems, but from a new angle.
You have received many awards. What are your most significant accomplishments that brought you to this point?
My greatest accomplishments have stemmed from the capacity to see a broad view of different fields. This viewpoint began quite early in life when I was exposed to different topics. I do not recall any topic which bored me or felt uninteresting. I have always tried to learn everything. I do not subscribe to the English adage, “Jack of all trades, master of none.” In my opinion, in order to be a master in anything, you need to try to master everything. I believe we start life by learning many different things, and then at some stage of our lives we start concentrating on a specialty. It is very easy to be pulled into such inflection points in life, and then continue by discarding all other subjects as irrelevant. However, as evidenced by the technique of annealing used in forging steel and in optimization theory, we will be able to achieve a more global viewpoint and better optimality if we experience a few of these specialties and connect them together toward a better perspective. Once we near this stage and of course continue in its pursuit, we open ourselves to much quicker adaptation in the face of new problems, arming us with the tools needed to solve any problem that we face.
What is Recognition Technologies, Inc.? How is the company different from others in the same sector?
My wife, Pargol, and I started an Internet hosting company, Internet Server Connections, Inc. in 1996, while I was at IBM. Pargol, herself an accomplished Columbia engineer, has been very supportive and a great partner in both life and business. We worked on specialized hosting of websites, a work that still continues to date. In 1998, we started converting all the articles of Encyclopaedia Iranica to digital form and made them available online. This work continued until 2012. In 2000, we were approached by the research division of Merrill Lynch to develop an optimization algorithm to be able to optimize portfolios from a set of 35,000 securities in realtime — an extremely formidable task. This project caused me to request to leave IBM. I was told by my manager to take a leave of absence instead, which I did for a year, and indeed never returned to IBM. We were quite successful with the algorithms, but because of the September 11, 2001 attacks, our contacts were dispersed and the research team moved out of the area. Having left IBM, I decided to start working on speaker recognition for my own company, which was formed later, by the name of Recognition Technologies, Inc. Both our companies have been completely self-funded with no external funding of any kind over the past two decades.
Having written the first version of IBM’s speaker recognition engine from scratch (both code and algorithms), I wanted to work in the same field for our own company. In order to do that I needed lots of data. Serendipitously, I started writing an interactive voice response system for a new client of ours who conducted thousands of oral language proficiency tests in over 100 languages. In an agreement, we started collecting a large amount of data which we could use to develop our own speaker recognition engine. In another sidetrack, I developed an automated system to provide better accuracy and granularity for rating the proficiency of oral language, which was used by our client for a decade, which in turn generated even more data for our speaker recognition engine. The speaker recognition engine was also used for over a decade to ensure that candidates who took the test were the authorized candidates by verifying their voices. Recognition Technologies was officially born in 2003 and continues to flourish. I continued efforts on writing a face recognition engine as well as a speech recognition engine, and, worked on combining these engines together to act with a single interface. To make a long story short, I have written over 2,000,000 lines of c++ code for these three engines, utilizing new algorithms in these subjects. We have also expanded into covering many different languages including English, Mandarin Chinese, Arabic, German, etc. In fact, we are currently working on supporting 39 languages. We are capable of running our engines at orders of magnitudes higher speeds than our competitors, while supporting a much larger domain (population for biometrics and unique words for speech recognition). We are in fact capable of running all our engines faster than realtime on a 25 dollar raspberry PI processor with similar performance when compared to running on a large cloud computer.
You have been involved in research and development in Biometrics, Pattern Recognition and Internet-Commerce. Could you highlight your works in these areas?
Like I mentioned, I started with pattern recognition during my doctoral work on health monitoring of machines, and continued with it in the entirety of my IBM career where I did handwriting, speaker, and speech recognition. I continue this work today in Recognition Technologies, Inc. in addition to music recognition, emotion detection, language proficiency rating, bridge and structure health monitoring, etc. I also wrote the first cookieless Internet Commerce package in the late 1990s at Internet Server Connections, Inc., the company run by Pargol and me.
You developed the award-winning RecoMadeEasy® Speaker Recognition and the multiple-award winning, CommerceMadeEasy software. Could you tell us more about them?
Our RecoMadeEasy® Speaker Recognition engine was one of the first products out of Recognition Technologies. In 2011, I was approached by Frost & Sullivan which is a prestigious market research company. They wanted to compare our speaker recognition engine to our competitors’. I provided them with the information they needed. A few months later they contacted me and told me that we were recipients of their “New Product Innovation Award” in our field. They had compared us to our competitors using their own metrics and had come to the conclusion that we scored the highest in speaker recognition, which is the biometric aspect of speech. The award was presented to us in person in January of 2012 in San Diego. More information about this award is available on our website.
I have always been an advocate of utmost privacy for clients and I wrote the CommerceMadeEasy® software for Internet Server Connections, Inc. with this value in mind. This was the first cookieless Internet Commerce package which allowed greater privacy by not keeping information on the history of browsing when purchasing an item. This way of thinking completely contradicts today’s ideas of commerce which are based on gathering as much information as possible from a client while they shop. We debuted this package at the Internet World Show at the New York Javits center in 2002 and won the best of show award in Internet Commerce, after three judges reviewed our product against other vendors dealing with Internet Commerce. We also won an award from the Linux Journal for this product in the same year. Unfortunately, since this kind of package is more supportive of the client than the vendor, it did not make it as a mainstream package.
You are the author of the first textbook on Speaker Recognition, entitled, “Fundamentals of Speaker Recognition”. What motivated you to write it?
Having pioneered a lot of the speaker recognition work at IBM and later at Recognition Technologies, I decided to gather my notes and lectures into a comprehensive book on speaker recognition, for which no textbook existed. I believe in sharing knowledge and not allowing it to perish with us when we leave this world. I began with the serious writing of this book in January of 2008 and averaged about 12–14 hours of research and writing for 4 years. The book is 1000 pages long with 26 chapters, and was published by Springer-Verlag publishers in December of 2011. It is a highly technical book and took a lot of time just to write the thousands of equations. This must be the biggest sustained challenge of my life. The second edition is due to be published by the end of this year. Since its publication, the electronic version of the book has been downloaded over 51,000 times. It is still the only textbook on speaker recognition. I use this book as my textbook, as do many other professors in the field.
What is the biggest obstacle you’ve faced along your career path? How did you overcome this?
Although I have been quite lucky in my career path, in mostly working freely toward my research goals, there have been impedances along the way. Even though I pioneered a lot of the learning techniques in neural networks, there was great resistance by management at IBM to use these techniques in our research. I worked very hard to be able to use such techniques in handwriting and speaker recognition at IBM at no avail. Once I left IBM, I was able to freely exercise these techniques into advancing our engines at Recognition Technologies. Much later in the process, IBM, as well as others, understood that neural techniques were quite appealing, but our company had already progressed beyond these latecomers, giving us the lead we needed.
What project are you the proudest of?
I am proud of most of the work I have done, but the most important is the line of RecoMadeEasy® engines which includes speaker, speech, and face recognition, along with emotion detection and many more functionalities being added recently. This engine is capable of running on an embedded processor at 20% to 40% faster than realtime and more than 17 times faster than real time on a regular laptop. It is capable of transcribing speech in multiple languages, even in a bilingual capacity, with a vocabulary of 340,000 words. This is all while segmenting the audio stream into speaker turns, identifying the speaker names for each turn, identifying their gender, and returning confidence levels and scores to the calling system with timestamps of these points in the media. And on top of it all, it is capable of recognizing the faces in the media (say video) and tagging those faces with their IDs. It also combines these results into a single-fused result with confidences and scores. To put this into perspective, the Carnegie Mellon University, which has been working on speech recognition for more than 4 decades, states on its frequently asked questions regarding whether large vocabulary speech recognition is possible on an embedded device, that they believe it is not possible. Our product not only shows that this is possible, but it has redefined large vocabulary from the nominal 60,000 words most everyone uses to 340,000 unique words. This performance is the same for all supported languages. Our standalone running capabilities mean that our engines can be utilized in mobile computing with no connectivity, such as in an automobile, while maintaining a high level of privacy by not using cloud services.
In your opinion, what are the main challenges currently facing your field?
From a political standpoint, most companies with presence in my field are, by definition, advertising companies. They look to running their speech, speaker, and face recognition engines on cloud computers so that they can obtain and accumulate data about their users, in order to sell the information to advertisers and provide the customers with targeted advertising. They do not condone localized computing and push for cloud services. We are doing the opposite and hope to prevail by increasing privacy levels. Alas, these conglomerates provide most of the funding for University research. There seems to be an active force against maintaining privacy. Government organizations, which are also a large part of such funding, by nature do not care for privacy-centric research.
Aside from these political aspects, the three greatest technical issues are as follows: 1. As the population grows, biometric identification becomes harder. Take speaker recognition for example. The intra-speaker variations may become larger than the inter-speaker variations as the population grows. This is true for face and other modalities of biometrics as well. 2. Degraded channels are another great source of errors in our field. This may include cross-talk (when two or more people speak at the same time), noisy environments (talking on a cell phone while walking on a street in Manhattan), degraded lines due to errors and package loss, etc. The analog of this in face recognition is the occlusion of faces, bad lighting, etc. 3. There is a misconception among the public about these research areas. In this regard, a lot of well-meaning colleagues set the wrong expectations and create fallacies about the field which feed on public misunderstanding. Of course, I guess this last issue is common to many advanced research problems.
You hold over many U.S patents. Can you tell us about the three most important patents that you have had so far?
I have 13 patents registered with the USPTO. I believe the last three patents are the most important. These are,
1. US10,042,993, “Access Control Through Multifactor Authentication with Multimodal Biometrics,” 2018,
2. US9,495,646, “Monitoring Health of Dynamic System using Speaker Recognition Techniques,” 2016, and
3. US9,064,257, “Mobile Device Transaction using Multi-Factor Authentication,” 2015. Patents 1 and 3, as their names suggest, provide complete infrastructures for access control and mobile transactions. They cover all aspects of such transactions which do not require direct connectivity to a cloud and which can be trusted by the vendors or access targets. Both patents are major with over 30 claims each. Patent 2 provides a full strategy for using inexpensive sensors on bridges, buildings, and machinery in order to forecast the health of these structures of machines by using techniques borrowed from speaker recognition. This patent speaks a lot to the theme of my career which is the perspective that allows us to see connections across different fields and apply solutions from one field to another.