Jan Bot or a new step in the demystification of feature extraction technology

Jan Bot
Jan Bot
Published in
8 min readNov 28, 2018

by Brecht Declerq

Jan Bot claims to be the world’s first, robotized film museum curator. A robot that analyzes an existing collection of film footage, describes and divides it into small pieces and then composes new clips, inspired by the content of news articles. A bit like a human curator who, inspired by a contemporary subject, browses through the film collection and by making his or her selection identifies connections and contrasts between the present and the past.

For long time followers of the development of the technologies applied here, this project — moving in the middle between a courageous proof-of-concept and a somewhat artistic craft work for high tech freaks — seems to be a true rush ahead.

Still image from Jan Bot, the self-proclaimed world’s first robotized film museum curator. Against common knowledge, the technology that powers this computer program isn’t new. It exists since the early 1990's.

Automatized image analysis is nothing new

Jan Bot’s description regularly refers to artificial intelligence. It is a somewhat unjustified use of that particularly vague but ubiquitous buzzword, as it appears that the work of Jan Bot can in fact be summarized as a sequence of automatic content analysis of motion pictures, an automatic keyword extraction from the news articles, and then a simple combination of the results of both.

Moreover, to those who would think that a new chapter in the curation of motion picture collections starts here: you’re largely mistaking. Shortly and witty: nothing is less new. Tests with the automatic analysis of audiovisual archive documents and keyword extraction will soon celebrate their 25th birthday. And yet, the big breakthrough of the use of this automated analysis in audiovisual archives is still not fully accomplished. In this article, I would like to outline briefly how this has happened.

Early experiments

In 1993 at the Carnegie Mellon University of Pittsburgh three researchers named Howard Wactlar, Michael Christel and Scott Stevens launched a research project called Informedia with as a main research topic the automatic recognition and description of audiovisual documents through algorithms. By far the most spectacular outcome of the project came about one year later: the researchers reported that they had succeeded to generate with their algorithm a description of a video in keywords without any other manual intervention. As far as can be checked, this was the first time that an automated analysis and description of image and sound was proposed as a solution to make large audiovisual collections searchable.

User display of Informedia digital library created in by Howard Wactlar, Michael Christel and Scott Stevens

It didn’t take long before the work of Wactlar, Christel and Stevens drew the attention of audiovisual archivists, professionals who had been active in this domain for almost a century. They acknowledged the enormous potential of Informedia’s results. Indeed, the first reaction was one of particularly great interest. The content description of audiovisual fragments, by far the most labor-intensive activity of audiovisual archives, could possibly be reduced to a few mouse clicks. Even before the end of the 1990s, the fire had spread throughout vast areas of the information retrieval science domain. As a consequence, more research groups were launched, more results were published and even competitions were organised to compare different algorithms and research groups active in the field. Olympics for speech and image recognition algorithms, say.

During this period research groups and avant-la-lettre start-ups grew very rapidly. Too rapidly, as the notorious example of Lernout & Hauspie proved. This speech technology company — the pearl on the crown of the Belgian innovation economy — even took its own dreams for reality, until the bubble burst and thousands of investors saw their money vanish like a puff of smoke. It was a trauma that would continue to feed the skepticism over feature extraction in Belgium for at least ten years.

Brazen optimism

From the early 2000s, the interest of the audiovisual archivists for the work of the feature extraction scientists became mutual. Audiovisual archives possessed large and very varied digital datasets, which the researchers could use to train their algorithms. The archives with slight or no objection handed over their audio and video files and from this borrowing followed joint test projects. From test projects followed proof-of-concepts, proof-of-concepts were followed by final tests and go-to-market studies. And at each new stage, the results were once again presented triumphantly: not only on scientific fora, but on audiovisual archiving conferences as well. In this period, each self-respecting audiovisual archive of some size had its own large test project in feature extraction.

At each presentation, it was concluded that the results provided by feature extraction algorithms were so promising that one would literally forget to say that these results were only based on tests, in laboratory conditions.

At each presentation, it was concluded that the results provided by feature extraction algorithms were so promising that one would literally forget to say that these results were only based on tests, in laboratory conditions. Grumblers in the audience trying to launch the slightest remark were cut off with a death sentence stating that the use of feature extraction in the audiovisual archive’s daily practice was imminent. And this word ‘imminent’ was used in the strangest ways: maybe next month, if not within a half year, but certainly within the next year. Unless there is anything to be scrambled, but even then, within two years at max.

Or did we say three? The stories about what could be possible with feature extraction grew and grew. Here and there, some even claimed that they could fly you to Mars.

From craze to myth. From myth to suspicion.

In 2008 the mailing list of the International Association of Sound and Audiovisual Archives (IASA) featured a post by a television archivist from a remote island group. It read: “Can anyone please tell me if it’s true, that soon some television station will no longer need archivists to work? Someone said there is a system whereby it does everything: archive, viewing, classification and computing. Is it true? Maybe you out there know something about it?”

At this point feature extraction acquired the allure of teenage sex: everyone was talking about it, because of course, it was so exciting. There were immense expectations and some without batting an eye claimed that they had done it already. It were all anxious lies, out of fear to fall out of the pitch, substantiated sometimes by nothing more than clumsy experiments and almost never the predicted fantastic result. Myth was only amplified, while truth was that for nobody feature extraction was already a part of daily audiovisual archiving life. Only in a few high-tech research projects such as PrestoSpace and Tosca-MP researchers were still able to see the wood for the trees.

Feature extraction acquired the allure of teenage sex: everyone was talking about it, because of course, it was so exciting.

In the beginning of 2010 however, bursts began to come in the bubble. Increasingly, skepticism sprouted: if all those tests were so successful, why was feature extraction software offered by so few companies? And if it had been offered, why didn’t anybody have good references? And why hadn’t anyone thought about how feature extraction would fit in the architecture and workflows as used in the Media Asset Management (MAM) systems that made their way up? Was it perhaps because it didn’t work as well as everyone always said? And didn’t that bring to mind again what had happened to Lernout & Hauspie a decade ago?

Slowly but surely a great divide grew: those who, against better judgment, kept believing in the automatic archivist and continued to see it as a servant for all work, versus those who preferred to first see and then believe, to await the future before beginning to prepare it. Soon enough there was no third way left: you were either with it or against it. Against what exactly, that question hadn’t been asked for ages.

From feature extraction to Artificial Intelligence

In the demystification of feature extraction for audiovisual archives the Media Management Commission of FIAT/IFTA — the world of television archives — played a decisive role. In Switzerland, where there is no shortage of money nor of level-headedness, a new MAM system was installed at the Italian-speaking public broadcaster RSI in the summer of 2011. Without any fanfare or publicity this system also included the ability to transcribe full radio or television broadcasts through an Italian speech recognition algorithm. The RSI archivists used it on a daily basis and to great satisfaction. Members of the Media Management Commission on the return of the FIAT/IFTA conference in Turin recognized the groundbreaking nature of this practice and invited RSI at the Media Management Seminar 2013 in Hilversum.

From that moment on new, verifiable testimonials from the use of feature extraction tools in daily archive practice popped up, also other than speech detectors. The hesitant attitude adopted by the major MAM vendors to provide feature extraction in their monolithic systems was ruthlessly punished: modularity became the new normal and APIs — connectors for external online software services — became a standard feature of archival software architectures.

New kids came on the block and although they were late, they weren’t the least for sure: IBM started a propaganda offensive around its reanimated supercomputer named Watson, Google came up with Cloud Video Intelligence and also Apple woke up all of a sudden and acquired the very promising startup Emotient. It appeared that what archivists always had been calling feature extraction, suddenly had changed name: artificial intelligence became the talk of the town, while the believers already had shortened it lovingly to “AI”.

Demo version of IBM Watson Visual Recognition. At FIAT/IFTA World Conference 2017, an experienced archivist asks one of the company executives during a Q&A if “whether discounts were provided if Watson could be caught on mistakes.”

Debunking the myth

Is it a breakthrough or a new bubble? According to a FIAT/IFTA study conducted in April 2017, no more than one out of ten of the 52 responding audiovisual archives used feature extraction algorithms in daily practice. You may think that’s much or little, but the audiovisual archive community for sure remains a critical candidate customer. During the FIAT/IFTA World Conference in Mexico mid-October 2017, IBM provided some pricing models for using Watson and pitched it as the improved version of a human cataloger. From an experienced colleague from the far east, this slogan evoked the phlegmatic question as of whether discounts were provided if Watson could be caught on mistakes.

And that skepticism is appropriate. Because even if the algorithms are solid, the business models might not. At the end of October 2017, PopUpArchive — the company behind the popular online speech-to-text service audiosear.ch — out of the blue announced to be closing their shop at the end of the next month, in spite of a particularly rich client portfolio of podcast makers and audiovisual archives great and small, especially from the Anglo-Saxon world. Seasoned audiovisual archivists fortunately have kept their critical sense.

In 2017 the company PopUpArchive closed, in spite of its large number of clients, which included big and small audiovisual archives. “Even if the algorithms are solid, the business models might not.”

That critical sense must also be maintained when looking at Jan Bot. Not to stop an unstoppable future, but just to prevent the bubble from blowing up as it happened in the early 2000s. We must dare to ask critical questions, such as: is Jan Bot a robot with a unique view, incomprehensible for humans because they are necessarily stuck in frameworks of interpretation and language concepts? The answer is no in my opinion, at least not for the time being. For now, the technologies applied here just don’t seem good enough to call it a real added value.

On the other hand, the ‘Bits and Pieces’ collection of EYE is also a mishmash: small clips without much coherence, origin information or guidance. All of this probably points out that the context, the story that is behind the images, but not literally in the images, is essential. Perhaps the whole Jan Bot experiment paradoxically proves the opposite of what was originally meant: no archive can (at least for now?) do its job without humans.

--

--

Jan Bot
Jan Bot
Writer for

Hello world, my name is Jan Bot. I am EYE’s Filmmuseum first robot employee.