Summer School Rocks!

Rim Baccour
TotalEnergies Digital Factory
7 min readJul 21, 2022

By Deirdree POLAK & Rim BACCOUR

Yes, this is a professional article, not an ad for your young family members!

From the 4th to the 7th of July, we had the opportunity to attend the AI summer school event organized by Hi! Paris, a research center co-founded by Institut polytechnique of Paris and HEC school. It was held in Telecom Paris (Saclay plateau) and consisted of 2 main tracks: an applied track covering concrete ML and AI applications in the industry, and a second track which dived deeper into the essence and theory of many of AI subdomains. TotalEnergies being one of the four main donors of this event, we had the opportunity to take part in the courses.

In this article, we chose to talk about the 4 sessions that struck us the most, and hopefully motivate you to attend next year’s summer school.

Operationalizing AI Regulation (July 5th- morning session)

This workshop was co-animated by Dr. David RESTREPO AMARILES (HEC Paris) and Dr. Winston MAXWELL (Telecom Paris) and raised the issue of ethics in the AI domain. Due to the lack of safeguards and auditing, unfair models (gender or racially biased…) are unfortunately used in the public and private sectors. Case-studies were used to trigger group reflections and discussions such as:

  1. Dutch child welfare fraud scandal (Rutte Government resigns after discovering that this system is biased against dual nationality holders).
  2. Amazon solution to automate human resource hiring (gender biased solution)

As a matter of fact, this was the reason behind the recent initiatives of many regulatory institutions trying to set a framework for trustworthy AI in order to face the risks and repercussions of black box solutions. From our group discussions we highlighted that:

  • Even though unfair models do exist, they are usually less biased than humans. By pointing the weaknesses of these algorithms, we can therefore help both humans and AI solutions become less discriminatory.
  • A data-scientist should be aware of Trustworthy AI pillars. The risks and social consequences of a ML/ DL model should be considered at the very beginning of any solution design.
  • It is also crucial to be aware of models’ explainability and decision justification especially in essential services such as education, credit, healthcare…

Data in Finance: FinTech Lending & Decision-making Under Uncertainty (July 4th & 7th- morning sessions)

This workshop was animated by Dr. Johan HOMBERT from HEC Paris and gave an insight on the world of finance.

We were given simple data sets and two questions: Who should get a loan? and What rate of interest do we want to provide to the loaner? The catch: We were in competition with other teams who could make offers to the same customers. In there came the difficult balance of providing offers and rates appealing enough to attract customers, but strict enough that we would make a profit even if some of them defaulted.

Good news, bad news: Our team got 60% of the market share but lost $30m to clients who never paid back.

This workshop resonated with another workshop animated by Dr Julien GRANG-CLEMENT, Hi! PARIS Chair at HEC Paris: Decision-making Under Uncertainty. When every prediction comes with an uncertainty, how to ensure our decisions can take this risk into account? You can have different approaches assuming that every prediction has an upper and lower bound:

  • The so-called nominal approach → just take the estimation as is.
  • The conservative approach → Be a pessimist and take the lower bound of the prediction (in the context of lending).
  • The robust approach → Find the balance. Normalize the predictions and their associated uncertainties so that the error can be expressed as a standard deviation. You can then decide on a ‘budget of deviation’, in other words the physical loss you’re okay with and that can be expressed as a function of the standard deviation:

budget of deviation => sum (prediction ± standard deviation)

By solving this equation for the standard deviation in the context of the loan seekers, you can decide on 1. which customers to make an offer to and 2. have a more robust estimation of your overall return on investment.

Learning for audio signals (July 7th- morning session)

Mr. Geoffroy PEETERS (Telecom Paris) started out by introducing the emerging applications of audio signal domain. These applications were divided into 3 categories (description, transformation & generation) and were used not only on speech and music data but also to understand environment sounds.

Some examples of speech applications were:

  • Description Speech to text, Automatic Speech Recognition (ASR), Speaker recognition, Speaker diarization
  • Transformation Speech Separation
  • Generation Text to speech

Music & Environment Sound applications:

  • Description Music-tagging (instruments, singers, music style identification), Recommendation of audio content, Content-description (pitch, chord, tempo…), Lyrics alignment & recognition, Acoustic Scene Classification (Metro station, urban park, public square…), Sound event localization from a multichannel audio input (detect in audio signal the various events occurring like a barking dog or car sounds or discussions…)
  • Transformation: Music Style Transfer (Sinatra singing reggae), Music source separation
  • Generation Sound Generation

The speaker insisted on the fact that signal processing state-of-the-art approaches provide an exhaustive understanding and modeling of audio signals. Here, we refer to, for example, sinusoidal modeling in which the signal is represented as a sum of sinusoids and noise or Fourier Transform which analyzes the frequency-domain representation (frequencies + magnitudes of a signal), among many other transformations.

Then came the era of Machine Learning then Deep Learning, where researchers started to replicate the architectures that had a huge success on images to audio signals. CNNs were used on the signal spectrogram (spectrum of frequencies variation of a signal over time) and RNNs were applied on original sound signals.

These models had a better result on the various range of audio signal applications but presented a critical flaw: You might have guessed it… HUGE TRAINING COST. These deep and complex architectures presented in some cases millions of parameters to learn. Consequently, training on large datasets would take long months and even years.

The session concluded with an introduction to the latest research on signal generative modeling called Differentiable Digital Signal Processing (DDSP). This model combines the classic signal processing techniques and deep learning architectures to generate audio signals. It extracts valuable information such as the loudness, the fundamental frequency… which is then fed to an encoder-decoder architecture to re-synthesize a realistic replicate of the input signal. The advantage here is that all the parts of the mathematical modeling of a signal are differential. They can be fed separately to a neural architecture where derivatives and back-propagation can be computed. This approach reduces drastically the number of parameters to learn thus the training cost and duration.

Intelligent Risk Management: Graph-Based Anomaly Detection Using the MDL Principle (July 7th- afternoon session)

This workshop was presented by Dr. Aluna WANG (HEC Paris) who walked us through a new approach for anomaly & fraud detection applied on companies financial data (bookkeeping data). This issue couldn’t be solved by ‘simple’ classification algorithms due to class imbalance (fraudulent transactions are scarce and represent less than 1% of the overall dataset) and therefore data labeling is a high-cost operation.

The proposed approach consisted of applying information-theory (specifically compression rules) to encode bookkeeping metadata and transaction graph motives separately. Indeed, compression is suited for anomaly and rare patterns detection. Its first rule dictates that frequent data must be compressed using a short sequence of bits to save storage memory or emission bandwidths, however, scarce data contains new and unknown information and thus must be compressed using more bits. In this use-case, Minimum Description Length (MDL) principle was used in order to compute a coding table for this data collection. It uses patterns’ novelty degree “to separate structure from noise; regularity from anomaly; meaningful information from accidental information; and at the technical level, the compressable versus the uncompressable portion in the observed data”. (From the link below)

Detecting anomalies consisted of getting metadata & graph motives encoded on long bit-sequences. These 2 algorithms reduced the size of the initial dataset and resulted in a set of “suspicious” patterns that were then studied by financial experts in order to confirm or deny to what extent it is “fraudulent” as we still deal with an unsupervised problem. The speaker emphasized that both results had a low overlap but were complementary as they detected different types of fraud. For more details, we recommend that you take a look into this article: Pattern Recognition and Anomaly Detection in Bookkeeping Data.

Final Thoughts

“It is crucial for data scientists nowadays to balance spending time on exploiting and exploring” that’s how Mr. Balaji PADMANABHAN (university of South Florida) concluded one of the round tables and it is this type of events that help us catch up with recent AI breakthroughs and latest advances.

A big thanks to Hi! Paris for organizing such an insightful and valuable event, and to all the session leaders and the keynote speakers from whom we learnt so much. You can read more information on the Hi! Paris website here: https://www.summerschool.hi-paris.fr/ .

Also, a big thanks to Total Energies and particularly the R&D team for providing us with tickets for the event and welcoming us to their offices.

See you there next year!

--

--