Brand Buzz Unveiled: Unleashing Power of Transformers to Uncover the Hottest Brand-Industry Links in Written Media

Maysaa Khalil
Launchmetrics Tech & Product Blog
11 min readFeb 9, 2024

Welcome to this tech blog article, where we’ll dive into the specifics of our latest technology, “Transformer over Transformer,” crafted right here at Launchmetrics. This cutting-edge approach is built upon NLP transformers, representing a significant advancement in AI for natural language understanding. Our focus is to improve brand detection in written media articles and uncover their related industries. Let’s explore the intricacies of our advanced brand industry link recognition algorithm, providing deeper insights and empowering our clients to make data-driven decisions in the FLB industry.

Why design a brand_industry link model?

We aim to achieve precise brand-industry detection from articles, revolutionizing the FLB market. Instead of relying on separate models, our unified approach seamlessly integrates both aspects, offering distinct advantages for a more sophisticated market understanding. For instance, envision a scenario where an influencer’s wedding attire becomes a topic of discussion. She wears a dress from Chanel and adorns herself with Dior jewelry. While individual models would identify Dior and Chanel as brands and jewelry and fashion as industries, our model goes beyond that. It associates Dior with jewelry and Chanel with fashion, empowering companies with deeper insights into specific market segments. This nuanced understanding facilitates informed decisions, targeted strategies, and capitalization on industry-specific opportunities. With our brand-industry detection model, brands gain a competitive edge, optimize market positioning, and navigate the dynamic landscape with precision.

Are transformers the right option?

Transformers have demonstrated their remarkable potential in NLP, making them a viable solution for brand_industry detection in written media articles when fine-tuned appropriately. Fine-tuning [1] involves training the transformer model on a domain-specific dataset that is carefully curated for brand_industry detection. This process enables the model to learn the specific linguistic patterns, contextual cues, and associations between brands and industries. By fine-tuning, the transformer can effectively recognize brand mentions, identify relevant industry contexts, and establish the connections between them within articles.

One key advantage of transformers [2] is their ability to capture long-range dependencies and contextual information, providing a comprehensive understanding of the text. This is particularly valuable in brand_industry detection, as it allows the model to consider the entire article and leverage the broader context for accurate identification. Transformers excel at encoding and retaining contextual information, enabling them to grasp the nuanced relationships between brands and industries.

Moreover, fine-tuning transformers with domain-specific knowledge and data enhances their performance in brand_industry detection. Incorporating industry-specific terminology, industry-related entities, and domain-specific linguistic patterns during the fine-tuning process ensures that the model captures the nuances unique to each industry. This empowers the model to make precise classifications and associations between brands and their corresponding industries, enabling companies to gain a deeper understanding of the market landscape.

While transformers like BERT [3], RoBERTa [4] and XLM-RoBERTa [5] have made significant advancements in NLP, relying solely on them will fall short in solving the complex problem of brand_industry detection in long articles. One prominent limitation of transformers is their inherent constraint of token limit. These models operate on fixed-length input sequences, often restricted to a few hundred tokens. This size limit is primarily due to the quadratic time and space complexity of the self-attention mechanism. Self attention compares each token to every other token, resulting in an O(n²) complexity, where n is the number of tokens. This leads to increased computational and memory requirements as the sequence length grows. As a consequence, longer articles may require truncation, resulting in vital contextual information being lost. This can have a detrimental impact on brand_industry detection, as crucial mentions or associations may be excluded from the analysis. Hence, our contribution to the community is a transformer over transformer architecture that will allow taking all text information into account for classification tasks.

Transformer over Transformer

Architecture

The Transformer over Transformer (ToT) architecture is a two-stage approach designed to handle long texts efficiently. This architecture was inspired by the work published in [6]. Here’s a more in-depth explanation of each step, you may also refer to the following figure for visual illustration:

  1. Chunking: In the first stage, the long text is split into smaller chunks. This division helps manage the computational complexity of processing lengthy inputs. Chunks can have overlapping or non-overlapping segments, depending on the specific model setting. The goal is to break down the text into manageable portions for subsequent processing.
  2. Tokenization: Each chunk is then further split into tokens using a tokenizer, which is a key component in any transformer-based architecture. Tokenization converts the text into a sequence of smaller units, such as words or subwords, that serve as inputs to the transformer model. This step allows the text to be processed in a structured manner.
  3. Tokens Embedding: The tokens from each chunk, along with their positional encoding, are fed into the base transformer model. Token embeddings capture the semantic meaning of individual tokens, while positional encodings convey the relative position of tokens within the sequence. The base transformer model processes the token embeddings and positional encodings to extract contextualized representations of the input tokens.
  4. Second Transformer: In the ToT architecture, the output vectors from the base transformer model serve as the input embeddings for a second transformer. This second transformer operates on the embeddings, considering additional contextual information or chunk positional encoding, depending on the specific implementation. The purpose of the second transformer is to refine and integrate the contextual information across different chunks, capturing the dependencies and relationships between them.

In addition to the steps mentioned earlier, the ToT architecture typically incorporates a special token called the “[CLS]” token for the second transformer stage. This token is inserted at the beginning of each chunk or sequence and carries an important role in the architecture. The “[CLS]” token serves as a classification token, allowing the model to capture high-level information or summarize the representations learned from the following tokens in the chunk. It acts as a compact representation of the chunk’s content and is used to inform the second transformer about the classification or prediction task at hand. The CLS embedding result is utilized to feed the second-stage transformer and the same CLS mechanism is employed to feed the Feed-Forward Neural Network (FFNN) in the second stage By including the “[CLS]” token, the model can distill the information from the chunk and encode it into a single vector representation. This vector, along with the representations of other tokens in the chunk, is then passed to the second transformer for further refinement and integration with the contextual information from other chunks.

Tradeoff between performance and memory constraints

In the ToT architecture, the model incorporates automatic management of the micro-batch size for both the first and second stages to handle long texts effectively. In the first stage, the model dynamically adjusts the micro-batch size depending on the available memory on the GPU or memory card being used. This adaptive approach allows the model to process as many chunks as possible within the memory constraints, maximizing the utilization of computational resources.

The ToT architecture consists of a pre-trained backbone model in the first stage, while the second stage does not undergo any initial training. The backbone model, typically a pre-trained transformer, serves as the foundation for capturing contextualized token embeddings in the first stage. Its pre-training allows it to learn general language representations, which are then utilized in subsequent stages for brand-industry detection.

In the second stage, a 2-layer transformer with a 4-head attention mechanism is employed. This transformer architecture is specifically designed to refine and integrate the embeddings generated in the first stage. The 2-layer structure enables the model to capture higher-order dependencies and interactions between the tokens, while the 4-head attention mechanism facilitates capturing different aspects of contextual information and relationships.

In scenarios where extremely long texts need to be managed, the size of the transformer in the second stage can be adapted. This adaptability allows for adjusting the architecture’s parameters, such as the number of layers, attention heads, or hidden dimensions, to accommodate the specific requirements of very long texts. This ensures that the model can effectively handle and process extended text inputs without sacrificing performance or encountering memory constraints.

Summary

Hence, by using the ToT architecture, the model can effectively handle long texts by breaking them into manageable chunks, processing them through a base transformer model, and then refining the representations through a second transformer. This approach enables the model to capture both local and global context, incorporating information from individual tokens as well as the relationships between chunks, resulting in a comprehensive understanding of the entire text. The ToT architecture addresses the challenges of processing long texts efficiently and effectively in transformer-based models.

Real example on how we trained a ToT at Launchmetrics for brand_industry detection

Here are some results of our multi-label classification problem, focusing on the detection of brand_industry links within large texts.

Data analysis

Our dataset consists of 209,883 documents sourced from various online platforms, including Online Magazines, Online Newspapers, Portals, General sources, Trade publications, and Blogs. The task involves predicting 138 brand_industry links, which serve as the classification classes. To address this problem, we have designed a custom ToT model discussed in the previous section and developed by Launchmetrics.

Our model is trained on a diverse range of brand classes, with 64 distinct brands represented. Each brand class is well-represented, containing more than 1,000 samples. Additionally, we included some ambiguous brand data, such as Guess, Chloé, and Celine, to enhance the model’s ability to handle cases with unclear brand associations. The training also incorporates the inclusion of documents without any labels, enabling the model to handle unclassified instances effectively.

Trained brands/industries

The trained brands include notable names such as Alberta Ferretti, Alexander Mc Queen, Azzedine Alaia, Balenciaga, Bottega Veneta, Bulgari, Burberry, Calvin Klein, Carolina Herrera, Cartier, Celine, Chanel, Chloe, Christian Louboutin, Diesel, Dior, Dolce & Gabbana, Elie Saab, Emporio Armani, Ermenegildo Zegna, Etro, Fendi, Giorgio Armani, Givenchy, Gucci, Guerlain, Guess, Hermes, Issey Miyake, Jean Paul Gaultier, Jil Sander, Jimmy Choo, Kenzo, Lanvin, Levi’s, Loewe, Louis Vuitton, Maison Margiela, Maje, Marc Jacobs, Marni, Michael Kors, Missoni, Miu Miu, MM6, Moschino, Nike, Paco Rabanne, Prada, Proenza Schouler, Puma, Ralph Lauren, Salvatore Ferragamo, Sandro, The Attico, Timberland, Tom Ford, Tommy Hilfiger, Tory Burch, Trussardi, Valentino, Versace, Viktor & Rolf, Yves Saint Laurent.. These brands span multiple industries, including fashion, watches/jewels, beauty, and eyewear.

The dataset encompasses documents in various languages, with English being the most dominant (86,401 documents). Language distribution can be visualized in the following figure.

Evaluation

Regarding data distribution, our training set consists of 193,076 documents, while the test and validation sets contain 8,395 documents each. We rigorously evaluated the performance of our model on the test set to assess its generalization and predictive capabilities.

The evaluation of the model’s performance on the brand-industry link detection task yielded promising results. The weighted average precision, recall, and f1-score were calculated to provide a comprehensive assessment of the model’s overall effectiveness.

The precision score, which measures the proportion of correctly predicted brand-industry links among the total predicted links, achieved a value of 0.83. This indicates that the model exhibits a high level of accuracy in identifying the correct brand-industry links within the given texts. The recall score, also known as sensitivity, represents the proportion of correctly predicted brand-industry links among the actual links present in the dataset. The model achieved a recall score of 0.92, indicating that it successfully captures a significant portion of the brand-industry links present in the texts.

The f1-score, a balanced measure that combines precision and recall, provides an overall assessment of the model’s performance. With an f1-score of 0.87, the model demonstrates a good balance between precision and recall, achieving a high level of accuracy while effectively capturing the brand-industry links.

These results highlight the effectiveness of the transformer over transformer model in accurately detecting brand-industry links within large texts. The high precision and recall scores indicate that the model successfully identifies relevant brand-industry associations, contributing to its overall performance in the multi-label classification task.

More about the company

Launchmetrics is the market’s first AI-powered Brand Performance Cloud, providing more than 1,700 clients with the software and data they need to connect strategy with execution. With over a decade of expertise, its Brand Performance Cloud helps executives launch campaigns, amplify reach, measure ROI, and benchmark brand performance.

Our AI-driven and proprietary Media Impact ValueTM algorithm is the answer to modern measurement in a global world, making impact measurable. Launchmetrics brings a sharp focus to profitability, accountability, and efficiency while enabling the type of quick decision-making required for agility. With tools for sample management, event organization, PR monitoring, brand performance, and Voice analytics, the Launchmetrics Brand Performance Cloud enables brands to build a successful marketing strategy, all in one place.

Founded in New York and with operating headquarters in Paris, Launchmetrics has 450+ employees in twelve markets worldwide and offers support in five languages. Launchmetrics has been the trusted software and data provider to brands worldwide such as Tiffany’s, Vogue, NET A PORTER, KCD, Shiseido, The North Face, and Levi’s as well as partners like IMG, the Council of Fashion Designers of America, the Camera Nazionale Della Moda Italiana and the Fédération de la Haute Couture et de la Mode.

To learn more about Launchmetrics, please visit launchmetrics.com/newsroom and follow @launchmetrics.

References

[1] Wolf, Thomas, et al. “Huggingface’s transformers: State-of-the-art natural language processing.” arXiv preprint arXiv:1910.03771 (2019).
[2] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
[3] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[4] Liu, Yinhan, et al. “Roberta: A robustly optimized bert pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).

[5] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

[6] Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., & Dehak, N. (2019, December). Hierarchical transformers for long document classification. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 838–844). IEEE.

Acknowledgements

Special thanks go to Anna Bosch Rue (VP Data Intelligence), David Pecchioli (Lead Data Scientist), Arnaud Nicklaus (VP Software Development), Katherine Knight (AVP, Brand & Communications) and Juliette Le Guennec (PeopleOps Manager Tech Data & Product) for their review and feedback.

Maysaa Khalil is on LinkedIn

--

--