Journey to BERT — After Bert Era

Goknur Ercan
Sahibinden Technology
6 min readJan 29, 2024

This article is a follow-up to Journey to BERT — Before Bert Era. You should check it before starting here.

Bert Methodology

Unlike other models, Bert uses a bidirectional encoder which means Bert understands the word from both left-to-right and right-to-left without losing the meaning of the sentence. Previous language models can either read the input left-to-right or right-to-left, while Bert can read both ways. Bert can be used on Masked Language Modelling and Next Sentence Prediction. The core idea is to <Mask> a word in a sentence and let the model find the masked word using the context. Bert uses a probabilistic approach to find the masked word inside the sentence. An example is shown below:

Probabilistic Resulting of Bert

Famous sentence of Decarte’s is given in the example and Bert calculates the probabilities of each word for the masked area.

On the training stage, Bert is fit with sentences with 15% of masked words. These 15% are changed with either <Mask>, Random and Actual words. The model learns to fill the <Mask> words using RNN and CNN methodologies.

Although Bert finds good results in masked examples, the bias inside the training set appears in the results.

Bert Gender Bias Example

For example, if we check the above example, the model thinks that Jim (a male) worked as barber, policeman, fisherman while Jane (a female) worked as nurse, secretary or waitress. These kinds of sexist or racist results are inevitable in such cases since these kinds of sentences occur in the training set.

You can check https://pair.withgoogle.com/explorables/fill-in-the-blank/ for more examples.

Fine-Tuning Bert

Fine-tuning in deep learning means training a pretrained model with new data to improve the success of the model for that particular case. For example, if we want to work on Medical Reports, we can fine-tune the Bert model for such data to get better results. Although training is done by millions of different data, fine-tuning can be achieved by a few thousand examples.

Google claims that fine-tuning a Bert model can be done in 30 minutes (using TPU) and a few hours (using GPU). The example for fine-tuning can be found on Hugging Face documentations. For NLP tasks and latest improvements huggingface.co can be used.

Bert Model Variants

● Bert-base: Base model of the Bert. Trained with 110 millions of parameters. The model can be found here.

● Bert-large: Large version of the base model. Trained with 340 millions of parameters. The model can be found here.

● DistilBERT: If you want to use a fast example of Bert, you can choose this variant. The naming comes from the distilled version of the base Bert model. 60% faster than the base model while performing 95% of it. The model can be found here.

● RoBERTa: Optimised version of the Bert for Masked Language Modelling. The pretraining of the model is different in this variant compared to the base version. The model can be found here.

● sBERT: Bert variant that can be used to create sentence embeddings. Instead of word embeddings this variant offers sentence vectors using Bert with Siamese Networks. The model can be found here.

Bert in Google Search and SEO

The importance of the Bert on our lives is probably its effect on Google Search. A Google Blog content from 2019 proves the integration of the Bert to the Google Search Engine and its success on the search query answering.

Google Search Result Before and After Bert Integration

The example shows the improvement of Google Search by the usage of Bert model. The query is exactly asking Brazilian travellers to the USA, not US citizens travelling to Brazil. The previous algorithm does not consider the importance of the “to” word and the order of the sentence and gives a content that is helping US citizens travelling to Brazil. While after the integration of Bert, the importance of the ordering and some common words (such as “to” in this example) is better understood by the engine.

This may seem not related to SEO however, instead of counting the number of the frequent terms, Google now understands the content of the pages and queries and matches them accordingly. The Google bots can now ignore the repetitive content (that can be seen on our News sites). Thus, the actual content becomes more important in the perspective of the Google bots.

Possible Future Methods

The most probable methods are the usage of large language models. The LLMs are built on lots of parameters that allows them to understand the data better than the previous models. For example, one of the base Bert models (bert large) uses 340 million parameters while the GPT-4 uses 100 trillion parameters. The incredible success of LLMs and their capabilities shows they can be the future of the Search Engines and SEO.

GPT

The most famous large language model nowadays as October 2023. The model is providing chat functionality and trained by the latest information on the Web. The accuracy and success of the model is way ahead of its counterparts, even though it is sometimes used as ground-truth.

PaLM

PaLM is another LLM developed by Google. It can be considered as a rival to GPT since their chatbot versions are considered as so. PaLM is a multipurpose LLM and successful in advanced reasoning tasks, coding multilingual translation, and natural language generation. The PaLM can is presented as a AI family with multiple variants such as Med-PaLM (medical), Sec-PaLM (security) and Gecko (mobile version).

ChatGPT and Bard

Both ChatGPT and Bard are AI powered chatbots. ChatGPT is developed by OpenAI and Bard is developed by Google. Similar to ChatGPT, which is built on GPT3 (and GPT-4 for premium tier), Bard is built on PaLM. Bard is better than ChatGPT since it has access continuous data from Google Search while ChatGPT has date limitation since it relies on training stage.

LLama

Another large language model that is developed by Meta. The parameter range differs from 7 Billion to 70 Billion. Although it is not outperforming GPT variants, the model is open-source and can be used by the community. Fine-tuning the Llama model might show better results than the GPT in some cases.

Mistral

Brand new large language model which is again open-source and a strong competitor against Llama. The main advantage of the model is becoming light weighted and can be run even in PCs.

References

1- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, May 24). Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv.org. https://arxiv.org/abs/1810.04805

2- Hugging face — the AI community building the future. Hugging Face –. (n.d.). https://huggingface.co/

3- Nayak, P. (2019, October 25). Understanding searches better than ever before. Google. https://blog.google/products/search/search-language-understanding-bert/

4- Reimers, N., & Gurevych, I. (2019, August 27). Sentence-bert: Sentence embeddings using Siamese Bert-Networks. arXiv.org. https://arxiv.org/abs/1908.10084

5- OpenAI. (n.d.). https://openai.com/

6- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023, February 27). Llama: Open and efficient foundation language models. arXiv.org. https://arxiv.org/abs/2302.13971

7- Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023, October 10). Mistral 7B. arXiv.org. https://arxiv.org/abs/2310.06825

8- Google. (n.d.). Google. https://bard.google.com/

9- Google ai palm 2. Google AI. (n.d.). https://ai.google/discover/palm2/

--

--