Landscape of Polish LLMs

Aleksander Obuchowski
TheLion.AI
Published in
12 min readDec 3, 2023

Introduction

This article takes you into the innovative world of Polish Large Language Models (LLMs), highlighting the significant progress and challenges in the field. It focuses on the unique aspects of Polish language processing, showcasing the contributions of local researchers and companies.

Selection criteria

Firstly, in this report, we have considered only generative models based on the decoder transformer architecture, which is why we don’t mention models, such as HerBert or PLT5, made by Allegro (although if you are interested in obtaining good sentence and word representations rather than text generation they are worth to check out).

Secondly, we don’t include multilingual models. There are two reasons behind this decision:

  • This report aims to promote Polish LLMs and Polish researchers and companies that work on them.
  • While many models claim to support Polish together with dozens or even hundreds of languages, the Polish language typically amounts to a fraction of a percent of the training data while the vast majority of the data is in English (see, e.g., LLama 2 section). This also typically leads to poor performance of the models in the Polish language.

Finally, we focus on published open-source LLMs that justify the following criteria

  • are published on Hugging Face
  • have at least ten downloads
  • have any form of readme or indication of how the model was trained

That’s why we haven’t included, e.g., the mysterious Polish ChatGPT by SentiOne

However, if you know a model that is not on this list but should be, or you would like to clarify information about the model you have created, feel free to message us.

This article takes you into the innovative world of Polish Large Language Models (LLMs), highlighting the significant progress and challenges in the field. It focuses on the unique aspects of Polish language processing, showcasing the contributions of local researchers and companies.

Selection criteria

Firstly, in this report, we have considered only generative models based on the decoder transformer architecture, which is why we don’t mention models, such as HerBert or PLT5, made by Allegro (although if you are interested in obtaining good sentence and word representations rather than text generation they are worth to check out).

Secondly, we don’t include multilingual models. There are two reasons behind this decision:

  • This report aims to promote Polish LLMs and Polish researchers and companies that work on them.
  • While many models claim to support Polish together with dozens or even hundreds of languages, the Polish language typically amounts to a fraction of a percent of the training data while the vast majority of the data is in English (see, e.g., LLama 2 section). This also typically leads to poor performance of the models in the Polish language.

Finally, we focus on published open-source LLMs that justify the following criteria

  • are published on Hugging Face
  • have at least ten downloads
  • have any form of readme or indication of how the model was trained

That’s why we haven’t included, e.g., the mysterious Polish ChatGPT by SentiOne

However, if you know a model that is not on this list but should be, or you would like to clarify information about the model you have created, feel free to message us.

Models based on LLama 2 architecture

Llama 2 is an updated collection of pre-trained and fine-tuned large language models introduced by Meta researchers.

Building upon its predecessor, LlaMA, LlaMA 2 brings several enhancements. The pretraining corpus size has been expanded by 40%, allowing the model to learn from a more extensive and diverse set of publicly available data. Additionally, the context length of Llama 2 has been doubled, enabling the model to consider a more extensive context when generating responses, leading to improved output quality and accuracy.

Llama 2 is open-source, which sparked significant interest and collaboration in the AI research community. This open-source nature of LlaMA 2 has facilitated widespread experimentation and innovation, as researchers and developers globally can access, modify, and improve the model. The availability of LLaMA 2’s source code and training data sets has also democratized access to cutting-edge language model technology, enabling smaller organizations and independent researchers to contribute to and benefit from advancements in AI.

Turul 2

Author: Voicelab.AI

Voicelab.AI is a Polish technology company involved in processing and understanding speech in multiple languages, primarily Polish and English, as well as German, Ukrainian, Russian, Romanian, and others. The company also conducts research and development work, including creating new algorithms based on artificial intelligence. VoiceLab.AI is the first company in Poland with proprietary automatic speech recognition technology

Model size: The model comes in 2 variants: 7B and 13B

Type: Finetuned

Training dataset: 970k conversational Polish and English samples. The training data includes Q&A pairs from various sources, including Alpaca comparison data with GPT, Falcon comparison data, Dolly 15k, Oasst1, Phu saferlfhf, ShareGPT version 2023.05.08v0 filtered and cleaned, Voicelab private datasets for JSON data extraction, modification, and analysis, CURLICAT dataset containing journal entries, the dataset from Polish wiki with Q&A pairs grouped into conversations, MMLU data in textual format, Voicelab private dataset with sales conversations, arguments and objections, paraphrases, contact reason detection, and corrected dialogues.

Optimized versions: The model has several optimized versions, including 8-bit quantization

APT3

Author: Azurro

Azurro, a dynamic software development company established in 2010, excels in creating top-quality, flexible solutions tailored to client needs. Using state-of-the-art technologies and automated test suites, Azurro assures the highest quality in its services. The company’s strength lies in its highly motivated team of software engineers who are committed to elevating clients’ businesses by overcoming technical challenges with unmatched skills and creativity.

Model size: The model comes in 2 variants: 275M and 500M, there is also an older version with 1B parameters

Type: Trained from scratch

Training dataset: 21 billion tokens: ebooks 8%, Polish Wikipedia 4%, web crawl data 88%

Optimized versions: The model has several optimized versions, including 8-bit quantization

It is worth noting that the model was trained on consumer-grade RTX4090 GPU and the authors shared extensive details about the training on the models page.

Polpaca

Author: Marcin Mosiolek

Marcin collected 12+ years of experience building successful AI-powered products for startups and large enterprises. He has created and deployed machine learning solutions, including large-scale document processing, pharmaceutical document review, and multiple search-related products. Marcin also led initiatives related to autonomous driving and augmented reality. Currently, he holds the role of AI Architect in Sii Poland, where he coordinates the work of AI Engineers in automated document reading projects.

Model size: 7B

Type: Finetuned

Dataset: Original Alpaca dataset machine translated to Polish Language

The author has mentioned the limitations of the model in his Medium Article. The model, however, still achieves good results given its simple fine-tuning process, and we commend the author for sharing its limitations and negative results.

LLama2

Author: Lajonbot

Although we tried to find information regarding Lajonbot, we couldn’t find any of their associations other than their Hugging Face profile. It is worth mentioning, however, that they have published numerous models on Hugging Face, as well as their quantized versions. We at TheLion.AI also command their fascination with lions 🙂

Model size: The model has two versions: 7B and 13B

Type: Finetuned

Dataset: Although not explicitly stated, the dataset seems to be a Polish version of the Databricks Dolly 15k Dataset

Vicuna

Vicuna (base model) is trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. A preliminary evaluation using GPT-4 as a judge shows that Vicuna-13B achieves more than 90% of the quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90% of cases.

Author: Lajonbot

Model size: The model has two versions: 7B and 13B

Type: Finetuned

Dataset: Although not explicitly stated, the dataset seems to be a Polish version of the Databricks Dolly 15k Dataset

Optimized versions: The model has several optimized versions, including GGML and GPTQ

WizardML

WizardML (base model) focuses on creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, authors use their proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, they mix all generated instruction data to fine-tune LLaMA. The model surpasses GPT4 (2023/03/15), ChatGPT-3.5, and Claude2 on the HumanEval Benchmarks.

Author: Lajonbot

Model size: The model has two versions: 7B and 13B

Type: Finetuned

Dataset: Although not explicitly stated, the dataset seems to be a Polish version of the Databricks Dolly 15k Dataset

StableBeluga

SatableBeluga (base model) is a Llama2 70B model finetuned on an Orca style Dataset. This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

Author: Lajonbot

Model size: The model has two versions: 7B and 13B

Type: Finetuned

Dataset: Although not explicitly stated, the dataset seems to be a Polish version of the Databricks Dolly 15k Dataset

Models based on Mistral

Mistral 7B is a 7.3B parameter model that:

  • Outperforms Llama 2 13B on all benchmarks
  • Outperforms Llama 1 34B on many benchmarks
  • Approaches CodeLlama 7B performance on code while remaining good at English tasks
  • Uses Grouped-query attention (GQA) for faster inference
  • Uses Sliding Window Attention (SWA) to handle longer sequences at a smaller cost

Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for a sequence length of 16k with a window of 4k.

Karkowiak

Author: Szymon Ruciński

Szymon Ruciński describes himself as an AI wizard and marathon runner. As a Junior Machine Learning Engineer based in Switzerland, Szymon is an expert in computer vision, audio and image processing. He also has a knack for troubleshooting and technical support. He is currently working as a Machine Learning Engineer at Apostroph Switzerland — Language Intelligence

Model size: 7B

Type: Finetuned

Dataset: Polish text generation using a custom-created large corpus of 100K Polish instructions.

Zephyr

Author: Nodzu

Model size: 7B

Dataset: The dataset is composed of 2 sources: a Polish prose dataset and a Polish version of the Databricks Dolly 15k Dataset

Type: Finetuned

Optimized versions: The model has several optimized versions, including exl2 , GGUF, AWQ, and GPTQ

Models based on GPT2

polish-gpt2

Author: Sławomir Dadas, PhD

Sławomir is a software engineer with over 10 years of experience in multiple technologies: backend (Java), frontend (JavaScript), and machine learning (Python). Currently engaged in research and development with Deep Learning and NLP. Author of multiple scientific publications.- Winner of 2021 GovTech challenge: detection of abusive clauses in B2C contracts — SemEval 2023–3rd place in Visual Word Sense Disambiguation task- SemEval 2023 — won 5 out of 10 language subtasks in Multilingual Twitter Intimacy Analysis challenge- Author of other state-of-the-art Polish NLP models such as Polish RoBERTa

Model size: The model comes in various sizes: 112M, 345M, 774M, and 1.5B

Dataset: Unfortunately, we couldn’t find any information about the dataset

polish-gpt2

Author: RadLab

The RadLab team specializes in machine learning and natural language processing, integrating these passions into their daily software development activities. They are committed to promoting knowledge and free resources, including developing word vector models and semantic similarity lists for the Polish language. Additionally, they redistribute Polish language text resources and share insights from their internal machine learning and NLP experiments on their blog.

Model size: 117M

Type: Trained from scratch

Dataset: 10GB of data, mostly from Clarin

PapugaGPT

Author: Flax Community

Flax/JAX community week was an event organized by Huggingface. The goal of this event was to make compute-intensive NLP and CV projects (like pre-training BERT, GPT2, CLIP, ViT) practicable for a wider audience of engineers and researchers

Model size: 117M

Type: Trained from scratch

Dataset: Polish subset of the multilingual Oscar corpus

Conclusions

The landscape of Polish Large Language Models (LLMs) presents a dynamic and flourishing field of artificial intelligence characterized by innovative approaches and a focus on catering to the unique linguistic features of the Polish language. This report has highlighted several key trends and insights:

Focus on Polish Language Specifics: The decision to concentrate on Polish-only models rather than multilingual ones has underscored the importance of developing LLMs that are finely tuned to the nuances of the Polish language. This approach promises more accurate and contextually relevant models, vital for applications ranging from natural language processing to conversational AI in the Polish context.

Diverse Model Applications: The various models discussed showcase a wide spectrum of applications. These models are not just academic exercises but practical tools addressing real-world needs like document processing, speech recognition, and augmented reality.

Collaboration and Open-Source Ethics: A significant number of these models are available on platforms like Hugging Face, indicating a strong open-source ethos within the Polish AI community. This accessibility encourages collaboration, further development, and democratization of AI technology.

Emerging Leaders in AI: The report spotlights several key individuals and organizations, like Voicelab.AI, who are at the forefront of AI research and development in Poland. Their work not only contributes to the global AI landscape but also positions Poland as a hub for AI innovation.

Limitations

Lack of dedicated datasets

A lot of the models were trained using translated versions of English datasets. While such datasets can be obtained easily, it severely limits their understanding of Polish culture and can cause error propagation from machine translation models.

To truly advance the capabilities of Polish LLMs, we need to create unified pre-training datasets that accurately represent the Polish language.

A notable approach towards obtaining large-scale Polish corpus is SpeakLeash.org , which has so far collected more than 800GB of data that can be used to fine-tune LLMs.

We also need a large-scale instruction fine-tuning dataset to move from LLMs to Chat models. Currently, the Polish ML Community is working on creating a Polish version of the OpenAssistant dataset

Lack of unified benchmarks

It is hard to compare the above-presented models. Although some were based on newer architectures and better base models, there is no way to say if they perform better than others.

That’s why we also call for creating a unified Polish benchmark for testing LLMs (similar to the Huggingface OpenLLM Leaderboard. Previously, Allegro showed initiative by presenting Klej Benchmark, which is focused on finetuning encoder-based LLMs

Lack of computing power

The largest fine-tuned model in this report has 13B parameters and the largest one trained from scratch has 1.5B parameters, which is only a fraction compared to the largest open-source LLMs with up to 180B parameters.

It is hard to compete with companies like OpenAI that spend millions of VC dollars on computing power and with global open-source LLM leaders such as AI at Meta.

As such resources are beyond the capabilities of even the largest Polish companies willing to focus on open-source LLMs, it is necessary to develop computational grants allowing researchers to utilize infrastructures such as PLGrid

Did you like the article?

Follow us on Linkedin: https://pl.linkedin.com/company/thelionai

--

--