AI Pioneers Gather at BAAI 2024: Unveiling Innovations in Large-Scaled AI Models for Language, Multimodal, Embodied, Bio-Computing, and FlagOpen 2.0

Published in
17 min readJun 17, 2024

The 6th annual BAAI Conference hosted by Beijing Academy of Artificial Intelligence (BAAI), commenced today in Beijing, marking a significant gathering for global AI insiders.

This premier event, themed “Global Vision, Ideas in Collision, Leading Cutting-Edge Innovations,” brings together top researchers and industry leaders from around the world to share their latest findings and discuss the future of artificial intelligence.

This year’s BAAI 2024 conference showcases an esteemed roster of speakers, featuring Turing Award laureate Andrew Chi-Chih Yao, together with eminent figures from leading international research institutions such as OpenAI, Meta, DeepMind, Stanford University, and UC Berkeley, alongside CEOs and CTOs from prominent Chinese AI technology firms including Baidu, 01.AI, Baichuan AI, Zhipu AI, and ModelBest.

Within the two busy days, more than 200 distinguished AI scholars and industry experts convene to engage in insightful discussions on critical trajectory and application scenarios for cutting-edge artificial intelligence technologies.

The opening ceremony was presided over by Tiejun Huang, Chairman of the BAAI.

During the ceremony, Zhongyuan Wang, President of the BAAI, delivered the 2024 BAAI yearly progress report, detailing the institute’s pioneering and research advancements in large language models, multimodal models, embodied intelligence, and biological computing models. Wang highlighted the institute’s advancements in large-scale models across various domains such as language, multimodality, embodied intelligence, and bio-computation. He also outlined significant upgrades and strategic layout of BAAI’s comprehensive full-stack open-source technology foundation.

Wang emphasized the current achievements of large language models, noting their core capabilities in understanding and reasoning, which are pivotal for general AI. He discussed the technological trajectory that integrates these language models to align and map other modalities, thereby enhancing multimodal understanding and generation capabilities.

However, Wang pointed out that is not the ultimate technological trajectory for enabling artificial intelligence to perceive and understand the physical world. Instead, a paradigm of unified models should be adopted to achieve multimodal inputs and outputs, endowing the model with native multimodal expansion capabilities and advancing towards a world model.

Looking ahead, Wang envisioned large models merging with smart hardwares in the form of digital agents, transitioning from the digital realm to the physical world as embodied intelligence.

Furthermore, he suggested that large models could introduce new paradigms for knowledge representation in scientific research, accelerating discoveries of the laws governing the microscopic physical world and advancing towards the Holy Grail of General Artificial Intelligence.

BAAI’s Large-Scale Models for Language: Tele-FLM Serires & the BGE Series

Addressing the high computational costs associated with large-scale model training, Beijing Academy of Artificial Intelligence (BAAI) and China Telecom AI Research Institute (TeleAI) have jointly developed and launched the world’s first low-carbon trillion-parameter dense language model, Tele-FLM-1T. This model, along with the 52 billion-parameter and 102 billion-parameter versions, constitutes the Tele-FLM series of models.

The Tele-FLM series has achieved low-carbon growth, completing the training of three models totaling 2.3 trillion tokens using only 9% of the industry-standard computational resources. This feat was accomplished in four months with 112 A800 servers. The training process was conducted with zero adjustments and retries, demonstrating high computational efficiency, model convergence, and stability. Currently, the 52B version of the Tele-FLM series has been fully open-sourced, including core technologies (growth techniques, optimal hyperparameter prediction), training details (loss curves, optimized hyperparameters, data ratios, and Grad Norm), with the aim of benefiting the large model community. The Tele-FLM-1T version is set to be open-sourced soon, providing the community with an excellent initial weights for training trillion-parameter dense models and addressing convergence challenges.

According to BPB evaluations, Tele-FLM-52B is comparable to Llama3–70B and superior to Llama2–70B and Llama3–8B for English tasks. For Chinese tasks, Tele-FLM-52B outperforms Llama3–70B and Qwen1.5–72B, ranking as the strongest open-source model. For dialogue tasks, AlignBench assessments indicate that Tele-FLM-Chat (52B) has achieved 96% of GPT-4’s capabilities in Chinese language tasks and 80% of GPT-4’s overall capabilities.

To address issues such as large model hallucinations, the Beijing Academy of Artificial Intelligence (BAAI) has independently developed the BGE (BAAI General Embedding) series of general semantic vector models. These models are applied to retrieval-augmented generation (RAG) technology, enabling precise semantic matching and supporting large models in accessing external knowledge.

Since August 2023, the BGE series has undergone three iterations, excelling in tasks such as Chinese-English retrieval, multilingual retrieval, and fine-grained retrieval. These models have consistently outperformed embedding models from OpenAI, Google, Microsoft, Cohere, and other leading institutions, demonstrating significantly superior capabilities.

Currently, the BGE series ranks first among Chinese AI models in terms of total downloads. It has been integrated into major AI development frameworks such as HuggingFace, Langchain, and Llama Index. Additionally, major cloud service providers including Tencent, Huawei, Alibaba, ByteDance, Microsoft, and Amazon have incorporated the BGE series into their platforms, offering commercial services to external clients.

This widespread adoption underscores the BGE series’ robustness and versatility in enhancing semantic understanding and retrieval across diverse applications.

Tele-FLM-52B open-source repository:
Tele-FLM-Chat demo (one-shot chat model):

The EMU 3 Native Multimodal World Models & the Bunny Series for Edge

Most existing multimodal models are typically specialized for specific tasks, such as Stable Diffusion for text-to-image generation, Sora for text-to-video generation, and GPT-4V for image-to-text generation. Each type of model has its own architecture and methods. For example, the DiT architecture is commonly used for video generation in models like Sora. However, these existing models often possess isolated capabilities rather than integrated, native multimodal capability. For instance, Sora lacks the ability to understand both images and videos simultaneously.

To achieve the next generation of unified, end-to-end multimodal models, the Beijing Academy of Artificial Intelligence (BAAI) has launched Emu 3, a native multimodal world model. Emu 3 employs BAAI’s proprietary multimodal autoregressive approach, jointly trained on images, videos, and text. This approach endows the model with native multimodal capabilities, enabling unified input and output across images, videos, and text.

Emu 3 is designed from the ground up for unified multimodal generation and understanding, currently capable of high-quality images and videos generation, video completion, as well as understanding the physical world.

In essence, Emu 3 unifies video, image, and text as well as generation and understanding. Notably, Emu 3 will be gradually open-sourced, following safety evaluations and continuous training.

To cater to edge devices, BAAI has introduced the Bunny series of lightweight multimodal models (Bunny-3B/4B/8B). This model series features a flexible architecture that supports various vision encoders and language-based foundation models. Comprehensive results from multiple benchmarks indicate that the multimodal capabilities of Bunny-8B achieve 87% of GPT-4o’s performance. Currently, the model weights, training code, and training data for the Bunny models have all been open-sourced.

Bunny Series Open-source repository:

BAAI on Embodied Intelligence: Grasping, VLA and More…

Over the past year, the Beijing Academy of Artificial Intelligence (BAAI) has achieved multiple world-class breakthroughs in the field of embodied intelligent large models. These advancements encompass general-purpose grasping technology, embodied operation VLA (Vision-Language-Action) large models, embodied navigation VLA large models, and self-developed robotic hardware.

In the realm of general grasping capabilities for embodied intelligence, the Beijing Academy of Artificial Intelligence (BAAI) has pioneered advancements to reach world-leading commercial-grade levels.

Addressing the challenge of generalization across various shapes and materials, BAAI has achieved a breakthrough with a success rate exceeding 95% in real-world experiments. Leveraging this technology, BAAI’s robots can accurately perceive the shapes and postures of transparent and highly reflective objects, even under complex lighting conditions involving transmission and reflection, and predict high-success-rate grasping poses.

In addition to grasping, BAAI has enhanced the cognitive abilities of robots by developing two specialized large model systems, each with a distinct role.

One of these systems is SAGE, a large model system for articulated object manipulation that can reflect and adapt to changing situations. This system effectively integrates the precise spatial geometric perception capabilities of small 3D vision models with the general object manipulation knowledge of large multimodal models, enabling robots to replan their operational processes after task failures.

The other system, Open6DOR, is the world’s first open-instruction six-degree-of-freedom pick-and-place large model system. This system allows robots to consider the position and orientation of objects during grasping, facilitating practical applications. Unlike Google’s RT series, which places objects in specified positions based on natural language instructions, Open6DOR further refines the control of the object’s posture. This technology significantly enhances the commercial applicability and value of embodied manipulation models.

To enable robots to navigate autonomously, BAAI has also developed NaVid, the world’s first end-to-end video-based multimodal embodied navigation model. NaVid directly takes video from the robot’s perspective and natural language instructions from users as inputs, outputting the robot’s movement control signals end-to-end. Unlike traditional robot navigation technologies, NaVid does not require mapping, depth information, or odometry data, relying solely on single-view RGB video streams from the robot’s camera. Trained using synthetic navigation data, NaVid achieves zero-shot real-world generalization in both indoor and outdoor environments through Sim2Real transfer, marking a bold and successful exploration in advanced technology.

Moreover, the research achievements in embodied large models have been applied in the medical field. In collaboration with Lingshi Zhiyuan, BAAI has developed the world’s first intelligent cardiac ultrasound robot, achieving the first autonomous cardiac ultrasound scan on a human. This innovation addresses the shortage of cardiac ultrasound doctors, low diagnosis accuracy, lack of standardization, and inefficiency. The intelligent cardiac ultrasound robot rapidly calculates and extracts cardiac features in dynamic environments, achieving L2 and L3 levels of automation comparable to autonomous driving. Clinical trails indicates that the robot matches experienced human doctors in accuracy, surpasses them in stability, and provides more comfort by controlling force below 4 Newtons, with efficiency on par with human doctors.

To enable general computer control, BAAI has introduced Cradle, a framework that allows AI agents to perform all tasks on a computer as humans do, using a mouse and keyboard.

Cradle comprises six modules: information collection, self-reflection, task inference, skill management, action planning, and memory. It offers powerful decision-making and reasoning capabilities, allowing the agent to “reflect on the past, summarize the present, and plan for the future.”

Unlike the industry’s common approach, Cradle achieves generality without relying on any internal APIs. BAAI has validated Cradle in collaboration with Kunlun Tech Research on popular games and productivity softwares. The agent can autonomously learn to play games and creatively edit images and videos based on prompts.

In the future, leveraging the technical advantages of large-scale multimodal models, BAAI will collaborate with universities and institutions like Peking University, Tsinghua University, and the Chinese Academy of Sciences, as well as industry partners like Galbot and Booster Robotics, to build an embodied intelligence innovation platform. This platform will focus on data, model, and scenario validation, fostering an innovation ecosystem for embodied intelligence.

Revolutionizing Biocomputing with Large-Scale Models

As large-scale models advance, AI is demonstrating significant value in various scientific fields. Biocomputing scientists aim to use large models to achieve breakthroughs in the microscopic world. In drug development, it usually takes over 10 years and $1 billion to bring a new drug to market, with 30% to 40% spent on drug design. AI can expedite tasks like compound screening and macromolecular structure modeling and prediction.
Can large models help us better understand and generate biological molecules?

At this conference, the Beijing Academy of Artificial Intelligence (BAAI) unveiled OpenComplex 2, a comprehensive all-atom biomolecular model capable of predicting proteins, RNA, DNA, carbohydrates, and small molecule complexes. OpenComplex 2 not only predicts the stable structures of macromolecules but also has preliminary capabilities to predict molecular polymorphisms and folding processes.

OpenComplex 2, based on all-atom modeling, serves as a foundational model for life molecules. Researchers have discovered that it not only predicts the stable structures of macromolecules but also has preliminary capabilities for predicting molecular conformations and folding processes.

In the international competition for biomolecular structure prediction, CAMEO (Continuous Automated Model Evaluation), OpenComplex has maintained the top position for two consecutive years and won the RNA automation track in CASP (Critical Assessment of Techniques for Protein Structure Prediction) 15. OpenComplex outperforms similar models such as AlphaFold in terms of accuracy and macroscopic structure, delivering comparable results without noise.

The OpenComplex platform has established an end-to-end deep learning framework for the unified prediction of three-dimensional structures of biomacromolecules, integrating “protein structure prediction,” “RNA structure prediction,” and “protein-RNA complex structure prediction” tasks. These tasks are inferred and trained within a unified “encoder-decoder” framework, supporting both multiple sequence alignment (MSA) and language model (LM) encoding strategies.

Leveraging these capabilities, life scientists can further explore the biological functions of proteins. Currently, BAAI has partnered with researchers to conduct studies on several significant diseases, providing insights into druggability and molecular mechanisms. In the future, the capabilities of OpenComplex may herald a new era in life sciences research, offering new possibilities for understanding complex mechanisms, such as those of the HIV virus and neurons.

Additionally, BAAI has developed the world’s first real-time digital twin cardiac computational model, achieving a bio-time/simulation-time ratio of less than 1 with high precision, positioning it at the forefront of international research. The real-time cardiac computational model marks the beginning of virtual cardiology research. Based on this model, BAAI will innovatively adopt a physics-data dual-wheel driven approach, integrating first principles with artificial intelligence methods.

This will enable the simulation of a “transparent heart” at subcellular, cellular, organ, and body levels. Furthermore, it can construct a digital twin heart reflecting a patient’s personalized physiological and pathological conditions based on clinical data, facilitating drug screening, treatment optimization, and preoperative planning in clinical applications. Furthermore, BAAI has partnered with Peking University First Hospital, Anzhen Hospital, Changzheng Hospital, and Chaoyang Hospital to apply these technologies in clinical practice.

OpenComplex open-source repository:

BAAI Launches FlagOpen 2.0

The Beijing Academy of Artificial Intelligence (BAAI), as an innovative research institution, continues to lead the vanguard in advancing artificial intelligence technologies. Leveraging its status as a neutral, non-profit organization, BAAI is committed to constructing public infrastructures that address contemporary industry challenges.

Last year, to facilitate global developers in seamlessly initiating large model development and research, the Beijing Academy of Artificial Intelligence (BAAI) introduced FlagOpen 1.0. This open-source, full-stack platform supports heterogeneous chips and multiple frameworks, providing a robust and comprehensive solution for large model innovations.

Building on the success of version 1.0, BAAI proudly presents FlagOpen 2.0. The enhanced iteration meticulously refines five critical components: models, data, algorithms, evaluation, and system architecture. BAAI aims to establish FlagOpen as the “Linux of the large model era,” setting a new benchmark for the development, deployment, and advancement of large-scale AI models.

FlagOpen 2.0 offers comprehensive support for a diverse array of chips and deep learning frameworks. To date, the global downloads of its open-source models have surpassed 47.55 million. Additionally, its 57 open-source datasets have been downloaded nearly 90,000 times, and its open-sourced code has been downloaded over 510,000 times.

FlagOpen open-source repository:

On the data front, the Beijing Academy of Artificial Intelligence (BAAI) released InfinityInstruct, the first high-quality open-source instruction tuning dataset project with tens of millions of entries. The initial release includes 3 million validated Chinese and English instruction data, soon to expand to tens of millions.

BAAI has analyzed existing open-source data to ensure a reasonable distribution of types, conducted quality screening to retain high-value data, augmented data in fields and tasks where open-source data is lacking, and controlled data quality through manual annotation to avoid distribution biases in synthetic data.

The current dataset outperforms SFT data capabilities of models like Mistral and Openhermes. When scaled to tens of millions of entries, the foundational model trained on this dataset is expected to achieve GPT-4 level conversational abilities.

BAAI has also built and open-sourced the IndustryCorpus, a multi-industry dataset in both Chinese and English, totaling 3.4TB (1TB in Chinese and 2.4TB in English), covering 18 industries with an 80% classification accuracy, with plans to expand to 30 industries.

To validate the performance of the industry dataset, BAAI trained a demonstration model in the medical field. This model exhibited a 20% improvement in overall objective performance compared to its pre-training iteration. Fine-tuning with BAAI’s specialized medical SFT and DPO datasets resulted in an 82% subjective win rate against reference answers. Furthermore, the model’s few-shot dialogue capability achieved a score of 4.45 out of 5 in the CMTMedQA evaluation.

The IndustryCorpus Dataset:
Demonstration Model for Medical :
Demonstration SFT Dataset for Medical:
Demonstration DPO Dataset for Medical:

In terms of evaluation, since its release in 2023, the FlagEval large model evaluation has expanded from primarily language models to include video, audio, and multimodal models, achieving comprehensive coverage across multiple domains. It combines subjective and objective assessments and integrates open and closed-book examinations. For the first time, it has collaborated with authoritative education departments to conduct large model K12 subject tests and partnered with the Communication University of China to co-construct a video generation model subjective evaluation system.

The Beijing Academy of Artificial Intelligence (BAAI) has collaborated with over 10 universities and institutions across China to develop advanced evaluation methods and tools. This includes exploring AI-assisted evaluation models such as FlagJudge and creating rigorous evaluation sets for emerging large model capabilities.

BAAI’s notable efforts include the HalluDial hallucination evaluation set co-developed with Peking University, the CMMU multimodal evaluation set co-developed with Beijing Normal University, the MG18 multilingual cross-modal evaluation set, the TACO complex code evaluation set, and the MLVU long video understanding evaluation set. Among these, the HalluDial set, co-developed with Peking University, stands out as the world’s largest hallucination evaluation dataset for dialogue scenarios, comprising over 18,000 rounds of dialogs and 140,000 responses.

Furthermore, BAAI has spearheaded the establishment of the IEEE Large Model Evaluation Standard Group P3419. It has also partnered with the Hugging Face community to release multiple leaderboards and collaborated with Singapore’s IMDA to contribute advanced evaluation data and models to the AI Verify Foundation. These initiatives are fostering global collaboration in the development of robust evaluation methods and tools for large models.

On the system front, the conference announced several important advancements: FlagOS, FlagScale and the Triton Operator Libraries.

To meet the growing demands for large model training and inference computation, addressing technical challenges of heterogeneous computing, high-speed interconnection, and elastic stability within and between large-scale AI systems and platforms the Beijing Academy of Artificial Intelligence (BAAI) has launched FlagOS.

This intelligent computing cluster software stack is designed for large models and supports various heterogeneous computing resources. FlagOS integrates key technologies that BAAI has developed over the years, including the Jiuding intelligent scheduling management platform for heterogeneous computing, the FlagScale parallel training and inference framework, the high-performance operator libraries FlagAttention and FlagGems, the FlagDiagnose cluster diagnostic tool, and the FlagPerf AI chip evaluation tool.

FlagOS functions like an “operating system,” integrating heterogeneous computing management, automated compute migration, parallel training optimization, and high-performance operators. It supports major tasks such as large model training, inference, and evaluation, while managing underlying heterogeneous computing resources, high-speed networks, and distributed storage.

FlagOS has already supported over 50 teams in their large model development endeavors, utilizing 8 different types of chips and managing more than 4,600 AI accelerator cards. It has operated stably for 20 months with a Service Level Agreement (SLA) exceeding 99.5%, enabling users to achieve efficient and stable cluster management, resource optimization, and large model development. The launch of FlagOS is poised to significantly enhance the capabilities of next-generation intelligent computing centers in China and accelerate the growth of the large model industry.

FlagScale, a parallel training framework supporting heterogeneous AI computing power, has been integrated into FlagOS. It achieved the first efficient hybrid training on heterogeneous clusters by utilizing cross-node RDMA direct connections and multiple parallel strategies from different vendors. This makes FlagScale the industry’s first training framework to support both vertical and horizontal expansion modes on diverse heterogeneous AI chips.

FlagScale supports both dense and sparse training for language and multimodal models, enabling large-scale, stable training and inference for sequences up to 1 million in length. It facilitates the stable training of an 8x16B trillion-parameter MoE language model on 1,024 cards for over 40 days using domestic computing resources, achieving end-to-end training, fine-tuning, and inference deployment.

FlagScale supports pooled training across various chips with different architectures, attaining over 85% of the upper-bound performance in hybrid training, comparable to the training effects of homogeneous chips. It adapts to eight different chips and can perform large-scale training verification on different clusters, ensuring strict alignment with loss and convergence curves.

To better support the unified ecosystem development of diverse AI chips, BAAI has launched open-source Triton operator libraries for large models, including the general operator library FlagGems and the large model-specific operator library FlagAttention. These libraries significantly enhance operator development efficiency using a unified open-source programming language, while also serving as shared operator libraries across diverse chips.

The FlagGems general operator library currently covers 66 out of the 127 operators required by mainstream language and multimodal models, with full coverage expected by the end of 2024. The FlagAttention library, dedicated to large models, includes six frequently used and cutting-edge attention operators, providing programming examples and customizable operators.

By utilizing automatic code generation technology specifically designed for pointwise operators, users can generate efficient Triton code with simple computational logic descriptions. This technology has been applied to 31 pointwise operators, which account for 47% of the entire operator library. Furthermore, runtime optimization techniques have enhanced operator execution speed by 70%, ensuring high performance.

AI Pioneers Gather to Explore the Path to AGI

At the opening ceremony of the 2024 Beijing Academy of Artificial Intelligence (BAAI) Conference, Aditya Ramesh, head of OpenAI’s Sora and DALL·E teams, and Assistant Professor Sai-Ning Xie from New York University, engaged in a stimulating dialogue about the technological trajectory and future evolution of multimodal models.

During the Fireside Chat moderated by Huang Tiejun, Chairman of BAAI, Kai-Fu Lee, CEO of 01.AI, and Ya-Qin Zhang, Member(academician) of the Chinese Academy of Engineering and Dean of the Institute for AI Industry Research (AIR) at Tsinghua University, shared their insights on the development trends of general artificial intelligence technology.

In his report titled “Large Models Heralding the Dawn of General AI,” Haifeng Wang, CTO of Baidu, elaborated on the transformative potential of large models in the quest for AGI.

In the Summit Dialogue focused on the path to AGI, Zhongyuan Wang from BAAI, Xiaochuan Wang, CEO of Baichuan Intelligence, Peng Zhang, CEO of Zhipu AI, Zhiyilin Yang, CEO of Moonshot AI, and Dahai Li, CEO of ModelBest, engaged in an in-depth discussion on critical topics such as the technological trajectory of large-scale models, the dynamics between open ecosystems and closed research, and the exploration of business models.

Looking ahead, BAAI is committed to continuing its pursuit of original technological innovation, exploring cutting-edge directions, forging extensive academic collaborations, and empowering industrial development.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.




AI Technology & Industry Review — | Newsletter: | Share My Research | Twitter: @Synced_Global