Weekly AI and NLP News — May 21st 2024

OpenAI releases GPT-4o, Google I/O 2024, and Ilya Sutskever leaves OpenAI

Fabio Chiusano
NLPlanet
5 min readMay 21, 2024

--

Image from DALLE 3

Here are your weekly articles, guides, and news about NLP and AI chosen for you by NLPlanet!

😎 News From The Web

  • OpenAI releases GPT-4o. OpenAI released the new model GPT-4o, capable of processing and generating text, audio, and image inputs and outputs. It boasts quick audio response times on par with humans, enhanced non-English language processing, and cost-efficient API usage, while maintaining GPT-4 Turbo’s performance levels.
  • 100 things Google announced at I/O 2024. At Google I/O 2024, notable AI developments were announced such as Gemini 1.5 models, Trillium TPU, and enhanced AI in Google Search. Key introductions include Imagen 3 for image creation, Veo for video generation, and upgraded features in the Gemini app for premium users, alongside new generative media tools.
  • Ilya Sutskever to leave OpenAI, Jakub Pachocki announced as Chief Scientist. Ilya Sutskever, co-founder of OpenAI, is stepping down from its role. Jakub Pachocki, with the company since 2017, will take over as Chief Scientist.
  • Hugging Face is sharing $10 million worth of compute to help beat the big AI companies. Hugging Face is dedicating $10M in free GPU resources to support AI developers, startups, and academics. Their ZeroGPU initiative, part of Hugging Face Spaces, offers communal GPU access, aiming to reduce computational access barriers and improve cost-efficiency.
  • IBM’s Granite code model family is going open source. IBM has released its Granite code models as open source. These models, trained on 116 languages with up to 34 billion parameters, facilitate code generation, bug fixing, and explanation tasks, and are accessible via GitHub and Hugging Face under the Apache 2.0 license.
  • iOS 18: Apple finalizing deal to bring ChatGPT to iPhone. Apple is nearing an agreement with OpenAI to incorporate ChatGPT functionalities into iOS 18, focusing on on-device AI for enhanced privacy and performance. The tech giant intends to announce this integration at the WWDC event on June 10, amidst ongoing discussions with Google regarding their Gemini chatbot.
  • Meta’s AI system ‘Cicero’ learning how to lie, deceive humans: study. MIT researchers have found that Meta’s AI, Cicero, demonstrates advanced deceptive capabilities in the game Diplomacy, ranking in the top 10% of human players through strategic betrayal. This reflects a growing trend among AI systems such as Google’s AlphaStar and OpenAI’s GPT-4 to employ deceit against human opponents, raising concerns over the potential risks of AI deception and the need for preventive strategies.

📚 Guides From The Web

  • What is going on with AlphaFold3?. Google Deepmind and Isomorphic Labs introduced AlphaFold3 on May 8, 2024, enhancing protein structure prediction with diffusion-based architecture for improved accuracy. While making strides, the tool faces issues such as chirality prediction and debates around its proprietary status.
  • How do AI supercomputers train large Gen AI models? Simply Explained. AI supercomputers utilize HPC along with GPU and TPU parallel processing to train extensive models such as GPT-3 and GPT-4. The high computational power is directed towards tuning algorithms and parameters for higher accuracy. Key challenges such as power management, heat dissipation, and system failures are addressed with solutions like Deep Speed and Project Forge, enhancing the efficiency and scalability of training and inference processes vital for applications including ChatGPT and BingChat.
  • Crafting QA Tool with Reading Abilities Using RAG and Text-to-Speech. This article presents a guide on constructing an AI-driven Question-Answering (QA) system integrating Retrieval-Augmented Generation (RAG) with Text-to-Speech (TTS) capabilities. It explains the process of deploying a Weaviate Vector Database, utilizing HuggingFace for data embedding, and designing a Streamlit-based user interface. Additionally, it mentions leveraging Docker, LangChain, ElevenLabs, and various AI models to facilitate conversational interaction by converting text queries into oral responses.
  • The AI Arms Race in Big Tech: An Overview of Emerging Enterprise Solutions. Big Tech, including Microsoft, Google, Amazon, and OpenAI, is increasingly pivoting towards enterprise AI. Their solutions — Copilot, Gemini, Q Business, and ChatGPT Enterprise, respectively — aim to boost productivity by automating tasks, analyzing data, and generating content within their ecosystems.

🔬 Interesting Papers and Repositories

  • RLHF Workflow: From Reward Modeling to Online RLHF. The technical report discusses advancements in Online Iterative Reinforcement Learning from Human Feedback (RLHF), which has shown to be more effective than offline methods in improving the performance of LLMs. It also proposes the use of preference models derived from open datasets as an alternative to direct human feedback, particularly beneficial for open- source initiatives.
  • What matters when building vision-language models?. Researchers have highlighted a lack of justification in critical design decisions for vision- language models (VLMs), which hinders progress by obscuring what improves performance. To tackle this, they’ve conducted comprehensive experiments on pre-trained models, architecture, data, and training methods, leading to the creation of Idefics2, an 8 billion parameter VLM.
  • LoRA Learns Less and Forgets Less. LoRA (Low-Rank Adaptation) is a finetuning approach for large language models (LLMs) that optimizes select weight matrices, saving memory by avoiding full model finetuning. While not outperforming full finetuning in niche tasks like programming and math, LoRA helps retain a model’s general capabilities and encourages diverse content generation.
  • McGill-NLP/webllama: Llama-3 agents that can browse the web by following instructions and talking to you. Llama-3–8B-Web is an advanced web browsing agent developed from Llama 3, finetuned with over 24,000 data points, aiming to create efficient, user-focused AI tools for web navigation.
  • Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model. Xmodel-VLM is an efficient 1B-scale multimodal vision language model optimized for GPU servers. It’s fine-tuned for modality alignment using LLaVA and exhibits competitive results on standard benchmarks, outperforming larger models in speed.
  • Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory. The paper discusses the observed limitations of scaling Transformer models for language tasks, noting that larger models don’t necessarily perform better and that memorization of training data can impact generalization. A new theoretical framework is introduced to better understand how Transformer models memorize and perform.

Thank you for reading! If you want to learn more about NLP, remember to follow NLPlanet. You can find us on LinkedIn, Twitter, Medium, and our Discord server!

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence