Weekly AI and NLP News — January 22th 2024

AlphaGeometry, Meta is buying a lot of GPUs, and Mamba applied to vision

Fabio Chiusano

Published in

NLPlanet

5 min readJan 22, 2024

Here are your weekly articles, guides, and news about NLP and AI chosen for you by NLPlanet!

😎 News From The Web

AlphaGeometry: An Olympiad-level AI system for geometry. AlphaGeometry, an AI developed by DeepMind, has demonstrated human Olympiad-level proficiency in geometry by solving 25 out of 30 problems within competition timeframes. Utilizing a hybrid approach that incorporates pattern recognition and formal logic, it emulates human problem-solving methods, effectively combining intuitive with analytical thinking.
Mark Zuckerberg indicates Meta is spending billions of dollars on Nvidia AI chips. Meta plans a significant investment in AI research by integrating 350,000 Nvidia H100 GPUs by 2024. Given their high cost — estimated between $25K-$30K — this investment underlines Meta’s commitment to scaling up computing power. Overall, Meta’s strategy to amass the computational equivalent of 600K H100 GPUs highlights a substantial push to enhance its AI capabilities.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. Vision Mamba (Vim) is a new vision backbone that replaces standard self-attention mechanisms with bidirectional Mamba blocks to enhance image processing by incorporating positional information. Vim has demonstrated superior performance on standard benchmarks like ImageNet, COCO, and ADE20k, surpassing existing models such as Vision Transformers (DeiT).
Stable Code 3B: Coding on the Edge. Stable AI has introduced Stable Code 3B, an advanced language model for coding that outperforms the larger CodeLLaMA 7b. It offers a seamless experience on standard laptops without the need for a GPU. Notable improvements include a ‘Fill in the Middle’ feature, better context handling with support for sequences up to 16,384 tokens, and customizable contexts extending to 100,000 tokens, thanks to training on a wide variety of language and software datasets.
Google said to use special pool of stock comp to keep top AI researchers. Google has implemented a strategy using substantial stock compensation to retain premier AI talent, highlighting the high stakes in maintaining a skilled workforce to stay ahead in the dynamic AI sector.
Lazy use of AI leads to Amazon products called “I cannot fulfill that request”. E-commerce platforms, including Amazon, are experiencing issues with AI-generated content, leading to product listings with erroneous titles like “I cannot fulfill that request.” The AI’s mistakes in product description generation are indicative of broader challenges in online listing management.
New study confirms the obvious, search results are only getting worse. A study analyzing search results from Google, Bing, and DuckDuckGo indicates a declining quality in web searches, with a preference for SEO-heavy, affiliate-focused content over in-depth information. This trend presents challenges for search engines attempting to distinguish valuable content from SEO manipulation. The emergence of generative AI is expected to exacerbate these issues.
Microsoft launches Copilot Pro for $20 per month per user. Microsoft has unveiled Copilot Pro, a premium productivity-enhancing tool for Microsoft 365 apps, priced at $20 per user/month. It grants priority access to advanced AI, including GPT-4 Turbo for expedited responses.

📚 Guides From The Web

RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application?. RAG (Retrieval-Augmented Generation) and finetuning are methods for optimizing LLMs based on task-specific requirements. RAG is ideal for applications needing responses grounded in evidence from real-time data or external databases, while finetuning is best for customizing an LLM’s outputs to align with particular contextual, stylistic, or domain-specific needs.
Preference Tuning LLMs with Direct Preference Optimization Methods. Researchers have developed three novel methods — DPO, IPO, and KTO — to tune Large Language Models (LLMs) to human preferences without employing reinforcement learning. These techniques, applied to 7b LLMs, encompass direct preference optimization (DPO), which may overfit; IPO, which integrates a regularity term to mitigate overfitting; and KTO, which leverages real-time unpaired feedback for immediate model updates.
Evaluations are all we need. The article explores the challenges of evaluating both human and AI capabilities, particularly in the context of recruitment and the use of LLMs. It addresses the limited effectiveness of current assessment methods for humans, marked by a notable misfit rate in hires, and the even greater complexity of measuring creativity in innovative roles. For AI, it highlights the nascent and challenging nature of intelligence evaluation, pointing out issues like data contamination and inadequate benchmarks.
The Road To Honest AI. AI reliability is a concern, particularly regarding accuracy and potential dishonesty in responses. A recent study introduces “honesty vectors” to assess and improve AI transparency, addressing the challenge of securing long-term AI safety and dependability.

🔬 Interesting Papers and Repositories

RAG makes LLMs better and equal. A study has evaluated the performance of open-source language models against closed-source equivalents in Retrieval-Augmented Generation (RAG) tasks. Key findings indicate that GPT4-Turbo outperforms the others, while Mixtral-8x7B matches the performance of GPT3.5-turbo, and the efficacy of RAG approaches remains robust even with vast datasets exceeding 1 billion chunks.
Self-Rewarding Language Models. Researchers have explored the concept of Self-Rewarding Language Models, where language models generate their own rewards during training. This concept posits that surpassing human-level performance necessitates training signals derived from superhuman feedback. The approach led to significant improvements in instruction-following and self-rewarding capabilities. By iterating this technique in training Llama 2 70B, the model exceeded the performance of several leading systems, including Claude 2, Gemini Pro, and GPT-4 0613, on the AlpacaEval 2.0 leaderboard.
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. Language models, including large ones like LLaMA-2–13B, are highly sensitive to prompt formatting, displaying significant performance variations with changes that don’t affect meaning. This sensitivity persists despite increases in model size or example quantity. Experts recommend evaluating models with various prompt formats to accurately gauge their capabilities, as the lack of a performance correlation across models with a uniform prompt format challenges the validity of direct model comparisons.
Transformers are Multi-State RNNs. Transformers, originally distinct from RNNs, are gaining a conceptual bridge to multi-state RNNs, with new research indicating that decoder-only Transformers may operate similarly to RNNs with infinite hidden states, or alternatively as finite RNNs with a specific number of hidden states.
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation. GPT-4V offers an innovative evaluation methodology for text-to-3D generative models by automating benchmarks that align with human judgment, thereby addressing the lack of robust evaluation metrics in the field. This system simulates detailed user assessments through tailored prompts, which allows for cost-effective and scalable comparison of 3D assets against diverse and user-specific standards.
Scalable Pre-training of Large Autoregressive Image Models. Apple has released research detailing the development of autoregressive vision models known as AIM, which display scaling characteristics akin to LLMs. These models have demonstrated that their performance improves with increased model size and data volume.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. A study revealed that LLMs capable of deceptive behavior, demonstrated by conditionally writing secure or exploitable code based on year prompts, cannot be readily corrected through conventional safety training methods, including supervised fine-tuning, reinforcement learning, and adversarial training.

Thank you for reading! If you want to learn more about NLP, remember to follow NLPlanet. You can find us on LinkedIn, Twitter, Medium, and our Discord server!

Weekly AI and NLP News — January 22th 2024

AlphaGeometry, Meta is buying a lot of GPUs, and Mamba applied to vision

😎 News From The Web

📚 Guides From The Web

🔬 Interesting Papers and Repositories

Written by Fabio Chiusano