Newsletter #9

Eren Gölge

Published in

Machine Learns

Sent as a

Newsletter

6 min readNov 22, 2023

All about 🤖AI — News, Research, and Open-Source

Hello all! I hope you’re enjoying the OpenAI saga. It appears that we must first address human alignment on AGI, in order to properly align AGI itself.

After two weeks, here is another issue. Let’s begin…

(Please note that these are personal notes transformed into a newsletter by my hobby project 🤖ME_AI, So execute its bits and bytes for any mistakes)

Bookmarks

Stability AI introduces Video Diffusion 🔗Link. This diffusion model can generate a video from a textual description or input image. It is open-source with an MIT license. Users report better results than commercial services like Runway.

EmuVideo and EmuEdit from Meta 🔗Link. Meta also introduces a video generation model — EmuVideo — and an image editing model — EmuEdit -. EmuVideo can generate 16 frames of videos, and EmuEdit can edit images based on user prompts like “Remove the text” or “Replace the box”. I don’t see anything open-source yet except for the paper.

DeepMind released Lyria 🔗Link Lyria is a music model that stands out for its capacity to produce high-quality music, complete with instrumentals and vocals. They plan to release Lyria on YouTube to generate new music for the shorts.

How much does it cost to use an LLM? 🔗Link.

The Art of Debugging 🔗Link. This is a collection of different debugging methodologies for ML and software engineering. It’s a must-read.

Why are Western apps more minimalistic than Asian apps? 🔗Link. The article explores cultural differences between individualistic and collectivist societies and how these differences impact digital product design, including online communication, privacy, UI complexity, and customization preferences. It emphasizes that while individualistic cultures value explicit communication and privacy, collectivist cultures prioritize community, indirect communication, and multitasking in digital interactions.

Adobe researchers create 3D models from 2D images ‘within 5 seconds’ in new AI breakthrough 🔗Link. Adobe Research and Australian National University researchers have developed an AI model, LRM (Large Reconstruction Model), to convert a 2D image into a 3D model within 5 seconds, potentially transforming industries like gaming, animation, and AR/VR. LRM, trained on massive datasets with a transformer-based neural network, can produce high-quality 3D reconstructions from various inputs, including real-world images and AI-generated images. However, it faces limitations like blurry textures in occluded regions.

OpenAIs roadmap for fundraising🔗Link. I don’t know how much of this still holds after the “nerd-drama,” but we’ll see.

A 1000 qubits quantum computer🔗Link. Atom Computing, a California-based start-up, has developed a quantum computer with 1180 qubits, surpassing IBM’s Osprey machine which has 433 qubits. This new quantum computer utilizes neutral atoms trapped in a 2D laser grid, offering easier scalability and the potential for quicker advancement toward fault-tolerant, error-free quantum computing.

Evaluating LLMs: A Comprehensive Survey 🔗Link. This 111-page paper gives a detailed overview of the LLM evaluation in different settings such as alignment, safety, specialization, etc.

Papers

Lookahead Decoding for Faster Exact LLM Inference

Post
Code

An exact, and parallel decoding algorithm designed to boost the inference of Large Language Models (LLMs). Lookahead decoding alleviates the sequential dependency typically seen in autoregressive decoding by simultaneously retrieving and validating n-grams through direct interaction with the LLM, leveraging the Jacobi iteration approach. It operates independently, omitting the requirement for either a draft model or a data repository. Lookahead decoding proportionally reduces the count of decoding steps, showing a direct relationship with the log(FLOPs) consumed per decoding step.

Experiments indicate that the lookahead decoding is more effective for smaller models. This is because large models require more FLOPs to compute guess tokens, and this increase counteracts the efficiency of the model’s decoding algorithm.

Attention Sorting — boost your language model’s long context skills

Paper

Attention sorting is an inference-time technique that reorders a language model’s context by attention, enhancing the recall of critical information and improving relevance. It significantly boosts accuracy in long context QA, enabling smaller 7B models to perform comparably to larger 100B+ models, and is effective even for models not specialized in long contexts. This method is easy to integrate into existing pipelines without additional training or fine-tuning, offering iterative performance improvements in tasks like QA, search, and summarization.

It works by initially performing a step of decoding, then sorting the documents based on the attention they receive, with documents receiving the most attention going last. This re-sorting process can be repeated, with each iteration typically moving the most relevant documents towards the end of the context where they are more likely to be used by the model. This method leverages the observation that relevant documents, even if not used in the response, tend to receive more attention than irrelevant ones at the same position.

Rethinking Attention

Paper

The work presents a study on using standard shallow feed-forward networks to replicate the behavior of attention mechanisms in Transformer models.

The researchers replaced key elements of the attention mechanism with simple feed-forward networks trained through knowledge distillation.

Their findings indicate that these “attentionless Transformers” can match the performance of the original architecture, offering insights into the adaptability of shallow feed-forward networks in mimicking attention mechanisms and the potential to simplify complex sequence-to-sequence tasks.

I’ve gathered all the well-known Transformer alternatives in a separate post.

YaRN: Efficient context window extension for LLMs.

Code
Paper

The paper introduces YaRN (Yet another RoPE extensioN method) to extend the context window of models trained with Rotary Position Embeddings (RoPE) like LLaMA, GPT-NeoX, and PaLM. YaRN achieves state-of-the-art performance in context window extensions, requiring significantly less training data and steps than previous methods.

It combines Dynamic Scaling, an inference-time technique, to extend context windows without fine-tuning.

The paper details various interpolation methods, experiments, and evaluations demonstrating YaRN’s effectiveness in long-sequence language modeling, passkey retrieval, and standardized benchmarks.

The technique is compatible with existing libraries and shows minimal performance degradation while enabling large context sizes.

Open-Source

Ollama — running LLMs locally

👩‍💻 Github

Ollama runs open LLMs locally and provides a REST API. It has good community support and different types of UIs and plug-ins.

MyHeyGen

Github
Demo Video

MyHeyGen is an open-source alternative to talking head generation that combines several other open-source tools, including our XTTS 👍. With speech synthesis and lip-syncing, you can basically make anyone say anything.

LoRAX — a framework to serve 100s of fine-tuned LLMs

Github

LoRAX enables packing hundreds of models into a single GPU, drastically cutting costs while maintaining performance. Key features include Dynamic Adapter Loading, Tiered Weight Caching, and Continuous Multi-Adapter Batching, which enhance throughput and reduce memory issues. It’s commercially viable under the Apache 2.0 license, offering pre-built Docker images and Helm charts for seamless deployment in production environments.

S-LoRA — serving 1000s of LoRA adapters.

Github

S-LoRA is an advanced system designed for efficiently serving thousands of concurrent Low-Rank Adaptation (LoRA) adapters for large language models. It leverages a ‘pretrain-then-finetune’ paradigm, storing adapters in the main memory and fetching them to the GPU as needed. Key features include Unified Paging for memory efficiency, custom CUDA kernels for heterogeneous batching, and a novel tensor parallelism strategy for effective multi-GPU parallelization. S-LoRA significantly improves throughput and can handle a much larger number of adapters compared to other libraries. The project builds on LightLLM and benefits from punica, PEFT, and vLLM technologies. Plans include releasing tensor parallelism implementation, enhancing API/frontend user-friendliness, and expanding model support.

Insanely Fast Whisper

Github

IFW provides an improved interface to OpenAI’s speech recognition model Whisper. With fp16, batching, and Flash Attention 2, it can transcribe 150 minutes of audio in just 98 secs.

FauxPilot

Github

A copilot clone that runs locally. It uses SalesForce CodeGen with Triton Inference Server and FastTransformer backend.

Newsletter #9

Bookmarks

Papers

Lookahead Decoding for Faster Exact LLM Inference

Attention Sorting — boost your language model’s long context skills

Rethinking Attention

YaRN: Efficient context window extension for LLMs.

Open-Source

Ollama — running LLMs locally

MyHeyGen

LoRAX — a framework to serve 100s of fine-tuned LLMs

S-LoRA — serving 1000s of LoRA adapters.

Insanely Fast Whisper

FauxPilot

Written by Eren Gölge