Newsletter #9
All about đ€AI â News, Research, and Open-Source
Hello all! I hope youâre enjoying the OpenAI saga. It appears that we must first address human alignment on AGI, in order to properly align AGI itself.
After two weeks, here is another issue. Letâs beginâŠ
(Please note that these are personal notes transformed into a newsletter by my hobby project đ€ME_AI, So execute its bits and bytes for any mistakes)
Bookmarks
Stability AI introduces Video Diffusion đLink. This diffusion model can generate a video from a textual description or input image. It is open-source with an MIT license. Users report better results than commercial services like Runway.
EmuVideo and EmuEdit from Meta đLink. Meta also introduces a video generation model â EmuVideo â and an image editing model â EmuEdit -. EmuVideo can generate 16 frames of videos, and EmuEdit can edit images based on user prompts like âRemove the textâ or âReplace the boxâ. I donât see anything open-source yet except for the paper.
DeepMind released Lyria đLink Lyria is a music model that stands out for its capacity to produce high-quality music, complete with instrumentals and vocals. They plan to release Lyria on YouTube to generate new music for the shorts.
How much does it cost to use an LLM? đLink.
The Art of Debugging đLink. This is a collection of different debugging methodologies for ML and software engineering. Itâs a must-read.
Why are Western apps more minimalistic than Asian apps? đLink. The article explores cultural differences between individualistic and collectivist societies and how these differences impact digital product design, including online communication, privacy, UI complexity, and customization preferences. It emphasizes that while individualistic cultures value explicit communication and privacy, collectivist cultures prioritize community, indirect communication, and multitasking in digital interactions.
Adobe researchers create 3D models from 2D images âwithin 5 secondsâ in new AI breakthrough đLink. Adobe Research and Australian National University researchers have developed an AI model, LRM (Large Reconstruction Model), to convert a 2D image into a 3D model within 5 seconds, potentially transforming industries like gaming, animation, and AR/VR. LRM, trained on massive datasets with a transformer-based neural network, can produce high-quality 3D reconstructions from various inputs, including real-world images and AI-generated images. However, it faces limitations like blurry textures in occluded regions.
OpenAIs roadmap for fundraisingđLink. I donât know how much of this still holds after the ânerd-drama,â but weâll see.
A 1000 qubits quantum computerđLink. Atom Computing, a California-based start-up, has developed a quantum computer with 1180 qubits, surpassing IBMâs Osprey machine which has 433 qubits. This new quantum computer utilizes neutral atoms trapped in a 2D laser grid, offering easier scalability and the potential for quicker advancement toward fault-tolerant, error-free quantum computing.
Evaluating LLMs: A Comprehensive Survey đLink. This 111-page paper gives a detailed overview of the LLM evaluation in different settings such as alignment, safety, specialization, etc.
Papers
Lookahead Decoding for Faster Exact LLM Inference
An exact, and parallel decoding algorithm designed to boost the inference of Large Language Models (LLMs). Lookahead decoding alleviates the sequential dependency typically seen in autoregressive decoding by simultaneously retrieving and validating n-grams through direct interaction with the LLM, leveraging the Jacobi iteration approach. It operates independently, omitting the requirement for either a draft model or a data repository. Lookahead decoding proportionally reduces the count of decoding steps, showing a direct relationship with the log(FLOPs) consumed per decoding step.
Experiments indicate that the lookahead decoding is more effective for smaller models. This is because large models require more FLOPs to compute guess tokens, and this increase counteracts the efficiency of the modelâs decoding algorithm.
Attention Sorting â boost your language modelâs long context skills
Attention sorting is an inference-time technique that reorders a language modelâs context by attention, enhancing the recall of critical information and improving relevance. It significantly boosts accuracy in long context QA, enabling smaller 7B models to perform comparably to larger 100B+ models, and is effective even for models not specialized in long contexts. This method is easy to integrate into existing pipelines without additional training or fine-tuning, offering iterative performance improvements in tasks like QA, search, and summarization.
It works by initially performing a step of decoding, then sorting the documents based on the attention they receive, with documents receiving the most attention going last. This re-sorting process can be repeated, with each iteration typically moving the most relevant documents towards the end of the context where they are more likely to be used by the model. This method leverages the observation that relevant documents, even if not used in the response, tend to receive more attention than irrelevant ones at the same position.
Rethinking Attention
The work presents a study on using standard shallow feed-forward networks to replicate the behavior of attention mechanisms in Transformer models.
The researchers replaced key elements of the attention mechanism with simple feed-forward networks trained through knowledge distillation.
Their findings indicate that these âattentionless Transformersâ can match the performance of the original architecture, offering insights into the adaptability of shallow feed-forward networks in mimicking attention mechanisms and the potential to simplify complex sequence-to-sequence tasks.
Iâve gathered all the well-known Transformer alternatives in a separate post.
YaRN: Efficient context window extension for LLMs.
The paper introduces YaRN (Yet another RoPE extensioN method) to extend the context window of models trained with Rotary Position Embeddings (RoPE) like LLaMA, GPT-NeoX, and PaLM. YaRN achieves state-of-the-art performance in context window extensions, requiring significantly less training data and steps than previous methods.
It combines Dynamic Scaling, an inference-time technique, to extend context windows without fine-tuning.
The paper details various interpolation methods, experiments, and evaluations demonstrating YaRNâs effectiveness in long-sequence language modeling, passkey retrieval, and standardized benchmarks.
The technique is compatible with existing libraries and shows minimal performance degradation while enabling large context sizes.
Open-Source
Ollama â running LLMs locally
đ©âđ» Github
Ollama runs open LLMs locally and provides a REST API. It has good community support and different types of UIs and plug-ins.
MyHeyGen
MyHeyGen is an open-source alternative to talking head generation that combines several other open-source tools, including our XTTS đ. With speech synthesis and lip-syncing, you can basically make anyone say anything.
LoRAX â a framework to serve 100s of fine-tuned LLMs
LoRAX enables packing hundreds of models into a single GPU, drastically cutting costs while maintaining performance. Key features include Dynamic Adapter Loading, Tiered Weight Caching, and Continuous Multi-Adapter Batching, which enhance throughput and reduce memory issues. Itâs commercially viable under the Apache 2.0 license, offering pre-built Docker images and Helm charts for seamless deployment in production environments.
S-LoRA â serving 1000s of LoRA adapters.
S-LoRA is an advanced system designed for efficiently serving thousands of concurrent Low-Rank Adaptation (LoRA) adapters for large language models. It leverages a âpretrain-then-finetuneâ paradigm, storing adapters in the main memory and fetching them to the GPU as needed. Key features include Unified Paging for memory efficiency, custom CUDA kernels for heterogeneous batching, and a novel tensor parallelism strategy for effective multi-GPU parallelization. S-LoRA significantly improves throughput and can handle a much larger number of adapters compared to other libraries. The project builds on LightLLM and benefits from punica, PEFT, and vLLM technologies. Plans include releasing tensor parallelism implementation, enhancing API/frontend user-friendliness, and expanding model support.
Insanely Fast Whisper
IFW provides an improved interface to OpenAIâs speech recognition model Whisper. With fp16, batching, and Flash Attention 2, it can transcribe 150 minutes of audio in just 98 secs.
FauxPilot
A copilot clone that runs locally. It uses SalesForce CodeGen with Triton Inference Server and FastTransformer backend.