Promising Sector Preview: The Decentralized Computing Power Market (Part I)
Author: Zeke, YBB Capital
Foreword
Since the birth of GPT-3, generative AI has reached an explosive tipping point in the field of artificial intelligence with its astonishing performance and wide range of application scenarios. This has led to a rush among tech giants to enter the AI race. However, this surge has brought with it a set of challenges. The training and inference operations of large language models (LLMs) require substantial computational power. With each iteration and upgrade of these models, the demand and cost for computing power have been increasing exponentially. Taking GPT-2 and GPT-3 as examples, the number of parameters between GPT-2 and GPT-3 differs by a factor of 1166 (GPT-2 has 150 million parameters, while GPT-3 has 175 billion), and the cost for a single training session of GPT-3 could reach up to $12 million USD, calculated based on the pricing models of public GPU clouds at the time, which is 200 times that of GPT-2. In practical use, each query from a user requires inference computation. Considering the 13 million independent users at the beginning of this year, the corresponding demand for chips would be over 30,000 A100 GPUs. Thus, the initial investment cost could reach an astonishing $800 million USD, with an estimated daily model inference cost of $700,000 USD.
The shortage of computing power and high costs have become significant challenges for the entire AI industry. Similarly, the blockchain industry seems to be facing the same issues. On one hand, the fourth Bitcoin halving and the approval of ETFs are imminent, and as future prices climb, miners’ demand for computational hardware is bound to increase significantly. On the other hand, Zero-Knowledge Proof (ZKP) technology is flourishing, and Vitalik has emphasized multiple times that the impact of ZK on the blockchain sector in the next decade will be as significant as blockchain itself. Although this technology is highly anticipated for its future in the blockchain industry, ZK also requires a substantial amount of computing power and time for the complex calculations involved in generating proofs, similar to AI.
In the foreseeable future, a shortage of computing power seems inevitable. So, could the decentralized computing power market be a promising business venture?
Decentralized Computing Power Market Definition
The decentralized computing power market is essentially equivalent to the decentralized cloud computing track, but I personally believe that this term is more apt to describe the new projects that will be discussed later. The decentralized computing power market should be considered a subset of DePIN (Decentralized Physical Infrastructure Networks), aiming to create an open market for computing power where anyone with idle computing resources can offer their resources incentivized by tokens, primarily serving B2B clients and developer communities. To illustrate with more familiar projects, networks like Render Network, which is based on decentralized GPU rendering solutions, and Akash Network, a distributed peer-to-peer marketplace for cloud computing, are part of this track.
The following text will start with the basic concepts and then discuss three emerging markets within this track: the AGI computing power market, the Bitcoin computing power market, and the ZK hardware acceleration market. The latter two will be discussed in “Promising Sector Preview: The Decentralized Computing Power Market (Part II)”.
Computing Power Overview
The concept of computing power originated with the invention of the computer, where the initial computations were performed by mechanical devices, and computing power referred to the computational ability of these mechanical devices. With the evolution of computer technology, the concept of computing power has also evolved. Today’s computing power typically refers to the ability of computer hardware (CPUs, GPUs, FPGAs, etc.) and software (operating systems, compilers, applications, etc.) to work together.
Definition
Computing Power refers to the amount of data a computer or other computational device can process within a certain time frame or the number of computational tasks it can complete. It is commonly used to describe the performance of a computer or other computational devices and is an important metric for measuring a device’s processing capabilities.
Measurement Standards
Computing power can be measured in various ways, such as computational speed, energy consumption, accuracy, and parallelism. In the field of computing, common metrics for measuring computing power include FLOPS (Floating Point Operations Per Second), IPS (Instructions Per Second), and TPS (Transactions Per Second).
FLOPS measures the ability of a computer to process floating-point operations (mathematical operations with decimal points, which require consideration of precision issues and rounding errors). It gauges how many floating-point operations a computer can complete per second. FLOPS is an indicator of a computer’s high-performance computing capabilities and is often used to measure the computational power of supercomputers, high-performance computing servers, and Graphics Processing Units (GPUs). For example, if a computer system has 1 TFLOPS (1 trillion floating-point operations per second), it means it can perform 1 trillion floating-point operations every second.
IPS measures the speed at which a computer processes instructions. It gauges how many instructions a computer can execute per second and is a metric for measuring the performance of a single instruction of a computer, typically used for Central Processing Units (CPUs). For instance, if a CPU has an IPS of 3 GHz (3 billion instructions per second), it means it can execute 3 billion instructions every second.
TPS measures the ability of a computer to process transactions. It gauges how many transactions a computer can complete per second and is often used to measure the performance of database servers. For example, if a database server has a TPS of 1000, it means it can handle 1000 database transactions every second.
Additionally, there are computing power metrics specific to certain application scenarios, such as inference speed, image processing speed, and voice recognition accuracy.
Types of Computing Power
GPU Computing Power refers to the computational ability of Graphics Processing Units (GPUs). Unlike Central Processing Units (CPUs), GPUs are hardware specifically designed to process graphical data such as images and videos. They possess a multitude of processing units and efficient parallel computing capabilities, allowing them to perform a large number of floating-point operations simultaneously. Originally used for gaming graphics, GPUs typically have higher clock speeds and greater memory bandwidth than CPUs to support complex graphical computations.
Differences between CPU and GPU
Architecture: CPUs and GPUs have different computing architectures. CPUs typically have one or more cores, each a general-purpose processor capable of performing a variety of different operations. GPUs, on the other hand, have a large number of Stream Processors and Shaders dedicated to executing computations related to image processing.
Parallel Computing: GPUs generally have higher parallel computing capabilities. CPUs have a limited number of cores, each capable of executing a single instruction, but GPUs can have thousands of stream processors, capable of executing multiple instructions and operations simultaneously. Therefore, GPUs are usually more suitable for parallel computing tasks, such as machine learning and deep learning, which require extensive parallel computations.
Programming Design: Programming for GPUs is relatively more complex compared to CPUs. It requires the use of specific programming languages (such as CUDA or OpenCL) and particular programming techniques to leverage the GPU’s parallel computing capabilities. In contrast, CPU programming is simpler and can utilize general-purpose programming languages and tools.
The Importance of Computing Power
In the industrial revolution era, oil was the lifeblood of the world, permeating every industry. In the upcoming AI era, computing power will be the “digital oil” of the world. From the frantic pursuit of AI chips by major corporations and Nvidia’s stock breaking the trillion-dollar mark, to the recent detailed U.S. restrictions on China’s high-end chips, including computing power capacity, chip size, and even plans to ban GPU clouds, the importance of computing power is self-evident. It will be the commodity of the next era.
Overview of Artificial General Intelligence
Artificial Intelligence (AI) is a new technical science that studies, develops, and applies theories, methods, and technologies for simulating, extending, and expanding human intelligence. Originating in the 1950s and 1960s, AI has evolved over more than half a century, experiencing intertwined developments through three waves: symbolism, connectionism, and agent-based approaches. Today, as an emerging general-purpose technology, AI is driving profound changes in social life and across all industries. The current stage of generative AI is more specifically defined as Artificial General Intelligence (AGI), a type of AI system with broad understanding capabilities that can perform tasks and operate in various domains with intelligence similar to or surpassing human levels. AGI fundamentally requires three elements: deep learning (DL), big data, and substantial computing power.
Deep Learning
Deep learning is a subfield of machine learning (ML). Deep learning algorithms are neural networks modeled after the human brain. For example, the human brain contains millions of interconnected neurons that work together to learn and process information. Similarly, deep learning neural networks (or artificial neural networks) consist of multiple layers of artificial neurons that work together within a computer. These artificial neurons, known as nodes, use mathematical computations to process data. Artificial neural networks are deep learning algorithms that use these nodes to solve complex problems.
Neural networks can be divided into layers: the input layer, hidden layers, and the output layer. The connections between these different layers are made up of parameters.
· Input Layer: The input layer is the first layer of the neural network and is responsible for receiving external input data. Each neuron in the input layer corresponds to a feature of the input data. For example, in image processing, each neuron might correspond to the value of a pixel in the image.
· Hidden Layers: The input layer processes data and passes it on to deeper layers within the network. These hidden layers process information at different levels and adjust their behavior when receiving new information. Deep learning networks can have hundreds of hidden layers, which allow them to analyze problems from multiple different perspectives. For instance, if you have an image of an unknown animal that needs to be classified, you might compare it to animals you are already familiar with by looking at the shape of the ears, the number of legs, or the size of the pupils. Hidden layers in deep neural networks work in a similar way. If a deep learning algorithm is trying to classify images of animals, each hidden layer would process different features of the animals and attempt to classify them accurately.
· Output Layer: The output layer is the last layer of the neural network and is responsible for producing the network’s output. Each neuron in the output layer represents a possible output category or value. For example, in a classification problem, each neuron in the output layer might correspond to a category, while in a regression problem, there might be only one neuron in the output layer, and its value represents the predicted result.
· Parameters: In neural networks, the connections between different layers are represented by weights and biases. These parameters are optimized during the training process so that the network can accurately identify patterns in the data and make predictions. Increasing the number of parameters can enhance the neural network’s model capacity, which is the ability of the model to learn and represent complex patterns in the data. However, an increase in parameters also leads to higher demands for computing power.
Big Data
For effective training, neural networks typically require large, diverse, high-quality, and multi-source data. This data is the foundation for training and validating machine learning models. By analyzing big data, machine learning models can learn patterns and relationships within the data, which allows them to make predictions or classifications.
Massive Computing Power
The demand for substantial computing power arises from several aspects of neural networks:
- Complex multi-layered structures
- A large number of parameters
- The need to process vast amounts of data
- Iterative training methods (during the training phase, the model must iterate repeatedly, performing forward and backward propagation calculations for each layer, including computations for activation functions, loss functions, gradients, and weight updates)
- The need for high-precision calculations
- Parallel processing capabilities
- Optimization and regularization techniques
- Model evaluation and validation processes
As deep learning progresses, the requirement for massive computing power for AGI is increasing by approximately tenfold each year. As of the latest model, GPT-4 contains 1.8 trillion parameters, with a single training cost exceeding $60 million and a computing power requirement of 2.15e25 FLOPS (21.5 quintillion floating-point operations). The demand for computing power for future model training continues to expand, and new models are being developed at an increasing rate.
AI Computing Power Economics
Future Market Size
According to the most authoritative estimates, the “2022–2023 Global Computing Power Index Assessment Report” compiled jointly by the International Data Corporation (IDC), Inspur Information, and the Global Industry Research Institute of Tsinghua University, the global AI computing market size is expected to grow from $19.5 billion in 2022 to $34.66 billion in 2026. Within this, the generative AI computing market is projected to increase from $820 million in 2022 to $10.99 billion in 2026. The share of generative AI computing in the overall AI computing market is expected to rise from 4.2% to 31.7%.
Monopoly in Computing Power Economy
The production of AI GPUs has been exclusively monopolized by NVIDIA, and they are extremely expensive (the latest H100 is being sold for $40,000 per unit). As soon as GPUs are released, they are snapped up by tech giants in Silicon Valley, some of which are used for training their own new models. The rest are rented out to AI developers through cloud platforms, such as those owned by Google, Amazon, and Microsoft, which control a vast amount of computing resources including servers, GPUs, and TPUs. Computing power has become a new resource monopolized by these giants. Many AI developers can’t even purchase a dedicated GPU without a markup. To use the latest equipment, developers are forced to rent cloud servers from AWS or Microsoft. Financial reports indicate that this business is highly profitable, with AWS’s cloud services boasting a gross margin of 61%, while Microsoft’s is even higher at 72%.
Do we have to accept this centralized authority and control, and pay a 72% profit margin for computing resources? Will the giants who monopolized Web2 also dominate the next era?
Challenges of Decentralized AGI Computing Power
When it comes to antitrust, decentralization is often seen as the best solution. Looking at existing projects, can we achieve the massive computing power required for AI through DePIN storage projects combined with protocols like RDNR for idle GPU utilization? The answer is no. The path to slaying the dragon is not that simple. Early projects were not specifically designed for AGI computing power and are not feasible. Bringing computing power onto the blockchain faces at least the following five challenges:
1. Work Verification: To build a truly trustless computing network that provides economic incentives to participants, the network must have a way to verify that deep learning computations were actually performed. The core issue here is the state dependency of deep learning models; in these models, the input for each layer depends on the output from the previous layer. This means that you cannot just verify a single layer in isolation without considering all preceding layers. The computation for each layer is based on the results of all the layers that came before it. Therefore, to verify the work completed at a specific point (such as a specific layer), all the work from the beginning of the model to that point must be executed.
2. Market: The AI computing power market, as an emerging market, is subject to supply and demand dilemmas, such as the cold start problem. Supply and demand liquidity need to be roughly matched from the start for the market to grow successfully. To capture potential computing power supply, clear incentives must be provided to participants in exchange for their computing resources. The market needs a mechanism to track completed computations and pay providers in a timely manner. In traditional markets, intermediaries handle tasks like management and onboarding, while reducing operational costs by setting minimum payment thresholds. However, this approach is costly when scaling the market size. Only a small portion of the supply can be economically captured, leading to a threshold equilibrium state, where the market can only capture and maintain a limited supply and cannot grow further.
3. Halting Problem: The halting problem is a fundamental issue in computational theory, which involves determining whether a given computational task will finish in a finite amount of time or run indefinitely. This problem is undecidable, meaning there is no universal algorithm that can predict whether any given computation will halt in a finite time. For instance, the execution of smart contracts on Ethereum also faces a similar halting problem. It is impossible to determine in advance how much computational resource a smart contract execution will require or whether it will complete within a reasonable time.
(In the context of deep learning, this problem becomes even more complex as models and frameworks shift from static graph construction to dynamic building and execution.)
4. Privacy: The design and development with privacy consciousness is a must for project teams. Although a lot of machine learning research can be conducted on public datasets, to enhance model performance and adapt to specific applications, models often need to be fine-tuned on proprietary user data. This fine-tuning process may involve the processing of personal data, thus privacy protection requirements need to be considered.
5. Parallelization: This is a key factor in the current projects’ lack of feasibility. Deep learning models are typically trained in parallel on large hardware clusters with proprietary architectures and very low latency, while GPUs in a distributed computing network would incur latency due to frequent data exchanges and would be limited by the performance of the slowest GPU. When computing sources are untrustworthy and unreliable, how to achieve heterogeneous parallelization is a problem that must be solved. Currently, a viable method is through transformer models, such as Switch Transformers, which now have highly parallelized characteristics.
Solutions: Although current attempts at a decentralized AGI computing power market are still in their infancy, there are two projects that have preliminarily solved the consensus design of decentralized networks and the implementation of decentralized computing power networks in model training and inference. The following will analyze the design methods and issues of the decentralized AGI computing power market using Gensyn and Together as examples.
Gensyn
Gensyn is an AGI computing power market still under construction, aiming to solve a variety of challenges in decentralized deep learning computation and to reduce the costs associated with current deep learning. Essentially, Gensyn is a first-layer proof-of-stake protocol on the Polkadot network, which rewards solvers (those who solve computational tasks) through smart contracts in exchange for using their idle GPU devices for computation and executing machine learning tasks.
Returning to the previous question, the core of building a truly trustless computational network lies in verifying the completed machine learning work. This is a highly complex issue that requires finding a balance between complexity theory, game theory, cryptography, and optimization.
Gensyn proposes a simple solution where solvers submit the results of their completed machine learning tasks. To verify these results, an independent verifier would attempt to re-execute the same task. This method can be called single replication because only one verifier would re-execute the task. This means that there is only one additional piece of work to verify the accuracy of the original work. However, if the person verifying the work is not the original requester, then the trust issue still exists. Verifiers themselves might not be honest, and their work needs to be verified. This leads to a potential problem: if the person verifying the work is not the original requester, then another verifier is needed to verify their work. But this new verifier might also not be trusted, so another verifier would be needed to verify their work, and this could continue indefinitely, forming an infinite chain of replication. Here, three key concepts need to be introduced and interwoven to build a participant system with four roles to solve the infinite chain problem.
Probabilistic Proof of Learning: Constructs certificates of completed work using metadata from the gradient-based optimization process. By replicating certain stages, these certificates can be quickly verified, ensuring that the work has been completed as scheduled.
Graph-based Precise Positioning Protocol: Utilizes a multi-granularity, graph-based precise positioning protocol, along with the consistency execution of cross-evaluators. This allows for the re-running and comparison of verification work to ensure consistency, which is ultimately confirmed by the blockchain itself.
Truebit-style Incentive Game: Constructs an incentive game using stakes and slashing to ensure that every economically rational participant acts honestly and performs their expected tasks.
The participant system consists of submitters, solvers, verifiers, and whistleblowers.
Submitters:
Submitters are the end-users of the system, providing tasks to be computed and paying for the units of work completed;
Solvers:
Solvers are the primary workers of the system, performing model training and generating proofs to be checked by verifiers;
Verifiers:
Verifiers are key to linking the non-deterministic training process with deterministic linear computations, replicating parts of the solver’s proof and comparing distances with expected thresholds;
Whistleblowers:
Whistleblowers are the last line of defense, checking the work of verifiers and raising challenges in the hope of receiving substantial rewards.
System Operation
The game system designed by the protocol operates through eight stages, encompassing the four main participant roles, to complete the entire process from task submission to final verification.
- Task Submission: Tasks consist of three specific pieces of information:
· Metadata describing the task and hyperparameters;
· A model binary file (or basic architecture);
· Publicly accessible, pre-processed training data.
2. To submit a task, submitters specify the details of the task in a machine-readable format and submit it to the chain along with the model binary file (or machine-readable architecture) and the location of the pre-processed training data that is publicly accessible. The public data can be stored in simple object storage like AWS’s S3, or in a decentralized storage like IPFS, Arweave, or Subspace.
3. Profiling: The profiling process establishes a baseline distance threshold for the proof of learning verification. Verifiers will periodically fetch profiling tasks and generate mutation thresholds for the comparison of learning proofs. To generate the threshold, verifiers deterministically run and rerun parts of the training using different random seeds, generating and checking their own proofs. In this process, verifiers establish an overall expected distance threshold for the non-deterministic work of the solution that can be used for verification.
4. Training: After profiling, tasks enter the public task pool (similar to Ethereum’s Mempool). A solver is selected to execute the task and the task is removed from the task pool. Solvers perform the task based on the metadata submitted by the submitter, as well as the provided model and training data. While executing the training task, solvers also generate a proof of learning by regularly checking points and storing metadata (including parameters) during the training process, so that verifiers can replicate the following optimization steps as accurately as possible.
5. Proof Generation: Solvers periodically store model weights or updates along with the corresponding indices of the training dataset to identify the samples used to generate the weight updates. The frequency of checkpoints can be adjusted to provide stronger assurances or to save storage space. Proofs can be “stacked,” meaning they can start from a random distribution used to initialize weights or from pre-trained weights generated using their own proofs. This allows the protocol to establish a set of proven, pre-trained base models that can be fine-tuned for more specific tasks.
6. Verification of Proof: After task completion, solvers register the task completion on the chain and display their proof of learning at a publicly accessible location for verifiers to access. Verifiers pull verification tasks from the public task pool and perform computational work to rerun part of the proof and execute distance calculations. The chain, along with the threshold calculated during the profiling stage, then uses the resulting distance to determine if the verification matches the proof.
7. Graph-based Pinpoint Challenge: After verifying the proof of learning, whistleblowers can replicate the work of verifiers to check if the verification work itself was correctly executed. If whistleblowers believe the verification has been executed incorrectly (maliciously or not), they can initiate a challenge to the contract arbitration for a reward. This reward can come from the deposits of solvers and verifiers (in the case of a true positive) or from a lottery pool bonus (in the case of a false positive) and uses the chain itself for arbitration. Whistleblowers (acting as verifiers in their case) will only verify and subsequently challenge work when they expect to receive appropriate compensation. In practice, this means whistleblowers are expected to join and leave the network based on the number of other active whistleblowers (i.e., with live deposits and challenges). Therefore, the expected default strategy for any whistleblower is to join the network when there are fewer other whistleblowers, post a deposit, randomly select an active task, and begin their verification process. After one task, they will grab another random active task and repeat until the number of whistleblowers exceeds their determined payout threshold, at which point they will leave the network (or more likely, switch to another role in the network — verifier or solver — based on their hardware capabilities) until the situation reverses again.
8. Contract Arbitration: When verifiers are challenged by whistleblowers, they enter a process with the chain to find the location of the disputed operation or input, with the chain ultimately performing the final basic operation and determining if the challenge is justified. To maintain the honesty of whistleblowers and overcome the verifier’s dilemma, periodic forced errors and jackpot payments are introduced here.
9. Settlement: During the settlement process, participants are paid based on the conclusions of probabilistic and deterministic checks. Different payment scenarios arise depending on the results of previous verifications and challenges. If the work is deemed to have been performed correctly and all checks have passed, both the solution providers and verifiers are rewarded based on the operations performed.
Project Brief Review
Gensyn has designed an intricate game-theoretic system at the verification and incentive layers, which allows for quick identification and rectification of errors by pinpointing divergences within the network. However, the current system still lacks many details. For instance, how can parameters be set to ensure reasonable rewards and penalties without making the entry threshold too high? Have extreme scenarios and the varying computational power of solvers been considered in the game-theoretic aspects? The current version of the whitepaper also does not provide detailed explanations for heterogeneous parallel execution, indicating that Gensyn’s implementation still has a long way to go.
Together.ai
Together.ai is a company focused on open-source, decentralized AI computational solutions for large models, with the goal of making AI accessible to anyone, anywhere. Strictly speaking, Together is not a blockchain project, but it has preliminarily solved the latency issues within decentralized AGI computational networks. Therefore, the following will only analyze Together’s solution without evaluating the project itself.
In a decentralized network that is 100 times slower than data centers, how can training and inference of large models be achieved?
Let’s imagine the distribution of GPUs participating in a decentralized network. These devices would be spread across different continents and cities, each needing to connect with varying latencies and bandwidths. As illustrated below, a simulated distributed scenario shows devices located in North America, Europe, and Asia, with differing bandwidths and latencies between them. What needs to be done to effectively link them together?
Distributed Training Computational Modeling: The diagram below represents the situation of training a base model across multiple devices, featuring three types of communication: Forward Activation, Backward Gradient, and Lateral Communication.
Combining communication bandwidth and latency, two forms of parallelism need to be considered: pipeline parallelism and data parallelism, corresponding to the three types of communication in a multi-device scenario:
1. Pipeline Parallelism: In pipeline parallelism, all layers of the model are divided into several stages, with each device processing one stage — a sequence of consecutive layers, such as multiple Transformer blocks. During forward propagation, activations are passed to the next stage, and during backward propagation, gradients of the activations are passed back to the previous stage.
2. Data Parallelism: In data parallelism, devices independently compute gradients for different micro-batches but need to synchronize these gradients through communication.
Scheduling Optimization:
In a decentralized environment, the training process is often constrained by communication. Scheduling algorithms typically assign tasks that require extensive communication to devices with faster connections. Considering the dependencies between tasks and the heterogeneity of the network, it is first necessary to model the cost of specific scheduling strategies. To capture the complex communication costs of training base models, Together proposes a novel formulation and decomposes the cost model into two levels using graph theory:
- Graph Theory: A branch of mathematics that studies the properties and structures of graphs (networks), which consist of vertices (nodes) and edges (lines connecting nodes). The main purpose in graph theory is to study various properties of graphs, such as connectivity, coloring, and the nature of paths and cycles within graphs.
- First Level: This is a balanced graph partitioning problem (dividing the vertex set of a graph into several subsets of equal or nearly equal size while minimizing the number of edges between subsets). In this partitioning, each subset represents a partition, and communication costs are reduced by minimizing the edges between partitions, corresponding to the communication costs of data parallelism.
- Second Level: This involves a joint graph matching and traveling salesman problem (a combinatorial optimization problem that combines elements of graph matching and the traveling salesman problem). The graph matching problem involves finding a match in the graph that minimizes or maximizes some cost. The traveling salesman problem seeks the shortest path that visits all nodes in the graph, corresponding to the communication costs of pipeline parallelism.
Process Diagram: The diagram above is a schematic of the process. Due to the complex calculations involved in the actual implementation, the process described in the diagram is simplified for easier understanding. For detailed implementation, one can refer to the documentation on Together’s official website.
Assuming there is a set of NN devices, DD, with uncertain communication delays (matrix AA) and bandwidths (matrix BB), based on the device set DD, we first generate a balanced graph partition. Each partition or group of devices contains approximately an equal number of devices, and they all handle the same pipeline stage. This ensures that during data parallelism, each device group performs a similar amount of work. According to the communication delays and bandwidths, a formula can calculate the “cost” of transferring data between device groups. Each balanced group is merged to create a fully connected coarse graph, where each node represents a pipeline stage, and the edges represent the communication cost between two stages. To minimize communication costs, a matching algorithm is used to determine which device groups should work together.
To further optimize, this problem is modeled as an open-loop traveling salesman problem, finding an optimal path for data transmission across all devices. Finally, Together uses an innovative scheduling algorithm to find the best allocation strategy for the given cost model, thereby minimizing communication costs and maximizing training throughput. According to tests, under this scheduling optimization, even if the network is 100 times slower, the end-to-end training throughput is only about 1.7 to 2.3 times slower.
Communication Compression Optimization:
For the optimization of communication compression, Together introduced the AQ-SGD algorithm (for detailed calculation processes, one can refer to the paper “Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees”). The AQ-SGD algorithm is a novel activation compression technique designed to address communication efficiency issues during pipeline parallel training over slow networks. Unlike previous methods that compress activation values directly, AQ-SGD focuses on compressing the changes in activation values of the same training sample at different times. This unique approach introduces an interesting “self-executing” dynamic, where the performance of the algorithm is expected to improve gradually as training stabilizes. The AQ-SGD algorithm has been rigorously theoretically analyzed and proven to have good convergence rates under certain technical conditions and bounded error quantization functions. The algorithm can be effectively implemented without adding additional end-to-end runtime overhead, although it requires the use of more memory and SSD to store activation values. Through extensive experiments on sequence classification and language modeling datasets, AQ-SGD has been shown to compress activation values to 2–4 bits without sacrificing convergence performance. Furthermore, AQ-SGD can be integrated with state-of-the-art gradient compression algorithms to achieve “end-to-end communication compression,” meaning that data exchanges between all machines, including model gradients, forward activation values, and backward gradients, are compressed to low precision, thereby significantly improving the communication efficiency of distributed training. Compared to the end-to-end training performance in a centralized computing network (such as 10 Gbps) without compression, it is currently only 31% slower. Combined with the data from scheduling optimization, although there is still a gap from centralized computing networks, there is considerable hope for catching up in the future.
Conclusion
In the bonus period brought by the AI wave, the AGI computing power market is undoubtedly the most potential and in-demand among the various computing power markets. However, it also has the highest development difficulty, hardware requirements, and capital demands. Combining the two projects discussed above, it is clear that we are still some time away from the realization of the AGI computing power market, and a truly decentralized network is much more complex than the ideal scenario, currently not sufficient to compete with cloud giants.
Many projects that are still in their infancy (the PPT stage) with small scale are also shifting their focus to the less challenging AGI inference stage rather than the training stage. However, in the long run, the significance of decentralization and permissionless systems is profound. The right to access and train AGI computing power should not be concentrated in the hands of a few centralized giants. Humanity does not need a new “theocracy” or a new “pope,” nor should it pay expensive membership fees.
About YBB
YBB is a web3 fund dedicating itself to identify Web3-defining projects with a vision to create a better online habitat for all internet residents. Founded by a group of blockchain believers who have been actively participated in this industry since 2013, YBB is always willing to help early-stage projects to evolve from 0 to 1.We value innovation, self-driven passion, and user-oriented products while recognizing the potential of cryptos and blockchain applications.
Explain the literature:
1.Gensyn Litepaper:https://docs.gensyn.ai/litepaper/
2.NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training :https://together.ai/blog/neurips-2022-overcoming-communication-bottlenecks-for-decentralized-training-12
3.Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees:https://arxiv.org/abs/2206.01299
4.The Machine Learning Compute Protocol and our future:https://mirror.xyz/gensyn.eth/_K2v2uuFZdNnsHxVL3Bjrs4GORu3COCMJZJi7_MxByo
5.Microsoft:Earnings Release FY23 Q2:
https://www.microsoft.com/en-us/Investor/earnings/FY-2023-Q2/performance
6.争夺AI入场券:BAT、字节美团们竞逐GPU:https://m.huxiu.com/article/1676290.html
7.IDC:2022–2023全球计算力指数评估报告:https://www.tsinghua.edu.cn/info/1175/105480.htm
8.国盛证券大模型训练估算:https://www.fxbaogao.com/detail/3565665
9.信息之翼:算力与AI是什么关系?:https://zhuanlan.zhihu.com/p/627645270