Read the Decentralized AI article series of FirstBatch with me, Part 1: Data Collection: Quality, Copyrights & Ownership
FirstBatch is building Dria now. Dria is an open-source collective knowledge Hub storing its information on Arweave, which aims at building a knowledge & open connectivity interface between humans and machines. FirstBatch introduced Dria as The Wikipedia of AIs. Recently FirstBatch published the first part of its new research series, towards Decentralized AI, focusing on the intersection between data collection problems and decentralization. In this article, the author will read the first research Data Collection: Quality, Copyrights & Ownership with you, focusing on how decentralization can solve data collection problems and risks and challenges in the decentralization of data collection.
How Decentralization Can Solve Data Collection Problems
The current challenges that AI teams/developers face during the data collection process are tiny data volume, poor data quality, storage concerns, privacy control, and intellectual property rights issues. Let’s take a look at how decentralization can help solve these problems.
The Chief AI scientist at Meta pointed out that the amount of data used for training these AI models is still tiny compared to the amount of input a 4-year-old child receives despite the major advancements in LLMs. Today, the format and source of data are limited to text-based types and certain sectors/verticals. FirstBatch imagined attracting teams and individuals to curate and rate data through social and economic incentives can speed up the process by bringing in new types and sources of data.
Nowadays, AI developers face a major challenge in finding high-quality data to use and checking the quality of the data they collect since the poor data quality stems from repeated and outdated data sources. Plus, relying on automated methods can lead to inaccuracies and poor data quality. Learning the successful experience of improving data quality from open data Hubs like Hugging Face, Kaggle, and Wikipedia, FirstBatch was innovated to state that, ‘by creating decentralized open data hubs, a large number of people join the process of curate, review, and rate the data. This can not only shift the burden of ensuring the quality of datasets from relatively small teams but also protect the data from any single party’s manipulation or intervention. With the right incentive structures, these decentralized data hubs and data curation/review activities can ensure the quality of data coming in with high velocity and volume.’ Dria of FirstBatch is creating a universal knowledge hub.
When it comes to storage, two key issues AI projects are faced with are cost and maintenance. With the ever-growing data amount and the iterating monthly subscription cost, users tried getting a discount by buying a larger storage service in advance. Unfortunately, it turned out to be a waste both economically and technically. FirstBatch stores the data on Arweave to mitigate the risk of losing data. It’s also useful to create a shared data lake that different projects can use without going through the process of collecting and storing the same data in different places. By doing this, waste of space and cost can be alleviated.
Exposing private data with personally identifiable information on the open knowledge hub for hundreds of thousands of people to review would violate certain policies. FirstBatch made a point that zero-knowledge and DID technologies can be used to encode the data before they run into the open knowledge hubs for curation. Future online activities should be in the protection of data privacy.
Numerous online platforms and media outlets raise concerns regarding AI companies utilizing copyrighted materials. They argue that the training and utilization of these AI models may encroach upon the rights of the original content owners. NFTs, thanks to their on-chain actions’ transparency and immutability, provide a clear and transparent means of establishing ownership and usage rights for creative and intellectual material. These tokens serve as a tool for verifying and identifying materials subject to specific procedures, simplifying both the data-cleaning process and the management of legal challenges.
Risks and Challenges in Decentralization of Data Collection
While decentralized solutions have their merits, the persistent issue remains the risks associated with user anonymity. For instance, when it comes to regulatory matters concerning copyright or harmful content, anonymous misconduct can potentially lead to more significant problems, putting the platform at risk. Storing data permanently on decentralized networks can result in the inclusion of harmful content in uploaded data, even with public data scrutiny, leaving room for oversight.
One significant challenge is how to allocate the weighting of data quantity and quality incentives. Regardless of the platform’s structure, there will always be individuals uploading either lower-quality data or fewer high-quality data, posing a perpetual challenge.
Conclusion
As decentralized AI data collection platforms continue to evolve, there will be more opportunities to empower better coordination paradigms toward smoother data collection pipelines. We also look forward to Dria bringing more good news of enhancing both the quantity and quality of data.
Link to the original article: https://www.firstbatch.xyz/blog/towards-decentralized-ai-part-1-data-collection
🔗 More about PermaDAO :Website | Twitter | Telegram | Discord | Medium | Youtube
💡 Initiated by everVision and sponsored by Forward Research (Arweave Official), PermaDAO is a “Cobuilding Community” focus on the theme of Arweave consensus storage. All contributions from PermaDAO contributors form the bedrock of data consensus. Let’s embark on a journey starting with data consensus and delve into a novel paradigm for decentralized collaboration — Decentralized Autonomous Organizations (DAOs)!