“Quality Datasets for AI” (transcription)

Gabrielle Ponce González
Effect Network
Published in
9 min readJul 17, 2023
  • Disclaimer: This article has been edited to make it readable and shorter while keeping the most important discussion that we had in the live session. To listen to the full, unedited version, please go here: https://twitter.com/i/spaces/1mrGmkempqkxy?s=20

Intros

Roberto: My background is actually mainly in game design, and I always wanted to build and create awesome games, but I didn’t really want to be reliant on anyone. I’ve worked in the gaming industry for 15 years. I used to be a designer who also had dev knowledge, and my main focus has always been building the systems that make the games fun. So in terms of what I’m doing right now or around this area, I’m working at Ocean, I’ve helped to initially get their grants program started, and the goal was to bootstrap a lot of the projects building around the protocol. So we helped distribute something like $2.5 million to over 150 projects. It’s been really cool to see all of these communities of web engineers, and web training developers building so many data-focused protocols or data sets or different AI models and trying to create companies around all of this and build the valley flows and everything that web3 is so good at.

Now my focus is less on things like how to build your models and more on how to get not just good-quality data for AI, but data that’s actually valuable and can actually build products that people care about.

Jesse: I’ve been following Ocean Protocol since the beginning, and it’s an amazing project. And I think what you guys are doing is so important for the space, especially now that the AI boom has started.

A bit about myself: I have a background in Artificial Intelligence. So at university in Amsterdam, I studied AI and did some research in AI as well for a little bit after my masters, mainly about deep learning, and I was developing a deep learning pipeline for creating defects just before it got really popular. So I was in that space back then, and then it grew really quickly. But I’ve done work on the data engineering side and developing algorithms, and shortly after that I started Effect Network, or Effect.AI, and mainly when we started it, what we were trying to solve was how to make access to data, like if you needed a specialized data set for your project, more accessible for everyone. Because it was clear back then that large data sets were too hard to access and create, it was pretty expensive. And they were, especially back then, extremely valuable for developing algorithms. And with Effect Network, we were trying to give everyone access to a workforce that can create a quality and great data set by doing human annotations. So now, six years later, I think this is more relevant than ever.

Q1: What is a quality dataset, and why is it essential in AI?

Roberto: To me, a quality data set is one that gives you the context you need so that you can then understand the terrain or the problem space a little bit and then ask some good questions.

Jesse: In my experience, a quality data set is one that has diversity and covers the information that you want to grasp because, when you use the data set to train an algorithm, you want to sort of extract from it information that you can use to make predictions, right? That’s what AI is about.

Quality is important. So if you have a lot of rough data or it's in the wrong format or whatever, you can spend tons of time cleaning it up, getting it in the right format, taking out the outliers, and taking out the things that just don’t really represent what you care about. So a quality data set needs to have that as well. So diversity and cleanliness are really important.

Roberto: Of course, you get biased algorithms, but if you know they’re biased, sometimes they can actually be good. If you understand its limitations, sometimes that’s sufficient to solve some of the immediate problems in the business so that it’s accessible.

Jesse: Yeah, that’s a very good point. I think bias isn’t even a bad thing per se.

As long as we can show the bias and be aware of it. For example, women are way underrepresented in a lot of the data that is available. And it’s just something really cultural and hard to solve. But like being able to indicate that, and maybe different data sets have different biases, transparency in that regard is essential.

Q2: What are some common issues that can affect the quality of a data set?

Roberto: So Jesse is a data scientist, right? Say he, as a user, gets this file that has a bunch of missing data in it, or there are not enough women being represented in the sample, or, you know, there are all these things that can happen to the data, so even if you create a great model, you’re still going to have people kind of putting fences around it. So for example, if you are missing some data because there are not enough women, maybe you just copy some of the women and resample the thing, and that’s called oversampling. That helps to balance the training. But you don’t have more opinions. It’s the same opinion, but in terms of training the model, it doesn’t care. And that’s one of the weird things about what’s being built now. It’s like the magic of AI, only it isn’t quite magic. That’s where it’s really incredible how AI has evolved to be something very different from what it used to be about 10 years ago or five years ago.

Jesse:

Since I was doing active AI training, building algorithms, and working as a data scientist, I think the world looks very different than it does now. So I know back then there were big problems like missing values, data, and balance, mainly because I think right now the technology hasn’t found the infrastructure that can be used.

ChatGPT could be trained on the entire internet, and it would cover all cultures that are already online. And most books that exist have been scanned digitally and can be used for training algorithms. So I think now it covers a broader cultural and data spectrum than it did when I was working in the space. It was more like the data sets available would be English books or images that were very biased and narrow. So I think that has improved, but it can still improve a lot more.

I think we’re getting into a place where human data is becoming scarcer and a lot more AI-generated content is out there, which will produce a lot of problems such as reusing the same content without any new ideas.

Roberto: That’s true. I think that there’s still a lot of value in the human-in-the-loop part. So you know, even though I try to use an agent to build stuff for myself, it’s just not that high of a quality. So I still have to tweak various things and do a bunch of massaging to get the results that I want. That points to just how important community and networks are.

Q3: How would you preprocess and clean the data to ensure its quality and sustainability for AI applications?

Jesse: So a data scientist spends a lot of time cleaning and processing data to make sure it fits into their pipeline for training and algorithms. So there are many ways to go about it. I think it’s not the fun part, and hardly anyone really enjoys having to process massive amounts of data into a format that fits their needs. So I think in this day and age, we’re looking more towards accessing already clean data. That’s the more important part. You want to make sure where you get your data from that you have some degree of high quality. Of course, Ocean Protocol is a data marketplace, and if you have algorithms or people to clean up a data set and enrich it a bit more, it would make it really practical. Also, I would use Python to do all the prefixes just because it’s convenient and there are so many tools available to load data frames, clean them, and work with different formats. So use the tools available and definitely look for clean sources of data.

Roberto: I agree, there are really great places to go and get data from, and at Ocean, we do have various data sets available. I think there are people working towards providing frameworks for platforms to clean and enrich or improve the data quality themselves, Effect Network included. And yeah, for myself, I have.

For me, it’s more of a business problem than an AI thing. One of the things we’re working on at Ocean is this thing called data farming, and the idea is that quality data sets will drive revenue. That traction is the signal that we put in front of our DAO participants so that they can vote on the data sets with the most traction as a way to basically curate quality. This is the objective function of how the protocol aligns everyone toward building quality and value.

Jesse: It's really nice that you guys are doing that. I think this is where Web3, AI, and data come together because, with these DeFi systems and being able to stake tokens and vote for the quality of data sets, we’re making it really transparent what the quality is like.

Roberto: Ideally, in the future, DAOs will take control of their data, and we can push ownership and control of the data to the individual. Eventually, data quality and everything else will move on-chain. The quality of it will grow exponentially, and it will outcompete whatever we have in private systems these days.

Q4: What are your opinions on online repositories and resources that provide quality datasets, such as Ocean Protocol and Google Data Set Search?

Jesse: There are many sources for data, and I think this is sort of an extension of what we just talked about, but for me, the future of where to get quality data sets, has to be publicly accessible and somehow traceable. And what we need to know about these data sets are things like what elements are machine-generated or human-generated, what are the biases, etc. I think that social networks are currently one of the most important sources of data because an AI tries to understand concepts of human behavior, and I think that because so much of our lives have moved online, social networks are major repositories where AIs can learn these things. I think social media platforms such as Facebook, Reddit, Twitter, etc., are going to realize that that’s their value, and they’re going to try to monetize it. So I really believe that these things will be declining a lot, but there are going to be decentralized alternatives, and these are going to climb up as well as infrastructure providers for managing and creating datasets, I think.

This is a moment for us to shine and provide that infrastructure where we can provide way better data sources than the ones you just mentioned, and we can do better. So we can really provide data sets that are proven to have been generated by specific people who also got rewarded for them. So like with Effect Network, we’re creating a platform where you can prove the origin of data as it’s generated by either humans or algorithms.

So I think the scene is going to change dramatically in the coming years, and Web3 is going to be the major thing, the pillar that this will be built around in the coming years.

Roberto: I think that in the industry, we don’t have the right incentive mechanisms or the right ways to take that user that created the data sets and help or empower them to develop and completely immerse themselves into it in a way that creates new revenue streams and that they’re able to build on top of these protocols in new ways so that part is still missing, and perhaps it is something that they could do in the future. And the one thing I wanted to kind of double down on is that I think there are some really cool things that we can build that solve problems that people will really care about and are not complex. There are just some cool building blocks with permanent storage access and licensing. We can build some really interesting things in crypto right now.

--

--