Data Tagging, Data Cleansing Costs, and How Bottos Will Help
Skyrocketing Data Costs and Time-Losses
Data is the main source of training for Artificial Intelligence algorithms. Large quantities of data may determine the success of specially designed AI algorithms and push forward its progress. Yet, the main barriers to the development and deployment of AI algorithms are the quality of the data available and the huge costs to tag and clean datasets, that may claim even half of the available budget.
Additional costs will be generated, since the added efforts must be made to collect large quantities of data, most times in the property of a handful companies, thus making the price skyrocket for high quality, large datasets.
As the time goes by though, more and more data will be produced — especially with the spreading of the IoT — and it will have to be cleaned and tagged, constituting a big challenge for small and medium enterprises, as well as for research institutions and individuals.
Unfortunately, data cleansing is a time-consuming endeavor. A survey conducted by CrowdFlower reported that data scientists spent 60% of their time on cleaning and organizing data and 19% of their time collecting data sets. That adds up to almost 80% of their time devoted to preparing and managing data for analysis, greatly impacting on the overall costs and budgeting.
In another survey by IBM, it has been estimated that low quality data costs the United States around $3.1 trillion every year, whereas one in three business leaders do not trust the information they use to make decisions on, creating severe problems of data veracity.
The Data Cleansing Process
Data cleansing is a complex, multi-stage process. The most common practices firstly indicate the drawing of a detailed data analysis as a first step in detecting which kinds of errors and inconsistencies must be removed. In addition to a manual inspection of the data or data samples, algorithms are often needed to extract metadata about the data properties and spot data quality problems.
Software that employs machine learning may give a hand, but because data can come from disparate sources, the data cleansing process also requires getting data into a consistent format for easier usability and to ensure it all has the same shape and schema. Depending on the number of data sources, their degree of heterogeneity, and the quality of the data, its transformation may be required as well. Then, the effectiveness of a transformation workflow and the transformation definitions must be tested and evaluated. Multiple iterations of the analysis, design, and verification steps should also be done in order to further polish the data. After errors are removed, the cleansed data must replace the previous data in the original sources.
There are other time and budget-consuming challenges as well. For instance, the available information on the errors is often insufficient to determine how to correct them, leaving data deletion as the only viable option. However, deleting the data means losing information. Then there’s the fact that data cleansing is a process that must be repeated every time data is accessed or values change.
How Bottos Can Push Down Data Costs
In order to face and solve these problems affecting all the industry, Bottos is just settling on this point and is actively creating great industrial value through multi-role participation of community nodes, or types of “data crowdsourcing platforms” that may subvert the AI industry.
Furthermore, it will strengthen its efforts by enacting policies and systems that will work to assure and incentivize the high quality of the data exchanged and by relying on the support of the users on the data cleansing and tagging effort.
The aims of the Bottos infrastructure include the further progress of the AI and the push to “democratize” the development of AI models, algorithms and applications, making available data and resources to as many people as possible, to as many places as possible.
Join Our Community and Stay Updated!