HOW TO GET YOUR BUSINESS DATA READY FOR AI

Marianne Michaelis
CONTACT Research
Published in
10 min readOct 5, 2023

As leading managers never tire of emphasizing, the future of many companies lies in data management. “Data is the new gold!”, “Unearth your data treasures!” — Data analytics firms are relentless in showing companies the value of their accumulated data and persuading them to engage in professional data science projects. However, as soon as data analysts are actually brought in, disillusionment usually sets in: The experts find that the supposedly available company data is hardly accessible, incomplete, or so flawed that the analyses become heavily skewed. Instead of receiving interesting results, the commissioning companies often have to content themselves with the data analyses revealing shortcomings in their data management — a highly unsatisfactory outcome for both parties!

So how can such trouble be avoided? What can companies do to benefit from current developments in the AI sector and ensure that data analytics projects in their organization are successful? The key factor here is data management. All algorithms in the data science sector, from clustering methods in data mining to neural networks for artificial intelligence, depend on good data input. Only when data is fed into the algorithm in sufficiently good quality, it can calculate accurate results and thereby generate useful insights for the company. On the other hand, if there is disorder in the basic database systems, even the best algorithm cannot deliver useful insights or even give misleading results.

Where does data chaos come from?

Let’s look at a practical example of how data chaos arises and how it could be avoided: In a company’s central data management system, employees enter their data via traditional input forms. These were developed, usually several decades ago, for the use cases that existed within the company at that time. Over time, new user groups were added to the system, such as when new departments were created, and workflows changed in existing departments. This also changed the information needs — certain new metrics had to be recorded, or other notes had to be made. From an IT perspective: new tables would have to be created in the database system, or new columns would have to be added to existing tables. What sounds simple to the end user, who may have their Excel spreadsheet in mind (where they can easily add a column or worksheet), is a massive effort for large business solutions. The altered data structure will only be implemented from a certain software version onwards, comprehensive tests are required for the release, links between systems must be cleanly defined while considering many special cases, and in the worst-case scenario, different version statuses are used in parallel, which later have to be laboriously migrated and merged. All this generates significant development costs in IT, which would usually be charged to the commissioning departments, thus affecting their budget. Therefore, individual departments come up with the idea of bypassing this effort. Typically, they have two options for doing so:

Either they decide to maintain the additional data outside of the core system. This creates a shadow data storage system: away from central IT (e.g., on department servers), central work results or other important information are maintained in simple Excel lists, for example — without access and rights controls, without backup copies, or versioning. It’s obvious that this approach is technically risky for the company and may also have legal consequences. Data analysis is also made more difficult: if only access to the central database was initially provided at the start of the project, but this does not contain crucial information, the data analysts must first laboriously collect information where which data piece is stored and how to gain access there, so that less working time remains for the substantive analyses.

Alternatively, departments also have the option to continue to store the new data in the central system without notifying IT. Instead of laboriously submitting change requests, coordinating the new solution with IT in numerous meetings, waiting for implementation, and finally paying for it, they take the seemingly easy and painless route: The department agrees internally that certain data, which used to be collected as standard, are no longer needed in current practice. The corresponding data field in the input mask is then repurposed, i.e. for example, all clerks receive the instruction to enter the name of the responsible accountant in the field for the location name. Technically this is not a problem — as long as the data type matches, there won’t be error messages with this method. However, chaos now begins within the underlying database: the associated table column is filled with location names up to a certain entry date, after which it is filled by one department with people’s names, while the remaining departments continue to enter location names. And since the other departments usually behave in the same way, i.e. filling redundant data fields with new content, an insane data jumble is created over the years in the originally well-designed database systems. Now if a data analyst runs their analyses on this data set (for example, trains a neural network that is supposed to predict the value of a certain field), what will come out? Exactly, the network will predict that Department A enters a location name at this point, and Department B enters a person’s name. That is correct, and it may even be a new insight for the client, but it doesn’t advance the business.

How to avoid data chaos?

But what can companies specifically do to ensure that their data treasures actually remain valuable treasures and don’t turn into complete garbage? Two aspects are important for this:

First of all: Management must accept that good data quality costs money. If you want to have clean data sets over many years, you have to invest in their maintenance. This starts first and foremost with regularly determining which information needs currently exist and how far they can be represented in the systems. Where this is no longer possible, database structures must be adapted. The crucial realization here is that the databases must be oriented towards work processes, not vice versa! If all clerks have additional workloads to struggle with unsuitable systems, this costs a lot of money, even if the workloads are less clearly visible at first glance than with the implementation of IT adjustments. In addition, errors spread through further processing, so poorly maintained base systems can cause additional costs in subsequent systems. The legal consequences for the company can also be devastating: for example, data in shadow storage is not only inaccessible to the rest of the company and has no quality assurance, but it can also have serious legal consequences for the company due to the lack of access controls and backups.

Only clean, complete data is a sensible basis for data-driven decisions, otherwise the old rule applies:

GIGO — garbage in, garbage out!

Prioritization also helps here, which data are most critical for the company’s success. The general rule is: be proactive rather than reactive — it is much cheaper to fix data errors before problems arise from them. If convincing arguments for the introduction of data maintenance measures are still needed, it helps to document the consequences of earlier poor data quality and also to quantify them financially.

The goal of the efforts must be good data: accurate, complete, consistent, and easily accessible to users. A certain critical amount of data is always needed for AI projects, but the following also applies here: quality over quantity! Slightly smaller but cleaner data sets deliver better results in AI projects. It is also important to take the company’s entire staff with you on the topic of data storage, to train them in the business importance of good data storage, and to inform them about the current approach. Good cooperation with existing BI colleagues also helps here, as they often have the most knowledge of both existing data storage and current information needs in the company. Special training courses are also worthwhile to further educate BI colleagues in the direction of data science.

Some practical guidelines

Once you have accepted the basic costs of good data storage, you can move on to practical implementation. When proceeding with specific measures, two cases must basically be distinguished: on the one hand, improvements in the area of already existing inventory data, on the other hand, optimized procedure when setting up new data stocks.

In dealing with existing data, the initial situation is naturally very individual — however, some general points can still be identified that can serve as a guideline:

  1. Define goals: What should the data serve for in the future, what requirements must they meet to achieve these goals?
  2. Identify problems: What does a comparison of the current data quality with the defined requirements reveal? At which points are the goals being met, and where does the data stock need to be post-processed?
  3. Combine Data Analytics and Data Cleansing: To achieve particularly high efficiency of the measures, it can be helpful to explore the data stock in parallel with classical data analysis and to correct any found errors immediately. It is important to log the corrected errors to be able to also address the sources of the errors subsequently.
  4. Learn from mistakes: What have been the sources of errors in previous entries? Where do the existing data fields not meet user needs? This is not just about optimizing the database structure; the insights into actual needs can also make the future use of existing data easier (for example, through changed interfaces).

When setting up a new data system, especially for larger systems, the use of specialized data management software (e.g. Data Governance Suites or MDM platforms) can be helpful. However, even without specific software products, attention should be paid to the following aspects:

  1. The storage of data must occur in a reliable system with regular, automated backups and be secured against unwanted third-party access.
  2. The data should be distributed across as few systems as possible; with each duplication of individual fields, the risk of inconsistencies increases. Errors and duplicates can be found much faster when all related data is in the same system.
  3. Individual errors are almost inevitable in large systems. In this regard, it is necessary to determine beforehand how to deal with this issue: What is the minimum quality that must be ensured in any case? Which parts of the system can tolerate what levels of error rates? Typically, a prioritization of the data most relevant to the company occurs here.
  4. A good, current documentation belongs to every database: What kind of data is it, where does it come from/how is it collected, what is it used for later? Maintaining an overview here avoids many errors when users later wish to make changes.
  5. Roles, rights, and responsibilities must be defined for all parts of the dataset. Who has read or write access, which data must be specially protected?
  6. Every part of the database requires a responsible person in the company (usually called a Data Steward) who knows all the information about the data set, can plan changes, and handles regular analyses for quality control. If errors are found, their sources must be sought and corrected.
  7. All data fields must be named clearly and understandably to avoid accidental erroneous entries. The Data Steward should regularly check whether the contents of the database still conform to the plans or have been repurposed due to changed needs.
  8. Whenever possible, data should be collected automatically. E.g., data fields like date, time, user or department, should be pre-filled by the system. This saves both working time and improves data quality.
  9. Data that is expected to be important for later evaluations must be mandatory fields.
  10. Data fields which are not likely to be evaluated should be optional. Too many mandatory fields cause users to fill them in sloppily, which can spoil later analyses.
  11. For every field in the data set, a precise data format should be specified if possible. Free text fields are to be avoided whenever possible! Especially here, when setting up new databases, work is often careless, and almost every string field is left available as free text. Instead, regular use should be made of pre-set selection lists (e.g., of persons, places, or billing units) to keep the data set free from typographical errors and content duplicates (e.g., through abbreviations).
  12. In systems where employees can modify data retrospectively, it is recommended to include change tracking from the start. On the one hand, this increases revision security and knowledge about editing processes; on the other hand, it can also be helpful, for example, to identify that employees have problems with the proper use of certain fields.
  13. If the data system is closely linked with other software parts (for example, systems that feed in automatically), it must be tested before every new release whether the delivered data quality is still sufficient.

Through these actions, data records are kept clean from the beginning and thus provide a good basis for successful AI projects!

References for further reading:

About CONTACT Research. CONTACT Research is a dynamic research group dedicated to collaborating with innovative minds from the fields of science and industry. Our primary mission is to develop cutting-edge solutions for the engineering and manufacturing challenges of the future. We undertake projects that encompass applied research, as well as technology and method innovation. An independent corporate unit within the CONTACT Software Group, we foster an environment where innovation thrives.

--

--

Marianne Michaelis
CONTACT Research

Researcher @CONTACT Software, curious about Data Science, AI and Philosophy