Why Nonprofits Can’t Have Machine Learning Without Standardization
The buzzwords “machine learning” and “AI” often send organizations spinning into a state of excitement, brainstorming potential applications and forecasting compelling use cases. However, at times this excitement overshadows the most essential prerequisite — good (clean) data.
In my opinion, standardization — or lack thereof — is a prominent issue in the humanitarian sector due to the fact that there are often several ways of referencing a single entity. For instance, international organizations frequently have their name translated in multiple languages, which complicates naming conventions. Médecins Sans Frontières, also known as Doctors Without Borders, Medicos Sin Fronteras and many other translations of its official name, is just one example of this occurrence. In addition to the linguistic variations, organizations are often referred to by their abbreviated form, as is the case again with Médecins Sans Frontières, “MSF”. Organizations such as Humanitarian Data Exchange (HDX) or International Aid Transparency Initiative (IATI) have basic guidelines in place for their data sharing platforms, but these require further specificity since upon extraction from such sites, I still find myself spending 90% of my time cleaning the data — translating it into English, fixing capitalizations, removing punctuation, standardizing names and so on.
Therefore, if you have not already done so, it is essential that you establish a set list of rules for your organization on how data will be kept. This will save you plenty of time in the long run, and will enable you to maximize your resources. Your list should include references to formatting, as well as stylistic and linguistic guidelines.
Below is an example of our data guidelines at Fields Data. You will notice that we tackle the issues previously mentioned by consistently formatting names as follows: the organization’s full name in English (if available) and capitalization of all words except articles (the, and, or, etc.). We also outline which variables we are collecting, and include a brief description of each one for transparency purposes.
Feel free to use our guidelines as a starting point, and simply adapt them to your needs. I encourage you to make yours specific, in accordance with your data type, but not so detailed that they are unrealistic to maintain.
Lastly, it is important to note that computers read text very literally. For instance, “World Bank” and “world bank” would be considered two different entries. When data is not uniform, it can significantly impact the accuracy of a machine learning algorithm by causing misleading calculations, such as incorrect counting of entries. In the worst case scenario, this could result in your organization basing decisions upon incorrect results. It is therefore critical that your organization follows through with standardizing its data in order to obtain more accurate insights and visualizations in the future.
After all, an algorithm is only as good as its data.