AI Meets a 96 Years Old NGO: Improving Case Management for Cross-Border Child Protection
The social service sector is increasingly showing an interest in turning to data-driven practices, which, until now, were predominantly utilized by the commercial counterpart.
Some of the key reforms that social organizations expect from leveraging data include:
- Harness the potential of the underlying gold mine of expert knowledge
- Relieve the limited staff of repetitive administrative and operational tasks
- Address missions on a shorter timeline with enhanced efficiency
Yet, according to IBM’s 2017 study, most of the sector seems to be in the early stages of the data journey, as shown in the visual below. Constrained budget, access to technology, and talent are cited as the major hurdles in utilizing analytical services.
We at Omdena had an unparalleled opportunity to work on one of such nascent stage projects for International Social Service (ISS), a 96-year old NGO that massively contributes to resolving child protection cases.
Why This Project?
ISS has a global network of expertise in providing essential social services to children across borders — mainly in the domains of child protection and migration. However, with over 70,000 open cases per year, ISS caseworkers were facing challenges in two aspects: managing time, and managing data. The challenges were often exacerbated by administrative backlogs and a high turnover rate not uncommon in the nonprofit sector.
If we could find a way to significantly reduce the percentage of time lost on repetitive administrative work, we could focus on the more direct, high-impact tasks, helping more children and families access better quality services. ISS saw an urgent need for a technological transformation in the way they managed cases — and this is where Omdena came into the picture.
The overarching goal of this project was to improve the quality of case management and avoid unnecessary delays in service. Our main question remained, how can we help save caseworkers’ time and leverage their data in a meaningful way?
It was time to break down the problem into more specific targets. We identified factors that hinder ISS caseworkers from focusing on the client-facing activities, seen in the following picture.
We saw that these subproblems could each be solved with different tools, which would help organizations like ISS understand the various ways that machine learning can be integrated with their existing system.
We also concluded that if we could manage data in a more streamlined manner, we could manage time more efficiently. Therefore, we decided that introducing a sample database system would prove to be beneficial.
The biggest challenge we faced as a team was the data shortage. ISS had a strict confidentiality agreement with their clients, which meant they couldn’t simply give us raw case files and expose private information.
Initially, ISS gave us five completed case files with names either masked or altered. As manually editing cases would have taken up caseworkers’ client-facing time, our team at Omdena had to find another way to augment data.
Our team collectively tackled the main problem from various angles, as follows:
As we had only five data points to work with and were not authorized to access ISS’s data pool, we clarified that our final product would be a proof-of-concept rather than a production-ready system.
Additionally, keeping in mind that our product was going to be used by caseworkers who may not have a technical background, we consolidated the final deliverables into a web application with a simple user interface.
Due to limited cases available at the start of the project, the first task in hand was to collect more children related cases from various sources. We majorly concentrated on child abuse and migration cases. We gathered the case files and success stories that were publicly available on ISS partner websites, Malawi’s Child Protection Training Manual, Bolt Burdon Kemp, and Act For Kids. We also collected a catalog of child welfare court cases from the European Court of Human Rights (ECHR), WorldCourt, US Case Law, and LawCite. In the end, we had managed to collate a dataset of about 230 cases and were ready to utilize these in our project pipeline.
We relied on a supervised learning approach for our risk score prediction model. For this, we manually labeled risk scores for each of our cases. A risk score, a float value ranging from 0 to 1, intends to highlight priority cases (i.e. cases which involve a threat to a child’s wellbeing or have tight time constraints) and require the immediate attention of the caseworkers.
The scores were given by taking into consideration various factors such as the presence and a history of abuse within the child’s network, access to education, caretaker’s willingness to care for the child, and so on. To reduce bias, three collaborators provided their risk score input for each case, and the average of the three was considered as the final risk score for that case.
Finally, we demarcated the risk scores into three categories, using the following threshold.
Additionally, we augmented our data — which originally only contained case text — by adding extra information such as case open and close dates, type of service requested, and country where the service is requested. Using this, we created a seed file to populate our sample database. These parameters would later help caseworkers see how applying a simple database search and filter system can enable dynamic data retrieval.
Next, we moved to data preprocessing which is crucial in any data project pipeline. To generate clean, formatted data, we implemented the following steps:
- Text Cleaning: Since the case texts were pulled from different sources, we had different sets of noises to remove, including special characters, unnecessary numbers, and section titles.
- Lowercasing: We converted the text to lower case to avoid multiple copies of the same words.
- Tokenization: Case text was further converted into tokens of sentences and words to access them individually.
- Stop word Removal: As stop words did not contribute to certain solutions that we worked on, we considered it wise to remove them.
- Lemmatization: For certain tasks like keyword and risk factor extraction, it was necessary to reduce the word to its lemmatized form (eg. “crying” to “cry,” “abused” to “abuse”), so that the words with the same root are not addressed multiple times.
We had to convert the case texts into compact numerical representations of fixed lengths to make them machine-readable. We considered four different types of embedding methods — Term Frequency Inverse Document Frequency (TFIDF), Doc2Vec, Universal Sentence Encoder (USE), and Bidirectional Encoder Representations (BERT).
To choose the one that works best for our case, we embedded all cases using each embedding method. Next, we reduced the embedding vector size to 100 dimensions using Principal Component Analysis (PCA). Then, we used a hierarchical clustering method to group similar cases as clusters. To find the optimal number of clusters for our problem, we referred to the dendrogram plot. We finally evaluated the quality of the clusters using Silhouette scores.
After performing these steps for all four algorithms, we observed the highest Silhouette score for USE embeddings, which was then selected as our embedding model.
Models & Algorithms
Multiple pre-trained extractive summarizers were tried, including BART, XLNet, BERT-SUM, and GPT-2 which were made available thanks to the HuggingFace Transformers library. As evaluation metrics such as ROUGE-N and BLEU required a lot more reference summaries than what we had, we opted for relative performance comparison and checked for the quality and noise level of each model’s outcomes. Then, inference speed played a major role in determining the final model for our use case, which was XLNet.
Keyword & Entity Relation Extraction
Keywords were obtained from each case file using RAKE, a keyword extraction algorithm that determines high-importance phrases based on their frequencies in relation to other words in the text.
For entity relations, several techniques using OpenIE and AllenNLP were tried, but they each had their own set of drawbacks, such as producing instances of repetitive information. So we implemented our own custom relation extractor utilizing spaCy, which better-identified subject and object nodes as well as their relationships based on root dependency.
The pairwise similarity was computed between a given case and the rest of the data based on USE embeddings. Among Euclidean distance, Manhattan distance, and cosine similarity, we chose cosine similarity as our distance metric for two reasons.
First, it works well with unnormalized data. Second, it takes into account the orientation (i.e. angle between the embedding vectors) rather than the magnitude of the distance between the vectors. This was favorable for our task as we had cases of various lengths, and needed to avoid missing out on cases with diluted yet similar embeddings.
After getting similarity scores for all cases in our database, we fetched top five cases that had the highest similarity values to the input case.
Risk Score Prediction
A number of regression models were trained using document embeddings as input and manually labeled risk scores as output. Tensorflow’s AutoKeras, Keras, and XGBoost were some of the libraries used. The best performing model — our custom Keras neural network sequential model — was selected based on root mean square error (RMSE).
Abuse Type & Risk Factor Extraction
We created more domain-specific tools to generate another source of insight via algorithms to find primary abuse types and risk factors.
For the abuse type extraction, we defined eight abuse-related verb categories such as “beat,” “molest,” and “neglect.” spaCy’s pre-trained English model en_core_web_lg and part-of-speech (POS) tagging were used to extract verbs and transform them into word vectors. Using cosine similarity, we compared each verb against the eight categories to find which abuse types most accurately capture the topic of the case.
Risk factor extraction works in a similar way, in that text was also tokenized and preprocessed using spaCy. This algorithm, however, further extended the previous abuse verbs by including additional risk words, such as “trauma,” “sick,” “war,” and “lack.” Instead of only looking at verbs, we compared each word in the case text (excluding custom stop words) against the risk words. Words that had a similarity score of over 0.65 with any of the risk factors were presented in their original form. This addition aimed to provide more transparency over what words may have affected the risk score.
To put these models altogether in a way that ISS caseworkers could easily understand and use, a simple user interface was developed using Flask, a lightweight Python web application framework. We also created forms via WTForms and graphs via Plotly, and let Bootstrap handle the overall stylization of the UI.
For the database, we used PostgreSQL, a relational database management system (RDBMS), along with SQLAlchemy, an object-relational mapper (ORM) that allows us to communicate with our database in a programmatic way. Our dataset, excluding the five confidential case files initially provided by ISS, was seeded into the database, which was then hosted on Amazon RDS.
A public Tableau dashboard to visualize the case files was also added, should caseworkers wish to refer to external resources and gain further insight on case outcomes.
The aim of this project was to assist ISS in offering services in a timely manner, bearing in mind the organizational history of knowledge available. Within eight weeks, we achieved this goal by providing an application prototype that would help caseworkers understand some of the various ways to leverage the power of data.
This tool, upon continued development, will be the first step toward ISS’s AI journey. And with enhanced capabilities, both experienced and less experienced caseworkers will be able to make better-informed decisions.
Our models do come with some limitations, mainly stemming from limited data due to privacy reasons and time constraints.
As our dataset only accounted for two types of services (child abuse and migration), and came from a small number of geographical sources, the risk score prediction model may contain biases. Bias could also have been induced by the manual labeling of risk scores, which was done based on our assumptions.
The solutions provided are not meant to replace human involvement. As 100% accuracy of a machine learning model can be difficult to achieve, the tool works best in combination with the judgment of a caseworker.
To bring this tool closer to a production level, a few improvements can be made.
Furthermore, risk scores can be validated by the ISS caseworkers. They can provide even more risk scores as they use our prediction model, to enable continuous learning.
The database fields can be more granular and include additional attributes as caseworkers see fit. For example, the current field “case_text” can be divided into “background_info,” “outcome,” and so on. This will create more flexibility in document search and flagging missing information.
Finally, once the app is productionized and deployed on a platform like AWS, ISS caseworkers across the globe will have access to these tools, as well as access to the entire network’s pool of resources — truly empowering caseworkers to do the work that matters.