Illustration Scoring: Teaching Technology to Be More Human

Timothy Carroll
The Startup
Published in
6 min readJan 7, 2021

The applications of machine learning range greatly, from spam filtering and predictive analysis to recommendation engines and fraud detection. Often times machines, utilizing neural networks, are used for computational tasks as they can perform these much faster and more accurately than humans.
A question to consider is how do they fair when performing an innately human task, such as grading children’s illustrations? This is exactly what our team set out to discover while working for the web application Story Squad.

The App

Story Squad, founded by former teacher Graig Peterson, was developed in order to encourage children to take a break from their daily screen time and practice creative writing and drawing. The game works like this:

  • Each week, the children are provided with a section of a story to read.
  • Next, they’re prompted to take time off-screen to write a one-page creative story and illustration based on the section they just completed.
  • Photos of the story and illustration are then uploaded to the app, transcribed, and scored based on an analysis of the writing.
  • Finally, squads of four are matched up against each other based on a clustering algorithm. They then vote on each others work to determine the winner of the matchup.

The Task

I was brought onto Story Squad for a 4-week project cycle as part of a cross-functional team. The team was comprised of three web developers and two data scientists. The task set out for the data science team was to set up a system to score hand-drawn illustrations. This may sound simple on the surface, in reality however, this task is nothing but.
Consider this, how do you measure art? Is it form, structure, mass and void? Complexity, accuracy, abstract beauty or coloring? This alone can be argued and interpreted endlessly, but what about getting a machine learning model to measure any of these metrics?

The team began researching work on the subject and examples of similar systems in practice. Surprisingly, across the wide world of the internet, there wasn’t much. There are many many examples of image classifications, segmentation, SOD(salient object detection) and the like, but much less in image “grading”. Our stakeholder had recommended the use of a SOD model, but after testing it out, it proved difficult to pull any valuable metrics from it.

SOD model output on pictures of various illustrations

Heading back to the drawing board, we began to explore the use of models for label classification, such as taking a movie poster and guessing what genre(s) the film likely was. There were various examples of this in practice, and it reliably proved effective. This got us thinking… “could we manually grade a solid number of illustrations, then train a model to do it for us?”. We began to pursue this avenue, but we still had a lingering problem, lack of data.

Data Collection and Model Building

Enter data scraping. A powerful tool for collecting data in order to train and power our models. The particular avenue of data scraping we used utilized Selenium to automate google image searches and saved a specified number of images from a set of queries.

code used for image scraping

After scraping around 5000 images, they were filtered through manually. Irrelevant or duplicate images were deleted, and some were cropped to focus on the relevant content of the image. After this process, ~1300 images remained. They were then graded on a scale of 1–3, and placed into relevant folders.

(images graded “1” on the left and “3” on the right)

The reason for this simple grading scale was that we wanted to make the difference between each grade as easily discernable as possible for our model. Given more data, that scale could have been more complex. A score of “1” denoted average pre-adolescents skill level, “2” being average middle school skill level, and “3” being high school+ skill level.

Data collected, it was time to start model building. The avenue we decided to pursue was to utilize transfer learning. Transfer learning is the process of taking the features learned from a previous model, and leveraging them on a new problem. This is a powerful tactic to use when lacking in large scale data to train a neural network from the ground up.

The pretrained model utilized was ResNet50. ResNet50 was trained on millions of images to identify 1000 classes. In this process, it identified features crucial for discerning one class from another, and we are going to repurpose this for our labeling purposes.

In creating our new model, we first load our base model(ResNet50), and freeze it so we don’t waste computation time retraining those layers. We also add additional new layers to the model in since it will trained on a totally different dataset then was it was originally purposed for. Layers are added for regularization to avoid over-fitting.

(code for model on left, summary on right)

Results

After training, our model achieved an accuracy of ~80%! This is a relatively decent metric for the task at hand, although I think it can be improved on in future development cycles. This could be accomplished by feeding the model additional data (possibly utilizing a data generator), as well modifying the model and its parameters and layers more.

The current state of Story Squad’s Data Focused features is as follows:

  • API endpoints are in place for image upload, and each image is screened for inappropriate material via Google Cloud Vision’s SafeSearch Method.
  • The text images are additionally screened for explicit words, then passed to the Google Cloud Vision API in order to be transcribed.
  • The data is passed into a formula which returns a “Squad Score” for each transcribed story. This squad score is based on the following metrics: grade level, length of story, average word length, number of quotation marks, number of unique words, number of adjectives, and the percentage of complex words.
  • An image scoring model has been developed to rate each illustration on a 1–3 scale. This system is basically a rudimentary recreation of how we would rate an illustration.
  • A clustering algorithm is in place that groups similarly rated stories in groups of 4, where the children then distribute points to each story and illustration. Whoever has the highest cumulative score is that week’s victor!

Here is a demonstration from a previous product cycle of the product:

The Future of Story Squad, and Lessons Learned

In future development cycles, one of the most important features to be completed will be incorporating the illustration scoring into the clustering algorithm. A challenge in this process will be determining how heavily the illustration score should be weighted in the clustering algorithm, especially considering the score isn’t based on a gradient, but instead whole numbers on a scale. Refactoring the scoring system to be on a wider range may aide in the implementation of this feature.

This project has presented to the team many challenges, and working on a greenfield feature on an already standing codebase has been a wonderful learning experience. Personally, this particular experience has taught me the importance of establishing a plan that takes in account both the stakeholder’s goals as well as actionable tasks for everyone on the team. Constant communication was vital, and look forward to building on the experiences gained in this product cycle!

--

--