How can we tell a person’s age from their photos?

Combining Microsoft Face API and regressions to predict actual ages of top 100 actresses on IMDb


Background

With advancements in image recognition technologies, we can now quantify facial features for interesting analyses. The purpose of this short blog post is to demonstrate how we can use R to call the Face API from Microsoft and return facial attributions for age predictions, based on photos of top 100 actresses scraped from IMDb website. From a marketing perspective, this technique could also be applied on social data mining when ages are not shown in public profiles, so brands can better understand the age range of their active followers and prospective customers.

Objective

The objective is to obtain facial attributes of the actresses, including the predicted age, and evaluate the performance against actual age.

Approach

1. Scrap the image links from IMDb

As introduced in the previous post, we can scrap the image links of the top 100 actresses using the rvest package in R. The URL we will be scraping from is shown below.

http://www.imdb.com/list/ls050128191/

2. Send the image links to Face API

First we need to register here for a free trial version of the Face API. The free version allows up to 30,000 calls per month, and 20 images per minute. As a result, we will break down the 100 actresses into 5 batches.

In addition, some images may be be recognized by the Face API and thus we may need to remove them before sending the image links in batches.

We will be able to retrieve the data as a dataframe, with 65 features.

Screenshot of the facial attributes returned from Face API

We will also need to add the names of the actresses to the dataframe.

3. Visualize the facial attributes

Here we can do a simple visualization by transforming all the facial attributes into a 2D plot to quickly evaluate the result, via the t-SNE algorithm.

Actresses who appear close together on the plot would suggest their over all facial features are similar. As for color labels, 1st tier contains top 33 of the 100 actresses; 2nd tier is top 34–66 actresses; 3rd tier represents top 67–100. However, no significant finding was observed, as the points scattered randomly.

Screenshot of the visualization on facial features of top 100 actresses

4. Scrap the actual age information

Next we will scrap the actual ages of the actresses, since the information is not enclosed on IMDb. The URL for the FamousFix website is

http://www.famousfix.com/topic/

We will also need to switch user-agent in order to scrap the content.

Screenshot of the names and actual ages added

5. Evaluate residuals between predicted age and actual age

Screenshot of the residuals plot

We can observe that our predictions are younger than the actual ages of the actresses, resulting in positive residuals ranging between 0 and 20 years. Knowing that the age predictions are off, we will create different models and stack them together to adjust the predictions.

Models

First we will perform PCA to remove colliearnity among the data. It can be observed that the first 3 principal components can represent 97% of the data.

Screenshot of the PCA result

Linear Regression

It seems both R-squared and adjusted R-squared are off, suggesting that the relationship between age and facial features may be non-linear.

Screenshot of the results from linear regression

Polynomial Regression

With polynomial regression, we are able to obtain better a better result.

Screenshot of the polynomial regression performance

We will try improving the performance by removing a few outliers. However, by removing data points, we could also lose valuable information.

Screenshot of the polynomial regression performance after removing outliers

Model Stacking

Here we will treat the returned ages from Face API as predictions from model 1 and the polynomial regression as model 2. We will use predictions from the models as input to fit a new model.

Though the R-squared is still low at 0.47, we can see that now the residuals between predicted age and actual age dropped, along with positive and negative values more evenly distributed, as shown below.

Screenshot of the residuals plot after stacking

This sums up for the blog post. While the R-squared is still too low to explain all the variance in our data, which could result in poor individual predictions, but we have successfully reduced the over all residuals on predictions.

As a result, from a marketing perspective, we could apply the technique to learn more about the average age range of followers and customers of particular brands on social platforms.

R Code
library(httr)
library(rvest)
library(XML)
library(ggplot2)
library(ggrepel)
library(Rtsne)
library(tidyr)
library(readr)
setwd("C:/Users/jamesjy.chen/Desktop/face0111")
#Scrap names and photos of the top 100 actresses
top100actresses = 'http://www.imdb.com/list/ls050128191/'
output = read_html(top100actresses)
images = html_nodes(output, '.zero-z-index')
imglinks = html_nodes(output, xpath = "//img[@class='zero-z-index']/@src") %>% html_text()
imgalts = html_nodes(output, xpath = "//img[@class='zero-z-index']/@alt") %>% html_text()
#Remove 16, 47, 48, 54, 86, 96 (Face API failed to recognize these photos)
imglinks <- imglinks[-c(16,47,48,54,86,96)]
imgalts <- imgalts[-c(16,47,48,54,86,96)]
#Call Face API
faceURL = "https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair"
faceKEY = '#enter_your_key_here'
mybody = list(url = imglinks[1])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json'
)

face1 = httr::content(face)[[1]]
face1$name = imgalts[1]
f <- as.data.frame(face1)
#Loop the first batch
for (i in 2:20){
mybody = list(url = imglinks[i])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json')

face1 = httr::content(face)[[1]]
face1$name = imgalts[i]
f1 <- as.data.frame(face1)
f <- rbind(f,f1)
}
#Pause for one minute and loop the second batch
for (i in 21:40){
mybody = list(url = imglinks[i])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json')

face1 = httr::content(face)[[1]]
face1$name = imgalts[i]
f1 <- as.data.frame(face1)
f <- rbind(f,f1)
}
#Pause for one minute and loop the third batch
for (i in 41:60){
mybody = list(url = imglinks[i])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json')

face1 = httr::content(face)[[1]]
face1$name = imgalts[i]
f1 <- as.data.frame(face1)
f <- rbind(f,f1)
}
#Pause for one minute and loop the fourth batch
for (i in 61:80){
mybody = list(url = imglinks[i])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json')

face1 = httr::content(face)[[1]]
face1$name = imgalts[i]
f1 <- as.data.frame(face1)
f <- rbind(f,f1)
}
#Pause for one minute and loop the final batch
for (i in 81:94){
mybody = list(url = imglinks[i])

face = POST(
url = faceURL,
content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
body = mybody,
encode = 'json')

face1 = httr::content(face)[[1]]
face1$name = imgalts[i]
f1 <- as.data.frame(face1)
f <- rbind(f,f1)
}
f$name <- gsub("Image of ","",f$name)
write.csv(f,"f.csv")
f <- read.csv("f.csv")
f$X <- NULL
mydata <- f
mydata$faceId <- NULL
mydata$name <- NULL
mydata$faceAttributes.gender <- NULL
set.seed(1)
#Visualize the facial attributes
tsne <- Rtsne(mydata, dims = 2, verbose=TRUE, max_iter = 500)
t = as.data.frame(tsne$Y)
a<- merge(f,t,by=0)
a <- read.csv("a.csv",stringsAsFactors = FALSE)
a$X <- NULL
p <- ggplot(a,aes(V1, V2,label=name)) +geom_point() + geom_label_repel(aes(label = a$name, fill = Rating), color = 'white',size = 3)
p
#Scrap actual ages of actresses
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
baseurl = "http://www.famousfix.com/topic/"
#Adjust names with special characters
a[25,67] <- "A J Cook"
a[33,67] <- "Beyonce Knowles"
for (i in 1:94){
searchurl=gsub(" ","-",a$name[i])
url = paste(baseurl,searchurl,sep="")
session <- html_session(url, user_agent(uastring))
a$age[i] <- session %>%
read_html() %>%
html_nodes(".pb6:nth-child(1) .blue") %>%
html_text() %>%
as.numeric()
}
write.csv(a,"b.csv")
b <- read.csv("b.csv")
b$X <- NULL
#Create residuals vs. fitted plot
plot(b$faceAttributes.age,b$age-b$faceAttributes.age,ylab="Redisuals",xlab="Predicted Age")
abline(h=0,v=0,col="red")
#Conduct PCA on facial attributes
PCA <- b
PCA$Row.names <- NULL
PCA$faceId <- NULL
PCA$faceAttributes.gender <- NULL
PCA$faceAttributes.age <- NULL
PCA$name <- NULL
PCA$V1 <- NULL
PCA$V2 <- NULL
PCA$age <- NULL
PCA$faceAttributes.facialHair.moustache <- NULL
PCA$faceAttributes.facialHair.beard <- NULL
PCA$faceAttributes.facialHair.sideburns <- NULL
pr.out=prcomp(PCA, scale=TRUE)
pr.out$sdev
pr.var=pr.out$sdev ^2
pve=pr.var/sum(pr.var)
pve
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained ", ylim=c(0,1),type="b")
plot(cumsum(pve), xlab="Principal Component ", ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type="b")
PCA2 <- scale(PCA, pr.out$center, pr.out$scale) %*% pr.out$rotation
pcadata <- merge(PCA2,b,by=0)
pcadata[,5:60] <- NULL
#Fit linear regression model
fit <- lm(age~PC1+PC2+PC3+V1+V2,data=pcadata)
summary(fit)
plot(fit)
#Fit polynomial regression model with outliers removed
pcadata2 <- pcadata[-c(43,87,93,21,7,47,61,4,48,8,40,10,53,78,38,51,85,6,16,52,17,1,67,19),]
fit1 <- lm(age~poly(PC1,5)+poly(PC2,5)+poly(PC3,5)+poly(V1,5)+poly(V2,5),data=pcadata2)
summary(fit1)
plot(fit1)
pcadata$M1 <- predict(fit1,pcadata)
#Model stacking
fit2 <- lm(age~poly(faceAttributes.age,10)+poly(M1,10),data=pcadata)
pcadata$M2 <- predict(fit2,pcadata)
plot(pcadata$M2,pcadata$age-pcadata$M2,ylab="Redisuals",xlab="Predicted Age")
abline(h=0,v=0,col="red")
summary(fit2)
plot(fit2)

Questions, comments, or concerns?
jchen6912@gmail.com


Written by James Chen. Edited by Christian Poutge.


Learn more about DJI Innovation’s latest MAVIC PRO, DJI’s portable yet powerful personal drone by watching the video below:



#RVZME: The Revoluzionne’s Curated Membership Program

Like what you read? Give James Chen a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.