# If you see these words, bid high

Christie’s Auction House is a landmark institution in New York City and across influential capital cities. Every few weeks, in the run-up to one of their famous auctions, you can waltz right in and enjoy museum quality artwork enjoying a brief public interlude before it disappears back into private hands. Only a few weeks ago, I was in the midst of Lichtensteins, Chagalls, and Basquiats on their two week vacation in the public eye. In these halls, listen carefully and you can hear the whispers of brokers advising their wealthy clients — which more often than not contains an evaluation of the direction of the art market. But how good are these predictions? Can we do better with data?

One thing we love to do at Gradient is to push our skills with novel, underutilized datasets. A few weeks ago, we got the bright idea of applying statistical techniques to an unlikely recipient of quantitative techniques: fine arts. The proof-of-concept project outlined below presented an opportunity to exercise a few core competencies of intense interest to our clients: assembling bespoke datasets through web scraping and building statistical models with unstructured or semi-structured data. From this proof-of-concept, we also uncovered a valuable case study in the value of regularization.

**Assembling the Dataset**

Christie’s conveniently hosts an online database of the results of their auctions. A typical page looks something like this:

With some inspection of the page’s source code, we can see how the data is organized:

With tools like R’s rvest, we can automate the scraping of data from auctions and specific lots (sales) and begin to assemble a massive dataset through an automated procedure. Gone are the days of copy-and-paste and manual data entry. For each sale, we collected the following data:

- The artist
- The title of the work
- The realized price
- The estimated pre-sale price range
- An essay describing the work
- And details on the work’s provenance

As is typical for projects like these, the data is extraordinarily messy — with many fields missing for many entries. A typical entry looks something like this:

$ lot : chr "802"

$ title : chr "A SMALL GREYISH-GREEN JADE 'BUFFALO’"

$ subtitle : chr "LATE SHANG-EARLY WESTERN ZHOU DYNASTY, 12TH-10TH CENTURY BC"

$ price_realised: int 15000

$ low_est : int 4000

$ high_est : int 6000

$ description : chr "A SMALL GREYISH-GREEN JADE 'BUFFALO’\r\nLATE SHANG-EARLY WESTERN ZHOU DYNASTY, 12TH-10TH CENTURY BC\r\nPossibly a necklace clos"| __truncated__

$ essay : chr "Compare the similar jade water buffalo carved in flat relief and dated to the Shang dynasty in the Mrs. Edward Sonnenschein Col"| __truncated__

$ details : chr "Provenance\r\n The Erwin Harris Collection, Miami, Florida, by 1995."

$ saleid : chr "12176"

We downloaded every lot sale from 2017 — a set of 11,577 observations.

**Setting up the model**

To start with, we needed to simplify the dataset into a target vector and a set of predictors. We are interested in predicting the actual price — but what price exactly? Since Christie’s supplies an estimated range, we decided we needed to “back out” the information already contained in the estimate. To control for the effect of scale, we used the ratio of the actual price to the upper bound of the estimated range. Since the ratio was not normally distributed, we had to employ a Box-Cox transformation to this vector to normalize it

For predictors, we decided to use the abundance of text contained in the dataset. To add some structure to the text, we tokenized the “bag of words” for each sale item, and included only words that were used between 50 and 200 times. This type of dataset is standard in text mining approaches and is called a document-term matrix, where each “document” is a row, and each possible “term” — typically, a stemmed word — is a column, with the number of times that term appears in a given document in the respective cell.

**A naïve approach**

Our first model was a simple linear regression with the document-term matrix as the set of predictors and our Box-Cox-transformed price-to-estimate ratio as our target vector. What did we get? A data scientist’s dream come true!

Residual standard error: 0.09126 on 8755 degrees of freedom

Multiple R-squared: 0.967, Adjusted R-squared: 0.9613

F-statistic: 168.4 on 1525 and 8755 DF, p-value: < 2.2e-16

You see that R-squared? **0.96!!**

Any good analyst worth their salt will throw up an eyebrow at this result, and an inspection of diagnostic plots starts exposes more cracks in this model:

*The model underestimates low ratios:*

*And residuals do not follow a normal distribution:*

And let’s come back to that 0.96 R-squared! This is obviously a case of overfitting — in real life we should not expect to be able to predict the actual price of a sale with that kind of accuracy. If we tested this model on data that we did not use to train the model, would we really expect to get it *this* right?

In addition, this kind of naïve model gives us almost no insight into what words are significant predictors of actual sale price. With 1,526 predictors, we’d have a lot of data to sort through even **after** we’ve run the analysis.

What’s the solution to all of these issues? Regularization!

**A more sophisticated model — L1-regularized regression with cross validation**

We **love** the L1-norm (or LASSO) for regularizing our regression models. In addition to regularizing the model by ensuring that it is applicable to data the model has not yet seen, it also helps make sense of very “wide” datasets — like those with over 1,500 predictors — by shining a light on only those that have a significant impact. By imposing an extra penalty on coefficients, it zeros out coefficients that have no significant impact and restricts the size of those that are non-zero.

Using cross-validation that repeatedly trains a model on sample of a databse and testing it on the held-out portion, we can tune the regression model to pick the exact combination of predictors that maximizes the penalized fit.

Although this is a busy graph (we did have 1,500 predictors after all), this shows that as we increase the penalization (lambda), more and more coefficients shrink and ultimately become zero. At the value of lambda that we selected through cross validation, there are roughly 40 words that actually have some predictive power. Some predict a price higher than the actual, and some predict a lower price.

*Positive indicators. If you see these words in the description, bid high.*

*Negative indicators*

Oh, and what was our R-squared? Nothing close to 0.96! This model had a pseudo R-squared of around 0.14. Smaller? Yes, but certainly more reflective of the predictive power of the words in the objects’ descriptions.

So, is 0.14 good or bad? Depends on your perspective! In financial markets an R-squared even ever so slightly above zero has huge value, as any kind of edge can make you millions. This kind of model probably could not be used to reliably inform purchase decisions at auctions, but it certainly raises good questions for the astute art observer: why does the word “warehouse” int he description predict a higher than estimated value? Are watercolours disrespected by the experts but then preferred by buyers? Any discerning buyer should be asking these questions.

**One step further — image tagging with computer vision**

In addition to price data and a text description of the item, we wanted to see how helpful the photos of each item could be in predicting sales price. We have been wanting to try the Microsoft Azure Computer Vision API, so we sent every image file to the service and it returned a set of tags for each photo. The most popular tags are shown below:

We then built a number of binary classifiers to predict whether or not the object exceeded the high end of its predicted price range. There were 365 tags that appeared for at least two images — these tags were used as the predictors in building a classifier.

We divided items into 2 groups:

- “1” — received
**higher**than the estimated price - “0” — received
**lower**than the estimated price

We plot the most popular tags across groups. We were surprised that the counts for the group that underperformed were so low, and that some fairly common words, like “vase”, “small”, and “group” appeared only in group 1. (A puzzle, for sure!).

We built three types of modern classifiers on the full set of tags: a Random Forest, a Neural Network, and a Gradient Boosted Tree. None of them performed spectacularly (the max AUC was 0.5744), but we were surprised that there was any signal in this data at all! We would have thought that all of this data would have been completely incorporated by the specialists at Christie’s in their estimate. Here are the three respective ROC curves detailing the performance of the classifiers:

**Conclusion**

So much more could be done to improve these models. Certainly we could go much further than the works sold in 2017. We could look for lots sold in more than one auction and build a model to account for changes in price over time. We could build a regression to isolate the performance of certain auction locations — like New York, London, and Hong Kong, or test performance measures between live and online auctions.

As a short internal project, this was really fun, and shows what the Gradient team can do with a novel source of data in a few days. Like what you see? Get in touch.