Big Dogs and GEO Distance Queries

Let’s talk about geographical distance and how we can order results in a search based on how close we are to the “Intended” location, rather than the “wrong” location that could be the nearest.

The scenario

The scenario we’ll use is my large dog as he needs a specific dog groomer that can handle him — not all dog groomers are the same.

Source

So when I search I might find the closest dog groomer to me, however they might not have the facilities to manage such a large animal.

I’ll show you how we can use GEO Radial decay curves to find me the dog groomer that is closest to me that can handle my dog. (It might not be around the corner, but it will be the closest one for large pooches.)

We need to make sure he is in the right hands.

Lots of dog groomers are close, but are they suitable?

Questions:

  • Do we want to find the best groomer for large dogs? Yes.
  • Do we have the data to support this so that we can give the correct guidance? Let’s check.
  • Do we have a range of queries to support this? Yes
  • Are we concerned with matching dog groomers to the closest metre? Probably Not.
  • How is all this going to work with weightings associated with activity? (machine learning)
  • How is all this going to work with weightings associated with matching terms if entered?

Wow lots of questions.

First we need some building blocks.

Organic scoring

We have indexed content and we split the documents into fields. Then we match on the fields with different terms and signals and we return the content in the order we desire (either with static, dynamic, or machine learned values).

Performance scoring

We introduce the date decay of the content/fields and we run other function scoring based on things like click-throughs or other pieces of activity such as time viewing.

Algorithmic decay scoring

Finally, we introduce decay signals for geographic regions, geographic ranges, centroids (lat/long coordinates)and a whole range of earthly coordinates/rules (as the crow flies versus arcs etc.).

Total Score

Now, do we sum all this together? Do we average all this out? Do we omit certain signals based on the intent of the query? Do we use TFIDF/BM25/ or other algorithms for our signals?

In this article, let’s focus on one component such as the GEO Radial decay. (which uses the above Algorithmic decay scoring)

Decay algorithms can focus on a central point or an offset. From there they can weaken the importance of a location based on the geographic locations presented.

We call this a decay function curve.

I do not profess to be a mathematician at all, so apologies if we are at 2,000 feet here. However, the technique I’m presenting basically has an origin and can include an offset say of 5km (so anything in that range is treated as the origin).

Outside this range, the score starts to decay.

The scale determines the rate of decay.

In the above example we have three different decay curves:

  • Linear — straight line
  • Exp — decays rapidly then slows down
  • Gauss — bell-shaped — decays slowly, then rapidly, then slows down again

The curve you choose determines how you want to score the decay, the further away we are from the origin.

Data the big challenge

So here’s the thing, we must have accurate geocoordinates for all of this to work.

This can be alleviated somewhat with…

GEO Hashing

GEO Hashing or GEO Hash Grids and associated aggregations represent collections of geo coordinates in a group or a bucket. These can be indexed and included with our decay curves as well.

(I’ll cover this in future stories, as the GEO Hashing technique deserves some conversation all on its own.)

Distance Query

Most search engines and their associated query functions can provide something called a distance query. This query allows you to match fields that have been designated as GEO fields, and are populated with either GEO lat/long coordinates (centroids) or other ranges such as GEOHash coordinates.

At this point we will index all of the geo coordinates for dog groomers in the South East.

Once we have this we can start executing our decay curve function scores in any combination we like.

Note the queries are refencing a field called geolocation that has been previously populated with all the coordinates that we need.

For example:

A distance query @distance:

{“geo_distance”:{“distance”:”5000.0m”,”distance_type”:”plane”,”geolocation”:{“lat”:-37.908031463623047,”lon”:144.99435424804688}}}]}}
The above example looks at the lat/long coordinates and sets a retrieval for us of 5kms around that coordinate.
gauss distance query
“functions”: [
{
“gauss”: {
“geolocation”: {
“origin”: “-37.90803146362305,144.99435424804688”,
“scale”: “5km”
}}}
],
“score_mode”: “sum”,
“boost_mode”: “sum”
}
}
}

The above example looks at the lat/long coordinates within a geo field and applies a gauss curve with a scale of 5km, so that a score is added to boost up content matching the centre of our curve decaying out.

The advantage of the above technique over an overriding boost at query time, means we can keep track of all our scores and manipulate them with detail.

For example: if we want the geo distance of the content to be more important than the organic score from matching keywords or triggers, we can tweak the scale in our gauss curve so that our scores become greater for this component.

So back to my dog

Tailoring the search with Radial Decay Curves

We live near Caulfield in South East Melbourne, the query is in Dog Groomer Large Dogs, Caulfield.

We also have many dog groomers that pamper little pooches very close by however we don’t want to serve them up.

Due to our beloved large pooch we are far more flexible when it comes to distance so let’s relax the gauss algorithm to lessen the scores closer to Caulfield, basically relax the scale to 20km and include some matches to include some potential dog groomers in the southeast.

Other Factors to consider

Size of your corpus (number of documents in the index).

Smaller number of documents = greater strength in your curve.

Larger numbers = weaker strength in your curve.

Caveat — this all works really well if you’re matching any fields.

Things to look out for overall when conducting GEO Radial Decay Curve queries

Caveat — If you are conducting the search over a sparse regional area the number of suburb/town matches might be minimal so increase your retrieval distance and then apply the gauss curve scoring.

Caveat — Don’t always assume your centroids are correct, always check your sources.

The next time you have a look at GEO, have some fun with Radial curves and take the guess work out.

And we really did save some time as my dog is looking and smelling a million dollars with the right groomer.

One very happy pooch

If you want more details about sorting with GEO, leave me a comment below.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.