Practise Usage of ML for a Developer

Alexey Gaynulin
6 min readApr 7, 2020

There are a lot of talks and materials about Machine Learning (ML). But many developers don’t use it, because of the lack of specific examples how they can use it for their purposes. ML — is a “deep ocean” of powerful technologies, algorithms and tools; but how not to lose yourself in it and understand practical stuff of it for your project — by this article I’ll try to help a developer to get acquainted with it by particular way of usage of ML.

Today I want to share some of my experience of ML’s practice usage while solving particular task of clustering cities (for example Italian cities).
For whom it can be useful — for any developer who has a problem of geo places mapping.

Let’s imagine we have a project with geo data (coordinates) which works with a wide range of suppliers. Suppliers provide their data for the project for the same cities — in my project I have realty estate offers from different suppliers. For every offer a supplier provides coordinates and the city of the offer. But sometimes different suppliers provides different coordinates (lat, lng) and even names for the same cities. So, I have a problem of mapping of such cities (for example, offer of place Rome from Supplier 1 is the same city as place District Rome from Supplier 2).
Visually and in our SQL table we have something like this:

Mapping of the same cities
Mapping of the same cities from different suppliers.

As you can see, we have two offers (two green icons at the map). One supplier provides an offer of Pescara city, another — of Montesilvano city. But as we see, they are in the one city; and we want to set for this offers “more accurate” city — Montesilvano.

Let’s cook it:

1st Step — Export data from sql to csv:

  • We need to make sql query only with fields that we need:
  • We got a lot of data (in the example Italian cities) and export it to csv init_offers_table.csv.

2nd Step — ML Clustering. Prepare Data:

I’ll use Jupiter NoteBook, python and its ML libriries ( Pandas — for working with data; Matplotlib — for inline graphics; Seabor — for visulisations; Sklearn — for appropriate ML algorythm) for the task.

First of all, we need to prepare data (by the way, as for me, proper preparing of the data — is the most important part for ML algorithms — we need to prepare only relevant and significant data). In our Jupyter NoteBook we do:

Mapping cities — preparing of the data for ML algoritm.
Preparing of the data for ML algoritm.

As you see, we:

  • import libraries (`%matplotlib inline` we use for visualization in Jupyter NoteBook)
  • import MeanShift library from Sklearn cluster (it’s a powerful tool for clusterization. This tool fits our task because we want to cluster points — in our case, cities — which close to each other)
  • by transFunc and in for loop we just prepare/clean our imported from sql/csv data.

3rd Step — ML Clustering. Apply Algorithm:

As I mentioned above — we use MeanShift algorithm fro clustering our points.

MeanShift algorithm for clustering cities.
Apply MeanShift algorithm for clustering cities.

As you see:

  • we clean our table from city_id, region_id, country_id, id columns — as they don’t participate in clusterisation (we just need coordinates — lat,lng)
  • We decide to cluster points/cities within the distance of 1–2km. Of course, some cities are much more bigger — there won’t be any proble to unite such “small cities” by our programming language after ML prepare clustering and we can get the cities id/hash and so on from google geo api (I’ll write about it below).
  • The last row prepares seaborn visualization for our result. So visually I have the following result (as you can see it repeats the contour of Italy country on map):
  • The clustered cities will be united by target field.
  • And that is it. We just need to make some preparation to export it back to our SQL database (in my case ml_cities.csv file):
Preparing clustered data for SQL export back.
Preparing clustered data for SQL export back.
  • Once again — last 2 pictures above are just about preparing some data for SQL export.

4th Step — Export Clustered Data back to SQL:

We export data that we got after 3rd step to ml_cities table

  • After export to SQL we can check our offers (from Peskara and Montesilvano) to be sure that they were clustered in one city (technically — they should have the same target):
Initially separated cities from different suppliers will unite in one by target after applying MeanShift ml algorithm.
  • As you can see, they had (and still have) different city_id initially, but now they both have the same target by which we can unite them in one city. It will be in the next step.

5th Step — Uniting cities:

So, we have cities united by target, and what’s next?

  • First of all, we can set new cities for our offers (create new cities based on ML algorithm and set its id to relevant offers. Something like this (here i use Laravel command, but depends on your programming language you can use any script you need, of course):
Update offers with clustered cities
Update offers with clustered cities (example).
  • Now for the clustered cities we’ve got (target ‘s) we can easily get the city that they are owned to. In my case we can use Google Geocode Api (as we unite cities — we don’t need to make a lot of request for each offer — to get its city by lat,lng; we just can get cities for the city which offers are owned to). For example, get city for Montesilvano:

As you can understand, we can get the same city for different targets (clustered cities). And it’s up to you and your programming language how to unite them in one city.

In this article, I just want to share my experience of using one of the ML algorithms to solve a specific development / business task. Clustering is not complicated, and, of course, we can use ML for more complex cases and use more complex algorithms and tools (maybe I will share the rest of my experience with this later) for specific tasks — for example, select tags from a description text using ML . This article is just to show how we can simplify some of our routine / complex tasks with a new approach (without implementing a lot of code).

Thanks for attention.
Happy coding!

--

--