Geo Exploration with Elasticsearch and Kibana

Li-Ting Liao

Published in

Dev Diaries

7 min readAug 9, 2020

Taiwan River Pollution Status Data Visualization with Elasticsearch and Kibana

Goal

I’m always fascinated by data analytics tools like the Elastic Stack, Splunk, etc. This time I will use the two components of the well-known Elastic Stack — Elasticsearch and Kibana — to showcase how we can utilize speedy store-search features of Elasticsearch and then use Kibana to give shape to our data.

I will create a Kibana dashboard to present river pollution status in Taiwan. The dashboard will cover:

Overall data contains 117,883 rows, including 68 rivers and over 300 monitoring sites, from 2018 Jan — 2020 Jun.
Select indices for each river. The most important index we’re looking at here is RPI (River Pollution Index, 河川污染指數) which is composed of multiple indices such as DO (水中溶氧量)、BOD5 (生化需氧量)、SS (懸浮固體)、NH3-N (氨氮). We can see over time how a river’s RPI is changing, and can also look at each specifically related sub-indices as shown in the line chart.
Monitoring sites along a river can be shown on a map.

Dataset I’m going to use is from the EPA of the Taiwan government's open-data website. I’m using river pollution monitoring datasets for this demo. Datasets can be found here:

https://data.epa.gov.tw/en/dataset/wqx_p_01

Column names of the dataset:

SiteId 測站代碼
SiteName 測站名稱
SiteEngName 測站英文名稱
County 縣市
Township 鄉鎮市區
Basin 流域名稱
River 河川名稱
TWD97Lon TWD97經度
TWD97Lat TWD97緯度
TWD97TM2X TWD97二度分帶X
TWD97TM2Y TWD97二度分帶Y
SampleDate 採樣日期
ItemName 測項名稱：酸鹼值、水溫、河川污染分類指標、大腸桿菌群、溶氧飽和度、氯鹽、氨氮、化學需氧量、生化需氧量、溶氧(電極法)、懸浮固體、氣溫、導電度
ItemEngName 測項英文名稱：pH、Water Temperature、River Pollution Index、Coliform Group、Dissolved Oxygen Saturation、Chloride、NH3-N、Chemical Oxygen Demand、Biochemical Oxygen Demand、Dissolved Oxygen、Suspended Solid、Temperature、Conductivity
ItemEngAbbreviation 測項英文簡稱：pH、WT、RPI、Coliform、DOS、Cl、NH3-N、COD、BOD5、DO、SS、TMP、EC
ItemValue 監測值
ItemUnit 測項單位

So, let’s get started.

Process

Keep required columns in the dataset, add new columns and transform data type with several existing columns with Python 3.8

Read the CSV data into Jupyter Notebook as data frame:

In column “SampleDate”, some of the date formats are not consistent, so I used the below code to take care of that:

data[‘SampleDate_New’] = [row.replace(‘-’, ‘’) for row in data[‘SampleDate’]]

Now there’s a new column called “SampleDate_New”, and sort the data set accordingly:

Since this dataset originally contained data from 2012 and the total was around 300K rows, I decided to only include data from 2018 Jan onwards which is around 100K rows in total:

data_20182020 = data.loc[(data[‘SampleDate_New’] >= “20180000”) & (data[‘SampleDate_New’] <= “20210000”)]

Also dropped several unused columns:

data_20182020 = data_20182020.drop(columns=[‘TWD97TM2X’, ‘TWD97TM2Y’, ‘SampleDate’])

First, I noticed that the column “ItemValue” which contains all actual number readings from chemical indices e.g. RPI, etc. is not in numeric format, which will cause problems in calculating on Elasticsearch. So, I decided to transform the data type here. If there are “-” in column “ItemValue”, then it will be “NaN”.

Then I added another column “RPI_level” with category level specified tag according to “ItemValue” of RPI 河川污染分類指標. This is following Taiwan EPA’s website:

Not(Slightly) polluted 未(稍)受污染: RPI ≦2.0
Low level polluted 輕度污染：2.0＜RPI≦3.0
Medium level polluted 中度污染：3.1≦RPI≦6.0
High level polluted 嚴重污染：RPI＞6.0

Code of the above two steps:

Lastly, export this updated dataset into a new CSV file “data_20182020_v3” for later use:

data_20182020.to_csv(‘data_20182020_v3.csv’,index=False)

2. Upload 2018 Jan- 2020 Jun data with specific mappings on the index with Elasticsearch, also adding geolocation (including latitude and longitude) as a new field.

First, connect to my Elasticsearch engine:

http://[myip]:9200/

Then, start Kibana and metricbeat agents. Use below to see the home page of the Elastic Stack:

http://[myip]:5601/

Upload my data_20182020_v3.csv here:

Right before uploading the datasets, Elasticsearch will show the first 1000 rows of our data by default, so we can check if all columns are uploaded correctly so far. Since my time field and time format had changed to column “SampleDate_New” and it looks like “yyyyMMdd”, I edit this change in the “Override settings”:

Click “import” and come to the index mapping settings page. Several index settings need to be changed:

Create Index Pattern once I import the data
Type of column “SampleDate_New”: date
Type of column “ItemValue_float”: float

Here comes a tricky part — Kibana allows us to provide geo points of data and showcase them on the map, so we need to do some settings to make those geo points able to be recognized by the engine.

First, in Mappings, change the type of column “TWD97Lat” and “TWD97Lon” into “float”. Then in both Mappings and Ingest pipeline, add one more field called “GeoLocation” (this name can be changed):

# Mappings
"GeoLocation" : { 
   "type": "geo_point"
}# Ingest pipeline
{"append": { "field": "GeoLocation", "value": ["{{TWD97Lat}},{{TWD97Lon}}"]}}

“TWD97Lat” & “TWD97Lon” need to be floated and “GeoLocation” is added.

Once done with all settings, click “import”. If all settings are done correctly, it will show the following success message:

Next, click “View index in Discover”. Now all raw data looks as we expected:

3. Create visualizations on the dashboard with Kibana.

Here comes the fun part — creating a customized dashboard. It will include:

Selectors: to choose which chemical indices and river name as user wants
Metrics: to show how many data points are present in the dataset, from overall to the number of rivers, amount of political borders across, amount of sites, and amount of chemical indices can be checked
Pie chart: to see what is the percentage of RPI status of overall rivers (or a specific river) over time
Line chart: to show how each chemical indices change over time, especially those are related to our RPI 河川污染指標 e.g. NH3-N 氨氮、BOD5 生化需氧量、DO 溶氧(電極法)、SS 懸浮固體
Map: a map showing locations of each monitoring site. Currently, it only shows the latest Sample Date’s data in tooltips

Let’s start from the top left.

Settings for the Selectors visual:

Settings for the Metrics visual:

Settings for the pie chart — here I used “Lens”, a new feature from Kibana standard visuals output. It helps provide several recommendation charts for your chosen data:

Settings for the line chart visual — here I tried to use timestamp which is the same as my SampleDate_New as x-axis, and make sure there’s one line to represent each of the chemical indices:

Finally, here to set up the map. First, choose “cluster map”:

Then click on the newly added map layer and start to edit the details:

There are 117883 counts of data hovering “Taiwan”!

I gave the map a name and specified tooltip fields.

The data here on the map will show RPI only.

Then I gave each dot’s color based on their RPI_level values.

Conclusion

My river pollution dashboard is finished! If I want to look at a specific river, I can do the following checkup. For example, I’m living in Taipei. One of the well-known rivers is Keelung river 基隆河. Let’s take a look.

There are 390 RPI data counts for the Keelung river over our time — 2018 Jan to 2020 Jun. It has 13 monitoring sites and it runs across 3 counties — New Taipei, Taipei, and Keelung: