KubeCon Paris 2024: Observability Day

PumsDev

Published in

odds.team

3 min readMar 19, 2024

#ODDS #KubeConParis2024

“Remember we are all learning ( keep on going )”

วันนี้จะมาเล่าเรื่องราวในแต่ละ Session ของ Observability Day หนึ่งใน Group Session ย่อยในงาน KubeCon + CloudNativeCon Europe 2024 ซึ่งเป็น Group ที่เกี่ยวกับการทำ Observability โดยเฉพาะ และใน Blog นี้จะคัดเฉพาะบางส่วนที่สนใจมาแบ่งปันให้ทุกคนได้อ่านกัน ไปเริ่มกันเลย

หมวด Telemetry

How telemetry dealing with error records

สิ่งแรกที่ควรเข้าใจก่อนเลยก็คือภาษาในการเขียน Program ที่แตกต่างกัน ก็จะใช้วิธีการในการจัดการกับ Errors ที่แตกต่างกันไปด้วย และอีกสิ่งนึงที่ควรแยกให้ออกก็คือ Errors กับ Exceptions นั้นแตกต่างกันโดยที่

ใน OpenTelemetry เองเราสามารถ Handling Errors โดยการใช้ Span หรืออาจจะใช้ Log ก็ได้เพื่อการ Tracing

Span is individual unit of work in our system

โดยความสามารถของ Span คือมี Standard และมีการเเบ่งรายละเอียดย่อยลงไปเพื่อให้สามารถ Tracing ได้มีประสิทธิภาพมากขึ้น ตัวอย่างเช่น

Span kind => exp. client, server, internal …
Span Status => code and message
Span Event => information of span

โดยตัวอย่างจากใน Session จะเป็นการใช้ OpenTelemetry กับ Python เพื่อจำลองการเก็บ Telemetry โดยสามารถ Clone source code เพื่อไปทำเองได้จาก

https://github.com/avillela/otel-errors-talk

และสามารถดูเรื่อง span ต่อได้จากที่นี่เลย

https://opentelemetry.io/docs/concepts/signals/traces/#spans

Lazy robot android: Telemetry buffering on android

สมมติว่าเราทำ App สำหรับ track การปีนเขา แต่ว่ามีช่วงหนึ่งที่อยู่บนเขากลับไม่มีสัญญาณทำให้การ track ขาดตอนไป จะเเก้ปัญหานี้อย่างไร ใน session นี้ใช้วิธีการเก็บการ track ไว้ใน telemetry data แล้วเก็บไว้ที่เครื่องก่อน และถ้า export ไปที่ server เสร็จแล้วก็ค่อย ลบ data จากที่เครื่องออก

สามารถดูเรื่องนี้ต่อสามารถดูได้จาก repository ด้านล่างนี้เลย

https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/disk-buffering

How do you think about Instrumentation Overhead

Overhead is the term used to describe additional computing resource usage cause by instrumentation

Instrumentation ที่เราเพิ่มเข้ามาในการ tracing API ของเรา เป็นเรื่องที่หลาย ๆ คนมองข้ามไปในตอนที่วัด performance ของ API เพราะว่าไม่ได้แยก latency และการ consume cpu หรือ memory ของ intrumentation ออกจากการทำงานจริงๆ ของ API จึงทำให้ค่าที่ได้ผิดเพี้ยนไปจากความเป็นจริงหรือโจทย์ในการวัด performance คลาดเคลื่อนไม่ครอบคลุมกับผลลัพธ์ที่ได้แทน

แนวทางการ improve เพื่อลด Overhead ของ instrumentation

Do nothing บางครั้งอาจจะไม่ต้อง improve แค่ต้องเข้าใจว่า ผลลัพธ์ที่ได้ออกมานั้นมี instrumentation รวมอยู่ด้วย
Remove unhelpful instrument
Upgrade an instrument
Leverage sampling
Turn off unnecessary instrumentation
Use care with manual instrumentation

สามารถดูข้อมูลเพิ่มเติมเรื่องนี้ได้จาก https://community.splunk.com/t5/Product-News-Announcements/Observability-How-to-Think-About-Instrumentation-Overhead-White/ba-p/670727

Fluentbit vs OpenTelemetry

การเปรียบเทียบระหว่าง tools สำหรับ Observability โดย Speaker ในงานได้เล่าถึงมุมมองในการเปรียบเทียบ ว่าควรคำนึงถึงอะไรบ้างโดยแบ่งเป็นข้อ ๆ ได้ดังนี้

Design
Logging
Metrics
Traces
Performance

ในส่วนของรายละเอียด สามารถติดตามต่อได้จากช่อง Youtube ของ Speaker ได้เลย https://youtube.com/@isitobservable?si=7dnSJEpbRu3ofHtq

หมวด Sampling Telemetry Data

การ Sampling telemetry data จำเป็นเมื่อ telemetry data เยอะมาก ๆ เพราะ Cost ของการทำ Observability ก็จะสูงขึ้นเนื่องจาก Logs, Metrics, Traces ที่เยอะขึ้นเรื่อย ๆ ถ้าไม่คัดเฉพาะข้อมูลที่เป็นประโยชน์จริง ๆ ก็อาจจะทำให้ Data เราบวมและเสียเงินไปกับสิ่งที่ไม่มีประโยชน์ต่อการ tracing เราเลยต้องมีการทำ Sampling data เพื่อลดปริมาณข้อมูลที่ไม่เป็นประโยนช์ออกไป โดยการทำ Sampling Tracing จะมีสองแบบ คือ

Head sampling
Tail sampling

จากในงาน บ. Pismo ใช้วิธีการ Tail sampling ส่วนรายละเอียด Strategy

ข้อควรระวังในการทำ Sampling

อาจจะทำให้ data หายไปได้ ถ้าใช้ Strategy ในการ Sampling ไม่เหมาะกับข้อมูลที่เรามี

Key takeaway observability mindset

Context over intuition
Right tool for the right job
Observability as a cross functional discipline

ที่มาของข้อมูลมาจาก 2 session นี้

Real-World Sampling — Lessons Learned after reducing 80% of our O11y Costs

Shift into an observability mindset and OpenTelemetry

หมวด AI

Do we still need to “Observe”? the future of AI & Observability

AI จะมาแทน Observe ได้ไหม คำตอบคือยังไม่ได้มาแทนแต่จะมาช่วยให้งานด้านการ Observability นั้นง่ายขึ้น เช่น คอย Monitor, Alert, Analysis data คอยสร้าง dashboard แยกเก็บข้อมูล โดยใน session นี้จะพูดถึงความสามารถที่ควรจะมีในอนาคตของการใช้ AI + O11y โดยแบ่งได้เป็นหมวด ๆ ดังนี้

Periodic dashboard assessment

Schedule AI to check your dashboard and gather insight
Analyze more data with advanced functionalities
Shorten the review time while converting more data
Maintain ownership

Alerting

Create alerts using LLMs
Get smarter thresholds
Ask ai to generate the code to create the telemetry
Get complexities and dependencies OOTB
Save time configuring and maintaining alerts

Investigation

Investigate using LLMs
Don’t create dashboard at all ใช้ AI สร้างให้ตอนเกิดปัญหา
Only query if and when you need it
Quickly extract logs
Get a snapshot of the environment at any given environment
Query what you want, whenever you want, without having to learn query language

Prediction

With the magic of AI + O11y
Make models that can be proactive and predict useful things
Detection & alerting based on anomalies

Post Mortem

Quick collection of all the data
Automatic analysis
Action items

Thank for all speaker! You give me a lot of things for learning next.

KubeCon + CloudNativeCon Europe 2024: Observability Day Hosted by CNCF - Full...

View more about this event at KubeCon + CloudNativeCon Europe 2024

kccnceu2024.sched.com