Stories by Muanai Khalifah Revindo on Medium

Taming Ambiguity in Indonesian Text Summarization: Building an IndoBART-v2 Pipeline from Research…

Muanai Khalifah Revindo — Fri, 08 May 2026 12:55:12 GMT

Taming Ambiguity in Indonesian Text Summarization: Building an IndoBART-v2 Pipeline from Research to Engineering

In the rapidly evolving landscape of Indonesian Fintech and E-commerce, the primary challenge is not merely the volume of data. Instead, the real hurdle is the inherent ambiguity. Summarizing Indonesian news or financial reports is a complex task because semantic coherence often clashes with grammatical nuances. While high-performance models are abundant in the English NLP ecosystem, the Indonesian landscape is still maturing. This situation often requires engineers to work with specific, localized architectures that carry their own set of legacy baggage.

This project focuses on the implementation of an abstractive summarization pipeline using IndoBART-v2 fine-tuned on the IndoSum dataset.

However, moving this from a simple research script to a robust pipeline revealed a fundamental truth in Machine Learning Engineering: the model is often the easiest part.

The real engineering happens in the friction. It involves navigating the “dependency hell” of outdated libraries, making calculated hardware tradeoffs between different GPU architectures, and designing a modular system that can scale beyond a single Jupyter Notebook. This article is not a simple report on achieving a high ROUGE score. It is a technical narrative of the engineering decisions, debugging struggles, and architectural choices required to build a production-ready NLP system in a localized context.

Model Selection: Why IndoBART-v2?

Choosing the right architecture for abstractive summarization is a critical decision that dictates the trajectory of the entire project. In the current era of Large Language Models (LLMs), the default inclination is often to reach for the largest decoder-only models available. However, an engineer must prioritize the alignment between the model objective and the specific task at hand.

The Architecture Choice: Encoder-Decoder vs. Decoder-Only

I chose IndoBART-v2 specifically because of its encoder-decoder architecture. While decoder-only models like GPT excel at open-ended generation, they are not inherently optimized for sequence-to-sequence tasks like summarization.

The BART architecture utilizes a denoising autoencoder objective. It is trained by corrupting text and then forcing the model to reconstruct the original document. This pre-training objective is highly congruent with summarization. The encoder builds a robust representation of the full source context, while the decoder focuses on reconstructive synthesis. This results in summaries that are typically more grounded in the source text compared to the purely “predictive” nature of decoder-only architectures.

Localized Pre-training vs. Multilingual Models

Another major consideration was the trade-off between localized and multilingual pre-training. Massive models like mBART or T5 are trained on hundreds of languages. While impressive, they often suffer from “the curse of multilinguality,” where the model’s capacity is spread thin across diverse linguistic structures.

Figure 1: Architectural Tradeoffs between BART and T5. Unlike T5 which focuses on predicting missing spans, BART’s objective of reconstructing the entire original sentence makes its encoder-decoder flow more robust for abstractive summarization, ensuring better groundedness in the resulting Indonesian text.

IndoBART-v2 is specifically pre-trained on a massive corpus of Indonesian text. This ensures that the model’s vocabulary and internal representations are finely tuned to the nuances of Indonesian formal and informal registers. For a domain like Fintech, where specific legal and financial terminology is common, a localized vocabulary is more efficient than a generalized multilingual one.

Engineering Constraints: Compute and Latency

Finally, engineering is the art of working within constraints. Fine-tuning a massive LLM requires a level of compute resources that is often not justifiable for specific niche tasks. IndoBART-v2 provides a “sweet spot” in terms of parameter count. It is large enough to capture complex semantic relationships but small enough to be fine-tuned on a single consumer-grade or mid-tier enterprise GPU.

This efficiency translates directly to production benefits. A smaller, specialized model offers lower inference latency and reduced operational costs. In a high-traffic E-commerce environment, being able to serve summaries in milliseconds is far more valuable than having a slightly more “creative” but prohibitively slow model.

The Debugging Chronicles: Solving Legacy Hell

In the idealized world of tutorials, libraries work together in perfect harmony. In professional machine learning engineering, however, you often find yourself caught in the crossfire of version mismatches and abandoned repositories. This project was no exception. The conflict centered on the indobenchmark toolkit, a vital but aging library, and the modern transformers ecosystem.

The Conflict: A Tale of Two Versions

The indobenchmark library is a snapshot of the Indonesian NLP state-of-the-art from circa 2021. Since then, the Hugging Face transformers library has undergone significant structural changes. When I attempted to initialize the IndoNLGTokenizer, the system immediately collapsed with a series of ImportError and AttributeError messages.

The root cause was technical debt in the dependency chain. The legacy tokenizer attempted to access private internal members of the transformers.utils module, specifically flags like is_tf_available and _is_jax. In newer versions of transformers, these members had been moved, renamed, or deleted. This created a complete blockade: I needed the specific logic of the Indonesian tokenizer, but the environment it required no longer existed.

The Solution: Monkey Patching and Runtime Surgery

Faced with a broken third-party library, an engineer has two choices: downgrade the entire environment to a vulnerable, outdated state, or perform “runtime surgery.” I chose the latter through a technique known as Monkey Patching.

Monkey patching allows a developer to modify or extend the behavior of a library at runtime without altering the actual source code on disk. This was achieved in two strategic phases:

Namespace Restoration: I manually injected the missing flags back into the transformers utility modules before the tokenizer was imported. By faking the existence of _is_jax and is_tf_available, I satisfied the legacy library's requirements without actually needing those frameworks present.
Method Wrapping: The modern Trainer class passes several new arguments to the tokenizer's pad and decode methods that the legacy IndoNLGTokenizer does not recognize. To solve this, I wrapped the original methods in custom functions that "swallow" unknown arguments. This ensured that when the training loop called for padding, the outdated tokenizer would only receive the parameters it knew how to handle.

import transformers.utils.generic
import types

missing_flags = ["_is_jax", "_is_numpy", "_is_tensorflow", "_is_torch", "_is_torch_device"]
for flag in missing_flags:
    if not hasattr(transformers.utils.generic, flag):
        setattr(transformers.utils.generic, flag, False)

def apply_tokenizer_patches(tokenizer):
    _original_pad = tokenizer.pad

    def custom_pad(*args, **kwargs):
        kwargs.pop('padding_side', None)
        return _original_pad(*args, **kwargs)

    tokenizer.pad = custom_pad

    def custom_save_vocabulary(self, save_directory, filename_prefix=None):
        return ()

    tokenizer.save_vocabulary = types.MethodType(custom_save_vocabulary, tokenizer)

    _original_decode = tokenizer.decode

    def custom_decode(self, token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=None, **kwargs):
        return _original_decode(token_ids, skip_special_tokens=skip_special_tokens, **kwargs)

    tokenizer.decode = types.MethodType(custom_decode, tokenizer)

    return tokenizer

A Note on Technical Debt: The Temporary Nature of Patching

It is important to acknowledge that while monkey patching rescued the pipeline, it is not a permanent architectural fix. Runtime patches are inherently brittle. If the transformers library undergoes another major update or if indobenchmark is eventually modernized, this patch could become redundant or even cause new conflicts. In a long-term production environment, the strategic move would be to fork the legacy library and update its source code to maintain compatibility. However, in the context of rapid prototyping and research, this tactical intervention allowed us to move forward without being paralyzed by a fractured ecosystem.

Key Takeaway: Engineering Maturity over Tool Dependency

Solving this “Legacy Hell” was perhaps the most informative part of the project. It shifted the focus from being a mere user of tools to being an investigator of those tools.

This level of debugging demonstrates a crucial aspect of software engineering maturity. In a production environment, you cannot always wait for a maintainer to update a library that has been stale for years.

The ability to trace a stack trace back to a specific line of a third-party dependency, understand the breaking change in the upstream library, and implement a non-destructive fix at runtime is a vital skill. It is the difference between a project that stalls at the first error and one that reaches deployment despite a fractured ecosystem.

Hardware Realism: T4 vs. P100 Tradeoffs

In the world of cloud computing and Kaggle environments, VRAM capacity is often used as the primary metric for GPU selection. A common assumption is that a card with high memory capacity is objectively superior for training large models.

However, this project served as a reminder that raw memory is a hollow metric if the underlying architecture cannot process data efficiently.

The Choice: Architecture over Capacity

When selecting a GPU for fine-tuning IndoBART, I was faced with a choice between the NVIDIA P100 (Pascal architecture) and the NVIDIA T4 (Turing architecture). On paper, the P100 is a powerhouse with high memory bandwidth. Yet, for modern transformer workloads, the T4 is often the more pragmatic choice.

The decision was not about the quantity of VRAM. It was about the generational leap in specialized instruction sets. While the P100 is an excellent general-purpose accelerator, it lacks the specialized hardware features that modern deep learning libraries are optimized to exploit.

Deep Dive: Tensor Cores and Mixed Precision

The defining factor in this hardware tradeoff was the presence of Tensor Cores in the T4. Unlike the older Pascal architecture, Turing was designed with these specialized cores to accelerate matrix multiplication, which is the fundamental operation in transformer layers.

By utilizing Tensor Cores, I could implement Mixed Precision Training (FP16). This technique allows the model to perform most calculations in 16-bit half-precision while maintaining critical weights in 32-bit to preserve accuracy. On the T4, FP16 is not just a memory-saving trick; it is a hardware-accelerated feature that provides a massive boost in computational speed. The P100, while capable of 16-bit operations, does not possess the dedicated Tensor Cores to turn that precision reduction into a significant performance gain.

training_args = Seq2SeqTrainingArguments(
    output_dir="./indosum-bart-summarization",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=20,
    predict_with_generate=True,
    fp16=True, # Crucial for T4 Tensor Core utilization
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

The Impact: Throughput and Training Stability

This architectural choice had a direct impact on the training lifecycle. By opting for the T4 and enabling FP16, I achieved significantly higher throughput, measured in samples processed per second.

Furthermore, the T4 handled the stability of mixed precision training more gracefully. Modern training loops use dynamic loss scaling to prevent numerical underflow in FP16. The Turing architecture is better aligned with these software optimizations, resulting in a training curve that was both faster and less prone to sudden gradient explosions.

Ultimately, this tradeoff proves a vital engineering lesson:

Always align your hardware choice with your software’s optimization capabilities. For fine-tuning transformers, the efficiency of specialized cores often outweighs the brute force of raw memory bandwidth.

Architecture: Designing for Modularity

A common pitfall in machine learning projects is the “monolithic notebook” approach. While packing every line of code into a single, massive script is convenient for a quick experiment, it is a nightmare for long-term scalability and maintenance. For this project, I chose to dismantle the monolithic structure in favor of a modular architecture built on the principle of separation of concerns.

The Monolith vs. The Modular Approach

In a production-ready environment, the logic for cleaning data should not live in the same space as the logic for updating model weights. By decoupling these processes, we ensure that a bug in the data pipeline does not silently corrupt the training loop. This modularity also allows different team members to work on separate parts of the system simultaneously without causing merge conflicts or logic overlaps.

The Structural Components

I reorganized the codebase into three distinct functional managers, each with a clear and isolated responsibility:

IndoSumManager: This is the “Source of Truth” for the data. It handles the raw file orchestration, flattening nested JSON structures, and executing the heavy lifting of text cleaning. By isolating these tasks, I can swap out the dataset or modify cleaning rules without ever touching the model configuration.
DataPreprocessor: This component manages the bridge between raw text and tensors. It handles the deterministic splitting of data using a fixed random_state=42 to ensure reproducibility. It also encapsulates the tokenization logic, ensuring that max length constraints and padding strategies remain consistent across training and evaluation.
SummarizationTrainer: This is the engine room. It wraps the Hugging Face Trainer API, the DataCollator, and the EarlyStopping callbacks. Its only job is to take processed datasets and an initialized model, then execute the training lifecycle according to the specified hyperparameters.

Benefits: Scalability and Maintainability

The primary benefit of this design is Scalability. If I decide to move from news summarization to summarizing customer reviews in an E-commerce context, I only need to implement a new data manager. The preprocessing and training modules remain largely untouched.

Furthermore, this structure enhances Maintainability. If a specific error occurs during tokenization, I know exactly which class to investigate. There is no need to sift through a thousand lines of code to find the offending loop. This level of organization reflects a professional software engineering mindset: code is written not just for the machine to execute, but for other engineers to read, debug, and improve.

Evaluation: A Skeptical Perspective

In machine learning, it is easy to become intoxicated by rising metrics. However, a professional engineer must maintain a healthy skepticism toward the results. While the ROUGE scores provide a mathematical baseline for performance, they often mask the underlying pathologies of the model.

The Quantitative Reality: Observing the Overfitting Trap

Figure 2: Visualizing the Overfitting Divergence. While the Training Loss decays sharply, the Validation Loss begins to climb steadily after Epoch 3. This divergence is a classic signal that the model is shifting from semantic learning to pattern memorization, marking the optimal point for early stopping.

The training logs reveal a classic divergence in the optimization process. By the twentieth epoch, the training loss dropped significantly from 0.53 to 0.04. However, the validation loss began to creep upward as early as the third epoch, eventually reaching 0.66.

This is a clear signal of overfitting. The model shifted from learning generalized semantic structures to memorizing the specific patterns of the training data. While the ROUGE-1 scores plateaued at approximately 0.35, the rising validation loss indicates that the model was losing its ability to generalize to unseen data. In a production environment, this suggests that training should have been halted much earlier, likely around epoch three, to preserve the model’s robustness.

The Qualitative Gap: Hallucinations and Generation Noise

Looking at the sample evaluation provides even deeper insights than the loss curves. In the provided example, the model successfully captures the core entities, such as “Satlak Prima” and the “Rp 500 miliar” budget. However, as the generation continues, we see the emergence of “noise.”

The AI summary ends with a series of repetitive dots and fragmented sentences regarding budget proposals and DPR decisions. This is a common failure mode in sequence-to-sequence models known as the “End-of-Sequence” (EOS) problem or repetitive looping. Even though the early part of the summary is highly accurate, the messy termination would make the output unsuitable for a direct user-facing interface without further post-processing or better penalty constraints during inference.

Factual Consistency vs. Token Matching

The most critical observation from the sample is the model’s reliance on extraction. The AI summary mimics the source text closely, which explains the decent ROUGE scores. Yet, it also introduced specific details about a “90 miliar” additional fund. While this might be factually present in the full source text, the model struggled to synthesize this information into a coherent conclusion, instead trailing off into repetitive punctuation.

This highlights the limitation of ROUGE. The metric rewards the model for matching tokens, but it does not penalize it for losing structural integrity at the end of a paragraph.

As engineers, we must realize that a high ROUGE score does not necessarily equate to a high-quality summary. True success is defined by a model that knows when to stop talking and how to maintain a consistent logical flow until the very last token.

Conclusion: Moving Toward Production

Building an Indonesian text summarization pipeline is a masterclass in managing technical tradeoffs. This project moved beyond the theoretical allure of model training and into the gritty reality of systems engineering. From performing runtime surgery on legacy libraries to making calculated hardware choices, the journey emphasized that a successful machine learning project is built on the foundation of robust software principles.

Lessons from the Training Trenches

The most immediate takeaway from this project is the necessity of rigorous monitoring. The divergence between training and validation loss serves as a textbook reminder that “more training” does not equal “better performance.” In future iterations, implementing more aggressive early stopping criteria would be a priority. By halting the training as soon as the validation loss begins its upward trend, we can preserve the model’s ability to generalize and prevent it from becoming a mere memorization engine.

The Road to Deployment: Optimization and Serving

To transition this model from a Kaggle notebook to a production environment, several optimization steps are required. The current raw model, while performant, has a footprint that could be improved for real-time serving.

Quantization: Moving the model from FP16 to INT8 or even 4-bit quantization would significantly reduce the memory footprint and increase inference speed. This is essential for deploying the model in cost-sensitive environments or on edge devices.
Inference Serving: Implementing the model through a high-performance framework like FastAPI or NVIDIA Triton would allow us to handle concurrent requests and manage the lifecycle of the model more efficiently than a standard Python script.
Advanced Fine-Tuning: For future versions, exploring Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA (Low-Rank Adaptation) could provide a more efficient path to specialization. This would allow us to adapt the model to specific sub-domains, such as legal contracts or customer support chats, without the need for full parameter updates.

Final Thoughts: The Engineering Mindset

Ultimately, this project confirms that the role of an ML Engineer is to be a bridge between research and reality. It is not enough to have a model that produces a high ROUGE score; the model must be maintainable, the code must be modular, and the deployment path must be clear.

As we continue to push the boundaries of localized NLP for the Indonesian market, our focus must remain on building systems that are not just “intelligent,” but also predictable and production-ready. The goal is to move from experimental prototypes to real-world systems that can reliably distill information in the high-stakes environments of Fintech and E-commerce.

The GitHub Repository

You can find the complete source code, the modular managers, and the training notebooks in my GitHub repository. I encourage you to fork the project, experiment with different hyperparameters, or suggest improvements to the data pipeline.

GitHub - Muanai/indobart-indosum-summarizer

From Street Images to Geo-Spatial Insights: Detecting Telecom Infrastructure with YOLOv8

Muanai Khalifah Revindo — Mon, 16 Mar 2026 09:44:23 GMT

Photo by Diana den Held on Unsplash

Telecom infrastructure is the silent backbone of our digital lives, woven into the chaotic fabric of our city streets. Yet for business analysts, a simple question remains surprisingly difficult to answer: who owns which pole?

Fixed broadband providers often mark their infrastructure using distinct colored bands attached to utility poles. In theory, these markers should make asset identification straightforward. In practice, mapping them at scale still relies heavily on manual field surveys — a process that is slow, costly, and difficult to scale.

During my internship at Telkomsel’s Business Growth & Analytics division, I explored whether computer vision could automate this process. The goal was to transform raw street-level imagery into structured insights about infrastructure ownership.

This project evolved beyond simply training an object detection model. Instead, I engineered an end-to-end pipeline combining YOLOv8-based detection with a custom geo-spatial workflow that maps detected assets directly to administrative boundaries.

In this article, I will walk through the technical journey from a baseline model to a functional MVP, including:

The Data-Centric Pivot: Why resolution and stratified dataset design mattered more than model tuning.
Handling Real-World Constraints: Using physics-aware augmentation and synthetic data to handle noisy street environments.
From Pixels to Maps: Integrating EXIF GPS extraction and reverse geocoding to generate regional infrastructure reports.

The Problem: Beyond the Clean Benchmark

In a controlled environment, object detection is a solved problem. But the streets of Indonesia are anything but controlled. This project was about the small, often weathered color bands wrapped around them that signify ownership. Moving from a sandbox to a real-world pilot-ready prototype meant facing a “Python Trap” of sorts: the logic was sound, but the environment was hostile.

Figure 1: Raw field capture showcasing typical urban constraints: complex backgrounds and weathered infrastructure markers that define the provider’s visual identity.

We identified three primary bottlenecks that made standard “off-the-shelf” training architecturally non-scalable:

The Resolution Trap (Thin Features): These color bands are thin and often placed high on utility poles. At standard 640px resolutions, the spatial features of these bands — especially when surrounded by thin fiber optic cables — suffer from severe aliasing, effectively disappearing into a mess of pixels.
Environmental Noise & Occlusion: Indonesian street scenes are dense. Our target markers are frequently masked by tangled “spaghetti” cables, tropical foliage, or roadside banners. The model doesn’t just need to see the band; it needs to distinguish it from a sea of visual clutter.
The Long-Tail Distribution: Infrastructure isn’t distributed equally. Some providers have a massive presence, while others — like Lintasarta — are so rare in the field that they became “ghost classes” during initial training, leading to a catastrophic 0.086 mAP before intervention.

We realized that achieving reliability required more than just more training time; it required a data-centric overhaul specifically tuned to these small-scale markers.

Dataset & Detection: Decoding the Provider “DNA”

The dataset was not just collected; it was engineered for visual identity classification. To ensure the highest data integrity for a pilot-ready prototype, I personally captured 399 raw images of street-level infrastructure.

However, quantity does not always equal quality in computer vision. I manually filtered this raw pool down to a curated set of high-quality images, selecting only those where the provider-specific color bands were clearly identifiable despite real-world obstructions.

These bands act as the visual “DNA” for operators like CBN, Indosat, MyRepublic, and Lintasarta. However, our “ground truth” was plagued by real-world inconsistency:

Weathering & Chromatic Shift: A red band on a brand-new pole is vivid. A red band after years of tropical sun and rain is faded or rusted. We had to implement aggressive saturation variance (hsv_s=0.7) to ensure the model learned the identity of the color, not just its perfect hex code.
Strategic Data Curated: From a raw pool of 399 field captures, only 157 high-quality images were selected to ensure that every pixel the model learned from was a valid representation of real infrastructure.

By shifting our focus from the pole to the color band as a high-frequency feature, we moved toward a 1248px resolution strategy — a first-class hyperparameter decision that finally allowed the model to “see” the markers clearly.

Model & Training Strategy: Engineering Reliable Learning

Building the model wasn’t just about picking an architecture; it was about ensuring the architecture could actually learn from the noise. We chose YOLOv8 for its balance of speed and accuracy, but achieving a pilot-ready status required several strategic interventions beyond the default configuration:

Stratified Splitting: Random splits are a luxury of balanced datasets. To prevent the scenario where all rare provider instances (like Lintasarta) ended up in the validation set — leaving the model with nothing to learn — we implemented a stratified split strategy to maintain class distribution across both sets.
Physics-Aware Augmentation: Modern augmentation can be too aggressive. We locked our transformations to real-world logic — specifically disabling vertical flips (flipud=0.0) because infrastructure objects never defy gravity. We also utilized occlusion simulation (erasing=0.4) to mimic the chaotic "spaghetti" of Indonesian street scenes.
Synthetic Data Injection: To address representation collapse in minority classes, we injected 12 AI-generated synthetic samples exclusively into the training set. Crucially, the validation set remained 100% real-world data, ensuring that our mAP scores reflected genuine generalization rather than “synthetic-assisted” evaluation.

The Small Object Problem: Solving Feature Collapse

In infrastructure detection, scale is the ultimate enemy. At the standard 640px resolution, thin assets — specifically provider color bands — often collapse into a few ambiguous pixels, making them invisible to the model.

To counter this, we treated resolution as a first-class hyperparameter:

The 1248px Shift: We doubled the training resolution to 1248px to recover fine-grained spatial features lost during downsampling.
The Impact: This was the primary driver for our “redemption arc.” For the Lintasarta class, this high-resolution strategy helped push mAP50 from a near-zero 0.086 to 0.497.

By prioritizing data density over model complexity, we moved from mere object detection to feature recovery.

Experiment Results: The Data-Centric Redemption

Figure 2: The iterative evolution of model performance. Each jump in mAP50 represents a specific bottleneck resolution: from the initial standardization in v1.0 to the critical v2.1, driven by high-resolution feature recovery and synthetic data injection.

After multiple iterations of dataset engineering and surgical model tuning, the final candidate (v2.1) achieved an overall mAP50 of 0.763 on the validation set. While the metric itself is solid for a chaotic street-level task, the real story lies in the “redemption arc” of our minority classes.

By implementing stratified training and high-resolution feature recovery, we saw a dramatic shift in model awareness:

Indosat & CBN: Maintained “God Tier” performance with mAP50 scores consistently above 0.90, proving the model’s absolute grasp of their visual markers.
The Minority Success: Our previously “invisible” class, Lintasarta, climbed from a catastrophic 0.086 to a functional 0.497 mAP50.
Robustness: The model demonstrated increased resilience against visual noise such as trees and banners, a direct result of our physics-aware augmentation and stratified validation strategy.

These results validate a core lesson of the project:

In real-world applications, an unreliable validation set is more damaging than an underperforming model. By keeping our validation 100% real-world, we ensured these metrics translate to actual field reliability.

Geo-Spatial Integration: From Pixels to Regional Intelligence

The true “Unique Selling Point” of this pipeline is its ability to move beyond simple bounding boxes. A detection on a screen has little value to a business analyst unless it is grounded in spatial context.

Figure 3: The standalone inference engine in action. The tool extracts GPS metadata and performs offline reverse-geocoding to map detections directly to administrative sub-districts (Kecamatan).

To bridge this gap, the pipeline integrates a sophisticated geo-spatial module that transforms raw detections into structured business intelligence:

Automated Metadata Extraction: The engine automatically parses EXIF metadata from the source imagery to extract precise GPS coordinates (Latitude/Longitude) at the moment of capture.
Offline Reverse Geocoding: Using BPS (Statistics Indonesia) Shapefiles, the system performs offline reverse geocoding to map these coordinates to specific administrative boundaries, such as Kecamatan (Sub-districts).
Structured Reporting: Instead of just annotated images, the final output is a consolidated CSV report. This allows infrastructure presence to be aggregated and analyzed at a regional level — enabling the Business Growth team to identify competitor expansion trends with a single click.

By integrating geo-analytics directly into the inference engine, we transformed a computer vision experiment into a functional Geospatial Business Intelligence tool.

Lessons Learned: Engineering Over Intuition

The journey from a baseline model to a pilot-ready prototype provided several critical insights into the behavior of applied Machine Learning in uncontrolled environments. Beyond the metrics, these are the engineering principles that defined the project’s success:

Data Dominates Architecture: Early iterations proved that modern detection architectures (YOLOv8) can handle dominant classes with ease, but consistently fail on minority ones. Meaningful improvement only occurred after intervening at the dataset level (stratification and synthetic injection), confirming that model capacity cannot compensate for poor data balance.
Augmentation Without Realism is Noise: Unconstrained transformations, such as vertical flips, risked violating real-world physics. Restricting our strategy to physically plausible transformations resulted in more stable learning and reduced spurious detections.
Resolution is a First-Class Hyperparameter: For high-frequency features like fiber cables and small markers, resolution choice has a larger impact than most optimizer-level tweaks. Increasing image size to 1248px was the only way to recover features that were otherwise lost to downsampling.
Synthetic Data as a Signal Amplifier: Injecting synthetic samples enabled the model to learn representations for underrepresented classes. However, we maintained a strict “zero-contamination” policy — synthetic data is a training tool, but validation must always remain 100% real-world to ensure statistical integrity.
Metrics Require Context: High mAP scores can be deceptive if the sample size is small. We learned to interpret performance in the context of sample size rather than as standalone indicators, especially for classes with fragile statistics.

Conclusion: Transforming Pixels into Intelligence

This project demonstrates that the value of Computer Vision in a business context lies far beyond simple bounding boxes. By engineering a pipeline that respects real-world constraints — from the chaos of Indonesian streets to the nuances of class imbalance — we moved from a technical experiment to a functional Geospatial Business Intelligence tool.

By combining high-resolution detection with automated spatial data integration, we proved that raw street imagery can be transformed into actionable infrastructure insights at the regional level. While this prototype is just the beginning, it serves as a pragmatic blueprint for how AI can support competitive landscape analysis and infrastructure expansion in the telecommunications industry.

The full technical implementation of this project — including the modular training scripts, the stratified dataset engineering logic, and the geo-spatial inference engine — is available on GitHub.

I have documented every failure and breakthrough in the Experiment Log, providing a transparent look at how we moved from a baseline of 0.522 to a functional 0.763 mAP50.

I invite you to explore the repository, fork the code, and test the limits of the 1248px resolution strategy. Whether it’s optimizing the inference speed of the geo-spatial module or refining the synthetic data injection for minority classes, there is always room to squeeze out more performance from a real-world pipeline.

Check out the repo here:

GitHub - Muanai/telecom-infrastructure-object-detection: End-to-end object detection project for telecom infrastructure using real-world field data with challenging conditions such as rust, occlusion, and adverse weather

Credit Risk Feature Engineering with Python & Numba

Muanai Khalifah Revindo — Fri, 19 Dec 2025 08:50:15 GMT

50,000x Faster: Engineering “Stateful” Credit Risk Features with Python & Numba

Photo by Jonas Leupe on Unsplash

Processing behavioral data for 1 million users shouldn’t bring your pipeline to a halt. Yet, when I tried to capture complex delinquency streaks using standard Python loops, the estimated runtime was abysmal. This article documents how I rewrote the core engine using Numba to achieve a 50,000x speedup and a +4.2% AUC uplift — proving that high-performance engineering is essential for modern credit risk modeling.

The Problem: The “Static Feature” Blindspot

Most credit risk models share a fatal flaw: they rely too heavily on static aggregates. We traditionally condense months of transactional history into simple metrics like mean_bill_amount, total_payments, or max_utilization. While these features are easy to compute using standard SQL or Pandas group-by operations, they flatten the dimension of time, effectively blinding the model to behavioral trends.

Consider two hypothetical users, both with an average credit utilization of 50%:

User A (The Disciplined Repayer): Started with high debt (85%) but has consistently paid it down every single month, ending at a healthy 15%. (Risk: Decreasing)
User B (The Volatile Spender): Started low (15%), but exhibited erratic behavior — spiking to 65%, briefly recovering, then relapsing to a dangerous 85%. (Risk: High & Unstable)

Figure 1: Two distinct user profiles. User A [85, 70, 60, 45, 25, 15], while User B [15, 25, 65, 40, 70, 85].

To a model fed only static aggregates, these two users look identical. To catch the difference, we need stateful, temporal features — metrics that measure velocity, acceleration, and consecutive streaks (e.g., “How many months in a row has this user paid late?”).

The Engineering Bottleneck Here lies the friction: calculating these stateful features is computationally expensive.

Standard Python Loops: Iterating through millions of rows to check conditions for every user is agonizingly slow (as shown in the benchmarks below).
Vectorized Pandas: While fast for simple math, Pandas struggles with complex conditional logic that depends on the previous row’s state (like resetting a counter when a payment is made).

We are often forced to choose: build a “dumb” model with fast simple features, or build a “smart” model with a pipeline that takes hours to run. This project aims to break that trade-off.

The Solution: Numba Powered Feature Engine

To break the trade-off between model complexity and pipeline latency, I turned to Numba, a Just-In-Time (JIT) compiler for Python.

Unlike standard Python which interprets code line-by-line (adding significant overhead to loops), Numba translates a subset of Python and NumPy code into fast machine code using the LLVM compiler library. This allows us to write complex, stateful logic in pure Python syntax — specifically for loops that track user history—while executing them at C-like speeds.

The Strategy: User-Time Matrix The architecture is straightforward. First, I pivot the raw transaction logs into a dense User × Time matrix using NumPy. This data structure is cache-friendly and ideal for Numba to iterate over. Once structured, we apply our custom feature logic decorated with @njit, effectively bypassing the Python Global Interpreter Lock (GIL) for the heavy lifting.

Benchmarking: The 50,000x Speedup I tested three implementations to measure the execution time for calculating rolling window statistics (e.g., 3-month moving average). Even for this standard operation, moving from Python loops to Numba resulted in a massive speedup. The results were not just an improvement; they were a paradigm shift.

Figure 2: Visualizing the bottleneck removal: From 55 seconds to 0.0011 seconds using Numba JIT.

Why This Matters A runtime of 58 seconds might seem manageable for a small dataset, but in a production environment processing millions of users, that latency is unacceptable. Numba bringing the compute time down to 1 millisecond means we are no longer constrained by cost or time. We can now afford to experiment with dozens of complex, multi-window behavioral features without stalling the entire data pipeline.

Engineering Behavioral Signals

With the computational constraints removed, we can move beyond simple sums and averages. I designed three flagship features specifically to capture the “story” of a user’s financial health over time. These features are stateful — meaning the value at month t depends on what happened at month t-1— which is exactly where Numba shines.

1. Consecutive Late Payment Streak

The Question: Is the user forgetting to pay, or are they insolvent?

A standard aggregation might tell you a user was late 3 times in a year. But there is a massive risk difference between a user who misses a payment every four months (random forgetfulness) and a user who misses payments for three months in a row (financial distress).

The Logic: I implemented a loop that increments a counter for every consecutive month a payment is missed (days_late > 0) and resets it to zero immediately upon a valid payment.
The Signal: This feature isolates systematic delinquency from stochastic errors. A high streak is one of the strongest predictors of an impending default.

2. Payment-to-Bill Velocity (Trend Slope)

The Question: Is the user’s ability to pay deteriorating?

Most models look at the current Payment-to-Bill ratio. I wanted to see the trajectory.

The Logic: This feature calculates the linear slope (gradient) of the user’s Payment-to-Bill ratio over a 6-month window.
A negative slope means the gap between what they owe and what they pay is widening (Danger).
A positive slope indicates recovery (Safe).
The Signal: This acts as an early warning system. A user might still be paying above the minimum threshold today, but a sharp negative velocity indicates they will likely breach it soon. Standard SQL struggles to calculate per-user regression slopes efficiently; Numba handles it effortlessly.

3. Critical Balance Utilization Count

The Question: How often is the user living on the edge?

High utilization (e.g., using 95% of a credit limit) isn’t always bad if it happens once. But repeated behavior suggests reliance on debt for liquidity.

The Logic: This counts the number of months where the user’s balance exceeded 90% of their credit limit.
The Signal: This captures financial stress. Users who consistently max out their cards are statistically more sensitive to external economic shocks and less likely to recover from a missed payment.

Here is a glimpse into the engine. Notice how we implement the logic in pure Python syntax, yet Numba compiles this down to optimized machine code. We handle stateful streaks and even manual linear regression slope calculations without the overhead of external libraries.

import numpy as np
from numba import njit

@njit(cache=True)
def get_max_consecutive_late(pay_matrix):
    """
    Calculates the longest streak of late payments per user.
    Logic: Resets counter to 0 immediately upon on-time payment.
    """
    n_rows, n_cols = pay_matrix.shape
    result = np.zeros(n_rows, dtype=np.int32)

    for i in range(n_rows):
        max_streak = 0
        current_streak = 0
        for j in range(n_cols):
            if pay_matrix[i, j] > 0: # 1 = Late
                current_streak += 1
            else:
                if current_streak > max_streak:
                    max_streak = current_streak
                current_streak = 0 # Reset state
        
        # Final check for streak at the end of window
        if current_streak > max_streak:
            max_streak = current_streak
        result[i] = max_streak
    return result

@njit(cache=True)
def get_velocity_slope(ratio_matrix):
    """
    Computes linear trend (slope) of payment ratios manually.
    Hardcoded math for a fixed 3-month window avoids SciPy overhead.
    """
    n_rows, n_cols = ratio_matrix.shape
    slopes = np.zeros(n_rows, dtype=np.float64)
    # Pre-calculated denominator for n=3 (x=[0,1,2])
    denom = 6.0 
    
    for i in range(n_rows):
        sum_y = 0.0
        sum_xy = 0.0
        for x in range(n_cols):
            y = ratio_matrix[i, x]
            if np.isnan(y) or np.isinf(y): y = 0.0 # Handle dirty data
            sum_y += y
            sum_xy += (x * y)

        # Manual slope formula: (n*sum(xy) - sum(x)*sum(y)) / denom
        num = (3.0 * sum_xy) - (3.0 * sum_y)
        slopes[i] = num / denom
    return slopes

Impact Study: Does it Actually Move the Needle?

Engineering cool features is satisfying, but in credit risk, if it doesn’t improve the AUC (Area Under the Receiver Operating Characteristic Curve), it’s just technical debt. To measure the real-world value of these Numba-engineered signals, I conducted a controlled experiment using LightGBM (for non-linear interactions) and Logistic Regression (for linear interpretability).

I evaluated three distinct feature sets on the same hold-out test set:

Scenario A (Baseline): Standard magnitude aggregates only (e.g., Mean Bill, Total Payment).
Scenario B (Behavioral Only): Only the Numba-engineered temporal features (e.g., Streaks, Velocity).
Scenario C (Hybrid): The combination of both.

The Results The performance metrics painted a clear picture:

========================================
FINAL LEADERBOARD
========================================
1. Hybrid AUC   : 0.6619 (+4.2%)
2. Baseline AUC : 0.6355
3. Behavior AUC : 0.6211 (-2.2%)
========================================

Analysis: The Power of Interaction At first glance, the Behavioral-only model (Scenario B) underperformed the Baseline. This is expected — knowing how fast someone is paying (velocity) isn’t useful if you don’t know how much they owe (magnitude).

However, the magic happens in Scenario C (Hybrid). The combination achieved a +0.0265 AUC uplift (+4.2%). This confirms that behavioral features act as powerful orthogonal signals. They provide context that static aggregates lack.

High Debt (Static) is bad.
High Debt + Negative Payment Velocity (Behavioral) is catastrophic.

The model learned these interactions. The SHAP analysis confirms our hypothesis. feat_pay_velocity and feat_max_late_streak didn’t just make the list—they ranked immediately after the primary financial aggregates. This proves the model isn't just 'using' these features; it relies on them to capture the behavioral volatility that baseline averages completely miss.

Figure 3: Standard aggregates tell us capacity, but our engineered features tell us character. The high ranking of feat_max_late_streak proves that the model relies heavily on consistency patterns to detect default risk.

Production Readiness: QA & Stability

In a regulated industry like Fintech, a broken pipeline is often worse than a non-existent one. High accuracy means nothing if the feature engine crashes on a generic edge case or if the model silently degrades due to data drift. To bridge the gap between “notebook prototype” and “industrial solution,” I implemented two layers of defense.

1. Unit Testing the Numba Engine (pytest) Since Numba JIT compiles Python code into machine code, debugging runtime errors can be opaque. To guarantee logical correctness, I built a rigorous test suite using pytest.

Edge Case Coverage: The tests specifically target data anomalies common in financial logs: users with zero bills (preventing division-by-zero errors in velocity calculations), NaN values (missing data), and infinite sequences.
Logic Verification: I validated that the “Streak” logic correctly resets after a payment and handles rolling windows accurately across thousands of randomized synthetic test cases.

2. Drift Monitoring with PSI (Population Stability Index) A credit model is only as good as the data feeding it. If the distribution of “Late Streaks” changes drastically next month (e.g., due to a macroeconomic shift), the model needs to raise an alarm. I integrated a Population Stability Index (PSI) check to compare the distribution of engineered features between the training set and the test set.

The Threshold: Industry standard dictates that a PSI < 0.1 is safe.
Current Status: The pipeline is currently Green (Stable) with a PSI of 0.0022.

This monitoring layer ensures that the system is not just performant, but also robust enough to be deployed in a real-world, automated decision engine.

Final Verdict

There is often a misconception in data science that “complex engineering” is unnecessary overhead. This project proves the opposite: performance is a prerequisite for model intelligence.

By optimizing the core feature engine with Numba, I didn’t just save time; I unlocked a new class of features. The 50,000x speedup meant I wasn’t forced to abandon stateful, temporal logic just because Pandas couldn’t handle it.

The impact is clear:

Speed: From ~58 seconds to ~1 millisecond per batch.
Quality: From static aggregates to rich, behavioral signals.
Value: A +4.2% AUC uplift, proving that the model was starving for this context.

My takeaway is simple: If your current tools (SQL or Pandas) are too slow to calculate the features you believe matter, do not “dumb down” the model. Optimize the engine. In the high-stakes world of credit risk, a 4.2% improvement isn’t just a metric — it’s a competitive advantage that directly impacts the bottom line.

Explore the Code: The full source code, including the Numba optimization patterns and the validation pipeline, is available on GitHub. I invite you to fork the repo, break the benchmarks, and see if you can squeeze out even more performance.

GitHub - Muanai/credit-risk-feature-engine: Credit risk feature engine using Numba for stateful temporal logic, benchmarked for speed and validated by model impact.

When Frogs Sing: Discovering Hidden Patterns with Unsupervised Learning

Muanai Khalifah Revindo — Fri, 04 Jul 2025 13:06:58 GMT

How I used K-Means and t-SNE to uncover acoustic structure in frog species — without a single label.

Ribbit Meets AI

What do frogs and data science have in common? Turns out — quite a lot. In this project, I explored a real-world dataset of frog vocalizations, each broken down into 22 Mel-Frequency Cepstral Coefficients (MFCCs), and tried to uncover natural groupings of frog species using unsupervised clustering.

No labels. No supervision. Just raw frog croaks and some machine learning magic.

The Dataset: Bioacoustics in the Wild

Frog calls aren’t just cute background noise in the rainforest — they’re complex acoustic fingerprints. In this project, I explore how machine learning can uncover hidden clusters in their calls using only sound-based features.

This dataset contains 7,195 syllables extracted from 10 frog species across 4 families. Each instance is a frame of audio (or a “syllable”) with 22 MFCCs extracted from the waveform.

The original audio was recorded in situ, meaning the frogs weren’t in sterile labs — they were in the jungle, surrounded by real-world noise. This makes the problem more challenging (and exciting).

Why MFCC?

MFCCs are widely used in speech and audio processing as they capture the perceptual and structural aspects of sound. In the case of frogs, these coefficients encode crucial patterns in their calls — such as pitch, resonance, and rhythm — that are biologically and behaviorally distinctive.

Goal: Can We Cluster These Croaks?

Without using any label information, I wanted to:

Cluster the 7,195 MFCC vectors
Evaluate how well those clusters align with the true species labels
Visualize the results with PCA and t-SNE

Preprocessing: A Little Help from PCA

Before clustering, I used PCA to reduce dimensionality for visualization (and denoising):

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

K-Means Clustering: Let the Grouping Begin

I tried k from 2 to 10, and evaluated using Silhouette Score.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

Best k: 5, with silhouette score = 0.3543

Evaluation: Do Clusters Match Species?

Using true species labels, I evaluated clustering quality with:

ARI (Adjusted Rand Index): 0.7956
NMI (Normalized Mutual Information): 0.6660

That’s surprisingly high, considering we used no label info!

Visualizing with PCA

Nice separation between clusters, but PCA struggles with non-linear structures. So…

Enter t-SNE: Visualizing Frogs in 2D

Now that’s more like it. t-SNE revealed beautifully separated clusters, especially for species like:

AdenomeraHylaedactylus (dominated Cluster 0, ~97%)
AdenomeraAndre (88% of Cluster 2)
HypsiboasCordobae (52% of Cluster 4)

But what does this mean?

Interpreting the Frogs’ Hidden Dialects

Looking deeper into the cluster composition:

Cluster 0 is practically a perfect island for AdenomeraHylaedactylus. Its acoustic profile must be extremely unique — potentially due to distinct syllable length or harmonic patterns.
Cluster 2 catches AdenomeraAndre almost exclusively. It seems this genus has a separate acoustic signature, distinguishable even from its sibling genus.
Cluster 4 is more mixed, but still reflects strong presence of HypsiboasCordobae. That genus might share features with others but still forms a meaningful core.

The lower-purity clusters (like Cluster 3) suggest:

Acoustic overlap between different genera (possibly due to shared environments or mimicry)
Limitations of MFCCs in distinguishing species with similar call structures

This result supports the bioacoustic hypothesis: that frog calls are evolutionarily shaped to stand out within ecosystems, but may still overlap where habitats or behaviors converge.

Technical Takeaways

High ARI (0.7956) means K-Means managed to reconstruct true species labels quite accurately.
t-SNE captured the local structure of MFCC space, making acoustic boundaries visible.
K-Means + PCA + MFCCs forms a surprisingly strong baseline pipeline for acoustic clustering.

Final Thoughts

Nature sings, and machine learning listens.

This project showed how unsupervised learning can uncover natural structure in animal behavior — even without labels. Whether you’re into frogs, sound, or unsupervised learning, there’s a lot more to explore.

Curious about how frogs form clusters?

👉 Run the interactive notebook on Google Colab

Thanks for reading!