Introduction to MineLogX AI, an open source IoT Pipeline Framework
Scope: This specification defines a cloud-agnostic, secure, and batch-mode data pipeline for collecting IoT sensor data from mining operations across geographically distributed regions. It supports deployment on AWS, Azure, GCP, Oracle Cloud, Huawei Cloud, Tencent Cloud, and IBM Cloud, with strong network isolation and horizontal scalability.
I. On-Prem Data Collection Agent
The on-premises agent is responsible for collecting IoT sensor data from industrial mining equipment using configurable protocols such as IP21 or OSI PI.
Data is captured at scheduled intervals and exported in CSV or XML format, with each record including timestamp, region_id, site_id, sensor_id, and quality_flag.
The agent buffers data locally during outages and resumes upload upon network restoration.
Communication is strictly outbound-only, and the agent must not allow inbound connections or expose remote management interfaces. Filenames include both region_id and site_id for validation and traceability.
Configuration is externalized, and local logging enables runtime monitoring and error tracking. This architecture supports air-gapped, network-isolated deployments.
II. Cloud Data Lake Storage
The cloud data lake is the centralized raw data repository. Data is organized hierarchically by region_id, site_id, and date, using the following structure:
/iot-data/
└── region_id=<REGION>/
. └── site_id=<SITE>/
. └── date=<YYYYMMDD>/
This structure enables scalable ingestion, region-based partitioning, and efficient analytics. Authenticated agents upload data using scoped service accounts with write-only access at the region/site level.
Object storage must support encryption at rest, data integrity via checksums, and lifecycle management for archival and retention policies. The design uses standardized cloud services to remain compatible across cloud platforms.
III. Cloud-Based Data Processing Services
Apache Spark (or equivalent) jobs are used to process incoming data in batch mode.
<REGION>/site_id/<SITE>/date=<YYYYMMDD>/ for new files, optionally using event-driven triggers. Unprocessed files are tracked through manifests and checkpoints.
Jobs validate, cleanse, and transform data while retaining region_id and site_id in the output schema. Aggregated metrics and summaries are computed by region and site. The system must support both full and incremental processing patterns, and scale to handle multi-terabyte daily volumes. Failed batches should be quarantined and logged, with monitoring and alerting integrated into native observability tools (e.g., CloudWatch, Azure Monitor).
IV. Cloud-Based Data Warehouse
The final stage loads cleaned and structured data into a cloud data warehouse for analytics. Tables are partitioned by region_id, site_id, and date to optimize performance for time-series and location-based queries.
The warehouse schema follows dimensional modeling principles and supports integration with BI tools via SQL, APIs, or dashboards. Sub-second query performance is expected for common filters. Data loading processes must ensure referential integrity and consistency. Access is enforced via IAM roles and MFA, with audit logging for compliance.
V. Optional Enhancements
- Service-Level Objectives (SLOs)
- • Data availability in warehouse: ≤ 6 hours from collection
- • Upload success rate: ≥ 99.9%
- • Query latency (95th percentile): ≤ 2 seconds
- 2. Data Contracts
- • Schema validation using Avro or JSON Schema
- • Enumerations for fields like sensor_type, unit, quality_flag
- • Per-file size thresholds (e.g., ≤ 100MB per batch)
- 3. Compliance & Retention
- • Region-specific retention policies (e.g., 1 year raw, 5 years curated)
- • GDPR/CCPA-compliant metadata logging
- • Secure wipe or transition to cold storage for aged data