
Technical Challenges
-
Legacy systems and infrastructure challenges: The existing infrastructure relied on outdated ETL tools and lacked support for modern streaming data frameworks.
-
Technical debt or architectural limitations: Fragmented architecture with siloed systems introduced inconsistencies and hindered data unification.
-
Integration requirements: Integration was needed across multiple AWS services, partner data feeds, Iot sources, and legacy RDS instances.
-
Scalability, reliability, or performance issues: The previous solution was not scalable for global operations and suffered from latency during peak usage.
-
Data challenges: Inconsistent schemas, lack of data versioning, and minimal observability made data debugging and auditing difficult.
-
Security and compliance requirements: Required implementation of role-based access, encryption for sensitive data, and adherence to regional compliance regulations.
Partner Solution
Solution Overview
Xenonstack designed a two-layer modular architecture on Amazon EKS:
Data Ingestion & Processing Layer
This layer supports both real-time and batch ingestion from diverse sources across the customer supply chain:
-
AWS DMS streams transactional data (orders, inventory, shipments) from Amazon RDS into Apache Kafka topics.
-
Iot or partner data is also streamed into Kafka for event-driven processing.
-
Apache Spark on EKS handles batch transformations and structured stream processing, writing output directly to Apache Iceberg tables on Amazon S3.
-
Apache Airflow orchestrates data pipelines and dependency-based workflows using container-native DAGS.
Lakehouse Storage & Analytics Layer
This layer provides governed, queryable storage and interactive BI dashboards:
-
Transformed data is stored in Apache Iceberg tables (partitioned and ACID-compliant) on Amazon S3, supporting schema evolution and time travel.
-
Trino, deployed on EKS, is the SQL query engine over Iceberg datasets, enabling fast, federated analytics.
-
Apache Superset dashboards offer business users real-time visibility into supply chain KPIS without engineering dependency.
-
All services are monitored using Prometheus, Grafana, and are integrated into the Kubernetes control plane.
AWS Services Used
-
Amazon RDS: Source for transactional data
-
Amazon DMS: Data migration and replication from RDS
-
AWS IAM: RBAC for scoped access
-
Amazon CloudWatch, Prometheus, Grafana: Monitoring and observability
Architecture Diagram
Implementation Details
Xenonstack implemented the solution over an 11-month period using Agile methodology and DevOps automation. The team began with stakeholder workshops to identify key operational KPIS, which helped guide architecture and prioritisation.
-
How the solution was implemented: RDS transactional data was streamed into Kafka using AWS DMS. Apache Spark jobs—configured for both batch and streaming—ran on Amazon EKS. Apache Airflow orchestrated data pipelines, while Iceberg tables on S3 stored transformed data. Trino on EKS enabled federated queries, and Superset provided visual dashboards for business users.
-
Methodology used: Agile sprints guided iterative development and testing. DevOps best practices were followed, and GitOps pipelines were used for deployment automation.
-
Migration approach: The legacy ETL system was replaced in phases, starting with ingestion pipelines, processing logic, and finally dashboarding tools. This ensured continuity of business operations.
-
Integration with existing systems: The platform was integrated with Amazon RDS, AWS IAM, and external data providers. Kafka connected Iot and partner data sources.
-
Security and compliance considerations: IAM-based RBAC was enforced at the Kubernetes and data access level. Encryption and compliance policies were aligned with GDPR and regional supply chain regulations.
-
Deployment and testing strategy: Components were containerised and deployed via Helm on EKS. Integration and load testing were performed using automated test suites. Grafana and Prometheus were configured for observability.
Timeline and major milestones:
-
Months 1–2: Requirements gathering, DMS-Kafka integration
-
Months 3–4: Spark pipeline implementation, Iceberg table design
-
Month 5: Trino setup, dashboard prototyping in Superset
-
Months 6–7: Performance tuning, role-based access, production deployment
Innovation and Best Practices
The solution adopted several AWS best practices, including modular design, containerization, and CI/CD deployment on EKS.
- How the solution leveraged AWS best practices: Each service (Kafka, Spark, Trino) was containerised and deployed using Helm on EKS with auto-scaling. Logging and monitoring were built in via CloudWatch, Prometheus, and Grafana.
- Innovative approaches or unique aspects of the implementation: Apache Iceberg for ACID-compliant data lakes enabled schema evolution and time travel, simplifying the onboarding of new data. A KPI-first approach minimised overengineering and kept efforts aligned with business outcomes.
- Use of AWS Well-Architected Framework principles: Operational Excellence was achieved through GitOps pipelines and observability. Reliability and Performance Efficiency were addressed with resource-tuned deployments and stream processing. Cost optimisation was accomplished via open-source tools and S3-based object storage.
DevOps, CI/CD, or other modern practices implemented:
-
GitOps workflows using Argocd
-
CI/CD pipelines for infrastructure and Spark jobs
-
Helm charts for consistent multi-environment deployment
-
Prometheus and Grafana dashboards for real-time monitoring
-
Started with KPI-first design, avoiding over-architecture
-
Used Kubernetes-native services for modular scaling
-
Applied Iceberg for schema evolution and unified storage
-
Enabled Superset for business user self-service
-
Applied AWS Well-Architected principles and container best practices
Results and Benefits
Business Outcomes and Success Metrics
Cost savings (specific percentages or amounts)
- Achieved a 40% reduction in total cost of ownership by shifting to an EKS-based, open-source lakehouse architecture.
- Reduced infrastructure and license expenditures through pay-as-you-go models and container orchestration.
Revenue increases or new revenue streams
- Enabled faster demand forecasting and order planning, contributing to improved supplier negotiations and reduced inventory carrying costs.
Time-to-market improvements
- Reporting latency reduced from 24–36 hours to under 1 hour, accelerating business decision-making.
- Forecast model refresh cycle improved from weekly (manual) to daily (automated), enhancing agility.
Operational efficiencies
- Fully automated dashboards eliminated the need for manual report generation (saving 8–10 hours/week).
- Onboarding new data sources now takes less than 3 days, compared to the previous 2–3 weeks.
Competitive advantages gained
- Real-time dashboards and unified analytics gave customers visibility into global supply chain metrics, enhancing responsiveness and reducing stock-outs.
ROI and payback period
- A significant reduction in infrastructure overheads and operational delays led to a rapid ROI, with the payback period achieved within the first year of deployment.
Technical Benefits
-
Performance improvements (with metrics): The shift to real-time pipelines and distributed processing significantly improved performance. Reporting latency was reduced from 24–36 hours to under 1 hour. Automated model refresh cycles accelerated from weekly to daily.
-
Scalability enhancements: The platform achieved modular scalability by containerising services and deploying them on Amazon EKS. Kafka, Spark, and Trino components could scale independently based on workload demands, ensuring optimal performance during peak hours.
-
Reliability and availability improvements: The Kubernetes-native design introduced auto-healing, load balancing, and high availability across services. Data pipelines became resilient to node failures, and real-time ingestion pipelines ensured data continuity.
-
Strengthening security posture: IAM-based RBAC and service account policies provided fine-grained access controls. Data stored in Iceberg tables on S3 was encrypted in transit and at rest, satisfying regional compliance requirements.
-
Reduced technical debt: Replacing legacy batch ETL and monolithic reporting tools with modern open-source frameworks reduced code complexity, enhanced maintainability, and lowered long-term technical overhead.
-
Improved development velocity: GitOps automation and CI/CD pipelines for infrastructure and jobs enabled faster iteration cycles. Teams could test, deploy, and monitor new pipelines or features with minimal manual intervention.
Lessons Learned
Challenges Overcome
During implementation, the team encountered several complex challenges:
Significant challenges encountered during implementation:
-
Tuning Apache Spark on Kubernetes for optimal performance was difficult due to the large shuffle operations.
-
Kafka experienced high latency during peak usage, which impacted data streaming.
-
Designing role-based access for multiple departments while ensuring compliance was initially complicated.
-
Initial Superset dashboards were too technical for non-technical users.
-
Spark issues were resolved through executor memory tuning and custom pod resource configurations.
-
Adjusting partition strategies, increasing broker count, and offloading archival workloads to batch pipelines improved Kafka performance.
-
RBAC was implemented using IAM and Kubernetes service accounts to enforce scoped access.
-
Dashboard usability was enhanced after feedback sessions with business stakeholders, leading to simpler and more targeted dashboards.
-
Additional focus was placed on UI/UX for BI dashboards.
-
Performance benchmarking became an ongoing process to optimise workloads on EKS.
Best Practices Identified
Key learnings from the implementation:
-
Starting with clear business KPIS helped align the architecture with tangible goals.
-
Kubernetes-native deployment offered flexibility in resource allocation and service modularity.
Practices that contributed to success:
-
Adoption of open-source, cloud-native tools avoided vendor lock-in.
-
GitOps and CI/CD pipelines ensured rapid, consistent, and observable deployments.
-
Iceberg's ACID compliance and schema evolution made it easier to onboard new data sources.
Approaches that could benefit other implementations:
-
A KPI-first planning methodology aligns IT architecture with measurable business outcomes.
-
Empowering business users through self-service BI reduces dependency on data engineering.
-
Early investment in observability tools like Prometheus and Grafana improves operational confidence and reduces mean-time-to-resolution (MTTR).
-
KPI-first solution design drove relevance.
-
Open source on Kubernetes ensured cost-effectiveness.
-
Superset accelerated BI adoption
-
RBAC ensured compliance across teams