Financial Data Hub

Project Overview

Built a comprehensive Financial Data Hub at SMBC that consolidated data from multiple legacy source systems into a unified, single source of truth. This enterprise-scale data warehouse serves as the backbone for downstream applications across the organization, enabling real-time financial reporting and analytics.

The platform was engineered using Azure cloud stack with Databricks as the core processing engine, implementing advanced parallelism and optimization techniques that resulted in significant cost savings—20% reduction in storage costs and 30% reduction in compute costs—while dramatically improving data accessibility and reliability.

Key Features

Unified Data Platform: Consolidated disparate financial data sources into a single, authoritative data repository
Advanced Optimization: Implemented parallelism techniques and Spark optimizations for 30% performance improvement
Cost Efficiency: Achieved 20% storage cost reduction through intelligent data partitioning and compression
Scalable Architecture: Designed for horizontal scaling to handle growing data volumes
Real-time Processing: Enabled near real-time data ingestion and transformation pipelines
Data Governance: Implemented Unity Catalog for comprehensive data cataloging and access control

Technologies Used

Databricks: Core data processing platform with notebook-based development
Azure Data Factory (ADF): Orchestration and workflow management
PySpark: Large-scale data processing and transformations
Spark SQL: Complex data queries and aggregations
Delta Lake: ACID transactions and versioned data storage
Azure Data Lake Gen2: Scalable data storage layer
Unity Catalog: Data governance and cataloging

Technical Implementation

The solution architecture followed a medallion approach with bronze, silver, and gold layers. Data ingestion was orchestrated through Azure Data Factory pipelines, with Databricks handling all transformation logic.

Spark DataFrames and SQL were used extensively for complex table-to-table transformations, while PySpark scripting enabled custom business logic implementation. The implementation of Delta Live Tables ensured data quality and lineage tracking throughout the pipeline.

Challenges & Solutions

Legacy System Integration: Designed robust connectors for multiple legacy systems with varying data formats and update frequencies
Performance Bottlenecks: Implemented dynamic partition pruning and optimized join strategies to improve query performance by 30%
Cost Optimization: Analyzed and optimized cluster configurations, implemented auto-scaling, and optimized data storage formats
Data Quality: Created comprehensive audit, balance, and control (ABC) frameworks using SQL database audit tables

Business Impact

The Financial Data Hub transformed how SMBC manages and analyzes financial data. By creating a single source of truth, the organization gained:

Faster decision-making through improved data accessibility
Significant cost savings through optimized infrastructure
Enhanced data quality and consistency
Improved regulatory compliance and audit capabilities
Foundation for advanced analytics and machine learning initiatives

My Role

As the lead Azure Data Engineer, I was responsible for the complete solution architecture, implementation of ETL pipelines, Spark optimization, and performance tuning. I worked closely with business stakeholders to translate requirements into technical solutions and ensured successful deployment and knowledge transfer.