Skip to content

A real-time data pipeline using Google Cloud (Pub/Sub, Dataflow, BigQuery) to process IRCTC mock data with Python UDF-based transformation and fault tolerance.

Notifications You must be signed in to change notification settings

sujitmahapatra/IRCTC-RealTime-Data-Pipeline-GCP

Repository files navigation

πŸš„ IRCTC Real-Time Data Pipeline using Google Cloud (GCP)

πŸ“’ Overview

The IRCTC Real-Time Data Pipeline is a cloud-based data processing system designed to ingest, transform, and store real-time streaming data from IRCTC (Indian Railway Catering and Tourism Corporation). This project leverages Google Cloud Platform (GCP) services such as Pub/Sub, Dataflow (Apache Beam), BigQuery, and Cloud Storage to enable seamless data processing, transformation, and analysis.


πŸ“Š Project Flowchart

πŸ“ Architecture Overview

πŸ”Ή Data Flow Pipeline

  1. Data Ingestion: Simulated IRCTC Mock Data is published to Google Pub/Sub.
  2. Data Processing: A Dataflow pipeline (Apache Beam) reads data from Pub/Sub, applies Python UDFs for transformation and fault tolerance.
  3. Data Storage: The transformed data is stored in Google BigQuery for analytics.
  4. UDF Registration: User-defined functions (transform_UDF.py) are registered from Google Cloud Storage to BigQuery.

βš™οΈ Tech Stack

  • Google Cloud Pub/Sub β†’ Real-time message streaming
  • Google Dataflow (Apache Beam) β†’ Data processing and transformation
  • Google BigQuery β†’ Data warehouse for analytics
  • Google Cloud Storage β†’ Stores UDF files
  • Python β†’ Apache Beam pipeline & UDF implementation
  • SQL β†’ Data transformation & querying in BigQuery
  • Terraform (Optional) β†’ Infrastructure as Code (IaC) for GCP setup

πŸš€ Features

βœ”οΈ Real-time data ingestion using Pub/Sub
βœ”οΈ Serverless & scalable processing via Dataflow
βœ”οΈ Custom transformations using Python UDFs
βœ”οΈ Fault tolerance & error handling
βœ”οΈ Data warehousing for analytics using BigQuery
βœ”οΈ Optimized SQL queries for analysis and reporting


πŸ—„οΈ BigQuery Schema

Column Name Data Type Description
row_key STRING Unique identifier for each record
name STRING Passenger's name
age INT64 Passenger's age
email STRING Passenger's email address
join_date DATE Date when the passenger joined
last_login TIMESTAMP Last login timestamp
loyalty_points INT64 Loyalty points earned
account_balance FLOAT64 Account balance in INR
is_active BOOL Indicates if the account is active
inserted_at TIMESTAMP Timestamp when the record was inserted
updated_at TIMESTAMP Last updated timestamp
loyalty_status STRING Loyalty membership status
account_age_days INT64 Total days since account creation

🎯 Use Cases

  • πŸ“Š Passenger Behavior Analysis: Using real-time & historical data to understand customer trends.
  • 🎁 Loyalty Program Management: Enhancing customer engagement through data-driven rewards.
  • πŸ” Operational Monitoring: Identifying active/inactive users for improved service efficiency.
  • πŸ“ˆ Trend Analysis: Leveraging BigQuery for actionable business insights.

πŸ“ Future Enhancements

βœ… Integrate Cloud Functions for event-driven triggers.
βœ… Implement Dataflow Streaming Mode for real-time analytics.
βœ… Optimize BigQuery Queries to enhance cost efficiency and performance.

πŸ‘¨β€πŸ’» Author

Sujit Mahapatra
πŸ“§ Email | πŸ”— LinkedIn

⭐ Contribute

Contributions are welcome! If you’d like to improve the project, feel free to fork the repository and submit a pull request.

About

A real-time data pipeline using Google Cloud (Pub/Sub, Dataflow, BigQuery) to process IRCTC mock data with Python UDF-based transformation and fault tolerance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages