(This is a hypothetical and synthetic requirement formulated for the zoomcamp project).
City of Boston wishes to review how its residents requested city services and how departments responded to those requests. The city also aim to identify areas of imporvement on its newly deployed 311 system and enhance the efficiency of City departments that deliver basic city services.
To solve this need, the data engineering / analytics need to create a pipeline to convert the raw data collected into actionable dashboards for the management team to analyze and curate to make informed decision.
KPI's
- Simplify and shorten time between a resident reporting a problem and its resolution.
- Enhance the 311 system to become an effective management tool, generating real-time reports that help departments manage staff, track trends, highlight and monitor performance, target efficiency needs and maximize its resources.
- Determine the impact of the 311 system on 911 Emergency System (more efficient by diverting non-emergency calls that could impede the City's emergency response.)
This project produces a pipeline which:
- Uses Terraform to manage the infrastructure
- Extract the raw data into GCS in Google cloud
- Transforms the raw data into standard tables using Apache Spark
- Load the transformed data into BigQuery
- repartition and Write the raw data into an archival storage in parquet format
- Produce dashboard tiles in Google Data studio.
- Ochestrate the pipeline using Airflow
The BashOperator if used to download the file from the API, the data is ingested into GCS with the pythonOperator. A sparkSubmitOperator is used to read the data from GCS, trasform it and load transform data to BigQuery. Data Studio was used to connect to the transformed data and dashboard i built from it. The pipeline refreshes data on a monthly basis with 311 request information recieved, extracting the raw data into data lake first for storage, transforming the data and loading into data warehouse(bigQuery) for easier dashboard construction and analytics.
- Cloud: GCP
- Datalake: GCS Bucket
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Airflow
- Data Warehouse: BigQuery
- Transformations: PySpark
- Create a gcp account
- Configure Google SDK, create a service account and download you secret key(json)
- Clone the repo
- Provision your google infrastructure(GCS, Bigquery) with Terraform
- Update the location of your secret key in the docker file
- Run the docker-compose.yaml file which contains Airflow, Spark, PostgreSQL (docker-compose up -d)
- Confirm all service are started.
- Connect to Airflow and turn on the DAG.
- Trigger the dag or wait for its scheduled run(once a month)
- Connect to bigQuery to explore loaded data and connect Data Studio to build dashboards.
- Data ingestion - Using Airflow to download the dataset and place it in google cloud storage(GCS)
- Data warehouse - BigQuery will be used to host the tables
- Transformations - Use pyspark to transform the data from GCS bucket and add to BigQuery (partitioned and clustered)
- Dashboard - Use Google Data studio to build the dashboards
- Most used channel for requesting service
- Percent of ticket resolved ontime
- Maximum no of days to resolve a ticket in each category
- Most frequent request
- Average resolution time of a case type