airflow_docker

Project Objective

Design and develop a data pipeline for batch processing the city of Chicago Traffic incidents
Develop analytical views and dashboard with the extracted data
Perform Data transformation with Pandas and Pyspark
Store the traffic data into data warehouse and data lakes
Develop data models, facts and dimension tables with dbt and SQL
Dataset link 1: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Vehicles/68nd-jvt3/about_data
Dataset link 2: https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if/about_data
Dataset link 3: https://data.cityofchicago.org/Transportation/Traffic-Crashes-People/u6pd-qa9d/about_data

Technologies

Workflow Orchestration: Apache Airflow
Data Warehouse: Big Query
Data Lake: Google Cloud Storage
Data Visualization: Looker Studio
Data Modeling: dbt
Containerization: Docker
Batch Processing: Spark
Google Cloud Services: DataProc

Process

Leverage spark dataframe to implement schema for the pandas data. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, TimestampType

spark = SparkSession.builder \ .appName("Change CSV Schema") \ .getOrCreate()

custom_schema = StructType([ StructField("crash_record_id", StringType(), True), StructField("crash_date", TimestampType(), True), # Specify TimestampType StructField("weather_condition", StringType(), True), StructField("lighting_condition", StringType(), True), StructField("road_defect", StringType(), True), StructField("injuries_total", IntegerType(), True), StructField("injuries_fatal", IntegerType(), True), StructField("latitude", FloatType(), True), StructField("longitude", FloatType(), True) ])

df = spark.read \ .option("header", "true") \ .schema(custom_schema) \ .csv("gs://dataengineerproject-448203-bucket1/crashes/crashes.csv")

df.printSchema()

df.write \ .mode("overwrite") \ .parquet("gs://dataengineerproject-448203-bucket1/crashes/transformed_crashes")

Name	Name	Last commit message	Last commit date
Latest commit leoimewore Update README.md Mar 13, 2025 704992e · Mar 13, 2025 History 13 Commits
airflow-data/dbt_project	airflow-data/dbt_project	Initial commit	Mar 13, 2025
dags	dags	Initial commit	Mar 13, 2025
.gitignore	.gitignore	Initial commit	Mar 13, 2025
README.md	README.md	Update README.md	Mar 13, 2025
code.txt	code.txt	Initial commit	Mar 13, 2025
docker-compose.yaml	docker-compose.yaml	Initial commit	Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

airflow_docker

Project Objective

Technologies

Process

Visualizations

About

Releases

Packages

Languages

leoimewore/airflow_docker

Folders and files

Latest commit

History

Repository files navigation

airflow_docker

Project Objective

Technologies

Process

Visualizations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages