Skip to content

Instantly share code, notes, and snippets.

@kamranahmedse
Created November 7, 2024 17:03
Show Gist options
  • Save kamranahmedse/f3e6fcf9ef15e098dff767a403947ef7 to your computer and use it in GitHub Desktop.
Save kamranahmedse/f3e6fcf9ef15e098dff767a403947ef7 to your computer and use it in GitHub Desktop.

Data Engineer Learning Roadmap

Introduction to Data Engineering

  • What is Data Engineering?
  • Data Engineering vs Data Science

Computer Science Fundamentals

  • Basic Terminal Usage
  • How Does a Computer Work?
  • Git
  • Programming Foundations
    • Data Structures
      • Arrays
      • Lists
      • Trees
      • HashMaps
    • Algorithms
      • Sorting
      • Searching
    • Complexity Analysis
  • Operating Systems Basics
    • File Systems
    • Process Management
    • Memory Management
  • Networking Fundamentals
    • TCP/IP Model
    • HTTP/HTTPS
    • DNS Basics
  • Basics of Distributed Systems
    • Concepts of Distribution
    • Fault Tolerance and Redundancy
  • Linux (link to roadmap)

Learn a Programming Language

  • Python
    • Numpy
    • Pandas
  • Java
  • Scala
  • Go

Testing

  • Integration Testing
  • Unit Testing
  • End-to-end Testing
  • Functional Testing
  • A/B Testing
  • Load Testing
  • Smoke Testing

Databases

  • Database Fundamentals
    • SQL
    • Normalisation
    • CAP Therom
    • OLTP vs OLAP
    • Horizontal vs Vertical Scaling
  • Relational Databases
    • Table Design
    • Indexing
    • Transactions
    • MySQL
    • PostgreSQL
    • MariaDB
    • AuroraDB
  • NoSQL Databases
    • Document Stores
    • Key-Value Stores
    • Column Stores
    • Graph Databases

Data Warehousing Solutions

  • What is a Data Warehouse?
  • Traditional Data Warehouses
    • Amazon Redshift
    • Google BigQuery
    • Snowflake
  • Modern Data Warehouses
    • Serverless Options
    • Data Mesh
    • DBSQL

Object Storage

  • AWS S3
  • Azure Blob Storage
  • Google Cloud Storage
  • Apache Ozone

Cluster Computing Fundamentals

  • Overview of Cluster Computing
  • Distributed File Systems
    • HDFS
  • Job Scheduling and Resource Management
  • Cluster Management Tools
    • Kubernetes
    • Apache Hadoop YARN

Data Processing

  • Batch
  • Hybrid
  • Streaming
  • Realtime

Messaging Systems

  • Definition and Importance
  • Asynchronous vs Synchronous Communication
  • Use Cases
  • Common Messaging Patterns
    • Pub/Sub Model
    • Point-to-Point
    • Event-Driven Architecture
  • Messaging Tools
    • Apache Kafka
      • Kafka Basics
      • Kafka Streams
      • Kafka Connect
    • RabbitMQ
      • Queues, Exchanges, Routing Keys
      • Use Cases
    • AWS
      • SQS
      • SNS
      • Use Cases
    • Google Cloud Pub/Sub
  • Real-Time Processing with Messaging
  • Best Practices for Messaging Systems
  • Message Ordering
  • Delivery Guarantees
  • Fault Tolerance
  • Exactly-Once Processing
  • Message Retention
  • Back Pressure Handling

Data Modeling

  • Data Normalization
  • Schema Design
  • Data Warehousing
  • Star Schema vs Snowflake Schema

Data Pipelines

  • Introduction to Pipelines
  • ETL Process
    • Extract Data
    • Transform Data
    • Load Data
  • ELT Process
    • dbt (Data Build Tool)
  • Data Pipeline Tools
    • Apache Airflow
    • Luigi
    • Prefect

Monitoring

  • Prometheus
  • Datadog
  • Sentry
  • NewRelic

Infrastructure as Code (IaC)

  • Declarative vs Imperative
  • Idempotency
  • Reusability
  • Environment Management
  • Tools
    • Terraform
    • OpenTofu
    • Pulumi
    • AWS CDK
    • Google Deployment Manager

CI/CD

  • What is it?
  • CircleCI
  • Gitlab
  • GitHub Actions
  • ArgoCD

Containerization & Orchestration

  • Key Concepts
    • Containerization Basics
    • Orchestration
  • Containerization Tools
    • Docker (Roadmap)
  • Orchestration Tools
    • Kubernetes (Roadmap)
    • Docker Swarm
    • Google Kubernetes Engine (GKE)
    • AWS Elastic Kubernetes Service (EKS)

Big Data Technologies

  • Hadoop Ecosystem
    • HDFS
    • MapReduce
    • YARN
  • Apache Spark
    • Spark Basics
    • Spark SQL
    • Spark Streaming
  • Data Lakes
    • Definition
    • Use Cases
    • Tabular
    • Microsoft
    • Databricks
    • Onehouse
    • Delta Lake

Data Integration

  • Data Integration Basics
  • API Integration
  • Data Streaming
    • Kafka
    • AWS Kinesis
  • Data Formats
    • JSON
    • Avro
    • Parquet
    • Apache OCR

Data Governance

  • Data Quality
  • Data Lineage
  • Metadata Management
  • Data Security
    • CIA Triad
      • Confidentiality
      • Integrity
      • Availability
    • Encryption
      • Data at Rest
      • Data in Transit
      • End-to-End Encryption
  • Legal Compliance
  • Encryption
  • Key Management
  • Tools
    • Unity Catalog
    • Polaris
    • Alation
    • Colibra

Data Analytics

  • Data Visualization
    • Best Practices
    • Tableu
    • Looker
    • Grafana
    • Jupiter Notebook
    • Microsoft Power BI
    • Streamlit
  • Business Intelligence
    • BI Tools
    • Reporting
  • Exploratory Data Analysis

Cloud Data Engineering

  • Cloud Platforms Overview
    • AWS
    • Google Cloud
    • Azure
  • Cloud Data Solutions
    • Data Warehousing
    • Data Lakes
  • Serverless Architectures
    • AWS Lambda
    • Google Cloud Functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment