- What is Data Engineering?
- Data Engineering vs Data Science
- Basic Terminal Usage
- How Does a Computer Work?
- Git
- Programming Foundations
- Data Structures
- Arrays
- Lists
- Trees
- HashMaps
- Algorithms
- Sorting
- Searching
- Complexity Analysis
- Data Structures
- Operating Systems Basics
- File Systems
- Process Management
- Memory Management
- Networking Fundamentals
- TCP/IP Model
- HTTP/HTTPS
- DNS Basics
- Basics of Distributed Systems
- Concepts of Distribution
- Fault Tolerance and Redundancy
- Linux (link to roadmap)
- Python
- Numpy
- Pandas
- Java
- Scala
- Go
- Integration Testing
- Unit Testing
- End-to-end Testing
- Functional Testing
- A/B Testing
- Load Testing
- Smoke Testing
- Database Fundamentals
- SQL
- Normalisation
- CAP Therom
- OLTP vs OLAP
- Horizontal vs Vertical Scaling
- Relational Databases
- Table Design
- Indexing
- Transactions
- MySQL
- PostgreSQL
- MariaDB
- AuroraDB
- NoSQL Databases
- Document Stores
- Key-Value Stores
- Column Stores
- Graph Databases
- What is a Data Warehouse?
- Traditional Data Warehouses
- Amazon Redshift
- Google BigQuery
- Snowflake
- Modern Data Warehouses
- Serverless Options
- Data Mesh
- DBSQL
- AWS S3
- Azure Blob Storage
- Google Cloud Storage
- Apache Ozone
- Overview of Cluster Computing
- Distributed File Systems
- HDFS
- Job Scheduling and Resource Management
- Cluster Management Tools
- Kubernetes
- Apache Hadoop YARN
- Batch
- Hybrid
- Streaming
- Realtime
- Definition and Importance
- Asynchronous vs Synchronous Communication
- Use Cases
- Common Messaging Patterns
- Pub/Sub Model
- Point-to-Point
- Event-Driven Architecture
- Messaging Tools
- Apache Kafka
- Kafka Basics
- Kafka Streams
- Kafka Connect
- RabbitMQ
- Queues, Exchanges, Routing Keys
- Use Cases
- AWS
- SQS
- SNS
- Use Cases
- Google Cloud Pub/Sub
- Apache Kafka
- Real-Time Processing with Messaging
- Best Practices for Messaging Systems
- Message Ordering
- Delivery Guarantees
- Fault Tolerance
- Exactly-Once Processing
- Message Retention
- Back Pressure Handling
- Data Normalization
- Schema Design
- Data Warehousing
- Star Schema vs Snowflake Schema
- Introduction to Pipelines
- ETL Process
- Extract Data
- Transform Data
- Load Data
- ELT Process
- dbt (Data Build Tool)
- Data Pipeline Tools
- Apache Airflow
- Luigi
- Prefect
- Prometheus
- Datadog
- Sentry
- NewRelic
- Declarative vs Imperative
- Idempotency
- Reusability
- Environment Management
- Tools
- Terraform
- OpenTofu
- Pulumi
- AWS CDK
- Google Deployment Manager
- What is it?
- CircleCI
- Gitlab
- GitHub Actions
- ArgoCD
- Key Concepts
- Containerization Basics
- Orchestration
- Containerization Tools
- Docker (Roadmap)
- Orchestration Tools
- Kubernetes (Roadmap)
- Docker Swarm
- Google Kubernetes Engine (GKE)
- AWS Elastic Kubernetes Service (EKS)
- Hadoop Ecosystem
- HDFS
- MapReduce
- YARN
- Apache Spark
- Spark Basics
- Spark SQL
- Spark Streaming
- Data Lakes
- Definition
- Use Cases
- Tabular
- Microsoft
- Databricks
- Onehouse
- Delta Lake
- Data Integration Basics
- API Integration
- Data Streaming
- Kafka
- AWS Kinesis
- Data Formats
- JSON
- Avro
- Parquet
- Apache OCR
- Data Quality
- Data Lineage
- Metadata Management
- Data Security
- CIA Triad
- Confidentiality
- Integrity
- Availability
- Encryption
- Data at Rest
- Data in Transit
- End-to-End Encryption
- CIA Triad
- Legal Compliance
- Encryption
- Key Management
- Tools
- Unity Catalog
- Polaris
- Alation
- Colibra
- Data Visualization
- Best Practices
- Tableu
- Looker
- Grafana
- Jupiter Notebook
- Microsoft Power BI
- Streamlit
- Business Intelligence
- BI Tools
- Reporting
- Exploratory Data Analysis
- Cloud Platforms Overview
- AWS
- Google Cloud
- Azure
- Cloud Data Solutions
- Data Warehousing
- Data Lakes
- Serverless Architectures
- AWS Lambda
- Google Cloud Functions