Skip to content

Instantly share code, notes, and snippets.

@RajaShyam
Created October 5, 2018 16:02
Show Gist options
  • Save RajaShyam/755b41fcd00f5de6e60883bb215f0976 to your computer and use it in GitHub Desktop.
Save RajaShyam/755b41fcd00f5de6e60883bb215f0976 to your computer and use it in GitHub Desktop.
A developers view into spark memory model
Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks)
Executor:
=========
1. Each executor contains Memory manager and Thread pool
2. The 5 key areas in Memory model of executor are
1. Data source - Such as json, csv, parquet etc
2. Internal format - Data represented in Binary format
3. Operators - Such as filter, join, substr, regexp etc..
4. Memory manager -
5. Cache manager -
3. Internal format: Data is stored interms of Objects i.e binary data
For ex: Row(123,"data","bricks") should have atleast 5 memory locations
i.e Row is an object and required 1 memory location
All 3 records are stored in an Array and it requires 1 mem location
123 - Integer - 1 memory location
"data" - String - 1 memory location
"bricks" - String - 1 memory location
1. Sort and Hash are 2 important algorithms used in Bigdata
2. The Native sort: Each comparision needs to access 2 different memory locations, which makes it hard for CPU cache to pre-fetch data, poor cache locality
3. Use cache-aware sort - Go through the key prefixes in a linear fashion, good cache locality.
4. Native hash map - Each lookup needs many pointer dereferences and key comparision when hash collision happens, and jumps b/w 2 memory regions, bad cache locality.
5. Cache aware hash map -
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment