Created
October 5, 2018 16:02
-
-
Save RajaShyam/755b41fcd00f5de6e60883bb215f0976 to your computer and use it in GitHub Desktop.
A developers view into spark memory model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Notes taken from Spark summit 2018 Europe:(By Wenchen Fan, Databricks) | |
Executor: | |
========= | |
1. Each executor contains Memory manager and Thread pool | |
2. The 5 key areas in Memory model of executor are | |
1. Data source - Such as json, csv, parquet etc | |
2. Internal format - Data represented in Binary format | |
3. Operators - Such as filter, join, substr, regexp etc.. | |
4. Memory manager - | |
5. Cache manager - | |
3. Internal format: Data is stored interms of Objects i.e binary data | |
For ex: Row(123,"data","bricks") should have atleast 5 memory locations | |
i.e Row is an object and required 1 memory location | |
All 3 records are stored in an Array and it requires 1 mem location | |
123 - Integer - 1 memory location | |
"data" - String - 1 memory location | |
"bricks" - String - 1 memory location | |
1. Sort and Hash are 2 important algorithms used in Bigdata | |
2. The Native sort: Each comparision needs to access 2 different memory locations, which makes it hard for CPU cache to pre-fetch data, poor cache locality | |
3. Use cache-aware sort - Go through the key prefixes in a linear fashion, good cache locality. | |
4. Native hash map - Each lookup needs many pointer dereferences and key comparision when hash collision happens, and jumps b/w 2 memory regions, bad cache locality. | |
5. Cache aware hash map - | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment