Hadoop Tips and Tricks

Project Concept

Hardware

Clusters with more nodes recover faster
The higher the storage per node, the longer the recovery time
Use commodity hardware:
- Use large slow disks (SATA) without RAID (3-6TB disks)
- Use as much RAM as is cost-effective (96-192GB RAM)
- Use mainstream CPU with as many cores as possible (8-12 cores)
Invest in reliable hardware for the NameNodes
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Networking cost should be 20% of hardware budget
40 nodes is the critical mass to achieve best performance/cost ratio
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas

Operating System and JVM

System

Enable monitoring using Ambari
Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to recover your cluster when needed
Avoid reaching 90% cluster disk utilization
Balance the cluster periodically using balancer
Edit metadata files using Hadoop utilities only, to avoid corruption
Keep replication >= 3
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
Clean /tmp regularly – it tends to fill up with junk files
Optimize the number of reducers to avoid system starvation
Verify that the file system you selected is supported by your Hadoop vendor

Data and System Recovery

pcheliniy/hdp_tips.md