Skip to content

Instantly share code, notes, and snippets.

@pcheliniy
Last active August 27, 2016 00:24
Show Gist options
  • Save pcheliniy/997e17792b9fe9173904aff3473f2511 to your computer and use it in GitHub Desktop.
Save pcheliniy/997e17792b9fe9173904aff3473f2511 to your computer and use it in GitHub Desktop.
Hadoop Tips and Tricks

Project Concept

  • Iterate cluster sizing to optimize performance and meet actual load patterns

Hardware

  • Clusters with more nodes recover faster
  • The higher the storage per node, the longer the recovery time
  • Use commodity hardware:
    • Use large slow disks (SATA) without RAID (3-6TB disks)
    • Use as much RAM as is cost-effective (96-192GB RAM)
    • Use mainstream CPU with as many cores as possible (8-12 cores)
  • Invest in reliable hardware for the NameNodes
  • NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
  • Networking cost should be 20% of hardware budget
  • 40 nodes is the critical mass to achieve best performance/cost ratio
  • Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas

Operating System and JVM

  • Must be 64-bit
  • Set file descriptor limit to 64K (ulimit)
  • Enable time synchronization using NTP
  • Speed up reads by mounting disks with NOATIME
  • Disable hugepages

System

  • Enable monitoring using Ambari
  • Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to recover your cluster when needed
  • Avoid reaching 90% cluster disk utilization
  • Balance the cluster periodically using balancer
  • Edit metadata files using Hadoop utilities only, to avoid corruption
  • Keep replication >= 3
  • Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
  • Clean /tmp regularly – it tends to fill up with junk files
  • Optimize the number of reducers to avoid system starvation
  • Verify that the file system you selected is supported by your Hadoop vendor

Data and System Recovery

  • Disk failure is not an issue
  • Data nodes failure is not a major issue
  • NameNodes failure is an issue even in a clustered environment
  • Make regular backups of namenode metadata
  • Enable NameNode clustering using ZooKeeper
  • Provide sufficient disk space for NameNode logging
  • Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment