Skip to content

Instantly share code, notes, and snippets.

@pragnesh
Forked from tmcgrath/Spark aggregateByKey
Last active August 22, 2016 05:55
Show Gist options
  • Save pragnesh/b5a5542a842dd092720b136b1e1a3cf9 to your computer and use it in GitHub Desktop.
Save pragnesh/b5a5542a842dd092720b136b1e1a3cf9 to your computer and use it in GitHub Desktop.
Spark aggregateByKey example
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
2014-12-02 08:40:25.812 java[2479:1607] Unable to load realm mapping info from SCDynamicStore
14/12/02 08:40:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.
scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))
babyNamesCSV: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> babyNamesCSV.reduceByKey((v1, v2) => v1 + v2).collect
res0: Array[(String, Int)] = Array((Abby,9), (David,11))
scala> babyNamesCSV.aggregateByKey(0)((accum, v) => accum + v, (v1, v2) => v1 + v2).collect
res1: Array[(String, Int)] = Array((Abby,9), (David,11))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment