Skip to content

Instantly share code, notes, and snippets.

@jmacias
Forked from itaysk/spark-avro-json-sample.py
Created February 1, 2018 20:52
Show Gist options
  • Save jmacias/e20cbf65edb4edc762fa7126c5b9eeb0 to your computer and use it in GitHub Desktop.
Save jmacias/e20cbf65edb4edc762fa7126c5b9eeb0 to your computer and use it in GitHub Desktop.
How to process Event Hub Archive's files using Spark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("spark-avro-json-sample") \
.config('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false') \
.getOrCreate()
#storage->avro
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
#avro->json
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd) # in real world it's better to specify a schema for the JSON
#do whatever you want with `data`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment