Lecture 1: Introduction to Research — [📝Lecture Notebooks] [
Lecture 2: Introduction to Python — [📝Lecture Notebooks] [
Lecture 3: Introduction to NumPy — [📝Lecture Notebooks] [
Lecture 4: Introduction to pandas — [📝Lecture Notebooks] [
Lecture 5: Plotting Data — [📝Lecture Notebooks] [[
- hackmd version: https://hackmd.io/1eeNAS1oQuSvMA0q6y_QuA?view
- gist version: https://gist.github.com/bluet/23e7697b86144561c4a3d804903d059d
[TOC]
- Extract 部份:取出要的資料、去雜訊、資料標準化、parsing...
- Transform:aggregation、mapping 、combined、Change Data Types
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A simple cheat sheet of Spark Dataframe syntax | |
# Current for Spark 1.6.1 | |
# import statements | |
from pyspark.sql import SQLContext | |
from pyspark.sql.types import * | |
from pyspark.sql.functions import * | |
#creating dataframes | |
df = sqlContext.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"]) # from manual data |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Parse Salesforce report data in Python | |
details in my answer https://stackoverflow.com/a/45645135/448474 | |
""" | |
from collections import OrderedDict | |
from simple_salesforce import Salesforce | |
import pandas as pd | |
import json | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"red": { | |
"50": "#ffebee", | |
"100": "#ffcdd2", | |
"200": "#ef9a9a", | |
"300": "#e57373", | |
"400": "#ef5350", | |
"500": "#f44336", | |
"600": "#e53935", | |
"700": "#d32f2f", |
The Guardian offers an API as deep and robust as the New York Times Article API when it comes to content analysis.
The Guardian's API offers more than "1.7 million pieces of content", with published items as far back as 1999. You can register as a developer here, which gets you 5,000 API hits a day and an API key that looks something like this:
zzzyyyyy-9a9z-999z-z999-9e8a83922516
The Guardian has a handy interactive explorer to interactively tweak the query parameters.
Tested with Apache Spark 2.1.0, Python 2.7.13 and Java 1.8.0_112
For older versions of Spark and ipython, please, see also previous version of text.
NewerOlder