Skip to content

Instantly share code, notes, and snippets.

@JoesDataDiner
Created June 10, 2013 22:32
Show Gist options
  • Save JoesDataDiner/5753005 to your computer and use it in GitHub Desktop.
Save JoesDataDiner/5753005 to your computer and use it in GitHub Desktop.
Extracting Microsoft Office metadata using R
library(XML)
#use R's inbuilt unzip function, knowing that the required metadata is in docProps/core.xml
doc = xmlInternalTreeParse(unzip('test.docx','docProps/core.xml'))
#define the namespace
ns=c('dc'= 'http://purl.org/dc/elements/1.1/')
#extract the author using xpath query
author = xmlValue(getNodeSet(doc, '/*/dc:creator', namespaces=ns)[[1]])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment