Skip to content

Instantly share code, notes, and snippets.

View kjhealy's full-sized avatar

Kieran Healy kjhealy

View GitHub Profile
@kjhealy
kjhealy / README.openai-structured-output-demo.md
Created November 19, 2024 20:58 — forked from dannguyen/README.openai-structured-output-demo.md
A basic test of OpenAI's Structured Output feature against financial disclosure reports and a newspaper's police blotter. Code examples use the Python SDK and pydantic for the schema definition.

Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output

tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.

For example, given a Congressional financial disclosure report, with assets defined in a table like this:

snps <-
list(r = "~/.config/rstudio/snippets/r.snippets") %>%
purrr::map(readLines, warn = FALSE) %>%
purrr::map(paste, collapse = "\n") %>%
purrr::map(trimws) %>%
purrr::map(strsplit, split = "(^|\n)snippet ") %>%
purrr::map_depth(2, ~ .x[.x != ""]) %>%
purrr::map_depth(2, ~ {
nm <- gsub("^([^\n\t ]+).*", "\\1", .x)
names(.x) <- nm
### Messing around with alluvial plots
library(tidyverse)
library(here)
### -------------- Note on Fonts --------------------
## For these fonts to work you will need to have
## the full Myriad Pro font installed along with
## my {myriad} package. Myriad Pro is available
@kjhealy
kjhealy / gist:85f23c3ba158770ffa3ae09de2ef946a
Created February 27, 2024 22:39
one-percent-stream-sample.sh
## Take an approximately 0.1 percent sample of lines from this gzipped
## csv file. We do this by having gzip stream to STDOUT and then the
## Perl one-liner does the sampling. On an 80GB file, the output will
## be ~80MB. Strictly speaking this is only roughly a 1% sample.
## Also, there are a few possible edge cases with the rough-and-ready
## sampling method, but they're not that likely to worry us given what
## we want to do.
gzip -cd giantfile.csv.gz | perl -ne 'print if (rand() < .001)' > sample.csv
## Don't forget to put back the column names
@kjhealy
kjhealy / co2.r
Last active March 15, 2023 11:38
library(tidyverse)
url <- "https://caphector.com/co2.csv"
df <- read_csv(url) |>
mutate(Time = paste0("2023/", Time),
Time = lubridate::as_datetime(Time))
df |>
pivot_longer(`CO2 (SCD40)`:`Altitude (BME)`) |>
ggplot(mapping = aes(x = Time,
## Irish birth data
## Kieran Healy / @mastodon.social@kjhealy
# After Stata version by Brendan Halpin:
# https://mastodon.social/@bthalpin/109919093889324229
library(tidyverse) # dplyr 1.00 or higher
@kjhealy
kjhealy / macos-tmux-256color.md
Created January 26, 2023 16:44 — forked from bbqtd/macos-tmux-256color.md
Installing tmux-256color for macOS

Installing tmux-256color for macOS

  • macOS 10.15.5
  • tmux 3.1b

macOS has ncurses version 5.7 which does not ship the terminfo description for tmux. There're two ways that can help you to solve this problem.

The Fast Blazing Solution

Instead of tmux-256color, use screen-256color which comes with system. Place this command into ~/.tmux.conf or ~/.config/tmux/tmux.conf(for version 3.1 and later):

@kjhealy
kjhealy / list2env.r
Last active January 22, 2023 18:42
## list2env() example
##
## An occasionally useful fn to turn list elements into objects. You usually
## want to keep your dfs etc as some kind of list, because it's much neater,
## but sometimes you want to spit out a bunch of named objects.
## Base R version
## We need to take care not to mislabel the list elements, e.g. by mistakenly
## assuming things about the correspondence between elements and names.
## To see why this is bad try e.g. unique(mtcars$cyl), which will be in
/* This is in /Applications/RStudio.app/Contents/Resources/app/resources/R.css
/*
* R.css
*
* Copyright (C) 2022 by Posit Software, PBC
*
* Unless you have received this program directly from Posit Software pursuant
* to the terms of a commercial license agreement with Posit Software, then
* this program is licensed to you under the terms of version 3 of the
* GNU Affero General Public License. This program is distributed WITHOUT
``` r
## Multinomial Logit Example
## Libraries
library(tidyverse) # Graphing and other tools
library(palmerpenguins) # The data
library(nnet) # Fitting the model
library(marginaleffects) # Extracting the effects (v0.8 or higher)
### Note: