Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / test_gemma3n.py
Created June 29, 2025 21:29
test inference with gemma-3n-e2b-it
# -*- coding: utf-8 -*-
"""gemma-3n-test
pip install -U -q git+https://github.com/huggingface/transformers.git
pip install -U -q git+https://github.com/huggingface/pytorch-image-models.git
"""
from transformers import pipeline
import torch
@pszemraj
pszemraj / slice_image.py
Created June 28, 2025 19:53
Slice a tall image into chunks.
#!/usr/bin/env python3
"""
Slice a (possibly very tall) image into fixed-height chunks.
Creates a sibling directory called <image stem>_slices/
and writes slice_000.png, slice_001.png, … inside it.
"""
import argparse
from pathlib import Path
@pszemraj
pszemraj / push_dataset_from_text.py
Last active June 27, 2025 02:56
aggregate and push an hf dataset from text files
"""
Create & save an hf dataset with train/test/val splits from dir w/ text files
Ideal structure:
root / section_name_1 / file 1
root / section_name_1 / file 2
root / section_name_1 / file YYY
root / section_name_2 / file 1
root / section_name_2 / file ZZZ
@pszemraj
pszemraj / run_ocr_nanonets.py
Last active June 18, 2025 01:52
Standalone Asynchronous Nanonets-OCR-s Inference Script using vLLM and PyMuPDF.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Standalone Asynchronous Nanonets-OCR-s Inference Script using vLLM and PyMuPDF.
This script processes PDF files from an input directory using the
nanonets/Nanonets-OCR-s model served locally by vLLM via its OpenAI-compatible API.
It renders each page, sends API requests concurrently for OCR, extracts the
structured markdown/HTML text, and saves the combined text for each PDF into a
corresponding .txt file in the specified output directory.
@pszemraj
pszemraj / model_summary.py
Last active June 19, 2025 18:22
Prints an accurate summary of a pytorch model
from dataclasses import dataclass
from typing import List, Optional, Tuple
import torch
import torch.nn as nn
@dataclass
class _LayerSummary:
"""A dataclass to hold summary information for a single layer."""
@pszemraj
pszemraj / modeling_wavenetwork.py
Last active May 8, 2025 04:01
pytorch impl for pretraining-free (directly finetune) wavenet, tiny transformer for classification
"""
WaveNet: An Ultra-Small Language Model (PyTorch Implementation)
Based on the paper: https://arxiv.org/abs/2411.02674
Hugging Face Transformers compatible implementation.
"""
import math
from typing import Dict, Optional, Tuple, Union
import torch

The Enshittification of Closed-Weight Frontier Models

Ruminations on Theory and Motivations

  1. The Concept of Enshittification: Coined by Cory Doctorow, it describes the pattern where platforms initially offer great value to attract users, then lock them in, and finally extract value by degrading the service for users while increasing value extraction for business customers (advertisers, etc.) or, in this case, the platform owner themselves by reducing costs.

  2. Applying it to Frontier AI Chatbots:

    • Phase 1: Attract Users: Release a groundbreaking model (e.g., initial GPT-4, Claude 3 Opus). Offer free access or affordable subscriptions. Generate massive hype and positive press. Users are amazed by the capabilities (complex reasoning, creativity, coding).
  • Phase 2: Lock-in Users: Users integrate the tool into their daily workflows, studies, or creative processes. They become accustomed to its abilities and interface. Subscription models create a direct financial lock-in
@pszemraj
pszemraj / load_zyda2.py
Last active April 27, 2025 21:25
load zyda 2 with streaming
from typing import Dict, List, Optional
import datasets
# Optional: Keep the version print outside the function if desired
# print(f"Using datasets library version: {datasets.__version__}")
def create_interleaved_streaming_dataset(
dataset_path: str = "Zyphra/Zyda-2",
@pszemraj
pszemraj / create_unified_mcqa.py
Created April 18, 2025 19:20
multiple‑choice dataset aggregator
#!/usr/bin/env python
"""
create_unified_mcqa.py – “batteries‑included” multiple‑choice aggregator
✅ Handles all datasets listed in the conversation
✅ Survives missing/renamed columns
✅ Converts every `label` to pure int64 to avoid ClassLabel clashes
✅ Explicitly casts features to ensure concatenation compatibility
✅ Improved error handling and skipping for malformed examples
✅ Limits warning/info messages per dataset
✅ Fixes column mismatch error during cast
@pszemraj
pszemraj / async_pipeline.py
Last active May 22, 2025 23:33
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
This script processes PDF files from an input directory using the
reducto/RolmOCR model served locally by vLLM via its OpenAI-compatible API.
It renders each page, sends API requests concurrently for OCR, extracts plain
text, and saves the combined text for each PDF into a corresponding .txt file
in the specified output directory.