Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / modeling_wavenetwork.py
Last active May 8, 2025 04:01
pytorch impl for pretraining-free (directly finetune) wavenet, tiny transformer for classification
"""
WaveNet: An Ultra-Small Language Model (PyTorch Implementation)
Based on the paper: https://arxiv.org/abs/2411.02674
Hugging Face Transformers compatible implementation.
"""
import math
from typing import Dict, Optional, Tuple, Union
import torch

The Enshittification of Closed-Weight Frontier Models

Ruminations on Theory and Motivations

  1. The Concept of Enshittification: Coined by Cory Doctorow, it describes the pattern where platforms initially offer great value to attract users, then lock them in, and finally extract value by degrading the service for users while increasing value extraction for business customers (advertisers, etc.) or, in this case, the platform owner themselves by reducing costs.

  2. Applying it to Frontier AI Chatbots:

    • Phase 1: Attract Users: Release a groundbreaking model (e.g., initial GPT-4, Claude 3 Opus). Offer free access or affordable subscriptions. Generate massive hype and positive press. Users are amazed by the capabilities (complex reasoning, creativity, coding).
  • Phase 2: Lock-in Users: Users integrate the tool into their daily workflows, studies, or creative processes. They become accustomed to its abilities and interface. Subscription models create a direct financial lock-in
@pszemraj
pszemraj / load_zyda2.py
Last active April 27, 2025 21:25
load zyda 2 with streaming
from typing import Dict, List, Optional
import datasets
# Optional: Keep the version print outside the function if desired
# print(f"Using datasets library version: {datasets.__version__}")
def create_interleaved_streaming_dataset(
dataset_path: str = "Zyphra/Zyda-2",
@pszemraj
pszemraj / create_unified_mcqa.py
Created April 18, 2025 19:20
multiple‑choice dataset aggregator
#!/usr/bin/env python
"""
create_unified_mcqa.py – “batteries‑included” multiple‑choice aggregator
✅ Handles all datasets listed in the conversation
✅ Survives missing/renamed columns
✅ Converts every `label` to pure int64 to avoid ClassLabel clashes
✅ Explicitly casts features to ensure concatenation compatibility
✅ Improved error handling and skipping for malformed examples
✅ Limits warning/info messages per dataset
✅ Fixes column mismatch error during cast
@pszemraj
pszemraj / async_pipeline.py
Last active April 30, 2025 17:46
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
This script processes PDF files from an input directory using the
reducto/RolmOCR model served locally by vLLM via its OpenAI-compatible API.
It renders each page, sends API requests concurrently for OCR, extracts plain
text, and saves the combined text for each PDF into a corresponding .txt file
in the specified output directory.
@pszemraj
pszemraj / alternate_attn_report.md
Created April 4, 2025 14:39
deep research report by gpt-4.5

Alternate Attention Mechanisms for Sequence Modeling (2023–2025)

Transformer-style self-attention has been central to recent advances in language modeling, but its $\mathcal{O}(L^2)$ complexity (for sequence length $L$) motivates research into more efficient alternate attention mechanisms. This report surveys state-of-the-art methods from 2023–2025 that replace or augment standard self-attention in language sequence models. We organize methods by broad families – from linear approximations and sparsity-based variants to convolutional, state-space, and recurrent mechanisms – outlining each method’s motivation, technical formulation, empirical performance on language tasks, and efficiency characteristics.

Contents:

@pszemraj
pszemraj / fix_extensions.py
Created March 31, 2025 22:50
File Extension Fixer using Magika
#!/usr/bin/env python3
"""
File Extension Fixer using Magika
This script analyzes files using Google's Magika deep learning model to identify
their actual content types and fix incorrect file extensions.
pip install -U joblib magika tqdm
"""
sudo apt-get update && sudo apt upgrade -y
sudo apt-get install -y poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
git clone https://github.com/allenai/olmocr.git --depth 1
cd olmocr
pip install -q ninja
pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
# clean up
pip cache purge && apt autoremove -y
@pszemraj
pszemraj / layernorm_scaling.py
Last active March 26, 2025 03:08
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class LayerNormScaling(nn.Module):
"""
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
Model Average CR⬆️ AGIEval Mean (Min, Max) AGIEval CR MMLU-Pro Mean (Min, Max) MMLU-Pro CR Math Mean (Min, Max) Math CR #Params (B)
meta-llama/Llama-3.1-70B-Instruct 72.39 72.43, (65.34, 74.66) 81.79 66.63, (55.16, 70.68) 73.19 65.88, (64.58, 67.86) 62.18 0
mistralai/Mistral-Large-Instruct-2407 71.93 68.78, (61.41, 74.49) 75.77 65.1, (50.28, 69.23) 72.31 71.04, (69.66, 72.72) 67.71 0
meta-llama/Meta-Llama-3-70B-Instruct 69.11 69.71, (60.77, 71.2) 83.13 58.75, (49.3, 63.16) 75.24 51.29, (49.66, 54.2) 48.96 0
01-ai/Yi-1.5-34B-Chat 58.43 63.89