Skip to content

Instantly share code, notes, and snippets.

View crypdick's full-sized avatar

Ricardo Decal crypdick

View GitHub Profile
<deleted 50,000 lines of logs repeating the same thing>
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) raise output
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 317, in generate_async
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) output = await self._generate_async(request)
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=3858, ip=10.0.54.100) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/llm/_internal/batch/stages/vllm_engine_stage.py", line 399, in generate_as
@crypdick
crypdick / flaky_ray_data_llm.log
Created June 17, 2025 23:29
stack trace for flaky ray data llm workload
2025-06-17 16:26:19,406 DEBUG streaming_executor.py:546 -- 9: - MapBatches(vLLMEngineStageUDF): Tasks: 24; Actors: 3; Queued blocks: 13; Resources: 0.0 CPU, 3.0 GPU, 768.0MB object store; [8/24 objects local], Blocks Outputted: 0/None
2025-06-17 16:26:19,406 DEBUG streaming_executor.py:546 -- 10: - MapBatches(DetokenizeUDF): Tasks: 0; Actors: 1; Queued blocks: 0; Resources: 1.0 CPU, 0.0B object store; [all objects local], Blocks Outputted: 0/None
2025-06-17 16:26:19,406 DEBUG streaming_executor.py:546 -- 11: - Map(_postprocess)->Filter(NoneType)->Write: Tasks: 0; Actors: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store, Blocks Outputted: 0/None
2025-06-17 16:26:27,977 ERROR streaming_executor_state.py:519 -- An exception was raised from a task of operator "MapBatches(vLLMEngineStageUDF)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_inter
@crypdick
crypdick / anyscale_job_submit_err.log
Created June 16, 2025 18:40
job completes successfully, but cell throws a CalledProcessError
{
"name": "CalledProcessError",
"message": "Command 'b'\
# Production batch job -- note that this is a bash cell\
! anyscale job submit --name=train-xboost-breast-cancer-model \\\\\
--containerfile=\"${WORKING_DIR}/containerfile\" \\\\\
--working-dir=\"${WORKING_DIR}\" \\\\\
--exclude=\"\" \\\\\
--wait \\\\\
--max-retries=0 \\\\\
---------------------------------------------------------------------------
SystemException Traceback (most recent call last)
SystemException:
The above exception was the direct cause of the following exception:
RayTaskError(TypeError) Traceback (most recent call last)
/home/ray/default/e2e-audio/e2e_audio/curation.ipynb Cell 16 line 1
----> 1 print(ds.take(1))
from time import sleep
import ray
from ray import tune
from ray.tune.tuner import Tuner
import time
def expensive_setup():
print("EXPENSIVE SETUP")
sleep(1)
@crypdick
crypdick / tune-pytorch-lightning-ipynb-logs
Created February 7, 2025 19:39
Error logs from running tune-pytorch-lightning.ipynb `tune_mnist_asha(num_samples=num_samples)`
(RayTrainWorker pid=43596) Setting up process group for: env:// [rank=0, world_size=3]
(RayTrainWorker pid=43591) [W207 11:34:59.682154000 ProcessGroupGloo.cpp:757] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
(TorchTrainer pid=43579) Started distributed worker processes:
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43592) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43591) world_rank=1, local_rank=1, node_rank=0
(TorchTrainer pid=43579) - (node_id=bd46294119818be88b4f409ae42e495f2d7b624c90eb7b896ad91660, ip=127.0.0.1, pid=43593) world_rank=2, local_rank=2, node_rank=0
(RayTrainWorker pid=43595) Setting up process group for: env://
2025-02-04 13:43:35,467 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadTorch->Map(extract_and_process_image)] -> LimitOperator[limit=1]
Running Dataset. Active & requested resources: 1/12 CPU, 256.0MB/1.0GB object store: : 0.00 row [00:01, ? row/s]2025-02-04 13:43:36,763 ERROR streaming_executor_state.py:485 -- An exception was raised from a task of operator "ReadTorch->Map(extract_and_process_image)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
⚠️ Dataset execution failed: : 0.00 row [00:01, ? row/s]
- ReadTorch->Map(extract_and_process_image): Tasks: 1; Queued blocks: 0; Resources: 1.0 CPU, 256.0MB object store: : 0.00 row [00:01, ? row/s]
- limit=1: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:01, ? row/s]
2025-02-04 13:43:36,781 ERROR exceptions.py:73
@crypdick
crypdick / torchvision_mean_stddev.py
Created January 28, 2025 20:22
code used to compute mean and standard deviation of all pytorch datasets. Output available on my blog post.
import inspect
import csv
import os
import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import v2
dataset_names = [
"Caltech101", "Caltech256", "CelebA", "CIFAR10", "CIFAR100", "Country211", "DTD",
@crypdick
crypdick / gist:e3e4204a9dafb02f63e1cca033e2ce75
Created April 18, 2024 17:07
AI coach prompt with more questions
First message: Hey buddy! What's on your mind?
System prompt: You are Tara, a coach known for empathy, insight, and support. You excel in helping individuals navigate challenges and celebrate their successes.
You have academic and industry expertise to brainstorm product ideas, draft engineering designs, and propose scientific solutions.
You help people feel better by asking questions to reflect on and evoke feelings of positivity, gratitude, joy, and love.
You show radical candor and tough love.
@crypdick
crypdick / ray_error.log
Created February 26, 2024 22:03
Ray 2.9.2 serialization exception error
File "/home/richard/src/DENDRA/fake/src/fake/pipelines/train/common_nodes.py", line 262, in run_experiment
result_grid = tuner.fit()
File "/home/richard/miniconda3/envs/fake/lib/python3.8/site-packages/ray/tune/tuner.py", line 381, in fit
return self._local_tuner.fit()
File "/home/richard/miniconda3/envs/fake/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 509, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/richard/miniconda3/envs/fake/lib/python3.8/site-packages/ray/tune/impl/tuner_internal.py", line 628, in _fit_internal
analysis = run(
File "/home/richard/miniconda3/envs/fake/lib/python3.8/site-packages/ray/tune/tune.py", line 1002, in run
runner.step()