relyt0925 · August 18, 2024 21:34
diff --git a/gistfile1.txt b/gistfile1.txt
 [root@tyler-a100-newimage-val instructlab]# nohup /root/bin/ilab.sh train  --strategy lab-multiphase --phased-phase1-data /var/mnt/inststg1/instructlab/generated/knowledge_train_msgs_2024-08-18T15_57_14.jsonl  --phased-phase2-data /var/mnt/inststg1/instructlab/generated/skills_train_msgs_2024-08-18T15_57_14.jsonl --phased-base-dir /var/mnt/inststg1/instructlab/phasedbasedir --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 --phased-mt-bench-judge /var/mnt/inststg1/instructlab/models/prometheus-eval/prometheus-8x7b-v2.0/ --max-batch-len 10000 --max-seq-len 4096 --phased-phase1-effective-batch-size 128 --phased-phase2-effective-batch-size 3840 --enable-serving-output --gpus 8  --skip-user-confirm --model-path /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ &
 [root@tyler-a100-newimage-val instructlab]# cat nohup.out 
 time="2024-08-18T20:04:24Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
 You are using an aliased command, this will be deprecated in a future release. Please consider using `ilab model train` instead
 Training Phase 1/2...
 TrainingArgs for current phase: TrainingArgs(model_path='/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/var/mnt/inststg1/instructlab/generated/knowledge_train_msgs_2024-08-18T15_57_14.jsonl', ckpt_output_dir='/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints', data_output_dir='/var/mnt/inststg1/instructlab/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=10000, num_epochs=2, effective_batch_size=128, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=False, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=None)
 [2024-08-18 20:04:33,199] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 df: /var/mnt/inststg1/instructlab/.triton/autotune: No such file or directory
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 INFO 2024-08-18 20:04:40,050 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
 INFO 2024-08-18 20:04:40,051 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
 INFO 2024-08-18 20:04:40,051 numexpr.utils:161: NumExpr defaulting to 16 threads.
 INFO 2024-08-18 20:04:40,206 datasets:58: PyTorch version 2.3.1 available.
 INFO 2024-08-18 20:04:40,465 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
 Generating train split: 267 examples [00:00, 11032.42 examples/s]
 tokenizing the dataset with /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ tokenizer...
 Map (num_proc=16): 100% 267/267 [00:00<00:00, 398.01 examples/s]
 ten largest length percentiles:
 Map (num_proc=16): 100% 267/267 [00:00<00:00, 1626.59 examples/s]
 quantile 90th: 1116.4
 quantile 91th: 1138.42
 quantile 92th: 1165.36
 quantile 93th: 1181.0400000000002
 quantile 94th: 1210.08
 quantile 95th: 1226.2999999999997
 quantile 96th: 1278.36
 quantile 97th: 1652.6199999999994
 quantile 98th: 1689.72
 quantile 99th: 1712.7599999999998
 quantile 100th: 1734.0

 at 4096 max sequence length, the number of samples to be dropped is 0
 (0.00% of total)
 quantile 0th: 255.0
 quantile 1th: 284.66
 quantile 2th: 288.32
 quantile 3th: 295.88
 quantile 4th: 301.0
 quantile 5th: 303.0
 quantile 6th: 318.84
 quantile 7th: 320.62
 quantile 8th: 322.28
 quantile 9th: 324.94
 quantile 10th: 327.6
 at 20 min sequence length, the number of samples to be dropped is 0
 checking the validity of the samples...
 Filter (num_proc=16): 100% 267/267 [00:00<00:00, 435.71 examples/s]
 INFO 2024-08-18 20:04:48,018 root:611: number of dropped samples: 0 -- out of 267
 Categorizing training data type...
 Data type sorting: 100% 267/267 [00:00<00:00, 468764.83it/s]
 unmasking the appropriate message content...
 Map (num_proc=16): 100% 267/267 [00:00<00:00, 1418.85 examples/s]
 The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.

 Pretraining ex sample 186: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 The Social Security number (SSN) is a nine-digit identifier in the format "AAA-GG-SSSS," consisting of an area number, group number, and serial number. Prior to June 25, 2011, area numbers were assigned based on geographical region, with numbers issued from the northeast to the southwest. However, the SSN assignment process was randomized in 2011, eliminating the geographical significance of the first three digits and the significance of the highest group number assigned for each area number. Unassigned area numbers, excluding 000, 666, and 900-999, were introduced for assignment. The middle two digits, the group number, range from 01 to 99 and were not assigned consecutively in an area. The last four digits are the serial number. Individual Taxpayer Identification Numbers (ITINs) are not affected by this SSA change as they are issued by the IRS.

 What are the three parts of a Social Security number?
 <mask> 
 A Social Security number consists of an area number, group number, and serial number.
 <|endoftext|>
 Original Input: <|system|> 
 I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
 <|user|> 
 The Social Security number (SSN) is a nine-digit identifier in the format "AAA-GG-SSSS," consisting of an area number, group number, and serial number. Prior to June 25, 2011, area numbers were assigned based on geographical region, with numbers issued from the northeast to the southwest. However, the SSN assignment process was randomized in 2011, eliminating the geographical significance of the first three digits and the significance of the highest group number assigned for each area number. Unassigned area numbers, excluding 000, 666, and 900-999, were introduced for assignment. The middle two digits, the group number, range from 01 to 99 and were not assigned consecutively in an area. The last four digits are the serial number. Individual Taxpayer Identification Numbers (ITINs) are not affected by this SSA change as they are issued by the IRS.

 What are the three parts of a Social Security number?
 <|assistant|> 
 A Social Security number consists of an area number, group number, and serial number.
 <|endoftext|>

 Pretraining ex sample 75: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Personally Identifiable Information (PII) refers to any information that can be used to identify a specific individual, such as their social security number, full name, email address, or phone number. With the increasing reliance on information technology, the amount of PII shared with organizations has grown, making it a target for cybercriminals. Hackers steal PII to commit identity theft, sell it on the black market, or hold it captive via ransomware, leading to significant costs for individuals and organizations.

 PII can be categorized into direct and indirect identifiers. Direct identifiers, such as passport or driver's license numbers, are unique to a person and sufficient to determine their identity. Indirect identifiers, like race and place of birth, are not unique but can identify a person when combined, such as gender, ZIP code, and date of birth.

 PII can also be classified as sensitive or non-sensitive. Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.

 Data privacy laws typically require organizations to safeguard sensitive PII with encryption, access control, or other cybersecurity measures, while non-sensitive PII may or may not be protected depending on the regulations and the organization's policies. The classification of PII as sensitive or non-sensitive depends on the context, such as the specific use case or potential harm resulting from a breach.

 What is the difference between sensitive and non-sensitive PII?
 <mask> 
 Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
 <|endoftext|>
 Original Input: <|system|> 
 I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
 <|user|> 
 Personally Identifiable Information (PII) refers to any information that can be used to identify a specific individual, such as their social security number, full name, email address, or phone number. With the increasing reliance on information technology, the amount of PII shared with organizations has grown, making it a target for cybercriminals. Hackers steal PII to commit identity theft, sell it on the black market, or hold it captive via ransomware, leading to significant costs for individuals and organizations.

 PII can be categorized into direct and indirect identifiers. Direct identifiers, such as passport or driver's license numbers, are unique to a person and sufficient to determine their identity. Indirect identifiers, like race and place of birth, are not unique but can identify a person when combined, such as gender, ZIP code, and date of birth.

 PII can also be classified as sensitive or non-sensitive. Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.

 Data privacy laws typically require organizations to safeguard sensitive PII with encryption, access control, or other cybersecurity measures, while non-sensitive PII may or may not be protected depending on the regulations and the organization's policies. The classification of PII as sensitive or non-sensitive depends on the context, such as the specific use case or potential harm resulting from a breach.

 What is the difference between sensitive and non-sensitive PII?
 <|assistant|> 
 Sensitive PII, such as social security numbers, unique identification numbers, biometric data, financial information, and medical records, directly identifies an individual and could cause significant harm if leaked or stolen. Non-sensitive PII, like a person's full name, mother's maiden name, telephone number, IP address, place of birth, date of birth, geographical details, employment information, email address or mailing address, race or ethnicity, and religion, may or may not be unique to a person but would not cause significant harm if leaked or stolen in isolation. However, when combined, they can still pose risks.
 <|endoftext|>

 Creating json from Arrow format: 100% 1/1 [00:00<00:00, 23.07ba/s]
 Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/ --data_path=/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints --num_epochs=2 --effective_batch_size=128 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=0 --log_level=INFO --max_batch_len=10000 --seed=42 --chat-tmpl-path=/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py --checkpoint_at_epoch
 W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] 
 W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] *****************************************
 W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
 W0818 20:04:49.993000 140562764190144 torch/distributed/run.py:757] *****************************************
 [2024-08-18 20:04:52,891] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,058] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,222] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,242] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,264] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,304] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,305] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:04:53,335] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 model_name_or_path: /var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/
 data_path: /var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl
 output_dir: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints
 num_epochs: 2
 last_step: 0
 effective_batch_size: 128
 learning_rate: 2.0e-05
 lr_scheduler: cosine
 num_warmup_steps: 25
 save_samples: 0
 save_samples_ds: null
 save_last: false
 checkpoint_at_epoch: true
 log_level: INFO
 seed: 42
 mock_data: false
 mock_len: 2600
 sharding_strategy: FULL_SHARD
 is_granite: false
 lora_r: 0
 lora_alpha: 32
 lora_dropout: 0.1
 lora_quant_bits: null
 lora_target_modules: null
 max_batch_len: 10000
 cpu_offload_optimizer: false
 cpu_offload_optimizer_pin_memory: false
 cpu_offload_optimizer_ratio: 1.0
 NEFTune_alpha: null
 chat_tmpl_path: /opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 disable_flash_attn: false

 {
    "script_params": {
        "model_name_or_path": "/var/mnt/inststg1/instructlab/models/granite-7b-starter1.1/",
        "data_path": "/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl",
        "output_dir": "/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints",
        "num_epochs": 2,
        "last_step": 0,
        "effective_batch_size": 128,
        "learning_rate": 2e-05,
        "lr_scheduler": "cosine",
        "num_warmup_steps": 25,
        "save_samples": 0,
        "save_samples_ds": null,
        "save_last": false,
        "checkpoint_at_epoch": true,
        "log_level": "INFO",
        "seed": 42,
        "mock_data": false,
        "mock_len": 2600,
        "sharding_strategy": "FULL_SHARD",
        "is_granite": false,
        "lora_r": 0,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
        "lora_quant_bits": null,
        "lora_target_modules": null,
        "max_batch_len": 10000,
        "cpu_offload_optimizer": false,
        "cpu_offload_optimizer_pin_memory": false,
        "cpu_offload_optimizer_ratio": 1.0,
        "NEFTune_alpha": null,
        "chat_tmpl_path": "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
        "disable_flash_attn": false
    },
    "timestamp": "2024-08-18T20:04:56.779187"
 }
 [2024-08-18 20:04:56,857] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:56,857] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 [2024-08-18 20:04:57,155] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:57,392] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:57,719] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-a100-newimage-val:570:570 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:570:570 [0] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:570:570 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:572:572 [2] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:572:572 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:572:572 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:573:573 [3] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:573:573 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:573:573 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:574:574 [4] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:574:574 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:574:574 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-08-18 20:04:57,854] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:57,858] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:57,865] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:04:57,877] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-a100-newimage-val:576:576 [6] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:576:576 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:576:576 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:571:571 [1] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:571:571 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:571:571 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:577:577 [7] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:577:577 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:577:577 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:575:575 [5] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:575:575 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:575:575 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Using network Socket
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO Using network Socket
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO Using network Socket
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO Using network Socket
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO Using network Socket
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO Using network Socket
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO Using network Socket
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO Using network Socket
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO ncclCommInitRank comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO ncclCommInitRank comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO ncclCommInitRank comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO ncclCommInitRank comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO ncclCommInitRank comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO ncclCommInitRank comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO ncclCommInitRank comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO ncclCommInitRank comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xa1bb5af6fed5ca65 - Init START
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO comm 0x558d760ca530 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO comm 0x5640cf2b7db0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO comm 0x560ada6ca910 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO comm 0x55d34ff865c0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO comm 0x55e7b0065170 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO comm 0x5636163750f0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO comm 0x55e6aed5c790 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO comm 0x55f7ae069780 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO ncclCommInitRank comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:572:1301 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.82 (kernels 0.13, bootstrap 0.36, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO ncclCommInitRank comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO ncclCommInitRank comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:577:1314 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.75 (kernels 0.25, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:570:1300 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.84 (kernels 0.15, bootstrap 0.36, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO ncclCommInitRank comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO ncclCommInitRank comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO ncclCommInitRank comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:574:1303 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.82 (kernels 0.15, bootstrap 0.34, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:575:1315 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.75 (kernels 0.25, bootstrap 0.17, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:576:1312 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.75 (kernels 0.24, bootstrap 0.18, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO ncclCommInitRank comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:573:1302 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.82 (kernels 0.15, bootstrap 0.34, allgathers 0.01, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO ncclCommInitRank comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0xa1bb5af6fed5ca65 - Init COMPLETE
 tyler-a100-newimage-val:571:1313 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.76 (kernels 0.37, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1335 [3] NCCL INFO Connected all rings
 tyler-a100-newimage-val:572:1338 [2] NCCL INFO Connected all rings
 tyler-a100-newimage-val:571:1339 [1] NCCL INFO Connected all rings
 tyler-a100-newimage-val:570:1336 [0] NCCL INFO Connected all rings
 tyler-a100-newimage-val:574:1333 [4] NCCL INFO Connected all rings
 tyler-a100-newimage-val:577:1334 [7] NCCL INFO Connected all rings
 tyler-a100-newimage-val:575:1337 [5] NCCL INFO Connected all rings
 tyler-a100-newimage-val:576:1332 [6] NCCL INFO Connected all rings
 Generating train split: 267 examples [00:00, 6232.18 examples/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1973.19it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1948.77it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1801.53it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1713.33it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1638.71it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1757.64it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1982.74it/s]
 Data length calculation: 100%|██████████| 267/267 [00:00<00:00, 1824.15it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.69it/s]
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Creating extension directory /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 {
    "num_gpus": 8,
    "avg_sample_len": 622.9588014981273,
    "effective_batch_size": 128,
    "max_batch_len_per_gpu": 10000,
    "packing_max_batch_len": 9079,
    "grad_accum": 2,
    "num_batches": 2,
    "avg_samples_per_batch": 133.5,
    "samples_per_gpu": 8,
    "timestamp": "2024-08-18T20:05:10.241425"
 }
 Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  2.84it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Loading checkpoint shards:  67%|██████▋   | 2/3 [00:00<00:00,  3.30it/s]You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.49it/s]
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.48it/s]
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.39it/s]
 Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.40it/s]
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.49it/s]
 Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  2.70it/s]Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.31it/s]
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.79it/s]
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
 [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
 [3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/app-root/lib64/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
 Loading extension module fused_adam...
 Time to load fused_adam op: 34.718958377838135 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.327874183654785 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.326067447662354 seconds
 Loading extension module fused_adam...
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.226067066192627 seconds
 Time to load fused_adam op: 29.229661464691162 seconds
 [2024-08-18 20:05:41,506] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
 [2024-08-18 20:05:41,506] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.326636791229248 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 30.327282190322876 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 29.730340242385864 seconds
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO Using network Socket
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO Using network Socket
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO Using network Socket
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Using network Socket
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO Using network Socket
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO Using network Socket
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO Using network Socket
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO Using network Socket
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO bootstrapSplit: comm 0x55f7afbd0a60 parent 0x55f7ae069780 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO bootstrapSplit: comm 0x55e7b1bda750 parent 0x55e7b0065170 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO bootstrapSplit: comm 0x563617fd12e0 parent 0x5636163750f0 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO bootstrapSplit: comm 0x558d77c33a30 parent 0x558d760ca530 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO bootstrapSplit: comm 0x55e6b08cfc90 parent 0x55e6aed5c790 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO bootstrapSplit: comm 0x560adc23dc90 parent 0x560ada6ca910 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO ncclCommSplit comm 0x55e7b1bda750 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55e7b0065170 color -934961569 key 3 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO bootstrapSplit: comm 0x55d351b0d620 parent 0x55d34ff865c0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO ncclCommSplit comm 0x558d77c33a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558d760ca530 color -934961569 key 7 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO ncclCommSplit comm 0x55e6b08cfc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e6aed5c790 color -934961569 key 2 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO bootstrapSplit: comm 0x5640d0e2b2b0 parent 0x5640cf2b7db0 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO ncclCommSplit comm 0x563617fd12e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x5636163750f0 color -934961569 key 5 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO ncclCommSplit comm 0x560adc23dc90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x560ada6ca910 color -934961569 key 4 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO ncclCommSplit comm 0x55f7afbd0a60 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55f7ae069780 color -934961569 key 1 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO ncclCommSplit comm 0x55d351b0d620 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55d34ff865c0 color -934961569 key 0 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO ncclCommSplit comm 0x5640d0e2b2b0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5640cf2b7db0 color -934961569 key 6 commId 0xed88d95c67fb6a92 - Init START
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO comm 0x558d77c33a30 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO comm 0x5640d0e2b2b0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO comm 0x55e6b08cfc90 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO comm 0x563617fd12e0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO comm 0x55d351b0d620 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO comm 0x55f7afbd0a60 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO comm 0x560adc23dc90 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO comm 0x55e7b1bda750 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO ncclCommSplit comm 0x5640d0e2b2b0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x5640cf2b7db0 color -934961569 key 6 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO ncclCommSplit comm 0x55d351b0d620 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55d34ff865c0 color -934961569 key 0 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:576:1419 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO ncclCommSplit comm 0x55e6b08cfc90 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e6aed5c790 color -934961569 key 2 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO ncclCommSplit comm 0x563617fd12e0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x5636163750f0 color -934961569 key 5 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:570:1421 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:572:1420 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO ncclCommSplit comm 0x55f7afbd0a60 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x55f7ae069780 color -934961569 key 1 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:575:1423 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:571:1426 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.33 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO ncclCommSplit comm 0x560adc23dc90 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x560ada6ca910 color -934961569 key 4 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO ncclCommSplit comm 0x558d77c33a30 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558d760ca530 color -934961569 key 7 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:574:1418 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
 tyler-a100-newimage-val:577:1422 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO ncclCommSplit comm 0x55e7b1bda750 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x55e7b0065170 color -934961569 key 3 commId 0xed88d95c67fb6a92 - Init COMPLETE
 tyler-a100-newimage-val:573:1417 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.36 (kernels 0.00, bootstrap 0.03, allgathers 0.00, topo 0.25, graphs 0.00, connections 0.06, rest 0.02)
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:572:1444 [2] NCCL INFO Connected all rings
 tyler-a100-newimage-val:573:1446 [3] NCCL INFO Connected all rings
 tyler-a100-newimage-val:574:1450 [4] NCCL INFO Connected all rings
 tyler-a100-newimage-val:571:1449 [1] NCCL INFO Connected all rings
 tyler-a100-newimage-val:570:1445 [0] NCCL INFO Connected all rings
 tyler-a100-newimage-val:577:1448 [7] NCCL INFO Connected all rings
 tyler-a100-newimage-val:575:1447 [5] NCCL INFO Connected all rings
 tyler-a100-newimage-val:576:1443 [6] NCCL INFO Connected all rings
 [2024-08-18 20:05:47,116] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
 [2024-08-18 20:05:47,117] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
 [2024-08-18 20:05:47,117] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
 [2024-08-18 20:05:47,130] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
 [2024-08-18 20:05:47,130] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
 [2024-08-18 20:05:47,130] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
 [2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
 [2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
 [2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
 [2024-08-18 20:05:47,130] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
 [2024-08-18 20:05:59,693] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:00,706] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:00,871] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:01,012] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:01,166] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:01,497] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:01,620] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:06:01,858] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
 [2024-08-18 20:06:01,859] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 17.26 GB         CA 17.26 GB         Max_CA 17 GB 
 [2024-08-18 20:06:01,860] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 30.63 GB, percent = 2.4%
 [2024-08-18 20:06:02,079] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
 [2024-08-18 20:06:02,080] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 18.83 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-08-18 20:06:02,080] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 30.64 GB, percent = 2.4%
 [2024-08-18 20:06:02,080] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
 [2024-08-18 20:06:02,301] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
 [2024-08-18 20:06:02,302] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 15.69 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-08-18 20:06:02,302] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 30.64 GB, percent = 2.4%
 [2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
 [2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
 [2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7eff204ab310>
 [2024-08-18 20:06:02,304] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-08-18 20:06:02,305] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
 [2024-08-18 20:06:02,305] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
 }
 [2024-08-18 20:06:02,305] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
 [2024-08-18 20:06:02,305] [INFO] [config.py:1001:print]   amp_enabled .................. False
 [2024-08-18 20:06:02,305] [INFO] [config.py:1001:print]   amp_params ................... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
 }
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0270163fd0>
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   communication_data_type ...... None
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   disable_allgather ............ False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   dump_state ................... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
 }
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   fp16_enabled ................. False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   global_rank .................. 0
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 2
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   graph_harvesting ............. False
 [2024-08-18 20:06:02,306] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   memory_breakdown ............. False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
 }
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   optimizer_name ............... None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   optimizer_params ............. None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   pld_enabled .................. False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   pld_params ................... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   prescale_gradients ........... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   scheduler_name ............... None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   scheduler_params ............. None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   sparse_attention ............. None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   steps_per_print .............. 1
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   train_batch_size ............. 128
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  8
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   weight_quantization_config ... None
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   world_size ................... 8
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  False
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   zero_enabled ................. True
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
 [2024-08-18 20:06:02,307] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
 [2024-08-18 20:06:02,307] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 128, 
    "gradient_accumulation_steps": 2, 
    "train_micro_batch_size_per_gpu": 8, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none"
        }
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false
 }
 [2024-08-18 20:06:02,308] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Epoch 0:   0%|          | 0/2 [00:00<?, ?it/s] total tokens: 8715 num samples: 15 num padding tokens: 392 - rank: 5 max len: 581 min len: 525 avg len: 554.8666666666667 num_loss_counted_tokens: 7438 total tokens: 8565 num samples: 15 num padding tokens: 401 - rank: 5 max len: 571 min len: 519 avg len: 544.2666666666667 num_loss_counted_tokens: 7279

 total tokens: 8477 num samples: 7 num padding tokens: 255 - rank: 1 max len: 1211 min len: 1136 avg len: 1174.5714285714287 num_loss_counted_tokens: 7809
 total tokens: 8666 num samples: 7 num padding tokens: 540 - rank: 1 max len: 1238 min len: 1097 avg len: 1160.857142857143 num_loss_counted_tokens: 7713
 total tokens: 8896 num samples: 8 num padding tokens: 933 - rank: 2 max len: 1112 min len: 883 avg len: 995.375 num_loss_counted_tokens: 7491
 total tokens: 8200 num samples: 8 num padding tokens: 824 - rank: 2 max len: 1025 min len: 886 avg len: 922.0 num_loss_counted_tokens: 6904
 total tokens: 8492 num samples: 22 num padding tokens: 1012 - rank: 7 max len: 386 min len: 288 avg len: 340.0 num_loss_counted_tokens: 6182
 total tokens: 8579 num samples: 23 num padding tokens: 1024 - rank: 7 max len: 373 min len: 282 avg len: 328.4782608695652 num_loss_counted_tokens: 6198
 total tokens: 8567 num samples: 13 num padding tokens: 536 - rank: 4 max len: 659 min len: 577 avg len: 617.7692307692307 num_loss_counted_tokens: 7264
 total tokens: 8723 num samples: 13 num padding tokens: 546 - rank: 4 max len: 671 min len: 581 avg len: 629.0 num_loss_counted_tokens: 7410
 total tokens: 8600 num samples: 5 num padding tokens: 654 - rank: 0 max len: 1720 min len: 1230 avg len: 1589.2 num_loss_counted_tokens: 7651
 total tokens: 8908 num samples: 17 num padding tokens: 1767 - rank: 6 max len: 524 min len: 384 avg len: 420.05882352941177 num_loss_counted_tokens: 6138
 total tokens: 8704 num samples: 17 num padding tokens: 1162 - rank: 6 max len: 512 min len: 388 avg len: 443.6470588235294 num_loss_counted_tokens: 6539
 total tokens: 8660 num samples: 5 num padding tokens: 130 - rank: 0 max len: 1732 min len: 1681 avg len: 1706.0 num_loss_counted_tokens: 8235
 total tokens: 8688 num samples: 12 num padding tokens: 317 - rank: 3 max len: 724 min len: 662 avg len: 697.5833333333334 num_loss_counted_tokens: 7663
 total tokens: 8712 num samples: 12 num padding tokens: 241 - rank: 3 max len: 726 min len: 673 avg len: 705.9166666666666 num_loss_counted_tokens: 7763
 Per-token loss scaled by world size: 0.00018896172696258873Per-token loss scaled by world size: 0.00022490561241284013Per-token loss scaled by world size: 0.00020114783546887338Per-token loss scaled by world size: 0.00021589698735624552Per-token loss scaled by world size: 0.0002045775472652167




 Per-token loss scaled by world size: 0.0001989303418667987
 Per-token loss scaled by world size: 0.00020551522902678698
 Epoch: 0, Step: 1, Rank: 3, loss = 1.6271358728408813Epoch: 0, Step: 1, Rank: 6, loss = 1.3670908212661743

 Epoch: 0, Step: 1, Rank: 5, loss = 1.4800673723220825
 Epoch: 0, Step: 1, Rank: 2, loss = 1.455254316329956
 Epoch: 0, Step: 1, Rank: 1, loss = 1.5619606971740723
 Epoch: 0, Step: 1, Rank: 7, loss = 1.4392112493515015
 Epoch: 0, Step: 1, Rank: 4, loss = 1.4868513345718384
 Per-token loss scaled by world size: 0.0001951669983100146
 Epoch: 0, Step: 1, Rank: 0, loss = 1.4119844436645508
 Epoch 0:  50%|█████     | 1/2 [00:03<00:03,  3.83s/it]{
    "epoch": 0,
    "step": 1,
    "rank": 0,
    "loss": 1.4119844436645508,
    "overall_throughput": 18.825078649542128,
    "lr": 0.0,
    "cuda_mem_allocated": 18.31652021408081,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 57878,
    "batch_size": 99,
    "total_loss": 1.4786945581436157,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:06:06.148365"
 }
 Per-token loss scaled by world size: 0.00022342400916386396Per-token loss scaled by world size: 0.00018432749493513256Per-token loss scaled by world size: 0.00020338631293270737Per-token loss scaled by world size: 0.00018564658239483833

 Per-token loss scaled by world size: 0.00020801158098038286


 Per-token loss scaled by world size: 0.00018980413733515888
 Per-token loss scaled by world size: 0.00019465763762127608Epoch: 0, Step: 2, Rank: 1, loss = 1.3317431211471558

 Epoch: 0, Step: 2, Rank: 5, loss = 1.4694406986236572Epoch: 0, Step: 2, Rank: 3, loss = 1.6142104864120483

 Epoch: 0, Step: 2, Rank: 6, loss = 1.341273307800293
 Epoch: 0, Step: 2, Rank: 2, loss = 1.5028576850891113
 Epoch: 0, Step: 2, Rank: 4, loss = 1.3713111877441406
 Epoch: 0, Step: 2, Rank: 7, loss = 1.4063770771026611
 Per-token loss scaled by world size: 0.00021443456353154033
 Epoch: 0, Step: 2, Rank: 0, loss = 1.5492628812789917
 [2024-08-18 20:06:08,978] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
 Epoch 0: 100%|██████████| 2/2 [00:06<00:00,  3.29s/it]{
    "epoch": 0,
    "step": 2,
    "rank": 0,
    "loss": 1.5492628812789917,
    "overall_throughput": 22.634999728904948,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 23.017526626586914,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 57799,
    "batch_size": 100,
    "total_loss": 1.4483095407485962,
    "gradnorm": 3.2187001705169678,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:06:09.050231"
 }
 Saving model in huggingface format at samples_seen: 192
 Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192
 [20:06:27] INFO     saving took 18.58629083633423 seconds           utils.py:611
 Epoch 0: 100%|██████████| 2/2 [00:25<00:00, 12.69s/it]
 total tokens: 8908 num samples: 17 num padding tokens: 1042 - rank: 6 max len: 524 min len: 389 avg len: 462.70588235294116 num_loss_counted_tokens: 6863
 total tokens: 8904 num samples: 21 num padding tokens: 847 - rank: 6 max len: 424 min len: 347 avg len: 383.6666666666667 num_loss_counted_tokens: 6818
 total tokens: 8996 num samples: 13 num padding tokens: 686 - rank: 4 max len: 692 min len: 600 avg len: 639.2307692307693 num_loss_counted_tokens: 7543
 total tokens: 8610 num samples: 14 num padding tokens: 511 - rank: 4 max len: 615 min len: 546 avg len: 578.5 num_loss_counted_tokens: 7273
 total tokens: 8495 num samples: 5 num padding tokens: 1475 - rank: 0 max len: 1699 min len: 1186 avg len: 1404.0 num_loss_counted_tokens: 6725
 total tokens: 8660 num samples: 5 num padding tokens: 88 - rank: 0 max len: 1732 min len: 1685 avg len: 1714.4 num_loss_counted_tokens: 8277
 total tokens: 8536 num samples: 22 num padding tokens: 1108 - rank: 7 max len: 388 min len: 283 avg len: 337.6363636363636 num_loss_counted_tokens: 6130 total tokens: 8835 num samples: 15 num padding tokens: 400 - rank: 5 max len: 589 min len: 524 avg len: 562.3333333333334 num_loss_counted_tokens: 7550

 total tokens: 8688 num samples: 16 num padding tokens: 651 - rank: 5 max len: 543 min len: 428 avg len: 502.3125 num_loss_counted_tokens: 7093
 total tokens: 8970 num samples: 26 num padding tokens: 730 - rank: 7 max len: 345 min len: 253 avg len: 316.9230769230769 num_loss_counted_tokens: 6706
 total tokens: 8397 num samples: 9 num padding tokens: 1203 - rank: 2 max len: 933 min len: 709 avg len: 799.3333333333334 num_loss_counted_tokens: 6663
 total tokens: 8288 num samples: 7 num padding tokens: 408 - rank: 1 max len: 1184 min len: 1017 avg len: 1125.7142857142858 num_loss_counted_tokens: 7467
 total tokens: 8470 num samples: 7 num padding tokens: 755 - rank: 2 max len: 1210 min len: 931 avg len: 1102.142857142857 num_loss_counted_tokens: 7302
 total tokens: 8950 num samples: 10 num padding tokens: 1495 - rank: 3 max len: 895 min len: 694 avg len: 745.5 num_loss_counted_tokens: 6865
 total tokens: 8405 num samples: 5 num padding tokens: 928 - rank: 1 max len: 1681 min len: 1230 avg len: 1495.4 num_loss_counted_tokens: 7182
 total tokens: 8508 num samples: 12 num padding tokens: 397 - rank: 3 max len: 709 min len: 628 avg len: 675.9166666666666 num_loss_counted_tokens: 7403
 Per-token loss scaled by world size: 0.00018219766207039356Per-token loss scaled by world size: 0.00019190594321116805Per-token loss scaled by world size: 0.00019805562624242157Per-token loss scaled by world size: 0.0001988127187360078Per-token loss scaled by world size: 0.00019518414046615362Per-token loss scaled by world size: 0.00021522259339690208

 Per-token loss scaled by world size: 0.00020449883595574647




 Epoch: 1, Step: 3, Rank: 6, loss = 1.3844094276428223
 Epoch: 1, Step: 3, Rank: 1, loss = 1.3143739700317383Epoch: 1, Step: 3, Rank: 4, loss = 1.428773283958435Epoch: 1, Step: 3, Rank: 7, loss = 1.4080584049224854Epoch: 1, Step: 3, Rank: 3, loss = 1.4342349767684937



 Epoch: 1, Step: 3, Rank: 5, loss = 1.552615761756897
 Epoch: 1, Step: 3, Rank: 2, loss = 1.4752546548843384
 Per-token loss scaled by world size: 0.00021497253328561783
 Epoch: 1, Step: 3, Rank: 0, loss = 1.5508118867874146
                                                      {
    "epoch": 1,████     | 1/2 [00:03<00:03,  3.17s/it]
    "step": 3,
    "rank": 0,
    "loss": 1.5508118867874146,
    "overall_throughput": 23.69786494018801,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.60161828994751,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 57712,
    "batch_size": 94,
    "total_loss": 1.4435664415359497,
    "gradnorm": 3.2187001705169678,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:06:30.868391"
 }
 Per-token loss scaled by world size: 0.00020728030358441174Per-token loss scaled by world size: 0.00018060434376820922Per-token loss scaled by world size: 0.0002226187934866175Per-token loss scaled by world size: 0.00021563439804594964

 Per-token loss scaled by world size: 0.00020198585116304457


 Per-token loss scaled by world size: 0.0002103917795466259
 Per-token loss scaled by world size: 0.0002255949075333774
 Epoch: 1, Step: 4, Rank: 3, loss = 1.5624500513076782
 Epoch: 1, Step: 4, Rank: 5, loss = 1.5134299993515015
 Epoch: 1, Step: 4, Rank: 0, loss = 1.2675715684890747Epoch: 1, Step: 4, Rank: 4, loss = 1.4547967910766602

 Epoch: 1, Step: 4, Rank: 1, loss = 1.4176377058029175
 Epoch: 1, Step: 4, Rank: 2, loss = 1.4766347408294678
 Epoch: 1, Step: 4, Rank: 7, loss = 1.5833379030227661
 Per-token loss scaled by world size: 0.00022538744087796658
 Epoch: 1, Step: 4, Rank: 6, loss = 1.5818817615509033
 [2024-08-18 20:06:33,631] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
                                                      {
    "epoch": 1,█████████| 2/2 [00:06<00:00,  2.98s/it]
    "step": 4,
    "rank": 0,
    "loss": 1.2675715684890747,
    "overall_throughput": 23.434918025287512,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 22.997975826263428,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 56148,
    "batch_size": 110,
    "total_loss": 1.4822176694869995,
    "gradnorm": 3.2527425289154053,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:06:33.767906"
 }
 Saving model in huggingface format at samples_seen: 320
 Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320
 [20:06:52] INFO     saving took 18.69700264930725 seconds           utils.py:611
 Epoch 1: 100%|██████████| 2/2 [00:24<00:00, 12.41s/it]
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:550 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:573 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:621 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:668 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:668 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO misc/socket.cc:775 -> 3
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO misc/socket.cc:775 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO misc/socket.cc:775 -> 3

 tyler-a100-newimage-val:574:1319 [4] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:574:1319 [4] NCCL INFO misc/socket.cc:826 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO misc/socket.cc:775 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:58 -> 3

 tyler-a100-newimage-val:574:1319 [4] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 4, res=3, closed=0
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:668 -> 3
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO misc/socket.cc:775 -> 3

 tyler-a100-newimage-val:576:1316 [6] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:668 -> 3
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:58 -> 3

 tyler-a100-newimage-val:574:1319 [4] proxy.cc:1521 NCCL WARN [Proxy Service 4] Failed to execute operation Close from rank 4, retcode 3
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO misc/socket.cc:775 -> 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO misc/socket.cc:775 -> 3
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:668 -> 3
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:47 -> 3
 tyler-a100-newimage-val:576:1316 [6] NCCL INFO misc/socket.cc:826 -> 3

 tyler-a100-newimage-val:571:1330 [1] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:668 -> 3

 tyler-a100-newimage-val:570:1321 [0] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:47 -> 3

 tyler-a100-newimage-val:576:1316 [6] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:752 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:58 -> 3
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO misc/socket.cc:775 -> 3

 tyler-a100-newimage-val:577:1318 [7] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:428 -> 3

 tyler-a100-newimage-val:576:1316 [6] proxy.cc:1521 NCCL WARN [Proxy Service 6] Failed to execute operation Close from rank 6, retcode 3

 tyler-a100-newimage-val:572:1327 [2] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:564 -> 3
 tyler-a100-newimage-val:570:1321 [0] NCCL INFO misc/socket.cc:826 -> 3
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:428 -> 3
 tyler-a100-newimage-val:571:1330 [1] NCCL INFO misc/socket.cc:826 -> 3
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:668 -> 3

 tyler-a100-newimage-val:570:1321 [0] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:564 -> 3

 tyler-a100-newimage-val:570:1321 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Close from rank 0, retcode 3

 tyler-a100-newimage-val:571:1330 [1] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 1, res=3, closed=0
 tyler-a100-newimage-val:577:1318 [7] NCCL INFO misc/socket.cc:826 -> 3
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:668 -> 3

 tyler-a100-newimage-val:577:1318 [7] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0
 tyler-a100-newimage-val:572:1327 [2] NCCL INFO misc/socket.cc:826 -> 3

 tyler-a100-newimage-val:577:1318 [7] proxy.cc:1521 NCCL WARN [Proxy Service 7] Failed to execute operation Close from rank 7, retcode 3

 tyler-a100-newimage-val:575:1325 [5] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable

 tyler-a100-newimage-val:572:1327 [2] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 2, res=3, closed=0

 tyler-a100-newimage-val:572:1327 [2] proxy.cc:1521 NCCL WARN [Proxy Service 2] Failed to execute operation Close from rank 2, retcode 3

 tyler-a100-newimage-val:571:1330 [1] proxy.cc:1521 NCCL WARN [Proxy Service 1] Failed to execute operation Close from rank 1, retcode 3

 tyler-a100-newimage-val:573:1324 [3] proxy.cc:1458 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable
 tyler-a100-newimage-val:575:1325 [5] NCCL INFO misc/socket.cc:826 -> 3

 tyler-a100-newimage-val:575:1325 [5] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0
 tyler-a100-newimage-val:573:1324 [3] NCCL INFO misc/socket.cc:826 -> 3

 tyler-a100-newimage-val:573:1324 [3] proxy.cc:1497 NCCL WARN [Service thread] Could not receive type from localRank 3, res=3, closed=0

 tyler-a100-newimage-val:575:1325 [5] proxy.cc:1521 NCCL WARN [Proxy Service 5] Failed to execute operation Close from rank 5, retcode 3

 tyler-a100-newimage-val:573:1324 [3] proxy.cc:1521 NCCL WARN [Proxy Service 3] Failed to execute operation Close from rank 3, retcode 3
 tyler-a100-newimage-val:570:9772 [0] NCCL INFO comm 0x55d34ff865c0 rank 0 nranks 8 cudaDev 0 busId 8010 - Abort COMPLETE
 tyler-a100-newimage-val:576:9777 [6] NCCL INFO comm 0x5640cf2b7db0 rank 6 nranks 8 cudaDev 6 busId e070 - Abort COMPLETE
 tyler-a100-newimage-val:574:9775 [4] NCCL INFO comm 0x560ada6ca910 rank 4 nranks 8 cudaDev 4 busId c050 - Abort COMPLETE
 tyler-a100-newimage-val:575:9771 [5] NCCL INFO comm 0x5636163750f0 rank 5 nranks 8 cudaDev 5 busId c060 - Abort COMPLETE
 tyler-a100-newimage-val:572:9778 [2] NCCL INFO comm 0x55e6aed5c790 rank 2 nranks 8 cudaDev 2 busId a030 - Abort COMPLETE
 tyler-a100-newimage-val:577:9774 [7] NCCL INFO comm 0x558d760ca530 rank 7 nranks 8 cudaDev 7 busId e080 - Abort COMPLETE
 tyler-a100-newimage-val:573:9776 [3] NCCL INFO comm 0x55e7b0065170 rank 3 nranks 8 cudaDev 3 busId a040 - Abort COMPLETE
 tyler-a100-newimage-val:571:9773 [1] NCCL INFO comm 0x55f7ae069780 rank 1 nranks 8 cudaDev 1 busId 8020 - Abort COMPLETE
 Operation completed successfully! 🎉
 MMLU evaluation for Phase 1...
 INFO 2024-08-18 20:07:09,101 lm-eval:152: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
 INFO 2024-08-18 20:07:09,102 lm-eval:189: Initializing hf model, with arguments: {'pretrained': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192', 'dtype': 'bfloat16'}
 INFO 2024-08-18 20:07:09,231 lm-eval:170: Using device 'cuda'
 Downloading builder script: 100% 5.86k/5.86k [00:00<00:00, 23.9MB/s]
 Downloading readme: 100% 1.11k/1.11k [00:00<00:00, 10.3MB/s]
 Downloading data: 100% 166M/166M [00:01<00:00, 102MB/s]  
 Generating test split: 100 examples [00:00, 1153.22 examples/s]
 Generating validation split: 11 examples [00:00, 3373.60 examples/s]
 Generating dev split: 5 examples [00:00, 59.65 examples/s]
 Generating test split: 135 examples [00:00, 1635.55 examples/s]
 Generating validation split: 14 examples [00:00, 5139.63 examples/s]
 Generating dev split: 5 examples [00:00, 61.55 examples/s]
 Generating test split: 152 examples [00:00, 1850.10 examples/s]
 Generating validation split: 16 examples [00:00, 5953.06 examples/s]
 Generating dev split: 5 examples [00:00, 60.12 examples/s]
 Generating test split: 100 examples [00:00, 1177.68 examples/s]
 Generating validation split: 11 examples [00:00, 4201.56 examples/s]
 Generating dev split: 5 examples [00:00, 60.37 examples/s]
 Generating test split: 265 examples [00:00, 2954.66 examples/s]
 Generating validation split: 29 examples [00:00, 7941.16 examples/s]
 Generating dev split: 5 examples [00:00, 61.76 examples/s]
 Generating test split: 144 examples [00:00, 1774.03 examples/s]
 Generating validation split: 16 examples [00:00, 3479.13 examples/s]
 Generating dev split: 5 examples [00:00, 61.13 examples/s]
 Generating test split: 100 examples [00:00, 1222.35 examples/s]
 Generating validation split: 8 examples [00:00, 1853.63 examples/s]
 Generating dev split: 5 examples [00:00, 60.05 examples/s]
 Generating test split: 100 examples [00:00, 1183.27 examples/s]
 Generating validation split: 11 examples [00:00, 2736.66 examples/s]
 Generating dev split: 5 examples [00:00, 60.60 examples/s]
 Generating test split: 100 examples [00:00, 1185.71 examples/s]
 Generating validation split: 11 examples [00:00, 3130.71 examples/s]
 Generating dev split: 5 examples [00:00, 61.68 examples/s]
 Generating test split: 173 examples [00:00, 2042.05 examples/s]
 Generating validation split: 22 examples [00:00, 6416.88 examples/s]
 Generating dev split: 5 examples [00:00, 62.23 examples/s]
 Generating test split: 102 examples [00:00, 1194.64 examples/s]
 Generating validation split: 11 examples [00:00, 4022.44 examples/s]
 Generating dev split: 5 examples [00:00, 61.46 examples/s]
 Generating test split: 100 examples [00:00, 1207.67 examples/s]
 Generating validation split: 11 examples [00:00, 2994.57 examples/s]
 Generating dev split: 5 examples [00:00, 60.33 examples/s]
 Generating test split: 235 examples [00:00, 2704.80 examples/s]
 Generating validation split: 26 examples [00:00, 8991.75 examples/s]
 Generating dev split: 5 examples [00:00, 60.17 examples/s]
 Generating test split: 114 examples [00:00, 1390.51 examples/s]
 Generating validation split: 12 examples [00:00, 3749.38 examples/s]
 Generating dev split: 5 examples [00:00, 60.87 examples/s]
 Generating test split: 145 examples [00:00, 1763.73 examples/s]
 Generating validation split: 16 examples [00:00, 5013.74 examples/s]
 Generating dev split: 5 examples [00:00, 60.14 examples/s]
 Generating test split: 378 examples [00:00, 4030.96 examples/s]
 Generating validation split: 41 examples [00:00, 7887.28 examples/s]
 Generating dev split: 5 examples [00:00, 60.97 examples/s]
 Generating test split: 126 examples [00:00, 1470.10 examples/s]
 Generating validation split: 14 examples [00:00, 4072.99 examples/s]
 Generating dev split: 5 examples [00:00, 59.27 examples/s]
 Generating test split: 100 examples [00:00, 1255.89 examples/s]
 Generating validation split: 10 examples [00:00, 3940.16 examples/s]
 Generating dev split: 5 examples [00:00, 61.03 examples/s]
 Generating test split: 310 examples [00:00, 3516.83 examples/s]
 Generating validation split: 32 examples [00:00, 8804.05 examples/s]
 Generating dev split: 5 examples [00:00, 61.07 examples/s]
 Generating test split: 203 examples [00:00, 2410.97 examples/s]
 Generating validation split: 22 examples [00:00, 4926.05 examples/s]
 Generating dev split: 5 examples [00:00, 62.35 examples/s]
 Generating test split: 100 examples [00:00, 1268.03 examples/s]
 Generating validation split: 9 examples [00:00, 3895.64 examples/s]
 Generating dev split: 5 examples [00:00, 62.00 examples/s]
 Generating test split: 165 examples [00:00, 1938.49 examples/s]
 Generating validation split: 18 examples [00:00, 3426.10 examples/s]
 Generating dev split: 5 examples [00:00, 62.50 examples/s]
 Generating test split: 198 examples [00:00, 2282.64 examples/s]
 Generating validation split: 22 examples [00:00, 7912.42 examples/s]
 Generating dev split: 5 examples [00:00, 61.81 examples/s]
 Generating test split: 193 examples [00:00, 2366.57 examples/s]
 Generating validation split: 21 examples [00:00, 7132.01 examples/s]
 Generating dev split: 5 examples [00:00, 61.24 examples/s]
 Generating test split: 390 examples [00:00, 4338.53 examples/s]
 Generating validation split: 43 examples [00:00, 9807.77 examples/s]
 Generating dev split: 5 examples [00:00, 62.74 examples/s]
 Generating test split: 270 examples [00:00, 3156.55 examples/s]
 Generating validation split: 29 examples [00:00, 8374.17 examples/s]
 Generating dev split: 5 examples [00:00, 61.80 examples/s]
 Generating test split: 238 examples [00:00, 2714.10 examples/s]
 Generating validation split: 26 examples [00:00, 5558.48 examples/s]
 Generating dev split: 5 examples [00:00, 60.55 examples/s]
 Generating test split: 151 examples [00:00, 1801.49 examples/s]
 Generating validation split: 17 examples [00:00, 4671.64 examples/s]
 Generating dev split: 5 examples [00:00, 61.15 examples/s]
 Generating test split: 545 examples [00:00, 5738.37 examples/s]
 Generating validation split: 60 examples [00:00, 9898.84 examples/s]
 Generating dev split: 5 examples [00:00, 61.26 examples/s]
 Generating test split: 216 examples [00:00, 2474.35 examples/s]
 Generating validation split: 23 examples [00:00, 6018.78 examples/s]
 Generating dev split: 5 examples [00:00, 61.26 examples/s]
 Generating test split: 204 examples [00:00, 2282.53 examples/s]
 Generating validation split: 22 examples [00:00, 4064.43 examples/s]
 Generating dev split: 5 examples [00:00, 61.72 examples/s]
 Generating test split: 237 examples [00:00, 2575.45 examples/s]
 Generating validation split: 26 examples [00:00, 4640.90 examples/s]
 Generating dev split: 5 examples [00:00, 61.22 examples/s]
 Generating test split: 223 examples [00:00, 2635.67 examples/s]
 Generating validation split: 23 examples [00:00, 5733.33 examples/s]
 Generating dev split: 5 examples [00:00, 62.89 examples/s]
 Generating test split: 131 examples [00:00, 1591.09 examples/s]
 Generating validation split: 12 examples [00:00, 3250.77 examples/s]
 Generating dev split: 5 examples [00:00, 61.29 examples/s]
 Generating test split: 121 examples [00:00, 1415.60 examples/s]
 Generating validation split: 13 examples [00:00, 3564.02 examples/s]
 Generating dev split: 5 examples [00:00, 61.46 examples/s]
 Generating test split: 108 examples [00:00, 1342.97 examples/s]
 Generating validation split: 11 examples [00:00, 2489.47 examples/s]
 Generating dev split: 5 examples [00:00, 58.93 examples/s]
 Generating test split: 163 examples [00:00, 2010.64 examples/s]
 Generating validation split: 18 examples [00:00, 4509.73 examples/s]
 Generating dev split: 5 examples [00:00, 61.46 examples/s]
 Generating test split: 112 examples [00:00, 1324.02 examples/s]
 Generating validation split: 11 examples [00:00, 3809.85 examples/s]
 Generating dev split: 5 examples [00:00, 61.62 examples/s]
 Generating test split: 103 examples [00:00, 1277.65 examples/s]
 Generating validation split: 11 examples [00:00, 3080.55 examples/s]
 Generating dev split: 5 examples [00:00, 59.10 examples/s]
 Generating test split: 234 examples [00:00, 2697.17 examples/s]
 Generating validation split: 25 examples [00:00, 5543.62 examples/s]
 Generating dev split: 5 examples [00:00, 60.37 examples/s]
 Generating test split: 100 examples [00:00, 1202.59 examples/s]
 Generating validation split: 11 examples [00:00, 4425.22 examples/s]
 Generating dev split: 5 examples [00:00, 62.03 examples/s]
 Generating test split: 783 examples [00:00, 7376.63 examples/s]
 Generating validation split: 86 examples [00:00, 16538.75 examples/s]
 Generating dev split: 5 examples [00:00, 59.82 examples/s]
 Generating test split: 346 examples [00:00, 3763.46 examples/s]
 Generating validation split: 38 examples [00:00, 6601.65 examples/s]
 Generating dev split: 5 examples [00:00, 61.74 examples/s]
 Generating test split: 895 examples [00:00, 7717.62 examples/s]
 Generating validation split: 100 examples [00:00, 14271.19 examples/s]
 Generating dev split: 5 examples [00:00, 61.34 examples/s]
 Generating test split: 306 examples [00:00, 3463.47 examples/s]
 Generating validation split: 33 examples [00:00, 7868.34 examples/s]
 Generating dev split: 5 examples [00:00, 62.63 examples/s]
 Generating test split: 311 examples [00:00, 3373.99 examples/s]
 Generating validation split: 34 examples [00:00, 9395.59 examples/s]
 Generating dev split: 5 examples [00:00, 61.80 examples/s]
 Generating test split: 324 examples [00:00, 3525.14 examples/s]
 Generating validation split: 35 examples [00:00, 7428.05 examples/s]
 Generating dev split: 5 examples [00:00, 61.91 examples/s]
 Generating test split: 282 examples [00:00, 3107.75 examples/s]
 Generating validation split: 31 examples [00:00, 5669.96 examples/s]
 Generating dev split: 5 examples [00:00, 62.70 examples/s]
 Generating test split: 1534 examples [00:00, 10061.95 examples/s]
 Generating validation split: 170 examples [00:00, 14781.84 examples/s]
 Generating dev split: 5 examples [00:00, 59.81 examples/s]
 Generating test split: 272 examples [00:00, 2957.00 examples/s]
 Generating validation split: 31 examples [00:00, 6405.41 examples/s]
 Generating dev split: 5 examples [00:00, 61.40 examples/s]
 Generating test split: 612 examples [00:00, 6144.16 examples/s]
 Generating validation split: 69 examples [00:00, 10691.85 examples/s]
 Generating dev split: 5 examples [00:00, 59.34 examples/s]
 Generating test split: 110 examples [00:00, 1355.61 examples/s]
 Generating validation split: 12 examples [00:00, 3074.06 examples/s]
 Generating dev split: 5 examples [00:00, 61.81 examples/s]
 Generating test split: 245 examples [00:00, 2837.26 examples/s]
 Generating validation split: 27 examples [00:00, 4828.44 examples/s]
 Generating dev split: 5 examples [00:00, 61.01 examples/s]
 Generating test split: 201 examples [00:00, 2289.38 examples/s]
 Generating validation split: 22 examples [00:00, 4368.45 examples/s]
 Generating dev split: 5 examples [00:00, 61.55 examples/s]
 Generating test split: 100 examples [00:00, 1184.69 examples/s]
 Generating validation split: 11 examples [00:00, 3479.70 examples/s]
 Generating dev split: 5 examples [00:00, 61.74 examples/s]
 Generating test split: 166 examples [00:00, 2008.24 examples/s]
 Generating validation split: 18 examples [00:00, 4386.07 examples/s]
 Generating dev split: 5 examples [00:00, 59.23 examples/s]
 Generating test split: 171 examples [00:00, 2049.86 examples/s]
 Generating validation split: 19 examples [00:00, 5742.72 examples/s]
 Generating dev split: 5 examples [00:00, 61.61 examples/s]
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_world_religions from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_virology from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_us_foreign_policy from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_sociology from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_security_studies from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,836 lm-eval:251: Overwriting default num_fewshot of mmlu_public_relations from None to 5
 INFO 2024-08-18 20:08:01,836 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_psychology from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_medicine from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_law from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_accounting from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_prehistory from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_philosophy from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_nutrition from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_scenarios from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_disputes from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_miscellaneous from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_medical_genetics from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_marketing from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_management from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_machine_learning from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_logical_fallacies from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_jurisprudence from None to 5
 INFO 2024-08-18 20:08:01,837 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,837 lm-eval:251: Overwriting default num_fewshot of mmlu_international_law from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_human_sexuality from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_human_aging from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_world_history from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_us_history from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_psychology from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_microeconomics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_macroeconomics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_government_and_politics from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_geography from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_european_history from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_computer_science from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
 INFO 2024-08-18 20:08:01,838 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,838 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_global_facts from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_formal_logic from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_elementary_mathematics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_electrical_engineering from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_econometrics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_conceptual_physics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_computer_security from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_physics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_medicine from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_mathematics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_computer_science from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_chemistry from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_college_biology from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_clinical_knowledge from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_business_ethics from None to 5
 INFO 2024-08-18 20:08:01,839 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,839 lm-eval:251: Overwriting default num_fewshot of mmlu_astronomy from None to 5
 INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,840 lm-eval:251: Overwriting default num_fewshot of mmlu_anatomy from None to 5
 INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:08:01,840 lm-eval:251: Overwriting default num_fewshot of mmlu_abstract_algebra from None to 5
 INFO 2024-08-18 20:08:01,840 lm-eval:261: Setting fewshot random generator seed to 1234
 INFO 2024-08-18 20:08:01,845 lm-eval:411: Building contexts for mmlu_world_religions on rank 0...
 100% 171/171 [00:01<00:00, 136.70it/s]
 INFO 2024-08-18 20:08:03,104 lm-eval:411: Building contexts for mmlu_virology on rank 0...
 100% 166/166 [00:01<00:00, 137.65it/s]
 INFO 2024-08-18 20:08:04,318 lm-eval:411: Building contexts for mmlu_us_foreign_policy on rank 0...
 100% 100/100 [00:00<00:00, 137.17it/s]
 INFO 2024-08-18 20:08:05,053 lm-eval:411: Building contexts for mmlu_sociology on rank 0...
 100% 201/201 [00:01<00:00, 137.20it/s]
 INFO 2024-08-18 20:08:06,528 lm-eval:411: Building contexts for mmlu_security_studies on rank 0...
 100% 245/245 [00:01<00:00, 137.06it/s]
 INFO 2024-08-18 20:08:08,328 lm-eval:411: Building contexts for mmlu_public_relations on rank 0...
 100% 110/110 [00:00<00:00, 137.91it/s]
 INFO 2024-08-18 20:08:09,132 lm-eval:411: Building contexts for mmlu_professional_psychology on rank 0...
 100% 612/612 [00:04<00:00, 137.30it/s]
 INFO 2024-08-18 20:08:13,618 lm-eval:411: Building contexts for mmlu_professional_medicine on rank 0...
 100% 272/272 [00:01<00:00, 137.57it/s]
 INFO 2024-08-18 20:08:15,610 lm-eval:411: Building contexts for mmlu_professional_law on rank 0...
 100% 1534/1534 [00:11<00:00, 137.50it/s]
 INFO 2024-08-18 20:08:26,840 lm-eval:411: Building contexts for mmlu_professional_accounting on rank 0...
 100% 282/282 [00:02<00:00, 137.72it/s]
 INFO 2024-08-18 20:08:28,902 lm-eval:411: Building contexts for mmlu_prehistory on rank 0...
 100% 324/324 [00:02<00:00, 137.90it/s]
 INFO 2024-08-18 20:08:31,267 lm-eval:411: Building contexts for mmlu_philosophy on rank 0...
 100% 311/311 [00:02<00:00, 137.34it/s]
 INFO 2024-08-18 20:08:33,547 lm-eval:411: Building contexts for mmlu_nutrition on rank 0...
 100% 306/306 [00:02<00:00, 137.31it/s]
 INFO 2024-08-18 20:08:35,790 lm-eval:411: Building contexts for mmlu_moral_scenarios on rank 0...
 100% 895/895 [00:06<00:00, 137.60it/s]
 INFO 2024-08-18 20:08:42,336 lm-eval:411: Building contexts for mmlu_moral_disputes on rank 0...
 100% 346/346 [00:02<00:00, 137.95it/s]
 INFO 2024-08-18 20:08:44,861 lm-eval:411: Building contexts for mmlu_miscellaneous on rank 0...
 100% 783/783 [00:05<00:00, 138.18it/s]
 INFO 2024-08-18 20:08:50,564 lm-eval:411: Building contexts for mmlu_medical_genetics on rank 0...
 100% 100/100 [00:00<00:00, 137.18it/s]
 INFO 2024-08-18 20:08:51,298 lm-eval:411: Building contexts for mmlu_marketing on rank 0...
 100% 234/234 [00:01<00:00, 137.53it/s]
 INFO 2024-08-18 20:08:53,011 lm-eval:411: Building contexts for mmlu_management on rank 0...
 100% 103/103 [00:00<00:00, 137.86it/s]
 INFO 2024-08-18 20:08:53,764 lm-eval:411: Building contexts for mmlu_machine_learning on rank 0...
 100% 112/112 [00:00<00:00, 137.95it/s]
 INFO 2024-08-18 20:08:54,582 lm-eval:411: Building contexts for mmlu_logical_fallacies on rank 0...
 100% 163/163 [00:01<00:00, 137.78it/s]
 INFO 2024-08-18 20:08:55,773 lm-eval:411: Building contexts for mmlu_jurisprudence on rank 0...
 100% 108/108 [00:00<00:00, 138.26it/s]
 INFO 2024-08-18 20:08:56,559 lm-eval:411: Building contexts for mmlu_international_law on rank 0...
 100% 121/121 [00:00<00:00, 137.86it/s]
 INFO 2024-08-18 20:08:57,444 lm-eval:411: Building contexts for mmlu_human_sexuality on rank 0...
 100% 131/131 [00:00<00:00, 137.92it/s]
 INFO 2024-08-18 20:08:58,400 lm-eval:411: Building contexts for mmlu_human_aging on rank 0...
 100% 223/223 [00:01<00:00, 138.55it/s]
 INFO 2024-08-18 20:09:00,021 lm-eval:411: Building contexts for mmlu_high_school_world_history on rank 0...
 100% 237/237 [00:01<00:00, 137.41it/s]
 INFO 2024-08-18 20:09:01,757 lm-eval:411: Building contexts for mmlu_high_school_us_history on rank 0...
 100% 204/204 [00:01<00:00, 137.87it/s]
 INFO 2024-08-18 20:09:03,248 lm-eval:411: Building contexts for mmlu_high_school_statistics on rank 0...
 100% 216/216 [00:01<00:00, 138.70it/s]
 INFO 2024-08-18 20:09:04,816 lm-eval:411: Building contexts for mmlu_high_school_psychology on rank 0...
 100% 545/545 [00:03<00:00, 138.10it/s]
 INFO 2024-08-18 20:09:08,787 lm-eval:411: Building contexts for mmlu_high_school_physics on rank 0...
 100% 151/151 [00:01<00:00, 137.64it/s]
 INFO 2024-08-18 20:09:09,892 lm-eval:411: Building contexts for mmlu_high_school_microeconomics on rank 0...
 100% 238/238 [00:01<00:00, 138.03it/s]
 INFO 2024-08-18 20:09:11,628 lm-eval:411: Building contexts for mmlu_high_school_mathematics on rank 0...
 100% 270/270 [00:01<00:00, 137.88it/s]
 INFO 2024-08-18 20:09:13,599 lm-eval:411: Building contexts for mmlu_high_school_macroeconomics on rank 0...
 100% 390/390 [00:02<00:00, 138.12it/s]
 INFO 2024-08-18 20:09:16,441 lm-eval:411: Building contexts for mmlu_high_school_government_and_politics on rank 0...
 100% 193/193 [00:01<00:00, 137.78it/s]
 INFO 2024-08-18 20:09:17,851 lm-eval:411: Building contexts for mmlu_high_school_geography on rank 0...
 100% 198/198 [00:01<00:00, 138.13it/s]
 INFO 2024-08-18 20:09:19,294 lm-eval:411: Building contexts for mmlu_high_school_european_history on rank 0...
 100% 165/165 [00:01<00:00, 136.67it/s]
 INFO 2024-08-18 20:09:20,511 lm-eval:411: Building contexts for mmlu_high_school_computer_science on rank 0...
 100% 100/100 [00:00<00:00, 137.40it/s]
 INFO 2024-08-18 20:09:21,244 lm-eval:411: Building contexts for mmlu_high_school_chemistry on rank 0...
 100% 203/203 [00:01<00:00, 107.90it/s]
 INFO 2024-08-18 20:09:23,135 lm-eval:411: Building contexts for mmlu_high_school_biology on rank 0...
 100% 310/310 [00:02<00:00, 137.74it/s]
 INFO 2024-08-18 20:09:25,401 lm-eval:411: Building contexts for mmlu_global_facts on rank 0...
 100% 100/100 [00:00<00:00, 138.25it/s]
 INFO 2024-08-18 20:09:26,129 lm-eval:411: Building contexts for mmlu_formal_logic on rank 0...
 100% 126/126 [00:00<00:00, 137.97it/s]
 INFO 2024-08-18 20:09:27,049 lm-eval:411: Building contexts for mmlu_elementary_mathematics on rank 0...
 100% 378/378 [00:02<00:00, 138.66it/s]
 INFO 2024-08-18 20:09:29,792 lm-eval:411: Building contexts for mmlu_electrical_engineering on rank 0...
 100% 145/145 [00:01<00:00, 138.30it/s]
 INFO 2024-08-18 20:09:30,848 lm-eval:411: Building contexts for mmlu_econometrics on rank 0...
 100% 114/114 [00:00<00:00, 138.62it/s]
 INFO 2024-08-18 20:09:31,676 lm-eval:411: Building contexts for mmlu_conceptual_physics on rank 0...
 100% 235/235 [00:01<00:00, 138.75it/s]
 INFO 2024-08-18 20:09:33,382 lm-eval:411: Building contexts for mmlu_computer_security on rank 0...
 100% 100/100 [00:00<00:00, 138.84it/s]
 INFO 2024-08-18 20:09:34,107 lm-eval:411: Building contexts for mmlu_college_physics on rank 0...
 100% 102/102 [00:00<00:00, 138.92it/s]
 INFO 2024-08-18 20:09:34,846 lm-eval:411: Building contexts for mmlu_college_medicine on rank 0...
 100% 173/173 [00:01<00:00, 138.30it/s]
 INFO 2024-08-18 20:09:36,106 lm-eval:411: Building contexts for mmlu_college_mathematics on rank 0...
 100% 100/100 [00:00<00:00, 138.92it/s]
 INFO 2024-08-18 20:09:36,831 lm-eval:411: Building contexts for mmlu_college_computer_science on rank 0...
 100% 100/100 [00:00<00:00, 138.70it/s]
 INFO 2024-08-18 20:09:37,557 lm-eval:411: Building contexts for mmlu_college_chemistry on rank 0...
 100% 100/100 [00:00<00:00, 138.24it/s]
 INFO 2024-08-18 20:09:38,286 lm-eval:411: Building contexts for mmlu_college_biology on rank 0...
 100% 144/144 [00:01<00:00, 139.08it/s]
 INFO 2024-08-18 20:09:39,329 lm-eval:411: Building contexts for mmlu_clinical_knowledge on rank 0...
 100% 265/265 [00:01<00:00, 138.97it/s]
 INFO 2024-08-18 20:09:41,248 lm-eval:411: Building contexts for mmlu_business_ethics on rank 0...
 100% 100/100 [00:00<00:00, 138.40it/s]
 INFO 2024-08-18 20:09:41,976 lm-eval:411: Building contexts for mmlu_astronomy on rank 0...
 100% 152/152 [00:01<00:00, 139.01it/s]
 INFO 2024-08-18 20:09:43,077 lm-eval:411: Building contexts for mmlu_anatomy on rank 0...
 100% 135/135 [00:00<00:00, 138.23it/s]
 INFO 2024-08-18 20:09:44,061 lm-eval:411: Building contexts for mmlu_abstract_algebra on rank 0...
 100% 100/100 [00:00<00:00, 139.49it/s]
 INFO 2024-08-18 20:09:44,783 lm-eval:438: Running loglikelihood requests
 Running loglikelihood requests:   0% 0/56168 [00:00<?, ?it/s]Passed argument batch_size = auto:1. Detecting largest batch size
 We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
 Determined largest batch size: 16
 Running loglikelihood requests: 100% 56168/56168 [13:58<00:00, 66.96it/s] 
 WARNING 2024-08-18 20:26:55,272 lm-eval:1315: Failed to get model SHA for /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192 at revision main. Error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192'. Use `repo_type` argument if needed.
 fatal: not a git repository (or any of the parent directories): .git
 CHECKPOINT EVALUATION: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_192 SCORED 0.5271100893470168
 INFO 2024-08-18 20:27:00,200 lm-eval:152: Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
 INFO 2024-08-18 20:27:00,200 lm-eval:189: Initializing hf model, with arguments: {'pretrained': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320', 'dtype': 'bfloat16'}
 INFO 2024-08-18 20:27:00,202 lm-eval:170: Using device 'cuda'
 WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_world_religions from None to 5
 INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_virology from None to 5
 INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,442 lm-eval:251: Overwriting default num_fewshot of mmlu_us_foreign_policy from None to 5
 INFO 2024-08-18 20:27:41,442 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_sociology from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_security_studies from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_public_relations from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_psychology from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_medicine from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_law from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_professional_accounting from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_prehistory from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_philosophy from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_nutrition from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_scenarios from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_moral_disputes from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_miscellaneous from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_medical_genetics from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_marketing from None to 5
 INFO 2024-08-18 20:27:41,443 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,443 lm-eval:251: Overwriting default num_fewshot of mmlu_management from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_machine_learning from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_logical_fallacies from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_jurisprudence from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_international_law from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_human_sexuality from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_human_aging from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_world_history from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_us_history from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_statistics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_psychology from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_physics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_microeconomics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_mathematics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_macroeconomics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_government_and_politics from None to 5
 INFO 2024-08-18 20:27:41,444 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,444 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_geography from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_european_history from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_computer_science from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_chemistry from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_high_school_biology from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_global_facts from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_formal_logic from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_elementary_mathematics from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_electrical_engineering from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_econometrics from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_conceptual_physics from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_computer_security from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_physics from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_medicine from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_mathematics from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,445 lm-eval:251: Overwriting default num_fewshot of mmlu_college_computer_science from None to 5
 INFO 2024-08-18 20:27:41,445 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_college_chemistry from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_college_biology from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_clinical_knowledge from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_business_ethics from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_astronomy from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_anatomy from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 WARNING 2024-08-18 20:27:41,446 lm-eval:251: Overwriting default num_fewshot of mmlu_abstract_algebra from None to 5
 INFO 2024-08-18 20:27:41,446 lm-eval:261: Setting fewshot random generator seed to 1234
 INFO 2024-08-18 20:27:41,451 lm-eval:411: Building contexts for mmlu_world_religions on rank 0...
 100% 171/171 [00:01<00:00, 136.86it/s]
 INFO 2024-08-18 20:27:42,709 lm-eval:411: Building contexts for mmlu_virology on rank 0...
 100% 166/166 [00:01<00:00, 137.15it/s]
 INFO 2024-08-18 20:27:43,928 lm-eval:411: Building contexts for mmlu_us_foreign_policy on rank 0...
 100% 100/100 [00:00<00:00, 136.60it/s]
 INFO 2024-08-18 20:27:44,666 lm-eval:411: Building contexts for mmlu_sociology on rank 0...
 100% 201/201 [00:01<00:00, 136.57it/s]
 INFO 2024-08-18 20:27:46,148 lm-eval:411: Building contexts for mmlu_security_studies on rank 0...
 100% 245/245 [00:01<00:00, 136.39it/s]
 INFO 2024-08-18 20:27:47,957 lm-eval:411: Building contexts for mmlu_public_relations on rank 0...
 100% 110/110 [00:00<00:00, 137.57it/s]
 INFO 2024-08-18 20:27:48,763 lm-eval:411: Building contexts for mmlu_professional_psychology on rank 0...
 100% 612/612 [00:04<00:00, 136.85it/s]
 INFO 2024-08-18 20:27:53,266 lm-eval:411: Building contexts for mmlu_professional_medicine on rank 0...
 100% 272/272 [00:01<00:00, 136.23it/s]
 INFO 2024-08-18 20:27:55,277 lm-eval:411: Building contexts for mmlu_professional_law on rank 0...
 100% 1534/1534 [00:11<00:00, 137.38it/s]
 INFO 2024-08-18 20:28:06,522 lm-eval:411: Building contexts for mmlu_professional_accounting on rank 0...
 100% 282/282 [00:02<00:00, 137.60it/s]
 INFO 2024-08-18 20:28:08,586 lm-eval:411: Building contexts for mmlu_prehistory on rank 0...
 100% 324/324 [00:02<00:00, 137.90it/s]
 INFO 2024-08-18 20:28:10,951 lm-eval:411: Building contexts for mmlu_philosophy on rank 0...
 100% 311/311 [00:02<00:00, 138.05it/s]
 INFO 2024-08-18 20:28:13,219 lm-eval:411: Building contexts for mmlu_nutrition on rank 0...
 100% 306/306 [00:02<00:00, 138.17it/s]
 INFO 2024-08-18 20:28:15,448 lm-eval:411: Building contexts for mmlu_moral_scenarios on rank 0...
 100% 895/895 [00:06<00:00, 138.76it/s]
 INFO 2024-08-18 20:28:21,941 lm-eval:411: Building contexts for mmlu_moral_disputes on rank 0...
 100% 346/346 [00:02<00:00, 138.65it/s]
 INFO 2024-08-18 20:28:24,454 lm-eval:411: Building contexts for mmlu_miscellaneous on rank 0...
 100% 783/783 [00:05<00:00, 138.51it/s]
 INFO 2024-08-18 20:28:30,144 lm-eval:411: Building contexts for mmlu_medical_genetics on rank 0...
 100% 100/100 [00:00<00:00, 138.01it/s]
 INFO 2024-08-18 20:28:30,874 lm-eval:411: Building contexts for mmlu_marketing on rank 0...
 100% 234/234 [00:01<00:00, 138.10it/s]
 INFO 2024-08-18 20:28:32,580 lm-eval:411: Building contexts for mmlu_management on rank 0...
 100% 103/103 [00:00<00:00, 138.77it/s]
 INFO 2024-08-18 20:28:33,327 lm-eval:411: Building contexts for mmlu_machine_learning on rank 0...
 100% 112/112 [00:00<00:00, 138.61it/s]
 INFO 2024-08-18 20:28:34,141 lm-eval:411: Building contexts for mmlu_logical_fallacies on rank 0...
 100% 163/163 [00:01<00:00, 139.13it/s]
 INFO 2024-08-18 20:28:35,321 lm-eval:411: Building contexts for mmlu_jurisprudence on rank 0...
 100% 108/108 [00:00<00:00, 138.78it/s]
 INFO 2024-08-18 20:28:36,105 lm-eval:411: Building contexts for mmlu_international_law on rank 0...
 100% 121/121 [00:00<00:00, 138.84it/s]
 INFO 2024-08-18 20:28:36,982 lm-eval:411: Building contexts for mmlu_human_sexuality on rank 0...
 100% 131/131 [00:00<00:00, 138.95it/s]
 INFO 2024-08-18 20:28:37,932 lm-eval:411: Building contexts for mmlu_human_aging on rank 0...
 100% 223/223 [00:01<00:00, 139.25it/s]
 INFO 2024-08-18 20:28:39,544 lm-eval:411: Building contexts for mmlu_high_school_world_history on rank 0...
 100% 237/237 [00:01<00:00, 138.81it/s]
 INFO 2024-08-18 20:28:41,264 lm-eval:411: Building contexts for mmlu_high_school_us_history on rank 0...
 100% 204/204 [00:01<00:00, 136.33it/s]
 INFO 2024-08-18 20:28:42,771 lm-eval:411: Building contexts for mmlu_high_school_statistics on rank 0...
 100% 216/216 [00:01<00:00, 138.37it/s]
 INFO 2024-08-18 20:28:44,343 lm-eval:411: Building contexts for mmlu_high_school_psychology on rank 0...
 100% 545/545 [00:03<00:00, 137.23it/s]
 INFO 2024-08-18 20:28:48,339 lm-eval:411: Building contexts for mmlu_high_school_physics on rank 0...
 100% 151/151 [00:01<00:00, 138.28it/s]
 INFO 2024-08-18 20:28:49,439 lm-eval:411: Building contexts for mmlu_high_school_microeconomics on rank 0...
 100% 238/238 [00:01<00:00, 138.44it/s]
 INFO 2024-08-18 20:28:51,170 lm-eval:411: Building contexts for mmlu_high_school_mathematics on rank 0...
 100% 270/270 [00:01<00:00, 137.69it/s]
 INFO 2024-08-18 20:28:53,144 lm-eval:411: Building contexts for mmlu_high_school_macroeconomics on rank 0...
 100% 390/390 [00:02<00:00, 138.61it/s]
 INFO 2024-08-18 20:28:55,975 lm-eval:411: Building contexts for mmlu_high_school_government_and_politics on rank 0...
 100% 193/193 [00:01<00:00, 138.85it/s]
 INFO 2024-08-18 20:28:57,375 lm-eval:411: Building contexts for mmlu_high_school_geography on rank 0...
 100% 198/198 [00:01<00:00, 138.88it/s]
 INFO 2024-08-18 20:28:58,810 lm-eval:411: Building contexts for mmlu_high_school_european_history on rank 0...
 100% 165/165 [00:01<00:00, 135.99it/s]
 INFO 2024-08-18 20:29:00,033 lm-eval:411: Building contexts for mmlu_high_school_computer_science on rank 0...
 100% 100/100 [00:00<00:00, 138.28it/s]
 INFO 2024-08-18 20:29:00,761 lm-eval:411: Building contexts for mmlu_high_school_chemistry on rank 0...
 100% 203/203 [00:01<00:00, 135.51it/s]
 INFO 2024-08-18 20:29:02,269 lm-eval:411: Building contexts for mmlu_high_school_biology on rank 0...
 100% 310/310 [00:02<00:00, 115.72it/s]
 INFO 2024-08-18 20:29:04,963 lm-eval:411: Building contexts for mmlu_global_facts on rank 0...
 100% 100/100 [00:00<00:00, 135.44it/s]
 INFO 2024-08-18 20:29:05,706 lm-eval:411: Building contexts for mmlu_formal_logic on rank 0...
 100% 126/126 [00:00<00:00, 135.56it/s]
 INFO 2024-08-18 20:29:06,642 lm-eval:411: Building contexts for mmlu_elementary_mathematics on rank 0...
 100% 378/378 [00:02<00:00, 135.60it/s]
 INFO 2024-08-18 20:29:09,447 lm-eval:411: Building contexts for mmlu_electrical_engineering on rank 0...
 100% 145/145 [00:01<00:00, 136.37it/s]
 INFO 2024-08-18 20:29:10,518 lm-eval:411: Building contexts for mmlu_econometrics on rank 0...
 100% 114/114 [00:00<00:00, 135.71it/s]
 INFO 2024-08-18 20:29:11,363 lm-eval:411: Building contexts for mmlu_conceptual_physics on rank 0...
 100% 235/235 [00:01<00:00, 136.95it/s]
 INFO 2024-08-18 20:29:13,091 lm-eval:411: Building contexts for mmlu_computer_security on rank 0...
 100% 100/100 [00:00<00:00, 136.87it/s]
 INFO 2024-08-18 20:29:13,827 lm-eval:411: Building contexts for mmlu_college_physics on rank 0...
 100% 102/102 [00:00<00:00, 136.91it/s]
 INFO 2024-08-18 20:29:14,577 lm-eval:411: Building contexts for mmlu_college_medicine on rank 0...
 100% 173/173 [00:01<00:00, 136.72it/s]
 INFO 2024-08-18 20:29:15,851 lm-eval:411: Building contexts for mmlu_college_mathematics on rank 0...
 100% 100/100 [00:00<00:00, 136.59it/s]
 INFO 2024-08-18 20:29:16,589 lm-eval:411: Building contexts for mmlu_college_computer_science on rank 0...
 100% 100/100 [00:00<00:00, 136.43it/s]
 INFO 2024-08-18 20:29:17,327 lm-eval:411: Building contexts for mmlu_college_chemistry on rank 0...
 100% 100/100 [00:00<00:00, 136.50it/s]
 INFO 2024-08-18 20:29:18,065 lm-eval:411: Building contexts for mmlu_college_biology on rank 0...
 100% 144/144 [00:01<00:00, 136.85it/s]
 INFO 2024-08-18 20:29:19,125 lm-eval:411: Building contexts for mmlu_clinical_knowledge on rank 0...
 100% 265/265 [00:01<00:00, 137.03it/s]
 INFO 2024-08-18 20:29:21,071 lm-eval:411: Building contexts for mmlu_business_ethics on rank 0...
 100% 100/100 [00:00<00:00, 136.58it/s]
 INFO 2024-08-18 20:29:21,808 lm-eval:411: Building contexts for mmlu_astronomy on rank 0...
 100% 152/152 [00:01<00:00, 136.77it/s]
 INFO 2024-08-18 20:29:22,927 lm-eval:411: Building contexts for mmlu_anatomy on rank 0...
 100% 135/135 [00:00<00:00, 137.49it/s]
 INFO 2024-08-18 20:29:23,916 lm-eval:411: Building contexts for mmlu_abstract_algebra on rank 0...
 100% 100/100 [00:00<00:00, 138.03it/s]
 INFO 2024-08-18 20:29:24,646 lm-eval:438: Running loglikelihood requests
 Running loglikelihood requests:   0% 0/56168 [00:00<?, ?it/s]Passed argument batch_size = auto:1. Detecting largest batch size
 Determined largest batch size: 16
 Running loglikelihood requests: 100% 56168/56168 [13:58<00:00, 66.98it/s] 
 WARNING 2024-08-18 20:46:42,688 lm-eval:1315: Failed to get model SHA for /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 at revision main. Error: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320'. Use `repo_type` argument if needed.
 fatal: not a git repository (or any of the parent directories): .git
 CHECKPOINT EVALUATION: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 SCORED 0.5283330937684491
 Training Phase 2/2...
 TrainingArgs for current phase: TrainingArgs(model_path='/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320', chat_tmpl_path='/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py', data_path='/var/mnt/inststg1/instructlab/generated/skills_train_msgs_2024-08-18T15_57_14.jsonl', ckpt_output_dir='/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints', data_output_dir='/var/mnt/inststg1/instructlab/.local/share/instructlab/internal', max_seq_len=4096, max_batch_len=10000, num_epochs=2, effective_batch_size=3840, save_samples=0, learning_rate=2e-05, warmup_steps=25, is_padding_free=False, random_seed=42, checkpoint_at_epoch=True, mock_data=False, mock_data_len=0, deepspeed_options=DeepSpeedOptions(cpu_offload_optimizer=False, cpu_offload_optimizer_ratio=1.0, cpu_offload_optimizer_pin_memory=False, save_samples=None), disable_flash_attn=False, lora=None)
 INFO 2024-08-18 20:46:47,145 root:611: eos: 32001, pad: 32002, system: 32003, user: 32004, assistant: 32005
 Generating train split: 10000 examples [00:00, 100155.07 examples/s]
 tokenizing the dataset with /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 tokenizer...
 Setting TOKENIZERS_PARALLELISM=false for forked processes.
 WARNING 2024-08-18 20:46:47,316 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
 Map (num_proc=16): 100% 10000/10000 [00:02<00:00, 3765.08 examples/s]
 ten largest length percentiles:
 Setting TOKENIZERS_PARALLELISM=false for forked processes.
 WARNING 2024-08-18 20:46:50,962 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
 Map (num_proc=16): 100% 10000/10000 [00:00<00:00, 16336.33 examples/s]
 quantile 90th: 1283.0
 quantile 91th: 1367.0
 quantile 92th: 1453.0
 quantile 93th: 1579.0
 quantile 94th: 1704.0599999999995
 quantile 95th: 1843.1499999999978
 quantile 96th: 2046.1199999999972
 quantile 97th: 2356.179999999993
 quantile 98th: 2724.100000000002
 quantile 99th: 3213.0200000000004
 quantile 100th: 5765.0

 at 4096 max sequence length, the number of samples to be dropped is 19
 (0.19% of total)
 quantile 0th: 70.0
 quantile 1th: 81.0
 quantile 2th: 85.0
 quantile 3th: 87.0
 quantile 4th: 91.0
 quantile 5th: 94.0
 quantile 6th: 97.93999999999994
 quantile 7th: 102.0
 quantile 8th: 108.0
 quantile 9th: 113.0
 quantile 10th: 118.0
 at 20 min sequence length, the number of samples to be dropped is 0
 checking the validity of the samples...
 Setting TOKENIZERS_PARALLELISM=false for forked processes.
 WARNING 2024-08-18 20:46:52,663 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
 Filter (num_proc=16): 100% 10000/10000 [00:01<00:00, 8244.08 examples/s] 
 INFO 2024-08-18 20:46:54,896 root:611: number of dropped samples: 19 -- out of 10000
 Categorizing training data type...
 Data type sorting: 100% 9981/9981 [00:00<00:00, 112525.30it/s]
 unmasking the appropriate message content...
 Setting TOKENIZERS_PARALLELISM=false for forked processes.
 WARNING 2024-08-18 20:46:57,636 datasets.arrow_dataset:3211: Setting TOKENIZERS_PARALLELISM=false for forked processes.
 Map (num_proc=16): 100% 9981/9981 [00:01<00:00, 9644.98 examples/s] 
 The following are some examples of the processed data, with masked tokens (not to be learned) represented with <mask>. The unmasked tokens are the ones the model will learn to predict. Please review these samples to ensure the model is learning to predict expected tokens.

 Instruction ex sample 6312: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 The TVs in "2001: A Space Odyssey" might initially seem random, but they are actually thoughtfully integrated into the film's narrative and symbolism. Here are some instances where TVs play a significant role:

 1. **Planetary Alignment:** In the opening scene, a TV screen displays the alignment of the planets, which is crucial to the plot and the film's exploration of extraterrestrial intelligence.
 2. **News Broadcasts:** Later in the movie, TVs are used to show news broadcasts, providing context and information about the ongoing space exploration and human presence beyond Earth.
 3. **Interviews:** TVs are also used to present interviews with characters, offering insights into their thoughts and motivations.
 4. **Monolith's Influence:** The iconic scene where the monolith appears to the apes is accompanied by a flashing TV screen. This juxtaposition highlights the monolith's influence over technology and human progress, as if the TV screen is a modern-day monolith guiding human development.

 In essence, the TVs in "2001: A Space Odyssey" are carefully crafted elements that contribute to the film's visual language and themes, such as technology, human evolution, and the search for extraterrestrial intelligence.<|endoftext|>
 Original Input: <|system|> 
 I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
 <|user|> 
 What are the random TVs in 2001?
 <|assistant|> 
 The TVs in "2001: A Space Odyssey" might initially seem random, but they are actually thoughtfully integrated into the film's narrative and symbolism. Here are some instances where TVs play a significant role:

 1. **Planetary Alignment:** In the opening scene, a TV screen displays the alignment of the planets, which is crucial to the plot and the film's exploration of extraterrestrial intelligence.
 2. **News Broadcasts:** Later in the movie, TVs are used to show news broadcasts, providing context and information about the ongoing space exploration and human presence beyond Earth.
 3. **Interviews:** TVs are also used to present interviews with characters, offering insights into their thoughts and motivations.
 4. **Monolith's Influence:** The iconic scene where the monolith appears to the apes is accompanied by a flashing TV screen. This juxtaposition highlights the monolith's influence over technology and human progress, as if the TV screen is a modern-day monolith guiding human development.

 In essence, the TVs in "2001: A Space Odyssey" are carefully crafted elements that contribute to the film's visual language and themes, such as technology, human evolution, and the search for extraterrestrial intelligence.<|endoftext|>

 Instruction ex sample 6891: <mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask><mask> 
 Mars, the fourth planet from the Sun, has a fascinating and complex weather system. Here are some intriguing facts about the weather on Mars:

 - **Thin Atmosphere:** Mars has a very thin atmosphere, which is only about 1% as dense as Earth's. It is primarily composed of carbon dioxide (CO2), with minor amounts of nitrogen and argon.

 - **Extreme Temperature Swings:** Due to the thin atmosphere and Mars' distance from the Sun, the planet experiences extreme temperature fluctuations. Daytime temperatures can reach up to 70°F (20°C), while nighttime temperatures can plummet to -100°F (-73°C).

 - **Dust Storms:** Mars is known for its massive dust storms that can cover the entire planet and last for months. These storms can reach speeds of up to 70 mph (113 km/h) and are so intense that they can alter the planet's albedo (reflectivity) and even be detected by telescopes on Earth.

 - **Winds:** Mars has a surprisingly active wind system, with average wind speeds of about 22 mph (35 km/h). These winds are primarily caused by the planet's rotation and temperature differences between the equator and the poles.

 - **Frozen Carbon Dioxide:** During the Martian winter, temperatures at the poles drop low enough for carbon dioxide to freeze, forming a layer of dry ice. This process contributes to the formation of the polar ice caps, which are primarily composed of water ice and dust.

 - **Seasonal Changes:** Mars experiences seasonal changes due to its elliptical orbit around the Sun. The planet is closer to the Sun during its summer, leading to more extreme temperature variations and more frequent dust storms.

 - **Absence of Liquid Water:** Due to Mars' low atmospheric pressure and temperature, liquid water cannot exist on the surface for an extended period. However, evidence suggests that water may have flowed on the planet's surface in the past, and there is the possibility of subsurface water reservoirs.

 These fascinating weather characteristics make Mars an intriguing subject for scientists and researchers studying planetary atmospheres and climates.<|endoftext|>
 Original Input: <|system|> 
 I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant.
 <|user|> 
 What are some interesting facts about the weather on Mars? Please use bullet points.
 <|assistant|> 
 Mars, the fourth planet from the Sun, has a fascinating and complex weather system. Here are some intriguing facts about the weather on Mars:

 - **Thin Atmosphere:** Mars has a very thin atmosphere, which is only about 1% as dense as Earth's. It is primarily composed of carbon dioxide (CO2), with minor amounts of nitrogen and argon.

 - **Extreme Temperature Swings:** Due to the thin atmosphere and Mars' distance from the Sun, the planet experiences extreme temperature fluctuations. Daytime temperatures can reach up to 70°F (20°C), while nighttime temperatures can plummet to -100°F (-73°C).

 - **Dust Storms:** Mars is known for its massive dust storms that can cover the entire planet and last for months. These storms can reach speeds of up to 70 mph (113 km/h) and are so intense that they can alter the planet's albedo (reflectivity) and even be detected by telescopes on Earth.

 - **Winds:** Mars has a surprisingly active wind system, with average wind speeds of about 22 mph (35 km/h). These winds are primarily caused by the planet's rotation and temperature differences between the equator and the poles.

 - **Frozen Carbon Dioxide:** During the Martian winter, temperatures at the poles drop low enough for carbon dioxide to freeze, forming a layer of dry ice. This process contributes to the formation of the polar ice caps, which are primarily composed of water ice and dust.

 - **Seasonal Changes:** Mars experiences seasonal changes due to its elliptical orbit around the Sun. The planet is closer to the Sun during its summer, leading to more extreme temperature variations and more frequent dust storms.

 - **Absence of Liquid Water:** Due to Mars' low atmospheric pressure and temperature, liquid water cannot exist on the surface for an extended period. However, evidence suggests that water may have flowed on the planet's surface in the past, and there is the possibility of subsurface water reservoirs.

 These fascinating weather characteristics make Mars an intriguing subject for scientists and researchers studying planetary atmospheres and climates.<|endoftext|>

 Creating json from Arrow format: 100% 10/10 [00:01<00:00,  7.02ba/s]
 Running command: torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 --rdzv_id=123 --rdzv_endpoint=127.0.0.1:12222 /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py --model_name_or_path=/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320 --data_path=/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl --output_dir=/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints --num_epochs=2 --effective_batch_size=3840 --learning_rate=2e-05 --num_warmup_steps=25 --save_samples=0 --log_level=INFO --max_batch_len=10000 --seed=42 --chat-tmpl-path=/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py --checkpoint_at_epoch
 W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] 
 W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] *****************************************
 W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
 W0818 20:47:08.033000 139686984577472 torch/distributed/run.py:757] *****************************************
 [2024-08-18 20:47:10,896] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,042] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,152] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,205] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,218] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,232] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,260] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-08-18 20:47:11,287] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 model_name_or_path: /var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320
 data_path: /var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl
 output_dir: /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints
 num_epochs: 2
 last_step: 0
 effective_batch_size: 3840
 learning_rate: 2.0e-05
 lr_scheduler: cosine
 num_warmup_steps: 25
 save_samples: 0
 save_samples_ds: null
 save_last: false
 checkpoint_at_epoch: true
 log_level: INFO
 seed: 42
 mock_data: false
 mock_len: 2600
 sharding_strategy: FULL_SHARD
 is_granite: false
 lora_r: 0
 lora_alpha: 32
 lora_dropout: 0.1
 lora_quant_bits: null
 lora_target_modules: null
 max_batch_len: 10000
 cpu_offload_optimizer: false
 cpu_offload_optimizer_pin_memory: false
 cpu_offload_optimizer_ratio: 1.0
 NEFTune_alpha: null
 chat_tmpl_path: /opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py
 disable_flash_attn: false

 {
    "script_params": {
        "model_name_or_path": "/var/mnt/inststg1/instructlab/phasedbasedir/phase1/checkpoints/hf_format/samples_320",
        "data_path": "/var/mnt/inststg1/instructlab/.local/share/instructlab/internal/data.jsonl",
        "output_dir": "/var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints",
        "num_epochs": 2,
        "last_step": 0,
        "effective_batch_size": 3840,
        "learning_rate": 2e-05,
        "lr_scheduler": "cosine",
        "num_warmup_steps": 25,
        "save_samples": 0,
        "save_samples_ds": null,
        "save_last": false,
        "checkpoint_at_epoch": true,
        "log_level": "INFO",
        "seed": 42,
        "mock_data": false,
        "mock_len": 2600,
        "sharding_strategy": "FULL_SHARD",
        "is_granite": false,
        "lora_r": 0,
        "lora_alpha": 32,
        "lora_dropout": 0.1,
        "lora_quant_bits": null,
        "lora_target_modules": null,
        "max_batch_len": 10000,
        "cpu_offload_optimizer": false,
        "cpu_offload_optimizer_pin_memory": false,
        "cpu_offload_optimizer_ratio": 1.0,
        "NEFTune_alpha": null,
        "chat_tmpl_path": "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py",
        "disable_flash_attn": false
    },
    "timestamp": "2024-08-18T20:47:14.720513"
 }
 [2024-08-18 20:47:14,794] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:47:14,794] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 tyler-a100-newimage-val:10546:10546 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10546:10546 [0] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10546:10546 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-08-18 20:47:15,959] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:47:15,969] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:47:15,974] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:47:15,976] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-a100-newimage-val:10548:10548 [2] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10548:10548 [2] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10548:10548 [2] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-08-18 20:47:15,987] [INFO] [comm.py:637:init_distributed] cdb=None
 [2024-08-18 20:47:15,995] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-a100-newimage-val:10550:10550 [4] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10550:10550 [4] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10550:10550 [4] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:10552:10552 [6] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10552:10552 [6] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10552:10552 [6] NCCL INFO NCCL version 2.22.3+cuda12.5
 [2024-08-18 20:47:16,031] [INFO] [comm.py:637:init_distributed] cdb=None
 tyler-a100-newimage-val:10551:10551 [5] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10551:10551 [5] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10551:10551 [5] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:10549:10549 [3] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10549:10549 [3] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10549:10549 [3] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:10547:10547 [1] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10547:10547 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10547:10547 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:10553:10553 [7] NCCL INFO cudaDriverVersion 12040
 tyler-a100-newimage-val:10553:10553 [7] NCCL INFO Bootstrap : Using enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10553:10553 [7] NCCL INFO NCCL version 2.22.3+cuda12.5
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/IB : No device found.
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.48.4<0>
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO ncclCommInitRank comm 0x56556127bf10 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO ncclCommInitRank comm 0x562f40810030 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO ncclCommInitRank comm 0x55812d458df0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO ncclCommInitRank comm 0x55ed47dad9a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO ncclCommInitRank comm 0x55e00ec7dc40 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO ncclCommInitRank comm 0x558cdebf44c0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO ncclCommInitRank comm 0x55e4c1493930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO ncclCommInitRank comm 0x55919a812fd0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x8ecf20a94c156f4c - Init START
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO comm 0x55e00ec7dc40 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO comm 0x56556127bf10 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO comm 0x55812d458df0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO comm 0x55ed47dad9a0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO comm 0x558cdebf44c0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO comm 0x55919a812fd0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO comm 0x55e4c1493930 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO comm 0x562f40810030 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO ncclCommInitRank comm 0x562f40810030 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10547:11290 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.76 (kernels 0.14, bootstrap 0.27, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO ncclCommInitRank comm 0x55e4c1493930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10548:11283 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.79 (kernels 0.16, bootstrap 0.29, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO ncclCommInitRank comm 0x55812d458df0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO ncclCommInitRank comm 0x55e00ec7dc40 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10551:11288 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.77 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10552:11287 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.77 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO ncclCommInitRank comm 0x56556127bf10 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10549:11289 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.76 (kernels 0.14, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO ncclCommInitRank comm 0x55ed47dad9a0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10550:11286 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.77 (kernels 0.15, bootstrap 0.28, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO ncclCommInitRank comm 0x55919a812fd0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
 tyler-a100-newimage-val:10546:11270 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.91 (kernels 0.18, bootstrap 0.39, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO ncclCommInitRank comm 0x558cdebf44c0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 commId 0x8ecf20a94c156f4c - Init COMPLETE
 tyler-a100-newimage-val:10553:11291 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.75 (kernels 0.16, bootstrap 0.25, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11313 [3] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10548:11311 [2] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10547:11312 [1] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10546:11315 [0] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10553:11314 [7] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10552:11310 [6] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10551:11308 [5] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10550:11309 [4] NCCL INFO Connected all rings
 Generating train split: 9981 examples [00:01, 7916.84 examples/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1972.79it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1957.24it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1893.76it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1968.18it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1937.07it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1968.84it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1891.13it/s]
 Data length calculation: 100%|██████████| 9981/9981 [00:05<00:00, 1843.73it/s]
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.1128382682800293 seconds
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.11071896553039551 seconds
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.11228275299072266 seconds
 {
    "num_gpus": 8,
    "avg_sample_len": 608.8641418695521,
    "effective_batch_size": 3840,
    "max_batch_len_per_gpu": 10000,
    "packing_max_batch_len": 8118,
    "grad_accum": 36,
    "num_batches": 121,
    "avg_samples_per_batch": 82.48760330578513,
    "samples_per_gpu": 13,
    "timestamp": "2024-08-18T20:47:39.867974"
 }
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.1144556999206543 seconds
 [2024-08-18 20:47:40,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
 [2024-08-18 20:47:40,239] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.10187482833862305 seconds
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 Detected CUDA files, patching ldflags
 Emitting ninja build file /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
 /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
 If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
 Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 Using /var/mnt/inststg1/instructlab/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
 ninja: no work to do.
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.11427879333496094 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.10196876525878906 seconds
 Loading extension module fused_adam...
 Time to load fused_adam op: 0.10352158546447754 seconds
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Using network Socket
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO bootstrapSplit: comm 0x565562e66530 parent 0x56556127bf10 rank 3 nranks 8 color -934961569 key 3 prev 2 next 4 - DONE
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO bootstrapSplit: comm 0x55e4c30806d0 parent 0x55e4c1493930 rank 2 nranks 8 color -934961569 key 2 prev 1 next 3 - DONE
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO bootstrapSplit: comm 0x55ed499a84f0 parent 0x55ed47dad9a0 rank 4 nranks 8 color -934961569 key 4 prev 3 next 5 - DONE
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO bootstrapSplit: comm 0x562f423f9400 parent 0x562f40810030 rank 1 nranks 8 color -934961569 key 1 prev 0 next 2 - DONE
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO bootstrapSplit: comm 0x558ce0813360 parent 0x558cdebf44c0 rank 7 nranks 8 color -934961569 key 7 prev 6 next 0 - DONE
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO bootstrapSplit: comm 0x55919c3fd580 parent 0x55919a812fd0 rank 0 nranks 8 color -934961569 key 0 prev 7 next 1 - DONE
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO bootstrapSplit: comm 0x55812f05a9a0 parent 0x55812d458df0 rank 5 nranks 8 color -934961569 key 5 prev 4 next 6 - DONE
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO ncclCommSplit comm 0x55e4c30806d0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e4c1493930 color -934961569 key 2 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO ncclCommSplit comm 0x55ed499a84f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x55ed47dad9a0 color -934961569 key 4 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO bootstrapSplit: comm 0x55e0108836a0 parent 0x55e00ec7dc40 rank 6 nranks 8 color -934961569 key 6 prev 5 next 7 - DONE
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO ncclCommSplit comm 0x558ce0813360 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558cdebf44c0 color -934961569 key 7 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO ncclCommSplit comm 0x55919c3fd580 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55919a812fd0 color -934961569 key 0 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO ncclCommSplit comm 0x562f423f9400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x562f40810030 color -934961569 key 1 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO ncclCommSplit comm 0x55812f05a9a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x55812d458df0 color -934961569 key 5 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO ncclCommSplit comm 0x565562e66530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x56556127bf10 color -934961569 key 3 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO ncclCommSplit comm 0x55e0108836a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x55e00ec7dc40 color -934961569 key 6 commId 0xc6ecd14a22a5889f - Init START
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffffffff
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO NVLS multicast support is not available on dev 3
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffffffff
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO NVLS multicast support is not available on dev 2
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO NVLS multicast support is not available on dev 1
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO NVLS multicast support is not available on dev 0
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Setting affinity for GPU 4 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO NVLS multicast support is not available on dev 4
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Setting affinity for GPU 6 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO NVLS multicast support is not available on dev 6
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Setting affinity for GPU 7 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO NVLS multicast support is not available on dev 7
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Setting affinity for GPU 5 to ffff,ffffff00,00000000
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO NVLS multicast support is not available on dev 5
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO comm 0x55e4c30806d0 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO comm 0x562f423f9400 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO comm 0x558ce0813360 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO comm 0x55919c3fd580 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO comm 0x55e0108836a0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO comm 0x55812f05a9a0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 00/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 01/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO comm 0x55ed499a84f0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO comm 0x565562e66530 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 02/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 03/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 04/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 05/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 06/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 07/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 08/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 09/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 10/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 11/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 12/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 13/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 14/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 15/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 16/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 17/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 18/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 19/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 20/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 21/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 22/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Channel 23/24 :    0   1   2   3   4   5   6   7
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO P2P Chunksize set to 524288
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO ncclCommSplit comm 0x558ce0813360 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId e080 parent 0x558cdebf44c0 color -934961569 key 7 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO ncclCommSplit comm 0x565562e66530 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId a040 parent 0x56556127bf10 color -934961569 key 3 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO ncclCommSplit comm 0x55812f05a9a0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId c060 parent 0x55812d458df0 color -934961569 key 5 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO ncclCommSplit comm 0x562f423f9400 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 8020 parent 0x562f40810030 color -934961569 key 1 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10553:11409 [7] NCCL INFO Init timings: rank 7 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
 tyler-a100-newimage-val:10551:11410 [5] NCCL INFO Init timings: rank 5 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10547:11408 [1] NCCL INFO Init timings: rank 1 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
 tyler-a100-newimage-val:10549:11415 [3] NCCL INFO Init timings: rank 3 nranks 8 total 0.34 (kernels 0.00, bootstrap 0.00, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO ncclCommSplit comm 0x55919c3fd580 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 8010 parent 0x55919a812fd0 color -934961569 key 0 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10546:11406 [0] NCCL INFO Init timings: rank 0 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO ncclCommSplit comm 0x55e0108836a0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId e070 parent 0x55e00ec7dc40 color -934961569 key 6 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO ncclCommSplit comm 0x55e4c30806d0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId a030 parent 0x55e4c1493930 color -934961569 key 2 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10552:11412 [6] NCCL INFO Init timings: rank 6 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10548:11407 [2] NCCL INFO Init timings: rank 2 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.02)
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO ncclCommSplit comm 0x55ed499a84f0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId c050 parent 0x55ed47dad9a0 color -934961569 key 4 commId 0xc6ecd14a22a5889f - Init COMPLETE
 tyler-a100-newimage-val:10550:11411 [4] NCCL INFO Init timings: rank 4 nranks 8 total 0.39 (kernels 0.00, bootstrap 0.05, allgathers 0.00, topo 0.26, graphs 0.00, connections 0.05, rest 0.03)
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 16/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 17/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 16/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 16/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 16/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 18/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 17/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 17/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 17/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 19/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 18/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 18/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 18/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 20/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 19/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 19/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 19/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 21/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 20/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 20/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 22/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 20/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 21/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 21/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Channel 23/0 : 5[5] -> 6[6] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 21/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 22/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 22/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[4] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 22/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Channel 23/0 : 7[7] -> 0[0] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/CUMEM/read
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Channel 23/0 : 6[6] -> 7[7] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Channel 23/0 : 4[4] -> 5[5] via P2P/CUMEM/read
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
 tyler-a100-newimage-val:10548:11433 [2] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10549:11438 [3] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10550:11439 [4] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10551:11432 [5] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10552:11436 [6] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10553:11435 [7] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10546:11437 [0] NCCL INFO Connected all rings
 tyler-a100-newimage-val:10547:11434 [1] NCCL INFO Connected all rings
 [2024-08-18 20:47:46,090] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
 [2024-08-18 20:47:46,091] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
 [2024-08-18 20:47:46,091] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
 [2024-08-18 20:47:46,104] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
 [2024-08-18 20:47:46,104] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
 [2024-08-18 20:47:46,104] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
 [2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000
 [2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000
 [2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False
 [2024-08-18 20:47:46,104] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
 [2024-08-18 20:47:59,000] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:47:59,024] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:00,036] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:00,385] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:00,831] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:00,924] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:01,063] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 [2024-08-18 20:48:01,367] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
 [2024-08-18 20:48:01,367] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 17.26 GB         CA 17.26 GB         Max_CA 17 GB 
 [2024-08-18 20:48:01,368] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 33.57 GB, percent = 2.7%
 [2024-08-18 20:48:01,588] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
 [2024-08-18 20:48:01,589] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 18.83 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-08-18 20:48:01,589] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 33.58 GB, percent = 2.7%
 [2024-08-18 20:48:01,590] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized
 [2024-08-18 20:48:01,807] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
 [2024-08-18 20:48:01,808] [INFO] [utils.py:782:see_memory_usage] MA 15.69 GB         Max_MA 15.69 GB         CA 20.4 GB         Max_CA 20 GB 
 [2024-08-18 20:48:01,808] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 33.58 GB, percent = 2.7%
 [2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
 [2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
 [2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f171cc77e10>
 [2024-08-18 20:48:01,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)]
 [2024-08-18 20:48:01,811] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
 }
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   amp_enabled .................. False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   amp_params ................... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
 }
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   bfloat16_enabled ............. True
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f171cc59a90>
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   communication_data_type ...... None
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   disable_allgather ............ False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   dump_state ................... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... None
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
 }
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   fp16_auto_cast ............... None
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   fp16_enabled ................. False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   global_rank .................. 0
 [2024-08-18 20:48:01,812] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 36
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   graph_harvesting ............. False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 1
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   loss_scale ................... 1.0
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   memory_breakdown ............. False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
 }
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   optimizer_name ............... None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   optimizer_params ............. None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   pld_enabled .................. False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   pld_params ................... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   prescale_gradients ........... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   scheduler_name ............... None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   scheduler_params ............. None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   sparse_attention ............. None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   steps_per_print .............. 1
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   train_batch_size ............. 3744
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  13
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   weight_quantization_config ... None
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   world_size ................... 8
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  False
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   zero_enabled ................. True
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
 [2024-08-18 20:48:01,813] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 2
 [2024-08-18 20:48:01,814] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 3.744000e+03, 
    "gradient_accumulation_steps": 36, 
    "train_micro_batch_size_per_gpu": 13, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 2, 
        "offload_param": {
            "device": "none"
        }, 
        "offload_optimizer": {
            "device": "none"
        }
    }, 
    "bf16": {
        "enabled": true
    }, 
    "gradient_clipping": 1.0, 
    "prescale_gradients": false, 
    "wall_clock_breakdown": false
 }
 [2024-08-18 20:48:01,814] [WARNING] [engine.py:2749:load_checkpoint] Unable to find latest file at /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/ds_native/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
 Epoch 0:   0%|          | 0/121 [00:00<?, ?it/s] total tokens: 7992 num samples: 18 num padding tokens: 2022 - rank: 6 max len: 444 min len: 240 avg len: 331.6666666666667 num_loss_counted_tokens: 3649
 total tokens: 7752 num samples: 17 num padding tokens: 1950 - rank: 6 max len: 456 min len: 267 avg len: 341.29411764705884 num_loss_counted_tokens: 3614
 total tokens: 7784 num samples: 14 num padding tokens: 1814 - rank: 6 max len: 556 min len: 342 avg len: 426.42857142857144 num_loss_counted_tokens: 3061 total tokens: 8094 num samples: 19 num padding tokens: 2216 - rank: 6 max len: 426 min len: 215 avg len: 309.36842105263156 num_loss_counted_tokens: 3253
 total tokens: 7524 num samples: 12 num padding tokens: 2256 - rank: 6 max len: 627 min len: 286 avg len: 439.0 num_loss_counted_tokens: 3465
 total tokens: 8056 num samples: 19 num padding tokens: 1799 - rank: 6 max len: 424 min len: 237 avg len: 329.3157894736842 num_loss_counted_tokens: 3512

 total tokens: 7803 num samples: 17 num padding tokens: 1146 - rank: 6 max len: 459 min len: 281 avg len: 391.5882352941176 num_loss_counted_tokens: 4201
 total tokens: 6561 num samples: 3 num padding tokens: 1099 - rank: 1 max len: 2187 min len: 1316 avg len: 1820.6666666666667 num_loss_counted_tokens: 687
 total tokens: 8118 num samples: 3 num padding tokens: 403 - rank: 1 max len: 2706 min len: 2307 avg len: 2571.6666666666665 num_loss_counted_tokens: 336
 total tokens: 7851 num samples: 3 num padding tokens: 789 - rank: 1 max len: 2617 min len: 1940 avg len: 2354.0 num_loss_counted_tokens: 768
 total tokens: 8096 num samples: 16 num padding tokens: 1551 - rank: 6 max len: 506 min len: 312 avg len: 409.0625 num_loss_counted_tokens: 3732
 total tokens: 7089 num samples: 3 num padding tokens: 249 - rank: 1 max len: 2363 min len: 2182 avg len: 2280.0 num_loss_counted_tokens: 237
 total tokens: 7629 num samples: 3 num padding tokens: 370 - rank: 1 max len: 2543 min len: 2326 avg len: 2419.6666666666665 num_loss_counted_tokens: 668
 total tokens: 7752 num samples: 24 num padding tokens: 3285 - rank: 7 max len: 323 min len: 83 avg len: 186.125 num_loss_counted_tokens: 2064
 total tokens: 7776 num samples: 16 num padding tokens: 2005 - rank: 6 max len: 486 min len: 266 avg len: 360.6875 num_loss_counted_tokens: 3202
 total tokens: 7794 num samples: 9 num padding tokens: 1201 - rank: 4 max len: 866 min len: 628 avg len: 732.5555555555555 num_loss_counted_tokens: 3797
 total tokens: 8075 num samples: 19 num padding tokens: 2226 - rank: 6 max len: 425 min len: 232 avg len: 307.8421052631579 num_loss_counted_tokens: 3319
 total tokens: 7872 num samples: 16 num padding tokens: 1566 - rank: 6 max len: 492 min len: 271 avg len: 394.125 num_loss_counted_tokens: 3233
 total tokens: 5934 num samples: 23 num padding tokens: 2656 - rank: 7 max len: 258 min len: 72 avg len: 142.52173913043478 num_loss_counted_tokens: 1182
 total tokens: 8024 num samples: 8 num padding tokens: 985 - rank: 4 max len: 1003 min len: 741 avg len: 879.875 num_loss_counted_tokens: 5064
 total tokens: 7840 num samples: 8 num padding tokens: 691 - rank: 4 max len: 980 min len: 778 avg len: 893.625 num_loss_counted_tokens: 4008
 total tokens: 8016 num samples: 3 num padding tokens: 730 - rank: 1 max len: 2672 min len: 2280 avg len: 2428.6666666666665 num_loss_counted_tokens: 336
 total tokens: 8032 num samples: 8 num padding tokens: 791 - rank: 4 max len: 1004 min len: 846 avg len: 905.125 num_loss_counted_tokens: 6538
 total tokens: 8094 num samples: 19 num padding tokens: 1536 - rank: 6 max len: 426 min len: 259 avg len: 345.1578947368421 num_loss_counted_tokens: 3814
 total tokens: 7476 num samples: 4 num padding tokens: 1909 - rank: 1 max len: 1869 min len: 1081 avg len: 1391.75 num_loss_counted_tokens: 2210
 total tokens: 6380 num samples: 29 num padding tokens: 2359 - rank: 7 max len: 220 min len: 77 avg len: 138.6551724137931 num_loss_counted_tokens: 1514
 total tokens: 7328 num samples: 32 num padding tokens: 2417 - rank: 7 max len: 229 min len: 83 avg len: 153.46875 num_loss_counted_tokens: 2050
 total tokens: 6348 num samples: 23 num padding tokens: 1924 - rank: 7 max len: 276 min len: 79 avg len: 192.34782608695653 num_loss_counted_tokens: 1895
 total tokens: 7460 num samples: 5 num padding tokens: 807 - rank: 1 max len: 1492 min len: 1178 avg len: 1330.6 num_loss_counted_tokens: 2708
 total tokens: 7890 num samples: 30 num padding tokens: 2915 - rank: 7 max len: 263 min len: 77 avg len: 165.83333333333334 num_loss_counted_tokens: 1979
 total tokens: 7704 num samples: 9 num padding tokens: 736 - rank: 4 max len: 856 min len: 719 avg len: 774.2222222222222 num_loss_counted_tokens: 2859
 total tokens: 7920 num samples: 10 num padding tokens: 665 - rank: 4 max len: 792 min len: 622 avg len: 725.5 num_loss_counted_tokens: 4725
 total tokens: 7812 num samples: 9 num padding tokens: 769 - rank: 4 max len: 868 min len: 744 avg len: 782.5555555555555 num_loss_counted_tokens: 5045
 total tokens: 6171 num samples: 3 num padding tokens: 752 - rank: 1 max len: 2057 min len: 1416 avg len: 1806.3333333333333 num_loss_counted_tokens: 750
 total tokens: 7432 num samples: 4 num padding tokens: 296 - rank: 1 max len: 1858 min len: 1710 avg len: 1784.0 num_loss_counted_tokens: 829
 total tokens: 5684 num samples: 2 num padding tokens: 688 - rank: 1 max len: 2842 min len: 2154 avg len: 2498.0 num_loss_counted_tokens: 182
 total tokens: 7992 num samples: 18 num padding tokens: 1629 - rank: 6 max len: 444 min len: 270 avg len: 353.5 num_loss_counted_tokens: 3206
 total tokens: 7625 num samples: 25 num padding tokens: 3089 - rank: 7 max len: 305 min len: 83 avg len: 181.44 num_loss_counted_tokens: 2298
 total tokens: 7812 num samples: 31 num padding tokens: 2673 - rank: 7 max len: 252 min len: 81 avg len: 165.7741935483871 num_loss_counted_tokens: 2047 total tokens: 7871 num samples: 17 num padding tokens: 2098 - rank: 6 max len: 463 min len: 248 avg len: 339.5882352941176 num_loss_counted_tokens: 3582
 total tokens: 8060 num samples: 20 num padding tokens: 1971 - rank: 6 max len: 403 min len: 249 avg len: 304.45 num_loss_counted_tokens: 3294
 total tokens: 7288 num samples: 4 num padding tokens: 745 - rank: 1 max len: 1822 min len: 1504 avg len: 1635.75 num_loss_counted_tokens: 956
 total tokens: 6479 num samples: 31 num padding tokens: 2352 - rank: 7 max len: 209 min len: 79 avg len: 133.1290322580645 num_loss_counted_tokens: 1439
 total tokens: 6892 num samples: 4 num padding tokens: 402 - rank: 1 max len: 1723 min len: 1368 avg len: 1622.5 num_loss_counted_tokens: 752
 total tokens: 6423 num samples: 3 num padding tokens: 305 - rank: 1 max len: 2141 min len: 1976 avg len: 2039.3333333333333 num_loss_counted_tokens: 448
 total tokens: 7634 num samples: 11 num padding tokens: 386 - rank: 4 max len: 694 min len: 621 avg len: 658.9090909090909 num_loss_counted_tokens: 5313
 total tokens: 8070 num samples: 10 num padding tokens: 726 - rank: 4 max len: 807 min len: 627 avg len: 734.4 num_loss_counted_tokens: 5736
 total tokens: 8096 num samples: 11 num padding tokens: 777 - rank: 4 max len: 736 min len: 594 avg len: 665.3636363636364 num_loss_counted_tokens: 4205
 total tokens: 7560 num samples: 10 num padding tokens: 629 - rank: 4 max len: 756 min len: 626 avg len: 693.1 num_loss_counted_tokens: 4492
 total tokens: 7998 num samples: 31 num padding tokens: 3034 - rank: 7 max len: 258 min len: 74 avg len: 160.1290322580645 num_loss_counted_tokens: 2091
 total tokens: 8090 num samples: 10 num padding tokens: 409 - rank: 4 max len: 809 min len: 692 avg len: 768.1 num_loss_counted_tokens: 5031
 total tokens: 7740 num samples: 18 num padding tokens: 1725 - rank: 6 max len: 430 min len: 265 avg len: 334.1666666666667 num_loss_counted_tokens: 3515
 total tokens: 8080 num samples: 5 num padding tokens: 765 - rank: 1 max len: 1616 min len: 1290 avg len: 1463.0 num_loss_counted_tokens: 2912

 total tokens: 7904 num samples: 32 num padding tokens: 2758 - rank: 7 max len: 247 min len: 84 avg len: 160.8125 num_loss_counted_tokens: 2044
 total tokens: 7461 num samples: 9 num padding tokens: 766 - rank: 4 max len: 829 min len: 675 avg len: 743.8888888888889 num_loss_counted_tokens: 4179
 total tokens: 6168 num samples: 2 num padding tokens: 447 - rank: 1 max len: 3084 min len: 2637 avg len: 2860.5 num_loss_counted_tokens: 177
 total tokens: 7395 num samples: 29 num padding tokens: 2459 - rank: 7 max len: 255 min len: 81 avg len: 170.20689655172413 num_loss_counted_tokens: 1981
 total tokens: 6830 num samples: 5 num padding tokens: 395 - rank: 1 max len: 1366 min len: 1223 avg len: 1287.0 num_loss_counted_tokens: 2516
 total tokens: 7786 num samples: 17 num padding tokens: 1200 - rank: 6 max len: 458 min len: 290 avg len: 387.4117647058824 num_loss_counted_tokens: 3600
 total tokens: 6888 num samples: 24 num padding tokens: 2293 - rank: 7 max len: 287 min len: 81 avg len: 191.45833333333334 num_loss_counted_tokens: 2153
 total tokens: 7627 num samples: 29 num padding tokens: 2769 - rank: 7 max len: 263 min len: 78 avg len: 167.51724137931035 num_loss_counted_tokens: 2159
 total tokens: 5475 num samples: 25 num padding tokens: 1896 - rank: 7 max len: 219 min len: 81 avg len: 143.16 num_loss_counted_tokens: 1372
 total tokens: 6916 num samples: 28 num padding tokens: 2558 - rank: 7 max len: 247 min len: 77 avg len: 155.64285714285714 num_loss_counted_tokens: 1696
 total tokens: 7304 num samples: 8 num padding tokens: 673 - rank: 4 max len: 913 min len: 718 avg len: 828.875 num_loss_counted_tokens: 4366
 total tokens: 7950 num samples: 10 num padding tokens: 340 - rank: 4 max len: 795 min len: 724 avg len: 761.0 num_loss_counted_tokens: 5963
 total tokens: 7964 num samples: 11 num padding tokens: 504 - rank: 4 max len: 724 min len: 630 avg len: 678.1818181818181 num_loss_counted_tokens: 4558
 total tokens: 7410 num samples: 30 num padding tokens: 2942 - rank: 7 max len: 247 min len: 75 avg len: 148.93333333333334 num_loss_counted_tokens: 1669
 total tokens: 7630 num samples: 7 num padding tokens: 606 - rank: 4 max len: 1090 min len: 831 avg len: 1003.4285714285714 num_loss_counted_tokens: 3943
 total tokens: 7368 num samples: 4 num padding tokens: 761 - rank: 2 max len: 1842 min len: 1539 avg len: 1651.75 num_loss_counted_tokens: 2748
 total tokens: 7596 num samples: 6 num padding tokens: 1064 - rank: 2 max len: 1266 min len: 985 avg len: 1088.6666666666667 num_loss_counted_tokens: 3149
 total tokens: 7623 num samples: 11 num padding tokens: 1032 - rank: 5 max len: 693 min len: 517 avg len: 599.1818181818181 num_loss_counted_tokens: 4483
 total tokens: 7410 num samples: 10 num padding tokens: 803 - rank: 5 max len: 741 min len: 563 avg len: 660.7 num_loss_counted_tokens: 5017
 total tokens: 7969 num samples: 13 num padding tokens: 1180 - rank: 5 max len: 613 min len: 448 avg len: 522.2307692307693 num_loss_counted_tokens: 4157
 total tokens: 7596 num samples: 9 num padding tokens: 893 - rank: 5 max len: 844 min len: 632 avg len: 744.7777777777778 num_loss_counted_tokens: 4339
 total tokens: 7044 num samples: 4 num padding tokens: 776 - rank: 2 max len: 1761 min len: 1377 avg len: 1567.0 num_loss_counted_tokens: 1196
 total tokens: 7678 num samples: 11 num padding tokens: 1417 - rank: 5 max len: 698 min len: 476 avg len: 569.1818181818181 num_loss_counted_tokens: 4381
 total tokens: 7576 num samples: 4 num padding tokens: 2237 - rank: 2 max len: 1894 min len: 1097 avg len: 1334.75 num_loss_counted_tokens: 2139
 total tokens: 7656 num samples: 11 num padding tokens: 1011 - rank: 5 max len: 696 min len: 496 avg len: 604.0909090909091 num_loss_counted_tokens: 4175
 total tokens: 7204 num samples: 4 num padding tokens: 443 - rank: 2 max len: 1801 min len: 1576 avg len: 1690.25 num_loss_counted_tokens: 2660 total tokens: 8073 num samples: 13 num padding tokens: 925 - rank: 5 max len: 621 min len: 461 avg len: 549.8461538461538 num_loss_counted_tokens: 5127

 total tokens: 8076 num samples: 4 num padding tokens: 1351 - rank: 2 max len: 2019 min len: 1214 avg len: 1681.25 num_loss_counted_tokens: 936
 total tokens: 7032 num samples: 6 num padding tokens: 478 - rank: 2 max len: 1172 min len: 1016 avg len: 1092.3333333333333 num_loss_counted_tokens: 3008
 total tokens: 7917 num samples: 13 num padding tokens: 1158 - rank: 5 max len: 609 min len: 429 avg len: 519.9230769230769 num_loss_counted_tokens: 4603
 total tokens: 7329 num samples: 7 num padding tokens: 546 - rank: 2 max len: 1047 min len: 911 avg len: 969.0 num_loss_counted_tokens: 3492
 total tokens: 7020 num samples: 5 num padding tokens: 506 - rank: 2 max len: 1404 min len: 1134 avg len: 1302.8 num_loss_counted_tokens: 3834
 total tokens: 7692 num samples: 6 num padding tokens: 974 - rank: 2 max len: 1282 min len: 958 avg len: 1119.6666666666667 num_loss_counted_tokens: 3913
 total tokens: 7667 num samples: 11 num padding tokens: 908 - rank: 5 max len: 697 min len: 509 avg len: 614.4545454545455 num_loss_counted_tokens: 3445
 total tokens: 8099 num samples: 13 num padding tokens: 1144 - rank: 5 max len: 623 min len: 461 avg len: 535.0 num_loss_counted_tokens: 4390
 total tokens: 5862 num samples: 2 num padding tokens: 136 - rank: 0 max len: 2931 min len: 2795 avg len: 2863.0 num_loss_counted_tokens: 203
 total tokens: 5944 num samples: 2 num padding tokens: 101 - rank: 0 max len: 2972 min len: 2871 avg len: 2921.5 num_loss_counted_tokens: 163
 total tokens: 6814 num samples: 2 num padding tokens: 689 - rank: 0 max len: 3407 min len: 2718 avg len: 3062.5 num_loss_counted_tokens: 1104
 total tokens: 7872 num samples: 12 num padding tokens: 1453 - rank: 5 max len: 656 min len: 432 avg len: 534.9166666666666 num_loss_counted_tokens: 5009
 total tokens: 5966 num samples: 2 num padding tokens: 300 - rank: 0 max len: 2983 min len: 2683 avg len: 2833.0 num_loss_counted_tokens: 223
 total tokens: 7100 num samples: 5 num padding tokens: 1259 - rank: 3 max len: 1420 min len: 1008 avg len: 1168.2 num_loss_counted_tokens: 3506
 total tokens: 7858 num samples: 2 num padding tokens: 1075 - rank: 0 max len: 3929 min len: 2854 avg len: 3391.5 num_loss_counted_tokens: 419
 total tokens: 6586 num samples: 2 num padding tokens: 534 - rank: 0 max len: 3293 min len: 2759 avg len: 3026.0 num_loss_counted_tokens: 208
 total tokens: 6802 num samples: 2 num padding tokens: 582 - rank: 0 max len: 3401 min len: 2819 avg len: 3110.0 num_loss_counted_tokens: 197
 total tokens: 8076 num samples: 12 num padding tokens: 941 - rank: 5 max len: 673 min len: 519 avg len: 594.5833333333334 num_loss_counted_tokens: 4341
 total tokens: 8021 num samples: 13 num padding tokens: 958 - rank: 5 max len: 617 min len: 492 avg len: 543.3076923076923 num_loss_counted_tokens: 5546
 total tokens: 7404 num samples: 6 num padding tokens: 691 - rank: 2 max len: 1234 min len: 1037 avg len: 1118.8333333333333 num_loss_counted_tokens: 4452
 total tokens: 6990 num samples: 6 num padding tokens: 455 - rank: 3 max len: 1165 min len: 1010 avg len: 1089.1666666666667 num_loss_counted_tokens: 2326
 total tokens: 6835 num samples: 5 num padding tokens: 883 - rank: 2 max len: 1367 min len: 1028 avg len: 1190.4 num_loss_counted_tokens: 1769
 total tokens: 6480 num samples: 3 num padding tokens: 1330 - rank: 2 max len: 2160 min len: 1455 avg len: 1716.6666666666667 num_loss_counted_tokens: 3402
 total tokens: 7852 num samples: 13 num padding tokens: 1008 - rank: 5 max len: 604 min len: 462 avg len: 526.4615384615385 num_loss_counted_tokens: 4343
 total tokens: 6516 num samples: 4 num padding tokens: 643 - rank: 2 max len: 1629 min len: 1320 avg len: 1468.25 num_loss_counted_tokens: 3036
 total tokens: 7765 num samples: 5 num padding tokens: 937 - rank: 2 max len: 1553 min len: 1107 avg len: 1365.6 num_loss_counted_tokens: 2840
 total tokens: 7014 num samples: 6 num padding tokens: 643 - rank: 2 max len: 1169 min len: 980 avg len: 1061.8333333333333 num_loss_counted_tokens: 3641
 total tokens: 8106 num samples: 7 num padding tokens: 448 - rank: 2 max len: 1158 min len: 1032 avg len: 1094.0 num_loss_counted_tokens: 3634
 total tokens: 7014 num samples: 3 num padding tokens: 944 - rank: 0 max len: 2338 min len: 1749 avg len: 2023.3333333333333 num_loss_counted_tokens: 1782
 total tokens: 7696 num samples: 8 num padding tokens: 623 - rank: 3 max len: 962 min len: 803 avg len: 884.125 num_loss_counted_tokens: 4778
 total tokens: 7618 num samples: 13 num padding tokens: 886 - rank: 5 max len: 586 min len: 437 avg len: 517.8461538461538 num_loss_counted_tokens: 3929
 total tokens: 7155 num samples: 5 num padding tokens: 513 - rank: 3 max len: 1431 min len: 1168 avg len: 1328.4 num_loss_counted_tokens: 3100
 total tokens: 7350 num samples: 7 num padding tokens: 751 - rank: 3 max len: 1050 min len: 873 avg len: 942.7142857142857 num_loss_counted_tokens: 4426
 total tokens: 7744 num samples: 8 num padding tokens: 283 - rank: 3 max len: 968 min len: 869 avg len: 932.625 num_loss_counted_tokens: 4872
 total tokens: 5448 num samples: 2 num padding tokens: 825 - rank: 0 max len: 2724 min len: 1899 avg len: 2311.5 num_loss_counted_tokens: 314
 total tokens: 7836 num samples: 6 num padding tokens: 1317 - rank: 3 max len: 1306 min len: 968 avg len: 1086.5 num_loss_counted_tokens: 4937
 total tokens: 7854 num samples: 11 num padding tokens: 871 - rank: 5 max len: 714 min len: 532 avg len: 634.8181818181819 num_loss_counted_tokens: 4452
 total tokens: 7788 num samples: 3 num padding tokens: 965 - rank: 0 max len: 2596 min len: 1888 avg len: 2274.3333333333335 num_loss_counted_tokens: 304
 total tokens: 5614 num samples: 2 num padding tokens: 364 - rank: 0 max len: 2807 min len: 2443 avg len: 2625.0 num_loss_counted_tokens: 241
 total tokens: 7086 num samples: 3 num padding tokens: 1107 - rank: 0 max len: 2362 min len: 1776 avg len: 1993.0 num_loss_counted_tokens: 301
 total tokens: 8037 num samples: 9 num padding tokens: 856 - rank: 3 max len: 893 min len: 698 avg len: 797.8888888888889 num_loss_counted_tokens: 6093
 total tokens: 5792 num samples: 2 num padding tokens: 18 - rank: 0 max len: 2896 min len: 2878 avg len: 2887.0 num_loss_counted_tokens: 176
 total tokens: 6306 num samples: 2 num padding tokens: 290 - rank: 0 max len: 3153 min len: 2863 avg len: 3008.0 num_loss_counted_tokens: 181
 total tokens: 7942 num samples: 11 num padding tokens: 1739 - rank: 5 max len: 722 min len: 430 avg len: 563.9090909090909 num_loss_counted_tokens: 2898
 total tokens: 7796 num samples: 4 num padding tokens: 715 - rank: 0 max len: 1949 min len: 1412 avg len: 1770.25 num_loss_counted_tokens: 1543
 total tokens: 7984 num samples: 8 num padding tokens: 514 - rank: 3 max len: 998 min len: 886 avg len: 933.75 num_loss_counted_tokens: 3889
 total tokens: 7248 num samples: 8 num padding tokens: 747 - rank: 3 max len: 906 min len: 734 avg len: 812.625 num_loss_counted_tokens: 5216
 total tokens: 6489 num samples: 3 num padding tokens: 267 - rank: 0 max len: 2163 min len: 1989 avg len: 2074.0 num_loss_counted_tokens: 774
 total tokens: 8000 num samples: 8 num padding tokens: 939 - rank: 3 max len: 1000 min len: 809 avg len: 882.625 num_loss_counted_tokens: 4397
 total tokens: 7146 num samples: 6 num padding tokens: 436 - rank: 3 max len: 1191 min len: 1062 avg len: 1118.3333333333333 num_loss_counted_tokens: 3632
 total tokens: 7693 num samples: 7 num padding tokens: 1196 - rank: 3 max len: 1099 min len: 833 avg len: 928.1428571428571 num_loss_counted_tokens: 4428
 total tokens: 7592 num samples: 8 num padding tokens: 548 - rank: 3 max len: 949 min len: 799 avg len: 880.5 num_loss_counted_tokens: 5791
 total tokens: 8064 num samples: 8 num padding tokens: 1193 - rank: 3 max len: 1008 min len: 741 avg len: 858.875 num_loss_counted_tokens: 5624
 total tokens: 7960 num samples: 8 num padding tokens: 519 - rank: 3 max len: 995 min len: 797 avg len: 930.125 num_loss_counted_tokens: 4667
 total tokens: 7140 num samples: 5 num padding tokens: 805 - rank: 3 max len: 1428 min len: 1157 avg len: 1267.0 num_loss_counted_tokens: 2481
 total tokens: 7376 num samples: 2 num padding tokens: 122 - rank: 0 max len: 3688 min len: 3566 avg len: 3627.0 num_loss_counted_tokens: 334
 Per-token loss scaled by world size: 2.2650606297247577e-06Per-token loss scaled by world size: 0.0005290773115120828Per-token loss scaled by world size: 0.00031778833363205194Per-token loss scaled by world size: 0.0002596491831354797Per-token loss scaled by world size: 0.00032042598468251526Per-token loss scaled by world size: 0.00037021367461420596




 Per-token loss scaled by world size: 3.6662072488979902e-06
 Epoch: 0, Step: 1, Rank: 3, loss = 0.8319301605224609
 Epoch: 0, Step: 1, Rank: 2, loss = 0.6797291040420532Epoch: 0, Step: 1, Rank: 5, loss = 1.3850582838058472

 Epoch: 0, Step: 1, Rank: 7, loss = 0.8388351798057556Epoch: 0, Step: 1, Rank: 1, loss = 0.005929645616561174


 Epoch: 0, Step: 1, Rank: 4, loss = 0.9691731333732605
 Epoch: 0, Step: 1, Rank: 0, loss = 0.009597672149538994
 Per-token loss scaled by world size: 0.0004498241178225726
 Epoch: 0, Step: 1, Rank: 6, loss = 1.1775833368301392
 Epoch 0:   1%|          | 1/121 [00:03<06:45,  3.38s/it]{
    "epoch": 0,
    "step": 1,
    "rank": 0,
    "loss": 0.009597672149538994,
    "overall_throughput": 35.709783908823596,
    "lr": 0.0,
    "cuda_mem_allocated": 17.990560054779053,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20943,
    "batch_size": 70,
    "total_loss": 0.737229585647583,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:05.199538"
 }
 total tokens: 6234 num samples: 3 num padding tokens: 200 - rank: 2 max len: 2078 min len: 1890 avg len: 2011.3333333333333 num_loss_counted_tokens: 1341
 total tokens: 5714 num samples: 2 num padding tokens: 5 - rank: 0 max len: 2857 min len: 2852 avg len: 2854.5 num_loss_counted_tokens: 145
 total tokens: 7392 num samples: 8 num padding tokens: 1326 - rank: 5 max len: 924 min len: 561 avg len: 758.25 num_loss_counted_tokens: 3044
 total tokens: 7340 num samples: 4 num padding tokens: 699 - rank: 3 max len: 1835 min len: 1455 avg len: 1660.25 num_loss_counted_tokens: 2623
 total tokens: 7627 num samples: 29 num padding tokens: 3037 - rank: 7 max len: 263 min len: 77 avg len: 158.27586206896552 num_loss_counted_tokens: 1957
 total tokens: 7242 num samples: 3 num padding tokens: 171 - rank: 1 max len: 2414 min len: 2311 avg len: 2357.0 num_loss_counted_tokens: 254
 total tokens: 7125 num samples: 5 num padding tokens: 1031 - rank: 4 max len: 1425 min len: 945 avg len: 1218.8 num_loss_counted_tokens: 3947
 total tokens: 8025 num samples: 15 num padding tokens: 2508 - rank: 6 max len: 535 min len: 266 avg len: 367.8 num_loss_counted_tokens: 3113
 Per-token loss scaled by world size: 0.00031914791907183826Per-token loss scaled by world size: 0.0003141801571473479Per-token loss scaled by world size: 0.0003882426244672388
 Per-token loss scaled by world size: 0.00020227984350640327


 Per-token loss scaled by world size: 5.1077040552627295e-05Per-token loss scaled by world size: 5.200964369578287e-05
 Per-token loss scaled by world size: 0.0002763153170235455
 Epoch: 0, Step: 2, Rank: 4, loss = 0.963906466960907
 Epoch: 0, Step: 2, Rank: 3, loss = 0.9489026069641113
 Epoch: 0, Step: 2, Rank: 2, loss = 0.6109356880187988
 Epoch: 0, Step: 2, Rank: 5, loss = 1.1725897789001465
 Epoch: 0, Step: 2, Rank: 0, loss = 0.15426543354988098

 Epoch: 0, Step: 2, Rank: 7, loss = 0.8345413208007812
 Per-token loss scaled by world size: 0.0004248657787684351
 Epoch: 0, Step: 2, Rank: 1, loss = 0.15708212554454803
 Epoch: 0, Step: 2, Rank: 6, loss = 1.2832008600234985
 Epoch 0:   2%|▏         | 2/121 [00:05<05:38,  2.85s/it] total tokens: 7986 num samples: 11 num padding tokens: 734 - rank: 4 max len: 726 min len: 605 avg len: 659.2727272727273 num_loss_counted_tokens: 4132
 {
    "epoch": 0,
    "step": 2,
    "rank": 0,
    "loss": 0.15426543354988098,
    "overall_throughput": 43.2007637745835,
    "lr": 0.0,
    "cuda_mem_allocated": 18.104323863983154,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24162,
    "batch_size": 93,
    "total_loss": 0.7656780481338501,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:07.698724"
 }
 total tokens: 7735 num samples: 17 num padding tokens: 1736 - rank: 6 max len: 455 min len: 245 avg len: 352.88235294117646 num_loss_counted_tokens: 3396
 total tokens: 5966 num samples: 2 num padding tokens: 555 - rank: 0 max len: 2983 min len: 2428 avg len: 2705.5 num_loss_counted_tokens: 181
 total tokens: 7852 num samples: 13 num padding tokens: 1205 - rank: 5 max len: 604 min len: 464 avg len: 511.3076923076923 num_loss_counted_tokens: 4550
 total tokens: 7928 num samples: 4 num padding tokens: 535 - rank: 1 max len: 1982 min len: 1721 avg len: 1848.25 num_loss_counted_tokens: 2525
 total tokens: 7836 num samples: 6 num padding tokens: 1664 - rank: 2 max len: 1306 min len: 926 avg len: 1028.6666666666667 num_loss_counted_tokens: 3620
 total tokens: 7821 num samples: 9 num padding tokens: 747 - rank: 3 max len: 869 min len: 729 avg len: 786.0 num_loss_counted_tokens: 5513
 total tokens: 4598 num samples: 19 num padding tokens: 1719 - rank: 7 max len: 242 min len: 75 avg len: 151.52631578947367 num_loss_counted_tokens: 1068
 Per-token loss scaled by world size: 0.00018360439571551979Per-token loss scaled by world size: 0.0003279669035691768
 Per-token loss scaled by world size: 2.2890385480422992e-06Per-token loss scaled by world size: 6.500220479210839e-05Per-token loss scaled by world size: 0.00032116335933096707



 Per-token loss scaled by world size: 0.00036416525836102664Per-token loss scaled by world size: 0.0005080102127976716
 Epoch: 0, Step: 3, Rank: 5, loss = 0.84148108959198
 Epoch: 0, Step: 3, Rank: 3, loss = 0.47108298540115356Epoch: 0, Step: 3, Rank: 1, loss = 0.16677941381931305

 Epoch: 0, Step: 3, Rank: 4, loss = 0.8240249156951904Epoch: 0, Step: 3, Rank: 0, loss = 0.005873100366443396


 Epoch: 0, Step: 3, Rank: 6, loss = 1.3034272193908691
 Per-token loss scaled by world size: 7.88167308201082e-05
 Epoch: 0, Step: 3, Rank: 7, loss = 0.9343570470809937
 Epoch: 0, Step: 3, Rank: 2, loss = 0.2022240310907364
 Epoch 0:   2%|▏         | 3/121 [00:08<05:19,  2.70s/it]{
    "epoch": 0,
    "step": 3,
    "rank": 0,
    "loss": 0.005873100366443396,
    "overall_throughput": 42.42993987932287,
    "lr": 0.0,
    "cuda_mem_allocated": 18.00035810470581,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20526,
    "batch_size": 75,
    "total_loss": 0.5936562418937683,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:10.210211"
 }
 total tokens: 6324 num samples: 2 num padding tokens: 135 - rank: 1 max len: 3162 min len: 3027 avg len: 3094.5 num_loss_counted_tokens: 175
 total tokens: 7644 num samples: 7 num padding tokens: 892 - rank: 4 max len: 1092 min len: 825 avg len: 964.5714285714286 num_loss_counted_tokens: 4646
 total tokens: 7192 num samples: 2 num padding tokens: 318 - rank: 0 max len: 3596 min len: 3278 avg len: 3437.0 num_loss_counted_tokens: 213
 total tokens: 7620 num samples: 10 num padding tokens: 1177 - rank: 5 max len: 762 min len: 501 avg len: 644.3 num_loss_counted_tokens: 4818
 total tokens: 7776 num samples: 16 num padding tokens: 1778 - rank: 6 max len: 486 min len: 269 avg len: 374.875 num_loss_counted_tokens: 3462
 total tokens: 8060 num samples: 31 num padding tokens: 2946 - rank: 7 max len: 260 min len: 79 avg len: 164.96774193548387 num_loss_counted_tokens: 2135
 total tokens: 8095 num samples: 5 num padding tokens: 1374 - rank: 3 max len: 1619 min len: 1120 avg len: 1344.2 num_loss_counted_tokens: 2765
 total tokens: 7320 num samples: 3 num padding tokens: 432 - rank: 2 max len: 2440 min len: 2027 avg len: 2296.0 num_loss_counted_tokens: 857
 Per-token loss scaled by world size: 0.0004511360311880708Per-token loss scaled by world size: 0.0004869260301347822Per-token loss scaled by world size: 4.640718543669209e-05Per-token loss scaled by world size: 8.355799946002662e-05


 Per-token loss scaled by world size: 6.561249392689206e-06
 Per-token loss scaled by world size: 0.00017396389739587903

 Epoch: 0, Step: 4, Rank: 1, loss = 0.12441766262054443
 Epoch: 0, Step: 4, Rank: 6, loss = 1.2094956636428833
 Epoch: 0, Step: 4, Rank: 5, loss = 1.3054486513137817
 Epoch: 0, Step: 4, Rank: 2, loss = 0.22401900589466095
 Epoch: 0, Step: 4, Rank: 0, loss = 0.017590709030628204
 Per-token loss scaled by world size: 0.00043431558879092336Epoch: 0, Step: 4, Rank: 7, loss = 0.4663971960544586

 Per-token loss scaled by world size: 0.00029926959541626275
 Epoch: 0, Step: 4, Rank: 4, loss = 1.1644001007080078
 Epoch: 0, Step: 4, Rank: 3, loss = 0.8023418188095093
 Epoch 0:   3%|▎         | 4/121 [00:10<05:08,  2.63s/it] total tokens: 7940 num samples: 10 num padding tokens: 987 - rank: 4 max len: 794 min len: 627 avg len: 695.3 num_loss_counted_tokens: 4306
 {
    "epoch": 0,
    "step": 4,
    "rank": 0,
    "loss": 0.017590709030628204,
    "overall_throughput": 42.48427743949919,
    "lr": 0.0,
    "cuda_mem_allocated": 18.00298833847046,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21448,
    "batch_size": 75,
    "total_loss": 0.664263904094696,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:12.736014"
 }
 total tokens: 7260 num samples: 4 num padding tokens: 876 - rank: 1 max len: 1815 min len: 1442 avg len: 1596.0 num_loss_counted_tokens: 1808
 total tokens: 8097 num samples: 3 num padding tokens: 887 - rank: 0 max len: 2699 min len: 2231 avg len: 2403.3333333333335 num_loss_counted_tokens: 260
 total tokens: 7776 num samples: 32 num padding tokens: 2768 - rank: 7 max len: 243 min len: 77 avg len: 156.5 num_loss_counted_tokens: 1979
 total tokens: 6895 num samples: 5 num padding tokens: 531 - rank: 2 max len: 1379 min len: 1160 avg len: 1272.8 num_loss_counted_tokens: 3055
 total tokens: 7923 num samples: 19 num padding tokens: 1608 - rank: 6 max len: 417 min len: 252 avg len: 332.36842105263156 num_loss_counted_tokens: 3042
 total tokens: 7512 num samples: 8 num padding tokens: 560 - rank: 3 max len: 939 min len: 795 avg len: 869.0 num_loss_counted_tokens: 4478
 total tokens: 8047 num samples: 13 num padding tokens: 1167 - rank: 5 max len: 619 min len: 439 avg len: 529.2307692307693 num_loss_counted_tokens: 4239
 Per-token loss scaled by world size: 0.00024933897657319903Per-token loss scaled by world size: 0.000386894796974957
 Per-token loss scaled by world size: 0.00021959797595627606

 Per-token loss scaled by world size: 3.401555431992165e-06
 Per-token loss scaled by world size: 5.781253548775567e-06Per-token loss scaled by world size: 0.00047684554010629654

 Per-token loss scaled by world size: 0.0002837673237081617
 Epoch: 0, Step: 5, Rank: 4, loss = 1.0060231685638428
 Epoch: 0, Step: 5, Rank: 2, loss = 0.6483436822891235
 Epoch: 0, Step: 5, Rank: 0, loss = 0.008844894357025623Epoch: 0, Step: 5, Rank: 3, loss = 0.571009635925293

 Epoch: 0, Step: 5, Rank: 1, loss = 0.015032704919576645
 Epoch: 0, Step: 5, Rank: 5, loss = 1.2399176359176636
 Epoch: 0, Step: 5, Rank: 7, loss = 0.7378659844398499
 Per-token loss scaled by world size: 0.00046695370110683143
 Epoch: 0, Step: 5, Rank: 6, loss = 1.2141963243484497
 Epoch 0:   4%|▍         | 5/121 [00:13<04:59,  2.58s/it] total tokens: 7651 num samples: 7 num padding tokens: 746 - rank: 4 max len: 1093 min len: 866 avg len: 986.4285714285714 num_loss_counted_tokens: 5678
 {
    "epoch": 0,
    "step": 5,
    "rank": 0,
    "loss": 0.008844894357025623,
    "overall_throughput": 43.14041038651036,
    "lr": 0.0,
    "cuda_mem_allocated": 18.102890491485596,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20802,
    "batch_size": 80,
    "total_loss": 0.6801542043685913,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:15.218258"
 }
 total tokens: 7038 num samples: 2 num padding tokens: 1091 - rank: 1 max len: 3519 min len: 2428 avg len: 2973.5 num_loss_counted_tokens: 159
 total tokens: 8028 num samples: 2 num padding tokens: 123 - rank: 0 max len: 4014 min len: 3891 avg len: 3952.5 num_loss_counted_tokens: 168
 total tokens: 7596 num samples: 12 num padding tokens: 1879 - rank: 6 max len: 633 min len: 336 avg len: 476.4166666666667 num_loss_counted_tokens: 3821
 total tokens: 8064 num samples: 24 num padding tokens: 3562 - rank: 7 max len: 336 min len: 89 avg len: 187.58333333333334 num_loss_counted_tokens: 1860
 total tokens: 7970 num samples: 5 num padding tokens: 726 - rank: 3 max len: 1594 min len: 1187 avg len: 1448.8 num_loss_counted_tokens: 2543
 total tokens: 7890 num samples: 10 num padding tokens: 666 - rank: 5 max len: 789 min len: 634 avg len: 722.4 num_loss_counted_tokens: 3801
 total tokens: 7000 num samples: 4 num padding tokens: 270 - rank: 2 max len: 1750 min len: 1627 avg len: 1682.5 num_loss_counted_tokens: 759
 Per-token loss scaled by world size: 0.0001642795541556552Per-token loss scaled by world size: 0.00021280848886817694Per-token loss scaled by world size: 0.00032824286608956754
 Per-token loss scaled by world size: 8.065341717156116e-06

 Per-token loss scaled by world size: 3.2945732527878135e-05
 Per-token loss scaled by world size: 0.00023678457364439964Per-token loss scaled by world size: 0.00048681392217986286


 Epoch: 0, Step: 6, Rank: 2, loss = 0.5886548757553101
 Epoch: 0, Step: 6, Rank: 4, loss = 0.907960832118988Epoch: 0, Step: 6, Rank: 3, loss = 0.45441779494285583

 Epoch: 0, Step: 6, Rank: 0, loss = 0.022309742867946625
 Epoch: 0, Step: 6, Rank: 1, loss = 0.0911320149898529
 Epoch: 0, Step: 6, Rank: 6, loss = 1.346588134765625
 Epoch: 0, Step: 6, Rank: 7, loss = 0.6549757122993469
 Per-token loss scaled by world size: 0.0005791043513454497
 Epoch: 0, Step: 6, Rank: 5, loss = 1.6018750667572021
 Epoch 0:   5%|▍         | 6/121 [00:15<04:56,  2.58s/it] total tokens: 7911 num samples: 9 num padding tokens: 476 - rank: 4 max len: 879 min len: 707 avg len: 826.1111111111111 num_loss_counted_tokens: 6136
 total tokens: 6741 num samples: 3 num padding tokens: 433 - rank: 1 max len: 2247 min len: 1894 avg len: 2102.6666666666665 num_loss_counted_tokens: 1931
 total tokens: 7502 num samples: 11 num padding tokens: 1303 - rank: 5 max len: 682 min len: 424 avg len: 563.5454545454545 num_loss_counted_tokens: 3560
 {
    "epoch": 0,
    "step": 6,
    "rank": 0,
    "loss": 0.022309742867946625,
    "overall_throughput": 41.65718757774177,
    "lr": 0.0,
    "cuda_mem_allocated": 18.077077388763428,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22129,
    "batch_size": 78,
    "total_loss": 0.7084892988204956,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:17.804745"
 }
 total tokens: 8020 num samples: 20 num padding tokens: 1280 - rank: 6 max len: 401 min len: 273 avg len: 337.0 num_loss_counted_tokens: 3403
 total tokens: 7714 num samples: 29 num padding tokens: 2377 - rank: 7 max len: 266 min len: 79 avg len: 184.0344827586207 num_loss_counted_tokens: 2540
 total tokens: 6716 num samples: 4 num padding tokens: 1107 - rank: 2 max len: 1679 min len: 1232 avg len: 1402.25 num_loss_counted_tokens: 2407
 total tokens: 6960 num samples: 6 num padding tokens: 657 - rank: 3 max len: 1160 min len: 960 avg len: 1050.5 num_loss_counted_tokens: 5018
 total tokens: 6996 num samples: 2 num padding tokens: 327 - rank: 0 max len: 3498 min len: 3171 avg len: 3334.5 num_loss_counted_tokens: 153
 Per-token loss scaled by world size: 0.0002843443362507969Per-token loss scaled by world size: 0.00017875904450193048Per-token loss scaled by world size: 0.00013562251115217805
 Per-token loss scaled by world size: 0.0001140675667556934
 Per-token loss scaled by world size: 0.00023188847990240902Per-token loss scaled by world size: 0.00043197604827582836
 Per-token loss scaled by world size: 0.00019801303278654814



 Epoch: 0, Step: 7, Rank: 6, loss = 0.8738256692886353
 Epoch: 0, Step: 7, Rank: 4, loss = 0.5493488907814026
 Epoch: 0, Step: 7, Rank: 0, loss = 0.4167849123477936Epoch: 0, Step: 7, Rank: 7, loss = 0.7126222848892212Epoch: 0, Step: 7, Rank: 1, loss = 0.35054388642311096


 Epoch: 0, Step: 7, Rank: 2, loss = 0.6085187792778015
 Epoch: 0, Step: 7, Rank: 5, loss = 1.3275164365768433
 Per-token loss scaled by world size: 0.00021533554536290467
 Epoch: 0, Step: 7, Rank: 3, loss = 0.6617530584335327
 Epoch 0:   6%|▌         | 7/121 [00:18<04:50,  2.55s/it] total tokens: 7806 num samples: 3 num padding tokens: 966 - rank: 1 max len: 2602 min len: 1917 avg len: 2280.0 num_loss_counted_tokens: 944
 total tokens: 7488 num samples: 9 num padding tokens: 542 - rank: 4 max len: 832 min len: 711 avg len: 771.7777777777778 num_loss_counted_tokens: 3897
 {
    "epoch": 0,
    "step": 7,
    "rank": 0,
    "loss": 0.4167849123477936,
    "overall_throughput": 43.4992160904002,
    "lr": 0.0,
    "cuda_mem_allocated": 18.127665996551514,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24585,
    "batch_size": 88,
    "total_loss": 0.6876142621040344,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:20.296605"
 }
 total tokens: 7400 num samples: 4 num padding tokens: 570 - rank: 2 max len: 1850 min len: 1579 avg len: 1707.5 num_loss_counted_tokens: 1604
 total tokens: 7920 num samples: 16 num padding tokens: 2028 - rank: 6 max len: 495 min len: 286 avg len: 368.25 num_loss_counted_tokens: 3190
 total tokens: 7140 num samples: 6 num padding tokens: 1155 - rank: 3 max len: 1190 min len: 845 avg len: 997.5 num_loss_counted_tokens: 3914
 total tokens: 7424 num samples: 2 num padding tokens: 302 - rank: 0 max len: 3712 min len: 3410 avg len: 3561.0 num_loss_counted_tokens: 261
 total tokens: 7744 num samples: 11 num padding tokens: 907 - rank: 5 max len: 704 min len: 540 avg len: 621.5454545454545 num_loss_counted_tokens: 4091
 total tokens: 7672 num samples: 28 num padding tokens: 2305 - rank: 7 max len: 274 min len: 83 avg len: 191.67857142857142 num_loss_counted_tokens: 2631
 Per-token loss scaled by world size: 0.00014226992789190263Per-token loss scaled by world size: 0.00031214559567160904Per-token loss scaled by world size: 0.00031845251214690506Per-token loss scaled by world size: 0.0002571563527453691

 Per-token loss scaled by world size: 0.00039170257514342666


 Per-token loss scaled by world size: 0.00012162854545749724
 Per-token loss scaled by world size: 0.00013610723544843495
 Epoch: 0, Step: 8, Rank: 6, loss = 1.0465071201324463
 Epoch: 0, Step: 8, Rank: 4, loss = 1.0676518678665161
 Epoch: 0, Step: 8, Rank: 2, loss = 0.47697773575782776Epoch: 0, Step: 8, Rank: 3, loss = 0.8621488213539124

 Epoch: 0, Step: 8, Rank: 1, loss = 0.4077748954296112Epoch: 0, Step: 8, Rank: 5, loss = 1.3132318258285522

 Epoch: 0, Step: 8, Rank: 7, loss = 0.4563165009021759
 Per-token loss scaled by world size: 2.5762397854123265e-05
 Epoch: 0, Step: 8, Rank: 0, loss = 0.08637166023254395
 Epoch 0:   7%|▋         | 8/121 [00:21<04:49,  2.56s/it] total tokens: 6510 num samples: 3 num padding tokens: 1131 - rank: 1 max len: 2170 min len: 1481 avg len: 1793.0 num_loss_counted_tokens: 969
 total tokens: 8001 num samples: 9 num padding tokens: 764 - rank: 4 max len: 889 min len: 750 avg len: 804.1111111111111 num_loss_counted_tokens: 5953
 total tokens: 6042 num samples: 19 num padding tokens: 2404 - rank: 7 max len: 318 min len: 88 avg len: 191.47368421052633 num_loss_counted_tokens: 1475
 {
    "epoch": 0,
    "step": 8,
    "rank": 0,
    "loss": 0.08637166023254395,
    "overall_throughput": 41.99042053701942,
    "lr": 0.0,
    "cuda_mem_allocated": 18.229600429534912,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26821,
    "batch_size": 90,
    "total_loss": 0.7146224975585938,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:22.870728"
 }
 total tokens: 7525 num samples: 7 num padding tokens: 620 - rank: 3 max len: 1075 min len: 938 avg len: 986.4285714285714 num_loss_counted_tokens: 5075
 total tokens: 7710 num samples: 6 num padding tokens: 714 - rank: 2 max len: 1285 min len: 1097 avg len: 1166.0 num_loss_counted_tokens: 4799
 total tokens: 7742 num samples: 14 num padding tokens: 1362 - rank: 6 max len: 553 min len: 343 avg len: 455.7142857142857 num_loss_counted_tokens: 3929
 total tokens: 6622 num samples: 2 num padding tokens: 836 - rank: 0 max len: 3311 min len: 2475 avg len: 2893.0 num_loss_counted_tokens: 693
 total tokens: 8096 num samples: 11 num padding tokens: 746 - rank: 5 max len: 736 min len: 587 avg len: 668.1818181818181 num_loss_counted_tokens: 5289
 Per-token loss scaled by world size: 0.00032484609982930124Per-token loss scaled by world size: 0.000325270724715665Per-token loss scaled by world size: 0.000487986282678321Per-token loss scaled by world size: 0.00039277857285924256
 Per-token loss scaled by world size: 0.00037055223947390914Per-token loss scaled by world size: 1.2821310519939288e-06




 Per-token loss scaled by world size: 2.2653903215541504e-05Epoch: 0, Step: 9, Rank: 4, loss = 0.9635331630706787Epoch: 0, Step: 9, Rank: 6, loss = 1.1635082960128784


 Epoch: 0, Step: 9, Rank: 5, loss = 1.4455373287200928Epoch: 0, Step: 9, Rank: 0, loss = 0.0037979925982654095

 Epoch: 0, Step: 9, Rank: 7, loss = 1.0976684093475342Epoch: 0, Step: 9, Rank: 2, loss = 0.9622753262519836

 Epoch: 0, Step: 9, Rank: 1, loss = 0.06710652261972427
 Per-token loss scaled by world size: 0.0003276054630987346
 Epoch: 0, Step: 9, Rank: 3, loss = 0.9704492688179016
 Epoch 0:   7%|▋         | 9/121 [00:23<04:45,  2.55s/it] total tokens: 7932 num samples: 3 num padding tokens: 1331 - rank: 1 max len: 2644 min len: 1965 avg len: 2200.3333333333335 num_loss_counted_tokens: 863
 total tokens: 7389 num samples: 9 num padding tokens: 630 - rank: 4 max len: 821 min len: 683 avg len: 751.0 num_loss_counted_tokens: 4429
 {
    "epoch": 0,
    "step": 9,
    "rank": 0,
    "loss": 0.0037979925982654095,
    "overall_throughput": 42.802430491911245,
    "lr": 0.0,
    "cuda_mem_allocated": 17.982967853546143,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23698,
    "batch_size": 89,
    "total_loss": 0.8342345356941223,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:25.401772"
 }
 total tokens: 8088 num samples: 12 num padding tokens: 982 - rank: 5 max len: 674 min len: 500 avg len: 592.1666666666666 num_loss_counted_tokens: 4684
 total tokens: 6798 num samples: 2 num padding tokens: 340 - rank: 0 max len: 3399 min len: 3059 avg len: 3229.0 num_loss_counted_tokens: 156
 total tokens: 6642 num samples: 27 num padding tokens: 2019 - rank: 7 max len: 246 min len: 76 avg len: 171.22222222222223 num_loss_counted_tokens: 1990
 total tokens: 6668 num samples: 4 num padding tokens: 1622 - rank: 2 max len: 1667 min len: 1049 avg len: 1261.5 num_loss_counted_tokens: 1065
 total tokens: 7939 num samples: 17 num padding tokens: 2169 - rank: 6 max len: 467 min len: 248 avg len: 339.4117647058824 num_loss_counted_tokens: 3544
 total tokens: 7528 num samples: 8 num padding tokens: 554 - rank: 3 max len: 941 min len: 832 avg len: 871.75 num_loss_counted_tokens: 4720
 Per-token loss scaled by world size: 0.00029833969892933965Per-token loss scaled by world size: 0.0003632043662946671Per-token loss scaled by world size: 0.00018327771977055818Per-token loss scaled by world size: 0.00017725562793202698



 Per-token loss scaled by world size: 1.4638754691986833e-05Per-token loss scaled by world size: 0.0002214470150647685Per-token loss scaled by world size: 5.87268550589215e-06


 Epoch: 0, Step: 10, Rank: 2, loss = 0.5432663559913635Epoch: 0, Step: 10, Rank: 6, loss = 0.9143738746643066

 Epoch: 0, Step: 10, Rank: 4, loss = 1.1131759881973267
 Epoch: 0, Step: 10, Rank: 3, loss = 0.5617232918739319
 Epoch: 0, Step: 10, Rank: 0, loss = 0.01799904741346836
 Epoch: 0, Step: 10, Rank: 1, loss = 0.04486595466732979
 Epoch: 0, Step: 10, Rank: 7, loss = 0.6787074208259583
 Per-token loss scaled by world size: 0.0003915868583135307
 Epoch: 0, Step: 10, Rank: 5, loss = 1.200164794921875
 Epoch 0:   8%|▊         | 10/121 [00:26<04:41,  2.54s/it] total tokens: 7056 num samples: 6 num padding tokens: 1539 - rank: 4 max len: 1176 min len: 760 avg len: 919.5 num_loss_counted_tokens: 3752
 total tokens: 6318 num samples: 2 num padding tokens: 744 - rank: 1 max len: 3159 min len: 2415 avg len: 2787.0 num_loss_counted_tokens: 641
 {
    "epoch": 0,
    "step": 10,
    "rank": 0,
    "loss": 0.01799904741346836,
    "overall_throughput": 43.20698130216357,
    "lr": 0.0,
    "cuda_mem_allocated": 17.961437225341797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24519,
    "batch_size": 84,
    "total_loss": 0.6342846155166626,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:27.904244"
 }
 total tokens: 7876 num samples: 11 num padding tokens: 1081 - rank: 5 max len: 716 min len: 528 avg len: 617.7272727272727 num_loss_counted_tokens: 3741
 total tokens: 6586 num samples: 2 num padding tokens: 33 - rank: 0 max len: 3293 min len: 3260 avg len: 3276.5 num_loss_counted_tokens: 401
 total tokens: 7084 num samples: 4 num padding tokens: 1038 - rank: 3 max len: 1771 min len: 1326 avg len: 1511.5 num_loss_counted_tokens: 2816
 total tokens: 7956 num samples: 17 num padding tokens: 1245 - rank: 6 max len: 468 min len: 316 avg len: 394.7647058823529 num_loss_counted_tokens: 3928
 total tokens: 6378 num samples: 3 num padding tokens: 279 - rank: 2 max len: 2126 min len: 1873 avg len: 2033.0 num_loss_counted_tokens: 1061
 total tokens: 7852 num samples: 26 num padding tokens: 3174 - rank: 7 max len: 302 min len: 84 avg len: 179.92307692307693 num_loss_counted_tokens: 2103
 Per-token loss scaled by world size: 0.00014277359878178686Per-token loss scaled by world size: 0.00017172133084386587Per-token loss scaled by world size: 0.00032692356035113335Per-token loss scaled by world size: 0.00021291511075105518Per-token loss scaled by world size: 0.0002838084474205971
 Per-token loss scaled by world size: 8.463160156679805e-06
 Per-token loss scaled by world size: 0.0002184695185860619




 Epoch: 0, Step: 11, Rank: 5, loss = 1.096379041671753
 Epoch: 0, Step: 11, Rank: 0, loss = 0.02838226407766342
 Epoch: 0, Step: 11, Rank: 2, loss = 0.5758889317512512
 Epoch: 0, Step: 11, Rank: 1, loss = 0.47880908846855164Epoch: 0, Step: 11, Rank: 6, loss = 0.7140374183654785Epoch: 0, Step: 11, Rank: 4, loss = 0.9517871141433716


 Epoch: 0, Step: 11, Rank: 7, loss = 0.7326648235321045
 Per-token loss scaled by world size: 0.00020073003543075174
 Epoch: 0, Step: 11, Rank: 3, loss = 0.6731732487678528
 Epoch 0:   9%|▉         | 11/121 [00:28<04:39,  2.54s/it] total tokens: 7000 num samples: 4 num padding tokens: 1372 - rank: 1 max len: 1750 min len: 1261 avg len: 1407.0 num_loss_counted_tokens: 902
 total tokens: 7730 num samples: 10 num padding tokens: 533 - rank: 4 max len: 773 min len: 666 avg len: 719.7 num_loss_counted_tokens: 4342
 total tokens: 7980 num samples: 12 num padding tokens: 984 - rank: 5 max len: 665 min len: 484 avg len: 583.0 num_loss_counted_tokens: 4069
 {
    "epoch": 0,
    "step": 11,
    "rank": 0,
    "loss": 0.02838226407766342,
    "overall_throughput": 42.203161621451606,
    "lr": 0.0,
    "cuda_mem_allocated": 18.220097064971924,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26829,
    "batch_size": 89,
    "total_loss": 0.6563901901245117,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:30.459939"
 }
 total tokens: 7854 num samples: 17 num padding tokens: 1998 - rank: 6 max len: 462 min len: 265 avg len: 344.47058823529414 num_loss_counted_tokens: 3040
 total tokens: 7953 num samples: 3 num padding tokens: 883 - rank: 0 max len: 2651 min len: 1841 avg len: 2356.6666666666665 num_loss_counted_tokens: 501
 total tokens: 7992 num samples: 8 num padding tokens: 643 - rank: 3 max len: 999 min len: 846 avg len: 918.625 num_loss_counted_tokens: 5458
 total tokens: 8060 num samples: 31 num padding tokens: 2438 - rank: 7 max len: 260 min len: 78 avg len: 181.3548387096774 num_loss_counted_tokens: 2341
 total tokens: 7314 num samples: 6 num padding tokens: 570 - rank: 2 max len: 1219 min len: 1012 avg len: 1124.0 num_loss_counted_tokens: 2551
 Per-token loss scaled by world size: 0.0002459367678966373Per-token loss scaled by world size: 0.00032957716030068696Per-token loss scaled by world size: 2.192043893955997e-06
 Per-token loss scaled by world size: 3.339715112815611e-05
 Per-token loss scaled by world size: 0.0003242892271373421

 Per-token loss scaled by world size: 0.00027953533572144806

 Per-token loss scaled by world size: 0.0003221108636353165
 Epoch: 0, Step: 12, Rank: 0, loss = 0.005932766944169998
 Epoch: 0, Step: 12, Rank: 2, loss = 0.8920005559921265
 Epoch: 0, Step: 12, Rank: 1, loss = 0.09038939327001572
 Epoch: 0, Step: 12, Rank: 3, loss = 0.6656278967857361
 Epoch: 0, Step: 12, Rank: 5, loss = 0.8776887655258179
 Epoch: 0, Step: 12, Rank: 4, loss = 0.7565624117851257
 Epoch: 0, Step: 12, Rank: 7, loss = 0.8717930316925049
 Per-token loss scaled by world size: 0.00047857032041065395
 Epoch: 0, Step: 12, Rank: 6, loss = 1.2952505350112915
 Epoch 0:  10%|▉         | 12/121 [00:31<04:35,  2.53s/it] total tokens: 7648 num samples: 8 num padding tokens: 826 - rank: 4 max len: 956 min len: 722 avg len: 852.75 num_loss_counted_tokens: 4969
 total tokens: 7680 num samples: 3 num padding tokens: 714 - rank: 1 max len: 2560 min len: 2201 avg len: 2322.0 num_loss_counted_tokens: 995
 {
    "epoch": 0,
    "step": 12,
    "rank": 0,
    "loss": 0.005932766944169998,
    "overall_throughput": 43.402175224914764,
    "lr": 0.0,
    "cuda_mem_allocated": 17.94186305999756,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21652,
    "batch_size": 76,
    "total_loss": 0.6819056272506714,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:32.991146"
 }
 total tokens: 7664 num samples: 16 num padding tokens: 1534 - rank: 6 max len: 479 min len: 276 avg len: 383.125 num_loss_counted_tokens: 3860
 total tokens: 8106 num samples: 7 num padding tokens: 750 - rank: 3 max len: 1158 min len: 961 avg len: 1050.857142857143 num_loss_counted_tokens: 5265
 total tokens: 7888 num samples: 29 num padding tokens: 2471 - rank: 7 max len: 272 min len: 82 avg len: 186.79310344827587 num_loss_counted_tokens: 2429
 total tokens: 7306 num samples: 2 num padding tokens: 864 - rank: 0 max len: 3653 min len: 2789 avg len: 3221.0 num_loss_counted_tokens: 191
 total tokens: 6890 num samples: 5 num padding tokens: 518 - rank: 2 max len: 1378 min len: 1170 avg len: 1274.4 num_loss_counted_tokens: 2274
 total tokens: 7689 num samples: 11 num padding tokens: 1127 - rank: 5 max len: 699 min len: 483 avg len: 596.5454545454545 num_loss_counted_tokens: 4497
 Per-token loss scaled by world size: 0.00013505632523447275Per-token loss scaled by world size: 0.00034081621561199427Per-token loss scaled by world size: 0.0004060634528286755Per-token loss scaled by world size: 0.0003271996683906764


 Per-token loss scaled by world size: 2.6962425181409344e-06
 Per-token loss scaled by world size: 0.00019702856661751866

 Per-token loss scaled by world size: 2.0444547317310935e-06
 Epoch: 0, Step: 13, Rank: 2, loss = 0.39850056171417236
 Epoch: 0, Step: 13, Rank: 6, loss = 1.0056208372116089Epoch: 0, Step: 13, Rank: 4, loss = 1.1981409788131714Epoch: 0, Step: 13, Rank: 3, loss = 0.9654435515403748


 Epoch: 0, Step: 13, Rank: 0, loss = 0.007955600507557392
 Epoch: 0, Step: 13, Rank: 7, loss = 0.5813574194908142
 Epoch: 0, Step: 13, Rank: 1, loss = 0.006032418925315142
 Per-token loss scaled by world size: 0.000633390387520194
 Epoch: 0, Step: 13, Rank: 5, loss = 1.868897557258606
 Epoch 0:  11%|█         | 13/121 [00:33<04:32,  2.53s/it] total tokens: 8085 num samples: 11 num padding tokens: 649 - rank: 4 max len: 735 min len: 614 avg len: 676.0 num_loss_counted_tokens: 3645
 total tokens: 6609 num samples: 3 num padding tokens: 940 - rank: 1 max len: 2203 min len: 1630 avg len: 1889.6666666666667 num_loss_counted_tokens: 239
 total tokens: 7854 num samples: 17 num padding tokens: 1445 - rank: 6 max len: 462 min len: 277 avg len: 377.0 num_loss_counted_tokens: 3417
 {
    "epoch": 0,
    "step": 13,
    "rank": 0,
    "loss": 0.007955600507557392,
    "overall_throughput": 42.85566372115263,
    "lr": 0.0,
    "cuda_mem_allocated": 18.043649196624756,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23605,
    "batch_size": 89,
    "total_loss": 0.7539936304092407,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:35.510670"
 }
 total tokens: 7722 num samples: 13 num padding tokens: 1086 - rank: 5 max len: 594 min len: 465 avg len: 510.46153846153845 num_loss_counted_tokens: 3752
 total tokens: 4062 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4062 min len: 4062 avg len: 4062.0 num_loss_counted_tokens: 85
 total tokens: 7917 num samples: 29 num padding tokens: 2683 - rank: 7 max len: 273 min len: 76 avg len: 180.48275862068965 num_loss_counted_tokens: 2192
 total tokens: 8091 num samples: 9 num padding tokens: 685 - rank: 3 max len: 899 min len: 760 avg len: 822.8888888888889 num_loss_counted_tokens: 6014
 total tokens: 7815 num samples: 5 num padding tokens: 2132 - rank: 2 max len: 1563 min len: 901 avg len: 1136.6 num_loss_counted_tokens: 3256
 Per-token loss scaled by world size: 0.00030266563408076763Per-token loss scaled by world size: 0.0001793744886526838Per-token loss scaled by world size: 0.00013343075988814235

 Per-token loss scaled by world size: 8.247572259278968e-05

 Per-token loss scaled by world size: 0.00023257164866663516Per-token loss scaled by world size: 0.00025159039068967104

 Epoch: 0, Step: 14, Rank: 3, loss = 0.6006578803062439
 Epoch: 0, Step: 14, Rank: 5, loss = 1.0135136842727661
 Epoch: 0, Step: 14, Rank: 0, loss = 0.2761802673339844
 Epoch: 0, Step: 14, Rank: 1, loss = 0.4468095600605011
 Per-token loss scaled by world size: 0.0003185720997862518
 Epoch: 0, Step: 14, Rank: 4, loss = 0.8424819111824036
 Epoch: 0, Step: 14, Rank: 7, loss = 0.7787952423095703
 Per-token loss scaled by world size: 9.590814443072304e-05
 Epoch: 0, Step: 14, Rank: 6, loss = 1.066778540611267
 Epoch: 0, Step: 14, Rank: 2, loss = 0.3211604058742523
 Epoch 0:  12%|█▏        | 14/121 [00:36<04:31,  2.54s/it] total tokens: 6432 num samples: 3 num padding tokens: 621 - rank: 1 max len: 2144 min len: 1779 avg len: 1937.0 num_loss_counted_tokens: 328
 total tokens: 7650 num samples: 9 num padding tokens: 968 - rank: 4 max len: 850 min len: 696 avg len: 742.4444444444445 num_loss_counted_tokens: 4090
 total tokens: 7776 num samples: 16 num padding tokens: 1817 - rank: 6 max len: 486 min len: 267 avg len: 372.4375 num_loss_counted_tokens: 2945
 {
    "epoch": 0,
    "step": 14,
    "rank": 0,
    "loss": 0.2761802673339844,
    "overall_throughput": 41.914163559432374,
    "lr": 0.0,
    "cuda_mem_allocated": 18.220842361450195,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26789,
    "batch_size": 100,
    "total_loss": 0.6682971715927124,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:38.048693"
 }
 total tokens: 7546 num samples: 11 num padding tokens: 994 - rank: 5 max len: 686 min len: 507 avg len: 595.6363636363636 num_loss_counted_tokens: 4306
 total tokens: 7953 num samples: 33 num padding tokens: 2532 - rank: 7 max len: 241 min len: 83 avg len: 164.27272727272728 num_loss_counted_tokens: 2377
 total tokens: 7696 num samples: 8 num padding tokens: 273 - rank: 3 max len: 962 min len: 903 avg len: 927.875 num_loss_counted_tokens: 3069
 total tokens: 7708 num samples: 2 num padding tokens: 1229 - rank: 0 max len: 3854 min len: 2625 avg len: 3239.5 num_loss_counted_tokens: 176
 total tokens: 6624 num samples: 4 num padding tokens: 1021 - rank: 2 max len: 1656 min len: 1051 avg len: 1400.75 num_loss_counted_tokens: 1421
 Per-token loss scaled by world size: 0.0004944643005728722Per-token loss scaled by world size: 0.0002606770722195506Per-token loss scaled by world size: 0.0004838722525164485Per-token loss scaled by world size: 5.0423706852598116e-05
 Per-token loss scaled by world size: 0.0004016592283733189
 Per-token loss scaled by world size: 6.617276085307822e-05



 Epoch: 0, Step: 15, Rank: 6, loss = 0.7229878306388855Epoch: 0, Step: 15, Rank: 5, loss = 1.3713966608047485

 Epoch: 0, Step: 15, Rank: 4, loss = 1.3420196771621704
 Epoch: 0, Step: 15, Rank: 0, loss = 0.13985015451908112Epoch: 0, Step: 15, Rank: 1, loss = 0.18353015184402466

 Epoch: 0, Step: 15, Rank: 7, loss = 1.1140018701553345
 Per-token loss scaled by world size: 8.276257722172886e-05
 Per-token loss scaled by world size: 0.000270542484940961
 Epoch: 0, Step: 15, Rank: 2, loss = 0.22954201698303223
 Epoch: 0, Step: 15, Rank: 3, loss = 0.7503495812416077
 Epoch 0:  12%|█▏        | 15/121 [00:38<04:29,  2.54s/it] total tokens: 7595 num samples: 7 num padding tokens: 570 - rank: 4 max len: 1085 min len: 923 avg len: 1003.5714285714286 num_loss_counted_tokens: 4160
 total tokens: 6930 num samples: 2 num padding tokens: 569 - rank: 1 max len: 3465 min len: 2896 avg len: 3180.5 num_loss_counted_tokens: 174
 total tokens: 7542 num samples: 9 num padding tokens: 1278 - rank: 5 max len: 838 min len: 611 avg len: 696.0 num_loss_counted_tokens: 3843
 total tokens: 8061 num samples: 3 num padding tokens: 917 - rank: 2 max len: 2687 min len: 1952 avg len: 2381.3333333333335 num_loss_counted_tokens: 302
 {
    "epoch": 0,
    "step": 15,
    "rank": 0,
    "loss": 0.13985015451908112,
    "overall_throughput": 42.649491925532274,
    "lr": 0.0,
    "cuda_mem_allocated": 18.06497097015381,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22188,
    "batch_size": 90,
    "total_loss": 0.7317097187042236,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:40.589466"
 }
 total tokens: 4070 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4070 min len: 4070 avg len: 4070.0 num_loss_counted_tokens: 1038
 total tokens: 7565 num samples: 5 num padding tokens: 974 - rank: 3 max len: 1513 min len: 1101 avg len: 1318.2 num_loss_counted_tokens: 2867
 total tokens: 7852 num samples: 13 num padding tokens: 2212 - rank: 6 max len: 604 min len: 287 avg len: 433.84615384615387 num_loss_counted_tokens: 3284
 total tokens: 5720 num samples: 20 num padding tokens: 1667 - rank: 7 max len: 286 min len: 83 avg len: 202.65 num_loss_counted_tokens: 1807
 Per-token loss scaled by world size: 0.00036834663478657603Per-token loss scaled by world size: 0.00037199081270955503Per-token loss scaled by world size: 0.00025284758885391057
 Per-token loss scaled by world size: 0.00017388923151884228


 Per-token loss scaled by world size: 4.525140013811324e-07
 Per-token loss scaled by world size: 0.00022274823277257383
 Per-token loss scaled by world size: 9.103088814299554e-05
 Epoch: 0, Step: 16, Rank: 4, loss = 1.2040791511535645
 Epoch: 0, Step: 16, Rank: 5, loss = 1.215991497039795Epoch: 0, Step: 16, Rank: 3, loss = 0.8265271782875061

 Epoch: 0, Step: 16, Rank: 7, loss = 0.5684221386909485
 Epoch: 0, Step: 16, Rank: 0, loss = 0.0014792117290198803
 Epoch: 0, Step: 16, Rank: 2, loss = 0.7281361222267151
 Epoch: 0, Step: 16, Rank: 1, loss = 0.29756858944892883
 Per-token loss scaled by world size: 0.00036798955989070237
 Epoch: 0, Step: 16, Rank: 6, loss = 1.2029118537902832
 Epoch 0:  13%|█▎        | 16/121 [00:41<04:25,  2.52s/it] total tokens: 7950 num samples: 10 num padding tokens: 556 - rank: 4 max len: 795 min len: 706 avg len: 739.4 num_loss_counted_tokens: 4706
 total tokens: 7167 num samples: 3 num padding tokens: 1839 - rank: 1 max len: 2389 min len: 1434 avg len: 1776.0 num_loss_counted_tokens: 2175
 {
    "epoch": 0,
    "step": 16,
    "rank": 0,
    "loss": 0.0014792117290198803,
    "overall_throughput": 43.42571067791906,
    "lr": 0.0,
    "cuda_mem_allocated": 18.137038707733154,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26151,
    "batch_size": 89,
    "total_loss": 0.7556394934654236,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:43.079559"
 }
 total tokens: 7904 num samples: 16 num padding tokens: 2012 - rank: 6 max len: 494 min len: 282 avg len: 368.25 num_loss_counted_tokens: 3677
 total tokens: 7689 num samples: 11 num padding tokens: 1030 - rank: 5 max len: 699 min len: 501 avg len: 605.3636363636364 num_loss_counted_tokens: 4515
 total tokens: 6075 num samples: 25 num padding tokens: 2510 - rank: 7 max len: 243 min len: 84 avg len: 142.6 num_loss_counted_tokens: 1404
 total tokens: 7752 num samples: 8 num padding tokens: 648 - rank: 3 max len: 969 min len: 812 avg len: 888.0 num_loss_counted_tokens: 4511
 total tokens: 7658 num samples: 7 num padding tokens: 214 - rank: 2 max len: 1094 min len: 982 avg len: 1063.4285714285713 num_loss_counted_tokens: 4703
 total tokens: 6666 num samples: 2 num padding tokens: 735 - rank: 0 max len: 3333 min len: 2598 avg len: 2965.5 num_loss_counted_tokens: 156
 Per-token loss scaled by world size: 0.0003832130169030279Per-token loss scaled by world size: 0.00025972415460273623Per-token loss scaled by world size: 0.0007209046743810177


 Per-token loss scaled by world size: 3.690614175866358e-05Per-token loss scaled by world size: 0.0004276617255527526

 Per-token loss scaled by world size: 7.648386599612422e-06
 Per-token loss scaled by world size: 0.00031741950078867376
 Epoch: 0, Step: 17, Rank: 0, loss = 0.0850963369011879
 Epoch: 0, Step: 17, Rank: 4, loss = 0.8835934400558472Epoch: 0, Step: 17, Rank: 6, loss = 1.6622259616851807
 Epoch: 0, Step: 17, Rank: 3, loss = 0.5988589525222778

 Epoch: 0, Step: 17, Rank: 2, loss = 0.9860810041427612Epoch: 0, Step: 17, Rank: 1, loss = 0.017635267227888107

 Epoch: 0, Step: 17, Rank: 7, loss = 0.7318900227546692
 Per-token loss scaled by world size: 0.00037659640656784177
 Epoch: 0, Step: 17, Rank: 5, loss = 0.8683371543884277
 Epoch 0:  14%|█▍        | 17/121 [00:43<04:21,  2.51s/it] total tokens: 8082 num samples: 9 num padding tokens: 734 - rank: 4 max len: 898 min len: 742 avg len: 816.4444444444445 num_loss_counted_tokens: 4032
 total tokens: 8112 num samples: 4 num padding tokens: 808 - rank: 1 max len: 2028 min len: 1729 avg len: 1826.0 num_loss_counted_tokens: 1009
 total tokens: 7680 num samples: 12 num padding tokens: 950 - rank: 5 max len: 640 min len: 496 avg len: 560.8333333333334 num_loss_counted_tokens: 4500
 {
    "epoch": 0,
    "step": 17,
    "rank": 0,
    "loss": 0.0850963369011879,
    "overall_throughput": 43.64044780600529,
    "lr": 0.0,
    "cuda_mem_allocated": 18.1714825630188,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18446,
    "batch_size": 76,
    "total_loss": 0.7292147874832153,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:45.559713"
 }
 total tokens: 7664 num samples: 16 num padding tokens: 2265 - rank: 6 max len: 479 min len: 253 avg len: 337.4375 num_loss_counted_tokens: 2968
 total tokens: 7756 num samples: 7 num padding tokens: 792 - rank: 3 max len: 1108 min len: 923 avg len: 994.8571428571429 num_loss_counted_tokens: 4875
 total tokens: 7194 num samples: 2 num padding tokens: 1367 - rank: 0 max len: 3597 min len: 2230 avg len: 2913.5 num_loss_counted_tokens: 192
 total tokens: 6448 num samples: 26 num padding tokens: 1927 - rank: 7 max len: 248 min len: 98 avg len: 173.8846153846154 num_loss_counted_tokens: 2050
 total tokens: 7075 num samples: 5 num padding tokens: 857 - rank: 2 max len: 1415 min len: 1116 avg len: 1243.6 num_loss_counted_tokens: 3663
 Per-token loss scaled by world size: 0.00015676565817557275Per-token loss scaled by world size: 0.0004898877814412117

 Per-token loss scaled by world size: 0.00045096693793311715
 Per-token loss scaled by world size: 5.226111625233898e-06Per-token loss scaled by world size: 8.227287253248505e-06

 Per-token loss scaled by world size: 0.0005342444637790322Per-token loss scaled by world size: 0.0005171209922991693

 Epoch: 0, Step: 18, Rank: 3, loss = 1.0057395696640015Epoch: 0, Step: 18, Rank: 2, loss = 0.3218398988246918

 Epoch: 0, Step: 18, Rank: 5, loss = 0.925835132598877
 Epoch: 0, Step: 18, Rank: 1, loss = 0.01072920672595501
 Epoch: 0, Step: 18, Rank: 0, loss = 0.016890620812773705
 Epoch: 0, Step: 18, Rank: 7, loss = 1.096803903579712
 Epoch: 0, Step: 18, Rank: 4, loss = 1.0616494417190552
 Per-token loss scaled by world size: 0.0007897767936810851
 Epoch: 0, Step: 18, Rank: 6, loss = 1.6214118003845215
 Epoch 0:  15%|█▍        | 18/121 [00:46<04:17,  2.50s/it] total tokens: 7335 num samples: 3 num padding tokens: 432 - rank: 1 max len: 2445 min len: 2093 avg len: 2301.0 num_loss_counted_tokens: 1145
 total tokens: 7455 num samples: 7 num padding tokens: 746 - rank: 4 max len: 1065 min len: 855 avg len: 958.4285714285714 num_loss_counted_tokens: 3764
 {
    "epoch": 0,
    "step": 18,
    "rank": 0,
    "loss": 0.016890620812773705,
    "overall_throughput": 43.99619876252009,
    "lr": 0.0,
    "cuda_mem_allocated": 17.973182678222656,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16424,
    "batch_size": 69,
    "total_loss": 0.7576124668121338,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:48.023775"
 }
 total tokens: 7188 num samples: 4 num padding tokens: 887 - rank: 2 max len: 1797 min len: 1322 avg len: 1575.25 num_loss_counted_tokens: 1620
 total tokens: 8016 num samples: 16 num padding tokens: 1990 - rank: 6 max len: 501 min len: 289 avg len: 376.625 num_loss_counted_tokens: 3443 total tokens: 7200 num samples: 25 num padding tokens: 2743 - rank: 7 max len: 288 min len: 77 avg len: 178.28 num_loss_counted_tokens: 2061

 total tokens: 7980 num samples: 10 num padding tokens: 1229 - rank: 5 max len: 798 min len: 551 avg len: 675.1 num_loss_counted_tokens: 3751
 total tokens: 6852 num samples: 2 num padding tokens: 182 - rank: 0 max len: 3426 min len: 3244 avg len: 3335.0 num_loss_counted_tokens: 203
 total tokens: 7290 num samples: 6 num padding tokens: 282 - rank: 3 max len: 1215 min len: 1115 avg len: 1168.0 num_loss_counted_tokens: 4808
 Per-token loss scaled by world size: 0.0002617795253172517Per-token loss scaled by world size: 0.0003632722655311227Per-token loss scaled by world size: 0.00038097533979453146Per-token loss scaled by world size: 0.00018842382996808738Per-token loss scaled by world size: 0.00016033223073463887




 Per-token loss scaled by world size: 2.0858458356087795e-06
 Per-token loss scaled by world size: 0.00016425059584435076
 Epoch: 0, Step: 19, Rank: 2, loss = 0.5884711742401123Epoch: 0, Step: 19, Rank: 4, loss = 1.1345447301864624

 Epoch: 0, Step: 19, Rank: 1, loss = 0.5007376074790955Epoch: 0, Step: 19, Rank: 6, loss = 0.8175702095031738
 Epoch: 0, Step: 19, Rank: 5, loss = 1.189833641052246

 Epoch: 0, Step: 19, Rank: 0, loss = 0.006514357402920723
 Epoch: 0, Step: 19, Rank: 7, loss = 0.5129751563072205
 Per-token loss scaled by world size: 0.00031291748746298254
 Epoch: 0, Step: 19, Rank: 3, loss = 0.9772804379463196
 Epoch 0:  16%|█▌        | 19/121 [00:48<04:16,  2.51s/it] total tokens: 7540 num samples: 10 num padding tokens: 800 - rank: 4 max len: 754 min len: 591 avg len: 674.0 num_loss_counted_tokens: 3840
 total tokens: 6924 num samples: 4 num padding tokens: 344 - rank: 1 max len: 1731 min len: 1592 avg len: 1645.0 num_loss_counted_tokens: 954
 total tokens: 7018 num samples: 29 num padding tokens: 2429 - rank: 7 max len: 242 min len: 75 avg len: 158.24137931034483 num_loss_counted_tokens: 1886
 {
    "epoch": 0,
    "step": 19,
    "rank": 0,
    "loss": 0.006514357402920723,
    "overall_throughput": 42.48141431618144,
    "lr": 0.0,
    "cuda_mem_allocated": 18.00298833847046,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24985,
    "batch_size": 81,
    "total_loss": 0.7159909009933472,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:50.578884"
 }
 total tokens: 7980 num samples: 14 num padding tokens: 1456 - rank: 5 max len: 570 min len: 372 avg len: 466.0 num_loss_counted_tokens: 4358
 total tokens: 7658 num samples: 7 num padding tokens: 1001 - rank: 3 max len: 1094 min len: 779 avg len: 951.0 num_loss_counted_tokens: 3845
 total tokens: 6948 num samples: 3 num padding tokens: 823 - rank: 0 max len: 2316 min len: 1860 avg len: 2041.6666666666667 num_loss_counted_tokens: 2568
 total tokens: 8030 num samples: 22 num padding tokens: 1466 - rank: 6 max len: 365 min len: 248 avg len: 298.3636363636364 num_loss_counted_tokens: 3426
 total tokens: 7350 num samples: 5 num padding tokens: 547 - rank: 2 max len: 1470 min len: 1198 avg len: 1360.6 num_loss_counted_tokens: 936
 Per-token loss scaled by world size: 4.6467721404042095e-06Per-token loss scaled by world size: 1.0636807928676717e-05Per-token loss scaled by world size: 7.807435031281784e-05Per-token loss scaled by world size: 0.000532266276422888Per-token loss scaled by world size: 0.0005103853181935847
 Per-token loss scaled by world size: 0.0004919093335047364Per-token loss scaled by world size: 0.0005770818097516894





 Epoch: 0, Step: 20, Rank: 0, loss = 0.025356819853186607
 Epoch: 0, Step: 20, Rank: 2, loss = 0.18611949682235718
 Epoch: 0, Step: 20, Rank: 1, loss = 0.01107732392847538
 Epoch: 0, Step: 20, Rank: 6, loss = 1.2688562870025635
 Epoch: 0, Step: 20, Rank: 4, loss = 1.1726503372192383Epoch: 0, Step: 20, Rank: 7, loss = 1.2166948318481445
 Epoch: 0, Step: 20, Rank: 5, loss = 1.3756909370422363

 Per-token loss scaled by world size: 0.00012562941992655396
 Epoch: 0, Step: 20, Rank: 3, loss = 0.29948481917381287
 Epoch 0:  17%|█▋        | 20/121 [00:51<04:14,  2.52s/it] total tokens: 5636 num samples: 2 num padding tokens: 753 - rank: 1 max len: 2818 min len: 2065 avg len: 2441.5 num_loss_counted_tokens: 151
 total tokens: 8030 num samples: 11 num padding tokens: 694 - rank: 4 max len: 730 min len: 602 avg len: 666.9090909090909 num_loss_counted_tokens: 4998
 {
    "epoch": 0,
    "step": 20,
    "rank": 0,
    "loss": 0.025356819853186607,
    "overall_throughput": 42.41401913459215,
    "lr": 0.0,
    "cuda_mem_allocated": 18.149494647979736,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19071,
    "batch_size": 76,
    "total_loss": 0.6944913268089294,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:53.125315"
 }
 total tokens: 7472 num samples: 8 num padding tokens: 683 - rank: 3 max len: 934 min len: 765 avg len: 848.625 num_loss_counted_tokens: 4018
 total tokens: 7984 num samples: 4 num padding tokens: 1049 - rank: 2 max len: 1996 min len: 1356 avg len: 1733.75 num_loss_counted_tokens: 1408
 total tokens: 7440 num samples: 24 num padding tokens: 2407 - rank: 7 max len: 310 min len: 86 avg len: 209.70833333333334 num_loss_counted_tokens: 2199
 total tokens: 7635 num samples: 15 num padding tokens: 1392 - rank: 6 max len: 509 min len: 313 avg len: 416.2 num_loss_counted_tokens: 3986
 total tokens: 5858 num samples: 2 num padding tokens: 28 - rank: 0 max len: 2929 min len: 2901 avg len: 2915.0 num_loss_counted_tokens: 167
 total tokens: 7813 num samples: 13 num padding tokens: 588 - rank: 5 max len: 601 min len: 517 avg len: 555.7692307692307 num_loss_counted_tokens: 4116
 Per-token loss scaled by world size: 0.0003165990929119289Per-token loss scaled by world size: 0.00013289590424392372Per-token loss scaled by world size: 0.00024944794131442904Per-token loss scaled by world size: 0.00046029582154005766


 Per-token loss scaled by world size: 0.00031609757570549846Per-token loss scaled by world size: 0.000298293714877218Per-token loss scaled by world size: 0.0003350040642544627



 Epoch: 0, Step: 21, Rank: 1, loss = 0.3848499357700348
 Epoch: 0, Step: 21, Rank: 4, loss = 1.3329591751098633
 Epoch: 0, Step: 21, Rank: 2, loss = 0.7223700284957886
 Epoch: 0, Step: 21, Rank: 3, loss = 0.9168314337730408
 Epoch: 0, Step: 21, Rank: 6, loss = 0.9153790473937988
 Epoch: 0, Step: 21, Rank: 7, loss = 0.8638213276863098
 Epoch: 0, Step: 21, Rank: 5, loss = 0.9701299071311951
 Per-token loss scaled by world size: 2.031196345342323e-06
 Epoch: 0, Step: 21, Rank: 0, loss = 0.005882090888917446
 Epoch 0:  17%|█▋        | 21/121 [00:53<04:14,  2.55s/it] total tokens: 7450 num samples: 10 num padding tokens: 522 - rank: 4 max len: 745 min len: 656 avg len: 692.8 num_loss_counted_tokens: 3966
 total tokens: 6090 num samples: 3 num padding tokens: 775 - rank: 1 max len: 2030 min len: 1546 avg len: 1771.6666666666667 num_loss_counted_tokens: 2594
 {
    "epoch": 0,
    "step": 21,
    "rank": 0,
    "loss": 0.005882090888917446,
    "overall_throughput": 41.659196665794404,
    "lr": 0.0,
    "cuda_mem_allocated": 18.256999015808105,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23167,
    "batch_size": 94,
    "total_loss": 0.7640278339385986,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:55.723471"
 }
 total tokens: 7656 num samples: 12 num padding tokens: 886 - rank: 5 max len: 638 min len: 429 avg len: 564.1666666666666 num_loss_counted_tokens: 4895
 total tokens: 7392 num samples: 8 num padding tokens: 529 - rank: 3 max len: 924 min len: 753 avg len: 857.875 num_loss_counted_tokens: 5456
 total tokens: 7110 num samples: 3 num padding tokens: 189 - rank: 0 max len: 2370 min len: 2243 avg len: 2307.0 num_loss_counted_tokens: 433
 total tokens: 7885 num samples: 19 num padding tokens: 1785 - rank: 6 max len: 415 min len: 258 avg len: 321.05263157894734 num_loss_counted_tokens: 3103
 total tokens: 7453 num samples: 29 num padding tokens: 2759 - rank: 7 max len: 257 min len: 79 avg len: 161.86206896551724 num_loss_counted_tokens: 1834
 total tokens: 8082 num samples: 6 num padding tokens: 1602 - rank: 2 max len: 1347 min len: 925 avg len: 1080.0 num_loss_counted_tokens: 3092
 Per-token loss scaled by world size: 0.00046551000559702516Per-token loss scaled by world size: 0.00031762762228026986Per-token loss scaled by world size: 0.0006262522656470537

 Per-token loss scaled by world size: 0.00032212832593359053
 Per-token loss scaled by world size: 0.0005319734336808324

 Per-token loss scaled by world size: 7.452299178112298e-05Per-token loss scaled by world size: 8.590232027927414e-06

 Epoch: 0, Step: 22, Rank: 4, loss = 1.093308448791504Epoch: 0, Step: 22, Rank: 7, loss = 0.7459881901741028
 Epoch: 0, Step: 22, Rank: 6, loss = 1.4708317518234253

 Epoch: 0, Step: 22, Rank: 3, loss = 0.7565586566925049
 Epoch: 0, Step: 22, Rank: 5, loss = 1.249406099319458
 Epoch: 0, Step: 22, Rank: 2, loss = 0.1750265657901764
 Epoch: 0, Step: 22, Rank: 1, loss = 0.020175233483314514
 Per-token loss scaled by world size: 8.447452273685485e-06
 Epoch: 0, Step: 22, Rank: 0, loss = 0.019839897751808167
 Epoch 0:  18%|█▊        | 22/121 [00:56<04:16,  2.59s/it] total tokens: 7168 num samples: 7 num padding tokens: 1126 - rank: 4 max len: 1024 min len: 712 avg len: 863.1428571428571 num_loss_counted_tokens: 4103
 total tokens: 6628 num samples: 2 num padding tokens: 99 - rank: 1 max len: 3314 min len: 3215 avg len: 3264.5 num_loss_counted_tokens: 217
 {
    "epoch": 0,
    "step": 22,
    "rank": 0,
    "loss": 0.019839897751808167,
    "overall_throughput": 40.13505236925628,
    "lr": 0.0,
    "cuda_mem_allocated": 18.249396324157715,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18789,
    "batch_size": 66,
    "total_loss": 0.6913918256759644,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:48:58.416941"
 }
 total tokens: 7821 num samples: 11 num padding tokens: 1199 - rank: 5 max len: 711 min len: 507 avg len: 602.0 num_loss_counted_tokens: 4500
 total tokens: 7330 num samples: 5 num padding tokens: 1004 - rank: 3 max len: 1466 min len: 1087 avg len: 1265.2 num_loss_counted_tokens: 1284
 total tokens: 7936 num samples: 16 num padding tokens: 1776 - rank: 6 max len: 496 min len: 286 avg len: 385.0 num_loss_counted_tokens: 4076
 total tokens: 8070 num samples: 30 num padding tokens: 2921 - rank: 7 max len: 269 min len: 82 avg len: 171.63333333333333 num_loss_counted_tokens: 2107 total tokens: 7552 num samples: 2 num padding tokens: 409 - rank: 0 max len: 3776 min len: 3367 avg len: 3571.5 num_loss_counted_tokens: 182

 total tokens: 8049 num samples: 3 num padding tokens: 2360 - rank: 2 max len: 2683 min len: 1483 avg len: 1896.3333333333333 num_loss_counted_tokens: 515
 Per-token loss scaled by world size: 0.0002995161630678922Per-token loss scaled by world size: 0.0001995390048250556Per-token loss scaled by world size: 4.880544565821765e-06Per-token loss scaled by world size: 0.00033291871659457684

 Per-token loss scaled by world size: 0.00010466719686519355Per-token loss scaled by world size: 0.00026131211780011654
 Per-token loss scaled by world size: 0.00020597832917701453



 Epoch: 0, Step: 23, Rank: 0, loss = 0.015341991558670998
 Epoch: 0, Step: 23, Rank: 2, loss = 0.6272508502006531
 Epoch: 0, Step: 23, Rank: 3, loss = 0.9415290355682373
 Epoch: 0, Step: 23, Rank: 5, loss = 1.04653000831604
 Epoch: 0, Step: 23, Rank: 7, loss = 0.8214346170425415Epoch: 0, Step: 23, Rank: 1, loss = 0.3290213346481323Epoch: 0, Step: 23, Rank: 4, loss = 0.6474928855895996


 Per-token loss scaled by world size: 0.00029968167655169964
 Epoch: 0, Step: 23, Rank: 6, loss = 0.9420493841171265
 Epoch 0:  19%|█▉        | 23/121 [00:59<04:11,  2.57s/it] total tokens: 7217 num samples: 7 num padding tokens: 812 - rank: 4 max len: 1031 min len: 810 avg len: 915.0 num_loss_counted_tokens: 5570
 total tokens: 7335 num samples: 3 num padding tokens: 893 - rank: 1 max len: 2445 min len: 1991 avg len: 2147.3333333333335 num_loss_counted_tokens: 303
 {
    "epoch": 0,
    "step": 23,
    "rank": 0,
    "loss": 0.015341991558670998,
    "overall_throughput": 43.085386891030744,
    "lr": 0.0,
    "cuda_mem_allocated": 18.126072883605957,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25148,
    "batch_size": 84,
    "total_loss": 0.6713312864303589,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:00.926547"
 }
 total tokens: 8025 num samples: 15 num padding tokens: 1566 - rank: 6 max len: 535 min len: 337 avg len: 430.6 num_loss_counted_tokens: 4270
 total tokens: 7872 num samples: 24 num padding tokens: 3099 - rank: 7 max len: 328 min len: 81 avg len: 198.875 num_loss_counted_tokens: 2380
 total tokens: 7830 num samples: 10 num padding tokens: 881 - rank: 5 max len: 783 min len: 590 avg len: 694.9 num_loss_counted_tokens: 5375
 total tokens: 7664 num samples: 4 num padding tokens: 1037 - rank: 2 max len: 1916 min len: 1514 avg len: 1656.75 num_loss_counted_tokens: 1726
 total tokens: 7794 num samples: 6 num padding tokens: 783 - rank: 3 max len: 1299 min len: 1032 avg len: 1168.5 num_loss_counted_tokens: 3507
 total tokens: 7062 num samples: 2 num padding tokens: 743 - rank: 0 max len: 3531 min len: 2788 avg len: 3159.5 num_loss_counted_tokens: 192
 Per-token loss scaled by world size: 0.00014580012066289783Per-token loss scaled by world size: 0.0005001741228625178Per-token loss scaled by world size: 0.0002923521969933063
 Per-token loss scaled by world size: 0.00035334189306013286Per-token loss scaled by world size: 0.00044065553811378777
 Per-token loss scaled by world size: 6.498985749203712e-05



 Epoch: 0, Step: 24, Rank: 2, loss = 0.37419599294662476
 Epoch: 0, Step: 24, Rank: 4, loss = 1.2836968898773193Epoch: 0, Step: 24, Rank: 6, loss = 0.9068519473075867

 Epoch: 0, Step: 24, Rank: 1, loss = 0.1667964607477188
 Epoch: 0, Step: 24, Rank: 3, loss = 0.7503219246864319
 Epoch: 0, Step: 24, Rank: 7, loss = 1.130942463874817
 Per-token loss scaled by world size: 0.0006109050591476262
 Epoch: 0, Step: 24, Rank: 5, loss = 1.567887783050537
 Per-token loss scaled by world size: 1.688349584583193e-05
 Epoch: 0, Step: 24, Rank: 0, loss = 0.04333149269223213
 Epoch 0:  20%|█▉        | 24/121 [01:01<04:07,  2.55s/it] total tokens: 7492 num samples: 4 num padding tokens: 997 - rank: 1 max len: 1873 min len: 1355 avg len: 1623.75 num_loss_counted_tokens: 1310
 total tokens: 8000 num samples: 10 num padding tokens: 696 - rank: 4 max len: 800 min len: 666 avg len: 730.4 num_loss_counted_tokens: 5339
 {
    "epoch": 0,
    "step": 24,
    "rank": 0,
    "loss": 0.04333149269223213,
    "overall_throughput": 42.861407474478405,
    "lr": 0.0,
    "cuda_mem_allocated": 18.177217483520508,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20532,
    "batch_size": 79,
    "total_loss": 0.7780030965805054,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:03.453834"
 }
 total tokens: 7904 num samples: 19 num padding tokens: 1180 - rank: 6 max len: 416 min len: 291 avg len: 353.89473684210526 num_loss_counted_tokens: 3756
 total tokens: 7920 num samples: 12 num padding tokens: 1406 - rank: 5 max len: 660 min len: 441 avg len: 542.8333333333334 num_loss_counted_tokens: 4093
 total tokens: 7830 num samples: 27 num padding tokens: 2551 - rank: 7 max len: 290 min len: 78 avg len: 195.5185185185185 num_loss_counted_tokens: 2643
 total tokens: 5734 num samples: 2 num padding tokens: 736 - rank: 0 max len: 2867 min len: 2131 avg len: 2499.0 num_loss_counted_tokens: 169
 total tokens: 8088 num samples: 6 num padding tokens: 1050 - rank: 2 max len: 1348 min len: 1049 avg len: 1173.0 num_loss_counted_tokens: 3708
 total tokens: 8040 num samples: 8 num padding tokens: 661 - rank: 3 max len: 1005 min len: 825 avg len: 922.375 num_loss_counted_tokens: 5105
 Per-token loss scaled by world size: 0.00023669454094488174Per-token loss scaled by world size: 0.00023404715466313064Per-token loss scaled by world size: 0.00022105168318375945
 Per-token loss scaled by world size: 0.00032421553623862565


 Per-token loss scaled by world size: 1.7478114386904053e-05
 Per-token loss scaled by world size: 0.00014570211351383477Per-token loss scaled by world size: 5.8525503845885396e-05

 Epoch: 0, Step: 25, Rank: 2, loss = 0.8244895935058594
 Epoch: 0, Step: 25, Rank: 4, loss = 0.7787098288536072Epoch: 0, Step: 25, Rank: 6, loss = 1.1421302556991577Epoch: 0, Step: 25, Rank: 3, loss = 0.8338156938552856


 Epoch: 0, Step: 25, Rank: 0, loss = 0.06157102435827255
 Epoch: 0, Step: 25, Rank: 1, loss = 0.2061707228422165
 Epoch: 0, Step: 25, Rank: 7, loss = 0.5132721066474915
 Per-token loss scaled by world size: 0.00032130838371813297
 Epoch: 0, Step: 25, Rank: 5, loss = 1.1318891048431396
 Epoch 0:  21%|██        | 25/121 [01:04<04:04,  2.55s/it] total tokens: 7385 num samples: 7 num padding tokens: 1010 - rank: 4 max len: 1055 min len: 819 avg len: 910.7142857142857 num_loss_counted_tokens: 2829
 total tokens: 6350 num samples: 2 num padding tokens: 809 - rank: 1 max len: 3175 min len: 2366 avg len: 2770.5 num_loss_counted_tokens: 1136
 total tokens: 7100 num samples: 25 num padding tokens: 2255 - rank: 7 max len: 284 min len: 85 avg len: 193.8 num_loss_counted_tokens: 2161
 {
    "epoch": 0,
    "step": 25,
    "rank": 0,
    "loss": 0.06157102435827255,
    "overall_throughput": 42.60322198576968,
    "lr": 0.0,
    "cuda_mem_allocated": 18.081379890441895,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28182,
    "batch_size": 71,
    "total_loss": 0.6865060329437256,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:05.995018"
 }
 total tokens: 7800 num samples: 15 num padding tokens: 2044 - rank: 6 max len: 520 min len: 291 avg len: 383.73333333333335 num_loss_counted_tokens: 4084
 total tokens: 7464 num samples: 4 num padding tokens: 487 - rank: 2 max len: 1866 min len: 1672 avg len: 1744.25 num_loss_counted_tokens: 862
 total tokens: 7970 num samples: 10 num padding tokens: 1552 - rank: 5 max len: 797 min len: 528 avg len: 641.8 num_loss_counted_tokens: 3059
 total tokens: 7472 num samples: 2 num padding tokens: 315 - rank: 0 max len: 3736 min len: 3421 avg len: 3578.5 num_loss_counted_tokens: 178
 total tokens: 7635 num samples: 5 num padding tokens: 1225 - rank: 3 max len: 1527 min len: 1071 avg len: 1282.0 num_loss_counted_tokens: 2498
 Per-token loss scaled by world size: 0.0005163732566870749Per-token loss scaled by world size: 0.00043581612408161163Per-token loss scaled by world size: 0.0003451558295637369Per-token loss scaled by world size: 0.00011102599819423631Per-token loss scaled by world size: 7.73636857047677e-05
 Per-token loss scaled by world size: 1.7906730818140204e-06Per-token loss scaled by world size: 0.00031262030825018883





 Epoch: 0, Step: 26, Rank: 4, loss = 1.1685864925384521Epoch: 0, Step: 26, Rank: 2, loss = 0.2977023422718048Epoch: 0, Step: 26, Rank: 3, loss = 0.9254922270774841Epoch: 0, Step: 26, Rank: 0, loss = 0.004801466129720211Epoch: 0, Step: 26, Rank: 6, loss = 1.3845903873443604



 Epoch: 0, Step: 26, Rank: 1, loss = 0.207441046833992

 Epoch: 0, Step: 26, Rank: 7, loss = 0.8382523059844971
 Per-token loss scaled by world size: 0.0004404305072966963
 Epoch: 0, Step: 26, Rank: 5, loss = 1.1809593439102173
 Epoch 0:  21%|██▏       | 26/121 [01:06<04:02,  2.55s/it] total tokens: 8012 num samples: 4 num padding tokens: 1086 - rank: 1 max len: 2003 min len: 1474 avg len: 1731.5 num_loss_counted_tokens: 2801
 total tokens: 7784 num samples: 8 num padding tokens: 708 - rank: 4 max len: 973 min len: 760 avg len: 884.5 num_loss_counted_tokens: 4202
 {
    "epoch": 0,
    "step": 26,
    "rank": 0,
    "loss": 0.004801466129720211,
    "overall_throughput": 42.41860146698506,
    "lr": 0.0,
    "cuda_mem_allocated": 18.102412223815918,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21451,
    "batch_size": 82,
    "total_loss": 0.7509781718254089,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:08.547049"
 }
 total tokens: 6995 num samples: 5 num padding tokens: 461 - rank: 2 max len: 1399 min len: 1155 avg len: 1306.8 num_loss_counted_tokens: 2776
 total tokens: 7920 num samples: 11 num padding tokens: 1235 - rank: 5 max len: 720 min len: 497 avg len: 607.7272727272727 num_loss_counted_tokens: 4165
 total tokens: 7553 num samples: 7 num padding tokens: 401 - rank: 3 max len: 1079 min len: 985 avg len: 1021.7142857142857 num_loss_counted_tokens: 5788
 total tokens: 7540 num samples: 26 num padding tokens: 2466 - rank: 7 max len: 290 min len: 86 avg len: 195.15384615384616 num_loss_counted_tokens: 2386
 total tokens: 7776 num samples: 16 num padding tokens: 1715 - rank: 6 max len: 486 min len: 293 avg len: 378.8125 num_loss_counted_tokens: 3442
 total tokens: 8067 num samples: 3 num padding tokens: 761 - rank: 0 max len: 2689 min len: 2058 avg len: 2435.3333333333335 num_loss_counted_tokens: 864
 Per-token loss scaled by world size: 0.0004792925319634378Per-token loss scaled by world size: 0.0002461661642882973Per-token loss scaled by world size: 0.0004854958679061383Per-token loss scaled by world size: 3.6859477404505014e-05



 Per-token loss scaled by world size: 0.00043515616562217474Per-token loss scaled by world size: 3.782783096539788e-05Per-token loss scaled by world size: 1.9859728126903065e-05


 Epoch: 0, Step: 27, Rank: 3, loss = 0.5675053000450134
 Epoch: 0, Step: 27, Rank: 5, loss = 1.1192500591278076Epoch: 0, Step: 27, Rank: 0, loss = 0.08497491478919983

 Epoch: 0, Step: 27, Rank: 7, loss = 1.1049489974975586
 Epoch: 0, Step: 27, Rank: 4, loss = 1.0031981468200684
 Epoch: 0, Step: 27, Rank: 2, loss = 0.0457841195166111
 Epoch: 0, Step: 27, Rank: 1, loss = 0.08720733970403671
 Per-token loss scaled by world size: 0.0007065049139782786
 Epoch: 0, Step: 27, Rank: 6, loss = 1.6287587881088257
 Epoch 0:  22%|██▏       | 27/121 [01:09<03:58,  2.54s/it] total tokens: 7665 num samples: 3 num padding tokens: 261 - rank: 1 max len: 2555 min len: 2313 avg len: 2468.0 num_loss_counted_tokens: 2129
 total tokens: 7147 num samples: 7 num padding tokens: 782 - rank: 4 max len: 1021 min len: 860 avg len: 909.2857142857143 num_loss_counted_tokens: 4214
 total tokens: 7680 num samples: 10 num padding tokens: 867 - rank: 5 max len: 768 min len: 585 avg len: 681.3 num_loss_counted_tokens: 4557
 {
    "epoch": 0,
    "step": 27,
    "rank": 0,
    "loss": 0.08497491478919983,
    "overall_throughput": 43.23576441221499,
    "lr": 0.0,
    "cuda_mem_allocated": 18.077077388763428,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18443,
    "batch_size": 71,
    "total_loss": 0.7052034735679626,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:11.052914"
 }
 total tokens: 5712 num samples: 2 num padding tokens: 241 - rank: 0 max len: 2856 min len: 2615 avg len: 2735.5 num_loss_counted_tokens: 234
 total tokens: 6724 num samples: 4 num padding tokens: 898 - rank: 3 max len: 1681 min len: 1052 avg len: 1456.5 num_loss_counted_tokens: 2574
 total tokens: 8092 num samples: 14 num padding tokens: 1753 - rank: 6 max len: 578 min len: 324 avg len: 452.7857142857143 num_loss_counted_tokens: 2859
 total tokens: 8060 num samples: 4 num padding tokens: 681 - rank: 2 max len: 2015 min len: 1739 avg len: 1844.75 num_loss_counted_tokens: 813
 total tokens: 8092 num samples: 28 num padding tokens: 3097 - rank: 7 max len: 289 min len: 81 avg len: 178.39285714285714 num_loss_counted_tokens: 2310
 Per-token loss scaled by world size: 0.0003594549198169261Per-token loss scaled by world size: 0.0002452041080687195Per-token loss scaled by world size: 0.0001836109149735421Per-token loss scaled by world size: 0.00030733394669368863Per-token loss scaled by world size: 0.0003232009767089039
 Per-token loss scaled by world size: 0.000357407407136634




 Per-token loss scaled by world size: 3.657307388493791e-05
 Epoch: 0, Step: 28, Rank: 3, loss = 0.7112144827842712Epoch: 0, Step: 28, Rank: 2, loss = 0.5325634479522705

 Epoch: 0, Step: 28, Rank: 6, loss = 1.0425989627838135
 Epoch: 0, Step: 28, Rank: 4, loss = 0.8914220929145813Epoch: 0, Step: 28, Rank: 5, loss = 1.0366601943969727

 Epoch: 0, Step: 28, Rank: 7, loss = 0.9374444484710693
 Epoch: 0, Step: 28, Rank: 1, loss = 0.10608020424842834
 Per-token loss scaled by world size: 4.0188886487158015e-05
 Epoch: 0, Step: 28, Rank: 0, loss = 0.11656786501407623
 Epoch 0:  23%|██▎       | 28/121 [01:11<03:56,  2.55s/it] total tokens: 7986 num samples: 11 num padding tokens: 839 - rank: 4 max len: 726 min len: 536 avg len: 649.7272727272727 num_loss_counted_tokens: 4414
 total tokens: 7616 num samples: 4 num padding tokens: 1782 - rank: 1 max len: 1904 min len: 1228 avg len: 1458.5 num_loss_counted_tokens: 1665
 {
    "epoch": 0,
    "step": 28,
    "rank": 0,
    "loss": 0.11656786501407623,
    "overall_throughput": 42.00576591775989,
    "lr": 0.0,
    "cuda_mem_allocated": 18.240712642669678,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23204,
    "batch_size": 91,
    "total_loss": 0.6718189716339111,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:13.626212"
 }
 total tokens: 8040 num samples: 20 num padding tokens: 1891 - rank: 6 max len: 402 min len: 208 avg len: 307.45 num_loss_counted_tokens: 3493
 total tokens: 4944 num samples: 24 num padding tokens: 1763 - rank: 7 max len: 206 min len: 77 avg len: 132.54166666666666 num_loss_counted_tokens: 1195
 total tokens: 6352 num samples: 2 num padding tokens: 1168 - rank: 0 max len: 3176 min len: 2008 avg len: 2592.0 num_loss_counted_tokens: 1140
 total tokens: 7960 num samples: 10 num padding tokens: 318 - rank: 3 max len: 796 min len: 730 avg len: 764.2 num_loss_counted_tokens: 6637
 total tokens: 7294 num samples: 7 num padding tokens: 669 - rank: 2 max len: 1042 min len: 832 avg len: 946.4285714285714 num_loss_counted_tokens: 3958
 total tokens: 7905 num samples: 15 num padding tokens: 563 - rank: 5 max len: 527 min len: 412 avg len: 489.46666666666664 num_loss_counted_tokens: 4169
 Per-token loss scaled by world size: 0.0004329077200964093Per-token loss scaled by world size: 5.7912915508495644e-05Per-token loss scaled by world size: 0.0001508180284872651Per-token loss scaled by world size: 3.933108018827625e-06



 Per-token loss scaled by world size: 0.00037304253783077
 Per-token loss scaled by world size: 0.0003069050144404173Per-token loss scaled by world size: 0.00022216846991796046

 Epoch: 0, Step: 29, Rank: 0, loss = 0.0120353102684021
 Epoch: 0, Step: 29, Rank: 5, loss = 1.3246976137161255
 Epoch: 0, Step: 29, Rank: 1, loss = 0.17721351981163025Epoch: 0, Step: 29, Rank: 2, loss = 0.46150317788124084

 Epoch: 0, Step: 29, Rank: 4, loss = 0.6798354983329773Epoch: 0, Step: 29, Rank: 7, loss = 0.9391293525695801Epoch: 0, Step: 29, Rank: 6, loss = 1.1415101289749146


 Per-token loss scaled by world size: 0.00031356202089227736
 Epoch: 0, Step: 29, Rank: 3, loss = 0.9594997763633728
 Epoch 0:  24%|██▍       | 29/121 [01:14<03:55,  2.55s/it] total tokens: 8085 num samples: 11 num padding tokens: 512 - rank: 4 max len: 735 min len: 626 avg len: 688.4545454545455 num_loss_counted_tokens: 5373
 total tokens: 6820 num samples: 4 num padding tokens: 921 - rank: 1 max len: 1705 min len: 1306 avg len: 1474.75 num_loss_counted_tokens: 1453
 {
    "epoch": 0,
    "step": 29,
    "rank": 0,
    "loss": 0.0120353102684021,
    "overall_throughput": 42.0762720880897,
    "lr": 0.0,
    "cuda_mem_allocated": 18.163118362426758,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24480,
    "batch_size": 81,
    "total_loss": 0.711928129196167,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:16.196587"
 }
 total tokens: 8056 num samples: 8 num padding tokens: 879 - rank: 3 max len: 1007 min len: 755 avg len: 897.125 num_loss_counted_tokens: 5796
 total tokens: 8021 num samples: 13 num padding tokens: 880 - rank: 5 max len: 617 min len: 473 avg len: 549.3076923076923 num_loss_counted_tokens: 4711
 total tokens: 7992 num samples: 2 num padding tokens: 2114 - rank: 0 max len: 3996 min len: 1882 avg len: 2939.0 num_loss_counted_tokens: 854
 total tokens: 7803 num samples: 17 num padding tokens: 1971 - rank: 6 max len: 459 min len: 257 avg len: 343.05882352941177 num_loss_counted_tokens: 3525
 total tokens: 8064 num samples: 32 num padding tokens: 3179 - rank: 7 max len: 252 min len: 75 avg len: 152.65625 num_loss_counted_tokens: 2045
 total tokens: 7854 num samples: 7 num padding tokens: 337 - rank: 2 max len: 1122 min len: 1010 avg len: 1073.857142857143 num_loss_counted_tokens: 4505
 Per-token loss scaled by world size: 0.00014981771528255194Per-token loss scaled by world size: 0.00036861959961242974Per-token loss scaled by world size: 0.00042424429557286203Per-token loss scaled by world size: 6.0521342675201595e-06


 Per-token loss scaled by world size: 0.00027959863655269146
 Per-token loss scaled by world size: 0.00027874435181729496

 Per-token loss scaled by world size: 2.435126134514576e-06
 Epoch: 0, Step: 30, Rank: 6, loss = 1.0413503646850586Epoch: 0, Step: 30, Rank: 5, loss = 1.1984901428222656

 Epoch: 0, Step: 30, Rank: 2, loss = 0.4232350289821625Epoch: 0, Step: 30, Rank: 0, loss = 0.01709727942943573

 Epoch: 0, Step: 30, Rank: 4, loss = 0.7898661494255066
 Epoch: 0, Step: 30, Rank: 7, loss = 0.7874528169631958
 Epoch: 0, Step: 30, Rank: 1, loss = 0.0068792314268648624
 Per-token loss scaled by world size: 0.0004282180452719331
 Epoch: 0, Step: 30, Rank: 3, loss = 1.2097159624099731
 Epoch 0:  25%|██▍       | 30/121 [01:16<03:52,  2.56s/it] total tokens: 7893 num samples: 9 num padding tokens: 1033 - rank: 4 max len: 877 min len: 680 avg len: 762.2222222222222 num_loss_counted_tokens: 5017
 {
    "epoch": 0,
    "step": 30,
    "rank": 0,
    "loss": 0.01709727942943573,
    "overall_throughput": 42.42562330633586,
    "lr": 0.0,
    "cuda_mem_allocated": 17.776453971862793,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22600,
    "batch_size": 88,
    "total_loss": 0.684260904788971,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:18.756982"
 }
 total tokens: 6591 num samples: 3 num padding tokens: 424 - rank: 1 max len: 2197 min len: 1974 avg len: 2055.6666666666665 num_loss_counted_tokens: 2362
 total tokens: 7974 num samples: 2 num padding tokens: 1251 - rank: 0 max len: 3987 min len: 2736 avg len: 3361.5 num_loss_counted_tokens: 364
 total tokens: 7440 num samples: 6 num padding tokens: 982 - rank: 3 max len: 1240 min len: 953 avg len: 1076.3333333333333 num_loss_counted_tokens: 2782
 total tokens: 7192 num samples: 31 num padding tokens: 2694 - rank: 7 max len: 232 min len: 81 avg len: 145.09677419354838 num_loss_counted_tokens: 1761
 total tokens: 8056 num samples: 19 num padding tokens: 2186 - rank: 6 max len: 424 min len: 234 avg len: 308.94736842105266 num_loss_counted_tokens: 3218
 total tokens: 7920 num samples: 12 num padding tokens: 1246 - rank: 5 max len: 660 min len: 465 avg len: 556.1666666666666 num_loss_counted_tokens: 4600
 total tokens: 7096 num samples: 4 num padding tokens: 1017 - rank: 2 max len: 1774 min len: 1287 avg len: 1519.75 num_loss_counted_tokens: 2154
 Per-token loss scaled by world size: 0.00010154087067348883Per-token loss scaled by world size: 8.184825674106833e-06Per-token loss scaled by world size: 1.3264020708447788e-06Per-token loss scaled by world size: 0.0004786914505530149
 Per-token loss scaled by world size: 0.0006691211019642651


 Per-token loss scaled by world size: 0.0006968624657019973Per-token loss scaled by world size: 0.0005032268818467855


 Epoch: 0, Step: 31, Rank: 0, loss = 0.01914430782198906
 Epoch: 0, Step: 31, Rank: 6, loss = 1.1196593046188354
 Epoch: 0, Step: 31, Rank: 1, loss = 0.0031024543568491936
 Epoch: 0, Step: 31, Rank: 2, loss = 0.23750409483909607
 Epoch: 0, Step: 31, Rank: 4, loss = 1.6299612522125244Epoch: 0, Step: 31, Rank: 5, loss = 1.5650743246078491Epoch: 0, Step: 31, Rank: 7, loss = 1.1770477294921875


 Per-token loss scaled by world size: 0.00016472380957566202
 Epoch: 0, Step: 31, Rank: 3, loss = 0.3852889835834503
 Epoch 0:  26%|██▌       | 31/121 [01:19<03:49,  2.55s/it] total tokens: 5876 num samples: 2 num padding tokens: 1112 - rank: 1 max len: 2938 min len: 1826 avg len: 2382.0 num_loss_counted_tokens: 148
 total tokens: 7476 num samples: 7 num padding tokens: 1132 - rank: 4 max len: 1068 min len: 776 avg len: 906.2857142857143 num_loss_counted_tokens: 5713
 {
    "epoch": 0,
    "step": 31,
    "rank": 0,
    "loss": 0.01914430782198906,
    "overall_throughput": 42.41999987357855,
    "lr": 0.0,
    "cuda_mem_allocated": 18.21198320388794,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18712,
    "batch_size": 86,
    "total_loss": 0.7670978307723999,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:21.283912"
 }
 total tokens: 7170 num samples: 6 num padding tokens: 395 - rank: 3 max len: 1195 min len: 1074 avg len: 1129.1666666666667 num_loss_counted_tokens: 4609
 total tokens: 7808 num samples: 16 num padding tokens: 1337 - rank: 6 max len: 488 min len: 315 avg len: 404.4375 num_loss_counted_tokens: 4322
 total tokens: 6692 num samples: 4 num padding tokens: 769 - rank: 2 max len: 1673 min len: 1240 avg len: 1480.75 num_loss_counted_tokens: 983
 total tokens: 8036 num samples: 28 num padding tokens: 2902 - rank: 7 max len: 287 min len: 83 avg len: 183.35714285714286 num_loss_counted_tokens: 2351
 total tokens: 6152 num samples: 2 num padding tokens: 42 - rank: 0 max len: 3076 min len: 3034 avg len: 3055.0 num_loss_counted_tokens: 181
 total tokens: 8030 num samples: 11 num padding tokens: 1077 - rank: 5 max len: 730 min len: 530 avg len: 632.0909090909091 num_loss_counted_tokens: 4192
 Per-token loss scaled by world size: 0.0005925196455791593Per-token loss scaled by world size: 0.0004467185935936868Per-token loss scaled by world size: 0.0003141453198622912


 Per-token loss scaled by world size: 0.0005347821279428899Per-token loss scaled by world size: 7.6175206231710035e-06Per-token loss scaled by world size: 0.0005325234378688037

 Per-token loss scaled by world size: 7.270013156812638e-05

 Epoch: 0, Step: 32, Rank: 3, loss = 0.6862111687660217
 Epoch: 0, Step: 32, Rank: 6, loss = 1.2942850589752197
 Epoch: 0, Step: 32, Rank: 4, loss = 0.9758009314537048
 Epoch: 0, Step: 32, Rank: 1, loss = 0.016639521345496178Epoch: 0, Step: 32, Rank: 5, loss = 1.1681647300720215

 Epoch: 0, Step: 32, Rank: 0, loss = 0.15880435705184937
 Epoch: 0, Step: 32, Rank: 7, loss = 1.1632308959960938
 Per-token loss scaled by world size: 2.2659537535218988e-06
 Epoch: 0, Step: 32, Rank: 2, loss = 0.004949692636728287
 Epoch 0:  26%|██▋       | 32/121 [01:22<03:48,  2.57s/it]{
    "epoch": 0,
    "step": 32,
    "rank": 0,
    "loss": 0.15880435705184937,
    "overall_throughput": 41.6872406777795,
    "lr": 0.0,
    "cuda_mem_allocated": 17.77738618850708,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17475,
    "batch_size": 60,
    "total_loss": 0.6835108399391174,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:23.890605"
 }
 total tokens: 8016 num samples: 4 num padding tokens: 1255 - rank: 1 max len: 2004 min len: 1389 avg len: 1690.25 num_loss_counted_tokens: 3118
 total tokens: 7668 num samples: 9 num padding tokens: 475 - rank: 4 max len: 852 min len: 747 avg len: 799.2222222222222 num_loss_counted_tokens: 4806
 total tokens: 6662 num samples: 2 num padding tokens: 558 - rank: 0 max len: 3331 min len: 2773 avg len: 3052.0 num_loss_counted_tokens: 194
 total tokens: 6006 num samples: 22 num padding tokens: 1923 - rank: 7 max len: 273 min len: 87 avg len: 185.5909090909091 num_loss_counted_tokens: 1883
 total tokens: 8000 num samples: 16 num padding tokens: 1855 - rank: 6 max len: 500 min len: 279 avg len: 384.0625 num_loss_counted_tokens: 3217
 total tokens: 8043 num samples: 7 num padding tokens: 1088 - rank: 3 max len: 1149 min len: 859 avg len: 993.5714285714286 num_loss_counted_tokens: 2802
 total tokens: 7986 num samples: 6 num padding tokens: 566 - rank: 2 max len: 1331 min len: 1187 avg len: 1236.6666666666667 num_loss_counted_tokens: 1663
 total tokens: 7920 num samples: 11 num padding tokens: 899 - rank: 5 max len: 720 min len: 534 avg len: 638.2727272727273 num_loss_counted_tokens: 5060
 Per-token loss scaled by world size: 0.00021643155196215957Per-token loss scaled by world size: 0.00016316254914272577Per-token loss scaled by world size: 0.0003613443404901773Per-token loss scaled by world size: 0.0002246944495709613Per-token loss scaled by world size: 0.00030351741588674486




 Per-token loss scaled by world size: 1.8829136934073176e-06
 Per-token loss scaled by world size: 0.0001424902438884601
 Epoch: 0, Step: 33, Rank: 6, loss = 0.7259596586227417
 Epoch: 0, Step: 33, Rank: 2, loss = 0.6992632746696472
 Epoch: 0, Step: 33, Rank: 1, loss = 0.5271577835083008
 Epoch: 0, Step: 33, Rank: 5, loss = 1.167458415031433
 Epoch: 0, Step: 33, Rank: 0, loss = 0.006083458662033081Epoch: 0, Step: 33, Rank: 4, loss = 0.9806268215179443

 Epoch: 0, Step: 33, Rank: 7, loss = 0.46036818623542786
 Per-token loss scaled by world size: 0.0003278390795458108
 Epoch: 0, Step: 33, Rank: 3, loss = 1.0592070817947388
 Epoch 0:  27%|██▋       | 33/121 [01:24<03:44,  2.55s/it] total tokens: 7733 num samples: 11 num padding tokens: 698 - rank: 4 max len: 703 min len: 598 avg len: 639.5454545454545 num_loss_counted_tokens: 4778
 total tokens: 6987 num samples: 3 num padding tokens: 912 - rank: 1 max len: 2329 min len: 1634 avg len: 2025.0 num_loss_counted_tokens: 1273
 {
    "epoch": 0,
    "step": 33,
    "rank": 0,
    "loss": 0.006083458662033081,
    "overall_throughput": 42.6693002092833,
    "lr": 0.0,
    "cuda_mem_allocated": 18.08671236038208,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25847,
    "batch_size": 82,
    "total_loss": 0.7032655477523804,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:26.403211"
 }
 total tokens: 7062 num samples: 6 num padding tokens: 1127 - rank: 2 max len: 1177 min len: 929 avg len: 989.1666666666666 num_loss_counted_tokens: 3984
 total tokens: 7810 num samples: 22 num padding tokens: 1750 - rank: 6 max len: 355 min len: 236 avg len: 275.45454545454544 num_loss_counted_tokens: 2894
 total tokens: 7312 num samples: 8 num padding tokens: 597 - rank: 3 max len: 914 min len: 774 avg len: 839.375 num_loss_counted_tokens: 4040
 total tokens: 7990 num samples: 34 num padding tokens: 2478 - rank: 7 max len: 235 min len: 87 avg len: 162.11764705882354 num_loss_counted_tokens: 1905
 total tokens: 7683 num samples: 13 num padding tokens: 1541 - rank: 5 max len: 591 min len: 394 avg len: 472.46153846153845 num_loss_counted_tokens: 3832
 total tokens: 6662 num samples: 2 num padding tokens: 451 - rank: 0 max len: 3331 min len: 2880 avg len: 3105.5 num_loss_counted_tokens: 160
 Per-token loss scaled by world size: 0.0002979582059197128Per-token loss scaled by world size: 5.8393885410623625e-05Per-token loss scaled by world size: 8.860254183673533e-07Per-token loss scaled by world size: 0.00030089422944001853Per-token loss scaled by world size: 0.0003473999386187643Per-token loss scaled by world size: 0.00043760568951256573





 Per-token loss scaled by world size: 0.00025744541198946536
 Epoch: 0, Step: 34, Rank: 0, loss = 0.002579330699518323Epoch: 0, Step: 34, Rank: 1, loss = 0.1699918955564499

 Epoch: 0, Step: 34, Rank: 2, loss = 0.8673935532569885
 Epoch: 0, Step: 34, Rank: 4, loss = 0.87594074010849
 Epoch: 0, Step: 34, Rank: 5, loss = 1.2739248275756836Epoch: 0, Step: 34, Rank: 6, loss = 1.0113246440887451

 Epoch: 0, Step: 34, Rank: 7, loss = 0.7494558095932007
 Per-token loss scaled by world size: 0.00023933911870699376
 Epoch: 0, Step: 34, Rank: 3, loss = 0.6967461109161377
 Epoch 0:  28%|██▊       | 34/121 [01:27<03:42,  2.56s/it] total tokens: 7905 num samples: 3 num padding tokens: 745 - rank: 1 max len: 2635 min len: 2036 avg len: 2386.6666666666665 num_loss_counted_tokens: 966
 total tokens: 7504 num samples: 7 num padding tokens: 972 - rank: 4 max len: 1072 min len: 783 avg len: 933.1428571428571 num_loss_counted_tokens: 3983
 {
    "epoch": 0,
    "step": 34,
    "rank": 0,
    "loss": 0.002579330699518323,
    "overall_throughput": 41.96938385940714,
    "lr": 0.0,
    "cuda_mem_allocated": 18.149734020233154,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23289,
    "batch_size": 81,
    "total_loss": 0.7059195637702942,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:28.982671"
 }
 total tokens: 7917 num samples: 13 num padding tokens: 2041 - rank: 6 max len: 609 min len: 308 avg len: 452.0 num_loss_counted_tokens: 4345
 total tokens: 7310 num samples: 5 num padding tokens: 1026 - rank: 3 max len: 1462 min len: 1079 avg len: 1256.8 num_loss_counted_tokens: 3091
 total tokens: 6994 num samples: 26 num padding tokens: 2708 - rank: 7 max len: 269 min len: 78 avg len: 164.84615384615384 num_loss_counted_tokens: 1792
 total tokens: 7628 num samples: 4 num padding tokens: 1180 - rank: 2 max len: 1907 min len: 1481 avg len: 1612.0 num_loss_counted_tokens: 3200
 total tokens: 6582 num samples: 2 num padding tokens: 537 - rank: 0 max len: 3291 min len: 2754 avg len: 3022.5 num_loss_counted_tokens: 226
 total tokens: 7740 num samples: 10 num padding tokens: 701 - rank: 5 max len: 774 min len: 622 avg len: 703.9 num_loss_counted_tokens: 5163
 Per-token loss scaled by world size: 0.00038499291986227036Per-token loss scaled by world size: 0.00033336589694954455Per-token loss scaled by world size: 6.31858165434096e-06
 Per-token loss scaled by world size: 0.0004092359740752727Per-token loss scaled by world size: 0.000349278881913051

 Per-token loss scaled by world size: 9.329826571047306e-05


 Per-token loss scaled by world size: 0.000329840142512694
 Epoch: 0, Step: 35, Rank: 1, loss = 0.8665429949760437Epoch: 0, Step: 35, Rank: 6, loss = 0.9079067707061768Epoch: 0, Step: 35, Rank: 3, loss = 1.0007410049438477Epoch: 0, Step: 35, Rank: 0, loss = 0.01642436347901821



 Epoch: 0, Step: 35, Rank: 4, loss = 1.0637577772140503Epoch: 0, Step: 35, Rank: 2, loss = 0.24251717329025269

 Epoch: 0, Step: 35, Rank: 7, loss = 0.8573781847953796
 Per-token loss scaled by world size: 0.0002692708803806454
 Epoch: 0, Step: 35, Rank: 5, loss = 0.6999359726905823
 Epoch 0:  29%|██▉       | 35/121 [01:29<03:38,  2.55s/it] total tokens: 7188 num samples: 4 num padding tokens: 317 - rank: 1 max len: 1797 min len: 1593 avg len: 1717.75 num_loss_counted_tokens: 2604
 total tokens: 7240 num samples: 8 num padding tokens: 858 - rank: 4 max len: 905 min len: 728 avg len: 797.75 num_loss_counted_tokens: 5344
 {
    "epoch": 0,
    "step": 35,
    "rank": 0,
    "loss": 0.01642436347901821,
    "overall_throughput": 43.07030586294916,
    "lr": 0.0,
    "cuda_mem_allocated": 18.10886526107788,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20795,
    "batch_size": 73,
    "total_loss": 0.7069005370140076,
    "gradnorm": null,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:31.496950"
 }
 total tokens: 7574 num samples: 7 num padding tokens: 648 - rank: 3 max len: 1082 min len: 913 avg len: 989.4285714285714 num_loss_counted_tokens: 4812
 total tokens: 7353 num samples: 3 num padding tokens: 1218 - rank: 0 max len: 2451 min len: 1808 avg len: 2045.0 num_loss_counted_tokens: 890
 total tokens: 7225 num samples: 5 num padding tokens: 790 - rank: 2 max len: 1445 min len: 1127 avg len: 1287.0 num_loss_counted_tokens: 2573
 total tokens: 7777 num samples: 11 num padding tokens: 857 - rank: 5 max len: 707 min len: 516 avg len: 629.0909090909091 num_loss_counted_tokens: 3335
 total tokens: 7672 num samples: 28 num padding tokens: 2867 - rank: 7 max len: 274 min len: 75 avg len: 171.60714285714286 num_loss_counted_tokens: 2189
 total tokens: 7650 num samples: 15 num padding tokens: 1519 - rank: 6 max len: 510 min len: 298 avg len: 408.73333333333335 num_loss_counted_tokens: 3925
 Per-token loss scaled by world size: 0.00017819351342041045Per-token loss scaled by world size: 0.00010071766882902011Per-token loss scaled by world size: 0.00047568074660375714Per-token loss scaled by world size: 0.00022655159409623593Per-token loss scaled by world size: 0.00010502615623408929Per-token loss scaled by world size: 0.0003213490708731115




 Per-token loss scaled by world size: 0.00031152847805060446

 Epoch: 0, Step: 36, Rank: 3, loss = 0.6177212595939636
 Epoch: 0, Step: 36, Rank: 2, loss = 0.28636693954467773Epoch: 0, Step: 36, Rank: 5, loss = 1.2970030307769775

 Epoch: 0, Step: 36, Rank: 4, loss = 0.876198410987854
 Epoch: 0, Step: 36, Rank: 0, loss = 0.485866904258728
 Epoch: 0, Step: 36, Rank: 1, loss = 0.27461931109428406
 Epoch: 0, Step: 36, Rank: 7, loss = 0.8494213223457336
 Per-token loss scaled by world size: 0.0004315480182413012
 Epoch: 0, Step: 36, Rank: 6, loss = 1.1766695976257324
 [2024-08-18 20:49:34,071] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[8.000000000000001e-07], mom=[(0.9, 0.95)]
 Epoch 0:  30%|██▉       | 36/121 [01:32<03:38,  2.57s/it] total tokens: 8019 num samples: 11 num padding tokens: 830 - rank: 4 max len: 729 min len: 589 avg len: 653.5454545454545 num_loss_counted_tokens: 4361
 total tokens: 7575 num samples: 5 num padding tokens: 525 - rank: 1 max len: 1515 min len: 1255 avg len: 1410.0 num_loss_counted_tokens: 2654
 {
    "epoch": 0,
    "step": 36,
    "rank": 0,
    "loss": 0.485866904258728,
    "overall_throughput": 41.05076993788474,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 22.813036918640137,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21813,
    "batch_size": 94,
    "total_loss": 0.7329833507537842,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:34.202497"
 }
 total tokens: 8078 num samples: 14 num padding tokens: 1294 - rank: 5 max len: 577 min len: 385 avg len: 484.57142857142856 num_loss_counted_tokens: 4682
 total tokens: 8085 num samples: 21 num padding tokens: 1544 - rank: 6 max len: 385 min len: 243 avg len: 311.4761904761905 num_loss_counted_tokens: 3795
 total tokens: 7680 num samples: 32 num padding tokens: 2281 - rank: 7 max len: 240 min len: 78 avg len: 168.71875 num_loss_counted_tokens: 2460
 total tokens: 7004 num samples: 2 num padding tokens: 1514 - rank: 0 max len: 3502 min len: 1988 avg len: 2745.0 num_loss_counted_tokens: 188
 total tokens: 7840 num samples: 7 num padding tokens: 725 - rank: 2 max len: 1120 min len: 917 avg len: 1016.4285714285714 num_loss_counted_tokens: 4215
 total tokens: 7911 num samples: 9 num padding tokens: 701 - rank: 3 max len: 879 min len: 758 avg len: 801.1111111111111 num_loss_counted_tokens: 5865
 Per-token loss scaled by world size: 0.0003182947402819991Per-token loss scaled by world size: 0.00048430776223540306Per-token loss scaled by world size: 0.00047009342233650386Per-token loss scaled by world size: 0.00042154916445724666Per-token loss scaled by world size: 5.139104814588791e-06




 Per-token loss scaled by world size: 0.0004850963596254587
 Epoch: 0, Step: 37, Rank: 4, loss = 1.2365219593048096Epoch: 0, Step: 37, Rank: 5, loss = 1.2739109992980957

 Epoch: 0, Step: 37, Rank: 3, loss = 0.8372344970703125Epoch: 0, Step: 37, Rank: 7, loss = 1.1088323593139648

 Epoch: 0, Step: 37, Rank: 0, loss = 0.013517772778868675
 Per-token loss scaled by world size: 2.1868495423404966e-06Epoch: 0, Step: 37, Rank: 6, loss = 1.2759853601455688

 Per-token loss scaled by world size: 0.00011173654638696462
 Epoch: 0, Step: 37, Rank: 1, loss = 0.005752234254032373Epoch: 0, Step: 37, Rank: 2, loss = 0.2939090132713318

 Epoch 0:  31%|███       | 37/121 [01:34<03:36,  2.58s/it] total tokens: 8060 num samples: 10 num padding tokens: 654 - rank: 4 max len: 806 min len: 663 avg len: 740.6 num_loss_counted_tokens: 4953
 total tokens: 7920 num samples: 4 num padding tokens: 1463 - rank: 1 max len: 1980 min len: 1324 avg len: 1614.25 num_loss_counted_tokens: 2571
 {
    "epoch": 0,
    "step": 37,
    "rank": 0,
    "loss": 0.013517772778868675,
    "overall_throughput": 42.263056677916275,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.268142223358154,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21043,
    "batch_size": 79,
    "total_loss": 0.7557079792022705,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:36.725200"
 }
 total tokens: 7872 num samples: 12 num padding tokens: 1903 - rank: 5 max len: 656 min len: 391 avg len: 497.4166666666667 num_loss_counted_tokens: 3985
 total tokens: 7497 num samples: 7 num padding tokens: 1399 - rank: 3 max len: 1071 min len: 808 avg len: 871.1428571428571 num_loss_counted_tokens: 4584
 total tokens: 7800 num samples: 20 num padding tokens: 2069 - rank: 6 max len: 390 min len: 221 avg len: 286.55 num_loss_counted_tokens: 3264
 total tokens: 7914 num samples: 6 num padding tokens: 519 - rank: 2 max len: 1319 min len: 1143 avg len: 1232.5 num_loss_counted_tokens: 2296
 total tokens: 6765 num samples: 3 num padding tokens: 359 - rank: 0 max len: 2255 min len: 2010 avg len: 2135.3333333333335 num_loss_counted_tokens: 396
 total tokens: 4796 num samples: 22 num padding tokens: 1785 - rank: 7 max len: 218 min len: 77 avg len: 136.86363636363637 num_loss_counted_tokens: 1137
 Per-token loss scaled by world size: 0.00023003398382570595Per-token loss scaled by world size: 0.00023557165695820004Per-token loss scaled by world size: 0.0003318150993436575Per-token loss scaled by world size: 3.583161378628574e-05Per-token loss scaled by world size: 0.00029603790608234704
 Per-token loss scaled by world size: 0.00038251461228355765


 Per-token loss scaled by world size: 0.00028330745408311486


 Epoch: 0, Step: 38, Rank: 6, loss = 0.7295815348625183
 Epoch: 0, Step: 38, Rank: 0, loss = 0.11364444345235825
 Epoch: 0, Step: 38, Rank: 7, loss = 0.7471449375152588
 Epoch: 0, Step: 38, Rank: 1, loss = 1.0523930788040161
 Epoch: 0, Step: 38, Rank: 4, loss = 0.8985450267791748
 Epoch: 0, Step: 38, Rank: 3, loss = 0.9389212131500244Epoch: 0, Step: 38, Rank: 5, loss = 1.2131929397583008

 Per-token loss scaled by world size: 0.00017416744958609343
 Epoch: 0, Step: 38, Rank: 2, loss = 0.5523938536643982
 Epoch 0:  31%|███▏      | 38/121 [01:37<03:32,  2.56s/it] total tokens: 8055 num samples: 9 num padding tokens: 294 - rank: 4 max len: 895 min len: 809 avg len: 862.3333333333334 num_loss_counted_tokens: 5164
 total tokens: 8007 num samples: 3 num padding tokens: 615 - rank: 1 max len: 2669 min len: 2320 avg len: 2464.0 num_loss_counted_tokens: 299
 total tokens: 7556 num samples: 4 num padding tokens: 692 - rank: 2 max len: 1889 min len: 1486 avg len: 1716.0 num_loss_counted_tokens: 3448
 total tokens: 7990 num samples: 10 num padding tokens: 1519 - rank: 5 max len: 799 min len: 505 avg len: 647.1 num_loss_counted_tokens: 3664
 {
    "epoch": 0,
    "step": 38,
    "rank": 0,
    "loss": 0.11364444345235825,
    "overall_throughput": 42.686060661097876,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.416475296020508,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25373,
    "batch_size": 90,
    "total_loss": 0.7807271480560303,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:39.251905"
 }
 total tokens: 8064 num samples: 16 num padding tokens: 1612 - rank: 6 max len: 504 min len: 282 avg len: 403.25 num_loss_counted_tokens: 3518
 total tokens: 7266 num samples: 6 num padding tokens: 1143 - rank: 3 max len: 1211 min len: 903 avg len: 1020.5 num_loss_counted_tokens: 5513
 total tokens: 6204 num samples: 22 num padding tokens: 2569 - rank: 7 max len: 282 min len: 76 avg len: 165.22727272727272 num_loss_counted_tokens: 1531
 total tokens: 7148 num samples: 2 num padding tokens: 587 - rank: 0 max len: 3574 min len: 2987 avg len: 3280.5 num_loss_counted_tokens: 224
 Per-token loss scaled by world size: 0.000557654129806906Per-token loss scaled by world size: 0.0007022880017757416Per-token loss scaled by world size: 0.0008294832659885287Per-token loss scaled by world size: 0.00019126593542750925Per-token loss scaled by world size: 0.00040766337770037353

 Per-token loss scaled by world size: 6.8775539148191456e-06



 Per-token loss scaled by world size: 7.391309281956637e-06
 Epoch: 0, Step: 39, Rank: 6, loss = 1.4909573793411255
 Epoch: 0, Step: 39, Rank: 0, loss = 0.014601047150790691Epoch: 0, Step: 39, Rank: 5, loss = 1.7609930038452148
 Epoch: 0, Step: 39, Rank: 7, loss = 1.1838997602462769

 Epoch: 0, Step: 39, Rank: 3, loss = 0.40605756640434265
 Epoch: 0, Step: 39, Rank: 4, loss = 0.8654693365097046
 Epoch: 0, Step: 39, Rank: 1, loss = 0.01569174975156784
 Per-token loss scaled by world size: 8.316225284943357e-05
 Epoch: 0, Step: 39, Rank: 2, loss = 0.17655345797538757
 Epoch 0:  32%|███▏      | 39/121 [01:39<03:29,  2.55s/it] total tokens: 7136 num samples: 4 num padding tokens: 383 - rank: 1 max len: 1784 min len: 1652 avg len: 1688.25 num_loss_counted_tokens: 2347
 total tokens: 7360 num samples: 8 num padding tokens: 773 - rank: 4 max len: 920 min len: 750 avg len: 823.375 num_loss_counted_tokens: 5270
 {
    "epoch": 0,
    "step": 39,
    "rank": 0,
    "loss": 0.014601047150790691,
    "overall_throughput": 42.970882858906755,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.469857692718506,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16984,
    "batch_size": 76,
    "total_loss": 0.7392778992652893,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:41.778226"
 }
 total tokens: 7925 num samples: 5 num padding tokens: 458 - rank: 2 max len: 1585 min len: 1370 avg len: 1493.4 num_loss_counted_tokens: 2405
 total tokens: 8022 num samples: 14 num padding tokens: 2297 - rank: 6 max len: 573 min len: 288 avg len: 408.92857142857144 num_loss_counted_tokens: 2953
 total tokens: 7733 num samples: 11 num padding tokens: 641 - rank: 5 max len: 703 min len: 579 avg len: 644.7272727272727 num_loss_counted_tokens: 5059
 total tokens: 7326 num samples: 6 num padding tokens: 857 - rank: 3 max len: 1221 min len: 972 avg len: 1078.1666666666667 num_loss_counted_tokens: 4306
 total tokens: 5446 num samples: 2 num padding tokens: 263 - rank: 0 max len: 2723 min len: 2460 avg len: 2591.5 num_loss_counted_tokens: 241
 total tokens: 8100 num samples: 30 num padding tokens: 2620 - rank: 7 max len: 270 min len: 87 avg len: 182.66666666666666 num_loss_counted_tokens: 2660
 Per-token loss scaled by world size: 0.0004144566773902625Per-token loss scaled by world size: 0.0004883252549916506Per-token loss scaled by world size: 0.00025119862402789295Per-token loss scaled by world size: 0.0002532459329813719Per-token loss scaled by world size: 0.0001632209459785372Per-token loss scaled by world size: 6.004169335938059e-06





 Per-token loss scaled by world size: 2.448785608066828e-06
 Epoch: 0, Step: 40, Rank: 0, loss = 0.017504405230283737Epoch: 0, Step: 40, Rank: 5, loss = 1.4236512184143066

 Epoch: 0, Step: 40, Rank: 7, loss = 0.7323381900787354Epoch: 0, Step: 40, Rank: 2, loss = 0.47585028409957886Epoch: 0, Step: 40, Rank: 4, loss = 0.7383068799972534
 Epoch: 0, Step: 40, Rank: 6, loss = 1.2082966566085815


 Epoch: 0, Step: 40, Rank: 1, loss = 0.007139128167182207
 Per-token loss scaled by world size: 0.00020900421077385545
 Epoch: 0, Step: 40, Rank: 3, loss = 0.609325647354126
 Epoch 0:  33%|███▎      | 40/121 [01:42<03:25,  2.54s/it] total tokens: 5720 num samples: 2 num padding tokens: 100 - rank: 1 max len: 2860 min len: 2760 avg len: 2810.0 num_loss_counted_tokens: 172
 total tokens: 7550 num samples: 10 num padding tokens: 676 - rank: 4 max len: 755 min len: 637 avg len: 687.4 num_loss_counted_tokens: 3236
 {
    "epoch": 0,
    "step": 40,
    "rank": 0,
    "loss": 0.017504405230283737,
    "overall_throughput": 43.000670432034845,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.411304473876953,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23323,
    "batch_size": 71,
    "total_loss": 0.6515514850616455,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:44.295336"
 }
 total tokens: 7617 num samples: 3 num padding tokens: 1777 - rank: 2 max len: 2539 min len: 1345 avg len: 1946.6666666666667 num_loss_counted_tokens: 824
 total tokens: 7627 num samples: 29 num padding tokens: 2077 - rank: 7 max len: 263 min len: 91 avg len: 191.3793103448276 num_loss_counted_tokens: 2397
 total tokens: 8024 num samples: 8 num padding tokens: 1091 - rank: 3 max len: 1003 min len: 759 avg len: 866.625 num_loss_counted_tokens: 3193
 total tokens: 7800 num samples: 20 num padding tokens: 1486 - rank: 6 max len: 390 min len: 264 avg len: 315.7 num_loss_counted_tokens: 3339
 total tokens: 6426 num samples: 2 num padding tokens: 109 - rank: 0 max len: 3213 min len: 3104 avg len: 3158.5 num_loss_counted_tokens: 160
 total tokens: 8099 num samples: 13 num padding tokens: 1516 - rank: 5 max len: 623 min len: 406 avg len: 506.38461538461536 num_loss_counted_tokens: 4165
 Per-token loss scaled by world size: 0.00030310056172311306Per-token loss scaled by world size: 0.0002651209069881588Per-token loss scaled by world size: 0.00026846988475881517


 Per-token loss scaled by world size: 0.00022448382514994591Per-token loss scaled by world size: 2.106957708747359e-06Per-token loss scaled by world size: 0.0002726210805121809
 Per-token loss scaled by world size: 9.102401236305013e-05


 Epoch: 0, Step: 41, Rank: 2, loss = 0.8657191395759583
 Epoch: 0, Step: 41, Rank: 6, loss = 0.876654863357544Epoch: 0, Step: 41, Rank: 5, loss = 0.9897370338439941

 Epoch: 0, Step: 41, Rank: 4, loss = 0.7330238819122314
 Epoch: 0, Step: 41, Rank: 7, loss = 0.8902100324630737Epoch: 0, Step: 41, Rank: 0, loss = 0.006880007218569517
 Epoch: 0, Step: 41, Rank: 1, loss = 0.29722753167152405

 Per-token loss scaled by world size: 0.00029196811374276876
 Epoch: 0, Step: 41, Rank: 3, loss = 0.9533854126930237
 Epoch 0:  34%|███▍      | 41/121 [01:45<03:23,  2.55s/it] total tokens: 7158 num samples: 3 num padding tokens: 546 - rank: 1 max len: 2386 min len: 2074 avg len: 2204.0 num_loss_counted_tokens: 274
 total tokens: 8024 num samples: 8 num padding tokens: 1137 - rank: 4 max len: 1003 min len: 719 avg len: 860.875 num_loss_counted_tokens: 6152
 {
    "epoch": 0,
    "step": 41,
    "rank": 0,
    "loss": 0.006880007218569517,
    "overall_throughput": 42.297091054292345,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.25260829925537,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26123,
    "batch_size": 88,
    "total_loss": 0.7016047239303589,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:46.855893"
 }
 total tokens: 6468 num samples: 22 num padding tokens: 2174 - rank: 7 max len: 294 min len: 86 avg len: 195.1818181818182 num_loss_counted_tokens: 2218
 total tokens: 8112 num samples: 4 num padding tokens: 1504 - rank: 2 max len: 2028 min len: 1392 avg len: 1652.0 num_loss_counted_tokens: 1907
 total tokens: 7936 num samples: 16 num padding tokens: 1181 - rank: 6 max len: 496 min len: 317 avg len: 422.1875 num_loss_counted_tokens: 3813
 total tokens: 5644 num samples: 2 num padding tokens: 18 - rank: 0 max len: 2822 min len: 2804 avg len: 2813.0 num_loss_counted_tokens: 165
 total tokens: 7667 num samples: 11 num padding tokens: 1375 - rank: 5 max len: 697 min len: 508 avg len: 572.0 num_loss_counted_tokens: 4263
 total tokens: 6950 num samples: 5 num padding tokens: 1094 - rank: 3 max len: 1390 min len: 1006 avg len: 1171.2 num_loss_counted_tokens: 2509
 Per-token loss scaled by world size: 0.0001050201608450152Per-token loss scaled by world size: 0.00023053436598274857Per-token loss scaled by world size: 0.0009069386287592351Per-token loss scaled by world size: 0.0004071406729053706


 Per-token loss scaled by world size: 0.0005334314191713929Per-token loss scaled by world size: 4.823124982067384e-06


 Per-token loss scaled by world size: 7.580199599033222e-05
 Epoch: 0, Step: 42, Rank: 3, loss = 0.4843238890171051Epoch: 0, Step: 42, Rank: 6, loss = 1.9053646326065063

 Epoch: 0, Step: 42, Rank: 0, loss = 0.010132783092558384Epoch: 0, Step: 42, Rank: 4, loss = 0.8553516864776611Epoch: 0, Step: 42, Rank: 2, loss = 0.22063423693180084


 Epoch: 0, Step: 42, Rank: 7, loss = 1.1206727027893066
 Epoch: 0, Step: 42, Rank: 1, loss = 0.15925051271915436
 Per-token loss scaled by world size: 0.0006252930616028607
 Epoch: 0, Step: 42, Rank: 5, loss = 1.3136625289916992
 Epoch 0:  35%|███▍      | 42/121 [01:47<03:20,  2.53s/it] total tokens: 8063 num samples: 11 num padding tokens: 996 - rank: 4 max len: 733 min len: 561 avg len: 642.4545454545455 num_loss_counted_tokens: 4498
 total tokens: 6616 num samples: 4 num padding tokens: 1100 - rank: 1 max len: 1654 min len: 1144 avg len: 1379.0 num_loss_counted_tokens: 966
 {
    "epoch": 0,
    "step": 42,
    "rank": 0,
    "loss": 0.010132783092558384,
    "overall_throughput": 43.158935099078796,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.46029806137085,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16807,
    "batch_size": 70,
    "total_loss": 0.758674144744873,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:49.361713"
 }
 total tokens: 7756 num samples: 14 num padding tokens: 706 - rank: 5 max len: 554 min len: 470 avg len: 503.57142857142856 num_loss_counted_tokens: 4826
 total tokens: 7803 num samples: 17 num padding tokens: 1773 - rank: 6 max len: 459 min len: 256 avg len: 354.70588235294116 num_loss_counted_tokens: 2701
 total tokens: 7983 num samples: 9 num padding tokens: 500 - rank: 3 max len: 887 min len: 781 avg len: 831.4444444444445 num_loss_counted_tokens: 4578
 total tokens: 7525 num samples: 7 num padding tokens: 786 - rank: 2 max len: 1075 min len: 894 avg len: 962.7142857142857 num_loss_counted_tokens: 3223
 total tokens: 6855 num samples: 3 num padding tokens: 919 - rank: 0 max len: 2285 min len: 1795 avg len: 1978.6666666666667 num_loss_counted_tokens: 310
 total tokens: 7808 num samples: 32 num padding tokens: 2971 - rank: 7 max len: 244 min len: 79 avg len: 151.15625 num_loss_counted_tokens: 1950
 Per-token loss scaled by world size: 0.00014639626897405833Per-token loss scaled by world size: 0.00042091766954399645Per-token loss scaled by world size: 0.00011771616118494421Per-token loss scaled by world size: 0.00020373229926917702Per-token loss scaled by world size: 0.00029096510843373835Per-token loss scaled by world size: 0.0002795852196868509


 Per-token loss scaled by world size: 0.0003187089751008898



 Epoch: 0, Step: 43, Rank: 5, loss = 1.3902910947799683
 Epoch: 0, Step: 43, Rank: 4, loss = 0.6729277968406677
 Epoch: 0, Step: 43, Rank: 2, loss = 0.3888164758682251Epoch: 0, Step: 43, Rank: 1, loss = 0.4835468530654907

 Epoch: 0, Step: 43, Rank: 3, loss = 1.0526957511901855
 Epoch: 0, Step: 43, Rank: 7, loss = 0.9234700202941895
 Epoch: 0, Step: 43, Rank: 6, loss = 0.9610577821731567
 Per-token loss scaled by world size: 1.8369590179645456e-05
 Epoch: 0, Step: 43, Rank: 0, loss = 0.0606747567653656
 Epoch 0:  36%|███▌      | 43/121 [01:50<03:19,  2.56s/it] total tokens: 7484 num samples: 4 num padding tokens: 327 - rank: 1 max len: 1871 min len: 1693 avg len: 1789.25 num_loss_counted_tokens: 3106
 total tokens: 8050 num samples: 10 num padding tokens: 630 - rank: 4 max len: 805 min len: 673 avg len: 742.0 num_loss_counted_tokens: 6379
 total tokens: 7696 num samples: 8 num padding tokens: 569 - rank: 3 max len: 962 min len: 816 avg len: 890.875 num_loss_counted_tokens: 5177{
    "epoch": 0,
    "step": 43,
    "rank": 0,
    "loss": 0.0606747567653656,
    "overall_throughput": 41.53024928853794,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.530761241912842,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26424,
    "batch_size": 80,
    "total_loss": 0.7416850328445435,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:51.968138"
 }

 total tokens: 7056 num samples: 28 num padding tokens: 3094 - rank: 7 max len: 252 min len: 75 avg len: 141.5 num_loss_counted_tokens: 1402
 total tokens: 8086 num samples: 13 num padding tokens: 1584 - rank: 5 max len: 622 min len: 401 avg len: 500.15384615384613 num_loss_counted_tokens: 4091
 total tokens: 5748 num samples: 2 num padding tokens: 405 - rank: 0 max len: 2874 min len: 2469 avg len: 2671.5 num_loss_counted_tokens: 176
 total tokens: 7158 num samples: 6 num padding tokens: 886 - rank: 2 max len: 1193 min len: 985 avg len: 1045.3333333333333 num_loss_counted_tokens: 2751
 total tokens: 7780 num samples: 20 num padding tokens: 1751 - rank: 6 max len: 389 min len: 257 avg len: 301.45 num_loss_counted_tokens: 3387
 Per-token loss scaled by world size: 0.0004709336790256202Per-token loss scaled by world size: 0.00041861337376758456Per-token loss scaled by world size: 0.00045743229566141963Per-token loss scaled by world size: 0.00041176279773935676Per-token loss scaled by world size: 1.3287355614011176e-05Per-token loss scaled by world size: 0.00043119132169522345Per-token loss scaled by world size: 0.0003093344275839627






 Epoch: 0, Step: 44, Rank: 5, loss = 1.1258552074432373
 Epoch: 0, Step: 44, Rank: 1, loss = 1.030312180519104Epoch: 0, Step: 44, Rank: 7, loss = 1.1590855121612549

 Epoch: 0, Step: 44, Rank: 0, loss = 0.03270350396633148
 Epoch: 0, Step: 44, Rank: 4, loss = 1.0134512186050415Epoch: 0, Step: 44, Rank: 6, loss = 1.0612696409225464

 Epoch: 0, Step: 44, Rank: 3, loss = 0.7613493800163269
 Per-token loss scaled by world size: 6.520144233945757e-05
 Epoch: 0, Step: 44, Rank: 2, loss = 0.16047704219818115
 Epoch 0:  36%|███▋      | 44/121 [01:52<03:17,  2.57s/it] total tokens: 5546 num samples: 2 num padding tokens: 467 - rank: 1 max len: 2773 min len: 2306 avg len: 2539.5 num_loss_counted_tokens: 136
 total tokens: 7947 num samples: 9 num padding tokens: 1026 - rank: 4 max len: 883 min len: 694 avg len: 769.0 num_loss_counted_tokens: 4965
 {
    "epoch": 0,
    "step": 44,
    "rank": 0,
    "loss": 0.03270350396633148,
    "overall_throughput": 41.61664148118311,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.250525951385498,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19690,
    "batch_size": 72,
    "total_loss": 0.7930629253387451,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:54.572590"
 }
 total tokens: 7632 num samples: 6 num padding tokens: 867 - rank: 3 max len: 1272 min len: 946 avg len: 1127.5 num_loss_counted_tokens: 1800
 total tokens: 6420 num samples: 30 num padding tokens: 2314 - rank: 7 max len: 214 min len: 73 avg len: 136.86666666666667 num_loss_counted_tokens: 1458
 total tokens: 6984 num samples: 4 num padding tokens: 806 - rank: 2 max len: 1746 min len: 1334 avg len: 1544.5 num_loss_counted_tokens: 1942
 total tokens: 6094 num samples: 2 num padding tokens: 150 - rank: 0 max len: 3047 min len: 2897 avg len: 2972.0 num_loss_counted_tokens: 493
 total tokens: 7900 num samples: 20 num padding tokens: 1997 - rank: 6 max len: 395 min len: 231 avg len: 295.15 num_loss_counted_tokens: 3387
 total tokens: 8052 num samples: 12 num padding tokens: 956 - rank: 5 max len: 671 min len: 482 avg len: 591.3333333333334 num_loss_counted_tokens: 5097
 Per-token loss scaled by world size: 0.0003629309358075261Per-token loss scaled by world size: 0.00019072710711043328Per-token loss scaled by world size: 0.0003041441086679697Per-token loss scaled by world size: 0.00040468695806339383Per-token loss scaled by world size: 5.418559885583818e-05Per-token loss scaled by world size: 7.64827273087576e-05





 Per-token loss scaled by world size: 0.00011607163469307125
 Epoch: 0, Step: 45, Rank: 6, loss = 1.2099664211273193
 Epoch: 0, Step: 45, Rank: 2, loss = 0.6358603239059448
 Epoch: 0, Step: 45, Rank: 0, loss = 0.2549838423728943Epoch: 0, Step: 45, Rank: 5, loss = 1.3491756916046143

 Epoch: 0, Step: 45, Rank: 4, loss = 1.0139784812927246Epoch: 0, Step: 45, Rank: 1, loss = 0.18064801394939423

 Per-token loss scaled by world size: 0.00039161398308351636
 Epoch: 0, Step: 45, Rank: 7, loss = 0.38696831464767456
 Epoch: 0, Step: 45, Rank: 3, loss = 1.3055920600891113
 Epoch 0:  37%|███▋      | 45/121 [01:55<03:14,  2.57s/it] total tokens: 7215 num samples: 3 num padding tokens: 642 - rank: 1 max len: 2405 min len: 2050 avg len: 2191.0 num_loss_counted_tokens: 335
 total tokens: 8046 num samples: 9 num padding tokens: 1094 - rank: 4 max len: 894 min len: 726 avg len: 772.4444444444445 num_loss_counted_tokens: 5000
 total tokens: 4725 num samples: 25 num padding tokens: 1257 - rank: 7 max len: 189 min len: 75 avg len: 138.72 num_loss_counted_tokens: 1367
 {
    "epoch": 0,
    "step": 45,
    "rank": 0,
    "loss": 0.2549838423728943,
    "overall_throughput": 42.42738944609394,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.32686471939087,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26671,
    "batch_size": 93,
    "total_loss": 0.792146623134613,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:57.124489"
 }
 total tokens: 7820 num samples: 17 num padding tokens: 2391 - rank: 6 max len: 460 min len: 196 avg len: 319.3529411764706 num_loss_counted_tokens: 2893
 total tokens: 7320 num samples: 6 num padding tokens: 832 - rank: 3 max len: 1220 min len: 937 avg len: 1081.3333333333333 num_loss_counted_tokens: 4277
 total tokens: 7832 num samples: 11 num padding tokens: 1541 - rank: 5 max len: 712 min len: 471 avg len: 571.9090909090909 num_loss_counted_tokens: 4862
 total tokens: 6974 num samples: 2 num padding tokens: 808 - rank: 0 max len: 3487 min len: 2679 avg len: 3083.0 num_loss_counted_tokens: 194
 total tokens: 6796 num samples: 4 num padding tokens: 645 - rank: 2 max len: 1699 min len: 1367 avg len: 1537.75 num_loss_counted_tokens: 2448
 Per-token loss scaled by world size: 0.00021682196529582143Per-token loss scaled by world size: 0.00032285196357406676Per-token loss scaled by world size: 0.00028426622156985104Per-token loss scaled by world size: 0.000325443601468578Per-token loss scaled by world size: 0.00019496992172207683Per-token loss scaled by world size: 0.0003603589429985732





 Per-token loss scaled by world size: 9.507144568488002e-05
 Epoch: 0, Step: 46, Rank: 5, loss = 1.2730580568313599Epoch: 0, Step: 46, Rank: 6, loss = 1.0042414665222168Epoch: 0, Step: 46, Rank: 3, loss = 0.7659777998924255


 Epoch: 0, Step: 46, Rank: 4, loss = 1.1497108936309814Epoch: 0, Step: 46, Rank: 2, loss = 1.1405552625656128Epoch: 0, Step: 46, Rank: 7, loss = 0.6887800097465515


 Epoch: 0, Step: 46, Rank: 1, loss = 0.3358636498451233
 Per-token loss scaled by world size: 4.201751289656386e-05
 Epoch: 0, Step: 46, Rank: 0, loss = 0.14843736588954926
 Epoch 0:  38%|███▊      | 46/121 [01:57<03:12,  2.57s/it] total tokens: 6948 num samples: 4 num padding tokens: 662 - rank: 1 max len: 1737 min len: 1416 avg len: 1571.5 num_loss_counted_tokens: 3855
 total tokens: 7308 num samples: 9 num padding tokens: 773 - rank: 4 max len: 812 min len: 625 avg len: 726.1111111111111 num_loss_counted_tokens: 4386
 {
    "epoch": 0,
    "step": 46,
    "rank": 0,
    "loss": 0.14843736588954926,
    "overall_throughput": 42.16837318958055,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.52260398864746,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28262,
    "batch_size": 94,
    "total_loss": 0.8133281469345093,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:49:59.691426"
 }
 total tokens: 8021 num samples: 13 num padding tokens: 1587 - rank: 5 max len: 617 min len: 431 avg len: 494.9230769230769 num_loss_counted_tokens: 4488
 total tokens: 7343 num samples: 7 num padding tokens: 562 - rank: 3 max len: 1049 min len: 905 avg len: 968.7142857142857 num_loss_counted_tokens: 4549
 total tokens: 7704 num samples: 18 num padding tokens: 1519 - rank: 6 max len: 428 min len: 278 avg len: 343.6111111111111 num_loss_counted_tokens: 3313
 total tokens: 8004 num samples: 29 num padding tokens: 3202 - rank: 7 max len: 276 min len: 86 avg len: 165.58620689655172 num_loss_counted_tokens: 1941
 total tokens: 8016 num samples: 6 num padding tokens: 940 - rank: 2 max len: 1336 min len: 1075 avg len: 1179.3333333333333 num_loss_counted_tokens: 3672
 total tokens: 7050 num samples: 3 num padding tokens: 713 - rank: 0 max len: 2350 min len: 1747 avg len: 2112.3333333333335 num_loss_counted_tokens: 1822
 Per-token loss scaled by world size: 0.00035897750058211386Per-token loss scaled by world size: 0.00042433346970938146Per-token loss scaled by world size: 0.00017432670574635267Per-token loss scaled by world size: 0.0005883869016543031Per-token loss scaled by world size: 0.00017508945893496275
 Per-token loss scaled by world size: 0.00023119074467103928




 Per-token loss scaled by world size: 0.0002113927184836939
 Epoch: 0, Step: 47, Rank: 5, loss = 1.6370394229888916
 Epoch: 0, Step: 47, Rank: 6, loss = 1.1806018352508545Epoch: 0, Step: 47, Rank: 1, loss = 0.4850204885005951

 Epoch: 0, Step: 47, Rank: 4, loss = 0.9987651109695435
 Epoch: 0, Step: 47, Rank: 7, loss = 0.6432304382324219Epoch: 0, Step: 47, Rank: 2, loss = 0.4871426522731781
 Epoch: 0, Step: 47, Rank: 3, loss = 0.5881474018096924

 Per-token loss scaled by world size: 3.264834595029242e-05
 Epoch: 0, Step: 47, Rank: 0, loss = 0.09083586186170578
 Epoch 0:  39%|███▉      | 47/121 [02:00<03:10,  2.58s/it] total tokens: 5918 num samples: 2 num padding tokens: 291 - rank: 1 max len: 2959 min len: 2668 avg len: 2813.5 num_loss_counted_tokens: 895
 total tokens: 5478 num samples: 22 num padding tokens: 2168 - rank: 7 max len: 249 min len: 85 avg len: 150.45454545454547 num_loss_counted_tokens: 1254
 total tokens: 7456 num samples: 8 num padding tokens: 650 - rank: 4 max len: 932 min len: 776 avg len: 850.75 num_loss_counted_tokens: 5212
 total tokens: 7815 num samples: 15 num padding tokens: 1566 - rank: 6 max len: 521 min len: 301 avg len: 416.6 num_loss_counted_tokens: 3998
 {
    "epoch": 0,
    "step": 47,
    "rank": 0,
    "loss": 0.09083586186170578,
    "overall_throughput": 41.45521968839562,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.520647048950195,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22258,
    "batch_size": 86,
    "total_loss": 0.7638478875160217,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:02.300320"
 }
 total tokens: 7480 num samples: 4 num padding tokens: 1532 - rank: 2 max len: 1870 min len: 1277 avg len: 1487.0 num_loss_counted_tokens: 1969
 total tokens: 6978 num samples: 6 num padding tokens: 608 - rank: 3 max len: 1163 min len: 946 avg len: 1061.6666666666667 num_loss_counted_tokens: 4439
 total tokens: 7890 num samples: 2 num padding tokens: 381 - rank: 0 max len: 3945 min len: 3564 avg len: 3754.5 num_loss_counted_tokens: 441
 total tokens: 8019 num samples: 11 num padding tokens: 950 - rank: 5 max len: 729 min len: 528 avg len: 642.6363636363636 num_loss_counted_tokens: 5004
 Per-token loss scaled by world size: 0.0004503819509409368Per-token loss scaled by world size: 0.0003560640325304121Per-token loss scaled by world size: 0.0002961498685181141Per-token loss scaled by world size: 0.0003928189689759165



 Per-token loss scaled by world size: 6.809273327235132e-05Per-token loss scaled by world size: 3.832683887594612e-06Per-token loss scaled by world size: 5.2919685913366266e-06


 Epoch: 0, Step: 48, Rank: 7, loss = 1.0013855695724487Epoch: 0, Step: 48, Rank: 3, loss = 0.8328844904899597

 Epoch: 0, Step: 48, Rank: 6, loss = 1.2666429281234741
 Epoch: 0, Step: 48, Rank: 4, loss = 1.1047542095184326
 Epoch: 0, Step: 48, Rank: 0, loss = 0.010778944939374924Epoch: 0, Step: 48, Rank: 2, loss = 0.19150230288505554Epoch: 0, Step: 48, Rank: 1, loss = 0.014883000403642654


 Per-token loss scaled by world size: 0.00040955503936856985
 Epoch: 0, Step: 48, Rank: 5, loss = 1.1518223285675049
 Epoch 0:  40%|███▉      | 48/121 [02:03<03:07,  2.56s/it] total tokens: 7580 num samples: 10 num padding tokens: 754 - rank: 4 max len: 758 min len: 613 avg len: 682.6 num_loss_counted_tokens: 3916
 total tokens: 7806 num samples: 3 num padding tokens: 1441 - rank: 1 max len: 2602 min len: 1696 avg len: 2121.6666666666665 num_loss_counted_tokens: 1824
 {
    "epoch": 0,
    "step": 48,
    "rank": 0,
    "loss": 0.010778944939374924,
    "overall_throughput": 42.77144474456274,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.303375244140625,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22499,
    "batch_size": 76,
    "total_loss": 0.6968317627906799,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:04.867177"
 }
 total tokens: 7440 num samples: 30 num padding tokens: 2674 - rank: 7 max len: 248 min len: 85 avg len: 158.86666666666667 num_loss_counted_tokens: 1971
 total tokens: 7644 num samples: 7 num padding tokens: 772 - rank: 3 max len: 1092 min len: 894 avg len: 981.7142857142857 num_loss_counted_tokens: 5193
 total tokens: 7580 num samples: 2 num padding tokens: 877 - rank: 0 max len: 3790 min len: 2913 avg len: 3351.5 num_loss_counted_tokens: 999
 total tokens: 8025 num samples: 5 num padding tokens: 1188 - rank: 2 max len: 1605 min len: 1101 avg len: 1367.4 num_loss_counted_tokens: 2038
 total tokens: 8100 num samples: 18 num padding tokens: 2074 - rank: 6 max len: 450 min len: 258 avg len: 334.77777777777777 num_loss_counted_tokens: 3301
 total tokens: 7709 num samples: 13 num padding tokens: 980 - rank: 5 max len: 593 min len: 457 avg len: 517.6153846153846 num_loss_counted_tokens: 4236
 Per-token loss scaled by world size: 0.0005574136739596725Per-token loss scaled by world size: 0.0002091079077217728Per-token loss scaled by world size: 0.0003604785306379199

 Per-token loss scaled by world size: 0.00025925517547875643Per-token loss scaled by world size: 0.00025941740022972226Per-token loss scaled by world size: 0.0002179427247028798



 Per-token loss scaled by world size: 5.799124210170703e-06
 Epoch: 0, Step: 49, Rank: 6, loss = 1.024795413017273Epoch: 0, Step: 49, Rank: 3, loss = 0.5944676399230957

 Epoch: 0, Step: 49, Rank: 5, loss = 1.5846574306488037
 Epoch: 0, Step: 49, Rank: 1, loss = 0.6195839047431946
 Epoch: 0, Step: 49, Rank: 4, loss = 0.7370300889015198Epoch: 0, Step: 49, Rank: 0, loss = 0.016486184671521187Epoch: 0, Step: 49, Rank: 7, loss = 0.737491250038147


 Per-token loss scaled by world size: 0.00016608217265456915
 Epoch: 0, Step: 49, Rank: 2, loss = 0.47215086221694946
 Epoch 0:  40%|████      | 49/121 [02:05<03:04,  2.57s/it] total tokens: 7560 num samples: 6 num padding tokens: 946 - rank: 1 max len: 1260 min len: 979 avg len: 1102.3333333333333 num_loss_counted_tokens: 3644
 total tokens: 7788 num samples: 12 num padding tokens: 734 - rank: 4 max len: 649 min len: 534 avg len: 587.8333333333334 num_loss_counted_tokens: 4810
 {
    "epoch": 0,
    "step": 49,
    "rank": 0,
    "loss": 0.016486184671521187,
    "overall_throughput": 41.99615706250001,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.364055633544922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22743,
    "batch_size": 77,
    "total_loss": 0.7233328223228455,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:07.406672"
 }
 total tokens: 7995 num samples: 15 num padding tokens: 1460 - rank: 5 max len: 533 min len: 364 avg len: 435.6666666666667 num_loss_counted_tokens: 4262
 total tokens: 3168 num samples: 18 num padding tokens: 844 - rank: 7 max len: 176 min len: 76 avg len: 129.11111111111111 num_loss_counted_tokens: 813
 total tokens: 7920 num samples: 22 num padding tokens: 2223 - rank: 6 max len: 360 min len: 188 avg len: 258.95454545454544 num_loss_counted_tokens: 2786
 total tokens: 7416 num samples: 8 num padding tokens: 720 - rank: 2 max len: 927 min len: 750 avg len: 837.0 num_loss_counted_tokens: 4900
 total tokens: 7248 num samples: 3 num padding tokens: 1314 - rank: 0 max len: 2416 min len: 1340 avg len: 1978.0 num_loss_counted_tokens: 2507
 total tokens: 7500 num samples: 10 num padding tokens: 393 - rank: 3 max len: 750 min len: 674 avg len: 710.7 num_loss_counted_tokens: 4132
 Per-token loss scaled by world size: 0.0003894062538165599Per-token loss scaled by world size: 0.0003162138455081731Per-token loss scaled by world size: 0.00023851577134337276Per-token loss scaled by world size: 0.00032866382389329374
 Per-token loss scaled by world size: 0.00045591729576699436



 Per-token loss scaled by world size: 4.146520223002881e-05Per-token loss scaled by world size: 5.4783604355179705e-06

 Epoch: 0, Step: 50, Rank: 7, loss = 0.9038181900978088
 Epoch: 0, Step: 50, Rank: 5, loss = 1.113020420074463
 Epoch: 0, Step: 50, Rank: 4, loss = 1.3031256198883057
 Epoch: 0, Step: 50, Rank: 3, loss = 0.6817377209663391
 Epoch: 0, Step: 50, Rank: 2, loss = 0.9394033551216125
 Epoch: 0, Step: 50, Rank: 0, loss = 0.01565852388739586Epoch: 0, Step: 50, Rank: 1, loss = 0.1185179129242897

 Per-token loss scaled by world size: 0.00044163045822642744
 Epoch: 0, Step: 50, Rank: 6, loss = 1.2622902393341064
 Epoch 0:  41%|████▏     | 50/121 [02:08<03:00,  2.54s/it] total tokens: 7206 num samples: 3 num padding tokens: 1221 - rank: 1 max len: 2402 min len: 1648 avg len: 1995.0 num_loss_counted_tokens: 854
 total tokens: 7592 num samples: 8 num padding tokens: 1031 - rank: 4 max len: 949 min len: 737 avg len: 820.125 num_loss_counted_tokens: 4482
 {
    "epoch": 0,
    "step": 50,
    "rank": 0,
    "loss": 0.01565852388739586,
    "overall_throughput": 43.999185393816596,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.364055633544922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22866,
    "batch_size": 99,
    "total_loss": 0.79219651222229,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:09.869463"
 }
 total tokens: 7938 num samples: 14 num padding tokens: 2583 - rank: 6 max len: 567 min len: 269 avg len: 382.5 num_loss_counted_tokens: 3524
 total tokens: 7014 num samples: 6 num padding tokens: 736 - rank: 3 max len: 1169 min len: 949 avg len: 1046.3333333333333 num_loss_counted_tokens: 3459
 total tokens: 7525 num samples: 5 num padding tokens: 656 - rank: 2 max len: 1505 min len: 1205 avg len: 1373.8 num_loss_counted_tokens: 3152
 total tokens: 8070 num samples: 30 num padding tokens: 3026 - rank: 7 max len: 269 min len: 86 avg len: 168.13333333333333 num_loss_counted_tokens: 2116
 total tokens: 7821 num samples: 11 num padding tokens: 546 - rank: 5 max len: 711 min len: 583 avg len: 661.3636363636364 num_loss_counted_tokens: 4062
 total tokens: 6454 num samples: 2 num padding tokens: 77 - rank: 0 max len: 3227 min len: 3150 avg len: 3188.5 num_loss_counted_tokens: 196
 Per-token loss scaled by world size: 0.00025689046015031636Per-token loss scaled by world size: 0.0004693289229180664Per-token loss scaled by world size: 0.0002972199581563473Per-token loss scaled by world size: 0.00019820936722680926Per-token loss scaled by world size: 0.0002842875546775758




 Per-token loss scaled by world size: 4.825916403206065e-05Per-token loss scaled by world size: 2.7838473215524573e-06

 Epoch: 0, Step: 51, Rank: 6, loss = 1.3355927467346191
 Epoch: 0, Step: 51, Rank: 7, loss = 0.8458136916160583
 Epoch: 0, Step: 51, Rank: 2, loss = 0.7310460209846497
 Epoch: 0, Step: 51, Rank: 1, loss = 0.13733351230621338
 Epoch: 0, Step: 51, Rank: 4, loss = 0.5640543103218079
 Epoch: 0, Step: 51, Rank: 3, loss = 0.8090112805366516
 Epoch: 0, Step: 51, Rank: 0, loss = 0.007922133430838585
 Per-token loss scaled by world size: 0.0003865555045194924
 Epoch: 0, Step: 51, Rank: 5, loss = 1.100040316581726
 Epoch 0:  42%|████▏     | 51/121 [02:10<02:58,  2.55s/it] total tokens: 7845 num samples: 5 num padding tokens: 525 - rank: 1 max len: 1569 min len: 1326 avg len: 1464.0 num_loss_counted_tokens: 3941
 total tokens: 7520 num samples: 10 num padding tokens: 686 - rank: 4 max len: 752 min len: 619 avg len: 683.4 num_loss_counted_tokens: 4307
 {
    "epoch": 0,
    "step": 51,
    "rank": 0,
    "loss": 0.007922133430838585,
    "overall_throughput": 42.29003789361672,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.354268074035645,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22766,
    "batch_size": 70,
    "total_loss": 0.6913517117500305,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:12.436230"
 }
 total tokens: 7780 num samples: 20 num padding tokens: 1768 - rank: 6 max len: 389 min len: 232 avg len: 300.6 num_loss_counted_tokens: 3299
 total tokens: 5290 num samples: 23 num padding tokens: 1933 - rank: 7 max len: 230 min len: 81 avg len: 145.95652173913044 num_loss_counted_tokens: 1376
 total tokens: 7761 num samples: 13 num padding tokens: 933 - rank: 5 max len: 597 min len: 418 avg len: 525.2307692307693 num_loss_counted_tokens: 4968
 total tokens: 7912 num samples: 8 num padding tokens: 555 - rank: 3 max len: 989 min len: 821 avg len: 919.625 num_loss_counted_tokens: 5323
 total tokens: 7374 num samples: 3 num padding tokens: 1194 - rank: 0 max len: 2458 min len: 1773 avg len: 2060.0 num_loss_counted_tokens: 336
 total tokens: 7728 num samples: 6 num padding tokens: 549 - rank: 2 max len: 1288 min len: 1035 avg len: 1196.5 num_loss_counted_tokens: 3900
 Per-token loss scaled by world size: 0.00019784610776696354Per-token loss scaled by world size: 0.000183841708349064Per-token loss scaled by world size: 9.399676491739228e-05

 Per-token loss scaled by world size: 0.0001539696240797639

 Per-token loss scaled by world size: 0.0002607290807645768Per-token loss scaled by world size: 0.000336777011398226

 Per-token loss scaled by world size: 0.0004033475706819445
 Epoch: 0, Step: 52, Rank: 0, loss = 0.30163562297821045
 Epoch: 0, Step: 52, Rank: 3, loss = 0.5899480581283569Epoch: 0, Step: 52, Rank: 2, loss = 0.6348881721496582

 Epoch: 0, Step: 52, Rank: 1, loss = 0.4940885007381439
 Epoch: 0, Step: 52, Rank: 6, loss = 1.0807174444198608
 Epoch: 0, Step: 52, Rank: 7, loss = 0.8366796374320984
 Epoch: 0, Step: 52, Rank: 4, loss = 1.2943423986434937
 Per-token loss scaled by world size: 0.00024687970289960504
 Epoch: 0, Step: 52, Rank: 5, loss = 0.7922369241714478
 Epoch 0:  43%|████▎     | 52/121 [02:13<02:54,  2.53s/it] total tokens: 5434 num samples: 2 num padding tokens: 361 - rank: 1 max len: 2717 min len: 2356 avg len: 2536.5 num_loss_counted_tokens: 214
 total tokens: 8091 num samples: 9 num padding tokens: 724 - rank: 4 max len: 899 min len: 723 avg len: 818.5555555555555 num_loss_counted_tokens: 4597
 {
    "epoch": 0,
    "step": 52,
    "rank": 0,
    "loss": 0.30163562297821045,
    "overall_throughput": 43.565651051419955,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.44568157196045,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25672,
    "batch_size": 81,
    "total_loss": 0.7530670762062073,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:14.922017"
 }
 total tokens: 7384 num samples: 4 num padding tokens: 597 - rank: 2 max len: 1846 min len: 1598 avg len: 1696.75 num_loss_counted_tokens: 648
 total tokens: 5825 num samples: 25 num padding tokens: 2240 - rank: 7 max len: 233 min len: 87 avg len: 143.4 num_loss_counted_tokens: 1267
 total tokens: 8040 num samples: 15 num padding tokens: 2522 - rank: 6 max len: 536 min len: 238 avg len: 367.8666666666667 num_loss_counted_tokens: 3089
 total tokens: 7434 num samples: 6 num padding tokens: 1318 - rank: 3 max len: 1239 min len: 914 avg len: 1019.3333333333334 num_loss_counted_tokens: 4080
 total tokens: 7854 num samples: 11 num padding tokens: 785 - rank: 5 max len: 714 min len: 537 avg len: 642.6363636363636 num_loss_counted_tokens: 3685
 total tokens: 7772 num samples: 2 num padding tokens: 602 - rank: 0 max len: 3886 min len: 3284 avg len: 3585.0 num_loss_counted_tokens: 194
 Per-token loss scaled by world size: 0.00032663694582879543Per-token loss scaled by world size: 0.0003048842481803149Per-token loss scaled by world size: 0.000158169845235534Per-token loss scaled by world size: 0.00026047308347187936


 Per-token loss scaled by world size: 0.00023948316811583936
 Per-token loss scaled by world size: 0.00011947691382374614Per-token loss scaled by world size: 5.034709374740487e-06


 Epoch: 0, Step: 53, Rank: 6, loss = 1.0754791498184204
 Epoch: 0, Step: 53, Rank: 5, loss = 1.1522117853164673
 Epoch: 0, Step: 53, Rank: 1, loss = 0.557944118976593
 Epoch: 0, Step: 53, Rank: 4, loss = 0.9188187718391418Epoch: 0, Step: 53, Rank: 2, loss = 0.4214548170566559

 Epoch: 0, Step: 53, Rank: 7, loss = 0.8447768688201904Epoch: 0, Step: 53, Rank: 0, loss = 0.017759937793016434

 Per-token loss scaled by world size: 0.00029550379258580506
 Epoch: 0, Step: 53, Rank: 3, loss = 1.0423896312713623
 Epoch 0:  44%|████▍     | 53/121 [02:15<02:51,  2.53s/it] total tokens: 6792 num samples: 3 num padding tokens: 382 - rank: 1 max len: 2264 min len: 2050 avg len: 2136.6666666666665 num_loss_counted_tokens: 2284
 total tokens: 7866 num samples: 9 num padding tokens: 794 - rank: 4 max len: 874 min len: 678 avg len: 785.7777777777778 num_loss_counted_tokens: 5164
 {
    "epoch": 0,
    "step": 53,
    "rank": 0,
    "loss": 0.017759937793016434,
    "overall_throughput": 42.74150473533702,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.40515947341919,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28220,
    "batch_size": 101,
    "total_loss": 0.7538543939590454,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:17.454732"
 }
 total tokens: 7777 num samples: 7 num padding tokens: 867 - rank: 3 max len: 1111 min len: 895 avg len: 987.1428571428571 num_loss_counted_tokens: 4712
 total tokens: 5152 num samples: 23 num padding tokens: 1800 - rank: 7 max len: 224 min len: 71 avg len: 145.7391304347826 num_loss_counted_tokens: 1286
 total tokens: 7932 num samples: 12 num padding tokens: 1042 - rank: 5 max len: 661 min len: 452 avg len: 574.1666666666666 num_loss_counted_tokens: 4158
 total tokens: 6598 num samples: 2 num padding tokens: 710 - rank: 0 max len: 3299 min len: 2589 avg len: 2944.0 num_loss_counted_tokens: 177
 total tokens: 8100 num samples: 18 num padding tokens: 1853 - rank: 6 max len: 450 min len: 235 avg len: 347.05555555555554 num_loss_counted_tokens: 3720
 total tokens: 8076 num samples: 4 num padding tokens: 2220 - rank: 2 max len: 2019 min len: 1211 avg len: 1464.0 num_loss_counted_tokens: 1401
 Per-token loss scaled by world size: 0.00026205729227513075Per-token loss scaled by world size: 0.00021129030210431665Per-token loss scaled by world size: 0.00041250750655308366Per-token loss scaled by world size: 0.0004137573123443872Per-token loss scaled by world size: 0.00038468287675641477




 Per-token loss scaled by world size: 6.601931090699509e-06
 Per-token loss scaled by world size: 0.0001686068280832842
 Epoch: 0, Step: 54, Rank: 5, loss = 1.1955498456954956
 Epoch: 0, Step: 54, Rank: 1, loss = 0.612372100353241
 Epoch: 0, Step: 54, Rank: 3, loss = 0.7595075368881226
 Epoch: 0, Step: 54, Rank: 6, loss = 1.1991721391677856
 Epoch: 0, Step: 54, Rank: 4, loss = 1.114907145500183
 Epoch: 0, Step: 54, Rank: 0, loss = 0.019134046509861946
 Epoch: 0, Step: 54, Rank: 7, loss = 0.48866474628448486
 Per-token loss scaled by world size: 0.00018809801258612424
 Epoch: 0, Step: 54, Rank: 2, loss = 0.5451550483703613
 Epoch 0:  45%|████▍     | 54/121 [02:18<02:49,  2.53s/it] total tokens: 6536 num samples: 4 num padding tokens: 440 - rank: 1 max len: 1634 min len: 1437 avg len: 1524.0 num_loss_counted_tokens: 1512
 total tokens: 7530 num samples: 10 num padding tokens: 291 - rank: 4 max len: 753 min len: 677 avg len: 723.9 num_loss_counted_tokens: 3325
 total tokens: 6312 num samples: 24 num padding tokens: 2423 - rank: 7 max len: 263 min len: 87 avg len: 162.04166666666666 num_loss_counted_tokens: 1548
 {
    "epoch": 0,
    "step": 54,
    "rank": 0,
    "loss": 0.019134046509861946,
    "overall_throughput": 42.550455941342655,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.375274658203125,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23186,
    "batch_size": 84,
    "total_loss": 0.7418078184127808,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:20.000394"
 }
 total tokens: 7968 num samples: 12 num padding tokens: 974 - rank: 5 max len: 664 min len: 458 avg len: 582.8333333333334 num_loss_counted_tokens: 4692
 total tokens: 7020 num samples: 5 num padding tokens: 1164 - rank: 2 max len: 1404 min len: 1078 avg len: 1171.2 num_loss_counted_tokens: 2148
 total tokens: 7786 num samples: 17 num padding tokens: 1313 - rank: 6 max len: 458 min len: 268 avg len: 380.7647058823529 num_loss_counted_tokens: 3857
 total tokens: 6240 num samples: 3 num padding tokens: 692 - rank: 0 max len: 2080 min len: 1719 avg len: 1849.3333333333333 num_loss_counted_tokens: 339
 total tokens: 8032 num samples: 8 num padding tokens: 1051 - rank: 3 max len: 1004 min len: 774 avg len: 872.625 num_loss_counted_tokens: 4797
 Per-token loss scaled by world size: 0.00031934864819049835Per-token loss scaled by world size: 0.00028757311520166695Per-token loss scaled by world size: 0.0002729191619437188Per-token loss scaled by world size: 5.039776624471415e-06Per-token loss scaled by world size: 5.635480647470104e-06Per-token loss scaled by world size: 0.00025633463519625366





 Per-token loss scaled by world size: 0.00021784953423775733
 Epoch: 0, Step: 55, Rank: 1, loss = 0.014716777950525284
 Epoch: 0, Step: 55, Rank: 4, loss = 0.7969580292701721Epoch: 0, Step: 55, Rank: 2, loss = 0.8397494554519653
 Epoch: 0, Step: 55, Rank: 5, loss = 0.9325379729270935Epoch: 0, Step: 55, Rank: 0, loss = 0.016456307843327522


 Epoch: 0, Step: 55, Rank: 3, loss = 0.7485291957855225
 Epoch: 0, Step: 55, Rank: 7, loss = 0.6361478567123413
 Per-token loss scaled by world size: 0.0004028878756798804
 Epoch: 0, Step: 55, Rank: 6, loss = 1.176482915878296
 Epoch 0:  45%|████▌     | 55/121 [02:20<02:48,  2.55s/it] total tokens: 6669 num samples: 3 num padding tokens: 1016 - rank: 1 max len: 2223 min len: 1683 avg len: 1884.3333333333333 num_loss_counted_tokens: 2050
 total tokens: 7551 num samples: 9 num padding tokens: 500 - rank: 4 max len: 839 min len: 731 avg len: 783.4444444444445 num_loss_counted_tokens: 5307
 total tokens: 7815 num samples: 5 num padding tokens: 1302 - rank: 2 max len: 1563 min len: 1114 avg len: 1302.6 num_loss_counted_tokens: 3153
 total tokens: 6000 num samples: 25 num padding tokens: 1935 - rank: 7 max len: 240 min len: 71 avg len: 162.6 num_loss_counted_tokens: 1696
 {
    "epoch": 0,
    "step": 55,
    "rank": 0,
    "loss": 0.016456307843327522,
    "overall_throughput": 41.61745941199812,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.42158031463623,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23361,
    "batch_size": 72,
    "total_loss": 0.6451972723007202,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:22.600693"
 }
 total tokens: 8008 num samples: 8 num padding tokens: 692 - rank: 3 max len: 1001 min len: 855 avg len: 914.5 num_loss_counted_tokens: 2914
 total tokens: 7755 num samples: 11 num padding tokens: 872 - rank: 5 max len: 705 min len: 568 avg len: 625.7272727272727 num_loss_counted_tokens: 3890
 total tokens: 5474 num samples: 2 num padding tokens: 71 - rank: 0 max len: 2737 min len: 2666 avg len: 2701.5 num_loss_counted_tokens: 203
 total tokens: 7952 num samples: 16 num padding tokens: 1838 - rank: 6 max len: 497 min len: 242 avg len: 382.125 num_loss_counted_tokens: 3616
 Per-token loss scaled by world size: 0.00027937223785556853Per-token loss scaled by world size: 0.0003551024419721216Per-token loss scaled by world size: 0.0003812481591012329


 Per-token loss scaled by world size: 0.00039887617458589375
 Per-token loss scaled by world size: 0.00017063321138266474
 Per-token loss scaled by world size: 0.0001592883054399863
 Per-token loss scaled by world size: 5.90530635236064e-06
 Epoch: 0, Step: 56, Rank: 5, loss = 1.202885627746582Epoch: 0, Step: 56, Rank: 6, loss = 1.1203925609588623

 Epoch: 0, Step: 56, Rank: 7, loss = 0.8814542889595032
 Epoch: 0, Step: 56, Rank: 4, loss = 1.2585041522979736
 Epoch: 0, Step: 56, Rank: 3, loss = 0.5025745034217834
 Epoch: 0, Step: 56, Rank: 1, loss = 0.5383691191673279
 Epoch: 0, Step: 56, Rank: 0, loss = 0.018631979823112488
 Per-token loss scaled by world size: 0.0001534399198135361
 Epoch: 0, Step: 56, Rank: 2, loss = 0.4841221272945404
 Epoch 0:  46%|████▋     | 56/121 [02:23<02:46,  2.55s/it] total tokens: 8090 num samples: 10 num padding tokens: 807 - rank: 4 max len: 809 min len: 667 avg len: 728.3 num_loss_counted_tokens: 3324
 total tokens: 7916 num samples: 4 num padding tokens: 1432 - rank: 1 max len: 1979 min len: 1204 avg len: 1621.0 num_loss_counted_tokens: 3189
 {
    "epoch": 0,
    "step": 56,
    "rank": 0,
    "loss": 0.018631979823112488,
    "overall_throughput": 42.399025208803295,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.21819305419922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25241,
    "batch_size": 80,
    "total_loss": 0.7508668899536133,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:25.156511"
 }
 total tokens: 7794 num samples: 18 num padding tokens: 1747 - rank: 6 max len: 433 min len: 260 avg len: 335.94444444444446 num_loss_counted_tokens: 3466
 total tokens: 7560 num samples: 3 num padding tokens: 565 - rank: 0 max len: 2520 min len: 2123 avg len: 2331.6666666666665 num_loss_counted_tokens: 291
 total tokens: 7956 num samples: 12 num padding tokens: 1282 - rank: 5 max len: 663 min len: 471 avg len: 556.1666666666666 num_loss_counted_tokens: 3682
 total tokens: 7158 num samples: 6 num padding tokens: 910 - rank: 2 max len: 1193 min len: 973 avg len: 1041.3333333333333 num_loss_counted_tokens: 3894
 total tokens: 7020 num samples: 27 num padding tokens: 2005 - rank: 7 max len: 260 min len: 75 avg len: 185.74074074074073 num_loss_counted_tokens: 2158
 total tokens: 7528 num samples: 8 num padding tokens: 511 - rank: 3 max len: 941 min len: 825 avg len: 877.125 num_loss_counted_tokens: 5733
 Per-token loss scaled by world size: 0.0006988957757130265Per-token loss scaled by world size: 0.0001346730423392728Per-token loss scaled by world size: 0.0006961524486541748Per-token loss scaled by world size: 0.0004565907292999327


 Per-token loss scaled by world size: 0.0009174557635560632Per-token loss scaled by world size: 6.04268007009523e-06

 Per-token loss scaled by world size: 2.950769612652948e-06

 Epoch: 0, Step: 57, Rank: 2, loss = 0.29436159133911133
 Epoch: 0, Step: 57, Rank: 7, loss = 1.5216152667999268
 Epoch: 0, Step: 57, Rank: 4, loss = 0.9979931712150574
 Epoch: 0, Step: 57, Rank: 6, loss = 1.527611494064331
 Epoch: 0, Step: 57, Rank: 5, loss = 2.005328893661499Epoch: 0, Step: 57, Rank: 0, loss = 0.013207787647843361

 Epoch: 0, Step: 57, Rank: 1, loss = 0.006449644919484854
 Per-token loss scaled by world size: 0.0004939697682857513
 Epoch: 0, Step: 57, Rank: 3, loss = 1.079694390296936
 Epoch 0:  47%|████▋     | 57/121 [02:25<02:43,  2.55s/it] total tokens: 7420 num samples: 4 num padding tokens: 675 - rank: 1 max len: 1855 min len: 1520 avg len: 1686.25 num_loss_counted_tokens: 2384
 total tokens: 7770 num samples: 10 num padding tokens: 828 - rank: 4 max len: 777 min len: 623 avg len: 694.2 num_loss_counted_tokens: 2018
 {
    "epoch": 0,
    "step": 57,
    "rank": 0,
    "loss": 0.013207787647843361,
    "overall_throughput": 42.49635893812786,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.335302352905273,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17486,
    "batch_size": 87,
    "total_loss": 0.9307827949523926,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:27.695911"
 }
 total tokens: 8062 num samples: 29 num padding tokens: 2902 - rank: 7 max len: 278 min len: 80 avg len: 177.93103448275863 num_loss_counted_tokens: 2355
 total tokens: 7464 num samples: 6 num padding tokens: 1141 - rank: 2 max len: 1244 min len: 904 avg len: 1053.8333333333333 num_loss_counted_tokens: 4123
 total tokens: 7808 num samples: 16 num padding tokens: 1561 - rank: 6 max len: 488 min len: 309 avg len: 390.4375 num_loss_counted_tokens: 4152
 total tokens: 7865 num samples: 13 num padding tokens: 912 - rank: 5 max len: 605 min len: 489 avg len: 534.8461538461538 num_loss_counted_tokens: 4080
 total tokens: 7839 num samples: 9 num padding tokens: 355 - rank: 3 max len: 871 min len: 779 avg len: 831.5555555555555 num_loss_counted_tokens: 4751
 total tokens: 6874 num samples: 2 num padding tokens: 130 - rank: 0 max len: 3437 min len: 3307 avg len: 3372.0 num_loss_counted_tokens: 164
 Per-token loss scaled by world size: 0.0002828103897627443Per-token loss scaled by world size: 0.0004572872712742537Per-token loss scaled by world size: 1.0020711442848551e-06Per-token loss scaled by world size: 0.0006151991547085345Per-token loss scaled by world size: 0.00043195782927796245

 Per-token loss scaled by world size: 7.227377864182927e-06

 Per-token loss scaled by world size: 0.00031230703461915255


 Epoch: 0, Step: 58, Rank: 6, loss = 1.217584490776062
 Epoch: 0, Step: 58, Rank: 0, loss = 0.0026681397575885057
 Epoch: 0, Step: 58, Rank: 3, loss = 0.7530180215835571Epoch: 0, Step: 58, Rank: 4, loss = 1.150141716003418
 Epoch: 0, Step: 58, Rank: 5, loss = 1.6380445957183838

 Epoch: 0, Step: 58, Rank: 1, loss = 0.019243797287344933
 Epoch: 0, Step: 58, Rank: 7, loss = 0.831556499004364
 Per-token loss scaled by world size: 7.78582543716766e-05
 Epoch: 0, Step: 58, Rank: 2, loss = 0.2073073387145996
 Epoch 0:  48%|████▊     | 58/121 [02:28<02:40,  2.56s/it] total tokens: 6570 num samples: 3 num padding tokens: 500 - rank: 1 max len: 2190 min len: 1919 avg len: 2023.3333333333333 num_loss_counted_tokens: 450
 total tokens: 7944 num samples: 8 num padding tokens: 930 - rank: 4 max len: 993 min len: 794 avg len: 876.75 num_loss_counted_tokens: 6270
 {
    "epoch": 0,
    "step": 58,
    "rank": 0,
    "loss": 0.0026681397575885057,
    "overall_throughput": 42.239983019087354,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.242695808410645,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21301,
    "batch_size": 71,
    "total_loss": 0.7274456024169922,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:30.263573"
 }
 total tokens: 5496 num samples: 24 num padding tokens: 2346 - rank: 7 max len: 229 min len: 77 avg len: 131.25 num_loss_counted_tokens: 1044
 total tokens: 7190 num samples: 2 num padding tokens: 692 - rank: 0 max len: 3595 min len: 2903 avg len: 3249.0 num_loss_counted_tokens: 177
 total tokens: 8113 num samples: 19 num padding tokens: 2176 - rank: 6 max len: 427 min len: 230 avg len: 312.4736842105263 num_loss_counted_tokens: 3174
 total tokens: 7020 num samples: 4 num padding tokens: 898 - rank: 2 max len: 1755 min len: 1310 avg len: 1530.5 num_loss_counted_tokens: 778
 total tokens: 7494 num samples: 6 num padding tokens: 623 - rank: 3 max len: 1249 min len: 1027 avg len: 1145.1666666666667 num_loss_counted_tokens: 2420
 total tokens: 8107 num samples: 11 num padding tokens: 1465 - rank: 5 max len: 737 min len: 430 avg len: 603.8181818181819 num_loss_counted_tokens: 4656
 Per-token loss scaled by world size: 0.00026943007833324373Per-token loss scaled by world size: 0.00027157709700986743Per-token loss scaled by world size: 0.00050318957073614

 Per-token loss scaled by world size: 0.0003404158051125705Per-token loss scaled by world size: 0.0005186275229789317


 Per-token loss scaled by world size: 5.279351626086282e-06
 Per-token loss scaled by world size: 6.721797399222851e-05
 Epoch: 0, Step: 59, Rank: 4, loss = 1.4499406814575195Epoch: 0, Step: 59, Rank: 7, loss = 0.7825493812561035

 Epoch: 0, Step: 59, Rank: 6, loss = 0.7763627767562866
 Epoch: 0, Step: 59, Rank: 5, loss = 1.4944251775741577Epoch: 0, Step: 59, Rank: 2, loss = 0.9809081554412842

 Epoch: 0, Step: 59, Rank: 0, loss = 0.015212451107800007
 Epoch: 0, Step: 59, Rank: 1, loss = 0.19368860125541687
 Per-token loss scaled by world size: 0.0002404269325779751
 Epoch: 0, Step: 59, Rank: 3, loss = 0.6927902102470398
 Epoch 0:  49%|████▉     | 59/121 [02:31<02:38,  2.56s/it] total tokens: 7623 num samples: 11 num padding tokens: 1253 - rank: 4 max len: 693 min len: 515 avg len: 579.0909090909091 num_loss_counted_tokens: 4023
 total tokens: 7490 num samples: 5 num padding tokens: 1411 - rank: 1 max len: 1498 min len: 1040 avg len: 1215.8 num_loss_counted_tokens: 1224
 {
    "epoch": 0,
    "step": 59,
    "rank": 0,
    "loss": 0.015212451107800007,
    "overall_throughput": 42.27999530101241,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.386998653411865,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23052,
    "batch_size": 97,
    "total_loss": 0.7982346415519714,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:32.865411"
 }
 total tokens: 7981 num samples: 23 num padding tokens: 1404 - rank: 6 max len: 347 min len: 241 avg len: 285.95652173913044 num_loss_counted_tokens: 2899
 total tokens: 7140 num samples: 30 num padding tokens: 2458 - rank: 7 max len: 238 min len: 77 avg len: 156.06666666666666 num_loss_counted_tokens: 1738
 total tokens: 7938 num samples: 9 num padding tokens: 784 - rank: 3 max len: 882 min len: 719 avg len: 794.8888888888889 num_loss_counted_tokens: 5307
 total tokens: 7189 num samples: 7 num padding tokens: 565 - rank: 2 max len: 1027 min len: 894 avg len: 946.2857142857143 num_loss_counted_tokens: 3237
 total tokens: 8096 num samples: 16 num padding tokens: 1168 - rank: 5 max len: 506 min len: 348 avg len: 433.0 num_loss_counted_tokens: 3888
 total tokens: 7446 num samples: 3 num padding tokens: 1212 - rank: 0 max len: 2482 min len: 1681 avg len: 2078.0 num_loss_counted_tokens: 1401
 Per-token loss scaled by world size: 0.0002737885224632919Per-token loss scaled by world size: 0.00014497540541924536Per-token loss scaled by world size: 0.00019299837003927678Per-token loss scaled by world size: 0.0003046545316465199Per-token loss scaled by world size: 0.0004120226949453354Per-token loss scaled by world size: 0.00017287737864535302





 Per-token loss scaled by world size: 9.607095989849768e-07
 Epoch: 0, Step: 60, Rank: 2, loss = 0.6385592222213745Epoch: 0, Step: 60, Rank: 1, loss = 0.4796692430973053

 Epoch: 0, Step: 60, Rank: 6, loss = 1.00798761844635Epoch: 0, Step: 60, Rank: 4, loss = 1.3632285594940186Epoch: 0, Step: 60, Rank: 3, loss = 0.9058635830879211


 Epoch: 0, Step: 60, Rank: 7, loss = 0.5719864368438721
 Epoch: 0, Step: 60, Rank: 0, loss = 0.0031786279287189245
 Per-token loss scaled by world size: 0.0002788409183267504
 Epoch: 0, Step: 60, Rank: 5, loss = 0.9225800633430481
 Epoch 0:  50%|████▉     | 60/121 [02:33<02:35,  2.55s/it] total tokens: 6675 num samples: 3 num padding tokens: 898 - rank: 1 max len: 2225 min len: 1369 avg len: 1925.6666666666667 num_loss_counted_tokens: 1482
 total tokens: 7610 num samples: 10 num padding tokens: 836 - rank: 4 max len: 761 min len: 583 avg len: 677.4 num_loss_counted_tokens: 5671
 {
    "epoch": 0,
    "step": 60,
    "rank": 0,
    "loss": 0.0031786279287189245,
    "overall_throughput": 42.89061324541665,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.254440784454346,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26469,
    "batch_size": 91,
    "total_loss": 0.7366316914558411,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:35.353516"
 }
 total tokens: 7735 num samples: 17 num padding tokens: 2066 - rank: 6 max len: 455 min len: 253 avg len: 333.47058823529414 num_loss_counted_tokens: 2881
 total tokens: 7116 num samples: 6 num padding tokens: 703 - rank: 2 max len: 1186 min len: 991 avg len: 1068.8333333333333 num_loss_counted_tokens: 4466
 total tokens: 5920 num samples: 2 num padding tokens: 120 - rank: 0 max len: 2960 min len: 2840 avg len: 2900.0 num_loss_counted_tokens: 179
 total tokens: 7696 num samples: 8 num padding tokens: 616 - rank: 3 max len: 962 min len: 775 avg len: 885.0 num_loss_counted_tokens: 4386
 total tokens: 8096 num samples: 32 num padding tokens: 2221 - rank: 7 max len: 253 min len: 72 avg len: 183.59375 num_loss_counted_tokens: 2722
 total tokens: 7566 num samples: 13 num padding tokens: 544 - rank: 5 max len: 582 min len: 459 avg len: 540.1538461538462 num_loss_counted_tokens: 4576
 Per-token loss scaled by world size: 0.0001878267794381827Per-token loss scaled by world size: 0.0002473706554155797Per-token loss scaled by world size: 0.0005609646323136985Per-token loss scaled by world size: 2.916532139352057e-05Per-token loss scaled by world size: 0.0005521869170479476
 Per-token loss scaled by world size: 2.548310840211343e-05




 Per-token loss scaled by world size: 0.0002686498628463596
 Epoch: 0, Step: 61, Rank: 3, loss = 0.4526155889034271
 Epoch: 0, Step: 61, Rank: 0, loss = 0.07028113305568695Epoch: 0, Step: 61, Rank: 2, loss = 0.5961014032363892
 Epoch: 0, Step: 61, Rank: 6, loss = 1.3306324481964111Epoch: 0, Step: 61, Rank: 4, loss = 1.3517844676971436


 Epoch: 0, Step: 61, Rank: 1, loss = 0.061407919973134995
 Epoch: 0, Step: 61, Rank: 7, loss = 0.6473789811134338
 Per-token loss scaled by world size: 0.000720518350135535
 Epoch: 0, Step: 61, Rank: 5, loss = 1.7362691164016724
 Epoch 0:  50%|█████     | 61/121 [02:36<02:32,  2.55s/it] total tokens: 6764 num samples: 4 num padding tokens: 645 - rank: 1 max len: 1691 min len: 1416 avg len: 1529.75 num_loss_counted_tokens: 1674
 total tokens: 7416 num samples: 9 num padding tokens: 608 - rank: 4 max len: 824 min len: 677 avg len: 756.4444444444445 num_loss_counted_tokens: 5244
 total tokens: 8088 num samples: 12 num padding tokens: 844 - rank: 5 max len: 674 min len: 545 avg len: 603.6666666666666 num_loss_counted_tokens: 4651
 {
    "epoch": 0,
    "step": 61,
    "rank": 0,
    "loss": 0.07028113305568695,
    "overall_throughput": 42.67067345316129,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.295628547668457,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19278,
    "batch_size": 85,
    "total_loss": 0.7808088660240173,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:37.891642"
 }
 total tokens: 7800 num samples: 24 num padding tokens: 3935 - rank: 7 max len: 325 min len: 82 avg len: 161.04166666666666 num_loss_counted_tokens: 1576
 total tokens: 7965 num samples: 15 num padding tokens: 1199 - rank: 6 max len: 531 min len: 325 avg len: 451.06666666666666 num_loss_counted_tokens: 4050
 total tokens: 7488 num samples: 8 num padding tokens: 552 - rank: 3 max len: 936 min len: 830 avg len: 867.0 num_loss_counted_tokens: 2535
 total tokens: 6438 num samples: 2 num padding tokens: 1075 - rank: 0 max len: 3219 min len: 2144 avg len: 2681.5 num_loss_counted_tokens: 201
 total tokens: 7952 num samples: 7 num padding tokens: 583 - rank: 2 max len: 1136 min len: 993 avg len: 1052.7142857142858 num_loss_counted_tokens: 5968
 Per-token loss scaled by world size: 0.0004150475433561951Per-token loss scaled by world size: 0.0003149463445879519Per-token loss scaled by world size: 0.000596669502556324

 Per-token loss scaled by world size: 6.351516731228912e-06Per-token loss scaled by world size: 8.575078709327499e-07Per-token loss scaled by world size: 0.00024007105093915015



 Per-token loss scaled by world size: 0.00015118405281100422
 Epoch: 0, Step: 62, Rank: 5, loss = 1.5943009853363037
 Epoch: 0, Step: 62, Rank: 3, loss = 0.8415366411209106
 Epoch: 0, Step: 62, Rank: 0, loss = 0.016971252858638763
 Epoch: 0, Step: 62, Rank: 4, loss = 1.1090070009231567Epoch: 0, Step: 62, Rank: 2, loss = 0.6414698362350464
 Epoch: 0, Step: 62, Rank: 1, loss = 0.0022912609856575727

 Epoch: 0, Step: 62, Rank: 7, loss = 0.40396377444267273
 Per-token loss scaled by world size: 0.0003845185856334865
 Epoch: 0, Step: 62, Rank: 6, loss = 1.0274336338043213
 Epoch 0:  51%|█████     | 62/121 [02:38<02:29,  2.53s/it] total tokens: 7392 num samples: 4 num padding tokens: 338 - rank: 1 max len: 1848 min len: 1671 avg len: 1763.5 num_loss_counted_tokens: 2320
 total tokens: 8112 num samples: 8 num padding tokens: 1667 - rank: 4 max len: 1014 min len: 705 avg len: 805.625 num_loss_counted_tokens: 4978
 {
    "epoch": 0,
    "step": 62,
    "rank": 0,
    "loss": 0.016971252858638763,
    "overall_throughput": 43.357387151746714,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.4012451171875,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21376,
    "batch_size": 77,
    "total_loss": 0.7046218514442444,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:40.388446"
 }
 total tokens: 7860 num samples: 30 num padding tokens: 2842 - rank: 7 max len: 262 min len: 79 avg len: 167.26666666666668 num_loss_counted_tokens: 2118
 total tokens: 7711 num samples: 11 num padding tokens: 1331 - rank: 5 max len: 701 min len: 445 avg len: 580.0 num_loss_counted_tokens: 5182
 total tokens: 7974 num samples: 18 num padding tokens: 2210 - rank: 6 max len: 443 min len: 270 avg len: 320.22222222222223 num_loss_counted_tokens: 2977
 total tokens: 6676 num samples: 4 num padding tokens: 617 - rank: 2 max len: 1669 min len: 1336 avg len: 1514.75 num_loss_counted_tokens: 2305
 total tokens: 7974 num samples: 6 num padding tokens: 795 - rank: 3 max len: 1329 min len: 1046 avg len: 1196.5 num_loss_counted_tokens: 3823
 total tokens: 6596 num samples: 2 num padding tokens: 195 - rank: 0 max len: 3298 min len: 3103 avg len: 3200.5 num_loss_counted_tokens: 198
 Per-token loss scaled by world size: 0.00032304422347806394Per-token loss scaled by world size: 0.0002737718168646097Per-token loss scaled by world size: 0.0002592895762063563Per-token loss scaled by world size: 0.0002177765272790566Per-token loss scaled by world size: 0.00020476435020100325Per-token loss scaled by world size: 0.00020230024529155344Per-token loss scaled by world size: 5.3382074838737026e-05






 Epoch: 0, Step: 63, Rank: 0, loss = 0.1870107501745224Epoch: 0, Step: 63, Rank: 6, loss = 0.908356249332428Epoch: 0, Step: 63, Rank: 4, loss = 0.7173407077789307Epoch: 0, Step: 63, Rank: 5, loss = 1.1317046880722046


 Epoch: 0, Step: 63, Rank: 1, loss = 0.959091067314148Epoch: 0, Step: 63, Rank: 3, loss = 0.7087083458900452Epoch: 0, Step: 63, Rank: 7, loss = 0.7629256248474121



 Per-token loss scaled by world size: 0.0001610093895578757
 Epoch: 0, Step: 63, Rank: 2, loss = 0.5640561580657959
 Epoch 0:  52%|█████▏    | 63/121 [02:41<02:27,  2.54s/it] total tokens: 7680 num samples: 8 num padding tokens: 903 - rank: 4 max len: 960 min len: 763 avg len: 847.125 num_loss_counted_tokens: 3503
 total tokens: 7896 num samples: 3 num padding tokens: 1173 - rank: 1 max len: 2632 min len: 1954 avg len: 2241.0 num_loss_counted_tokens: 499
 {
    "epoch": 0,
    "step": 63,
    "rank": 0,
    "loss": 0.1870107501745224,
    "overall_throughput": 42.41853134282842,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.40930986404419,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28026,
    "batch_size": 89,
    "total_loss": 0.7423991560935974,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:42.939707"
 }
 total tokens: 7400 num samples: 10 num padding tokens: 1048 - rank: 5 max len: 740 min len: 555 avg len: 635.2 num_loss_counted_tokens: 4022
 total tokens: 7010 num samples: 5 num padding tokens: 576 - rank: 2 max len: 1402 min len: 1179 avg len: 1286.8 num_loss_counted_tokens: 3377
 total tokens: 7182 num samples: 27 num padding tokens: 2131 - rank: 7 max len: 266 min len: 74 avg len: 187.07407407407408 num_loss_counted_tokens: 2558
 total tokens: 7935 num samples: 15 num padding tokens: 1793 - rank: 6 max len: 529 min len: 277 avg len: 409.46666666666664 num_loss_counted_tokens: 3824
 total tokens: 7548 num samples: 2 num padding tokens: 861 - rank: 0 max len: 3774 min len: 2913 avg len: 3343.5 num_loss_counted_tokens: 218
 total tokens: 6996 num samples: 6 num padding tokens: 573 - rank: 3 max len: 1166 min len: 965 avg len: 1070.5 num_loss_counted_tokens: 3633
 Per-token loss scaled by world size: 0.0005391178419813514Per-token loss scaled by world size: 0.00022075393644627184
 Per-token loss scaled by world size: 8.797919872449711e-05
 Per-token loss scaled by world size: 0.00035083515103906393
 Per-token loss scaled by world size: 0.0003944748896174133Per-token loss scaled by world size: 3.479456063359976e-05
 Per-token loss scaled by world size: 0.0002142135490430519


 Epoch: 0, Step: 64, Rank: 3, loss = 0.64051753282547
 Epoch: 0, Step: 64, Rank: 2, loss = 0.25527164340019226
 Epoch: 0, Step: 64, Rank: 5, loss = 1.5642504692077637
 Epoch: 0, Step: 64, Rank: 4, loss = 1.0179481506347656
 Epoch: 0, Step: 64, Rank: 6, loss = 1.144568920135498
 Epoch: 0, Step: 64, Rank: 1, loss = 0.10095641762018204Epoch: 0, Step: 64, Rank: 7, loss = 0.6215406060218811

 Per-token loss scaled by world size: 2.743201912380755e-05
 Epoch: 0, Step: 64, Rank: 0, loss = 0.07959400117397308
 Epoch 0:  53%|█████▎    | 64/121 [02:43<02:26,  2.57s/it] total tokens: 7266 num samples: 2 num padding tokens: 794 - rank: 1 max len: 3633 min len: 2839 avg len: 3236.0 num_loss_counted_tokens: 186
 total tokens: 7720 num samples: 8 num padding tokens: 848 - rank: 4 max len: 965 min len: 791 avg len: 859.0 num_loss_counted_tokens: 4634
 {
    "epoch": 0,
    "step": 64,
    "rank": 0,
    "loss": 0.07959400117397308,
    "overall_throughput": 41.08188082412845,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.510859966278076,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23212,
    "batch_size": 70,
    "total_loss": 0.6780809760093689,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:45.576193"
 }
 total tokens: 7564 num samples: 31 num padding tokens: 2701 - rank: 7 max len: 244 min len: 82 avg len: 156.8709677419355 num_loss_counted_tokens: 1913
 total tokens: 4065 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4065 min len: 4065 avg len: 4065.0 num_loss_counted_tokens: 82
 total tokens: 8030 num samples: 5 num padding tokens: 1753 - rank: 3 max len: 1606 min len: 989 avg len: 1255.4 num_loss_counted_tokens: 3735
 total tokens: 7720 num samples: 10 num padding tokens: 1003 - rank: 5 max len: 772 min len: 589 avg len: 671.7 num_loss_counted_tokens: 4179
 total tokens: 7860 num samples: 4 num padding tokens: 620 - rank: 2 max len: 1965 min len: 1643 avg len: 1810.0 num_loss_counted_tokens: 1863
 total tokens: 7800 num samples: 15 num padding tokens: 1926 - rank: 6 max len: 520 min len: 284 avg len: 391.6 num_loss_counted_tokens: 3775
 Per-token loss scaled by world size: 0.000316505174851045Per-token loss scaled by world size: 0.00018296584312338382Per-token loss scaled by world size: 0.00035575314541347325Per-token loss scaled by world size: 0.00033105004695244133Per-token loss scaled by world size: 0.00041151116602122784Per-token loss scaled by world size: 4.141435056226328e-05
 Per-token loss scaled by world size: 0.0004561956156976521





 Epoch: 0, Step: 65, Rank: 6, loss = 0.928863525390625
 Epoch: 0, Step: 65, Rank: 0, loss = 0.12154076248407364Epoch: 0, Step: 65, Rank: 1, loss = 0.5369589924812317Epoch: 0, Step: 65, Rank: 3, loss = 1.0440465211868286


 Epoch: 0, Step: 65, Rank: 7, loss = 0.9715490937232971
 Epoch: 0, Step: 65, Rank: 5, loss = 1.3388200998306274
 Epoch: 0, Step: 65, Rank: 4, loss = 1.2076823711395264
 Per-token loss scaled by world size: 0.00015677251212764531
 Epoch: 0, Step: 65, Rank: 2, loss = 0.4600881338119507
 Epoch 0:  54%|█████▎    | 65/121 [02:46<02:23,  2.57s/it] total tokens: 6615 num samples: 3 num padding tokens: 1003 - rank: 1 max len: 2205 min len: 1609 avg len: 1870.6666666666667 num_loss_counted_tokens: 351
 total tokens: 7942 num samples: 11 num padding tokens: 995 - rank: 4 max len: 722 min len: 567 avg len: 631.5454545454545 num_loss_counted_tokens: 4015
 total tokens: 7820 num samples: 20 num padding tokens: 1923 - rank: 6 max len: 391 min len: 224 avg len: 294.85 num_loss_counted_tokens: 3347
 {
    "epoch": 0,
    "step": 65,
    "rank": 0,
    "loss": 0.12154076248407364,
    "overall_throughput": 42.203863937195706,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.473669052124023,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23478,
    "batch_size": 88,
    "total_loss": 0.8261936902999878,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:48.138254"
 }
 total tokens: 4515 num samples: 21 num padding tokens: 1409 - rank: 7 max len: 215 min len: 76 avg len: 147.9047619047619 num_loss_counted_tokens: 1105
 total tokens: 7320 num samples: 8 num padding tokens: 903 - rank: 3 max len: 915 min len: 724 avg len: 802.125 num_loss_counted_tokens: 4454
 total tokens: 7735 num samples: 5 num padding tokens: 1812 - rank: 2 max len: 1547 min len: 1021 avg len: 1184.6 num_loss_counted_tokens: 1072
 total tokens: 7896 num samples: 14 num padding tokens: 1195 - rank: 5 max len: 564 min len: 397 avg len: 478.64285714285717 num_loss_counted_tokens: 4241
 total tokens: 6444 num samples: 2 num padding tokens: 493 - rank: 0 max len: 3222 min len: 2729 avg len: 2975.5 num_loss_counted_tokens: 461
 Per-token loss scaled by world size: 0.00020512452465482056Per-token loss scaled by world size: 0.00037579398485831916Per-token loss scaled by world size: 0.00022079057816881686Per-token loss scaled by world size: 0.0002576705301180482
 Per-token loss scaled by world size: 5.168040661374107e-05

 Per-token loss scaled by world size: 0.00019716547103598714


 Per-token loss scaled by world size: 7.161292160162702e-05
 Epoch: 0, Step: 66, Rank: 6, loss = 0.8971443176269531Epoch: 0, Step: 66, Rank: 4, loss = 1.3084206581115723
 Epoch: 0, Step: 66, Rank: 0, loss = 0.17993825674057007
 Epoch: 0, Step: 66, Rank: 3, loss = 0.7687376141548157
 Epoch: 0, Step: 66, Rank: 2, loss = 0.7141923308372498

 Epoch: 0, Step: 66, Rank: 1, loss = 0.6864808797836304
 Epoch: 0, Step: 66, Rank: 7, loss = 0.249338299036026
 Per-token loss scaled by world size: 0.0002357129706069827
 Epoch: 0, Step: 66, Rank: 5, loss = 0.8206936120986938
 Epoch 0:  55%|█████▍    | 66/121 [02:48<02:20,  2.55s/it] total tokens: 6852 num samples: 3 num padding tokens: 921 - rank: 1 max len: 2284 min len: 1726 avg len: 1977.0 num_loss_counted_tokens: 3152
 total tokens: 7542 num samples: 9 num padding tokens: 1092 - rank: 4 max len: 838 min len: 652 avg len: 716.6666666666666 num_loss_counted_tokens: 3407
 total tokens: 7788 num samples: 33 num padding tokens: 2754 - rank: 7 max len: 236 min len: 71 avg len: 152.54545454545453 num_loss_counted_tokens: 2032
 {
    "epoch": 0,
    "step": 66,
    "rank": 0,
    "loss": 0.17993825674057007,
    "overall_throughput": 43.1739543464684,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.432954788208008,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27854,
    "batch_size": 94,
    "total_loss": 0.7031182050704956,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:50.647621"
 }
 total tokens: 7980 num samples: 19 num padding tokens: 1539 - rank: 6 max len: 420 min len: 240 avg len: 339.0 num_loss_counted_tokens: 3367
 total tokens: 7175 num samples: 7 num padding tokens: 309 - rank: 3 max len: 1025 min len: 903 avg len: 980.8571428571429 num_loss_counted_tokens: 5453
 total tokens: 7130 num samples: 5 num padding tokens: 1200 - rank: 2 max len: 1426 min len: 1084 avg len: 1186.0 num_loss_counted_tokens: 3905
 total tokens: 7388 num samples: 2 num padding tokens: 641 - rank: 0 max len: 3694 min len: 3053 avg len: 3373.5 num_loss_counted_tokens: 603
 total tokens: 7668 num samples: 12 num padding tokens: 1327 - rank: 5 max len: 639 min len: 423 avg len: 528.4166666666666 num_loss_counted_tokens: 3896
 Per-token loss scaled by world size: 0.00023726793006062508Per-token loss scaled by world size: 0.00031606658012606204Per-token loss scaled by world size: 0.000504097668454051Per-token loss scaled by world size: 0.00039712167927064Per-token loss scaled by world size: 0.0004929095157422125Per-token loss scaled by world size: 6.066870355425635e-06Per-token loss scaled by world size: 2.3820351998438127e-05






 Epoch: 0, Step: 67, Rank: 2, loss = 0.6478897333145142
 Epoch: 0, Step: 67, Rank: 3, loss = 0.8630593419075012Epoch: 0, Step: 67, Rank: 6, loss = 1.3765016794204712
 Epoch: 0, Step: 67, Rank: 0, loss = 0.01656634733080864
 Epoch: 0, Step: 67, Rank: 7, loss = 1.08439040184021
 Epoch: 0, Step: 67, Rank: 4, loss = 1.3459510803222656

 Epoch: 0, Step: 67, Rank: 1, loss = 0.06504444777965546
 Per-token loss scaled by world size: 0.00038537452928721905
 Epoch: 0, Step: 67, Rank: 5, loss = 1.0523133277893066
 Epoch 0:  55%|█████▌    | 67/121 [02:51<02:16,  2.53s/it] total tokens: 6858 num samples: 3 num padding tokens: 879 - rank: 1 max len: 2286 min len: 1606 avg len: 1993.0 num_loss_counted_tokens: 2226
 total tokens: 8001 num samples: 9 num padding tokens: 744 - rank: 4 max len: 889 min len: 750 avg len: 806.3333333333334 num_loss_counted_tokens: 3999
 {
    "epoch": 0,
    "step": 67,
    "rank": 0,
    "loss": 0.01656634733080864,
    "overall_throughput": 43.254684230948214,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.338647842407227,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21845,
    "batch_size": 79,
    "total_loss": 0.8064644932746887,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:53.149207"
 }
 total tokens: 8085 num samples: 11 num padding tokens: 2256 - rank: 5 max len: 735 min len: 410 avg len: 529.9090909090909 num_loss_counted_tokens: 4137
 total tokens: 7679 num samples: 7 num padding tokens: 788 - rank: 3 max len: 1097 min len: 910 avg len: 984.4285714285714 num_loss_counted_tokens: 4646
 total tokens: 6030 num samples: 2 num padding tokens: 245 - rank: 0 max len: 3015 min len: 2770 avg len: 2892.5 num_loss_counted_tokens: 180
 total tokens: 8040 num samples: 20 num padding tokens: 1710 - rank: 6 max len: 402 min len: 246 avg len: 316.5 num_loss_counted_tokens: 2948
 total tokens: 7520 num samples: 32 num padding tokens: 2159 - rank: 7 max len: 235 min len: 85 avg len: 167.53125 num_loss_counted_tokens: 2187
 total tokens: 7895 num samples: 5 num padding tokens: 1603 - rank: 2 max len: 1579 min len: 1158 avg len: 1258.4 num_loss_counted_tokens: 2148
 Per-token loss scaled by world size: 0.00019273992802482098Per-token loss scaled by world size: 0.00030933329253457487Per-token loss scaled by world size: 0.00017293139535468072Per-token loss scaled by world size: 0.00035478913923725486
 Per-token loss scaled by world size: 0.0003922785690519959
 Per-token loss scaled by world size: 4.257708951627137e-06



 Per-token loss scaled by world size: 0.0001415474253008142
 Epoch: 0, Step: 68, Rank: 6, loss = 1.0613998174667358
 Epoch: 0, Step: 68, Rank: 2, loss = 0.6613388657569885Epoch: 0, Step: 68, Rank: 5, loss = 1.3460057973861694

 Epoch: 0, Step: 68, Rank: 0, loss = 0.014609264209866524
 Epoch: 0, Step: 68, Rank: 4, loss = 1.2173702716827393Epoch: 0, Step: 68, Rank: 1, loss = 0.5933708548545837

 Epoch: 0, Step: 68, Rank: 7, loss = 0.4856846034526825
 Per-token loss scaled by world size: 0.00021395196381490678
 Epoch: 0, Step: 68, Rank: 3, loss = 0.7341226935386658
 Epoch 0:  56%|█████▌    | 68/121 [02:53<02:14,  2.54s/it] total tokens: 8082 num samples: 9 num padding tokens: 769 - rank: 4 max len: 898 min len: 707 avg len: 812.5555555555555 num_loss_counted_tokens: 4363
 total tokens: 7876 num samples: 4 num padding tokens: 706 - rank: 1 max len: 1969 min len: 1606 avg len: 1792.5 num_loss_counted_tokens: 565
 {
    "epoch": 0,
    "step": 68,
    "rank": 0,
    "loss": 0.014609264209866524,
    "overall_throughput": 42.56272046948144,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.448001861572266,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27450,
    "batch_size": 88,
    "total_loss": 0.7642378211021423,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:55.692310"
 }
 total tokens: 7612 num samples: 11 num padding tokens: 1187 - rank: 5 max len: 692 min len: 458 avg len: 584.0909090909091 num_loss_counted_tokens: 4392
 total tokens: 5586 num samples: 2 num padding tokens: 652 - rank: 0 max len: 2793 min len: 2141 avg len: 2467.0 num_loss_counted_tokens: 151
 total tokens: 8060 num samples: 31 num padding tokens: 3056 - rank: 7 max len: 260 min len: 75 avg len: 161.41935483870967 num_loss_counted_tokens: 2101
 total tokens: 7220 num samples: 5 num padding tokens: 823 - rank: 2 max len: 1444 min len: 1143 avg len: 1279.4 num_loss_counted_tokens: 1187
 total tokens: 7902 num samples: 18 num padding tokens: 1530 - rank: 6 max len: 439 min len: 282 avg len: 354.0 num_loss_counted_tokens: 3796
 total tokens: 7882 num samples: 7 num padding tokens: 581 - rank: 3 max len: 1126 min len: 922 avg len: 1043.0 num_loss_counted_tokens: 4213
 Per-token loss scaled by world size: 0.0005160618457011878Per-token loss scaled by world size: 0.00044540074304677546Per-token loss scaled by world size: 4.21712247771211e-05Per-token loss scaled by world size: 0.0002244754577986896Per-token loss scaled by world size: 0.0007427233504131436



 Per-token loss scaled by world size: 9.168142241833266e-06

 Per-token loss scaled by world size: 0.0003082228358834982
 Epoch: 0, Step: 69, Rank: 2, loss = 0.09369391947984695
 Epoch: 0, Step: 69, Rank: 5, loss = 0.9895691275596619Epoch: 0, Step: 69, Rank: 6, loss = 1.1465604305267334Epoch: 0, Step: 69, Rank: 3, loss = 0.49872833490371704


 Epoch: 0, Step: 69, Rank: 4, loss = 1.6501456499099731
 Epoch: 0, Step: 69, Rank: 1, loss = 0.02036931924521923
 Epoch: 0, Step: 69, Rank: 7, loss = 0.6847940683364868Per-token loss scaled by world size: 4.934850949211977e-06

 Epoch: 0, Step: 69, Rank: 0, loss = 0.010964005254209042
 Epoch 0:  57%|█████▋    | 69/121 [02:56<02:12,  2.55s/it] total tokens: 7194 num samples: 6 num padding tokens: 688 - rank: 1 max len: 1199 min len: 1008 avg len: 1084.3333333333333 num_loss_counted_tokens: 3201
 total tokens: 7668 num samples: 12 num padding tokens: 514 - rank: 4 max len: 639 min len: 540 avg len: 596.1666666666666 num_loss_counted_tokens: 3626
 {
    "epoch": 0,
    "step": 69,
    "rank": 0,
    "loss": 0.010964005254209042,
    "overall_throughput": 41.668886771410214,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.496148586273193,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17774,
    "batch_size": 74,
    "total_loss": 0.6368531584739685,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:50:58.281411"
 }
 total tokens: 3168 num samples: 24 num padding tokens: 653 - rank: 7 max len: 132 min len: 86 avg len: 104.79166666666667 num_loss_counted_tokens: 693
 total tokens: 5566 num samples: 2 num padding tokens: 709 - rank: 0 max len: 2783 min len: 2074 avg len: 2428.5 num_loss_counted_tokens: 267
 total tokens: 8040 num samples: 15 num padding tokens: 1436 - rank: 5 max len: 536 min len: 357 avg len: 440.26666666666665 num_loss_counted_tokens: 4304
 total tokens: 7872 num samples: 24 num padding tokens: 2389 - rank: 6 max len: 328 min len: 136 avg len: 228.45833333333334 num_loss_counted_tokens: 2377
 total tokens: 7784 num samples: 8 num padding tokens: 564 - rank: 2 max len: 973 min len: 847 avg len: 902.5 num_loss_counted_tokens: 5185
 total tokens: 7800 num samples: 10 num padding tokens: 617 - rank: 3 max len: 780 min len: 641 avg len: 718.3 num_loss_counted_tokens: 5599
 Per-token loss scaled by world size: 0.00038898465572856367Per-token loss scaled by world size: 0.0003429916687309742Per-token loss scaled by world size: 0.0005293539143167436Per-token loss scaled by world size: 2.475303517712746e-06

 Per-token loss scaled by world size: 0.00041178142419084907


 Per-token loss scaled by world size: 0.00016001032781787217Per-token loss scaled by world size: 3.3767562854336575e-05

 Epoch: 0, Step: 70, Rank: 6, loss = 0.9818993806838989
 Epoch: 0, Step: 70, Rank: 4, loss = 1.515407919883728
 Epoch: 0, Step: 70, Rank: 5, loss = 1.1135658025741577
 Epoch: 0, Step: 70, Rank: 0, loss = 0.00708617502823472
 Epoch: 0, Step: 70, Rank: 3, loss = 1.1788272857666016
 Epoch: 0, Step: 70, Rank: 7, loss = 0.4580695629119873
 Epoch: 0, Step: 70, Rank: 1, loss = 0.09666808694601059
 Per-token loss scaled by world size: 0.00012752025213558227
 Epoch: 0, Step: 70, Rank: 2, loss = 0.3650586009025574
 Epoch 0:  58%|█████▊    | 70/121 [02:58<02:09,  2.54s/it] total tokens: 7424 num samples: 8 num padding tokens: 763 - rank: 4 max len: 928 min len: 726 avg len: 832.625 num_loss_counted_tokens: 3620
 total tokens: 7068 num samples: 4 num padding tokens: 755 - rank: 1 max len: 1767 min len: 1403 avg len: 1578.25 num_loss_counted_tokens: 1097
 {
    "epoch": 0,
    "step": 70,
    "rank": 0,
    "loss": 0.00708617502823472,
    "overall_throughput": 42.86151276291768,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.356226444244385,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22902,
    "batch_size": 78,
    "total_loss": 0.7145729064941406,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:00.845490"
 }
 total tokens: 6700 num samples: 25 num padding tokens: 2499 - rank: 7 max len: 268 min len: 77 avg len: 168.04 num_loss_counted_tokens: 1712
 total tokens: 7984 num samples: 16 num padding tokens: 1302 - rank: 6 max len: 499 min len: 278 avg len: 417.625 num_loss_counted_tokens: 3478
 total tokens: 7644 num samples: 7 num padding tokens: 654 - rank: 3 max len: 1092 min len: 950 avg len: 998.5714285714286 num_loss_counted_tokens: 5657
 total tokens: 6950 num samples: 5 num padding tokens: 699 - rank: 2 max len: 1390 min len: 1102 avg len: 1250.2 num_loss_counted_tokens: 2977
 total tokens: 7590 num samples: 11 num padding tokens: 1060 - rank: 5 max len: 690 min len: 509 avg len: 593.6363636363636 num_loss_counted_tokens: 4071
 total tokens: 7488 num samples: 3 num padding tokens: 1132 - rank: 0 max len: 2496 min len: 1787 avg len: 2118.6666666666665 num_loss_counted_tokens: 1930
 Per-token loss scaled by world size: 0.00010220974945696071Per-token loss scaled by world size: 0.0004755923873744905Per-token loss scaled by world size: 0.00013658934039995074Per-token loss scaled by world size: 0.0005745669477619231Per-token loss scaled by world size: 0.00038079574005678296Per-token loss scaled by world size: 1.1699220294758561e-06


 Per-token loss scaled by world size: 0.0002442343102302402



 Epoch: 0, Step: 71, Rank: 6, loss = 1.3208389282226562
 Epoch: 0, Step: 71, Rank: 5, loss = 1.595716118812561
 Epoch: 0, Step: 71, Rank: 0, loss = 0.0032491658348590136Epoch: 0, Step: 71, Rank: 1, loss = 0.28386202454566956Epoch: 0, Step: 71, Rank: 2, loss = 0.3793427646160126Epoch: 0, Step: 71, Rank: 4, loss = 1.0575649738311768



 Epoch: 0, Step: 71, Rank: 7, loss = 0.6782997250556946
 Per-token loss scaled by world size: 0.0002879296080209315
 Epoch: 0, Step: 71, Rank: 3, loss = 0.7996525168418884
 Epoch 0:  59%|█████▊    | 71/121 [03:01<02:07,  2.54s/it] total tokens: 7900 num samples: 10 num padding tokens: 433 - rank: 4 max len: 790 min len: 700 avg len: 746.7 num_loss_counted_tokens: 4392
 total tokens: 7940 num samples: 5 num padding tokens: 1033 - rank: 1 max len: 1588 min len: 1214 avg len: 1381.4 num_loss_counted_tokens: 4515
 {
    "epoch": 0,
    "step": 71,
    "rank": 0,
    "loss": 0.0032491658348590136,
    "overall_throughput": 42.58421912234305,
    "lr": 8.000000000000001e-07,
    "cuda_mem_allocated": 24.31266736984253,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22218,
    "batch_size": 83,
    "total_loss": 0.7648157477378845,
    "gradnorm": 0.9589425325393677,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:03.350519"
 }
 total tokens: 7645 num samples: 11 num padding tokens: 1259 - rank: 5 max len: 695 min len: 477 avg len: 580.5454545454545 num_loss_counted_tokens: 4944
 total tokens: 8092 num samples: 17 num padding tokens: 1854 - rank: 6 max len: 476 min len: 261 avg len: 366.94117647058823 num_loss_counted_tokens: 3900
 total tokens: 7931 num samples: 7 num padding tokens: 505 - rank: 2 max len: 1133 min len: 979 avg len: 1060.857142857143 num_loss_counted_tokens: 4912
 total tokens: 7776 num samples: 8 num padding tokens: 515 - rank: 3 max len: 972 min len: 822 avg len: 907.625 num_loss_counted_tokens: 5257
 total tokens: 7904 num samples: 32 num padding tokens: 2745 - rank: 7 max len: 247 min len: 71 avg len: 161.21875 num_loss_counted_tokens: 2232
 total tokens: 8004 num samples: 4 num padding tokens: 730 - rank: 0 max len: 2001 min len: 1656 avg len: 1818.5 num_loss_counted_tokens: 2493
 Per-token loss scaled by world size: 0.0004883022629655898Per-token loss scaled by world size: 0.00043922686018049717Per-token loss scaled by world size: 0.0004386535147204995
 Per-token loss scaled by world size: 0.00020862463861703873


 Per-token loss scaled by world size: 0.0003196638426743448Per-token loss scaled by world size: 4.3209151954215486e-06

 Per-token loss scaled by world size: 0.0001734672114253044
 Epoch: 0, Step: 72, Rank: 6, loss = 1.25338876247406
 Epoch: 0, Step: 72, Rank: 5, loss = 1.2517526149749756Epoch: 0, Step: 72, Rank: 4, loss = 1.393431544303894

 Epoch: 0, Step: 72, Rank: 2, loss = 0.5953364968299866
 Epoch: 0, Step: 72, Rank: 7, loss = 0.9122007489204407
 Epoch: 0, Step: 72, Rank: 0, loss = 0.012330272234976292
 Epoch: 0, Step: 72, Rank: 1, loss = 0.4950103759765625
 Per-token loss scaled by world size: 0.00020103121642023325
 Epoch: 0, Step: 72, Rank: 3, loss = 0.5736677050590515
 [2024-08-18 20:51:05,925] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[1.6000000000000001e-06], mom=[(0.9, 0.95)]
 Epoch 0:  60%|█████▉    | 72/121 [03:04<02:06,  2.58s/it] total tokens: 7434 num samples: 7 num padding tokens: 801 - rank: 4 max len: 1062 min len: 831 avg len: 947.5714285714286 num_loss_counted_tokens: 3620
 total tokens: 7284 num samples: 3 num padding tokens: 180 - rank: 1 max len: 2428 min len: 2292 avg len: 2368.0 num_loss_counted_tokens: 273
 {
    "epoch": 0,
    "step": 72,
    "rank": 0,
    "loss": 0.012330272234976292,
    "overall_throughput": 41.0709419873187,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 22.637446880340576,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22829,
    "batch_size": 79,
    "total_loss": 0.8108897805213928,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:06.059291"
 }
 total tokens: 6880 num samples: 5 num padding tokens: 854 - rank: 3 max len: 1376 min len: 1145 avg len: 1205.2 num_loss_counted_tokens: 2855
 total tokens: 6178 num samples: 2 num padding tokens: 622 - rank: 0 max len: 3089 min len: 2467 avg len: 2778.0 num_loss_counted_tokens: 309
 total tokens: 7306 num samples: 26 num padding tokens: 2608 - rank: 7 max len: 281 min len: 72 avg len: 180.69230769230768 num_loss_counted_tokens: 1912
 total tokens: 7960 num samples: 4 num padding tokens: 988 - rank: 2 max len: 1990 min len: 1586 avg len: 1743.0 num_loss_counted_tokens: 1479
 total tokens: 7548 num samples: 12 num padding tokens: 2286 - rank: 6 max len: 629 min len: 305 avg len: 438.5 num_loss_counted_tokens: 3206
 total tokens: 8040 num samples: 10 num padding tokens: 894 - rank: 5 max len: 804 min len: 634 avg len: 714.6 num_loss_counted_tokens: 4636
 Per-token loss scaled by world size: 0.00029286538483574986Per-token loss scaled by world size: 0.0002602968306746334Per-token loss scaled by world size: 0.00021679738711100072Per-token loss scaled by world size: 0.00021336728241294622Per-token loss scaled by world size: 0.0002807514392770827

 Per-token loss scaled by world size: 2.977332087539253e-06

 Per-token loss scaled by world size: 0.00019802094902843237


 Epoch: 0, Step: 73, Rank: 1, loss = 0.8374074697494507
 Epoch: 0, Step: 73, Rank: 6, loss = 0.9421845078468323Epoch: 0, Step: 73, Rank: 4, loss = 0.6974642872810364

 Epoch: 0, Step: 73, Rank: 0, loss = 0.00957844965159893Epoch: 0, Step: 73, Rank: 3, loss = 0.6864292025566101

 Epoch: 0, Step: 73, Rank: 2, loss = 0.9032124280929565
 Epoch: 0, Step: 73, Rank: 7, loss = 0.6370581388473511
 Per-token loss scaled by world size: 0.0002398234064457938
 Epoch: 0, Step: 73, Rank: 5, loss = 0.7715418934822083
 Epoch 0:  60%|██████    | 73/121 [03:06<02:03,  2.57s/it] total tokens: 7832 num samples: 8 num padding tokens: 827 - rank: 4 max len: 979 min len: 822 avg len: 875.625 num_loss_counted_tokens: 5623
 total tokens: 7455 num samples: 3 num padding tokens: 291 - rank: 1 max len: 2485 min len: 2232 avg len: 2388.0 num_loss_counted_tokens: 600
 {
    "epoch": 0,
    "step": 73,
    "rank": 0,
    "loss": 0.00957844965159893,
    "overall_throughput": 41.76512858398771,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.471110343933105,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25737,
    "batch_size": 88,
    "total_loss": 0.6856094598770142,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:08.555309"
 }
 total tokens: 8020 num samples: 10 num padding tokens: 1418 - rank: 5 max len: 802 min len: 571 avg len: 660.2 num_loss_counted_tokens: 4480
 total tokens: 5714 num samples: 2 num padding tokens: 293 - rank: 0 max len: 2857 min len: 2564 avg len: 2710.5 num_loss_counted_tokens: 246
 total tokens: 7260 num samples: 6 num padding tokens: 413 - rank: 3 max len: 1210 min len: 1056 avg len: 1141.1666666666667 num_loss_counted_tokens: 3104
 total tokens: 6768 num samples: 4 num padding tokens: 992 - rank: 2 max len: 1692 min len: 1291 avg len: 1444.0 num_loss_counted_tokens: 3075
 total tokens: 7602 num samples: 14 num padding tokens: 1899 - rank: 6 max len: 543 min len: 306 avg len: 407.35714285714283 num_loss_counted_tokens: 3724
 total tokens: 7930 num samples: 26 num padding tokens: 2791 - rank: 7 max len: 305 min len: 79 avg len: 197.65384615384616 num_loss_counted_tokens: 2193
 Per-token loss scaled by world size: 0.0004040475469082594Per-token loss scaled by world size: 0.00014303348143585026Per-token loss scaled by world size: 0.00015468306082766503Per-token loss scaled by world size: 0.00038016383768990636Per-token loss scaled by world size: 0.0002839408116415143Per-token loss scaled by world size: 0.00020860570657532662




 Per-token loss scaled by world size: 3.69467556993186e-06

 Epoch: 0, Step: 74, Rank: 5, loss = 1.1417745351791382
 Epoch: 0, Step: 74, Rank: 1, loss = 0.4645712375640869Epoch: 0, Step: 74, Rank: 4, loss = 0.4295831620693207Epoch: 0, Step: 74, Rank: 6, loss = 1.2135063409805298Epoch: 0, Step: 74, Rank: 7, loss = 0.8527806997299194



 Epoch: 0, Step: 74, Rank: 0, loss = 0.011096496134996414
 Epoch: 0, Step: 74, Rank: 2, loss = 0.6265211701393127
 Per-token loss scaled by world size: 0.00020632839004974812
 Epoch: 0, Step: 74, Rank: 3, loss = 0.6196815371513367
 Epoch 0:  61%|██████    | 74/121 [03:09<02:00,  2.55s/it] total tokens: 7696 num samples: 8 num padding tokens: 1763 - rank: 4 max len: 962 min len: 651 avg len: 741.625 num_loss_counted_tokens: 5122
 total tokens: 7854 num samples: 3 num padding tokens: 565 - rank: 1 max len: 2618 min len: 2101 avg len: 2429.6666666666665 num_loss_counted_tokens: 298
 {
    "epoch": 0,
    "step": 74,
    "rank": 0,
    "loss": 0.011096496134996414,
    "overall_throughput": 41.849181139415684,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.389501094818115,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24027,
    "batch_size": 89,
    "total_loss": 0.6699394583702087,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:11.072098"
 }
 total tokens: 7530 num samples: 5 num padding tokens: 1325 - rank: 3 max len: 1506 min len: 1053 avg len: 1241.0 num_loss_counted_tokens: 2113
 total tokens: 7860 num samples: 30 num padding tokens: 3145 - rank: 7 max len: 262 min len: 80 avg len: 157.16666666666666 num_loss_counted_tokens: 1940
 total tokens: 7596 num samples: 12 num padding tokens: 1105 - rank: 5 max len: 633 min len: 472 avg len: 540.9166666666666 num_loss_counted_tokens: 4291
 total tokens: 4061 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4061 min len: 4061 avg len: 4061.0 num_loss_counted_tokens: 393
 total tokens: 6255 num samples: 3 num padding tokens: 930 - rank: 2 max len: 2085 min len: 1507 avg len: 1775.0 num_loss_counted_tokens: 1265
 total tokens: 7905 num samples: 17 num padding tokens: 1989 - rank: 6 max len: 465 min len: 271 avg len: 348.0 num_loss_counted_tokens: 3447
 Per-token loss scaled by world size: 0.0008345923852175474Per-token loss scaled by world size: 0.0001891565480036661Per-token loss scaled by world size: 0.0006257555796764791

 Per-token loss scaled by world size: 5.515092198038474e-06Per-token loss scaled by world size: 0.00020594018860720098Per-token loss scaled by world size: 4.789793456438929e-05


 Per-token loss scaled by world size: 9.402850264450535e-05

 Epoch: 0, Step: 75, Rank: 5, loss = 1.483744740486145
 Epoch: 0, Step: 75, Rank: 3, loss = 0.44851380586624146
 Epoch: 0, Step: 75, Rank: 0, loss = 0.013076973147690296Epoch: 0, Step: 75, Rank: 4, loss = 1.9789228439331055

 Epoch: 0, Step: 75, Rank: 2, loss = 0.22295333445072174
 Epoch: 0, Step: 75, Rank: 1, loss = 0.11357199400663376Epoch: 0, Step: 75, Rank: 7, loss = 0.48830991983413696

 Per-token loss scaled by world size: 0.0005321354838088155
 Epoch: 0, Step: 75, Rank: 6, loss = 1.2617597579956055
 Epoch 0:  62%|██████▏   | 75/121 [03:11<01:57,  2.54s/it] total tokens: 7851 num samples: 3 num padding tokens: 1455 - rank: 1 max len: 2617 min len: 1665 avg len: 2132.0 num_loss_counted_tokens: 271
 total tokens: 7112 num samples: 7 num padding tokens: 1148 - rank: 4 max len: 1016 min len: 665 avg len: 852.0 num_loss_counted_tokens: 3987
 total tokens: 8086 num samples: 13 num padding tokens: 1573 - rank: 5 max len: 622 min len: 389 avg len: 501.0 num_loss_counted_tokens: 4062
 {
    "epoch": 0,
    "step": 75,
    "rank": 0,
    "loss": 0.013076973147690296,
    "overall_throughput": 41.95248737744596,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.426692962646484,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18969,
    "batch_size": 77,
    "total_loss": 0.7513566613197327,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:13.595856"
 }
 total tokens: 7182 num samples: 6 num padding tokens: 591 - rank: 3 max len: 1197 min len: 1043 avg len: 1098.5 num_loss_counted_tokens: 5313
 total tokens: 6628 num samples: 4 num padding tokens: 547 - rank: 2 max len: 1657 min len: 1381 avg len: 1520.25 num_loss_counted_tokens: 947
 total tokens: 3496 num samples: 19 num padding tokens: 1312 - rank: 7 max len: 184 min len: 78 avg len: 114.94736842105263 num_loss_counted_tokens: 656
 total tokens: 7938 num samples: 21 num padding tokens: 2607 - rank: 6 max len: 378 min len: 187 avg len: 253.85714285714286 num_loss_counted_tokens: 2571
 total tokens: 7226 num samples: 2 num padding tokens: 50 - rank: 0 max len: 3613 min len: 3563 avg len: 3588.0 num_loss_counted_tokens: 179
 Per-token loss scaled by world size: 0.00010904129885602742Per-token loss scaled by world size: 0.00042939232662320137Per-token loss scaled by world size: 0.0003037904389202595

 Per-token loss scaled by world size: 0.00046344727161340415Per-token loss scaled by world size: 6.435919203795493e-05Per-token loss scaled by world size: 0.0002804531832225621



 Per-token loss scaled by world size: 0.00045534392120316625
 Epoch: 0, Step: 76, Rank: 5, loss = 1.2729872465133667
 Epoch: 0, Step: 76, Rank: 6, loss = 1.3739473819732666
 Epoch: 0, Step: 76, Rank: 2, loss = 0.900624692440033
 Epoch: 0, Step: 76, Rank: 0, loss = 0.19080086052417755Epoch: 0, Step: 76, Rank: 1, loss = 0.32326656579971313

 Epoch: 0, Step: 76, Rank: 7, loss = 0.8314384818077087
 Epoch: 0, Step: 76, Rank: 4, loss = 1.3499239683151245
 Per-token loss scaled by world size: 0.000354817311745137
 Epoch: 0, Step: 76, Rank: 3, loss = 1.0519002676010132
 Epoch 0:  63%|██████▎   | 76/121 [03:14<01:54,  2.54s/it] total tokens: 7095 num samples: 5 num padding tokens: 648 - rank: 4 max len: 1419 min len: 1179 avg len: 1289.4 num_loss_counted_tokens: 3185
 total tokens: 6052 num samples: 2 num padding tokens: 305 - rank: 1 max len: 3026 min len: 2721 avg len: 2873.5 num_loss_counted_tokens: 349
 {
    "epoch": 0,
    "step": 76,
    "rank": 0,
    "loss": 0.19080086052417755,
    "overall_throughput": 41.77716057032778,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.457417488098145,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23717,
    "batch_size": 104,
    "total_loss": 0.9118610620498657,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:16.129445"
 }
 total tokens: 7774 num samples: 13 num padding tokens: 2031 - rank: 6 max len: 598 min len: 311 avg len: 441.7692307692308 num_loss_counted_tokens: 3287
 total tokens: 8034 num samples: 3 num padding tokens: 672 - rank: 2 max len: 2678 min len: 2124 avg len: 2454.0 num_loss_counted_tokens: 1267
 total tokens: 7544 num samples: 4 num padding tokens: 572 - rank: 3 max len: 1886 min len: 1658 avg len: 1743.0 num_loss_counted_tokens: 711
 total tokens: 7700 num samples: 7 num padding tokens: 2236 - rank: 5 max len: 1100 min len: 637 avg len: 780.5714285714286 num_loss_counted_tokens: 3735
 total tokens: 6648 num samples: 24 num padding tokens: 2581 - rank: 7 max len: 277 min len: 90 avg len: 169.45833333333334 num_loss_counted_tokens: 1750
 total tokens: 6410 num samples: 2 num padding tokens: 53 - rank: 0 max len: 3205 min len: 3152 avg len: 3178.5 num_loss_counted_tokens: 196
 Per-token loss scaled by world size: 0.00034956797026097775Per-token loss scaled by world size: 0.00019042924395762384Per-token loss scaled by world size: 0.00021594665304291993Per-token loss scaled by world size: 0.000333549891365692Per-token loss scaled by world size: 0.00039773472235538065




 Per-token loss scaled by world size: 1.5378537909782608e-06Per-token loss scaled by world size: 1.5691426597186364e-05

 Epoch: 0, Step: 77, Rank: 7, loss = 1.0991719961166382Epoch: 0, Step: 77, Rank: 6, loss = 0.7116252183914185

 Epoch: 0, Step: 77, Rank: 4, loss = 1.3106850385665894
 Epoch: 0, Step: 77, Rank: 5, loss = 1.1519575119018555
 Epoch: 0, Step: 77, Rank: 2, loss = 0.6275357604026794
 Epoch: 0, Step: 77, Rank: 1, loss = 0.0517091378569603Epoch: 0, Step: 77, Rank: 0, loss = 0.005067804828286171

 Per-token loss scaled by world size: 0.0002473424538038671
 Epoch: 0, Step: 77, Rank: 3, loss = 0.8150861859321594
 Epoch 0:  64%|██████▎   | 77/121 [03:16<01:51,  2.53s/it] total tokens: 7920 num samples: 9 num padding tokens: 741 - rank: 4 max len: 880 min len: 742 avg len: 797.6666666666666 num_loss_counted_tokens: 3310
 total tokens: 5446 num samples: 2 num padding tokens: 50 - rank: 1 max len: 2723 min len: 2673 avg len: 2698.0 num_loss_counted_tokens: 175
 total tokens: 7689 num samples: 11 num padding tokens: 1344 - rank: 5 max len: 699 min len: 481 avg len: 576.8181818181819 num_loss_counted_tokens: 3004
 total tokens: 8041 num samples: 17 num padding tokens: 1807 - rank: 6 max len: 473 min len: 265 avg len: 366.70588235294116 num_loss_counted_tokens: 4129
 {
    "epoch": 0,
    "step": 77,
    "rank": 0,
    "loss": 0.005067804828286171,
    "overall_throughput": 42.455461086288004,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.274834632873535,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26363,
    "batch_size": 91,
    "total_loss": 0.7216048836708069,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:18.625029"
 }
 total tokens: 6488 num samples: 2 num padding tokens: 319 - rank: 0 max len: 3244 min len: 2925 avg len: 3084.5 num_loss_counted_tokens: 219
 total tokens: 7337 num samples: 29 num padding tokens: 2640 - rank: 7 max len: 253 min len: 79 avg len: 161.9655172413793 num_loss_counted_tokens: 1852
 total tokens: 7398 num samples: 6 num padding tokens: 1043 - rank: 3 max len: 1233 min len: 937 avg len: 1059.1666666666667 num_loss_counted_tokens: 4139
 total tokens: 7473 num samples: 3 num padding tokens: 1919 - rank: 2 max len: 2491 min len: 1521 avg len: 1851.3333333333333 num_loss_counted_tokens: 1621
 Per-token loss scaled by world size: 0.00012608377437572926Per-token loss scaled by world size: 0.00035810453118756413Per-token loss scaled by world size: 0.00015491498925257474Per-token loss scaled by world size: 0.00043326299055479467

 Per-token loss scaled by world size: 0.00016809521184768528
 Per-token loss scaled by world size: 3.594179133870057e-06


 Per-token loss scaled by world size: 0.0003268007712904364
 Epoch: 0, Step: 78, Rank: 6, loss = 1.1593186855316162Epoch: 0, Step: 78, Rank: 5, loss = 1.4026347398757935
 Epoch: 0, Step: 78, Rank: 7, loss = 0.5015178918838501
 Epoch: 0, Step: 78, Rank: 3, loss = 0.40818047523498535

 Epoch: 0, Step: 78, Rank: 0, loss = 0.011635705828666687
 Epoch: 0, Step: 78, Rank: 1, loss = 0.5441872477531433
 Epoch: 0, Step: 78, Rank: 4, loss = 1.0579766035079956
 Per-token loss scaled by world size: 0.000327433692291379
 Epoch: 0, Step: 78, Rank: 2, loss = 1.060025691986084
 Epoch 0:  64%|██████▍   | 78/121 [03:19<01:49,  2.54s/it] total tokens: 7496 num samples: 8 num padding tokens: 719 - rank: 4 max len: 937 min len: 761 avg len: 847.125 num_loss_counted_tokens: 4720
 total tokens: 8100 num samples: 4 num padding tokens: 633 - rank: 1 max len: 2025 min len: 1692 avg len: 1866.75 num_loss_counted_tokens: 1316
 {
    "epoch": 0,
    "step": 78,
    "rank": 0,
    "loss": 0.011635705828666687,
    "overall_throughput": 41.25017196728111,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.33673620223999,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25899,
    "batch_size": 81,
    "total_loss": 0.7681846618652344,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:21.186251"
 }
 total tokens: 8107 num samples: 11 num padding tokens: 1051 - rank: 5 max len: 737 min len: 542 avg len: 641.4545454545455 num_loss_counted_tokens: 4630
 total tokens: 7305 num samples: 5 num padding tokens: 524 - rank: 2 max len: 1461 min len: 1128 avg len: 1356.2 num_loss_counted_tokens: 3964
 total tokens: 7209 num samples: 27 num padding tokens: 2040 - rank: 7 max len: 267 min len: 82 avg len: 191.44444444444446 num_loss_counted_tokens: 2311
 total tokens: 7672 num samples: 7 num padding tokens: 538 - rank: 3 max len: 1096 min len: 942 avg len: 1019.1428571428571 num_loss_counted_tokens: 4249
 total tokens: 7226 num samples: 2 num padding tokens: 698 - rank: 0 max len: 3613 min len: 2915 avg len: 3264.0 num_loss_counted_tokens: 204
 total tokens: 8070 num samples: 15 num padding tokens: 1557 - rank: 6 max len: 538 min len: 297 avg len: 434.2 num_loss_counted_tokens: 3942
 Per-token loss scaled by world size: 0.0003785255830734968Per-token loss scaled by world size: 0.00020959046378266066Per-token loss scaled by world size: 0.0004416834854055196Per-token loss scaled by world size: 0.0002668427478056401



 Per-token loss scaled by world size: 0.00010036973981186748Per-token loss scaled by world size: 0.00018033267406281084Per-token loss scaled by world size: 4.8121955842361785e-06


 Epoch: 0, Step: 79, Rank: 6, loss = 0.6261777281761169
 Epoch: 0, Step: 79, Rank: 5, loss = 1.319584608078003
 Epoch: 0, Step: 79, Rank: 4, loss = 1.1308925151824951
 Epoch: 0, Step: 79, Rank: 7, loss = 0.797226071357727
 Epoch: 0, Step: 79, Rank: 0, loss = 0.014377035200595856
 Epoch: 0, Step: 79, Rank: 1, loss = 0.5387663841247559
 Epoch: 0, Step: 79, Rank: 2, loss = 0.2998671531677246
 Per-token loss scaled by world size: 0.00022867463121656328
 Epoch: 0, Step: 79, Rank: 3, loss = 0.6831940412521362
 Epoch 0:  65%|██████▌   | 79/121 [03:21<01:46,  2.55s/it] total tokens: 7794 num samples: 9 num padding tokens: 491 - rank: 4 max len: 866 min len: 750 avg len: 811.4444444444445 num_loss_counted_tokens: 5401
 total tokens: 7551 num samples: 3 num padding tokens: 552 - rank: 1 max len: 2517 min len: 2018 avg len: 2333.0 num_loss_counted_tokens: 491
 {
    "epoch": 0,
    "step": 79,
    "rank": 0,
    "loss": 0.014377035200595856,
    "overall_throughput": 41.31761838872816,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.35622549057007,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23901,
    "batch_size": 83,
    "total_loss": 0.6762607097625732,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:23.748903"
 }
 total tokens: 7966 num samples: 14 num padding tokens: 2590 - rank: 6 max len: 569 min len: 271 avg len: 384.0 num_loss_counted_tokens: 3308
 total tokens: 7416 num samples: 4 num padding tokens: 852 - rank: 2 max len: 1854 min len: 1322 avg len: 1641.0 num_loss_counted_tokens: 506
 total tokens: 7931 num samples: 11 num padding tokens: 889 - rank: 5 max len: 721 min len: 597 avg len: 640.1818181818181 num_loss_counted_tokens: 5323
 total tokens: 7806 num samples: 6 num padding tokens: 1777 - rank: 3 max len: 1301 min len: 876 avg len: 1004.8333333333334 num_loss_counted_tokens: 4853
 total tokens: 5476 num samples: 2 num padding tokens: 96 - rank: 0 max len: 2738 min len: 2642 avg len: 2690.0 num_loss_counted_tokens: 179
 total tokens: 8100 num samples: 30 num padding tokens: 2273 - rank: 7 max len: 270 min len: 83 avg len: 194.23333333333332 num_loss_counted_tokens: 2747
 Per-token loss scaled by world size: 0.00042449356988072395Per-token loss scaled by world size: 0.00028748821932822466Per-token loss scaled by world size: 0.0002529154298827052Per-token loss scaled by world size: 0.0005231253453530371Per-token loss scaled by world size: 0.0002102917933370918Per-token loss scaled by world size: 5.35248773303465e-06




 Per-token loss scaled by world size: 0.00035050552105531096

 Epoch: 0, Step: 80, Rank: 6, loss = 1.4146617650985718
 Epoch: 0, Step: 80, Rank: 0, loss = 0.01447446458041668Epoch: 0, Step: 80, Rank: 4, loss = 0.5686815977096558

 Epoch: 0, Step: 80, Rank: 3, loss = 0.7774400115013123Epoch: 0, Step: 80, Rank: 5, loss = 1.1479367017745972

 Epoch: 0, Step: 80, Rank: 2, loss = 0.6839465498924255
 Epoch: 0, Step: 80, Rank: 7, loss = 0.9478545188903809
 Per-token loss scaled by world size: 1.0431926966703031e-06
 Epoch: 0, Step: 80, Rank: 1, loss = 0.002821053843945265
 Epoch 0:  66%|██████▌   | 80/121 [03:24<01:44,  2.54s/it] total tokens: 7600 num samples: 10 num padding tokens: 502 - rank: 4 max len: 760 min len: 672 avg len: 709.8 num_loss_counted_tokens: 4962
 total tokens: 7035 num samples: 5 num padding tokens: 571 - rank: 1 max len: 1407 min len: 1097 avg len: 1292.8 num_loss_counted_tokens: 3599
 {
    "epoch": 0,
    "step": 80,
    "rank": 0,
    "loss": 0.01447446458041668,
    "overall_throughput": 41.61237764414521,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.469753742218018,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21634,
    "batch_size": 76,
    "total_loss": 0.6947270631790161,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:26.290599"
 }
 total tokens: 7760 num samples: 16 num padding tokens: 1837 - rank: 6 max len: 485 min len: 278 avg len: 370.1875 num_loss_counted_tokens: 3781
 total tokens: 7595 num samples: 7 num padding tokens: 592 - rank: 2 max len: 1085 min len: 934 avg len: 1000.4285714285714 num_loss_counted_tokens: 4208
 total tokens: 7444 num samples: 4 num padding tokens: 741 - rank: 0 max len: 1861 min len: 1498 avg len: 1675.75 num_loss_counted_tokens: 2607
 total tokens: 7440 num samples: 8 num padding tokens: 519 - rank: 3 max len: 930 min len: 764 avg len: 865.125 num_loss_counted_tokens: 6064
 total tokens: 8100 num samples: 30 num padding tokens: 2932 - rank: 7 max len: 270 min len: 75 avg len: 172.26666666666668 num_loss_counted_tokens: 2182
 total tokens: 8016 num samples: 12 num padding tokens: 980 - rank: 5 max len: 668 min len: 496 avg len: 586.3333333333334 num_loss_counted_tokens: 5943
 Per-token loss scaled by world size: 0.0002643285261001438Per-token loss scaled by world size: 0.000505154428537935Per-token loss scaled by world size: 0.0003831658395938575Per-token loss scaled by world size: 0.0005561576108448207
 Per-token loss scaled by world size: 4.442329100129427e-06Per-token loss scaled by world size: 0.000311601092107594




 Per-token loss scaled by world size: 3.0491105462715495e-06
 Epoch: 0, Step: 81, Rank: 5, loss = 1.2860599756240845
 Epoch: 0, Step: 81, Rank: 3, loss = 0.6729474067687988
 Epoch: 0, Step: 81, Rank: 6, loss = 1.4159077405929565
 Epoch: 0, Step: 81, Rank: 4, loss = 0.9754922986030579Epoch: 0, Step: 81, Rank: 1, loss = 0.011309614405035973

 Epoch: 0, Step: 81, Rank: 7, loss = 0.7932974100112915
 Epoch: 0, Step: 81, Rank: 0, loss = 0.007762654218822718
 Per-token loss scaled by world size: 0.00019164555124007165
 Epoch: 0, Step: 81, Rank: 2, loss = 0.4879056215286255
 Epoch 0:  67%|██████▋   | 81/121 [03:27<01:41,  2.55s/it]{
    "epoch": 0,
    "step": 81,
    "rank": 0,
    "loss": 0.007762654218822718,
    "overall_throughput": 41.32577987310498,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.05413246154785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20367,
    "batch_size": 76,
    "total_loss": 0.7063353061676025,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:28.868839"
 }
 total tokens: 5876 num samples: 2 num padding tokens: 1017 - rank: 1 max len: 2938 min len: 1921 avg len: 2429.5 num_loss_counted_tokens: 591
 total tokens: 7936 num samples: 8 num padding tokens: 1689 - rank: 4 max len: 992 min len: 598 avg len: 780.875 num_loss_counted_tokens: 4729
 total tokens: 6752 num samples: 2 num padding tokens: 73 - rank: 0 max len: 3376 min len: 3303 avg len: 3339.5 num_loss_counted_tokens: 449
 total tokens: 8018 num samples: 19 num padding tokens: 2548 - rank: 6 max len: 422 min len: 234 avg len: 287.89473684210526 num_loss_counted_tokens: 3302
 total tokens: 7156 num samples: 4 num padding tokens: 826 - rank: 2 max len: 1789 min len: 1335 avg len: 1582.5 num_loss_counted_tokens: 3840
 total tokens: 6552 num samples: 28 num padding tokens: 2284 - rank: 7 max len: 234 min len: 79 avg len: 152.42857142857142 num_loss_counted_tokens: 1601
 total tokens: 7540 num samples: 13 num padding tokens: 837 - rank: 5 max len: 580 min len: 438 avg len: 515.6153846153846 num_loss_counted_tokens: 4083
 total tokens: 7512 num samples: 6 num padding tokens: 558 - rank: 3 max len: 1252 min len: 1059 avg len: 1159.0 num_loss_counted_tokens: 3038
 Per-token loss scaled by world size: 0.000629897927865386Per-token loss scaled by world size: 0.0006152652204036713Per-token loss scaled by world size: 0.00011580222053453326Per-token loss scaled by world size: 0.0004951037117280066
 Per-token loss scaled by world size: 0.000213472536415793



 Per-token loss scaled by world size: 1.2029913705191575e-05Per-token loss scaled by world size: 5.17758380738087e-05

 Epoch: 0, Step: 82, Rank: 4, loss = 1.4647926092147827
 Epoch: 0, Step: 82, Rank: 6, loss = 1.4996294975280762Epoch: 0, Step: 82, Rank: 2, loss = 0.27569612860679626

 Epoch: 0, Step: 82, Rank: 3, loss = 1.1787182092666626
 Epoch: 0, Step: 82, Rank: 7, loss = 0.5082247257232666
 Epoch: 0, Step: 82, Rank: 1, loss = 0.02864021621644497Epoch: 0, Step: 82, Rank: 0, loss = 0.1232653260231018

 Per-token loss scaled by world size: 0.0006569805555045605
 Epoch: 0, Step: 82, Rank: 5, loss = 1.5641064643859863
 Epoch 0:  68%|██████▊   | 82/121 [03:29<01:38,  2.53s/it] total tokens: 7308 num samples: 9 num padding tokens: 762 - rank: 4 max len: 812 min len: 669 avg len: 727.3333333333334 num_loss_counted_tokens: 4589
 total tokens: 7874 num samples: 31 num padding tokens: 2845 - rank: 7 max len: 254 min len: 81 avg len: 162.2258064516129 num_loss_counted_tokens: 2174
 total tokens: 7372 num samples: 4 num padding tokens: 311 - rank: 1 max len: 1843 min len: 1686 avg len: 1765.25 num_loss_counted_tokens: 1401
 {
    "epoch": 0,
    "step": 82,
    "rank": 0,
    "loss": 0.1232653260231018,
    "overall_throughput": 42.003063993949816,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.33745241165161,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19046,
    "batch_size": 84,
    "total_loss": 0.8303841352462769,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:31.350016"
 }
 total tokens: 7866 num samples: 19 num padding tokens: 1707 - rank: 6 max len: 414 min len: 255 avg len: 324.1578947368421 num_loss_counted_tokens: 3285
 total tokens: 6580 num samples: 4 num padding tokens: 1811 - rank: 2 max len: 1645 min len: 1002 avg len: 1192.25 num_loss_counted_tokens: 2171
 total tokens: 8094 num samples: 3 num padding tokens: 948 - rank: 0 max len: 2698 min len: 2046 avg len: 2382.0 num_loss_counted_tokens: 2180
 total tokens: 7992 num samples: 8 num padding tokens: 875 - rank: 3 max len: 999 min len: 819 avg len: 889.625 num_loss_counted_tokens: 5913
 total tokens: 7982 num samples: 13 num padding tokens: 1310 - rank: 5 max len: 614 min len: 424 avg len: 513.2307692307693 num_loss_counted_tokens: 3981
 Per-token loss scaled by world size: 0.00023041688837110996Per-token loss scaled by world size: 0.0001501823280705139Per-token loss scaled by world size: 0.000244573806412518Per-token loss scaled by world size: 0.0003567738749552518Per-token loss scaled by world size: 0.00027550142840482295
 Per-token loss scaled by world size: 3.432213998166844e-05

 Per-token loss scaled by world size: 0.00022764307504985482



 Epoch: 0, Step: 83, Rank: 5, loss = 1.1512646675109863Epoch: 0, Step: 83, Rank: 4, loss = 0.4846196174621582Epoch: 0, Step: 83, Rank: 7, loss = 0.7435265183448792

 Epoch: 0, Step: 83, Rank: 0, loss = 0.11075326055288315Epoch: 0, Step: 83, Rank: 1, loss = 0.7892091274261475


 Epoch: 0, Step: 83, Rank: 3, loss = 0.8890087008476257
 Epoch: 0, Step: 83, Rank: 2, loss = 0.7345757484436035
 Per-token loss scaled by world size: 0.0002607592905405909
 Epoch: 0, Step: 83, Rank: 6, loss = 0.8414376378059387
 Epoch 0:  69%|██████▊   | 83/121 [03:32<01:35,  2.52s/it] total tokens: 7168 num samples: 7 num padding tokens: 519 - rank: 4 max len: 1024 min len: 877 avg len: 949.8571428571429 num_loss_counted_tokens: 3980
 total tokens: 6612 num samples: 3 num padding tokens: 963 - rank: 1 max len: 2204 min len: 1680 avg len: 1883.0 num_loss_counted_tokens: 659
 {
    "epoch": 0,
    "step": 83,
    "rank": 0,
    "loss": 0.11075326055288315,
    "overall_throughput": 42.37488901314667,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.450260639190674,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25815,
    "batch_size": 90,
    "total_loss": 0.7180494070053101,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:33.841762"
 }
 total tokens: 6544 num samples: 4 num padding tokens: 763 - rank: 2 max len: 1636 min len: 1316 avg len: 1445.25 num_loss_counted_tokens: 481
 total tokens: 7930 num samples: 10 num padding tokens: 919 - rank: 5 max len: 793 min len: 605 avg len: 701.1 num_loss_counted_tokens: 4841
 total tokens: 7994 num samples: 14 num padding tokens: 1260 - rank: 6 max len: 571 min len: 361 avg len: 481.0 num_loss_counted_tokens: 5292
 total tokens: 8096 num samples: 23 num padding tokens: 2810 - rank: 7 max len: 352 min len: 91 avg len: 229.82608695652175 num_loss_counted_tokens: 2516
 total tokens: 7590 num samples: 6 num padding tokens: 565 - rank: 3 max len: 1265 min len: 1087 avg len: 1170.8333333333333 num_loss_counted_tokens: 4039
 total tokens: 6258 num samples: 2 num padding tokens: 564 - rank: 0 max len: 3129 min len: 2565 avg len: 2847.0 num_loss_counted_tokens: 179
 Per-token loss scaled by world size: 0.00019247813906986266Per-token loss scaled by world size: 0.00029174372320994735
 Per-token loss scaled by world size: 0.0004131880996283144Per-token loss scaled by world size: 0.0003363724099472165

 Per-token loss scaled by world size: 0.0005087562603875995Per-token loss scaled by world size: 0.0001915783795993775Per-token loss scaled by world size: 3.281491217421717e-06



 Epoch: 0, Step: 84, Rank: 3, loss = 0.8194716572761536
 Epoch: 0, Step: 84, Rank: 2, loss = 0.540647029876709Epoch: 0, Step: 84, Rank: 7, loss = 0.9448280930519104Epoch: 0, Step: 84, Rank: 4, loss = 1.1605937480926514


 Epoch: 0, Step: 84, Rank: 0, loss = 0.009217298589646816
 Epoch: 0, Step: 84, Rank: 5, loss = 1.429032802581787
 Epoch: 0, Step: 84, Rank: 1, loss = 0.5381197333335876
 Per-token loss scaled by world size: 0.0003772681811824441
 Epoch: 0, Step: 84, Rank: 6, loss = 1.0596991777420044
 Epoch 0:  69%|██████▉   | 84/121 [03:34<01:33,  2.52s/it] total tokens: 6480 num samples: 3 num padding tokens: 343 - rank: 1 max len: 2160 min len: 1943 avg len: 2045.6666666666667 num_loss_counted_tokens: 909
 total tokens: 7448 num samples: 8 num padding tokens: 1052 - rank: 4 max len: 931 min len: 740 avg len: 799.5 num_loss_counted_tokens: 3154
 {
    "epoch": 0,
    "step": 84,
    "rank": 0,
    "loss": 0.009217298589646816,
    "overall_throughput": 41.98783375148776,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.287980556488037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22471,
    "batch_size": 89,
    "total_loss": 0.8127012252807617,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:36.362094"
 }
 total tokens: 7896 num samples: 28 num padding tokens: 3153 - rank: 7 max len: 282 min len: 74 avg len: 169.39285714285714 num_loss_counted_tokens: 2024
 total tokens: 8115 num samples: 15 num padding tokens: 2233 - rank: 6 max len: 541 min len: 292 avg len: 392.1333333333333 num_loss_counted_tokens: 3345
 total tokens: 5744 num samples: 2 num padding tokens: 367 - rank: 0 max len: 2872 min len: 2505 avg len: 2688.5 num_loss_counted_tokens: 161
 total tokens: 7436 num samples: 4 num padding tokens: 1471 - rank: 2 max len: 1859 min len: 1182 avg len: 1491.25 num_loss_counted_tokens: 773
 total tokens: 8071 num samples: 7 num padding tokens: 817 - rank: 3 max len: 1153 min len: 974 avg len: 1036.2857142857142 num_loss_counted_tokens: 4123
 total tokens: 7788 num samples: 11 num padding tokens: 766 - rank: 5 max len: 708 min len: 547 avg len: 638.3636363636364 num_loss_counted_tokens: 3782
 Per-token loss scaled by world size: 0.0005787216359749436Per-token loss scaled by world size: 0.0005308112595230341Per-token loss scaled by world size: 0.00033112603705376387Per-token loss scaled by world size: 0.00014354031009133905Per-token loss scaled by world size: 0.00046847882913425565Per-token loss scaled by world size: 3.301608558103908e-06





 Per-token loss scaled by world size: 3.5768789530266076e-06
 Epoch: 0, Step: 85, Rank: 6, loss = 1.3779860734939575Epoch: 0, Step: 85, Rank: 5, loss = 1.2161710262298584Epoch: 0, Step: 85, Rank: 7, loss = 0.8596031665802002


 Epoch: 0, Step: 85, Rank: 1, loss = 0.008570975624024868Epoch: 0, Step: 85, Rank: 4, loss = 1.5023614168167114

 Epoch: 0, Step: 85, Rank: 2, loss = 0.37263065576553345
 Epoch: 0, Step: 85, Rank: 0, loss = 0.009285577572882175
 Per-token loss scaled by world size: 0.00042734169983305037
 Epoch: 0, Step: 85, Rank: 3, loss = 1.1093790531158447
 Epoch 0:  70%|███████   | 85/121 [03:37<01:31,  2.53s/it] total tokens: 7898 num samples: 11 num padding tokens: 499 - rank: 4 max len: 718 min len: 607 avg len: 672.6363636363636 num_loss_counted_tokens: 3857
 total tokens: 8060 num samples: 5 num padding tokens: 394 - rank: 1 max len: 1612 min len: 1437 avg len: 1533.2 num_loss_counted_tokens: 1786
 {
    "epoch": 0,
    "step": 85,
    "rank": 0,
    "loss": 0.009285577572882175,
    "overall_throughput": 41.311858360949856,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.234922885894775,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20768,
    "batch_size": 87,
    "total_loss": 0.8069984912872314,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:38.922540"
 }
 total tokens: 7898 num samples: 22 num padding tokens: 1234 - rank: 6 max len: 359 min len: 264 avg len: 302.90909090909093 num_loss_counted_tokens: 3332
 total tokens: 6834 num samples: 3 num padding tokens: 886 - rank: 0 max len: 2278 min len: 1616 avg len: 1982.6666666666667 num_loss_counted_tokens: 909
 total tokens: 7680 num samples: 6 num padding tokens: 1041 - rank: 2 max len: 1280 min len: 938 avg len: 1106.5 num_loss_counted_tokens: 3266
 total tokens: 7683 num samples: 13 num padding tokens: 1257 - rank: 5 max len: 591 min len: 368 avg len: 494.3076923076923 num_loss_counted_tokens: 4067
 total tokens: 7395 num samples: 29 num padding tokens: 2483 - rank: 7 max len: 255 min len: 89 avg len: 169.3793103448276 num_loss_counted_tokens: 2026
 total tokens: 7408 num samples: 8 num padding tokens: 818 - rank: 3 max len: 926 min len: 725 avg len: 823.75 num_loss_counted_tokens: 4017
 Per-token loss scaled by world size: 0.00045185594353824854Per-token loss scaled by world size: 0.0003287219151388854Per-token loss scaled by world size: 0.00010264909360557795Per-token loss scaled by world size: 0.0003051054081879556Per-token loss scaled by world size: 0.00028172050951980054Per-token loss scaled by world size: 1.640593291085679e-05





 Per-token loss scaled by world size: 8.82493841345422e-05
 Epoch: 0, Step: 86, Rank: 2, loss = 1.0376107692718506Epoch: 0, Step: 86, Rank: 6, loss = 0.9630652666091919
 Epoch: 0, Step: 86, Rank: 1, loss = 0.3240118622779846
 Epoch: 0, Step: 86, Rank: 3, loss = 1.4262832403182983
 Epoch: 0, Step: 86, Rank: 0, loss = 0.05178532749414444
 Epoch: 0, Step: 86, Rank: 4, loss = 0.8892507553100586

 Epoch: 0, Step: 86, Rank: 7, loss = 0.2785591781139374
 Per-token loss scaled by world size: 0.000460325536550954
 Epoch: 0, Step: 86, Rank: 5, loss = 1.4530175924301147
 Epoch 0:  71%|███████   | 86/121 [03:39<01:28,  2.53s/it] total tokens: 3744 num samples: 18 num padding tokens: 1271 - rank: 7 max len: 208 min len: 81 avg len: 137.38888888888889 num_loss_counted_tokens: 914
 total tokens: 7208 num samples: 4 num padding tokens: 870 - rank: 1 max len: 1802 min len: 1447 avg len: 1584.5 num_loss_counted_tokens: 2098
 total tokens: 7308 num samples: 9 num padding tokens: 993 - rank: 4 max len: 812 min len: 635 avg len: 701.6666666666666 num_loss_counted_tokens: 3474
 {
    "epoch": 0,
    "step": 86,
    "rank": 0,
    "loss": 0.05178532749414444,
    "overall_throughput": 41.951886200619036,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.232909202575684,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25252,
    "batch_size": 101,
    "total_loss": 0.802947998046875,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:41.442803"
 }
 total tokens: 7896 num samples: 21 num padding tokens: 2014 - rank: 6 max len: 376 min len: 211 avg len: 280.0952380952381 num_loss_counted_tokens: 3459
 total tokens: 7364 num samples: 7 num padding tokens: 790 - rank: 3 max len: 1052 min len: 812 avg len: 939.1428571428571 num_loss_counted_tokens: 3186
 total tokens: 7872 num samples: 6 num padding tokens: 665 - rank: 2 max len: 1312 min len: 1060 avg len: 1201.1666666666667 num_loss_counted_tokens: 5354
 total tokens: 7982 num samples: 13 num padding tokens: 1356 - rank: 5 max len: 614 min len: 393 avg len: 509.6923076923077 num_loss_counted_tokens: 4015
 total tokens: 6650 num samples: 2 num padding tokens: 959 - rank: 0 max len: 3325 min len: 2366 avg len: 2845.5 num_loss_counted_tokens: 183
 Per-token loss scaled by world size: 0.0001364344934700057Per-token loss scaled by world size: 0.00033442748826928437Per-token loss scaled by world size: 0.00019135570619255304Per-token loss scaled by world size: 0.00039014805224724114Per-token loss scaled by world size: 7.375221321126446e-05

 Per-token loss scaled by world size: 3.955068677896634e-05



 Per-token loss scaled by world size: 0.00019416131544858217
 Epoch: 0, Step: 87, Rank: 4, loss = 0.4185469150543213
 Epoch: 0, Step: 87, Rank: 2, loss = 0.5870314836502075Epoch: 0, Step: 87, Rank: 5, loss = 1.1968766450881958
 Epoch: 0, Step: 87, Rank: 0, loss = 0.12133162468671799Epoch: 0, Step: 87, Rank: 3, loss = 1.02593994140625Epoch: 0, Step: 87, Rank: 1, loss = 0.22625336050987244



 Epoch: 0, Step: 87, Rank: 7, loss = 0.5956383943557739
 Per-token loss scaled by world size: 0.00031522451899945736
 Epoch: 0, Step: 87, Rank: 6, loss = 0.9670300483703613
 Epoch 0:  72%|███████▏  | 87/121 [03:42<01:25,  2.52s/it] total tokens: 7015 num samples: 5 num padding tokens: 1264 - rank: 4 max len: 1403 min len: 1017 avg len: 1150.2 num_loss_counted_tokens: 3249
 total tokens: 5650 num samples: 2 num padding tokens: 51 - rank: 1 max len: 2825 min len: 2774 avg len: 2799.5 num_loss_counted_tokens: 191
 total tokens: 5482 num samples: 2 num padding tokens: 83 - rank: 2 max len: 2741 min len: 2658 avg len: 2699.5 num_loss_counted_tokens: 171
 {
    "epoch": 0,
    "step": 87,
    "rank": 0,
    "loss": 0.12133162468671799,
    "overall_throughput": 42.28776662058589,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.46161460876465,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24542,
    "batch_size": 79,
    "total_loss": 0.642331063747406,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:43.944539"
 }
 total tokens: 7560 num samples: 4 num padding tokens: 1086 - rank: 3 max len: 1890 min len: 1404 avg len: 1618.5 num_loss_counted_tokens: 1603
 total tokens: 7830 num samples: 9 num padding tokens: 1197 - rank: 5 max len: 870 min len: 619 avg len: 737.0 num_loss_counted_tokens: 3609
 total tokens: 7956 num samples: 13 num padding tokens: 2029 - rank: 6 max len: 612 min len: 285 avg len: 455.9230769230769 num_loss_counted_tokens: 3596
 total tokens: 5358 num samples: 19 num padding tokens: 1354 - rank: 7 max len: 282 min len: 86 avg len: 210.73684210526315 num_loss_counted_tokens: 1995
 total tokens: 7698 num samples: 2 num padding tokens: 950 - rank: 0 max len: 3849 min len: 2899 avg len: 3374.0 num_loss_counted_tokens: 1217
 Per-token loss scaled by world size: 0.00014240843302104622Per-token loss scaled by world size: 0.000148817416629754Per-token loss scaled by world size: 0.0001530916924821213Per-token loss scaled by world size: 0.00020883062097709626Per-token loss scaled by world size: 0.00023989545297808945




 Per-token loss scaled by world size: 0.00017666697385720909
 Per-token loss scaled by world size: 0.0001427593524567783
 Epoch: 0, Step: 88, Rank: 5, loss = 0.8521594405174255Epoch: 0, Step: 88, Rank: 6, loss = 0.9789233803749084Epoch: 0, Step: 88, Rank: 2, loss = 0.6247097849845886Epoch: 0, Step: 88, Rank: 3, loss = 0.6072680950164795
 Epoch: 0, Step: 88, Rank: 4, loss = 0.5811154246330261



 Epoch: 0, Step: 88, Rank: 1, loss = 0.7209116816520691
 Epoch: 0, Step: 88, Rank: 7, loss = 0.5825473666191101
 Per-token loss scaled by world size: 0.00011780338536482304
 Epoch: 0, Step: 88, Rank: 0, loss = 0.480711430311203
 Epoch 0:  73%|███████▎  | 88/121 [03:44<01:23,  2.54s/it] total tokens: 7821 num samples: 11 num padding tokens: 369 - rank: 4 max len: 711 min len: 644 avg len: 677.4545454545455 num_loss_counted_tokens: 4895
 total tokens: 8108 num samples: 4 num padding tokens: 1270 - rank: 1 max len: 2027 min len: 1314 avg len: 1709.5 num_loss_counted_tokens: 973
 total tokens: 7920 num samples: 8 num padding tokens: 1147 - rank: 3 max len: 990 min len: 714 avg len: 846.625 num_loss_counted_tokens: 2895
 {
    "epoch": 0,
    "step": 88,
    "rank": 0,
    "loss": 0.480711430311203,
    "overall_throughput": 41.16723941483895,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.52360773086548,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 32645,
    "batch_size": 94,
    "total_loss": 0.6785432696342468,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:46.514923"
 }
 total tokens: 7888 num samples: 17 num padding tokens: 1450 - rank: 6 max len: 464 min len: 277 avg len: 378.70588235294116 num_loss_counted_tokens: 4530
 total tokens: 6446 num samples: 2 num padding tokens: 340 - rank: 0 max len: 3223 min len: 2883 avg len: 3053.0 num_loss_counted_tokens: 686
 total tokens: 7860 num samples: 30 num padding tokens: 2767 - rank: 7 max len: 262 min len: 77 avg len: 169.76666666666668 num_loss_counted_tokens: 2196
 total tokens: 7398 num samples: 6 num padding tokens: 659 - rank: 2 max len: 1233 min len: 1028 avg len: 1123.1666666666667 num_loss_counted_tokens: 3542
 total tokens: 7536 num samples: 12 num padding tokens: 714 - rank: 5 max len: 628 min len: 502 avg len: 568.5 num_loss_counted_tokens: 5605
 Per-token loss scaled by world size: 0.0003699270309880376Per-token loss scaled by world size: 0.0005684850038960576Per-token loss scaled by world size: 5.893620254937559e-06

 Per-token loss scaled by world size: 3.489888695185073e-05

 Per-token loss scaled by world size: 0.0005643228068947792Per-token loss scaled by world size: 0.0003445304755587131Per-token loss scaled by world size: 0.00016975219477899373


 Epoch: 0, Step: 89, Rank: 5, loss = 1.299698829650879
 Epoch: 0, Step: 89, Rank: 1, loss = 0.013474289327859879
 Epoch: 0, Step: 89, Rank: 3, loss = 0.8457456827163696
 Epoch: 0, Step: 89, Rank: 0, loss = 0.07978758215904236
 Epoch: 0, Step: 89, Rank: 4, loss = 0.3880959451198578Epoch: 0, Step: 89, Rank: 6, loss = 1.2901830673217773Epoch: 0, Step: 89, Rank: 7, loss = 0.7876827716827393


 Per-token loss scaled by world size: 0.00024518067948520184
 Epoch: 0, Step: 89, Rank: 2, loss = 0.5605443120002747
 Epoch 0:  74%|███████▎  | 89/121 [03:47<01:21,  2.55s/it] total tokens: 5448 num samples: 2 num padding tokens: 812 - rank: 1 max len: 2724 min len: 1912 avg len: 2318.0 num_loss_counted_tokens: 205
 total tokens: 7690 num samples: 10 num padding tokens: 487 - rank: 4 max len: 769 min len: 675 avg len: 720.3 num_loss_counted_tokens: 4105
 {
    "epoch": 0,
    "step": 89,
    "rank": 0,
    "loss": 0.07978758215904236,
    "overall_throughput": 41.16520368540104,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.30566644668579,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18290,
    "batch_size": 69,
    "total_loss": 0.6581515669822693,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:49.084401"
 }
 total tokens: 7360 num samples: 4 num padding tokens: 675 - rank: 2 max len: 1840 min len: 1482 avg len: 1671.25 num_loss_counted_tokens: 836
 total tokens: 7812 num samples: 12 num padding tokens: 1280 - rank: 5 max len: 651 min len: 447 avg len: 544.3333333333334 num_loss_counted_tokens: 2926
 total tokens: 7095 num samples: 5 num padding tokens: 1469 - rank: 3 max len: 1419 min len: 943 avg len: 1125.2 num_loss_counted_tokens: 1900
 total tokens: 7740 num samples: 18 num padding tokens: 1493 - rank: 6 max len: 430 min len: 282 avg len: 347.05555555555554 num_loss_counted_tokens: 3796
 total tokens: 7248 num samples: 2 num padding tokens: 215 - rank: 0 max len: 3624 min len: 3409 avg len: 3516.5 num_loss_counted_tokens: 197
 total tokens: 7772 num samples: 29 num padding tokens: 2475 - rank: 7 max len: 268 min len: 82 avg len: 182.6551724137931 num_loss_counted_tokens: 2574
 Per-token loss scaled by world size: 0.0002169163926737383Per-token loss scaled by world size: 0.00031902806949801743Per-token loss scaled by world size: 0.000320168532198295Per-token loss scaled by world size: 0.00028694834327325225Per-token loss scaled by world size: 2.5503815777483396e-05


 Per-token loss scaled by world size: 1.9868204617523588e-05

 Per-token loss scaled by world size: 0.00033540837466716766

 Epoch: 0, Step: 90, Rank: 4, loss = 0.9190002083778381Epoch: 0, Step: 90, Rank: 6, loss = 0.9222854375839233

 Epoch: 0, Step: 90, Rank: 3, loss = 0.8265905380249023Epoch: 0, Step: 90, Rank: 1, loss = 0.07346692681312561Epoch: 0, Step: 90, Rank: 0, loss = 0.057232845574617386


 Epoch: 0, Step: 90, Rank: 2, loss = 0.6248548030853271
 Epoch: 0, Step: 90, Rank: 7, loss = 0.9661857485771179
 Per-token loss scaled by world size: 0.0004802969633601606
 Epoch: 0, Step: 90, Rank: 5, loss = 1.3835554122924805
 Epoch 0:  74%|███████▍  | 90/121 [03:49<01:18,  2.54s/it] total tokens: 7656 num samples: 11 num padding tokens: 851 - rank: 4 max len: 696 min len: 552 avg len: 618.6363636363636 num_loss_counted_tokens: 4472
 total tokens: 7875 num samples: 5 num padding tokens: 1795 - rank: 1 max len: 1575 min len: 1090 avg len: 1216.0 num_loss_counted_tokens: 4251
 {
    "epoch": 0,
    "step": 90,
    "rank": 0,
    "loss": 0.057232845574617386,
    "overall_throughput": 42.04731734702061,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.250526905059814,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23045,
    "batch_size": 73,
    "total_loss": 0.7216464877128601,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:51.641363"
 }
 total tokens: 8100 num samples: 15 num padding tokens: 1058 - rank: 5 max len: 540 min len: 390 avg len: 469.46666666666664 num_loss_counted_tokens: 4269
 total tokens: 7154 num samples: 7 num padding tokens: 458 - rank: 2 max len: 1022 min len: 870 avg len: 956.5714285714286 num_loss_counted_tokens: 2467
 total tokens: 7200 num samples: 3 num padding tokens: 1334 - rank: 0 max len: 2400 min len: 1618 avg len: 1955.3333333333333 num_loss_counted_tokens: 241
 total tokens: 7389 num samples: 9 num padding tokens: 493 - rank: 3 max len: 821 min len: 703 avg len: 766.2222222222222 num_loss_counted_tokens: 4922
 total tokens: 7945 num samples: 35 num padding tokens: 2409 - rank: 7 max len: 227 min len: 85 avg len: 158.17142857142858 num_loss_counted_tokens: 2310
 total tokens: 7986 num samples: 22 num padding tokens: 1512 - rank: 6 max len: 363 min len: 227 avg len: 294.27272727272725 num_loss_counted_tokens: 3717
 Per-token loss scaled by world size: 0.0006446933257393539Per-token loss scaled by world size: 0.00028916815062984824Per-token loss scaled by world size: 0.00012860735296271741
 Per-token loss scaled by world size: 0.00038613073411397636Per-token loss scaled by world size: 3.499255626593367e-06
 Per-token loss scaled by world size: 0.0005611648084595799


 Per-token loss scaled by world size: 2.536307329137344e-05

 Epoch: 0, Step: 91, Rank: 3, loss = 0.6820392608642578
 Epoch: 0, Step: 91, Rank: 1, loss = 0.008253431878983974
 Epoch: 0, Step: 91, Rank: 5, loss = 1.520589828491211
 Epoch: 0, Step: 91, Rank: 7, loss = 0.9107375741004944Epoch: 0, Step: 91, Rank: 2, loss = 0.303336501121521

 Epoch: 0, Step: 91, Rank: 4, loss = 1.3235772848129272
 Epoch: 0, Step: 91, Rank: 0, loss = 0.05982197821140289
 Per-token loss scaled by world size: 0.0006536963628605008
 Epoch: 0, Step: 91, Rank: 6, loss = 1.5418245792388916
 Epoch 0:  75%|███████▌  | 91/121 [03:52<01:16,  2.55s/it] total tokens: 7800 num samples: 8 num padding tokens: 997 - rank: 4 max len: 975 min len: 702 avg len: 850.375 num_loss_counted_tokens: 5549
 total tokens: 7290 num samples: 3 num padding tokens: 1298 - rank: 1 max len: 2430 min len: 1736 avg len: 1997.3333333333333 num_loss_counted_tokens: 1662
 {
    "epoch": 0,
    "step": 91,
    "rank": 0,
    "loss": 0.05982197821140289,
    "overall_throughput": 41.22796863364372,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.05379819869995,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18869,
    "batch_size": 79,
    "total_loss": 0.7937725186347961,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:54.169866"
 }
 total tokens: 5968 num samples: 2 num padding tokens: 197 - rank: 0 max len: 2984 min len: 2787 avg len: 2885.5 num_loss_counted_tokens: 209
 total tokens: 7667 num samples: 11 num padding tokens: 919 - rank: 5 max len: 697 min len: 558 avg len: 613.4545454545455 num_loss_counted_tokens: 4489
 total tokens: 8062 num samples: 29 num padding tokens: 2764 - rank: 7 max len: 278 min len: 78 avg len: 182.68965517241378 num_loss_counted_tokens: 2067
 total tokens: 6764 num samples: 4 num padding tokens: 631 - rank: 2 max len: 1691 min len: 1371 avg len: 1533.25 num_loss_counted_tokens: 1920
 total tokens: 7920 num samples: 15 num padding tokens: 1922 - rank: 6 max len: 528 min len: 310 avg len: 399.8666666666667 num_loss_counted_tokens: 3579
 total tokens: 7693 num samples: 7 num padding tokens: 417 - rank: 3 max len: 1099 min len: 981 avg len: 1039.4285714285713 num_loss_counted_tokens: 6131
 Per-token loss scaled by world size: 0.0005888827727176249Per-token loss scaled by world size: 0.0006443694583140314Per-token loss scaled by world size: 8.987231012724806e-06Per-token loss scaled by world size: 0.00010111679148394614Per-token loss scaled by world size: 1.0885350093303714e-05Per-token loss scaled by world size: 0.0005767960683442652





 Per-token loss scaled by world size: 0.00016675007645972073
 Epoch: 0, Step: 92, Rank: 6, loss = 1.448703646659851
 Epoch: 0, Step: 92, Rank: 1, loss = 0.02447298914194107Epoch: 0, Step: 92, Rank: 0, loss = 0.0202055424451828
 Epoch: 0, Step: 92, Rank: 3, loss = 1.3239556550979614

 Epoch: 0, Step: 92, Rank: 4, loss = 1.2967817783355713Epoch: 0, Step: 92, Rank: 2, loss = 0.2273358255624771

 Epoch: 0, Step: 92, Rank: 7, loss = 0.3748958706855774
 Per-token loss scaled by world size: 0.0007324381731450558
 Epoch: 0, Step: 92, Rank: 5, loss = 1.646704077720642
 Epoch 0:  76%|███████▌  | 92/121 [03:54<01:13,  2.54s/it] total tokens: 5964 num samples: 2 num padding tokens: 23 - rank: 1 max len: 2982 min len: 2959 avg len: 2970.5 num_loss_counted_tokens: 702
 total tokens: 4664 num samples: 22 num padding tokens: 1595 - rank: 7 max len: 212 min len: 82 avg len: 139.5 num_loss_counted_tokens: 1228
 total tokens: 8070 num samples: 10 num padding tokens: 724 - rank: 4 max len: 807 min len: 638 avg len: 734.6 num_loss_counted_tokens: 4997
 {
    "epoch": 0,
    "step": 92,
    "rank": 0,
    "loss": 0.0202055424451828,
    "overall_throughput": 41.45973903821548,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.430901527404785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17986,
    "batch_size": 75,
    "total_loss": 0.7953818440437317,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:56.703392"
 }
 total tokens: 6888 num samples: 3 num padding tokens: 1202 - rank: 2 max len: 2296 min len: 1438 avg len: 1895.3333333333333 num_loss_counted_tokens: 633
 total tokens: 8040 num samples: 20 num padding tokens: 2357 - rank: 6 max len: 402 min len: 223 avg len: 284.15 num_loss_counted_tokens: 2838
 total tokens: 7665 num samples: 7 num padding tokens: 534 - rank: 3 max len: 1095 min len: 894 avg len: 1018.7142857142857 num_loss_counted_tokens: 4973
 total tokens: 7764 num samples: 2 num padding tokens: 79 - rank: 0 max len: 3882 min len: 3803 avg len: 3842.5 num_loss_counted_tokens: 230
 total tokens: 7596 num samples: 12 num padding tokens: 1505 - rank: 5 max len: 633 min len: 403 avg len: 507.5833333333333 num_loss_counted_tokens: 3739
 Per-token loss scaled by world size: 0.0010100876679643989Per-token loss scaled by world size: 0.0010313765378668904Per-token loss scaled by world size: 0.0004698181292042136Per-token loss scaled by world size: 0.00015131689724512398



 Per-token loss scaled by world size: 1.1349918167979922e-05Per-token loss scaled by world size: 0.0004358472360763699Per-token loss scaled by world size: 5.984314611851005e-06


 Epoch: 0, Step: 93, Rank: 5, loss = 1.8667914867401123
 Epoch: 0, Step: 93, Rank: 3, loss = 0.273883581161499
 Epoch: 0, Step: 93, Rank: 4, loss = 0.8503708243370056
 Epoch: 0, Step: 93, Rank: 6, loss = 1.828258752822876
 Epoch: 0, Step: 93, Rank: 0, loss = 0.020543351769447327
 Epoch: 0, Step: 93, Rank: 1, loss = 0.01083160936832428
 Epoch: 0, Step: 93, Rank: 7, loss = 0.7888835072517395
 Per-token loss scaled by world size: 9.832592331804335e-05
 Epoch: 0, Step: 93, Rank: 2, loss = 0.17796991765499115
 Epoch 0:  77%|███████▋  | 93/121 [03:57<01:11,  2.57s/it] total tokens: 7368 num samples: 8 num padding tokens: 711 - rank: 4 max len: 921 min len: 712 avg len: 832.125 num_loss_counted_tokens: 4742
 total tokens: 6549 num samples: 3 num padding tokens: 1202 - rank: 1 max len: 2183 min len: 1533 avg len: 1782.3333333333333 num_loss_counted_tokens: 405
 {
    "epoch": 0,
    "step": 93,
    "rank": 0,
    "loss": 0.020543351769447327,
    "overall_throughput": 40.324540881871776,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.333390712738037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 14480,
    "batch_size": 60,
    "total_loss": 0.7271916270256042,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:51:59.324128"
 }
 total tokens: 8115 num samples: 15 num padding tokens: 1727 - rank: 6 max len: 541 min len: 302 avg len: 425.8666666666667 num_loss_counted_tokens: 4209
 total tokens: 7450 num samples: 5 num padding tokens: 482 - rank: 2 max len: 1490 min len: 1299 avg len: 1393.6 num_loss_counted_tokens: 4364
 total tokens: 7832 num samples: 11 num padding tokens: 625 - rank: 5 max len: 712 min len: 561 avg len: 655.1818181818181 num_loss_counted_tokens: 4908
 total tokens: 6486 num samples: 2 num padding tokens: 374 - rank: 0 max len: 3243 min len: 2869 avg len: 3056.0 num_loss_counted_tokens: 161
 total tokens: 7693 num samples: 7 num padding tokens: 643 - rank: 3 max len: 1099 min len: 949 avg len: 1007.1428571428571 num_loss_counted_tokens: 4763
 total tokens: 8073 num samples: 27 num padding tokens: 3166 - rank: 7 max len: 299 min len: 83 avg len: 181.74074074074073 num_loss_counted_tokens: 2216
 Per-token loss scaled by world size: 0.00047986634308472276Per-token loss scaled by world size: 0.0005184172769077122Per-token loss scaled by world size: 0.00042661072802729905Per-token loss scaled by world size: 5.770879943156615e-05
 Per-token loss scaled by world size: 0.0003191411087755114



 Per-token loss scaled by world size: 1.1559887752810027e-05
 Per-token loss scaled by world size: 1.194144033433986e-06
 Epoch: 0, Step: 94, Rank: 3, loss = 1.1955350637435913
 Epoch: 0, Step: 94, Rank: 5, loss = 0.9838176965713501Epoch: 0, Step: 94, Rank: 2, loss = 0.13308370113372803

 Epoch: 0, Step: 94, Rank: 7, loss = 1.1066317558288574
 Epoch: 0, Step: 94, Rank: 4, loss = 0.7359793186187744
 Epoch: 0, Step: 94, Rank: 0, loss = 0.026658546179533005
 Epoch: 0, Step: 94, Rank: 1, loss = 0.002753845416009426
 Per-token loss scaled by world size: 0.0007509095594286919
 Epoch: 0, Step: 94, Rank: 6, loss = 1.7316913604736328
 Epoch 0:  78%|███████▊  | 94/121 [04:00<01:08,  2.55s/it] total tokens: 7608 num samples: 8 num padding tokens: 1057 - rank: 4 max len: 951 min len: 740 avg len: 818.875 num_loss_counted_tokens: 3486
 total tokens: 5880 num samples: 2 num padding tokens: 154 - rank: 1 max len: 2940 min len: 2786 avg len: 2863.0 num_loss_counted_tokens: 481
 total tokens: 6471 num samples: 3 num padding tokens: 972 - rank: 2 max len: 2157 min len: 1602 avg len: 1833.0 num_loss_counted_tokens: 1746
 {
    "epoch": 0,
    "step": 94,
    "rank": 0,
    "loss": 0.026658546179533005,
    "overall_throughput": 42.25155762364806,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.342710971832275,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18449,
    "batch_size": 79,
    "total_loss": 0.7395188808441162,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:01.831915"
 }
 total tokens: 7725 num samples: 5 num padding tokens: 1821 - rank: 3 max len: 1545 min len: 1004 avg len: 1180.8 num_loss_counted_tokens: 4073
 total tokens: 7725 num samples: 25 num padding tokens: 2825 - rank: 7 max len: 309 min len: 84 avg len: 196.0 num_loss_counted_tokens: 2238
 total tokens: 7540 num samples: 13 num padding tokens: 2357 - rank: 6 max len: 580 min len: 317 avg len: 398.6923076923077 num_loss_counted_tokens: 3252
 total tokens: 7920 num samples: 11 num padding tokens: 539 - rank: 5 max len: 720 min len: 583 avg len: 671.0 num_loss_counted_tokens: 4509
 total tokens: 7732 num samples: 2 num padding tokens: 720 - rank: 0 max len: 3866 min len: 3146 avg len: 3506.0 num_loss_counted_tokens: 197
 Per-token loss scaled by world size: 0.00016949654673226178Per-token loss scaled by world size: 0.00019448986859060824Per-token loss scaled by world size: 0.0003709697921294719Per-token loss scaled by world size: 0.00028232726617716253Per-token loss scaled by world size: 0.00031966116512194276Per-token loss scaled by world size: 4.294802783988416e-05


 Per-token loss scaled by world size: 4.4173757487442344e-06



 Epoch: 0, Step: 95, Rank: 6, loss = 1.1748613119125366
 Epoch: 0, Step: 95, Rank: 7, loss = 0.8941304087638855Epoch: 0, Step: 95, Rank: 1, loss = 0.13601639866828918
 Epoch: 0, Step: 95, Rank: 3, loss = 0.5367955565452576Epoch: 0, Step: 95, Rank: 0, loss = 0.013989828526973724

 Epoch: 0, Step: 95, Rank: 2, loss = 0.6159493923187256Epoch: 0, Step: 95, Rank: 4, loss = 1.0123668909072876


 Per-token loss scaled by world size: 0.00031159218633547425
 Epoch: 0, Step: 95, Rank: 5, loss = 0.9868124723434448
 Epoch 0:  79%|███████▊  | 95/121 [04:02<01:06,  2.56s/it] total tokens: 7953 num samples: 11 num padding tokens: 802 - rank: 4 max len: 723 min len: 574 avg len: 650.0909090909091 num_loss_counted_tokens: 3865
 total tokens: 7895 num samples: 5 num padding tokens: 903 - rank: 1 max len: 1579 min len: 1252 avg len: 1398.4 num_loss_counted_tokens: 4598
 {
    "epoch": 0,
    "step": 95,
    "rank": 0,
    "loss": 0.013989828526973724,
    "overall_throughput": 40.78303709583472,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.430901527404785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25336,
    "batch_size": 79,
    "total_loss": 0.6713653802871704,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:04.432513"
 }
 total tokens: 7994 num samples: 14 num padding tokens: 1263 - rank: 5 max len: 571 min len: 393 avg len: 480.7857142857143 num_loss_counted_tokens: 4425
 total tokens: 5280 num samples: 24 num padding tokens: 2248 - rank: 7 max len: 220 min len: 77 avg len: 126.33333333333333 num_loss_counted_tokens: 1108
 total tokens: 7212 num samples: 6 num padding tokens: 1461 - rank: 2 max len: 1202 min len: 820 avg len: 958.5 num_loss_counted_tokens: 3497
 total tokens: 7820 num samples: 20 num padding tokens: 2085 - rank: 6 max len: 391 min len: 220 avg len: 286.75 num_loss_counted_tokens: 3565
 total tokens: 7326 num samples: 9 num padding tokens: 374 - rank: 3 max len: 814 min len: 726 avg len: 772.4444444444445 num_loss_counted_tokens: 2645
 total tokens: 7032 num samples: 3 num padding tokens: 98 - rank: 0 max len: 2344 min len: 2282 avg len: 2311.3333333333335 num_loss_counted_tokens: 312
 Per-token loss scaled by world size: 0.0003659721987787634Per-token loss scaled by world size: 1.1935087059100624e-05Per-token loss scaled by world size: 0.00034196022897958755Per-token loss scaled by world size: 4.9577370191400405e-06


 Per-token loss scaled by world size: 0.000384376646252349
 Per-token loss scaled by world size: 3.742313765542349e-06Per-token loss scaled by world size: 0.0004321872256696224


 Epoch: 0, Step: 96, Rank: 3, loss = 1.043386697769165
 Epoch: 0, Step: 96, Rank: 2, loss = 0.03402693197131157
 Epoch: 0, Step: 96, Rank: 6, loss = 0.974928617477417
 Epoch: 0, Step: 96, Rank: 1, loss = 0.01413450762629509
 Epoch: 0, Step: 96, Rank: 7, loss = 1.095857858657837
 Epoch: 0, Step: 96, Rank: 0, loss = 0.010669336654245853Epoch: 0, Step: 96, Rank: 4, loss = 1.232165813446045

 Per-token loss scaled by world size: 0.00038899367791600525
 Epoch: 0, Step: 96, Rank: 5, loss = 1.1090209484100342
 Epoch 0:  79%|███████▉  | 96/121 [04:05<01:03,  2.55s/it] total tokens: 6210 num samples: 3 num padding tokens: 1280 - rank: 1 max len: 2070 min len: 1379 avg len: 1643.3333333333333 num_loss_counted_tokens: 704
 total tokens: 8100 num samples: 10 num padding tokens: 581 - rank: 4 max len: 810 min len: 687 avg len: 751.9 num_loss_counted_tokens: 4814
 {
    "epoch": 0,
    "step": 96,
    "rank": 0,
    "loss": 0.010669336654245853,
    "overall_throughput": 42.34869571304124,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.221776962280273,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22808,
    "batch_size": 79,
    "total_loss": 0.6892738342285156,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:06.964617"
 }
 total tokens: 5532 num samples: 2 num padding tokens: 327 - rank: 0 max len: 2766 min len: 2439 avg len: 2602.5 num_loss_counted_tokens: 211
 total tokens: 7075 num samples: 25 num padding tokens: 2496 - rank: 7 max len: 283 min len: 84 avg len: 183.16 num_loss_counted_tokens: 1819
 total tokens: 7546 num samples: 11 num padding tokens: 1067 - rank: 5 max len: 686 min len: 503 avg len: 589.0 num_loss_counted_tokens: 4558
 total tokens: 7595 num samples: 7 num padding tokens: 688 - rank: 3 max len: 1085 min len: 846 avg len: 986.7142857142857 num_loss_counted_tokens: 4050
 total tokens: 8048 num samples: 16 num padding tokens: 1729 - rank: 6 max len: 503 min len: 314 avg len: 394.9375 num_loss_counted_tokens: 3712
 total tokens: 8070 num samples: 6 num padding tokens: 955 - rank: 2 max len: 1345 min len: 1089 avg len: 1185.8333333333333 num_loss_counted_tokens: 3253
 Per-token loss scaled by world size: 0.00020825346291530877Per-token loss scaled by world size: 0.0002562287845648825Per-token loss scaled by world size: 8.292648271890357e-05Per-token loss scaled by world size: 9.641618089517578e-05Per-token loss scaled by world size: 0.00017162703443318605Per-token loss scaled by world size: 9.9565637356136e-05Per-token loss scaled by world size: 8.746929961489514e-05






 Epoch: 0, Step: 97, Rank: 2, loss = 0.41501447558403015Epoch: 0, Step: 97, Rank: 0, loss = 0.3645939230918884Epoch: 0, Step: 97, Rank: 3, loss = 0.3456583023071289


 Epoch: 0, Step: 97, Rank: 6, loss = 0.8680524826049805Epoch: 0, Step: 97, Rank: 1, loss = 0.4018867313861847Epoch: 0, Step: 97, Rank: 7, loss = 0.7153843641281128


 Epoch: 0, Step: 97, Rank: 4, loss = 1.0680255889892578
 Per-token loss scaled by world size: 0.00028643777477554977
 Epoch: 0, Step: 97, Rank: 5, loss = 1.1939442157745361
 Epoch 0:  80%|████████  | 97/121 [04:07<01:00,  2.54s/it] total tokens: 8016 num samples: 8 num padding tokens: 1080 - rank: 4 max len: 1002 min len: 797 avg len: 867.0 num_loss_counted_tokens: 5620
 total tokens: 7887 num samples: 3 num padding tokens: 742 - rank: 1 max len: 2629 min len: 1930 avg len: 2381.6666666666665 num_loss_counted_tokens: 1049
 total tokens: 7076 num samples: 29 num padding tokens: 2680 - rank: 7 max len: 244 min len: 78 avg len: 151.58620689655172 num_loss_counted_tokens: 1550
 {
    "epoch": 0,
    "step": 97,
    "rank": 0,
    "loss": 0.3645939230918884,
    "overall_throughput": 41.762877356480914,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.456066131591797,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 33346,
    "batch_size": 92,
    "total_loss": 0.6715700030326843,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:09.460519"
 }
 total tokens: 8048 num samples: 16 num padding tokens: 2637 - rank: 6 max len: 503 min len: 255 avg len: 338.1875 num_loss_counted_tokens: 3126
 total tokens: 7820 num samples: 10 num padding tokens: 1247 - rank: 5 max len: 782 min len: 563 avg len: 657.3 num_loss_counted_tokens: 4844
 total tokens: 7668 num samples: 4 num padding tokens: 419 - rank: 2 max len: 1917 min len: 1666 avg len: 1812.25 num_loss_counted_tokens: 734
 total tokens: 7176 num samples: 6 num padding tokens: 542 - rank: 3 max len: 1196 min len: 1025 avg len: 1105.6666666666667 num_loss_counted_tokens: 3353
 total tokens: 7128 num samples: 2 num padding tokens: 2 - rank: 0 max len: 3564 min len: 3562 avg len: 3563.0 num_loss_counted_tokens: 172
 Per-token loss scaled by world size: 0.0005401856615208089Per-token loss scaled by world size: 0.000174855042132549Per-token loss scaled by world size: 0.0003811018541455269Per-token loss scaled by world size: 2.7653879442368634e-05Per-token loss scaled by world size: 0.00025230227038264275




 Per-token loss scaled by world size: 6.325829599518329e-05
 Per-token loss scaled by world size: 0.0003237307828385383
 Epoch: 0, Step: 98, Rank: 3, loss = 0.4728299081325531
 Epoch: 0, Step: 98, Rank: 2, loss = 1.030547022819519Epoch: 0, Step: 98, Rank: 0, loss = 0.07477954775094986

 Epoch: 0, Step: 98, Rank: 5, loss = 1.4607295989990234
 Epoch: 0, Step: 98, Rank: 4, loss = 0.6822568774223328
 Epoch: 0, Step: 98, Rank: 1, loss = 0.17105834186077118
 Epoch: 0, Step: 98, Rank: 7, loss = 0.8754084706306458
 Per-token loss scaled by world size: 0.00048297818284481764
 Epoch: 0, Step: 98, Rank: 6, loss = 1.3060333728790283
 Epoch 0:  81%|████████  | 98/121 [04:10<00:57,  2.52s/it] total tokens: 7452 num samples: 9 num padding tokens: 759 - rank: 4 max len: 828 min len: 690 avg len: 743.6666666666666 num_loss_counted_tokens: 4766
 total tokens: 8040 num samples: 6 num padding tokens: 578 - rank: 1 max len: 1340 min len: 1184 avg len: 1243.6666666666667 num_loss_counted_tokens: 2162
 {
    "epoch": 0,
    "step": 98,
    "rank": 0,
    "loss": 0.07477954775094986,
    "overall_throughput": 42.80100675786857,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.374258518218994,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21633,
    "batch_size": 82,
    "total_loss": 0.7592054009437561,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:11.932654"
 }
 total tokens: 7994 num samples: 7 num padding tokens: 237 - rank: 2 max len: 1142 min len: 1003 avg len: 1108.142857142857 num_loss_counted_tokens: 2239
 total tokens: 8008 num samples: 8 num padding tokens: 518 - rank: 3 max len: 1001 min len: 839 avg len: 936.25 num_loss_counted_tokens: 4873
 total tokens: 7579 num samples: 11 num padding tokens: 1520 - rank: 5 max len: 689 min len: 432 avg len: 550.8181818181819 num_loss_counted_tokens: 3863
 total tokens: 7808 num samples: 32 num padding tokens: 2556 - rank: 7 max len: 244 min len: 86 avg len: 164.125 num_loss_counted_tokens: 1928
 total tokens: 8080 num samples: 4 num padding tokens: 1956 - rank: 0 max len: 2020 min len: 1348 avg len: 1531.0 num_loss_counted_tokens: 2972
 total tokens: 7740 num samples: 18 num padding tokens: 1567 - rank: 6 max len: 430 min len: 249 avg len: 342.94444444444446 num_loss_counted_tokens: 3382
 Per-token loss scaled by world size: 0.00014469273446593434Per-token loss scaled by world size: 0.0002714892034418881Per-token loss scaled by world size: 0.00024302249948959798Per-token loss scaled by world size: 0.0003524755884427577Per-token loss scaled by world size: 6.74632319714874e-05
 Per-token loss scaled by world size: 0.0003295539354439825




 Per-token loss scaled by world size: 0.0001440553314751014
 Epoch: 0, Step: 99, Rank: 3, loss = 0.4647168815135956Epoch: 0, Step: 99, Rank: 5, loss = 0.8719554543495178

 Epoch: 0, Step: 99, Rank: 1, loss = 0.2166750431060791
 Epoch: 0, Step: 99, Rank: 7, loss = 0.7805275321006775
 Epoch: 0, Step: 99, Rank: 6, loss = 1.1320635080337524
 Epoch: 0, Step: 99, Rank: 4, loss = 1.058444857597351
 Per-token loss scaled by world size: 0.00019673565111588687
 Epoch: 0, Step: 99, Rank: 2, loss = 0.4626697301864624
 Epoch: 0, Step: 99, Rank: 0, loss = 0.6318657398223877
 Epoch 0:  82%|████████▏ | 99/121 [04:12<00:55,  2.54s/it] total tokens: 7893 num samples: 9 num padding tokens: 665 - rank: 4 max len: 877 min len: 700 avg len: 803.1111111111111 num_loss_counted_tokens: 3967
 total tokens: 2590 num samples: 14 num padding tokens: 751 - rank: 7 max len: 185 min len: 86 avg len: 131.35714285714286 num_loss_counted_tokens: 581
 total tokens: 7648 num samples: 8 num padding tokens: 131 - rank: 3 max len: 956 min len: 911 avg len: 939.625 num_loss_counted_tokens: 5962
 total tokens: 7645 num samples: 5 num padding tokens: 395 - rank: 1 max len: 1529 min len: 1349 avg len: 1450.0 num_loss_counted_tokens: 2441
 {
    "epoch": 0,
    "step": 99,
    "rank": 0,
    "loss": 0.6318657398223877,
    "overall_throughput": 40.78082568043985,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.533984184265137,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25694,
    "batch_size": 91,
    "total_loss": 0.7023648619651794,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:14.529016"
 }
 total tokens: 7752 num samples: 6 num padding tokens: 918 - rank: 2 max len: 1292 min len: 958 avg len: 1139.0 num_loss_counted_tokens: 2321
 total tokens: 7260 num samples: 3 num padding tokens: 1692 - rank: 0 max len: 2420 min len: 1564 avg len: 1856.0 num_loss_counted_tokens: 2978
 total tokens: 7740 num samples: 20 num padding tokens: 1925 - rank: 6 max len: 387 min len: 196 avg len: 290.75 num_loss_counted_tokens: 3585
 total tokens: 8064 num samples: 12 num padding tokens: 1582 - rank: 5 max len: 672 min len: 401 avg len: 540.1666666666666 num_loss_counted_tokens: 3589
 Per-token loss scaled by world size: 0.0002912842610385269Per-token loss scaled by world size: 0.00046773377107456326Per-token loss scaled by world size: 0.00034302467247471213Per-token loss scaled by world size: 0.0002211699465988204



 Per-token loss scaled by world size: 6.557774031534791e-05Per-token loss scaled by world size: 2.8064672733307816e-05

 Per-token loss scaled by world size: 2.8628314794332255e-06
 Epoch: 0, Step: 100, Rank: 5, loss = 1.2855077981948853
 Epoch: 0, Step: 100, Rank: 7, loss = 0.9427604675292969
 Epoch: 0, Step: 100, Rank: 4, loss = 0.6078579425811768
 Epoch: 0, Step: 100, Rank: 3, loss = 0.8005583882331848
 Epoch: 0, Step: 100, Rank: 2, loss = 0.1802322268486023
 Epoch: 0, Step: 100, Rank: 1, loss = 0.07713224738836288
 Epoch: 0, Step: 100, Rank: 0, loss = 0.007868134416639805
 Per-token loss scaled by world size: 0.0006381099228747189
 Epoch: 0, Step: 100, Rank: 6, loss = 1.753765344619751
 Epoch 0:  83%|████████▎ | 100/121 [04:15<00:53,  2.54s/it] total tokens: 5672 num samples: 2 num padding tokens: 1177 - rank: 1 max len: 2836 min len: 1659 avg len: 2247.5 num_loss_counted_tokens: 502
 total tokens: 7504 num samples: 8 num padding tokens: 1233 - rank: 4 max len: 938 min len: 657 avg len: 783.875 num_loss_counted_tokens: 4542
 {
    "epoch": 0,
    "step": 100,
    "rank": 0,
    "loss": 0.007868134416639805,
    "overall_throughput": 41.97207337460162,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.315226078033447,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21987,
    "batch_size": 69,
    "total_loss": 0.7069603800773621,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:17.051456"
 }
 total tokens: 7908 num samples: 6 num padding tokens: 1459 - rank: 3 max len: 1318 min len: 961 avg len: 1074.8333333333333 num_loss_counted_tokens: 4507
 total tokens: 7644 num samples: 12 num padding tokens: 705 - rank: 5 max len: 637 min len: 515 avg len: 578.25 num_loss_counted_tokens: 4604
 total tokens: 8032 num samples: 16 num padding tokens: 1718 - rank: 6 max len: 502 min len: 297 avg len: 394.625 num_loss_counted_tokens: 3774
 total tokens: 7306 num samples: 26 num padding tokens: 2547 - rank: 7 max len: 281 min len: 87 avg len: 183.03846153846155 num_loss_counted_tokens: 1982
 total tokens: 6576 num samples: 4 num padding tokens: 507 - rank: 2 max len: 1644 min len: 1411 avg len: 1517.25 num_loss_counted_tokens: 3636
 total tokens: 6544 num samples: 2 num padding tokens: 236 - rank: 0 max len: 3272 min len: 3036 avg len: 3154.0 num_loss_counted_tokens: 209
 Per-token loss scaled by world size: 0.00010836837464012206Per-token loss scaled by world size: 0.0003503480111248791Per-token loss scaled by world size: 0.0005262716440483928Per-token loss scaled by world size: 0.0004925990360789001Per-token loss scaled by world size: 0.0006540374597534537

 Per-token loss scaled by world size: 8.3443388575688e-05Per-token loss scaled by world size: 2.478029045960284e-06




 Epoch: 0, Step: 101, Rank: 6, loss = 1.2019386291503906
 Epoch: 0, Step: 101, Rank: 4, loss = 0.8001510500907898
 Epoch: 0, Step: 101, Rank: 5, loss = 1.4937398433685303
 Epoch: 0, Step: 101, Rank: 2, loss = 0.24749982357025146
 Epoch: 0, Step: 101, Rank: 7, loss = 1.1250346899032593Epoch: 0, Step: 101, Rank: 1, loss = 0.1905742734670639
 Epoch: 0, Step: 101, Rank: 0, loss = 0.005659508518874645

 Per-token loss scaled by world size: 0.00031609582947567105
 Epoch: 0, Step: 101, Rank: 3, loss = 0.7219233512878418
 Epoch 0:  83%|████████▎ | 101/121 [04:17<00:50,  2.54s/it] total tokens: 7903 num samples: 7 num padding tokens: 827 - rank: 4 max len: 1129 min len: 858 avg len: 1010.8571428571429 num_loss_counted_tokens: 3333
 total tokens: 5778 num samples: 2 num padding tokens: 561 - rank: 1 max len: 2889 min len: 2328 avg len: 2608.5 num_loss_counted_tokens: 499
 {
    "epoch": 0,
    "step": 101,
    "rank": 0,
    "loss": 0.005659508518874645,
    "overall_throughput": 41.27268820447733,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.25443983078003,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18271,
    "batch_size": 78,
    "total_loss": 0.7233151197433472,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:19.615502"
 }
 total tokens: 7981 num samples: 23 num padding tokens: 4218 - rank: 7 max len: 347 min len: 81 avg len: 163.6086956521739 num_loss_counted_tokens: 1503
 total tokens: 7659 num samples: 9 num padding tokens: 737 - rank: 5 max len: 851 min len: 617 avg len: 769.1111111111111 num_loss_counted_tokens: 5452
 total tokens: 7125 num samples: 5 num padding tokens: 617 - rank: 3 max len: 1425 min len: 1151 avg len: 1301.6 num_loss_counted_tokens: 3466
 total tokens: 6171 num samples: 3 num padding tokens: 616 - rank: 2 max len: 2057 min len: 1742 avg len: 1851.6666666666667 num_loss_counted_tokens: 894
 total tokens: 6344 num samples: 2 num padding tokens: 281 - rank: 0 max len: 3172 min len: 2891 avg len: 3031.5 num_loss_counted_tokens: 198
 total tokens: 7878 num samples: 13 num padding tokens: 1215 - rank: 6 max len: 606 min len: 373 avg len: 512.5384615384615 num_loss_counted_tokens: 4756
 Per-token loss scaled by world size: 0.0004239458357915282Per-token loss scaled by world size: 0.00031359592685475945Per-token loss scaled by world size: 0.00021933596872258931Per-token loss scaled by world size: 0.00028875406133010983
 Per-token loss scaled by world size: 0.000378787808585912
 Per-token loss scaled by world size: 6.144649523776025e-05


 Per-token loss scaled by world size: 0.0003507360816001892

 Epoch: 0, Step: 102, Rank: 2, loss = 0.9117801189422607
 Epoch: 0, Step: 102, Rank: 5, loss = 1.232622504234314
 Epoch: 0, Step: 102, Rank: 4, loss = 0.8395524621009827Epoch: 0, Step: 102, Rank: 6, loss = 1.101325511932373Epoch: 0, Step: 102, Rank: 3, loss = 0.6377193331718445Epoch: 0, Step: 102, Rank: 0, loss = 0.1786556839942932



 Epoch: 0, Step: 102, Rank: 7, loss = 1.0197651386260986
 Per-token loss scaled by world size: 0.00012210274871904403
 Epoch: 0, Step: 102, Rank: 1, loss = 0.35501372814178467
 Epoch 0:  84%|████████▍ | 102/121 [04:20<00:48,  2.56s/it] total tokens: 7592 num samples: 8 num padding tokens: 994 - rank: 4 max len: 949 min len: 717 avg len: 824.75 num_loss_counted_tokens: 4547
 total tokens: 6717 num samples: 3 num padding tokens: 1106 - rank: 1 max len: 2239 min len: 1619 avg len: 1870.3333333333333 num_loss_counted_tokens: 2357
 {
    "epoch": 0,
    "step": 102,
    "rank": 0,
    "loss": 0.1786556839942932,
    "overall_throughput": 40.7946280716196,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.383514404296875,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23260,
    "batch_size": 97,
    "total_loss": 0.7845543026924133,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:22.207291"
 }
 total tokens: 8022 num samples: 14 num padding tokens: 1912 - rank: 6 max len: 573 min len: 313 avg len: 436.42857142857144 num_loss_counted_tokens: 3612
 total tokens: 7575 num samples: 5 num padding tokens: 437 - rank: 2 max len: 1515 min len: 1354 avg len: 1427.6 num_loss_counted_tokens: 3239
 total tokens: 7656 num samples: 11 num padding tokens: 467 - rank: 5 max len: 696 min len: 601 avg len: 653.5454545454545 num_loss_counted_tokens: 3901
 total tokens: 7392 num samples: 24 num padding tokens: 2967 - rank: 7 max len: 308 min len: 80 avg len: 184.375 num_loss_counted_tokens: 1964
 total tokens: 7478 num samples: 2 num padding tokens: 770 - rank: 0 max len: 3739 min len: 2969 avg len: 3354.0 num_loss_counted_tokens: 161
 total tokens: 7944 num samples: 6 num padding tokens: 1452 - rank: 3 max len: 1324 min len: 951 avg len: 1082.0 num_loss_counted_tokens: 5314
 Per-token loss scaled by world size: 0.000377753924112767Per-token loss scaled by world size: 0.0003784565778914839Per-token loss scaled by world size: 0.00017589255003258586Per-token loss scaled by world size: 0.00025099579943343997Per-token loss scaled by world size: 0.0002923366264440119


 Per-token loss scaled by world size: 1.306815647694748e-06


 Per-token loss scaled by world size: 8.565741882193834e-05
 Epoch: 0, Step: 103, Rank: 5, loss = 1.0730663537979126
 Epoch: 0, Step: 103, Rank: 3, loss = 0.7116672396659851Epoch: 0, Step: 103, Rank: 1, loss = 0.49872133135795593Epoch: 0, Step: 103, Rank: 0, loss = 0.0037053122650831938Epoch: 0, Step: 103, Rank: 6, loss = 1.0710740089416504



 Epoch: 0, Step: 103, Rank: 4, loss = 0.828883945941925
 Epoch: 0, Step: 103, Rank: 7, loss = 0.24287091195583344
 Per-token loss scaled by world size: 0.00028907370870001614
 Epoch: 0, Step: 103, Rank: 2, loss = 0.819632351398468
 Epoch 0:  85%|████████▌ | 103/121 [04:22<00:45,  2.55s/it] total tokens: 7472 num samples: 8 num padding tokens: 859 - rank: 4 max len: 934 min len: 761 avg len: 826.625 num_loss_counted_tokens: 4441
 total tokens: 5632 num samples: 2 num padding tokens: 254 - rank: 1 max len: 2816 min len: 2562 avg len: 2689.0 num_loss_counted_tokens: 324
 total tokens: 7860 num samples: 30 num padding tokens: 3085 - rank: 7 max len: 262 min len: 70 avg len: 159.16666666666666 num_loss_counted_tokens: 2067
 total tokens: 7635 num samples: 15 num padding tokens: 2223 - rank: 6 max len: 509 min len: 271 avg len: 360.8 num_loss_counted_tokens: 2976
 {
    "epoch": 0,
    "step": 103,
    "rank": 0,
    "loss": 0.0037053122650831938,
    "overall_throughput": 41.79496526057776,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.362098217010498,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22683,
    "batch_size": 80,
    "total_loss": 0.6562026739120483,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:24.741846"
 }
 total tokens: 7140 num samples: 5 num padding tokens: 839 - rank: 3 max len: 1428 min len: 1015 avg len: 1260.2 num_loss_counted_tokens: 4097
 total tokens: 7520 num samples: 10 num padding tokens: 1082 - rank: 5 max len: 752 min len: 531 avg len: 643.8 num_loss_counted_tokens: 3194
 total tokens: 6831 num samples: 3 num padding tokens: 752 - rank: 2 max len: 2277 min len: 1793 avg len: 2026.3333333333333 num_loss_counted_tokens: 2435
 total tokens: 6642 num samples: 2 num padding tokens: 16 - rank: 0 max len: 3321 min len: 3305 avg len: 3313.0 num_loss_counted_tokens: 167
 Per-token loss scaled by world size: 0.00023188922205008566Per-token loss scaled by world size: 0.0005923461285419762Per-token loss scaled by world size: 0.0007710273494012654

 Per-token loss scaled by world size: 0.0006771996268071234Per-token loss scaled by world size: 5.4260908655123785e-06Per-token loss scaled by world size: 7.5567550084088e-06Per-token loss scaled by world size: 0.00048243210767395794




 Epoch: 0, Step: 104, Rank: 5, loss = 1.1573703289031982Epoch: 0, Step: 104, Rank: 6, loss = 1.5064910650253296

 Epoch: 0, Step: 104, Rank: 3, loss = 0.4530825614929199
 Epoch: 0, Step: 104, Rank: 4, loss = 1.323163390159607Epoch: 0, Step: 104, Rank: 2, loss = 0.010601903311908245

 Epoch: 0, Step: 104, Rank: 1, loss = 0.014764954335987568
 Epoch: 0, Step: 104, Rank: 7, loss = 0.9426120519638062
 Per-token loss scaled by world size: 8.817338675726205e-05
 Epoch: 0, Step: 104, Rank: 0, loss = 0.17227977514266968
 Epoch 0:  86%|████████▌ | 104/121 [04:25<00:43,  2.55s/it] total tokens: 7770 num samples: 10 num padding tokens: 458 - rank: 4 max len: 777 min len: 676 avg len: 731.2 num_loss_counted_tokens: 4387
 total tokens: 7380 num samples: 5 num padding tokens: 709 - rank: 1 max len: 1476 min len: 1222 avg len: 1334.2 num_loss_counted_tokens: 2606
 total tokens: 7931 num samples: 7 num padding tokens: 471 - rank: 2 max len: 1133 min len: 997 avg len: 1065.7142857142858 num_loss_counted_tokens: 2877
 {
    "epoch": 0,
    "step": 104,
    "rank": 0,
    "loss": 0.17227977514266968,
    "overall_throughput": 41.371928375269654,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.487372398376465,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 15631,
    "batch_size": 56,
    "total_loss": 0.6975457668304443,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:27.296912"
 }
 total tokens: 8100 num samples: 12 num padding tokens: 2394 - rank: 5 max len: 675 min len: 353 avg len: 475.5 num_loss_counted_tokens: 3145
 total tokens: 7684 num samples: 34 num padding tokens: 2510 - rank: 7 max len: 226 min len: 80 avg len: 152.1764705882353 num_loss_counted_tokens: 1885
 total tokens: 7496 num samples: 8 num padding tokens: 516 - rank: 3 max len: 937 min len: 786 avg len: 872.5 num_loss_counted_tokens: 4095
 total tokens: 8073 num samples: 23 num padding tokens: 1383 - rank: 6 max len: 351 min len: 231 avg len: 290.8695652173913 num_loss_counted_tokens: 3443
 total tokens: 6147 num samples: 3 num padding tokens: 667 - rank: 0 max len: 2049 min len: 1588 avg len: 1826.6666666666667 num_loss_counted_tokens: 458
 Per-token loss scaled by world size: 0.00020804539963137358Per-token loss scaled by world size: 0.00024777904036454856Per-token loss scaled by world size: 0.00040929196984507143Per-token loss scaled by world size: 0.00047369435196742415Per-token loss scaled by world size: 0.00023292600235436112



 Per-token loss scaled by world size: 0.0002660582831595093Per-token loss scaled by world size: 2.4003027647268027e-05


 Epoch: 0, Step: 105, Rank: 6, loss = 1.2955113649368286Epoch: 0, Step: 105, Rank: 5, loss = 1.4993610382080078
 Epoch: 0, Step: 105, Rank: 7, loss = 0.7842825651168823

 Epoch: 0, Step: 105, Rank: 3, loss = 0.7372690439224243Epoch: 0, Step: 105, Rank: 2, loss = 0.6585156917572021

 Epoch: 0, Step: 105, Rank: 0, loss = 0.07597558200359344
 Epoch: 0, Step: 105, Rank: 4, loss = 0.8421409726142883
 Per-token loss scaled by world size: 0.0001044606979121454
 Epoch: 0, Step: 105, Rank: 1, loss = 0.3306442201137543
 Epoch 0:  87%|████████▋ | 105/121 [04:28<00:40,  2.56s/it]{
    "epoch": 0,
    "step": 105,
    "rank": 0,
    "loss": 0.07597558200359344,
    "overall_throughput": 41.19873338301903,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.338607788085938,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25322,
    "batch_size": 90,
    "total_loss": 0.7779626250267029,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:29.864769"
 }
 Per-token loss scaled by world size: 0.0007665135781280696Per-token loss scaled by world size: 0.0007400316535495222Per-token loss scaled by world size: 0.0002667310182005167Per-token loss scaled by world size: 0.000634319381788373Per-token loss scaled by world size: 0.00010545395343797281Per-token loss scaled by world size: 4.777937192557147e-06
 Per-token loss scaled by world size: 1.3243148941910476e-06





 Epoch: 0, Step: 106, Rank: 2, loss = 0.21801286935806274
 Epoch: 0, Step: 106, Rank: 4, loss = 1.5846710205078125Epoch: 0, Step: 106, Rank: 3, loss = 0.5514330267906189Epoch: 0, Step: 106, Rank: 6, loss = 1.3113759756088257Epoch: 0, Step: 106, Rank: 0, loss = 0.00987778790295124



 Epoch: 0, Step: 106, Rank: 7, loss = 1.5299229621887207Epoch: 0, Step: 106, Rank: 1, loss = 0.002737855538725853

 Per-token loss scaled by world size: 0.0005778921768069267
 Epoch: 0, Step: 106, Rank: 5, loss = 1.1947197914123535
 Epoch 0:  88%|████████▊ | 106/121 [04:30<00:38,  2.53s/it]{
    "epoch": 0,
    "step": 106,
    "rank": 0,
    "loss": 0.00987778790295124,
    "overall_throughput": 42.66049518366333,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.433530807495117,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16539,
    "batch_size": 82,
    "total_loss": 0.800343930721283,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:32.345371"
 }
 Per-token loss scaled by world size: 0.00032297801226377487Per-token loss scaled by world size: 0.00039764257962815464Per-token loss scaled by world size: 0.00018069567158818245Per-token loss scaled by world size: 0.00018984438793268055

 Per-token loss scaled by world size: 0.00037407863419502974
 Per-token loss scaled by world size: 2.2991487185208825e-06Per-token loss scaled by world size: 0.00024643141659907997



 Epoch: 0, Step: 107, Rank: 1, loss = 0.6019198894500732Epoch: 0, Step: 107, Rank: 4, loss = 1.3245971202850342

 Epoch: 0, Step: 107, Rank: 2, loss = 0.6323953866958618Epoch: 0, Step: 107, Rank: 6, loss = 1.0758801698684692

 Epoch: 0, Step: 107, Rank: 0, loss = 0.007658752147108316Epoch: 0, Step: 107, Rank: 3, loss = 1.2461026906967163

 Epoch: 0, Step: 107, Rank: 7, loss = 0.8208938241004944
 Per-token loss scaled by world size: 0.00040240780799649656
 Epoch: 0, Step: 107, Rank: 5, loss = 1.3404706716537476
 Epoch 0:  88%|████████▊ | 107/121 [04:33<00:35,  2.54s/it]{
    "epoch": 0,
    "step": 107,
    "rank": 0,
    "loss": 0.007658752147108316,
    "overall_throughput": 41.73801803946499,
    "lr": 1.6000000000000001e-06,
    "cuda_mem_allocated": 24.428075790405273,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26649,
    "batch_size": 107,
    "total_loss": 0.8812397718429565,
    "gradnorm": 1.0122549533843994,
    "weight_norm": 433.0432434082031,
    "timestamp": "2024-08-18T20:52:34.882920"
 }
 Per-token loss scaled by world size: 0.0003084656782448292Per-token loss scaled by world size: 0.0003857612609863281Per-token loss scaled by world size: 7.505448593292385e-05
 Per-token loss scaled by world size: 3.2563605145696783e-06Per-token loss scaled by world size: 6.232602754607797e-05
 Per-token loss scaled by world size: 0.00027665658853948116



 Per-token loss scaled by world size: 0.00029414540040306747
 Epoch: 0, Step: 108, Rank: 5, loss = 1.2347253561019897
 Epoch: 0, Step: 108, Rank: 1, loss = 0.24023064970970154
 Epoch: 0, Step: 108, Rank: 0, loss = 0.01042279601097107Epoch: 0, Step: 108, Rank: 2, loss = 0.199490025639534Epoch: 0, Step: 108, Rank: 4, loss = 0.9873215556144714


 Epoch: 0, Step: 108, Rank: 6, loss = 0.9414858818054199
 Epoch: 0, Step: 108, Rank: 7, loss = 0.8855085372924805
 Per-token loss scaled by world size: 0.0003316248476039618
 Epoch: 0, Step: 108, Rank: 3, loss = 1.0614482164382935
 [2024-08-18 20:52:37,412] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[2.4000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-08-18 20:52:37,489] [INFO] [timer.py:258:stop] epoch=0/micro_step=108/global_step=3, RunningAvgSamplesPerSec=41.632043299114166, CurrSamplesPerSec=41.632043299114166, MemAllocated=22.7GB, MaxMemAllocated=30.58GB
 Epoch 0:  89%|████████▉ | 108/121 [04:35<00:33,  2.56s/it]{
    "epoch": 0,
    "step": 108,
    "rank": 0,
    "loss": 0.01042279601097107,
    "overall_throughput": 40.57748284835933,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 22.696479320526123,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25606,
    "batch_size": 79,
    "total_loss": 0.6950791478157043,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:37.550745"
 }
 Per-token loss scaled by world size: 0.0006956385332159698Per-token loss scaled by world size: 0.0003228207933716476
 Per-token loss scaled by world size: 3.4351643989793956e-05Per-token loss scaled by world size: 7.628021558048204e-05


 Per-token loss scaled by world size: 0.0002656075230333954Per-token loss scaled by world size: 0.0005566730978898704

 Epoch: 0, Step: 109, Rank: 3, loss = 0.7804192900657654
 Epoch: 0, Step: 109, Rank: 5, loss = 1.681706190109253
 Epoch: 0, Step: 109, Rank: 1, loss = 0.0830451026558876Epoch: 0, Step: 109, Rank: 2, loss = 0.18440742790699005

 Per-token loss scaled by world size: 0.00023115344811230898
 Epoch: 0, Step: 109, Rank: 4, loss = 0.6421061754226685Epoch: 0, Step: 109, Rank: 6, loss = 1.345757246017456

 Per-token loss scaled by world size: 2.6328027161071077e-05
 Epoch: 0, Step: 109, Rank: 7, loss = 0.5588134527206421
 Epoch: 0, Step: 109, Rank: 0, loss = 0.06364800781011581
 Epoch 0:  90%|█████████ | 109/121 [04:38<00:31,  2.59s/it]{
    "epoch": 0,
    "step": 109,
    "rank": 0,
    "loss": 0.06364800781011581,
    "overall_throughput": 40.20414871551739,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.495201587677002,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19340,
    "batch_size": 78,
    "total_loss": 0.6674879193305969,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:40.142481"
 }
 Per-token loss scaled by world size: 0.00015839148545637727Per-token loss scaled by world size: 1.4700900692332652e-06Per-token loss scaled by world size: 0.00025009570526890457Per-token loss scaled by world size: 0.00011483808339107782Per-token loss scaled by world size: 1.5624003935954534e-05Per-token loss scaled by world size: 0.00039461886626668274





 Per-token loss scaled by world size: 0.0002000233216676861
 Epoch: 0, Step: 110, Rank: 2, loss = 0.8055582642555237
 Epoch: 0, Step: 110, Rank: 5, loss = 1.2710673809051514Epoch: 0, Step: 110, Rank: 0, loss = 0.004735160153359175

 Epoch: 0, Step: 110, Rank: 4, loss = 0.36989346146583557
 Epoch: 0, Step: 110, Rank: 1, loss = 0.05032491683959961Epoch: 0, Step: 110, Rank: 3, loss = 0.5101789832115173

 Epoch: 0, Step: 110, Rank: 7, loss = 0.6442751288414001
 Per-token loss scaled by world size: 0.00032995041692629457
 Epoch: 0, Step: 110, Rank: 6, loss = 1.0627702474594116
 Epoch 0:  91%|█████████ | 110/121 [04:40<00:28,  2.56s/it]{
    "epoch": 0,
    "step": 110,
    "rank": 0,
    "loss": 0.004735160153359175,
    "overall_throughput": 41.98747405186681,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.342525005340576,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25768,
    "batch_size": 78,
    "total_loss": 0.5898504853248596,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:42.652816"
 }
 Per-token loss scaled by world size: 0.0003511311369948089Per-token loss scaled by world size: 0.0003018661809619516Per-token loss scaled by world size: 0.00043145185918547213Per-token loss scaled by world size: 0.0006916387937963009Per-token loss scaled by world size: 0.00025981958606280386
 Per-token loss scaled by world size: 0.000197814850253053




 Per-token loss scaled by world size: 5.381280425353907e-05
 Epoch: 0, Step: 111, Rank: 6, loss = 0.877037763595581
 Epoch: 0, Step: 111, Rank: 5, loss = 1.7275407314300537
 Epoch: 0, Step: 111, Rank: 4, loss = 0.7539862394332886
 Epoch: 0, Step: 111, Rank: 7, loss = 1.0776588916778564
 Epoch: 0, Step: 111, Rank: 3, loss = 0.6489643454551697Epoch: 0, Step: 111, Rank: 2, loss = 0.49409204721450806

 Epoch: 0, Step: 111, Rank: 1, loss = 0.13441093266010284
 Per-token loss scaled by world size: 6.49542971586925e-06
 Epoch: 0, Step: 111, Rank: 0, loss = 0.016223959624767303
 Epoch 0:  92%|█████████▏| 111/121 [04:43<00:25,  2.57s/it]{
    "epoch": 0,
    "step": 111,
    "rank": 0,
    "loss": 0.016223959624767303,
    "overall_throughput": 41.001441887063585,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.491368293762207,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19982,
    "batch_size": 69,
    "total_loss": 0.716239333152771,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:45.269018"
 }
 Per-token loss scaled by world size: 0.00018836244998965412Per-token loss scaled by world size: 0.00022297118266578764Per-token loss scaled by world size: 0.0004412824346218258Per-token loss scaled by world size: 0.0002451048349030316Per-token loss scaled by world size: 2.037934336840408e-06Per-token loss scaled by world size: 0.00028445024508982897





 Per-token loss scaled by world size: 0.00019181481911800802
 Epoch: 0, Step: 112, Rank: 4, loss = 0.735774040222168Epoch: 0, Step: 112, Rank: 6, loss = 1.3246747255325317

 Epoch: 0, Step: 112, Rank: 1, loss = 0.5654405355453491
 Epoch: 0, Step: 112, Rank: 0, loss = 0.00611762423068285Epoch: 0, Step: 112, Rank: 3, loss = 0.6693316102027893

 Epoch: 0, Step: 112, Rank: 2, loss = 0.8538841009140015
 Epoch: 0, Step: 112, Rank: 7, loss = 0.5758041143417358
 Per-token loss scaled by world size: 0.00042108085472136736
 Epoch: 0, Step: 112, Rank: 5, loss = 1.2640321254730225
 Epoch 0:  93%|█████████▎| 112/121 [04:45<00:23,  2.56s/it]{
    "epoch": 0,
    "step": 112,
    "rank": 0,
    "loss": 0.00611762423068285,
    "overall_throughput": 41.80928245712371,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.407159328460693,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24015,
    "batch_size": 92,
    "total_loss": 0.7493823766708374,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:47.762170"
 }
 Per-token loss scaled by world size: 0.0004581383545883Per-token loss scaled by world size: 0.0004093858879059553Per-token loss scaled by world size: 0.0004562860412988812Per-token loss scaled by world size: 0.0003013689420185983Per-token loss scaled by world size: 8.010071906028315e-05

 Per-token loss scaled by world size: 9.55424093262991e-06

 Per-token loss scaled by world size: 0.0002337862824788317


 Epoch: 0, Step: 113, Rank: 6, loss = 1.3187236785888672
 Epoch: 0, Step: 113, Rank: 5, loss = 1.1831763982772827
 Epoch: 0, Step: 113, Rank: 4, loss = 1.3240771293640137
 Epoch: 0, Step: 113, Rank: 1, loss = 0.23150108754634857
 Epoch: 0, Step: 113, Rank: 3, loss = 0.8709939122200012
 Epoch: 0, Step: 113, Rank: 0, loss = 0.027612950652837753
 Epoch: 0, Step: 113, Rank: 7, loss = 0.6756715774536133
 Per-token loss scaled by world size: 0.0002764550154097378
 Epoch: 0, Step: 113, Rank: 2, loss = 0.7989895343780518
 Epoch 0:  93%|█████████▎| 113/121 [04:48<00:20,  2.56s/it]{
    "epoch": 0,
    "step": 113,
    "rank": 0,
    "loss": 0.027612950652837753,
    "overall_throughput": 41.14323140579119,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.228994369506836,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23121,
    "batch_size": 80,
    "total_loss": 0.8038432598114014,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:50.330248"
 }
 Per-token loss scaled by world size: 0.0005671991966664791Per-token loss scaled by world size: 0.00022458804596681148Per-token loss scaled by world size: 0.00021035455574747175Per-token loss scaled by world size: 4.355869896244258e-05Per-token loss scaled by world size: 7.875097253418062e-06Per-token loss scaled by world size: 0.0004406924417708069





 Per-token loss scaled by world size: 0.0002821373345796019
 Epoch: 0, Step: 114, Rank: 6, loss = 1.126409888267517
 Epoch: 0, Step: 114, Rank: 0, loss = 0.020128747448325157Epoch: 0, Step: 114, Rank: 5, loss = 1.449761152267456Epoch: 0, Step: 114, Rank: 4, loss = 0.5376662611961365Epoch: 0, Step: 114, Rank: 3, loss = 0.5740470290184021Epoch: 0, Step: 114, Rank: 2, loss = 0.11133603751659393




 Epoch: 0, Step: 114, Rank: 7, loss = 0.7211430072784424
 Per-token loss scaled by world size: 2.8121936338720843e-05
 Epoch: 0, Step: 114, Rank: 1, loss = 0.07187967002391815
 Epoch 0:  94%|█████████▍| 114/121 [04:51<00:17,  2.56s/it]{
    "epoch": 0,
    "step": 114,
    "rank": 0,
    "loss": 0.020128747448325157,
    "overall_throughput": 41.23395474647752,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.419190883636475,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20448,
    "batch_size": 78,
    "total_loss": 0.5765464305877686,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:52.896686"
 }
 Per-token loss scaled by world size: 0.00020671900711022317Per-token loss scaled by world size: 0.00019181812240276486Per-token loss scaled by world size: 0.00029149872716516256

 Per-token loss scaled by world size: 0.00032473698956891894
 Per-token loss scaled by world size: 0.00032112447661347687Per-token loss scaled by world size: 0.0003388051700312644
 Per-token loss scaled by world size: 0.00025795208057388663


 Epoch: 0, Step: 115, Rank: 3, loss = 0.9541117548942566
 Epoch: 0, Step: 115, Rank: 1, loss = 0.6278446912765503Epoch: 0, Step: 115, Rank: 0, loss = 0.6766171455383301

 Epoch: 0, Step: 115, Rank: 6, loss = 1.062904715538025
 Epoch: 0, Step: 115, Rank: 5, loss = 1.051080584526062
 Epoch: 0, Step: 115, Rank: 7, loss = 0.8443094491958618Epoch: 0, Step: 115, Rank: 4, loss = 1.1089516878128052

 Per-token loss scaled by world size: 0.00011388205894036219
 Epoch: 0, Step: 115, Rank: 2, loss = 0.3727502226829529
 Epoch 0:  95%|█████████▌| 115/121 [04:53<00:15,  2.57s/it]{
    "epoch": 0,
    "step": 115,
    "rank": 0,
    "loss": 0.6766171455383301,
    "overall_throughput": 40.978208090508964,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.531991481781006,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26185,
    "batch_size": 95,
    "total_loss": 0.8373212814331055,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:55.477863"
 }
 Per-token loss scaled by world size: 0.00025488759274594486Per-token loss scaled by world size: 0.00010854459105757996Per-token loss scaled by world size: 0.00016505405073985457Per-token loss scaled by world size: 0.00032808436662890017Per-token loss scaled by world size: 0.00027797490474767983Per-token loss scaled by world size: 0.00035925835254602134





 Epoch: 0, Step: 116, Rank: 5, loss = 1.0426521301269531
 Epoch: 0, Step: 116, Rank: 0, loss = 0.5245417952537537Epoch: 0, Step: 116, Rank: 3, loss = 0.8100327849388123
 Epoch: 0, Step: 116, Rank: 1, loss = 0.3449546992778778

 Epoch: 0, Step: 116, Rank: 6, loss = 1.1417230367660522
 Epoch: 0, Step: 116, Rank: 4, loss = 0.8834042549133301
 Per-token loss scaled by world size: 6.462022429332137e-05
 Per-token loss scaled by world size: 4.7339886805275455e-05
 Epoch: 0, Step: 116, Rank: 2, loss = 0.15044616162776947
 Epoch: 0, Step: 116, Rank: 7, loss = 0.20536306500434875
 Epoch 0:  96%|█████████▌| 116/121 [04:56<00:12,  2.55s/it]{
    "epoch": 0,
    "step": 116,
    "rank": 0,
    "loss": 0.5245417952537537,
    "overall_throughput": 42.05718076221283,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.434387683868408,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25424,
    "batch_size": 77,
    "total_loss": 0.6378897428512573,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:52:57.991057"
 }
 Per-token loss scaled by world size: 0.00043317166273482144Per-token loss scaled by world size: 0.0003517000295687467
 Per-token loss scaled by world size: 0.0003901177551597357Per-token loss scaled by world size: 0.00024814700009301305Per-token loss scaled by world size: 0.0001685179740888998
 Per-token loss scaled by world size: 2.624317403387977e-06
 Per-token loss scaled by world size: 2.476824556651991e-05



 Epoch: 0, Step: 117, Rank: 3, loss = 1.0443732738494873
 Epoch: 0, Step: 117, Rank: 5, loss = 1.2863032817840576
 Epoch: 0, Step: 117, Rank: 7, loss = 0.7368724942207336Epoch: 0, Step: 117, Rank: 4, loss = 1.1584546566009521

 Epoch: 0, Step: 117, Rank: 0, loss = 0.007792910560965538Epoch: 0, Step: 117, Rank: 2, loss = 0.5004141330718994

 Epoch: 0, Step: 117, Rank: 1, loss = 0.0735493078827858
 Per-token loss scaled by world size: 0.00034316719393245876
 Epoch: 0, Step: 117, Rank: 6, loss = 1.0190349817276
 Epoch 0:  97%|█████████▋| 117/121 [04:58<00:10,  2.54s/it]{
    "epoch": 0,
    "step": 117,
    "rank": 0,
    "loss": 0.007792910560965538,
    "overall_throughput": 42.14977654824287,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.35035228729248,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23756,
    "batch_size": 76,
    "total_loss": 0.7283493876457214,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:00.498739"
 }
 Per-token loss scaled by world size: 0.00027391428011469543Per-token loss scaled by world size: 0.00038771191611886024Per-token loss scaled by world size: 0.0005671838880516589Per-token loss scaled by world size: 6.881056378915673e-06


 Per-token loss scaled by world size: 4.1250186768593267e-05Per-token loss scaled by world size: 7.28774830349721e-05
 Per-token loss scaled by world size: 0.00023806751414667815


 Epoch: 0, Step: 118, Rank: 5, loss = 1.42512047290802
 Epoch: 0, Step: 118, Rank: 0, loss = 0.01728951372206211Epoch: 0, Step: 118, Rank: 3, loss = 0.9741746783256531

 Epoch: 0, Step: 118, Rank: 4, loss = 0.6882438659667969
 Epoch: 0, Step: 118, Rank: 2, loss = 0.18311378359794617Epoch: 0, Step: 118, Rank: 1, loss = 0.10364624857902527Epoch: 0, Step: 118, Rank: 7, loss = 0.5981743931770325


 Per-token loss scaled by world size: 0.0005670187529176474
 Epoch: 0, Step: 118, Rank: 6, loss = 1.4247055053710938
 Epoch 0:  98%|█████████▊| 118/121 [05:01<00:07,  2.53s/it]{
    "epoch": 0,
    "step": 118,
    "rank": 0,
    "loss": 0.01728951372206211,
    "overall_throughput": 42.35613860527613,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.3255033493042,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20101,
    "batch_size": 64,
    "total_loss": 0.6768085956573486,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:03.034819"
 }
 Per-token loss scaled by world size: 0.0003004461759701371Per-token loss scaled by world size: 0.0002987224142998457Per-token loss scaled by world size: 0.0002618007711134851Per-token loss scaled by world size: 0.0003349074686411768Per-token loss scaled by world size: 3.861719960696064e-06
 Per-token loss scaled by world size: 0.0002586914342828095




 Per-token loss scaled by world size: 0.00010813030530698597
 Epoch: 0, Step: 119, Rank: 0, loss = 0.012113732285797596
 Epoch: 0, Step: 119, Rank: 2, loss = 0.821236252784729Epoch: 0, Step: 119, Rank: 6, loss = 0.9424620866775513

 Epoch: 0, Step: 119, Rank: 4, loss = 0.9370548725128174Epoch: 0, Step: 119, Rank: 5, loss = 1.050562858581543

 Epoch: 0, Step: 119, Rank: 7, loss = 0.8114826679229736
 Epoch: 0, Step: 119, Rank: 1, loss = 0.3391912579536438
 Per-token loss scaled by world size: 0.00019692791101988405
 Epoch: 0, Step: 119, Rank: 3, loss = 0.6177382469177246
 Epoch 0:  98%|█████████▊| 119/121 [05:03<00:05,  2.52s/it]{
    "epoch": 0,
    "step": 119,
    "rank": 0,
    "loss": 0.012113732285797596,
    "overall_throughput": 41.99486327647021,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.461923599243164,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25095,
    "batch_size": 73,
    "total_loss": 0.691480278968811,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:05.516499"
 }
 Per-token loss scaled by world size: 0.00044983928091824055Per-token loss scaled by world size: 0.00034963246434926987Per-token loss scaled by world size: 0.0002892126503866166Per-token loss scaled by world size: 0.0003644507669378072Per-token loss scaled by world size: 0.0004460133204702288Per-token loss scaled by world size: 3.170213403791422e-06





 Per-token loss scaled by world size: 2.0736099486384774e-06
 Epoch: 0, Step: 120, Rank: 5, loss = 0.8975055813789368
 Epoch: 0, Step: 120, Rank: 6, loss = 1.0983635187149048
 Epoch: 0, Step: 120, Rank: 0, loss = 0.007807046640664339Epoch: 0, Step: 120, Rank: 4, loss = 0.861013650894165

 Epoch: 0, Step: 120, Rank: 2, loss = 0.7122223377227783Epoch: 0, Step: 120, Rank: 3, loss = 1.1077854633331299

 Epoch: 0, Step: 120, Rank: 1, loss = 0.005106523633003235
 Per-token loss scaled by world size: 0.0004087797424290329
 Epoch: 0, Step: 120, Rank: 7, loss = 1.0066711902618408
 Epoch 0:  99%|█████████▉| 120/121 [05:06<00:02,  2.48s/it]{
    "epoch": 0,
    "step": 120,
    "rank": 0,
    "loss": 0.007807046640664339,
    "overall_throughput": 44.47716496365875,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.361114025115967,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19701,
    "batch_size": 75,
    "total_loss": 0.7120593786239624,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:07.898377"
 }
 Per-token loss scaled by world size: 0.00019174267072230577Per-token loss scaled by world size: 0.0003345832519698888Per-token loss scaled by world size: 0.0003169576812069863Per-token loss scaled by world size: 0.0004366403736639768Per-token loss scaled by world size: 0.00023642051382921636




 Per-token loss scaled by world size: 1.6517016774741933e-05
 Epoch: 0, Step: 121, Rank: 5, loss = 0.9575772881507874
 Epoch: 0, Step: 121, Rank: 1, loss = 0.5487675070762634Epoch: 0, Step: 121, Rank: 4, loss = 1.2496647834777832
 Epoch: 0, Step: 121, Rank: 3, loss = 0.9071329236030579

 Epoch: 0, Step: 121, Rank: 7, loss = 0.6766355037689209
 Per-token loss scaled by world size: 0.00013164509437046945Epoch: 0, Step: 121, Rank: 0, loss = 0.04727170243859291

 Epoch: 0, Step: 121, Rank: 2, loss = 0.3767682611942291
 Per-token loss scaled by world size: 0.0003866745682898909
 Epoch: 0, Step: 121, Rank: 6, loss = 1.106662631034851
 Epoch 0: 100%|██████████| 121/121 [05:08<00:00,  2.50s/it]{
    "epoch": 0,
    "step": 121,
    "rank": 0,
    "loss": 0.04727170243859291,
    "overall_throughput": 41.60628512913175,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.30147409439087,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22896,
    "batch_size": 102,
    "total_loss": 0.7338100075721741,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:10.441395"
 }
 Saving model in huggingface format at samples_seen: 12688
 Model saved in /var/mnt/inststg1/instructlab/phasedbasedir/phase2/checkpoints/hf_format/samples_12688
 [20:53:29] INFO     saving took 18.56430721282959 seconds           utils.py:611
 Epoch 0: 100%|██████████| 121/121 [05:27<00:00,  2.70s/it]
 total tokens: 6858 num samples: 3 num padding tokens: 1697 - rank: 1 max len: 2286 min len: 1282 avg len: 1720.3333333333333 num_loss_counted_tokens: 1935
 total tokens: 7868 num samples: 7 num padding tokens: 1154 - rank: 3 max len: 1124 min len: 848 avg len: 959.1428571428571 num_loss_counted_tokens: 3674
 total tokens: 6932 num samples: 4 num padding tokens: 207 - rank: 1 max len: 1733 min len: 1622 avg len: 1681.25 num_loss_counted_tokens: 1829
 total tokens: 7551 num samples: 3 num padding tokens: 1038 - rank: 1 max len: 2517 min len: 1773 avg len: 2171.0 num_loss_counted_tokens: 344
 total tokens: 7515 num samples: 3 num padding tokens: 578 - rank: 1 max len: 2505 min len: 2066 avg len: 2312.3333333333335 num_loss_counted_tokens: 990
 total tokens: 7928 num samples: 4 num padding tokens: 1119 - rank: 1 max len: 1982 min len: 1503 avg len: 1702.25 num_loss_counted_tokens: 604
 total tokens: 6279 num samples: 3 num padding tokens: 554 - rank: 1 max len: 2093 min len: 1575 avg len: 1908.3333333333333 num_loss_counted_tokens: 1272
 total tokens: 6693 num samples: 3 num padding tokens: 401 - rank: 1 max len: 2231 min len: 1907 avg len: 2097.3333333333335 num_loss_counted_tokens: 402 total tokens: 6963 num samples: 3 num padding tokens: 692 - rank: 1 max len: 2321 min len: 1787 avg len: 2090.3333333333335 num_loss_counted_tokens: 386
 total tokens: 6988 num samples: 4 num padding tokens: 1322 - rank: 1 max len: 1747 min len: 1233 avg len: 1416.5 num_loss_counted_tokens: 2279
 total tokens: 6990 num samples: 5 num padding tokens: 1059 - rank: 2 max len: 1398 min len: 1070 avg len: 1186.2 num_loss_counted_tokens: 4472
 total tokens: 7260 num samples: 6 num padding tokens: 870 - rank: 2 max len: 1210 min len: 983 avg len: 1065.0 num_loss_counted_tokens: 3195
 total tokens: 7796 num samples: 4 num padding tokens: 966 - rank: 1 max len: 1949 min len: 1423 avg len: 1707.5 num_loss_counted_tokens: 4297
 total tokens: 7130 num samples: 5 num padding tokens: 914 - rank: 2 max len: 1426 min len: 1115 avg len: 1243.2 num_loss_counted_tokens: 4551
 total tokens: 6874 num samples: 2 num padding tokens: 795 - rank: 1 max len: 3437 min len: 2642 avg len: 3039.5 num_loss_counted_tokens: 158
 total tokens: 8064 num samples: 12 num padding tokens: 1509 - rank: 5 max len: 672 min len: 468 avg len: 546.25 num_loss_counted_tokens: 3627
 total tokens: 7714 num samples: 7 num padding tokens: 1215 - rank: 3 max len: 1102 min len: 813 avg len: 928.4285714285714 num_loss_counted_tokens: 4827
 total tokens: 7722 num samples: 13 num padding tokens: 1136 - rank: 5 max len: 594 min len: 399 avg len: 506.61538461538464 num_loss_counted_tokens: 4746
 total tokens: 7575 num samples: 5 num padding tokens: 945 - rank: 2 max len: 1515 min len: 1189 avg len: 1326.0 num_loss_counted_tokens: 2489
 total tokens: 7609 num samples: 7 num padding tokens: 1227 - rank: 3 max len: 1087 min len: 823 avg len: 911.7142857142857 num_loss_counted_tokens: 1419
 total tokens: 7942 num samples: 11 num padding tokens: 1142 - rank: 5 max len: 722 min len: 517 avg len: 618.1818181818181 num_loss_counted_tokens: 4497
 total tokens: 8108 num samples: 4 num padding tokens: 1070 - rank: 2 max len: 2027 min len: 1386 avg len: 1759.5 num_loss_counted_tokens: 2087
 total tokens: 6940 num samples: 5 num padding tokens: 950 - rank: 2 max len: 1388 min len: 1065 avg len: 1198.0 num_loss_counted_tokens: 3642
 total tokens: 7680 num samples: 5 num padding tokens: 844 - rank: 2 max len: 1536 min len: 1208 avg len: 1367.2 num_loss_counted_tokens: 3999
 total tokens: 7700 num samples: 14 num padding tokens: 1915 - rank: 5 max len: 550 min len: 316 avg len: 413.2142857142857 num_loss_counted_tokens: 3693
 total tokens: 7966 num samples: 7 num padding tokens: 799 - rank: 3 max len: 1138 min len: 946 avg len: 1023.8571428571429 num_loss_counted_tokens: 5133
 total tokens: 7566 num samples: 13 num padding tokens: 1066 - rank: 5 max len: 582 min len: 447 avg len: 500.0 num_loss_counted_tokens: 4009
 total tokens: 7344 num samples: 4 num padding tokens: 439 - rank: 2 max len: 1836 min len: 1616 avg len: 1726.25 num_loss_counted_tokens: 888
 total tokens: 7696 num samples: 8 num padding tokens: 1050 - rank: 3 max len: 962 min len: 715 avg len: 830.75 num_loss_counted_tokens: 5818
 total tokens: 8076 num samples: 12 num padding tokens: 901 - rank: 5 max len: 673 min len: 477 avg len: 597.9166666666666 num_loss_counted_tokens: 3764

 total tokens: 7953 num samples: 11 num padding tokens: 496 - rank: 5 max len: 723 min len: 589 avg len: 677.9090909090909 num_loss_counted_tokens: 4581
 total tokens: 7320 num samples: 4 num padding tokens: 796 - rank: 1 max len: 1830 min len: 1451 avg len: 1631.0 num_loss_counted_tokens: 2250
 total tokens: 7616 num samples: 8 num padding tokens: 348 - rank: 3 max len: 952 min len: 870 avg len: 908.5 num_loss_counted_tokens: 5221
 total tokens: 7987 num samples: 7 num padding tokens: 817 - rank: 3 max len: 1141 min len: 951 avg len: 1024.2857142857142 num_loss_counted_tokens: 3832
 total tokens: 7826 num samples: 7 num padding tokens: 953 - rank: 3 max len: 1118 min len: 839 avg len: 981.8571428571429 num_loss_counted_tokens: 4081
 total tokens: 7764 num samples: 12 num padding tokens: 1322 - rank: 5 max len: 647 min len: 464 avg len: 536.8333333333334 num_loss_counted_tokens: 3693
 total tokens: 7224 num samples: 8 num padding tokens: 547 - rank: 3 max len: 903 min len: 776 avg len: 834.625 num_loss_counted_tokens: 3728
 total tokens: 7896 num samples: 3 num padding tokens: 906 - rank: 1 max len: 2632 min len: 1930 avg len: 2330.0 num_loss_counted_tokens: 1083
 total tokens: 7769 num samples: 17 num padding tokens: 1525 - rank: 6 max len: 457 min len: 273 avg len: 367.29411764705884 num_loss_counted_tokens: 3637
 total tokens: 7816 num samples: 4 num padding tokens: 1243 - rank: 1 max len: 1954 min len: 1404 avg len: 1643.25 num_loss_counted_tokens: 2194
 total tokens: 7806 num samples: 6 num padding tokens: 497 - rank: 2 max len: 1301 min len: 1097 avg len: 1218.1666666666667 num_loss_counted_tokens: 3744
 total tokens: 6950 num samples: 5 num padding tokens: 1216 - rank: 3 max len: 1390 min len: 974 avg len: 1146.8 num_loss_counted_tokens: 3368
 total tokens: 7455 num samples: 3 num padding tokens: 849 - rank: 2 max len: 2485 min len: 1841 avg len: 2202.0 num_loss_counted_tokens: 231
 total tokens: 7536 num samples: 6 num padding tokens: 917 - rank: 2 max len: 1256 min len: 968 avg len: 1103.1666666666667 num_loss_counted_tokens: 3262
 total tokens: 7806 num samples: 6 num padding tokens: 1329 - rank: 2 max len: 1301 min len: 978 avg len: 1079.5 num_loss_counted_tokens: 3537
 total tokens: 7709 num samples: 13 num padding tokens: 1295 - rank: 5 max len: 593 min len: 373 avg len: 493.38461538461536 num_loss_counted_tokens: 4385
 total tokens: 7220 num samples: 5 num padding tokens: 746 - rank: 1 max len: 1444 min len: 1177 avg len: 1294.8 num_loss_counted_tokens: 3357
 total tokens: 6820 num samples: 4 num padding tokens: 1422 - rank: 3 max len: 1705 min len: 1072 avg len: 1349.5 num_loss_counted_tokens: 1716
 total tokens: 8019 num samples: 9 num padding tokens: 450 - rank: 3 max len: 891 min len: 790 avg len: 841.0 num_loss_counted_tokens: 4323
 total tokens: 7668 num samples: 4 num padding tokens: 969 - rank: 2 max len: 1917 min len: 1447 avg len: 1674.75 num_loss_counted_tokens: 1838
 total tokens: 7693 num samples: 7 num padding tokens: 403 - rank: 2 max len: 1099 min len: 978 avg len: 1041.4285714285713 num_loss_counted_tokens: 4254
 total tokens: 7525 num samples: 7 num padding tokens: 708 - rank: 3 max len: 1075 min len: 867 avg len: 973.8571428571429 num_loss_counted_tokens: 4837
 total tokens: 7740 num samples: 5 num padding tokens: 1159 - rank: 1 max len: 1548 min len: 1219 avg len: 1316.2 num_loss_counted_tokens: 2345
 total tokens: 7843 num samples: 11 num padding tokens: 990 - rank: 5 max len: 713 min len: 541 avg len: 623.0 num_loss_counted_tokens: 5313
 total tokens: 5614 num samples: 2 num padding tokens: 423 - rank: 0 max len: 2807 min len: 2384 avg len: 2595.5 num_loss_counted_tokens: 193
 total tokens: 8037 num samples: 3 num padding tokens: 1697 - rank: 1 max len: 2679 min len: 1665 avg len: 2113.3333333333335 num_loss_counted_tokens: 1007
 total tokens: 8022 num samples: 14 num padding tokens: 1252 - rank: 5 max len: 573 min len: 409 avg len: 483.57142857142856 num_loss_counted_tokens: 4146
 total tokens: 7945 num samples: 7 num padding tokens: 727 - rank: 2 max len: 1135 min len: 918 avg len: 1031.142857142857 num_loss_counted_tokens: 4527
 total tokens: 7891 num samples: 13 num padding tokens: 1039 - rank: 5 max len: 607 min len: 439 avg len: 527.0769230769231 num_loss_counted_tokens: 4697
 total tokens: 7472 num samples: 2 num padding tokens: 1064 - rank: 0 max len: 3736 min len: 2672 avg len: 3204.0 num_loss_counted_tokens: 186
 total tokens: 6628 num samples: 2 num padding tokens: 541 - rank: 0 max len: 3314 min len: 2773 avg len: 3043.5 num_loss_counted_tokens: 178
 total tokens: 7560 num samples: 10 num padding tokens: 554 - rank: 5 max len: 756 min len: 656 avg len: 700.6 num_loss_counted_tokens: 4201
 total tokens: 8021 num samples: 13 num padding tokens: 974 - rank: 5 max len: 617 min len: 468 avg len: 542.0769230769231 num_loss_counted_tokens: 4392
 total tokens: 4062 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4062 min len: 4062 avg len: 4062.0 num_loss_counted_tokens: 85
 total tokens: 7650 num samples: 9 num padding tokens: 478 - rank: 3 max len: 850 min len: 751 avg len: 796.8888888888889 num_loss_counted_tokens: 4488
 total tokens: 8041 num samples: 11 num padding tokens: 905 - rank: 5 max len: 731 min len: 567 avg len: 648.7272727272727 num_loss_counted_tokens: 4931
 total tokens: 6344 num samples: 2 num padding tokens: 190 - rank: 0 max len: 3172 min len: 2982 avg len: 3077.0 num_loss_counted_tokens: 709
 total tokens: 6178 num samples: 2 num padding tokens: 302 - rank: 0 max len: 3089 min len: 2787 avg len: 2938.0 num_loss_counted_tokens: 161
 total tokens: 7370 num samples: 5 num padding tokens: 663 - rank: 2 max len: 1474 min len: 1191 avg len: 1341.4 num_loss_counted_tokens: 6093
 total tokens: 7680 num samples: 8 num padding tokens: 596 - rank: 3 max len: 960 min len: 833 avg len: 885.5 num_loss_counted_tokens: 5258
 total tokens: 7776 num samples: 12 num padding tokens: 1190 - rank: 5 max len: 648 min len: 476 avg len: 548.8333333333334 num_loss_counted_tokens: 4105
 total tokens: 7803 num samples: 17 num padding tokens: 1519 - rank: 6 max len: 459 min len: 281 avg len: 369.6470588235294 num_loss_counted_tokens: 4044 total tokens: 7980 num samples: 19 num padding tokens: 2139 - rank: 6 max len: 420 min len: 236 avg len: 307.42105263157896 num_loss_counted_tokens: 3018
 total tokens: 8016 num samples: 16 num padding tokens: 1816 - rank: 6 max len: 501 min len: 300 avg len: 387.5 num_loss_counted_tokens: 3555

 total tokens: 8064 num samples: 18 num padding tokens: 1914 - rank: 6 max len: 448 min len: 254 avg len: 341.6666666666667 num_loss_counted_tokens: 3451
 total tokens: 6752 num samples: 2 num padding tokens: 43 - rank: 0 max len: 3376 min len: 3333 avg len: 3354.5 num_loss_counted_tokens: 441
 total tokens: 8010 num samples: 15 num padding tokens: 2334 - rank: 6 max len: 534 min len: 273 avg len: 378.4 num_loss_counted_tokens: 3218
 total tokens: 7800 num samples: 20 num padding tokens: 1809 - rank: 6 max len: 390 min len: 229 avg len: 299.55 num_loss_counted_tokens: 3534
 total tokens: 8112 num samples: 26 num padding tokens: 2416 - rank: 6 max len: 312 min len: 150 avg len: 219.07692307692307 num_loss_counted_tokens: 2759
 total tokens: 7189 num samples: 7 num padding tokens: 616 - rank: 3 max len: 1027 min len: 850 avg len: 939.0 num_loss_counted_tokens: 4467
 total tokens: 5632 num samples: 2 num padding tokens: 674 - rank: 0 max len: 2816 min len: 2142 avg len: 2479.0 num_loss_counted_tokens: 167
 total tokens: 7192 num samples: 2 num padding tokens: 1145 - rank: 0 max len: 3596 min len: 2451 avg len: 3023.5 num_loss_counted_tokens: 187
 total tokens: 5518 num samples: 2 num padding tokens: 53 - rank: 0 max len: 2759 min len: 2706 avg len: 2732.5 num_loss_counted_tokens: 276
 total tokens: 6324 num samples: 2 num padding tokens: 282 - rank: 0 max len: 3162 min len: 2880 avg len: 3021.0 num_loss_counted_tokens: 179
 total tokens: 8109 num samples: 17 num padding tokens: 1877 - rank: 6 max len: 477 min len: 301 avg len: 366.5882352941176 num_loss_counted_tokens: 3402
 total tokens: 7284 num samples: 3 num padding tokens: 841 - rank: 0 max len: 2428 min len: 1996 avg len: 2147.6666666666665 num_loss_counted_tokens: 2016
 total tokens: 6622 num samples: 2 num padding tokens: 51 - rank: 0 max len: 3311 min len: 3260 avg len: 3285.5 num_loss_counted_tokens: 220
 total tokens: 7606 num samples: 2 num padding tokens: 284 - rank: 0 max len: 3803 min len: 3519 avg len: 3661.0 num_loss_counted_tokens: 239
 total tokens: 8021 num samples: 13 num padding tokens: 1426 - rank: 5 max len: 617 min len: 421 avg len: 507.3076923076923 num_loss_counted_tokens: 4445
 total tokens: 7548 num samples: 12 num padding tokens: 1418 - rank: 6 max len: 629 min len: 356 avg len: 510.8333333333333 num_loss_counted_tokens: 4627
 total tokens: 7623 num samples: 9 num padding tokens: 977 - rank: 4 max len: 847 min len: 676 avg len: 738.4444444444445 num_loss_counted_tokens: 4362
 total tokens: 920 num samples: 8 num padding tokens: 143 - rank: 7 max len: 115 min len: 80 avg len: 97.125 num_loss_counted_tokens: 174
 total tokens: 7865 num samples: 11 num padding tokens: 678 - rank: 4 max len: 715 min len: 597 avg len: 653.3636363636364 num_loss_counted_tokens: 4917
 total tokens: 7890 num samples: 30 num padding tokens: 2636 - rank: 7 max len: 263 min len: 81 avg len: 175.13333333333333 num_loss_counted_tokens: 2257
 total tokens: 8060 num samples: 20 num padding tokens: 1821 - rank: 6 max len: 403 min len: 229 avg len: 311.95 num_loss_counted_tokens: 3851
 total tokens: 7644 num samples: 14 num padding tokens: 1953 - rank: 6 max len: 546 min len: 313 avg len: 406.5 num_loss_counted_tokens: 3666
 total tokens: 7568 num samples: 8 num padding tokens: 610 - rank: 3 max len: 946 min len: 809 avg len: 869.75 num_loss_counted_tokens: 4321
 total tokens: 7923 num samples: 19 num padding tokens: 1491 - rank: 6 max len: 417 min len: 251 avg len: 338.5263157894737 num_loss_counted_tokens: 3342
 total tokens: 7767 num samples: 9 num padding tokens: 568 - rank: 4 max len: 863 min len: 734 avg len: 799.8888888888889 num_loss_counted_tokens: 4720
 total tokens: 6913 num samples: 31 num padding tokens: 1814 - rank: 7 max len: 223 min len: 81 avg len: 164.48387096774192 num_loss_counted_tokens: 2029
 total tokens: 8041 num samples: 17 num padding tokens: 2270 - rank: 6 max len: 473 min len: 266 avg len: 339.47058823529414 num_loss_counted_tokens: 3062
 total tokens: 7710 num samples: 3 num padding tokens: 1745 - rank: 0 max len: 2570 min len: 1545 avg len: 1988.3333333333333 num_loss_counted_tokens: 932
 total tokens: 8109 num samples: 9 num padding tokens: 1178 - rank: 4 max len: 901 min len: 683 avg len: 770.1111111111111 num_loss_counted_tokens: 4382
 total tokens: 7461 num samples: 9 num padding tokens: 1078 - rank: 4 max len: 829 min len: 583 avg len: 709.2222222222222 num_loss_counted_tokens: 4008
 total tokens: 8028 num samples: 18 num padding tokens: 1647 - rank: 6 max len: 446 min len: 269 avg len: 354.5 num_loss_counted_tokens: 3480
 total tokens: 8010 num samples: 10 num padding tokens: 847 - rank: 4 max len: 801 min len: 646 avg len: 716.3 num_loss_counted_tokens: 5137
 total tokens: 7812 num samples: 28 num padding tokens: 3453 - rank: 7 max len: 279 min len: 77 avg len: 155.67857142857142 num_loss_counted_tokens: 1761
 total tokens: 7791 num samples: 21 num padding tokens: 1808 - rank: 6 max len: 371 min len: 232 avg len: 284.9047619047619 num_loss_counted_tokens: 2895
 total tokens: 6410 num samples: 2 num padding tokens: 76 - rank: 0 max len: 3205 min len: 3129 avg len: 3167.0 num_loss_counted_tokens: 169
 total tokens: 7540 num samples: 26 num padding tokens: 3128 - rank: 7 max len: 290 min len: 81 avg len: 169.69230769230768 num_loss_counted_tokens: 1955
 total tokens: 7980 num samples: 35 num padding tokens: 2701 - rank: 7 max len: 228 min len: 76 avg len: 150.82857142857142 num_loss_counted_tokens: 1955
 total tokens: 7116 num samples: 6 num padding tokens: 825 - rank: 2 max len: 1186 min len: 937 avg len: 1048.5 num_loss_counted_tokens: 3033
 total tokens: 7714 num samples: 19 num padding tokens: 1776 - rank: 6 max len: 406 min len: 218 avg len: 312.5263157894737 num_loss_counted_tokens: 3452
 total tokens: 6944 num samples: 28 num padding tokens: 2075 - rank: 7 max len: 248 min len: 85 avg len: 173.89285714285714 num_loss_counted_tokens: 2194
 total tokens: 6546 num samples: 3 num padding tokens: 491 - rank: 0 max len: 2182 min len: 1721 avg len: 2018.3333333333333 num_loss_counted_tokens: 1774
 total tokens: 8090 num samples: 10 num padding tokens: 775 - rank: 4 max len: 809 min len: 668 avg len: 731.5 num_loss_counted_tokens: 4241
 total tokens: 6475 num samples: 25 num padding tokens: 2566 - rank: 7 max len: 259 min len: 80 avg len: 156.36 num_loss_counted_tokens: 1562
 total tokens: 7461 num samples: 9 num padding tokens: 615 - rank: 4 max len: 829 min len: 730 avg len: 760.6666666666666 num_loss_counted_tokens: 5504
 total tokens: 6972 num samples: 28 num padding tokens: 2290 - rank: 7 max len: 249 min len: 78 avg len: 167.21428571428572 num_loss_counted_tokens: 2098
 total tokens: 7434 num samples: 9 num padding tokens: 997 - rank: 4 max len: 826 min len: 634 avg len: 715.2222222222222 num_loss_counted_tokens: 4684
 total tokens: 7830 num samples: 9 num padding tokens: 687 - rank: 4 max len: 870 min len: 732 avg len: 793.6666666666666 num_loss_counted_tokens: 4598
 total tokens: 7868 num samples: 28 num padding tokens: 3286 - rank: 7 max len: 281 min len: 71 avg len: 163.64285714285714 num_loss_counted_tokens: 1693
 total tokens: 7700 num samples: 10 num padding tokens: 670 - rank: 4 max len: 770 min len: 672 avg len: 703.0 num_loss_counted_tokens: 3844
 total tokens: 8019 num samples: 11 num padding tokens: 675 - rank: 4 max len: 729 min len: 607 avg len: 667.6363636363636 num_loss_counted_tokens: 5829
 total tokens: 7824 num samples: 8 num padding tokens: 1092 - rank: 4 max len: 978 min len: 762 avg len: 841.5 num_loss_counted_tokens: 3700
 total tokens: 7890 num samples: 30 num padding tokens: 2695 - rank: 7 max len: 263 min len: 77 avg len: 173.16666666666666 num_loss_counted_tokens: 2423
 total tokens: 7668 num samples: 9 num padding tokens: 739 - rank: 4 max len: 852 min len: 673 avg len: 769.8888888888889 num_loss_counted_tokens: 3730
 total tokens: 7514 num samples: 26 num padding tokens: 3137 - rank: 7 max len: 289 min len: 81 avg len: 168.34615384615384 num_loss_counted_tokens: 1545
 total tokens: 7336 num samples: 28 num padding tokens: 3103 - rank: 7 max len: 262 min len: 76 avg len: 151.17857142857142 num_loss_counted_tokens: 1734
 total tokens: 7774 num samples: 23 num padding tokens: 3246 - rank: 7 max len: 338 min len: 79 avg len: 196.8695652173913 num_loss_counted_tokens: 2197
 total tokens: 8050 num samples: 35 num padding tokens: 2271 - rank: 7 max len: 230 min len: 71 avg len: 165.11428571428573 num_loss_counted_tokens: 2692
 total tokens: 7960 num samples: 10 num padding tokens: 1262 - rank: 4 max len: 796 min len: 576 avg len: 669.8 num_loss_counted_tokens: 5376
 total tokens: 7576 num samples: 8 num padding tokens: 1064 - rank: 4 max len: 947 min len: 736 avg len: 814.0 num_loss_counted_tokens: 2965
 total tokens: 4393 num samples: 23 num padding tokens: 1496 - rank: 7 max len: 191 min len: 75 avg len: 125.95652173913044 num_loss_counted_tokens: 1010
 total tokens: 7945 num samples: 35 num padding tokens: 2723 - rank: 7 max len: 227 min len: 78 avg len: 149.2 num_loss_counted_tokens: 2193
 total tokens: 7760 num samples: 10 num padding tokens: 746 - rank: 4 max len: 776 min len: 627 avg len: 701.4 num_loss_counted_tokens: 4720
 Per-token loss scaled by world size: 0.0004431476700119674Per-token loss scaled by world size: 0.0004245223826728761Per-token loss scaled by world size: 0.0004812271217815578Per-token loss scaled by world size: 0.0004481318756006658Per-token loss scaled by world size: 5.5284708651015535e-06Per-token loss scaled by world size: 0.0003835844690911472





 Epoch: 1, Step: 122, Rank: 0, loss = 0.014387845993041992
 Per-token loss scaled by world size: 3.5970988392364234e-05Epoch: 1, Step: 122, Rank: 5, loss = 1.1048195362091064
 Epoch: 1, Step: 122, Rank: 6, loss = 1.2523936033248901

 Epoch: 1, Step: 122, Rank: 3, loss = 0.9982785582542419
 Epoch: 1, Step: 122, Rank: 4, loss = 1.1532918214797974
 Epoch: 1, Step: 122, Rank: 7, loss = 1.166263222694397
 Epoch: 1, Step: 122, Rank: 1, loss = 0.09361449629068375
 Per-token loss scaled by world size: 0.00016121947555802763
 Epoch: 1, Step: 122, Rank: 2, loss = 0.41957369446754456
                                                         total tokens: 7389 num samples: 9 num padding tokens: 463 - rank: 4 max len: 821 min len: 731 avg len: 769.5555555555555 num_loss_counted_tokens: 4484
 total tokens: 7920 num samples: 4 num padding tokens: 1041 - rank: 1 max len: 1980 min len: 1501 avg len: 1719.75 num_loss_counted_tokens: 3030
 total tokens: 7557 num samples: 11 num padding tokens: 1175 - rank: 5 max len: 687 min len: 471 avg len: 580.1818181818181 num_loss_counted_tokens: 3537
 {
    "epoch": 1,
    "step": 122,
    "rank": 0,
    "loss": 0.014387845993041992,
    "overall_throughput": 41.10590635951548,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.46029806137085,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20820,
    "batch_size": 84,
    "total_loss": 0.7753278613090515,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:32.058777"
 }
 total tokens: 6208 num samples: 2 num padding tokens: 1070 - rank: 0 max len: 3104 min len: 2034 avg len: 2569.0 num_loss_counted_tokens: 154
 total tokens: 7758 num samples: 6 num padding tokens: 429 - rank: 2 max len: 1293 min len: 1127 avg len: 1221.5 num_loss_counted_tokens: 4122
 total tokens: 7641 num samples: 27 num padding tokens: 2718 - rank: 7 max len: 283 min len: 90 avg len: 182.33333333333334 num_loss_counted_tokens: 2178
 total tokens: 7786 num samples: 17 num padding tokens: 1303 - rank: 6 max len: 458 min len: 306 avg len: 381.3529411764706 num_loss_counted_tokens: 3550
 total tokens: 7714 num samples: 7 num padding tokens: 1069 - rank: 3 max len: 1102 min len: 863 avg len: 949.2857142857143 num_loss_counted_tokens: 3857
 Per-token loss scaled by world size: 0.00028338973061181605Per-token loss scaled by world size: 0.00028131416183896363Per-token loss scaled by world size: 0.00031643020338378847


 Per-token loss scaled by world size: 2.5795485271373764e-05Per-token loss scaled by world size: 0.0003353776701260358Per-token loss scaled by world size: 3.061828238060116e-06


 Epoch: 1, Step: 123, Rank: 2, loss = 0.9252774715423584
 Epoch: 1, Step: 123, Rank: 6, loss = 0.9321042895317078Epoch: 1, Step: 123, Rank: 4, loss = 1.0407785177230835

 Per-token loss scaled by world size: 0.00020596390822902322Epoch: 1, Step: 123, Rank: 1, loss = 0.08484457433223724

 Epoch: 1, Step: 123, Rank: 0, loss = 0.010070735588669777
 Epoch: 1, Step: 123, Rank: 5, loss = 1.1030991077423096
 Per-token loss scaled by world size: 0.00040186592377722263
 Epoch: 1, Step: 123, Rank: 3, loss = 1.3217872381210327
 Epoch: 1, Step: 123, Rank: 7, loss = 0.6774410605430603
                                                         total tokens: 7158 num samples: 3 num padding tokens: 1822 - rank: 1 max len: 2386 min len: 1371 avg len: 1778.6666666666667 num_loss_counted_tokens: 859
 total tokens: 8090 num samples: 10 num padding tokens: 658 - rank: 4 max len: 809 min len: 681 avg len: 743.2 num_loss_counted_tokens: 4692
 {
    "epoch": 1,
    "step": 123,
    "rank": 0,
    "loss": 0.010070735588669777,
    "overall_throughput": 41.52344953508732,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.238781452178955,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26313,
    "batch_size": 94,
    "total_loss": 0.7619253396987915,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:34.606466"
 }
 total tokens: 7752 num samples: 8 num padding tokens: 561 - rank: 3 max len: 969 min len: 810 avg len: 898.875 num_loss_counted_tokens: 4165
 total tokens: 8109 num samples: 17 num padding tokens: 1626 - rank: 6 max len: 477 min len: 291 avg len: 381.3529411764706 num_loss_counted_tokens: 4491
 total tokens: 7248 num samples: 2 num padding tokens: 588 - rank: 0 max len: 3624 min len: 3036 avg len: 3330.0 num_loss_counted_tokens: 185
 total tokens: 7836 num samples: 6 num padding tokens: 921 - rank: 2 max len: 1306 min len: 997 avg len: 1152.5 num_loss_counted_tokens: 3095
 total tokens: 7480 num samples: 11 num padding tokens: 811 - rank: 5 max len: 680 min len: 536 avg len: 606.2727272727273 num_loss_counted_tokens: 4943
 total tokens: 7614 num samples: 27 num padding tokens: 2676 - rank: 7 max len: 282 min len: 88 avg len: 182.88888888888889 num_loss_counted_tokens: 2334
 Per-token loss scaled by world size: 0.00031184396357275546Per-token loss scaled by world size: 0.0001448883704142645Per-token loss scaled by world size: 0.0002452973276376724Per-token loss scaled by world size: 0.00019676386727951467Per-token loss scaled by world size: 0.00030535017140209675Per-token loss scaled by world size: 1.8834512047760654e-06


 Per-token loss scaled by world size: 0.00019069209520239383



 Epoch: 1, Step: 124, Rank: 6, loss = 0.9844914078712463
 Epoch: 1, Step: 124, Rank: 1, loss = 0.45741257071495056
 Epoch: 1, Step: 124, Rank: 3, loss = 0.9639905095100403Epoch: 1, Step: 124, Rank: 4, loss = 0.7744036912918091

 Epoch: 1, Step: 124, Rank: 7, loss = 0.6211835145950317Epoch: 1, Step: 124, Rank: 0, loss = 0.005946055520325899Epoch: 1, Step: 124, Rank: 2, loss = 0.60201495885849


 Per-token loss scaled by world size: 0.00031234745983965695
 Epoch: 1, Step: 124, Rank: 5, loss = 0.9860809445381165
                                                         total tokens: 7335 num samples: 3 num padding tokens: 518 - rank: 1 max len: 2445 min len: 1952 avg len: 2272.3333333333335 num_loss_counted_tokens: 357
 total tokens: 7632 num samples: 8 num padding tokens: 1110 - rank: 4 max len: 954 min len: 772 avg len: 815.25 num_loss_counted_tokens: 4352
 total tokens: 7540 num samples: 10 num padding tokens: 877 - rank: 5 max len: 754 min len: 579 avg len: 666.3 num_loss_counted_tokens: 4493
 {
    "epoch": 1,
    "step": 124,
    "rank": 0,
    "loss": 0.005946055520325899,
    "overall_throughput": 42.28005677176176,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.3601393699646,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25256,
    "batch_size": 81,
    "total_loss": 0.6744404435157776,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:37.110671"
 }
 total tokens: 7146 num samples: 6 num padding tokens: 707 - rank: 3 max len: 1191 min len: 954 avg len: 1073.1666666666667 num_loss_counted_tokens: 3282
 total tokens: 6650 num samples: 2 num padding tokens: 396 - rank: 0 max len: 3325 min len: 2929 avg len: 3127.0 num_loss_counted_tokens: 164
 total tokens: 7188 num samples: 4 num padding tokens: 897 - rank: 2 max len: 1797 min len: 1207 avg len: 1572.75 num_loss_counted_tokens: 3275
 total tokens: 7749 num samples: 27 num padding tokens: 3523 - rank: 7 max len: 287 min len: 78 avg len: 156.5185185185185 num_loss_counted_tokens: 1794
 total tokens: 7924 num samples: 14 num padding tokens: 2249 - rank: 6 max len: 566 min len: 291 avg len: 405.35714285714283 num_loss_counted_tokens: 3760
 Per-token loss scaled by world size: 0.0005536731332540512Per-token loss scaled by world size: 0.0003451017546467483Per-token loss scaled by world size: 8.309840632136911e-05Per-token loss scaled by world size: 0.00047730450751259923Per-token loss scaled by world size: 0.0006338073872029781
 Per-token loss scaled by world size: 3.222640589228831e-05Per-token loss scaled by world size: 5.77377568333759e-06





 Epoch: 1, Step: 125, Rank: 3, loss = 0.21387451887130737Epoch: 1, Step: 125, Rank: 7, loss = 0.8882056474685669

 Epoch: 1, Step: 125, Rank: 4, loss = 1.6312617063522339Epoch: 1, Step: 125, Rank: 2, loss = 1.425016164779663Epoch: 1, Step: 125, Rank: 1, loss = 0.014860255643725395Epoch: 1, Step: 125, Rank: 5, loss = 1.2284624576568604



 Epoch: 1, Step: 125, Rank: 0, loss = 0.08294271677732468
 Per-token loss scaled by world size: 0.0003464070614427328
 Epoch: 1, Step: 125, Rank: 6, loss = 0.8915651440620422
                                                         total tokens: 7860 num samples: 4 num padding tokens: 1157 - rank: 1 max len: 1965 min len: 1442 avg len: 1675.75 num_loss_counted_tokens: 559
 total tokens: 7632 num samples: 9 num padding tokens: 954 - rank: 4 max len: 848 min len: 605 avg len: 742.0 num_loss_counted_tokens: 4296
 {
    "epoch": 1,
    "step": 125,
    "rank": 0,
    "loss": 0.08294271677732468,
    "overall_throughput": 42.30731411051347,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.3255033493042,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20590,
    "batch_size": 94,
    "total_loss": 0.7970236539840698,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:39.613939"
 }
 total tokens: 7974 num samples: 18 num padding tokens: 1582 - rank: 6 max len: 443 min len: 282 avg len: 355.1111111111111 num_loss_counted_tokens: 3319
 total tokens: 8033 num samples: 29 num padding tokens: 2797 - rank: 7 max len: 277 min len: 75 avg len: 180.55172413793105 num_loss_counted_tokens: 2279
 total tokens: 7904 num samples: 8 num padding tokens: 432 - rank: 3 max len: 988 min len: 857 avg len: 934.0 num_loss_counted_tokens: 3900
 total tokens: 6880 num samples: 5 num padding tokens: 746 - rank: 2 max len: 1376 min len: 1087 avg len: 1226.8 num_loss_counted_tokens: 1611
 total tokens: 7629 num samples: 3 num padding tokens: 752 - rank: 0 max len: 2543 min len: 1988 avg len: 2292.3333333333335 num_loss_counted_tokens: 470
 total tokens: 7813 num samples: 13 num padding tokens: 1168 - rank: 5 max len: 601 min len: 444 avg len: 511.15384615384613 num_loss_counted_tokens: 3836
 Per-token loss scaled by world size: 0.00016968029376585037Per-token loss scaled by world size: 0.00043697707587853074Per-token loss scaled by world size: 0.00022829265799373388Per-token loss scaled by world size: 0.00044452777365222573

 Per-token loss scaled by world size: 0.00034727680031210184


 Per-token loss scaled by world size: 1.7178894040625892e-06Per-token loss scaled by world size: 0.0002958408440463245

 Epoch: 1, Step: 126, Rank: 5, loss = 1.2159979343414307
 Epoch: 1, Step: 126, Rank: 2, loss = 0.6352813839912415
 Epoch: 1, Step: 126, Rank: 6, loss = 1.2370096445083618
 Epoch: 1, Step: 126, Rank: 1, loss = 0.4721778333187103Epoch: 1, Step: 126, Rank: 4, loss = 0.9663845300674438

 Epoch: 1, Step: 126, Rank: 0, loss = 0.004780456889420748
 Epoch: 1, Step: 126, Rank: 7, loss = 0.8232511281967163
 Per-token loss scaled by world size: 0.0001971422607311979
 Epoch: 1, Step: 126, Rank: 3, loss = 0.5485976338386536
                                                         total tokens: 5944 num samples: 2 num padding tokens: 412 - rank: 1 max len: 2972 min len: 2560 avg len: 2766.0 num_loss_counted_tokens: 816
 total tokens: 7911 num samples: 9 num padding tokens: 941 - rank: 4 max len: 879 min len: 688 avg len: 774.4444444444445 num_loss_counted_tokens: 5131
 {
    "epoch": 1,
    "step": 126,
    "rank": 0,
    "loss": 0.004780456889420748,
    "overall_throughput": 41.60576526421823,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.30566644668579,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22262,
    "batch_size": 84,
    "total_loss": 0.7379351258277893,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:42.158108"
 }
 total tokens: 8048 num samples: 16 num padding tokens: 2431 - rank: 6 max len: 503 min len: 247 avg len: 351.0625 num_loss_counted_tokens: 3335
 total tokens: 6345 num samples: 27 num padding tokens: 1990 - rank: 7 max len: 235 min len: 78 avg len: 161.2962962962963 num_loss_counted_tokens: 1823
 total tokens: 7467 num samples: 3 num padding tokens: 1752 - rank: 2 max len: 2489 min len: 1428 avg len: 1905.0 num_loss_counted_tokens: 538
 total tokens: 7710 num samples: 6 num padding tokens: 1580 - rank: 3 max len: 1285 min len: 891 avg len: 1021.6666666666666 num_loss_counted_tokens: 3990
 total tokens: 7513 num samples: 11 num padding tokens: 964 - rank: 5 max len: 683 min len: 534 avg len: 595.3636363636364 num_loss_counted_tokens: 4084
 total tokens: 5974 num samples: 2 num padding tokens: 4 - rank: 0 max len: 2987 min len: 2983 avg len: 2985.0 num_loss_counted_tokens: 160
 Per-token loss scaled by world size: 0.00019664443971123546Per-token loss scaled by world size: 0.00033893511863425374
 Per-token loss scaled by world size: 0.0002353027812205255Per-token loss scaled by world size: 0.00019304313173051924
 Per-token loss scaled by world size: 0.00022410067322198302


 Per-token loss scaled by world size: 0.00028643259429372847Per-token loss scaled by world size: 1.9502303985063918e-05

 Epoch: 1, Step: 127, Rank: 5, loss = 1.0451911687850952
 Epoch: 1, Step: 127, Rank: 3, loss = 0.6064022779464722
 Epoch: 1, Step: 127, Rank: 4, loss = 0.5952967405319214
 Epoch: 1, Step: 127, Rank: 1, loss = 0.7256149649620056
 Epoch: 1, Step: 127, Rank: 2, loss = 0.6910704374313354
 Epoch: 1, Step: 127, Rank: 0, loss = 0.060140229761600494
 Epoch: 1, Step: 127, Rank: 7, loss = 0.8832864761352539
 Per-token loss scaled by world size: 0.00036767972051166
 Epoch: 1, Step: 127, Rank: 6, loss = 1.133832335472107
                                                         total tokens: 7504 num samples: 8 num padding tokens: 911 - rank: 4 max len: 938 min len: 741 avg len: 824.125 num_loss_counted_tokens: 5191
 total tokens: 7206 num samples: 3 num padding tokens: 904 - rank: 1 max len: 2402 min len: 1894 avg len: 2100.6666666666665 num_loss_counted_tokens: 932
 total tokens: 7887 num samples: 11 num padding tokens: 1542 - rank: 5 max len: 717 min len: 395 avg len: 576.8181818181819 num_loss_counted_tokens: 4690
 {
    "epoch": 1,
    "step": 127,
    "rank": 0,
    "loss": 0.060140229761600494,
    "overall_throughput": 42.34427644874209,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.374258518218994,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24670,
    "batch_size": 85,
    "total_loss": 0.7176043391227722,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:44.674459"
 }
 total tokens: 7096 num samples: 4 num padding tokens: 561 - rank: 2 max len: 1774 min len: 1529 avg len: 1633.75 num_loss_counted_tokens: 1104
 total tokens: 7780 num samples: 20 num padding tokens: 1756 - rank: 6 max len: 389 min len: 236 avg len: 301.2 num_loss_counted_tokens: 2929
 total tokens: 6770 num samples: 5 num padding tokens: 936 - rank: 3 max len: 1354 min len: 956 avg len: 1166.8 num_loss_counted_tokens: 3115
 total tokens: 6110 num samples: 26 num padding tokens: 1710 - rank: 7 max len: 235 min len: 78 avg len: 169.23076923076923 num_loss_counted_tokens: 1981
 total tokens: 5920 num samples: 2 num padding tokens: 555 - rank: 0 max len: 2960 min len: 2405 avg len: 2682.5 num_loss_counted_tokens: 172
 Per-token loss scaled by world size: 0.0008036900544539094Per-token loss scaled by world size: 0.0005513833602890372Per-token loss scaled by world size: 0.0007896597380749881Per-token loss scaled by world size: 9.589117689756677e-05



 Per-token loss scaled by world size: 5.089726073492784e-06
 Per-token loss scaled by world size: 1.1957185051869601e-05
 Epoch: 1, Step: 128, Rank: 6, loss = 1.6164215803146362Epoch: 1, Step: 128, Rank: 2, loss = 0.19286112487316132

 Epoch: 1, Step: 128, Rank: 5, loss = 1.5882031917572021
 Epoch: 1, Step: 128, Rank: 4, loss = 1.108969807624817
 Per-token loss scaled by world size: 5.8463097957428545e-05
 Epoch: 1, Step: 128, Rank: 1, loss = 0.010236711241304874
 Epoch: 1, Step: 128, Rank: 0, loss = 0.02404888905584812
 Per-token loss scaled by world size: 0.00048682422493584454
 Epoch: 1, Step: 128, Rank: 3, loss = 0.9791252017021179
 Epoch: 1, Step: 128, Rank: 7, loss = 0.11758390814065933
                                                         total tokens: 6630 num samples: 26 num padding tokens: 2393 - rank: 7 max len: 255 min len: 78 avg len: 162.96153846153845 num_loss_counted_tokens: 1758
 total tokens: 7601 num samples: 11 num padding tokens: 831 - rank: 4 max len: 691 min len: 561 avg len: 615.4545454545455 num_loss_counted_tokens: 4864
 total tokens: 7125 num samples: 5 num padding tokens: 1532 - rank: 1 max len: 1425 min len: 985 avg len: 1118.6 num_loss_counted_tokens: 3514
 {
    "epoch": 1,
    "step": 128,
    "rank": 0,
    "loss": 0.02404888905584812,
    "overall_throughput": 42.28494221885827,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.05379819869995,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16090,
    "batch_size": 72,
    "total_loss": 0.7046812772750854,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:47.176000"
 }
 total tokens: 5476 num samples: 2 num padding tokens: 764 - rank: 0 max len: 2738 min len: 1974 avg len: 2356.0 num_loss_counted_tokens: 216
 total tokens: 7714 num samples: 14 num padding tokens: 571 - rank: 5 max len: 551 min len: 429 avg len: 510.2142857142857 num_loss_counted_tokens: 5114
 total tokens: 7731 num samples: 9 num padding tokens: 842 - rank: 3 max len: 859 min len: 715 avg len: 765.4444444444445 num_loss_counted_tokens: 4877
 total tokens: 7923 num samples: 19 num padding tokens: 1796 - rank: 6 max len: 417 min len: 256 avg len: 322.4736842105263 num_loss_counted_tokens: 3232
 total tokens: 7840 num samples: 8 num padding tokens: 324 - rank: 2 max len: 980 min len: 894 avg len: 939.5 num_loss_counted_tokens: 5169
 Per-token loss scaled by world size: 0.0001966664713108912Per-token loss scaled by world size: 0.0001095464758691378Per-token loss scaled by world size: 0.00030617474112659693Per-token loss scaled by world size: 0.0003008927742484957



 Per-token loss scaled by world size: 0.0001922248484333977Per-token loss scaled by world size: 6.467673188126355e-07Per-token loss scaled by world size: 0.00025096320314332843

 Epoch: 1, Step: 129, Rank: 6, loss = 1.0479342937469482Epoch: 1, Step: 129, Rank: 4, loss = 1.066330075263977Epoch: 1, Step: 129, Rank: 2, loss = 0.3815229833126068

 Epoch: 1, Step: 129, Rank: 1, loss = 0.6849401593208313

 Epoch: 1, Step: 129, Rank: 0, loss = 0.0022525289095938206
 Epoch: 1, Step: 129, Rank: 7, loss = 0.6694710850715637

 Per-token loss scaled by world size: 0.00032839240157045424
 Epoch: 1, Step: 129, Rank: 5, loss = 1.14370858669281
 Epoch: 1, Step: 129, Rank: 3, loss = 0.8740420937538147
                                                         total tokens: 6094 num samples: 2 num padding tokens: 254 - rank: 1 max len: 3047 min len: 2793 avg len: 2920.0 num_loss_counted_tokens: 165
 total tokens: 7068 num samples: 6 num padding tokens: 383 - rank: 4 max len: 1178 min len: 1029 avg len: 1114.1666666666667 num_loss_counted_tokens: 3758
 {
    "epoch": 1,
    "step": 129,
    "rank": 0,
    "loss": 0.0022525289095938206,
    "overall_throughput": 41.47733724327907,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.227038383483887,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27862,
    "batch_size": 83,
    "total_loss": 0.7337751984596252,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:49.705725"
 }
 total tokens: 8016 num samples: 16 num padding tokens: 1518 - rank: 6 max len: 501 min len: 265 avg len: 406.125 num_loss_counted_tokens: 3598
 total tokens: 7964 num samples: 4 num padding tokens: 776 - rank: 3 max len: 1991 min len: 1616 avg len: 1797.0 num_loss_counted_tokens: 415
 total tokens: 6264 num samples: 24 num padding tokens: 2144 - rank: 7 max len: 261 min len: 85 avg len: 171.66666666666666 num_loss_counted_tokens: 1744
 total tokens: 7974 num samples: 2 num padding tokens: 703 - rank: 0 max len: 3987 min len: 3284 avg len: 3635.5 num_loss_counted_tokens: 349
 total tokens: 8073 num samples: 9 num padding tokens: 2207 - rank: 5 max len: 897 min len: 515 avg len: 651.7777777777778 num_loss_counted_tokens: 4298
 total tokens: 7056 num samples: 3 num padding tokens: 377 - rank: 2 max len: 2352 min len: 2050 avg len: 2226.3333333333335 num_loss_counted_tokens: 2384
 Per-token loss scaled by world size: 0.0002643604821059853Per-token loss scaled by world size: 0.000382772006560117Per-token loss scaled by world size: 0.000271481869276613
 Per-token loss scaled by world size: 3.5156226658727974e-06
 Per-token loss scaled by world size: 6.239629328774754e-07


 Per-token loss scaled by world size: 0.00018353613268118352
 Epoch: 1, Step: 130, Rank: 5, loss = 1.2674537897109985
 Epoch: 1, Step: 130, Rank: 6, loss = 0.8989443778991699Epoch: 1, Step: 130, Rank: 0, loss = 0.011641105636954308

 Epoch: 1, Step: 130, Rank: 2, loss = 0.8753636479377747
 Per-token loss scaled by world size: 0.0003246103588026017Epoch: 1, Step: 130, Rank: 1, loss = 0.0020660972222685814

 Epoch: 1, Step: 130, Rank: 7, loss = 0.6077340245246887
 Per-token loss scaled by world size: 0.00032646721228957176
 Epoch: 1, Step: 130, Rank: 4, loss = 1.0748660564422607
 Epoch: 1, Step: 130, Rank: 3, loss = 1.0810145139694214
                                                         total tokens: 7672 num samples: 7 num padding tokens: 936 - rank: 4 max len: 1096 min len: 867 avg len: 962.2857142857143 num_loss_counted_tokens: 3200
 total tokens: 6610 num samples: 2 num padding tokens: 448 - rank: 1 max len: 3305 min len: 2857 avg len: 3081.0 num_loss_counted_tokens: 155
 total tokens: 7690 num samples: 10 num padding tokens: 781 - rank: 5 max len: 769 min len: 604 avg len: 690.9 num_loss_counted_tokens: 5078
 {
    "epoch": 1,
    "step": 130,
    "rank": 0,
    "loss": 0.011641105636954308,
    "overall_throughput": 41.55595823632429,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.426838874816895,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26490,
    "batch_size": 77,
    "total_loss": 0.7273854613304138,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:52.249145"
 }
 total tokens: 5650 num samples: 2 num padding tokens: 223 - rank: 2 max len: 2825 min len: 2602 avg len: 2713.5 num_loss_counted_tokens: 210
 total tokens: 4070 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4070 min len: 4070 avg len: 4070.0 num_loss_counted_tokens: 1038
 total tokens: 8030 num samples: 5 num padding tokens: 1189 - rank: 3 max len: 1606 min len: 1172 avg len: 1368.2 num_loss_counted_tokens: 2403
 total tokens: 5180 num samples: 20 num padding tokens: 2159 - rank: 7 max len: 259 min len: 76 avg len: 151.05 num_loss_counted_tokens: 1160
 total tokens: 7800 num samples: 13 num padding tokens: 2266 - rank: 6 max len: 600 min len: 271 avg len: 425.6923076923077 num_loss_counted_tokens: 3635
 Per-token loss scaled by world size: 0.0001472200092393905Per-token loss scaled by world size: 0.00025629153242334723Per-token loss scaled by world size: 0.0002756573085207492Per-token loss scaled by world size: 0.00023288748343475163Per-token loss scaled by world size: 0.0004771172534674406




 Per-token loss scaled by world size: 1.5136585034269956e-06
 Epoch: 1, Step: 131, Rank: 2, loss = 0.8319223523139954Epoch: 1, Step: 131, Rank: 6, loss = 0.894783616065979

 Epoch: 1, Step: 131, Rank: 3, loss = 0.755952775478363Epoch: 1, Step: 131, Rank: 4, loss = 1.5487226247787476

 Epoch: 1, Step: 131, Rank: 1, loss = 0.4778761565685272
 Per-token loss scaled by world size: 0.0003685416013468057
 Epoch: 1, Step: 131, Rank: 0, loss = 0.004913335666060448
 Epoch: 1, Step: 131, Rank: 5, loss = 1.1962860822677612
 Per-token loss scaled by world size: 0.0002723717479966581
 Epoch: 1, Step: 131, Rank: 7, loss = 0.8841187357902527
                                                          total tokens: 7064 num samples: 4 num padding tokens: 383 - rank: 1 max len: 1766 min len: 1494 avg len: 1670.25 num_loss_counted_tokens: 2269
 total tokens: 7308 num samples: 9 num padding tokens: 754 - rank: 4 max len: 812 min len: 666 avg len: 728.2222222222222 num_loss_counted_tokens: 4526
 {
    "epoch": 1,
    "step": 131,
    "rank": 0,
    "loss": 0.004913335666060448,
    "overall_throughput": 42.206020029222465,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.24073839187622,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25968,
    "batch_size": 101,
    "total_loss": 0.8243219256401062,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:54.758478"
 }
 total tokens: 8040 num samples: 8 num padding tokens: 747 - rank: 3 max len: 1005 min len: 816 avg len: 911.625 num_loss_counted_tokens: 5982
 total tokens: 7415 num samples: 5 num padding tokens: 1415 - rank: 2 max len: 1483 min len: 1015 avg len: 1200.0 num_loss_counted_tokens: 2673
 total tokens: 8106 num samples: 3 num padding tokens: 591 - rank: 0 max len: 2702 min len: 2308 avg len: 2505.0 num_loss_counted_tokens: 285
 total tokens: 7936 num samples: 32 num padding tokens: 2881 - rank: 7 max len: 248 min len: 70 avg len: 157.96875 num_loss_counted_tokens: 2168
 total tokens: 7824 num samples: 12 num padding tokens: 940 - rank: 5 max len: 652 min len: 506 avg len: 573.6666666666666 num_loss_counted_tokens: 5458
 total tokens: 7664 num samples: 16 num padding tokens: 1917 - rank: 6 max len: 479 min len: 262 avg len: 359.1875 num_loss_counted_tokens: 3606
 Per-token loss scaled by world size: 0.0003236977499909699Per-token loss scaled by world size: 0.0009724997216835618Per-token loss scaled by world size: 0.00038814375875517726Per-token loss scaled by world size: 0.0005891940090805292

 Per-token loss scaled by world size: 0.0001362602924928069

 Per-token loss scaled by world size: 5.815729309688322e-06
 Per-token loss scaled by world size: 8.939716281020083e-06

 Epoch: 1, Step: 132, Rank: 5, loss = 1.257119059562683Epoch: 1, Step: 132, Rank: 6, loss = 2.0749497413635254

 Epoch: 1, Step: 132, Rank: 4, loss = 0.6906496286392212Epoch: 1, Step: 132, Rank: 7, loss = 0.8281532526016235

 Epoch: 1, Step: 132, Rank: 2, loss = 0.012408585287630558
 Epoch: 1, Step: 132, Rank: 3, loss = 0.290728360414505
 Epoch: 1, Step: 132, Rank: 1, loss = 0.019074002280831337
 Per-token loss scaled by world size: 3.719959931913763e-05
 Epoch: 1, Step: 132, Rank: 0, loss = 0.07936999201774597
                                                          total tokens: 7930 num samples: 10 num padding tokens: 877 - rank: 4 max len: 793 min len: 643 avg len: 705.3 num_loss_counted_tokens: 3435
 total tokens: 7288 num samples: 4 num padding tokens: 780 - rank: 1 max len: 1822 min len: 1515 avg len: 1627.0 num_loss_counted_tokens: 3100
 total tokens: 7105 num samples: 7 num padding tokens: 499 - rank: 3 max len: 1015 min len: 827 avg len: 943.7142857142857 num_loss_counted_tokens: 4654
 {
    "epoch": 1,
    "step": 132,
    "rank": 0,
    "loss": 0.07936999201774597,
    "overall_throughput": 42.231542345095384,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.476311683654785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17069,
    "batch_size": 64,
    "total_loss": 0.6565565466880798,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:57.301535"
 }
 total tokens: 7560 num samples: 30 num padding tokens: 2281 - rank: 7 max len: 252 min len: 77 avg len: 175.96666666666667 num_loss_counted_tokens: 2486
 total tokens: 7176 num samples: 6 num padding tokens: 666 - rank: 2 max len: 1196 min len: 1024 avg len: 1085.0 num_loss_counted_tokens: 3309
 total tokens: 5546 num samples: 2 num padding tokens: 144 - rank: 0 max len: 2773 min len: 2629 avg len: 2701.0 num_loss_counted_tokens: 194
 total tokens: 8046 num samples: 18 num padding tokens: 1959 - rank: 6 max len: 447 min len: 255 avg len: 338.1666666666667 num_loss_counted_tokens: 3390
 total tokens: 7656 num samples: 12 num padding tokens: 892 - rank: 5 max len: 638 min len: 458 avg len: 563.6666666666666 num_loss_counted_tokens: 4307
 Per-token loss scaled by world size: 0.00027533259708434343Per-token loss scaled by world size: 0.00012797772069461644Per-token loss scaled by world size: 0.00020723696798086166Per-token loss scaled by world size: 0.00024243281222879887Per-token loss scaled by world size: 0.00023833484738133848
 Per-token loss scaled by world size: 0.00026018035714514554



 Per-token loss scaled by world size: 0.00010569631558610126

 Epoch: 1, Step: 133, Rank: 2, loss = 0.7923144102096558
 Epoch: 1, Step: 133, Rank: 6, loss = 0.9153087735176086Epoch: 1, Step: 133, Rank: 1, loss = 0.42544591426849365Epoch: 1, Step: 133, Rank: 4, loss = 0.8059375882148743
 Epoch: 1, Step: 133, Rank: 3, loss = 0.8649370670318604


 Epoch: 1, Step: 133, Rank: 7, loss = 0.6889333724975586
 Epoch: 1, Step: 133, Rank: 0, loss = 0.35137417912483215
 Per-token loss scaled by world size: 0.0003980571636930108
 Epoch: 1, Step: 133, Rank: 5, loss = 1.323291301727295
                                                          total tokens: 7461 num samples: 9 num padding tokens: 647 - rank: 4 max len: 829 min len: 685 avg len: 757.1111111111111 num_loss_counted_tokens: 3551
 total tokens: 7808 num samples: 4 num padding tokens: 946 - rank: 1 max len: 1952 min len: 1419 avg len: 1715.5 num_loss_counted_tokens: 1967
 total tokens: 8010 num samples: 30 num padding tokens: 2819 - rank: 7 max len: 267 min len: 85 avg len: 173.03333333333333 num_loss_counted_tokens: 2333
 {
    "epoch": 1,
    "step": 133,
    "rank": 0,
    "loss": 0.35137417912483215,
    "overall_throughput": 41.395595194588644,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.437856197357178,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26595,
    "batch_size": 92,
    "total_loss": 0.7709429264068604,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:53:59.817004"
 }
 total tokens: 6995 num samples: 5 num padding tokens: 1102 - rank: 2 max len: 1399 min len: 998 avg len: 1178.6 num_loss_counted_tokens: 2712
 total tokens: 7808 num samples: 8 num padding tokens: 442 - rank: 3 max len: 976 min len: 844 avg len: 920.75 num_loss_counted_tokens: 4240
 total tokens: 7686 num samples: 3 num padding tokens: 937 - rank: 0 max len: 2562 min len: 2050 avg len: 2249.6666666666665 num_loss_counted_tokens: 556
 total tokens: 7992 num samples: 18 num padding tokens: 1651 - rank: 6 max len: 444 min len: 270 avg len: 352.27777777777777 num_loss_counted_tokens: 3377
 total tokens: 7656 num samples: 12 num padding tokens: 1025 - rank: 5 max len: 638 min len: 445 avg len: 552.5833333333334 num_loss_counted_tokens: 4330
 Per-token loss scaled by world size: 0.0003683593822643161Per-token loss scaled by world size: 0.0001245876046596095Per-token loss scaled by world size: 0.0003223164821974933Per-token loss scaled by world size: 0.0003114262653980404Per-token loss scaled by world size: 0.0002396363124717027Per-token loss scaled by world size: 0.0002762637159321457





 Per-token loss scaled by world size: 7.41567782824859e-06
 Epoch: 1, Step: 134, Rank: 1, loss = 0.3786684572696686
 Epoch: 1, Step: 134, Rank: 6, loss = 0.9796406626701355
 Epoch: 1, Step: 134, Rank: 5, loss = 1.1195822954177856
 Epoch: 1, Step: 134, Rank: 7, loss = 0.9465411901473999
 Epoch: 1, Step: 134, Rank: 4, loss = 0.7283446192741394
 Epoch: 1, Step: 134, Rank: 3, loss = 0.8396689891815186
 Epoch: 1, Step: 134, Rank: 0, loss = 0.02253902517259121
 Per-token loss scaled by world size: 0.00010645172005752102
 Epoch: 1, Step: 134, Rank: 2, loss = 0.32354670763015747
                                                          total tokens: 6520 num samples: 4 num padding tokens: 733 - rank: 1 max len: 1630 min len: 1198 avg len: 1446.75 num_loss_counted_tokens: 1720
 total tokens: 7600 num samples: 10 num padding tokens: 457 - rank: 4 max len: 760 min len: 692 avg len: 714.3 num_loss_counted_tokens: 4057
 {
    "epoch": 1,
    "step": 134,
    "rank": 0,
    "loss": 0.02253902517259121,
    "overall_throughput": 41.963973566813905,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.358724117279053,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24315,
    "batch_size": 87,
    "total_loss": 0.6673164963722229,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:02.337863"
 }
 total tokens: 7712 num samples: 16 num padding tokens: 2123 - rank: 6 max len: 482 min len: 243 avg len: 349.3125 num_loss_counted_tokens: 3585
 total tokens: 8064 num samples: 7 num padding tokens: 647 - rank: 2 max len: 1152 min len: 995 avg len: 1059.5714285714287 num_loss_counted_tokens: 4100
 total tokens: 6102 num samples: 27 num padding tokens: 2162 - rank: 7 max len: 226 min len: 78 avg len: 145.92592592592592 num_loss_counted_tokens: 1630
 total tokens: 7956 num samples: 12 num padding tokens: 1032 - rank: 5 max len: 663 min len: 497 avg len: 577.0 num_loss_counted_tokens: 5176
 total tokens: 7920 num samples: 8 num padding tokens: 902 - rank: 3 max len: 990 min len: 770 avg len: 877.25 num_loss_counted_tokens: 3876
 total tokens: 6987 num samples: 3 num padding tokens: 1078 - rank: 0 max len: 2329 min len: 1691 avg len: 1969.6666666666667 num_loss_counted_tokens: 437
 Per-token loss scaled by world size: 0.0005423504626378417Per-token loss scaled by world size: 0.00036555714905261993Per-token loss scaled by world size: 0.00022980774519965053Per-token loss scaled by world size: 2.886021502490621e-05Per-token loss scaled by world size: 2.577510167611763e-05Per-token loss scaled by world size: 0.00026930312742479146
 Per-token loss scaled by world size: 4.34833509643795e-06





 Epoch: 1, Step: 135, Rank: 3, loss = 0.5620235800743103
 Epoch: 1, Step: 135, Rank: 4, loss = 0.8940157294273376Epoch: 1, Step: 135, Rank: 2, loss = 0.07058126479387283Epoch: 1, Step: 135, Rank: 1, loss = 0.0630362331867218
 Epoch: 1, Step: 135, Rank: 6, loss = 1.3263858556747437


 Epoch: 1, Step: 135, Rank: 0, loss = 0.010634397156536579
 Epoch: 1, Step: 135, Rank: 7, loss = 0.658614456653595
 Per-token loss scaled by world size: 0.0009107645018957555
 Epoch: 1, Step: 135, Rank: 5, loss = 2.227388381958008
                                                          total tokens: 6480 num samples: 3 num padding tokens: 750 - rank: 1 max len: 2160 min len: 1742 avg len: 1910.0 num_loss_counted_tokens: 614
 total tokens: 7314 num samples: 6 num padding tokens: 901 - rank: 4 max len: 1219 min len: 926 avg len: 1068.8333333333333 num_loss_counted_tokens: 4111
 {
    "epoch": 1,
    "step": 135,
    "rank": 0,
    "loss": 0.010634397156536579,
    "overall_throughput": 41.220052175245854,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.333390712738037,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19565,
    "batch_size": 73,
    "total_loss": 0.7265850305557251,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:04.904683"
 }
 total tokens: 7488 num samples: 9 num padding tokens: 392 - rank: 5 max len: 832 min len: 756 avg len: 788.4444444444445 num_loss_counted_tokens: 5496
 total tokens: 7700 num samples: 11 num padding tokens: 1846 - rank: 6 max len: 700 min len: 394 avg len: 532.1818181818181 num_loss_counted_tokens: 3763
 total tokens: 7185 num samples: 5 num padding tokens: 672 - rank: 3 max len: 1437 min len: 1230 avg len: 1302.6 num_loss_counted_tokens: 1982
 total tokens: 6648 num samples: 4 num padding tokens: 299 - rank: 2 max len: 1662 min len: 1515 avg len: 1587.25 num_loss_counted_tokens: 916
 total tokens: 7875 num samples: 21 num padding tokens: 3380 - rank: 7 max len: 375 min len: 88 avg len: 214.04761904761904 num_loss_counted_tokens: 2091
 total tokens: 5508 num samples: 2 num padding tokens: 33 - rank: 0 max len: 2754 min len: 2721 avg len: 2737.5 num_loss_counted_tokens: 386
 Per-token loss scaled by world size: 0.0004460816562641412Per-token loss scaled by world size: 0.0001522299717180431Per-token loss scaled by world size: 0.0003226569388061762Per-token loss scaled by world size: 0.0003637947083916515Per-token loss scaled by world size: 0.00024503390886820853
 Per-token loss scaled by world size: 5.745379894506186e-05


 Per-token loss scaled by world size: 6.490522537205834e-06


 Epoch: 1, Step: 136, Rank: 6, loss = 1.0003172159194946
 Epoch: 1, Step: 136, Rank: 5, loss = 1.3829646110534668Epoch: 1, Step: 136, Rank: 3, loss = 0.4719509482383728

 Epoch: 1, Step: 136, Rank: 1, loss = 0.17812113463878632Epoch: 1, Step: 136, Rank: 4, loss = 1.127854585647583

 Epoch: 1, Step: 136, Rank: 7, loss = 0.759666383266449
 Epoch: 1, Step: 136, Rank: 0, loss = 0.020122243091464043
 Per-token loss scaled by world size: 0.00021433050278574228
 Epoch: 1, Step: 136, Rank: 2, loss = 0.6644781231880188
                                                          total tokens: 6660 num samples: 4 num padding tokens: 420 - rank: 1 max len: 1665 min len: 1335 avg len: 1560.0 num_loss_counted_tokens: 1802
 total tokens: 7997 num samples: 11 num padding tokens: 948 - rank: 4 max len: 727 min len: 535 avg len: 640.8181818181819 num_loss_counted_tokens: 5687
 {
    "epoch": 1,
    "step": 136,
    "rank": 0,
    "loss": 0.020122243091464043,
    "overall_throughput": 41.5055278365255,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.32311248779297,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24802,
    "batch_size": 88,
    "total_loss": 0.7006844282150269,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:07.453575"
 }
 total tokens: 8010 num samples: 15 num padding tokens: 963 - rank: 5 max len: 534 min len: 399 avg len: 469.8 num_loss_counted_tokens: 4616
 total tokens: 7308 num samples: 6 num padding tokens: 791 - rank: 2 max len: 1218 min len: 1008 avg len: 1086.1666666666667 num_loss_counted_tokens: 4945
 total tokens: 6356 num samples: 28 num padding tokens: 2338 - rank: 7 max len: 227 min len: 81 avg len: 143.5 num_loss_counted_tokens: 1107
 total tokens: 7782 num samples: 2 num padding tokens: 2057 - rank: 0 max len: 3891 min len: 1834 avg len: 2862.5 num_loss_counted_tokens: 230
 total tokens: 7752 num samples: 8 num padding tokens: 1153 - rank: 3 max len: 969 min len: 740 avg len: 824.875 num_loss_counted_tokens: 4517
 total tokens: 8085 num samples: 21 num padding tokens: 1827 - rank: 6 max len: 385 min len: 241 avg len: 298.0 num_loss_counted_tokens: 3422
 Per-token loss scaled by world size: 0.00032276863930746913Per-token loss scaled by world size: 0.0002541161666158587Per-token loss scaled by world size: 4.045515743200667e-05Per-token loss scaled by world size: 0.00033090231590904295


 Per-token loss scaled by world size: 0.0003559431352186948

 Per-token loss scaled by world size: 0.00022709915356244892
 Epoch: 1, Step: 137, Rank: 0, loss = 0.13576750457286835
 Epoch: 1, Step: 137, Rank: 6, loss = 1.1105082035064697Epoch: 1, Step: 137, Rank: 3, loss = 0.8528138995170593Per-token loss scaled by world size: 0.0001023018267005682

 Epoch: 1, Step: 137, Rank: 5, loss = 1.0832115411758423

 Epoch: 1, Step: 137, Rank: 4, loss = 1.1945451498031616
 Epoch: 1, Step: 137, Rank: 1, loss = 0.7621447443962097
 Per-token loss scaled by world size: 0.0002989015483763069
 Epoch: 1, Step: 137, Rank: 7, loss = 0.3433249294757843
 Epoch: 1, Step: 137, Rank: 2, loss = 1.0031136274337769
                                                          total tokens: 6924 num samples: 4 num padding tokens: 1102 - rank: 1 max len: 1731 min len: 1172 avg len: 1455.5 num_loss_counted_tokens: 1807
 total tokens: 7890 num samples: 10 num padding tokens: 472 - rank: 4 max len: 789 min len: 675 avg len: 741.8 num_loss_counted_tokens: 3224
 total tokens: 5649 num samples: 21 num padding tokens: 2159 - rank: 7 max len: 269 min len: 72 avg len: 166.1904761904762 num_loss_counted_tokens: 1470
 {
    "epoch": 1,
    "step": 137,
    "rank": 0,
    "loss": 0.13576750457286835,
    "overall_throughput": 42.12198553508762,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.488715171813965,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26848,
    "batch_size": 89,
    "total_loss": 0.8106787204742432,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:09.964344"
 }
 total tokens: 7440 num samples: 8 num padding tokens: 452 - rank: 3 max len: 930 min len: 829 avg len: 873.5 num_loss_counted_tokens: 4943
 total tokens: 7932 num samples: 12 num padding tokens: 1107 - rank: 5 max len: 661 min len: 488 avg len: 568.75 num_loss_counted_tokens: 4833
 total tokens: 7648 num samples: 16 num padding tokens: 1451 - rank: 6 max len: 478 min len: 284 avg len: 387.3125 num_loss_counted_tokens: 3233
 total tokens: 5446 num samples: 2 num padding tokens: 973 - rank: 0 max len: 2723 min len: 1750 avg len: 2236.5 num_loss_counted_tokens: 235
 total tokens: 7763 num samples: 7 num padding tokens: 774 - rank: 2 max len: 1109 min len: 938 avg len: 998.4285714285714 num_loss_counted_tokens: 4812
 Per-token loss scaled by world size: 0.00015472256927751005Per-token loss scaled by world size: 0.00039065544842742383Per-token loss scaled by world size: 0.0003819867270067334Per-token loss scaled by world size: 0.00013601673708762974Per-token loss scaled by world size: 0.0002345130778849125

 Per-token loss scaled by world size: 9.365750884171575e-05
 Per-token loss scaled by world size: 0.00026829339913092554



 Epoch: 1, Step: 138, Rank: 5, loss = 1.27397620677948
 Epoch: 1, Step: 138, Rank: 1, loss = 0.5045696496963501Epoch: 1, Step: 138, Rank: 2, loss = 0.4435676038265228
 Epoch: 1, Step: 138, Rank: 4, loss = 1.2457064390182495
 Epoch: 1, Step: 138, Rank: 0, loss = 0.3054288327693939

 Epoch: 1, Step: 138, Rank: 7, loss = 0.8749383091926575Epoch: 1, Step: 138, Rank: 3, loss = 0.7647764682769775

 Per-token loss scaled by world size: 0.0003792343777604401
 Epoch: 1, Step: 138, Rank: 6, loss = 1.236730694770813
                                                          total tokens: 7994 num samples: 7 num padding tokens: 1969 - rank: 4 max len: 1142 min len: 760 avg len: 860.7142857142857 num_loss_counted_tokens: 3971
 total tokens: 6921 num samples: 3 num padding tokens: 423 - rank: 1 max len: 2307 min len: 1952 avg len: 2166.0 num_loss_counted_tokens: 2301
 {
    "epoch": 1,
    "step": 138,
    "rank": 0,
    "loss": 0.3054288327693939,
    "overall_throughput": 42.2029043837429,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.34983253479004,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26089,
    "batch_size": 100,
    "total_loss": 0.8312118053436279,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:12.475862"
 }
 total tokens: 7876 num samples: 11 num padding tokens: 1537 - rank: 5 max len: 716 min len: 452 avg len: 576.2727272727273 num_loss_counted_tokens: 4021
 total tokens: 7676 num samples: 4 num padding tokens: 763 - rank: 2 max len: 1919 min len: 1594 avg len: 1728.25 num_loss_counted_tokens: 1719
 total tokens: 7115 num samples: 5 num padding tokens: 406 - rank: 3 max len: 1423 min len: 1187 avg len: 1341.8 num_loss_counted_tokens: 3469
 total tokens: 7202 num samples: 26 num padding tokens: 2407 - rank: 7 max len: 277 min len: 93 avg len: 184.42307692307693 num_loss_counted_tokens: 2108
 total tokens: 6352 num samples: 2 num padding tokens: 277 - rank: 0 max len: 3176 min len: 2899 avg len: 3037.5 num_loss_counted_tokens: 710
 total tokens: 8046 num samples: 18 num padding tokens: 1591 - rank: 6 max len: 447 min len: 283 avg len: 358.6111111111111 num_loss_counted_tokens: 3535
 Per-token loss scaled by world size: 0.0002807814453262836Per-token loss scaled by world size: 0.0002877341175917536Per-token loss scaled by world size: 0.00017392370500601828
 Per-token loss scaled by world size: 0.00026719356537796557Per-token loss scaled by world size: 0.00031913904240354896
 Per-token loss scaled by world size: 0.000377663760446012


 Per-token loss scaled by world size: 3.4834424695873167e-06

 Epoch: 1, Step: 139, Rank: 5, loss = 0.8960040807723999Epoch: 1, Step: 139, Rank: 3, loss = 0.5415984392166138

 Epoch: 1, Step: 139, Rank: 1, loss = 0.8743534088134766Epoch: 1, Step: 139, Rank: 6, loss = 0.9937989711761475

 Epoch: 1, Step: 139, Rank: 7, loss = 0.8320407271385193
 Epoch: 1, Step: 139, Rank: 0, loss = 0.010847439989447594
 Epoch: 1, Step: 139, Rank: 4, loss = 1.1760449409484863
 Per-token loss scaled by world size: 0.0001241332065546885
 Epoch: 1, Step: 139, Rank: 2, loss = 0.38655081391334534
                                                          total tokens: 7483 num samples: 7 num padding tokens: 916 - rank: 4 max len: 1069 min len: 850 avg len: 938.1428571428571 num_loss_counted_tokens: 5047
 total tokens: 8067 num samples: 3 num padding tokens: 632 - rank: 1 max len: 2689 min len: 2157 avg len: 2478.3333333333335 num_loss_counted_tokens: 281
 {
    "epoch": 1,
    "step": 139,
    "rank": 0,
    "loss": 0.010847439989447594,
    "overall_throughput": 41.72034163026323,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.309249877929688,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24912,
    "batch_size": 83,
    "total_loss": 0.7139047980308533,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:15.011121"
 }
 total tokens: 7392 num samples: 4 num padding tokens: 517 - rank: 2 max len: 1848 min len: 1598 avg len: 1718.75 num_loss_counted_tokens: 1592
 total tokens: 3971 num samples: 19 num padding tokens: 1215 - rank: 7 max len: 209 min len: 77 avg len: 145.05263157894737 num_loss_counted_tokens: 1089
 total tokens: 7575 num samples: 5 num padding tokens: 836 - rank: 3 max len: 1515 min len: 1078 avg len: 1347.8 num_loss_counted_tokens: 3059
 total tokens: 7148 num samples: 2 num padding tokens: 850 - rank: 0 max len: 3574 min len: 2724 avg len: 3149.0 num_loss_counted_tokens: 226
 total tokens: 7308 num samples: 9 num padding tokens: 1380 - rank: 5 max len: 812 min len: 530 avg len: 658.6666666666666 num_loss_counted_tokens: 3339
 total tokens: 8048 num samples: 16 num padding tokens: 2308 - rank: 6 max len: 503 min len: 219 avg len: 358.75 num_loss_counted_tokens: 3793
 Per-token loss scaled by world size: 0.0002717878087423742Per-token loss scaled by world size: 0.00017612801457289606

 Per-token loss scaled by world size: 0.00045838873484171927Per-token loss scaled by world size: 0.00027124761254526675
 Per-token loss scaled by world size: 0.00033216923475265503Per-token loss scaled by world size: 2.2576082301384304e-06


 Per-token loss scaled by world size: 5.144028546055779e-05
 Epoch: 1, Step: 140, Rank: 2, loss = 0.5452042818069458
 Epoch: 1, Step: 140, Rank: 3, loss = 0.8413191437721252
 Epoch: 1, Step: 140, Rank: 5, loss = 1.4189423322677612
 Epoch: 1, Step: 140, Rank: 4, loss = 0.8396469950675964
 Epoch: 1, Step: 140, Rank: 0, loss = 0.0069884262047708035Epoch: 1, Step: 140, Rank: 7, loss = 1.028229832649231

 Epoch: 1, Step: 140, Rank: 1, loss = 0.15923340618610382
 Per-token loss scaled by world size: 0.0004096345801372081
 Epoch: 1, Step: 140, Rank: 6, loss = 1.2680238485336304
                                                          total tokens: 7968 num samples: 8 num padding tokens: 1069 - rank: 4 max len: 996 min len: 754 avg len: 862.375 num_loss_counted_tokens: 4593
 total tokens: 7478 num samples: 2 num padding tokens: 921 - rank: 1 max len: 3739 min len: 2818 avg len: 3278.5 num_loss_counted_tokens: 165
 {
    "epoch": 1,
    "step": 140,
    "rank": 0,
    "loss": 0.0069884262047708035,
    "overall_throughput": 41.71879346229527,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.433530807495117,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24764,
    "batch_size": 84,
    "total_loss": 0.7634485363960266,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:17.546839"
 }
 total tokens: 7473 num samples: 3 num padding tokens: 2287 - rank: 2 max len: 2491 min len: 1276 avg len: 1728.6666666666667 num_loss_counted_tokens: 634
 total tokens: 7476 num samples: 28 num padding tokens: 2615 - rank: 7 max len: 267 min len: 72 avg len: 173.60714285714286 num_loss_counted_tokens: 2127
 total tokens: 7326 num samples: 6 num padding tokens: 603 - rank: 3 max len: 1221 min len: 1065 avg len: 1120.5 num_loss_counted_tokens: 2131
 total tokens: 7580 num samples: 2 num padding tokens: 14 - rank: 0 max len: 3790 min len: 3776 avg len: 3783.0 num_loss_counted_tokens: 616
 total tokens: 7390 num samples: 10 num padding tokens: 1181 - rank: 5 max len: 739 min len: 522 avg len: 620.9 num_loss_counted_tokens: 3575
 total tokens: 7740 num samples: 15 num padding tokens: 1604 - rank: 6 max len: 516 min len: 269 avg len: 409.06666666666666 num_loss_counted_tokens: 3156
 Per-token loss scaled by world size: 0.0002609801304060966Per-token loss scaled by world size: 0.0002962287690024823Per-token loss scaled by world size: 0.0003096856235060841Per-token loss scaled by world size: 0.00027970768860541284


 Per-token loss scaled by world size: 2.0839811440964695e-06
 Per-token loss scaled by world size: 0.0006479129078797996

 Per-token loss scaled by world size: 5.338866685633548e-06
 Epoch: 1, Step: 141, Rank: 4, loss = 0.7952631711959839
 Epoch: 1, Step: 141, Rank: 3, loss = 0.8313897848129272Epoch: 1, Step: 141, Rank: 7, loss = 0.7006337642669678

 Epoch: 1, Step: 141, Rank: 2, loss = 0.750910222530365
 Epoch: 1, Step: 141, Rank: 5, loss = 1.7394031286239624Epoch: 1, Step: 141, Rank: 0, loss = 0.005594708025455475

 Epoch: 1, Step: 141, Rank: 1, loss = 0.014332855120301247
 Per-token loss scaled by world size: 0.0004338203580118716
 Epoch: 1, Step: 141, Rank: 6, loss = 1.1646449565887451
                                                          total tokens: 6904 num samples: 4 num padding tokens: 1250 - rank: 1 max len: 1726 min len: 1224 avg len: 1413.5 num_loss_counted_tokens: 3292
 total tokens: 7900 num samples: 10 num padding tokens: 629 - rank: 4 max len: 790 min len: 654 avg len: 727.1 num_loss_counted_tokens: 4621
 {
    "epoch": 1,
    "step": 141,
    "rank": 0,
    "loss": 0.005594708025455475,
    "overall_throughput": 43.02419669348744,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.362098217010498,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21477,
    "batch_size": 74,
    "total_loss": 0.7502715587615967,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:20.007363"
 }
 total tokens: 7999 num samples: 19 num padding tokens: 1858 - rank: 6 max len: 421 min len: 242 avg len: 323.2105263157895 num_loss_counted_tokens: 3598
 total tokens: 7648 num samples: 8 num padding tokens: 656 - rank: 3 max len: 956 min len: 812 avg len: 874.0 num_loss_counted_tokens: 5090
 total tokens: 6966 num samples: 6 num padding tokens: 548 - rank: 2 max len: 1161 min len: 993 avg len: 1069.6666666666667 num_loss_counted_tokens: 3065
 total tokens: 7440 num samples: 31 num padding tokens: 2495 - rank: 7 max len: 240 min len: 77 avg len: 159.51612903225808 num_loss_counted_tokens: 2088
 total tokens: 6690 num samples: 3 num padding tokens: 492 - rank: 0 max len: 2230 min len: 1959 avg len: 2066.0 num_loss_counted_tokens: 280
 total tokens: 7728 num samples: 12 num padding tokens: 1190 - rank: 5 max len: 644 min len: 425 avg len: 544.8333333333334 num_loss_counted_tokens: 3724
 Per-token loss scaled by world size: 0.0005912419874221087Per-token loss scaled by world size: 2.4495158868376166e-05Per-token loss scaled by world size: 0.0005119486595503986Per-token loss scaled by world size: 4.400004763738252e-05Per-token loss scaled by world size: 0.00011134906526422128Per-token loss scaled by world size: 0.0005244921194389462
 Per-token loss scaled by world size: 0.0003919812443200499





 Epoch: 1, Step: 142, Rank: 1, loss = 0.06206461042165756Epoch: 1, Step: 142, Rank: 6, loss = 1.297149896621704Epoch: 1, Step: 142, Rank: 5, loss = 1.4980593919754028

 Epoch: 1, Step: 142, Rank: 0, loss = 0.11148512363433838

 Epoch: 1, Step: 142, Rank: 2, loss = 0.2821306884288788
 Epoch: 1, Step: 142, Rank: 7, loss = 0.9931824803352356
 Epoch: 1, Step: 142, Rank: 4, loss = 1.3289319276809692
 Per-token loss scaled by world size: 0.00033672110293991864
 Epoch: 1, Step: 142, Rank: 3, loss = 0.8531671166419983
                                                          total tokens: 6288 num samples: 3 num padding tokens: 423 - rank: 1 max len: 2096 min len: 1765 avg len: 1955.0 num_loss_counted_tokens: 4016
 total tokens: 7448 num samples: 8 num padding tokens: 727 - rank: 4 max len: 931 min len: 688 avg len: 840.125 num_loss_counted_tokens: 5430
 {
    "epoch": 1,
    "step": 142,
    "rank": 0,
    "loss": 0.11148512363433838,
    "overall_throughput": 41.71373478675397,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.478935718536377,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20270,
    "batch_size": 89,
    "total_loss": 0.8032714128494263,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:22.542197"
 }
 total tokens: 7664 num samples: 16 num padding tokens: 2101 - rank: 6 max len: 479 min len: 261 avg len: 347.6875 num_loss_counted_tokens: 2998
 total tokens: 7707 num samples: 7 num padding tokens: 488 - rank: 3 max len: 1101 min len: 961 avg len: 1031.2857142857142 num_loss_counted_tokens: 3858
 total tokens: 7130 num samples: 5 num padding tokens: 559 - rank: 2 max len: 1426 min len: 1132 avg len: 1314.2 num_loss_counted_tokens: 3828
 total tokens: 7540 num samples: 29 num padding tokens: 3170 - rank: 7 max len: 260 min len: 78 avg len: 150.68965517241378 num_loss_counted_tokens: 1742
 total tokens: 6342 num samples: 2 num padding tokens: 886 - rank: 0 max len: 3171 min len: 2285 avg len: 2728.0 num_loss_counted_tokens: 169
 total tokens: 7872 num samples: 12 num padding tokens: 877 - rank: 5 max len: 656 min len: 510 avg len: 582.9166666666666 num_loss_counted_tokens: 4836
 Per-token loss scaled by world size: 0.0003404757590033114Per-token loss scaled by world size: 0.0005307358223944902Per-token loss scaled by world size: 6.328061135718599e-05Per-token loss scaled by world size: 0.0002552252262830734

 Per-token loss scaled by world size: 2.1218120309640653e-06


 Per-token loss scaled by world size: 1.8243759768665768e-05
 Per-token loss scaled by world size: 0.0003829057968687266Epoch: 1, Step: 143, Rank: 5, loss = 1.3186794519424438

 Epoch: 1, Step: 143, Rank: 2, loss = 0.15722858905792236Epoch: 1, Step: 143, Rank: 3, loss = 0.6341390013694763

 Epoch: 1, Step: 143, Rank: 0, loss = 0.005271907430142164
 Epoch: 1, Step: 143, Rank: 4, loss = 0.8459545969963074
 Epoch: 1, Step: 143, Rank: 1, loss = 0.04532890021800995
 Epoch: 1, Step: 143, Rank: 7, loss = 0.9513773322105408
 Per-token loss scaled by world size: 0.0005416726926341653
 Epoch: 1, Step: 143, Rank: 6, loss = 1.3458534479141235
                                                          total tokens: 7796 num samples: 4 num padding tokens: 508 - rank: 1 max len: 1949 min len: 1564 avg len: 1822.0 num_loss_counted_tokens: 2068
 total tokens: 7893 num samples: 9 num padding tokens: 666 - rank: 4 max len: 877 min len: 712 avg len: 803.0 num_loss_counted_tokens: 5521
 total tokens: 7679 num samples: 7 num padding tokens: 655 - rank: 3 max len: 1097 min len: 888 avg len: 1003.4285714285714 num_loss_counted_tokens: 5783
 {
    "epoch": 1,
    "step": 143,
    "rank": 0,
    "loss": 0.005271907430142164,
    "overall_throughput": 42.46446277162131,
    "lr": 2.4000000000000003e-06,
    "cuda_mem_allocated": 24.28184461593628,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19877,
    "batch_size": 76,
    "total_loss": 0.6629791259765625,
    "gradnorm": 1.007487177848816,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:25.071211"
 }
 total tokens: 2587 num samples: 13 num padding tokens: 827 - rank: 7 max len: 199 min len: 78 avg len: 135.3846153846154 num_loss_counted_tokens: 674
 total tokens: 7755 num samples: 11 num padding tokens: 897 - rank: 5 max len: 705 min len: 483 avg len: 623.4545454545455 num_loss_counted_tokens: 3835
 total tokens: 8028 num samples: 18 num padding tokens: 2679 - rank: 6 max len: 446 min len: 203 avg len: 297.1666666666667 num_loss_counted_tokens: 3221
 total tokens: 7772 num samples: 2 num padding tokens: 1840 - rank: 0 max len: 3886 min len: 2046 avg len: 2966.0 num_loss_counted_tokens: 2038 total tokens: 7080 num samples: 5 num padding tokens: 517 - rank: 2 max len: 1416 min len: 1149 avg len: 1312.6 num_loss_counted_tokens: 2382

 Per-token loss scaled by world size: 0.00031199524528346956Per-token loss scaled by world size: 8.699101454112679e-05Per-token loss scaled by world size: 1.0524022400204558e-06Per-token loss scaled by world size: 0.00040260597597807646Per-token loss scaled by world size: 0.0002991097862832248Per-token loss scaled by world size: 9.484303154749796e-05





 Epoch: 1, Step: 144, Rank: 1, loss = 0.21871715784072876Epoch: 1, Step: 144, Rank: 0, loss = 0.0026460024528205395
 Epoch: 1, Step: 144, Rank: 2, loss = 0.23845909535884857
 Epoch: 1, Step: 144, Rank: 4, loss = 1.0122520923614502
 Epoch: 1, Step: 144, Rank: 7, loss = 0.7844340801239014

 Epoch: 1, Step: 144, Rank: 3, loss = 0.7520367503166199
 Per-token loss scaled by world size: 0.0005959446425549686
 Per-token loss scaled by world size: 0.00046541052870452404Epoch: 1, Step: 144, Rank: 5, loss = 1.4983538389205933

 Epoch: 1, Step: 144, Rank: 6, loss = 1.1701583862304688
 [2024-08-18 20:54:27,526] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[3.2000000000000003e-06], mom=[(0.9, 0.95)]
 [2024-08-18 20:54:27,603] [INFO] [timer.py:258:stop] epoch=0/micro_step=144/global_step=4, RunningAvgSamplesPerSec=41.73388834459918, CurrSamplesPerSec=41.83623290194428, MemAllocated=22.69GB, MaxMemAllocated=30.58GB
                                                          total tokens: 7693 num samples: 7 num padding tokens: 967 - rank: 4 max len: 1099 min len: 892 avg len: 960.8571428571429 num_loss_counted_tokens: 3764
 total tokens: 8019 num samples: 3 num padding tokens: 876 - rank: 1 max len: 2673 min len: 2126 avg len: 2381.0 num_loss_counted_tokens: 283
 total tokens: 8005 num samples: 5 num padding tokens: 1366 - rank: 3 max len: 1601 min len: 1136 avg len: 1327.8 num_loss_counted_tokens: 1813
 {
    "epoch": 1,
    "step": 144,
    "rank": 0,
    "loss": 0.0026460024528205395,
    "overall_throughput": 41.11764673168018,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 22.69074296951294,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20114,
    "batch_size": 79,
    "total_loss": 0.709632158279419,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:27.606216"
 }
 total tokens: 8100 num samples: 15 num padding tokens: 1493 - rank: 6 max len: 540 min len: 336 avg len: 440.46666666666664 num_loss_counted_tokens: 3689
 total tokens: 8000 num samples: 25 num padding tokens: 2965 - rank: 7 max len: 320 min len: 78 avg len: 201.4 num_loss_counted_tokens: 2308
 total tokens: 7860 num samples: 4 num padding tokens: 801 - rank: 2 max len: 1965 min len: 1608 avg len: 1764.75 num_loss_counted_tokens: 4235
 total tokens: 7858 num samples: 2 num padding tokens: 1014 - rank: 0 max len: 3929 min len: 2915 avg len: 3422.0 num_loss_counted_tokens: 226
 total tokens: 7686 num samples: 9 num padding tokens: 1386 - rank: 5 max len: 854 min len: 588 avg len: 700.0 num_loss_counted_tokens: 3557
 Per-token loss scaled by world size: 0.00021755551279056817Per-token loss scaled by world size: 0.00011090948828496039Per-token loss scaled by world size: 0.0003043776086997241Per-token loss scaled by world size: 0.00020149040210526437
 Per-token loss scaled by world size: 1.0487364079381223e-06
 Per-token loss scaled by world size: 0.000273953570285812Per-token loss scaled by world size: 0.00017797687905840576




 Epoch: 1, Step: 145, Rank: 5, loss = 1.0936287641525269
 Epoch: 1, Step: 145, Rank: 6, loss = 0.7239550352096558
 Epoch: 1, Step: 145, Rank: 3, loss = 0.7816769480705261Epoch: 1, Step: 145, Rank: 0, loss = 0.0037681099493056536

 Epoch: 1, Step: 145, Rank: 1, loss = 0.3984977900981903
 Epoch: 1, Step: 145, Rank: 7, loss = 0.6394709348678589
 Epoch: 1, Step: 145, Rank: 4, loss = 0.9843152165412903
 Per-token loss scaled by world size: 0.00025657241349108517
 Epoch: 1, Step: 145, Rank: 2, loss = 0.9218646883964539
                                                          total tokens: 5826 num samples: 2 num padding tokens: 6 - rank: 1 max len: 2913 min len: 2907 avg len: 2910.0 num_loss_counted_tokens: 567
 total tokens: 7812 num samples: 7 num padding tokens: 1611 - rank: 4 max len: 1116 min len: 752 avg len: 885.8571428571429 num_loss_counted_tokens: 3680
 {
    "epoch": 1,
    "step": 145,
    "rank": 0,
    "loss": 0.0037681099493056536,
    "overall_throughput": 41.9281555047958,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.221776962280273,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28744,
    "batch_size": 94,
    "total_loss": 0.6933972239494324,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:30.179291"
 }
 total tokens: 5792 num samples: 2 num padding tokens: 692 - rank: 2 max len: 2896 min len: 2204 avg len: 2550.0 num_loss_counted_tokens: 520
 total tokens: 7612 num samples: 11 num padding tokens: 980 - rank: 5 max len: 692 min len: 509 avg len: 602.9090909090909 num_loss_counted_tokens: 4697
 total tokens: 7952 num samples: 16 num padding tokens: 1525 - rank: 6 max len: 497 min len: 266 avg len: 401.6875 num_loss_counted_tokens: 4516
 total tokens: 6225 num samples: 25 num padding tokens: 2041 - rank: 7 max len: 249 min len: 75 avg len: 167.36 num_loss_counted_tokens: 1838
 total tokens: 6171 num samples: 3 num padding tokens: 1311 - rank: 3 max len: 2057 min len: 1365 avg len: 1620.0 num_loss_counted_tokens: 217
 total tokens: 7190 num samples: 2 num padding tokens: 292 - rank: 0 max len: 3595 min len: 3303 avg len: 3449.0 num_loss_counted_tokens: 179
 Per-token loss scaled by world size: 0.00040030613308772445Per-token loss scaled by world size: 2.0256973130017286e-06Per-token loss scaled by world size: 4.8830220293893944e-06Per-token loss scaled by world size: 0.0004052049189340323Per-token loss scaled by world size: 0.0007852399721741676
 Per-token loss scaled by world size: 0.00055212079314515Per-token loss scaled by world size: 0.0006642856751568615





 Epoch: 1, Step: 146, Rank: 3, loss = 0.004231428261846304Epoch: 1, Step: 146, Rank: 6, loss = 1.640268087387085
 Epoch: 1, Step: 146, Rank: 2, loss = 0.8361894488334656

 Epoch: 1, Step: 146, Rank: 5, loss = 1.3876097202301025
 Epoch: 1, Step: 146, Rank: 7, loss = 0.8464224338531494
 Epoch: 1, Step: 146, Rank: 4, loss = 1.1533113718032837
 Epoch: 1, Step: 146, Rank: 1, loss = 0.010200022719800472
 Per-token loss scaled by world size: 5.620273441309109e-05
 Epoch: 1, Step: 146, Rank: 0, loss = 0.11740048974752426
                                                          total tokens: 5644 num samples: 2 num padding tokens: 996 - rank: 1 max len: 2822 min len: 1826 avg len: 2324.0 num_loss_counted_tokens: 207
 total tokens: 7605 num samples: 9 num padding tokens: 737 - rank: 4 max len: 845 min len: 701 avg len: 763.1111111111111 num_loss_counted_tokens: 4932
 {
    "epoch": 1,
    "step": 146,
    "rank": 0,
    "loss": 0.11740048974752426,
    "overall_throughput": 40.05446798320092,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.520647048950195,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16711,
    "batch_size": 66,
    "total_loss": 0.749454140663147,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:32.780987"
 }
 total tokens: 6615 num samples: 27 num padding tokens: 2358 - rank: 7 max len: 245 min len: 88 avg len: 157.66666666666666 num_loss_counted_tokens: 1644
 total tokens: 6744 num samples: 4 num padding tokens: 435 - rank: 2 max len: 1686 min len: 1368 avg len: 1577.25 num_loss_counted_tokens: 992
 total tokens: 6300 num samples: 2 num padding tokens: 261 - rank: 0 max len: 3150 min len: 2889 avg len: 3019.5 num_loss_counted_tokens: 190
 total tokens: 6790 num samples: 5 num padding tokens: 1037 - rank: 3 max len: 1358 min len: 1051 avg len: 1150.6 num_loss_counted_tokens: 2865
 total tokens: 7872 num samples: 12 num padding tokens: 1248 - rank: 5 max len: 656 min len: 441 avg len: 552.0 num_loss_counted_tokens: 3289
 total tokens: 7776 num samples: 18 num padding tokens: 1655 - rank: 6 max len: 432 min len: 260 avg len: 340.05555555555554 num_loss_counted_tokens: 3056
 Per-token loss scaled by world size: 0.0007255689124576747Per-token loss scaled by world size: 0.0006010388606227934Per-token loss scaled by world size: 0.0007319062133319676Per-token loss scaled by world size: 7.5493703661777545e-06Per-token loss scaled by world size: 0.00025839314912445843Per-token loss scaled by world size: 5.159122338227462e-06Per-token loss scaled by world size: 7.537942292401567e-05






 Epoch: 1, Step: 147, Rank: 5, loss = 1.5308597087860107Epoch: 1, Step: 147, Rank: 6, loss = 1.544230580329895

 Epoch: 1, Step: 147, Rank: 4, loss = 1.26811683177948Epoch: 1, Step: 147, Rank: 2, loss = 0.010885103605687618

 Epoch: 1, Step: 147, Rank: 1, loss = 0.015928227454423904
 Epoch: 1, Step: 147, Rank: 0, loss = 0.159041166305542Epoch: 1, Step: 147, Rank: 7, loss = 0.5451772212982178

 Per-token loss scaled by world size: 0.000249014439759776
 Epoch: 1, Step: 147, Rank: 3, loss = 0.5253893136978149
                                                          total tokens: 5520 num samples: 2 num padding tokens: 43 - rank: 1 max len: 2760 min len: 2717 avg len: 2738.5 num_loss_counted_tokens: 221
 {poch 1:  21%|██▏       | 26/122 [01:06<04:05,  2.56s/it]
    "epoch": 1,
    "step": 147,
    "rank": 0,
    "loss": 0.159041166305542,
    "overall_throughput": 41.27657023996106,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.05473041534424,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16879,
    "batch_size": 60,
    "total_loss": 0.699953556060791,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:35.343018"
 }
 total tokens: 7721 num samples: 7 num padding tokens: 1814 - rank: 4 max len: 1103 min len: 719 avg len: 843.8571428571429 num_loss_counted_tokens: 3306
 total tokens: 6106 num samples: 2 num padding tokens: 283 - rank: 0 max len: 3053 min len: 2770 avg len: 2911.5 num_loss_counted_tokens: 189
 total tokens: 8064 num samples: 32 num padding tokens: 2634 - rank: 7 max len: 252 min len: 83 avg len: 169.6875 num_loss_counted_tokens: 2206
 total tokens: 7017 num samples: 3 num padding tokens: 592 - rank: 2 max len: 2339 min len: 2001 avg len: 2141.6666666666665 num_loss_counted_tokens: 257
 total tokens: 7960 num samples: 5 num padding tokens: 1198 - rank: 3 max len: 1592 min len: 1124 avg len: 1352.4 num_loss_counted_tokens: 3699
 total tokens: 7840 num samples: 16 num padding tokens: 2136 - rank: 6 max len: 490 min len: 253 avg len: 356.5 num_loss_counted_tokens: 2793
 total tokens: 7744 num samples: 11 num padding tokens: 773 - rank: 5 max len: 704 min len: 499 avg len: 633.7272727272727 num_loss_counted_tokens: 4777
 Per-token loss scaled by world size: 0.00016060298366937786Per-token loss scaled by world size: 0.0004763362812809646Per-token loss scaled by world size: 1.0486909332030336e-06
 Per-token loss scaled by world size: 0.00025925057707354426
 Per-token loss scaled by world size: 0.00013205081631895155

 Per-token loss scaled by world size: 0.0002572258817963302Per-token loss scaled by world size: 0.0002500153495930135


 Epoch: 1, Step: 148, Rank: 5, loss = 1.6056700944900513
 Epoch: 1, Step: 148, Rank: 0, loss = 0.003535006195306778
 Epoch: 1, Step: 148, Rank: 2, loss = 0.5413725972175598
 Epoch: 1, Step: 148, Rank: 6, loss = 0.8739013075828552
 Epoch: 1, Step: 148, Rank: 4, loss = 0.8670763373374939Epoch: 1, Step: 148, Rank: 7, loss = 0.8427704572677612

 Epoch: 1, Step: 148, Rank: 1, loss = 0.44512680172920227
 Per-token loss scaled by world size: 0.00031931744888424873
 Epoch: 1, Step: 148, Rank: 3, loss = 1.0763791799545288
                                                          total tokens: 7304 num samples: 8 num padding tokens: 608 - rank: 4 max len: 913 min len: 793 avg len: 837.0 num_loss_counted_tokens: 4081
 total tokens: 7014 num samples: 3 num padding tokens: 554 - rank: 1 max len: 2338 min len: 1840 avg len: 2153.3333333333335 num_loss_counted_tokens: 349
 {
    "epoch": 1,
    "step": 148,
    "rank": 0,
    "loss": 0.003535006195306778,
    "overall_throughput": 40.32522306984037,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.535661697387695,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26967,
    "batch_size": 89,
    "total_loss": 0.781978964805603,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:37.948159"
 }
 total tokens: 7966 num samples: 14 num padding tokens: 3196 - rank: 6 max len: 569 min len: 266 avg len: 340.7142857142857 num_loss_counted_tokens: 2847
 total tokens: 6400 num samples: 25 num padding tokens: 2104 - rank: 7 max len: 256 min len: 75 avg len: 171.84 num_loss_counted_tokens: 1881
 total tokens: 7720 num samples: 10 num padding tokens: 795 - rank: 5 max len: 772 min len: 616 avg len: 692.5 num_loss_counted_tokens: 5407
 total tokens: 5704 num samples: 2 num padding tokens: 490 - rank: 0 max len: 2852 min len: 2362 avg len: 2607.0 num_loss_counted_tokens: 206
 total tokens: 7476 num samples: 6 num padding tokens: 582 - rank: 2 max len: 1246 min len: 1054 avg len: 1149.0 num_loss_counted_tokens: 3350
 total tokens: 7350 num samples: 7 num padding tokens: 419 - rank: 3 max len: 1050 min len: 921 avg len: 990.1428571428571 num_loss_counted_tokens: 4766
 Per-token loss scaled by world size: 0.00031092012068256736Per-token loss scaled by world size: 0.00025245780125260353Per-token loss scaled by world size: 0.0003971010446548462Per-token loss scaled by world size: 0.00022266971063800156Per-token loss scaled by world size: 0.0001838229363784194Per-token loss scaled by world size: 0.0004638316167984158





 Per-token loss scaled by world size: 4.563625225273427e-06
 Epoch: 1, Step: 149, Rank: 4, loss = 0.7849859595298767Epoch: 1, Step: 149, Rank: 5, loss = 1.2347360849380493
 Epoch: 1, Step: 149, Rank: 1, loss = 0.6923636198043823

 Epoch: 1, Step: 149, Rank: 7, loss = 0.9667672514915466
 Epoch: 1, Step: 149, Rank: 2, loss = 0.5715744495391846Epoch: 1, Step: 149, Rank: 3, loss = 1.4422264099121094

 Epoch: 1, Step: 149, Rank: 0, loss = 0.014190022833645344
 Per-token loss scaled by world size: 0.00031763844890519977
 Epoch: 1, Step: 149, Rank: 6, loss = 0.9876570105552673
                                                          total tokens: 7170 num samples: 3 num padding tokens: 1425 - rank: 1 max len: 2390 min len: 1672 avg len: 1915.0 num_loss_counted_tokens: 1941
 total tokens: 7462 num samples: 7 num padding tokens: 447 - rank: 4 max len: 1066 min len: 923 avg len: 1002.1428571428571 num_loss_counted_tokens: 3761
 {
    "epoch": 1,
    "step": 149,
    "rank": 0,
    "loss": 0.014190022833645344,
    "overall_throughput": 42.217359471248905,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.2309513092041,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24875,
    "batch_size": 89,
    "total_loss": 0.8368127346038818,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:40.455161"
 }
 total tokens: 7720 num samples: 5 num padding tokens: 320 - rank: 2 max len: 1544 min len: 1403 avg len: 1480.0 num_loss_counted_tokens: 2899
 total tokens: 7384 num samples: 8 num padding tokens: 1322 - rank: 5 max len: 923 min len: 593 avg len: 757.75 num_loss_counted_tokens: 4939
 total tokens: 7968 num samples: 24 num padding tokens: 3134 - rank: 7 max len: 332 min len: 91 avg len: 201.41666666666666 num_loss_counted_tokens: 2053
 total tokens: 6318 num samples: 2 num padding tokens: 100 - rank: 0 max len: 3159 min len: 3059 avg len: 3109.0 num_loss_counted_tokens: 160
 total tokens: 6785 num samples: 5 num padding tokens: 341 - rank: 3 max len: 1357 min len: 1186 avg len: 1288.8 num_loss_counted_tokens: 5137
 total tokens: 7602 num samples: 14 num padding tokens: 1273 - rank: 6 max len: 543 min len: 371 avg len: 452.07142857142856 num_loss_counted_tokens: 4011
 Per-token loss scaled by world size: 0.00026751268887892365Per-token loss scaled by world size: 0.00014292483683675528Per-token loss scaled by world size: 0.0001920466311275959Per-token loss scaled by world size: 0.0004723104939330369Per-token loss scaled by world size: 2.162046712328447e-06Per-token loss scaled by world size: 0.0003416259423829615





 Per-token loss scaled by world size: 0.0003082101175095886
 Epoch: 1, Step: 150, Rank: 3, loss = 0.7713059782981873Epoch: 1, Step: 150, Rank: 0, loss = 0.0062337215058505535

 Epoch: 1, Step: 150, Rank: 2, loss = 0.5537184476852417
 Epoch: 1, Step: 150, Rank: 5, loss = 1.3617892265319824
 Epoch: 1, Step: 150, Rank: 1, loss = 0.4120880365371704
 Epoch: 1, Step: 150, Rank: 7, loss = 0.9849929809570312
 Epoch: 1, Step: 150, Rank: 4, loss = 0.8886467814445496
 Per-token loss scaled by world size: 0.00027262946241535246
 Epoch: 1, Step: 150, Rank: 6, loss = 0.7860589027404785
                                                          total tokens: 7392 num samples: 3 num padding tokens: 958 - rank: 1 max len: 2464 min len: 1969 avg len: 2144.6666666666665 num_loss_counted_tokens: 582
 total tokens: 7800 num samples: 10 num padding tokens: 762 - rank: 4 max len: 780 min len: 667 avg len: 703.8 num_loss_counted_tokens: 4946
 {
    "epoch": 1,
    "step": 150,
    "rank": 0,
    "loss": 0.0062337215058505535,
    "overall_throughput": 41.782778937325965,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.4852614402771,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23066,
    "batch_size": 89,
    "total_loss": 0.7206042408943176,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:42.989502"
 }
 total tokens: 7866 num samples: 19 num padding tokens: 2154 - rank: 6 max len: 414 min len: 221 avg len: 300.63157894736844 num_loss_counted_tokens: 2912
 total tokens: 7504 num samples: 4 num padding tokens: 1167 - rank: 2 max len: 1876 min len: 1283 avg len: 1584.25 num_loss_counted_tokens: 741
 total tokens: 8016 num samples: 8 num padding tokens: 721 - rank: 3 max len: 1002 min len: 795 avg len: 911.875 num_loss_counted_tokens: 6005
 total tokens: 5805 num samples: 27 num padding tokens: 2202 - rank: 7 max len: 215 min len: 71 avg len: 133.44444444444446 num_loss_counted_tokens: 1404
 total tokens: 6802 num samples: 2 num padding tokens: 541 - rank: 0 max len: 3401 min len: 2860 avg len: 3130.5 num_loss_counted_tokens: 166
 total tokens: 7920 num samples: 12 num padding tokens: 1266 - rank: 5 max len: 660 min len: 445 avg len: 554.5 num_loss_counted_tokens: 4048
 Per-token loss scaled by world size: 0.0004141188692301512Per-token loss scaled by world size: 0.0002827317512128502Per-token loss scaled by world size: 0.00034681695979088545

 Per-token loss scaled by world size: 0.0004108196299057454
 Per-token loss scaled by world size: 1.1767973546739086e-06

 Per-token loss scaled by world size: 9.391092316946015e-05Per-token loss scaled by world size: 0.00017690712411422282

 Epoch: 1, Step: 151, Rank: 3, loss = 1.0656384229660034
 Epoch: 1, Step: 151, Rank: 5, loss = 1.2724319696426392Epoch: 1, Step: 151, Rank: 4, loss = 0.8687286376953125

 Epoch: 1, Step: 151, Rank: 0, loss = 0.003615856869146228Epoch: 1, Step: 151, Rank: 6, loss = 1.2622946500778198

 Epoch: 1, Step: 151, Rank: 7, loss = 0.5435692667961121
 Epoch: 1, Step: 151, Rank: 1, loss = 0.28855305910110474
 Per-token loss scaled by world size: 0.0003057793073821813
 Epoch: 1, Step: 151, Rank: 2, loss = 0.9395451545715332
                                                          total tokens: 8016 num samples: 8 num padding tokens: 1103 - rank: 4 max len: 1002 min len: 696 avg len: 864.125 num_loss_counted_tokens: 4285
 total tokens: 7425 num samples: 3 num padding tokens: 1417 - rank: 1 max len: 2475 min len: 1754 avg len: 2002.6666666666667 num_loss_counted_tokens: 868
 {
    "epoch": 1,
    "step": 151,
    "rank": 0,
    "loss": 0.003615856869146228,
    "overall_throughput": 41.2486467947231,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.402647495269775,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24581,
    "batch_size": 87,
    "total_loss": 0.780547022819519,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:45.555476"
 }
 total tokens: 8046 num samples: 6 num padding tokens: 1241 - rank: 3 max len: 1341 min len: 1041 avg len: 1134.1666666666667 num_loss_counted_tokens: 2869
 total tokens: 7560 num samples: 30 num padding tokens: 2626 - rank: 7 max len: 252 min len: 77 avg len: 164.46666666666667 num_loss_counted_tokens: 2002
 total tokens: 7755 num samples: 15 num padding tokens: 2376 - rank: 6 max len: 517 min len: 262 avg len: 358.6 num_loss_counted_tokens: 3077
 total tokens: 6984 num samples: 4 num padding tokens: 492 - rank: 2 max len: 1746 min len: 1432 avg len: 1623.0 num_loss_counted_tokens: 2214
 total tokens: 7491 num samples: 11 num padding tokens: 868 - rank: 5 max len: 681 min len: 536 avg len: 602.0909090909091 num_loss_counted_tokens: 3815
 total tokens: 7306 num samples: 2 num padding tokens: 786 - rank: 0 max len: 3653 min len: 2867 avg len: 3260.0 num_loss_counted_tokens: 160
 Per-token loss scaled by world size: 0.00020859052892774343Per-token loss scaled by world size: 0.0006468938081525266Per-token loss scaled by world size: 0.00038840470369905233Per-token loss scaled by world size: 8.114238880807534e-05

 Per-token loss scaled by world size: 4.119947334402241e-05
 Per-token loss scaled by world size: 0.0005277044838294387

 Per-token loss scaled by world size: 3.4815836897905683e-06

 Epoch: 1, Step: 152, Rank: 5, loss = 1.5654021501541138
 Epoch: 1, Step: 152, Rank: 3, loss = 0.5047630071640015
 Epoch: 1, Step: 152, Rank: 2, loss = 0.1963544338941574
 Epoch: 1, Step: 152, Rank: 7, loss = 0.9398908615112305
 Epoch: 1, Step: 152, Rank: 4, loss = 1.276978850364685
 Epoch: 1, Step: 152, Rank: 0, loss = 0.008424997329711914
 Epoch: 1, Step: 152, Rank: 1, loss = 0.09969757497310638
 Per-token loss scaled by world size: 0.0005147996125742793
 Epoch: 1, Step: 152, Rank: 6, loss = 1.2457506656646729
                                                          total tokens: 5714 num samples: 2 num padding tokens: 255 - rank: 1 max len: 2857 min len: 2602 avg len: 2729.5 num_loss_counted_tokens: 173
 total tokens: 7479 num samples: 9 num padding tokens: 752 - rank: 4 max len: 831 min len: 644 avg len: 747.4444444444445 num_loss_counted_tokens: 4908
 {
    "epoch": 1,
    "step": 152,
    "rank": 0,
    "loss": 0.008424997329711914,
    "overall_throughput": 43.280048077519346,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.22560167312622,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19359,
    "batch_size": 61,
    "total_loss": 0.7296578288078308,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:48.003629"
 }
 total tokens: 7932 num samples: 6 num padding tokens: 1038 - rank: 2 max len: 1322 min len: 990 avg len: 1149.0 num_loss_counted_tokens: 3464 total tokens: 6662 num samples: 2 num padding tokens: 247 - rank: 0 max len: 3331 min len: 3084 avg len: 3207.5 num_loss_counted_tokens: 180

 total tokens: 3819 num samples: 19 num padding tokens: 1446 - rank: 7 max len: 201 min len: 76 avg len: 124.89473684210526 num_loss_counted_tokens: 686
 total tokens: 8052 num samples: 22 num padding tokens: 2171 - rank: 6 max len: 366 min len: 201 avg len: 267.3181818181818 num_loss_counted_tokens: 3024
 total tokens: 7912 num samples: 8 num padding tokens: 615 - rank: 3 max len: 989 min len: 855 avg len: 912.125 num_loss_counted_tokens: 4585
 total tokens: 8047 num samples: 13 num padding tokens: 1226 - rank: 5 max len: 619 min len: 375 avg len: 524.6923076923077 num_loss_counted_tokens: 5490
 Per-token loss scaled by world size: 0.0002795422915369272Per-token loss scaled by world size: 0.00013167920405976474Per-token loss scaled by world size: 0.0003253524482715875Per-token loss scaled by world size: 0.00030560040613636374Per-token loss scaled by world size: 0.00012384731962811202


 Per-token loss scaled by world size: 1.194436777041119e-06


 Epoch: 1, Step: 153, Rank: 3, loss = 1.0706535577774048Per-token loss scaled by world size: 0.0002909142931457609

 Epoch: 1, Step: 153, Rank: 1, loss = 0.407550573348999Epoch: 1, Step: 153, Rank: 7, loss = 0.43332335352897644
 Epoch: 1, Step: 153, Rank: 4, loss = 1.0056545734405518Epoch: 1, Step: 153, Rank: 6, loss = 0.9199037551879883Epoch: 1, Step: 153, Rank: 0, loss = 0.003930592909455299



 Per-token loss scaled by world size: 0.0003188060945831239
 Epoch: 1, Step: 153, Rank: 2, loss = 0.9573261737823486
 Epoch: 1, Step: 153, Rank: 5, loss = 1.0491111278533936
                                                          total tokens: 7839 num samples: 9 num padding tokens: 793 - rank: 4 max len: 871 min len: 691 avg len: 782.8888888888889 num_loss_counted_tokens: 5172
 total tokens: 6369 num samples: 3 num padding tokens: 846 - rank: 1 max len: 2123 min len: 1506 avg len: 1841.0 num_loss_counted_tokens: 2162
 {
    "epoch": 1,
    "step": 153,
    "rank": 0,
    "loss": 0.003930592909455299,
    "overall_throughput": 41.68838808478364,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.49734401702881,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26326,
    "batch_size": 95,
    "total_loss": 0.7309317588806152,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:50.537829"
 }
 total tokens: 7215 num samples: 5 num padding tokens: 508 - rank: 2 max len: 1443 min len: 1149 avg len: 1341.4 num_loss_counted_tokens: 3549
 total tokens: 7714 num samples: 29 num padding tokens: 2672 - rank: 7 max len: 266 min len: 79 avg len: 173.86206896551724 num_loss_counted_tokens: 2096
 total tokens: 8004 num samples: 3 num padding tokens: 275 - rank: 0 max len: 2668 min len: 2403 avg len: 2576.3333333333335 num_loss_counted_tokens: 964
 total tokens: 7700 num samples: 7 num padding tokens: 838 - rank: 3 max len: 1100 min len: 894 avg len: 980.2857142857143 num_loss_counted_tokens: 4738
 total tokens: 8076 num samples: 12 num padding tokens: 1148 - rank: 5 max len: 673 min len: 516 avg len: 577.3333333333334 num_loss_counted_tokens: 4820
 total tokens: 8032 num samples: 16 num padding tokens: 2150 - rank: 6 max len: 502 min len: 274 avg len: 367.625 num_loss_counted_tokens: 3962
 Per-token loss scaled by world size: 0.00019867185619659722Per-token loss scaled by world size: 0.0002461467229295522Per-token loss scaled by world size: 0.00024380745890084654Per-token loss scaled by world size: 0.00019991688895970583

 Per-token loss scaled by world size: 6.630049756495282e-05


 Per-token loss scaled by world size: 1.6863944551914756e-07
 Epoch: 1, Step: 154, Rank: 2, loss = 0.7483974695205688
 Epoch: 1, Step: 154, Rank: 3, loss = 0.6136698722839355Epoch: 1, Step: 154, Rank: 4, loss = 0.7555781602859497
 Epoch: 1, Step: 154, Rank: 6, loss = 0.6098480820655823

 Per-token loss scaled by world size: 0.00021128085791133344Epoch: 1, Step: 154, Rank: 1, loss = 0.20351766049861908

 Epoch: 1, Step: 154, Rank: 0, loss = 0.0005176598788239062
 Per-token loss scaled by world size: 0.0005612249951809645
 Epoch: 1, Step: 154, Rank: 7, loss = 0.6485530138015747
 Epoch: 1, Step: 154, Rank: 5, loss = 1.7227503061294556
                                                          total tokens: 7904 num samples: 4 num padding tokens: 814 - rank: 1 max len: 1976 min len: 1533 avg len: 1772.5 num_loss_counted_tokens: 2873
 total tokens: 7744 num samples: 11 num padding tokens: 596 - rank: 4 max len: 704 min len: 594 avg len: 649.8181818181819 num_loss_counted_tokens: 5417
 {
    "epoch": 1,
    "step": 154,
    "rank": 0,
    "loss": 0.0005176598788239062,
    "overall_throughput": 42.108806987198356,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.21819305419922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24557,
    "batch_size": 80,
    "total_loss": 0.662854015827179,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:53.052708"
 }
 total tokens: 7440 num samples: 31 num padding tokens: 3180 - rank: 7 max len: 240 min len: 74 avg len: 137.41935483870967 num_loss_counted_tokens: 1408
 total tokens: 7020 num samples: 6 num padding tokens: 1084 - rank: 2 max len: 1170 min len: 846 avg len: 989.3333333333334 num_loss_counted_tokens: 1599
 total tokens: 8097 num samples: 3 num padding tokens: 808 - rank: 0 max len: 2699 min len: 2025 avg len: 2429.6666666666665 num_loss_counted_tokens: 947
 total tokens: 7999 num samples: 19 num padding tokens: 1990 - rank: 6 max len: 421 min len: 240 avg len: 316.2631578947368 num_loss_counted_tokens: 3596
 total tokens: 7657 num samples: 13 num padding tokens: 1095 - rank: 5 max len: 589 min len: 434 avg len: 504.7692307692308 num_loss_counted_tokens: 4851
 total tokens: 7560 num samples: 9 num padding tokens: 485 - rank: 3 max len: 840 min len: 716 avg len: 786.1111111111111 num_loss_counted_tokens: 5656
 Per-token loss scaled by world size: 0.0003278250514995307Per-token loss scaled by world size: 0.0003366835881024599Per-token loss scaled by world size: 0.0003885742917191237Per-token loss scaled by world size: 0.00019582045206334442
 Per-token loss scaled by world size: 0.00032262562308460474



 Per-token loss scaled by world size: 0.00013199940440244973
 Per-token loss scaled by world size: 3.822985672741197e-05
 Epoch: 1, Step: 155, Rank: 5, loss = 1.060516357421875
 Epoch: 1, Step: 155, Rank: 4, loss = 0.8947165012359619Epoch: 1, Step: 155, Rank: 7, loss = 0.9188936948776245

 Epoch: 1, Step: 155, Rank: 2, loss = 0.5344429612159729
 Epoch: 1, Step: 155, Rank: 3, loss = 0.8805260062217712
 Epoch: 1, Step: 155, Rank: 1, loss = 0.36025938391685486
 Epoch: 1, Step: 155, Rank: 0, loss = 0.10433883965015411
 Per-token loss scaled by world size: 0.00041954353218898177
 Epoch: 1, Step: 155, Rank: 6, loss = 1.1450392007827759
                                                          total tokens: 8064 num samples: 12 num padding tokens: 820 - rank: 4 max len: 672 min len: 523 avg len: 603.6666666666666 num_loss_counted_tokens: 5303
 total tokens: 6845 num samples: 5 num padding tokens: 859 - rank: 1 max len: 1369 min len: 1106 avg len: 1197.2 num_loss_counted_tokens: 4981
 {
    "epoch": 1,
    "step": 155,
    "rank": 0,
    "loss": 0.10433883965015411,
    "overall_throughput": 41.90741869411001,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.32686471939087,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21834,
    "batch_size": 76,
    "total_loss": 0.7373416423797607,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:55.565723"
 }
 total tokens: 7539 num samples: 7 num padding tokens: 396 - rank: 2 max len: 1077 min len: 913 avg len: 1020.4285714285714 num_loss_counted_tokens: 3046
 total tokens: 8118 num samples: 9 num padding tokens: 1166 - rank: 3 max len: 902 min len: 687 avg len: 772.4444444444445 num_loss_counted_tokens: 4403
 total tokens: 7740 num samples: 15 num padding tokens: 1204 - rank: 5 max len: 516 min len: 339 avg len: 435.73333333333335 num_loss_counted_tokens: 4055
 total tokens: 7732 num samples: 2 num padding tokens: 2186 - rank: 0 max len: 3866 min len: 1680 avg len: 2773.0 num_loss_counted_tokens: 1642
 total tokens: 6720 num samples: 30 num padding tokens: 2251 - rank: 7 max len: 224 min len: 83 avg len: 148.96666666666667 num_loss_counted_tokens: 1606
 total tokens: 7872 num samples: 24 num padding tokens: 1261 - rank: 6 max len: 328 min len: 226 avg len: 275.4583333333333 num_loss_counted_tokens: 3708
 Per-token loss scaled by world size: 0.00035203597508370876Per-token loss scaled by world size: 0.0004669471236411482Per-token loss scaled by world size: 0.00028399238362908363Per-token loss scaled by world size: 0.0006751486216671765

 Per-token loss scaled by world size: 8.408135727222543e-06Per-token loss scaled by world size: 0.0003578344185370952



 Epoch: 1, Step: 156, Rank: 2, loss = 0.6541054248809814
 Epoch: 1, Step: 156, Rank: 3, loss = 1.075495958328247
 Epoch: 1, Step: 156, Rank: 6, loss = 1.5550360679626465
 Epoch: 1, Step: 156, Rank: 5, loss = 0.81082683801651Epoch: 1, Step: 156, Rank: 0, loss = 0.019366038963198662

 Per-token loss scaled by world size: 0.0002267559466417879
 Epoch: 1, Step: 156, Rank: 4, loss = 0.8241821527481079
 Per-token loss scaled by world size: 2.405712393738213e-06
 Epoch: 1, Step: 156, Rank: 7, loss = 0.5222756266593933
 Epoch: 1, Step: 156, Rank: 1, loss = 0.00554095720872283
                                                          total tokens: 6510 num samples: 3 num padding tokens: 687 - rank: 1 max len: 2170 min len: 1746 avg len: 1941.0 num_loss_counted_tokens: 859
 total tokens: 7452 num samples: 9 num padding tokens: 620 - rank: 4 max len: 828 min len: 705 avg len: 759.1111111111111 num_loss_counted_tokens: 5560
 total tokens: 8004 num samples: 29 num padding tokens: 2622 - rank: 7 max len: 276 min len: 75 avg len: 185.58620689655172 num_loss_counted_tokens: 2559
 {
    "epoch": 1,
    "step": 156,
    "rank": 0,
    "loss": 0.019366038963198662,
    "overall_throughput": 40.42005133127166,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.42158031463623,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18426,
    "batch_size": 65,
    "total_loss": 0.6833536028862,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:54:58.183018"
 }
 total tokens: 7995 num samples: 15 num padding tokens: 1805 - rank: 6 max len: 533 min len: 277 avg len: 412.6666666666667 num_loss_counted_tokens: 4022
 total tokens: 7678 num samples: 11 num padding tokens: 680 - rank: 5 max len: 698 min len: 541 avg len: 636.1818181818181 num_loss_counted_tokens: 5943
 total tokens: 6816 num samples: 4 num padding tokens: 1254 - rank: 2 max len: 1704 min len: 1064 avg len: 1390.5 num_loss_counted_tokens: 844
 total tokens: 7217 num samples: 7 num padding tokens: 458 - rank: 3 max len: 1031 min len: 875 avg len: 965.5714285714286 num_loss_counted_tokens: 3971
 total tokens: 5472 num samples: 2 num padding tokens: 101 - rank: 0 max len: 2736 min len: 2635 avg len: 2685.5 num_loss_counted_tokens: 183
 Per-token loss scaled by world size: 0.000452109903562814Per-token loss scaled by world size: 0.0004523490206338465Per-token loss scaled by world size: 7.593091595481383e-06Per-token loss scaled by world size: 0.0006878247950226068Per-token loss scaled by world size: 9.245219553122297e-05Per-token loss scaled by world size: 0.0003072120016440749Per-token loss scaled by world size: 0.0005076072411611676






 Epoch: 1, Step: 157, Rank: 5, loss = 0.9610720276832581
 Epoch: 1, Step: 157, Rank: 6, loss = 1.4613697528839111Epoch: 1, Step: 157, Rank: 3, loss = 0.6527103185653687Epoch: 1, Step: 157, Rank: 2, loss = 0.19642624258995056Epoch: 1, Step: 157, Rank: 4, loss = 0.9605640172958374

 Epoch: 1, Step: 157, Rank: 1, loss = 0.016132472082972527


 Epoch: 1, Step: 157, Rank: 7, loss = 1.078474998474121
 Per-token loss scaled by world size: 3.8059803046053275e-05
 Epoch: 1, Step: 157, Rank: 0, loss = 0.08086281269788742
                                                          total tokens: 7684 num samples: 4 num padding tokens: 608 - rank: 1 max len: 1921 min len: 1493 avg len: 1769.0 num_loss_counted_tokens: 2916
 total tokens: 8118 num samples: 9 num padding tokens: 805 - rank: 4 max len: 902 min len: 707 avg len: 812.5555555555555 num_loss_counted_tokens: 5091
 {
    "epoch": 1,
    "step": 157,
    "rank": 0,
    "loss": 0.08086281269788742,
    "overall_throughput": 41.84706536508769,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.473669052124023,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16997,
    "batch_size": 74,
    "total_loss": 0.6759515404701233,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:00.708630"
 }
 total tokens: 7973 num samples: 17 num padding tokens: 2030 - rank: 6 max len: 469 min len: 276 avg len: 349.5882352941176 num_loss_counted_tokens: 3185
 total tokens: 7330 num samples: 5 num padding tokens: 579 - rank: 2 max len: 1466 min len: 1164 avg len: 1350.2 num_loss_counted_tokens: 1678
 total tokens: 7656 num samples: 11 num padding tokens: 986 - rank: 5 max len: 696 min len: 545 avg len: 606.3636363636364 num_loss_counted_tokens: 3391
 total tokens: 7714 num samples: 29 num padding tokens: 2374 - rank: 7 max len: 266 min len: 78 avg len: 184.13793103448276 num_loss_counted_tokens: 2574
 total tokens: 7791 num samples: 7 num padding tokens: 885 - rank: 3 max len: 1113 min len: 906 avg len: 986.5714285714286 num_loss_counted_tokens: 4155
 total tokens: 6052 num samples: 2 num padding tokens: 661 - rank: 0 max len: 3026 min len: 2365 avg len: 2695.5 num_loss_counted_tokens: 163
 Per-token loss scaled by world size: 0.00020404128008522093Per-token loss scaled by world size: 0.0002751105057541281Per-token loss scaled by world size: 0.0002363547682762146Per-token loss scaled by world size: 0.0002702484780456871Per-token loss scaled by world size: 0.0001953808678081259




 Per-token loss scaled by world size: 0.00024261375074274838Per-token loss scaled by world size: 1.6901136632441194e-06

 Epoch: 1, Step: 158, Rank: 5, loss = 0.8857870101928711
 Epoch: 1, Step: 158, Rank: 4, loss = 0.870132565498352Epoch: 1, Step: 158, Rank: 3, loss = 0.6569619178771973Epoch: 1, Step: 158, Rank: 7, loss = 0.6290775537490845Epoch: 1, Step: 158, Rank: 2, loss = 0.7610032558441162



 Epoch: 1, Step: 158, Rank: 0, loss = 0.005441743414849043
 Epoch: 1, Step: 158, Rank: 1, loss = 0.7811556458473206
 Per-token loss scaled by world size: 0.0003746829752344638
 Epoch: 1, Step: 158, Rank: 6, loss = 1.2063854932785034
                                                          total tokens: 7932 num samples: 3 num padding tokens: 1113 - rank: 1 max len: 2644 min len: 1935 avg len: 2273.0 num_loss_counted_tokens: 430
 total tokens: 7551 num samples: 9 num padding tokens: 529 - rank: 4 max len: 839 min len: 726 avg len: 780.2222222222222 num_loss_counted_tokens: 3403
 {
    "epoch": 1,
    "step": 158,
    "rank": 0,
    "loss": 0.005441743414849043,
    "overall_throughput": 42.29815333288017,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.366318225860596,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25758,
    "batch_size": 93,
    "total_loss": 0.724493145942688,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:03.213265"
 }
 total tokens: 7185 num samples: 5 num padding tokens: 924 - rank: 2 max len: 1437 min len: 1091 avg len: 1252.2 num_loss_counted_tokens: 3371
 total tokens: 6068 num samples: 2 num padding tokens: 230 - rank: 0 max len: 3034 min len: 2804 avg len: 2919.0 num_loss_counted_tokens: 189
 total tokens: 7942 num samples: 11 num padding tokens: 1172 - rank: 5 max len: 722 min len: 546 avg len: 615.4545454545455 num_loss_counted_tokens: 3713
 total tokens: 6292 num samples: 22 num padding tokens: 2266 - rank: 7 max len: 286 min len: 80 avg len: 183.0 num_loss_counted_tokens: 1812
 total tokens: 7856 num samples: 16 num padding tokens: 1583 - rank: 6 max len: 491 min len: 299 avg len: 392.0625 num_loss_counted_tokens: 3733
 total tokens: 7308 num samples: 7 num padding tokens: 647 - rank: 3 max len: 1044 min len: 844 avg len: 951.5714285714286 num_loss_counted_tokens: 4260
 Per-token loss scaled by world size: 0.0002086303138639778Per-token loss scaled by world size: 0.00018750393064692616Per-token loss scaled by world size: 0.00023788934049662203Per-token loss scaled by world size: 0.00018921871378552169Per-token loss scaled by world size: 0.00015611379058100283
 Per-token loss scaled by world size: 0.00034976963070221245
 Per-token loss scaled by world size: 2.962535518236109e-06




 Epoch: 1, Step: 159, Rank: 6, loss = 0.6299428939819336Epoch: 1, Step: 159, Rank: 2, loss = 0.7992189526557922Epoch: 1, Step: 159, Rank: 5, loss = 1.1750948429107666

 Epoch: 1, Step: 159, Rank: 1, loss = 0.5244837999343872
 Epoch: 1, Step: 159, Rank: 7, loss = 0.6357039213180542
 Epoch: 1, Step: 159, Rank: 4, loss = 0.7009196281433105
 Epoch: 1, Step: 159, Rank: 0, loss = 0.009953008033335209

 Per-token loss scaled by world size: 0.0001541711390018463
 Epoch: 1, Step: 159, Rank: 3, loss = 0.5179572105407715
                                                          total tokens: 7248 num samples: 3 num padding tokens: 1157 - rank: 1 max len: 2416 min len: 1450 avg len: 2030.3333333333333 num_loss_counted_tokens: 491
 total tokens: 7848 num samples: 9 num padding tokens: 544 - rank: 4 max len: 872 min len: 752 avg len: 811.5555555555555 num_loss_counted_tokens: 5085
 {
    "epoch": 1,
    "step": 159,
    "rank": 0,
    "loss": 0.009953008033335209,
    "overall_throughput": 41.96663008841316,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.325264930725098,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26877,
    "batch_size": 82,
    "total_loss": 0.6241592168807983,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:05.729512"
 }
 total tokens: 7830 num samples: 18 num padding tokens: 1959 - rank: 6 max len: 435 min len: 237 avg len: 326.1666666666667 num_loss_counted_tokens: 3267
 total tokens: 8118 num samples: 11 num padding tokens: 1074 - rank: 5 max len: 738 min len: 541 avg len: 640.3636363636364 num_loss_counted_tokens: 4228
 total tokens: 7938 num samples: 7 num padding tokens: 978 - rank: 3 max len: 1134 min len: 898 avg len: 994.2857142857143 num_loss_counted_tokens: 3454
 total tokens: 7010 num samples: 5 num padding tokens: 659 - rank: 2 max len: 1402 min len: 1181 avg len: 1270.2 num_loss_counted_tokens: 5158
 total tokens: 6944 num samples: 31 num padding tokens: 2362 - rank: 7 max len: 224 min len: 74 avg len: 147.80645161290323 num_loss_counted_tokens: 1787
 total tokens: 6818 num samples: 2 num padding tokens: 131 - rank: 0 max len: 3409 min len: 3278 avg len: 3343.5 num_loss_counted_tokens: 203
 Per-token loss scaled by world size: 0.00015841875574551523Per-token loss scaled by world size: 0.00011544318113010377Per-token loss scaled by world size: 0.00030993111431598663Per-token loss scaled by world size: 0.00033960427390411496Per-token loss scaled by world size: 0.0002961684949696064Per-token loss scaled by world size: 0.0003685772535391152





 Epoch: 1, Step: 160, Rank: 5, loss = 0.9887577295303345
 Epoch: 1, Step: 160, Rank: 3, loss = 1.0834225416183472Epoch: 1, Step: 160, Rank: 6, loss = 0.9448515772819519

 Epoch: 1, Step: 160, Rank: 4, loss = 1.1758536100387573Epoch: 1, Step: 160, Rank: 1, loss = 0.36829259991645813Epoch: 1, Step: 160, Rank: 2, loss = 0.5053954124450684


 Per-token loss scaled by world size: 5.767856782767922e-05
 Epoch: 1, Step: 160, Rank: 7, loss = 0.18400904536247253
 Per-token loss scaled by world size: 0.00019863103807438165
 Epoch: 1, Step: 160, Rank: 0, loss = 0.6336826682090759
                                                          total tokens: 7389 num samples: 9 num padding tokens: 610 - rank: 4 max len: 821 min len: 661 avg len: 753.2222222222222 num_loss_counted_tokens: 4659
 total tokens: 7668 num samples: 4 num padding tokens: 927 - rank: 1 max len: 1917 min len: 1314 avg len: 1685.25 num_loss_counted_tokens: 682
 total tokens: 7945 num samples: 35 num padding tokens: 3133 - rank: 7 max len: 227 min len: 79 avg len: 137.4857142857143 num_loss_counted_tokens: 1541
 {
    "epoch": 1,
    "step": 160,
    "rank": 0,
    "loss": 0.6336826682090759,
    "overall_throughput": 41.55102604571352,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.496148586273193,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25522,
    "batch_size": 69,
    "total_loss": 0.7355331778526306,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:08.279444"
 }
 total tokens: 7800 num samples: 12 num padding tokens: 1127 - rank: 5 max len: 650 min len: 421 avg len: 556.0833333333334 num_loss_counted_tokens: 4039
 total tokens: 7800 num samples: 6 num padding tokens: 1207 - rank: 2 max len: 1300 min len: 984 avg len: 1098.8333333333333 num_loss_counted_tokens: 3151
 total tokens: 8020 num samples: 20 num padding tokens: 2104 - rank: 6 max len: 401 min len: 238 avg len: 295.8 num_loss_counted_tokens: 3264
 total tokens: 7640 num samples: 8 num padding tokens: 396 - rank: 3 max len: 955 min len: 852 avg len: 905.5 num_loss_counted_tokens: 6041
 total tokens: 5680 num samples: 2 num padding tokens: 585 - rank: 0 max len: 2840 min len: 2255 avg len: 2547.5 num_loss_counted_tokens: 205
 Per-token loss scaled by world size: 5.620659976557363e-06Per-token loss scaled by world size: 0.00039823996485210955Per-token loss scaled by world size: 0.0004752624372486025Per-token loss scaled by world size: 0.00014614466635975987Per-token loss scaled by world size: 0.0004995565977878869


 Per-token loss scaled by world size: 0.00032942448160611093Per-token loss scaled by world size: 0.0004525336844380945



 Epoch: 1, Step: 161, Rank: 5, loss = 1.1807301044464111
 Epoch: 1, Step: 161, Rank: 2, loss = 0.9893774390220642Epoch: 1, Step: 161, Rank: 6, loss = 1.2410858869552612

 Epoch: 1, Step: 161, Rank: 3, loss = 0.36307814717292786
 Epoch: 1, Step: 161, Rank: 1, loss = 0.013963826932013035
 Epoch: 1, Step: 161, Rank: 4, loss = 0.8184139132499695
 Epoch: 1, Step: 161, Rank: 7, loss = 1.1242634057998657
 Per-token loss scaled by world size: 9.448503078601789e-06
 Epoch: 1, Step: 161, Rank: 0, loss = 0.023473624140024185
                                                          total tokens: 7380 num samples: 3 num padding tokens: 869 - rank: 1 max len: 2460 min len: 1846 avg len: 2170.3333333333335 num_loss_counted_tokens: 394
 total tokens: 7407 num samples: 9 num padding tokens: 1144 - rank: 4 max len: 823 min len: 580 avg len: 695.8888888888889 num_loss_counted_tokens: 4039
 {
    "epoch": 1,
    "step": 161,
    "rank": 0,
    "loss": 0.023473624140024185,
    "overall_throughput": 40.61008048978514,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.50694465637207,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19875,
    "batch_size": 70,
    "total_loss": 0.7192983031272888,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:10.880154"
 }
 total tokens: 8008 num samples: 14 num padding tokens: 808 - rank: 5 max len: 572 min len: 460 avg len: 514.2857142857143 num_loss_counted_tokens: 4103
 total tokens: 7992 num samples: 8 num padding tokens: 557 - rank: 3 max len: 999 min len: 868 avg len: 929.375 num_loss_counted_tokens: 6112
 total tokens: 7704 num samples: 18 num padding tokens: 1248 - rank: 6 max len: 428 min len: 273 avg len: 358.6666666666667 num_loss_counted_tokens: 3541
 total tokens: 8028 num samples: 2 num padding tokens: 1416 - rank: 0 max len: 4014 min len: 2598 avg len: 3306.0 num_loss_counted_tokens: 164 total tokens: 8076 num samples: 6 num padding tokens: 1073 - rank: 2 max len: 1346 min len: 1029 avg len: 1167.1666666666667 num_loss_counted_tokens: 4924

 total tokens: 6233 num samples: 23 num padding tokens: 2022 - rank: 7 max len: 271 min len: 80 avg len: 183.08695652173913 num_loss_counted_tokens: 1936
 Per-token loss scaled by world size: 0.0008293814607895911Per-token loss scaled by world size: 0.0006533037521876395Per-token loss scaled by world size: 0.0005620094598270953Per-token loss scaled by world size: 1.2225326827319805e-05Per-token loss scaled by world size: 5.4908236052142456e-05Per-token loss scaled by world size: 5.549823254114017e-05





 Per-token loss scaled by world size: 2.421064209556789e-06
 Epoch: 1, Step: 162, Rank: 5, loss = 1.6809488534927368Epoch: 1, Step: 162, Rank: 0, loss = 0.02477768063545227

 Epoch: 1, Step: 162, Rank: 4, loss = 1.1390526294708252
 Epoch: 1, Step: 162, Rank: 7, loss = 1.3240833282470703
 Epoch: 1, Step: 162, Rank: 2, loss = 0.11248104274272919
 Epoch: 1, Step: 162, Rank: 1, loss = 0.1112852692604065
 Epoch: 1, Step: 162, Rank: 3, loss = 0.004906891845166683
 Per-token loss scaled by world size: 0.0008481117547489703
 Epoch: 1, Step: 162, Rank: 6, loss = 1.7189104557037354
                                                          total tokens: 7875 num samples: 3 num padding tokens: 1486 - rank: 1 max len: 2625 min len: 1737 avg len: 2129.6666666666665 num_loss_counted_tokens: 747
 total tokens: 7448 num samples: 8 num padding tokens: 1221 - rank: 4 max len: 931 min len: 707 avg len: 778.375 num_loss_counted_tokens: 2823
 total tokens: 6572 num samples: 4 num padding tokens: 264 - rank: 2 max len: 1643 min len: 1485 avg len: 1577.0 num_loss_counted_tokens: 3122
 {
    "epoch": 1,
    "step": 162,
    "rank": 0,
    "loss": 0.02477768063545227,
    "overall_throughput": 42.2840773469095,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.426692962646484,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 16214,
    "batch_size": 68,
    "total_loss": 0.7645557522773743,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:13.378530"
 }
 total tokens: 7760 num samples: 16 num padding tokens: 1908 - rank: 6 max len: 485 min len: 279 avg len: 365.75 num_loss_counted_tokens: 3363
 total tokens: 7944 num samples: 6 num padding tokens: 1666 - rank: 3 max len: 1324 min len: 934 avg len: 1046.3333333333333 num_loss_counted_tokens: 2604
 total tokens: 7590 num samples: 11 num padding tokens: 1066 - rank: 5 max len: 690 min len: 492 avg len: 593.0909090909091 num_loss_counted_tokens: 3430
 total tokens: 7830 num samples: 29 num padding tokens: 2293 - rank: 7 max len: 270 min len: 77 avg len: 190.93103448275863 num_loss_counted_tokens: 2270
 total tokens: 6446 num samples: 2 num padding tokens: 435 - rank: 0 max len: 3223 min len: 2788 avg len: 3005.5 num_loss_counted_tokens: 179
 Per-token loss scaled by world size: 0.00018602880300022662Per-token loss scaled by world size: 0.000572515360545367Per-token loss scaled by world size: 0.0006904263282194734Per-token loss scaled by world size: 0.0004460025520529598


 Per-token loss scaled by world size: 0.0004231746424920857
 Per-token loss scaled by world size: 5.620245701720705e-06

 Per-token loss scaled by world size: 8.813981935418269e-07
 Epoch: 1, Step: 163, Rank: 6, loss = 1.2291189432144165
 Epoch: 1, Step: 163, Rank: 4, loss = 1.4822590351104736Epoch: 1, Step: 163, Rank: 2, loss = 0.3993805944919586

 Epoch: 1, Step: 163, Rank: 0, loss = 0.012065964750945568Epoch: 1, Step: 163, Rank: 3, loss = 0.9575117230415344

 Epoch: 1, Step: 163, Rank: 7, loss = 0.9085030555725098
 Epoch: 1, Step: 163, Rank: 1, loss = 0.0018922517774626613
 Per-token loss scaled by world size: 0.0005087603931315243
 Epoch: 1, Step: 163, Rank: 5, loss = 1.0922449827194214
                                                          total tokens: 7695 num samples: 9 num padding tokens: 710 - rank: 4 max len: 855 min len: 712 avg len: 776.1111111111111 num_loss_counted_tokens: 5130
 total tokens: 7552 num samples: 4 num padding tokens: 1186 - rank: 1 max len: 1888 min len: 1444 avg len: 1591.5 num_loss_counted_tokens: 754
 {
    "epoch": 1,
    "step": 163,
    "rank": 0,
    "loss": 0.012065964750945568,
    "overall_throughput": 42.679599605421174,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.32099151611328,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17175,
    "batch_size": 79,
    "total_loss": 0.7603721022605896,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:15.854736"
 }
 total tokens: 6895 num samples: 5 num padding tokens: 862 - rank: 2 max len: 1379 min len: 1055 avg len: 1206.6 num_loss_counted_tokens: 1706
 total tokens: 7287 num samples: 7 num padding tokens: 458 - rank: 3 max len: 1041 min len: 894 avg len: 975.5714285714286 num_loss_counted_tokens: 4839
 total tokens: 6182 num samples: 22 num padding tokens: 2681 - rank: 7 max len: 281 min len: 75 avg len: 159.13636363636363 num_loss_counted_tokens: 1402
 total tokens: 6306 num samples: 2 num padding tokens: 367 - rank: 0 max len: 3153 min len: 2786 avg len: 2969.5 num_loss_counted_tokens: 481
 total tokens: 8016 num samples: 16 num padding tokens: 1294 - rank: 6 max len: 501 min len: 281 avg len: 420.125 num_loss_counted_tokens: 3734
 total tokens: 7799 num samples: 11 num padding tokens: 922 - rank: 5 max len: 709 min len: 521 avg len: 625.1818181818181 num_loss_counted_tokens: 3657
 Per-token loss scaled by world size: 0.000572259072214365Per-token loss scaled by world size: 0.0008030128665268421Per-token loss scaled by world size: 0.0005713719874620438Per-token loss scaled by world size: 0.0008417390054091811

 Per-token loss scaled by world size: 4.514108695730101e-06Per-token loss scaled by world size: 1.0517849659663625e-05

 Per-token loss scaled by world size: 8.813677595753688e-06


 Epoch: 1, Step: 164, Rank: 5, loss = 1.8358327150344849Epoch: 1, Step: 164, Rank: 6, loss = 1.7513710260391235

 Epoch: 1, Step: 164, Rank: 7, loss = 1.2461622953414917
 Epoch: 1, Step: 164, Rank: 2, loss = 0.009845270775258541
 Epoch: 1, Step: 164, Rank: 4, loss = 1.2480970621109009Epoch: 1, Step: 164, Rank: 0, loss = 0.02293943054974079Epoch: 1, Step: 164, Rank: 1, loss = 0.019222630187869072


 Per-token loss scaled by world size: 0.0004583366389852017
 Epoch: 1, Step: 164, Rank: 3, loss = 0.9996321797370911
                                                          total tokens: 7084 num samples: 4 num padding tokens: 676 - rank: 1 max len: 1771 min len: 1482 avg len: 1602.0 num_loss_counted_tokens: 552
 total tokens: 7983 num samples: 9 num padding tokens: 1125 - rank: 4 max len: 887 min len: 678 avg len: 762.0 num_loss_counted_tokens: 4997
 {
    "epoch": 1,
    "step": 164,
    "rank": 0,
    "loss": 0.02293943054974079,
    "overall_throughput": 41.52812214957308,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.29750394821167,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17448,
    "batch_size": 78,
    "total_loss": 0.8916378021240234,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:18.404418"
 }
 total tokens: 7848 num samples: 12 num padding tokens: 795 - rank: 5 max len: 654 min len: 498 avg len: 587.75 num_loss_counted_tokens: 5391
 total tokens: 7010 num samples: 5 num padding tokens: 860 - rank: 2 max len: 1402 min len: 1134 avg len: 1230.0 num_loss_counted_tokens: 3080
 total tokens: 7448 num samples: 7 num padding tokens: 586 - rank: 3 max len: 1064 min len: 904 avg len: 980.2857142857143 num_loss_counted_tokens: 3922
 total tokens: 7560 num samples: 28 num padding tokens: 2574 - rank: 7 max len: 270 min len: 79 avg len: 178.07142857142858 num_loss_counted_tokens: 2339
 total tokens: 7263 num samples: 3 num padding tokens: 675 - rank: 0 max len: 2421 min len: 1797 avg len: 2196.0 num_loss_counted_tokens: 329
 total tokens: 8109 num samples: 17 num padding tokens: 1719 - rank: 6 max len: 477 min len: 271 avg len: 375.88235294117646 num_loss_counted_tokens: 3620
 Per-token loss scaled by world size: 0.0004442204663064331Per-token loss scaled by world size: 0.00018070742953568697Per-token loss scaled by world size: 0.0003504411142785102Per-token loss scaled by world size: 4.3835102587763686e-06

 Per-token loss scaled by world size: 0.00022767498739995062Per-token loss scaled by world size: 7.03313219219126e-07
 Per-token loss scaled by world size: 0.0002697974268812686



 Epoch: 1, Step: 165, Rank: 2, loss = 0.5169813632965088
 Epoch: 1, Step: 165, Rank: 5, loss = 1.2708592414855957
 Epoch: 1, Step: 165, Rank: 3, loss = 1.002568244934082
 Epoch: 1, Step: 165, Rank: 4, loss = 0.651349663734436
 Epoch: 1, Step: 165, Rank: 7, loss = 0.7718567252159119
 Epoch: 1, Step: 165, Rank: 1, loss = 0.012540674768388271
 Epoch: 1, Step: 165, Rank: 0, loss = 0.0020120912231504917
 Per-token loss scaled by world size: 0.00034711475018411875
 Epoch: 1, Step: 165, Rank: 6, loss = 0.9930519461631775
                                                          total tokens: 8085 num samples: 11 num padding tokens: 750 - rank: 4 max len: 735 min len: 600 avg len: 666.8181818181819 num_loss_counted_tokens: 3756
 total tokens: 7432 num samples: 4 num padding tokens: 599 - rank: 1 max len: 1858 min len: 1560 avg len: 1708.25 num_loss_counted_tokens: 1691
 {
    "epoch": 1,
    "step": 165,
    "rank": 0,
    "loss": 0.0020120912231504917,
    "overall_throughput": 42.759868613235994,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.2490234375,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22887,
    "batch_size": 75,
    "total_loss": 0.6526525020599365,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:20.878553"
 }
 total tokens: 2249 num samples: 13 num padding tokens: 686 - rank: 7 max len: 173 min len: 79 avg len: 120.23076923076923 num_loss_counted_tokens: 568
 total tokens: 7657 num samples: 13 num padding tokens: 1363 - rank: 5 max len: 589 min len: 381 avg len: 484.15384615384613 num_loss_counted_tokens: 3837
 total tokens: 7615 num samples: 5 num padding tokens: 1314 - rank: 2 max len: 1523 min len: 1058 avg len: 1260.2 num_loss_counted_tokens: 3900
 total tokens: 7980 num samples: 21 num padding tokens: 2329 - rank: 6 max len: 380 min len: 178 avg len: 269.0952380952381 num_loss_counted_tokens: 2730
 total tokens: 8040 num samples: 8 num padding tokens: 983 - rank: 3 max len: 1005 min len: 762 avg len: 882.125 num_loss_counted_tokens: 5456
 total tokens: 8061 num samples: 3 num padding tokens: 594 - rank: 0 max len: 2687 min len: 2311 avg len: 2489.0 num_loss_counted_tokens: 259
 Per-token loss scaled by world size: 0.0003275613998994231Per-token loss scaled by world size: 0.00014095827646087855Per-token loss scaled by world size: 0.0002651048998814076Per-token loss scaled by world size: 0.00036449063918553293Per-token loss scaled by world size: 0.00021203258074820042Per-token loss scaled by world size: 0.0002020968240685761


 Per-token loss scaled by world size: 1.8056784938380588e-06



 Epoch: 1, Step: 166, Rank: 1, loss = 0.4387502372264862
 Epoch: 1, Step: 166, Rank: 5, loss = 0.659977912902832
 Epoch: 1, Step: 166, Rank: 6, loss = 1.019575834274292
 Epoch: 1, Step: 166, Rank: 0, loss = 0.005620399955660105
 Epoch: 1, Step: 166, Rank: 7, loss = 0.8251721262931824Epoch: 1, Step: 166, Rank: 3, loss = 1.1345226764678955Epoch: 1, Step: 166, Rank: 4, loss = 0.6290516257286072


 Per-token loss scaled by world size: 0.0001444466906832531
 Epoch: 1, Step: 166, Rank: 2, loss = 0.44960838556289673
                                                          total tokens: 5966 num samples: 2 num padding tokens: 617 - rank: 1 max len: 2983 min len: 2366 avg len: 2674.5 num_loss_counted_tokens: 241
 total tokens: 7798 num samples: 7 num padding tokens: 1377 - rank: 4 max len: 1114 min len: 740 avg len: 917.2857142857143 num_loss_counted_tokens: 4598
 {
    "epoch": 1,
    "step": 166,
    "rank": 0,
    "loss": 0.005620399955660105,
    "overall_throughput": 41.7691118119109,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.322949409484863,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24901,
    "batch_size": 68,
    "total_loss": 0.64528489112854,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:23.412572"
 }
 total tokens: 8100 num samples: 12 num padding tokens: 1017 - rank: 5 max len: 675 min len: 510 avg len: 590.25 num_loss_counted_tokens: 5149
 total tokens: 7824 num samples: 4 num padding tokens: 1097 - rank: 3 max len: 1956 min len: 1341 avg len: 1681.75 num_loss_counted_tokens: 3450
 total tokens: 8000 num samples: 16 num padding tokens: 2121 - rank: 6 max len: 500 min len: 290 avg len: 367.4375 num_loss_counted_tokens: 3595
 total tokens: 6925 num samples: 25 num padding tokens: 2774 - rank: 7 max len: 277 min len: 77 avg len: 166.04 num_loss_counted_tokens: 1782
 total tokens: 6918 num samples: 3 num padding tokens: 285 - rank: 2 max len: 2306 min len: 2144 avg len: 2211.0 num_loss_counted_tokens: 208
 total tokens: 6292 num samples: 2 num padding tokens: 131 - rank: 0 max len: 3146 min len: 3015 avg len: 3080.5 num_loss_counted_tokens: 174
 Per-token loss scaled by world size: 0.0004856240702793002Per-token loss scaled by world size: 0.0003434315149206668Per-token loss scaled by world size: 3.41317463607993e-05Per-token loss scaled by world size: 3.1640320230508223e-06Per-token loss scaled by world size: 1.4528293377225054e-06

 Per-token loss scaled by world size: 0.0005926437443122268



 Per-token loss scaled by world size: 0.0002455389767419547
 Epoch: 1, Step: 167, Rank: 5, loss = 1.262865424156189Epoch: 1, Step: 167, Rank: 2, loss = 0.08875960856676102
 Epoch: 1, Step: 167, Rank: 6, loss = 0.8930936455726624

 Epoch: 1, Step: 167, Rank: 0, loss = 0.0037780827842652798
 Epoch: 1, Step: 167, Rank: 4, loss = 1.5411700010299683Epoch: 1, Step: 167, Rank: 1, loss = 0.008228065446019173

 Epoch: 1, Step: 167, Rank: 7, loss = 0.6385241150856018
 Per-token loss scaled by world size: 0.0002826468553394079
 Epoch: 1, Step: 167, Rank: 3, loss = 0.7350231409072876
                                                          total tokens: 7464 num samples: 4 num padding tokens: 501 - rank: 1 max len: 1866 min len: 1478 avg len: 1740.75 num_loss_counted_tokens: 1361
 total tokens: 7820 num samples: 10 num padding tokens: 879 - rank: 4 max len: 782 min len: 657 avg len: 694.1 num_loss_counted_tokens: 4357
 {
    "epoch": 1,
    "step": 167,
    "rank": 0,
    "loss": 0.0037780827842652798,
    "overall_throughput": 41.38714694400821,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.380234718322754,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20804,
    "batch_size": 85,
    "total_loss": 0.6464303731918335,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:25.970591"
 }
 total tokens: 6966 num samples: 27 num padding tokens: 2143 - rank: 7 max len: 258 min len: 77 avg len: 178.62962962962962 num_loss_counted_tokens: 1881
 total tokens: 8032 num samples: 8 num padding tokens: 946 - rank: 3 max len: 1004 min len: 790 avg len: 885.75 num_loss_counted_tokens: 3467
 total tokens: 7410 num samples: 6 num padding tokens: 411 - rank: 2 max len: 1235 min len: 1054 avg len: 1166.5 num_loss_counted_tokens: 3794
 total tokens: 7668 num samples: 12 num padding tokens: 1179 - rank: 5 max len: 639 min len: 424 avg len: 540.75 num_loss_counted_tokens: 2919
 total tokens: 8037 num samples: 19 num padding tokens: 1731 - rank: 6 max len: 423 min len: 269 avg len: 331.89473684210526 num_loss_counted_tokens: 3562
 total tokens: 6488 num samples: 2 num padding tokens: 894 - rank: 0 max len: 3244 min len: 2350 avg len: 2797.0 num_loss_counted_tokens: 206
 Per-token loss scaled by world size: 0.00039238386671058834Per-token loss scaled by world size: 0.0004944170941598713Per-token loss scaled by world size: 5.2391669669304974e-06Per-token loss scaled by world size: 0.00014728681708220392Per-token loss scaled by world size: 0.0004893930163234472Per-token loss scaled by world size: 3.5653782106237486e-05

 Per-token loss scaled by world size: 0.0004791621759068221




 Epoch: 1, Step: 168, Rank: 0, loss = 0.012632940895855427Epoch: 1, Step: 168, Rank: 6, loss = 1.180048942565918

 Epoch: 1, Step: 168, Rank: 5, loss = 1.1921632289886475Epoch: 1, Step: 168, Rank: 2, loss = 0.35514533519744873Epoch: 1, Step: 168, Rank: 4, loss = 0.9461355805397034


 Epoch: 1, Step: 168, Rank: 1, loss = 0.08597017824649811
 Epoch: 1, Step: 168, Rank: 7, loss = 1.1553797721862793
 Per-token loss scaled by world size: 0.00023344735382124782
 Epoch: 1, Step: 168, Rank: 3, loss = 0.5628999471664429
                                                          total tokens: 7931 num samples: 11 num padding tokens: 374 - rank: 4 max len: 721 min len: 645 avg len: 687.0 num_loss_counted_tokens: 3685
 total tokens: 6885 num samples: 3 num padding tokens: 1645 - rank: 1 max len: 2295 min len: 1266 avg len: 1746.6666666666667 num_loss_counted_tokens: 678
 {
    "epoch": 1,
    "step": 168,
    "rank": 0,
    "loss": 0.012632940895855427,
    "overall_throughput": 41.535125133568044,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.440462589263916,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19290,
    "batch_size": 79,
    "total_loss": 0.6862969994544983,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:28.521157"
 }
 total tokens: 8112 num samples: 16 num padding tokens: 1692 - rank: 6 max len: 507 min len: 284 avg len: 401.25 num_loss_counted_tokens: 3636
 total tokens: 7821 num samples: 9 num padding tokens: 753 - rank: 3 max len: 869 min len: 722 avg len: 785.3333333333334 num_loss_counted_tokens: 5472
 total tokens: 7440 num samples: 6 num padding tokens: 983 - rank: 2 max len: 1240 min len: 874 avg len: 1076.1666666666667 num_loss_counted_tokens: 3013
 total tokens: 7728 num samples: 12 num padding tokens: 813 - rank: 5 max len: 644 min len: 530 avg len: 576.25 num_loss_counted_tokens: 4677
 total tokens: 7306 num samples: 26 num padding tokens: 2436 - rank: 7 max len: 281 min len: 81 avg len: 187.30769230769232 num_loss_counted_tokens: 2351
 total tokens: 7062 num samples: 2 num padding tokens: 653 - rank: 0 max len: 3531 min len: 2878 avg len: 3204.5 num_loss_counted_tokens: 193
 Per-token loss scaled by world size: 0.00023193543893285096Per-token loss scaled by world size: 0.00030478413100354373Per-token loss scaled by world size: 0.00034480085014365613
 Per-token loss scaled by world size: 4.721171990240691e-06Per-token loss scaled by world size: 4.196311692794552e-06



 Per-token loss scaled by world size: 0.00038670990034006536
 Epoch: 1, Step: 169, Rank: 3, loss = 0.8575863242149353
 Per-token loss scaled by world size: 8.75471014296636e-05Epoch: 1, Step: 169, Rank: 6, loss = 0.9701833724975586

 Epoch: 1, Step: 169, Rank: 0, loss = 0.013284197077155113Epoch: 1, Step: 169, Rank: 2, loss = 0.652608335018158

 Epoch: 1, Step: 169, Rank: 1, loss = 0.011807371862232685
 Epoch: 1, Step: 169, Rank: 4, loss = 1.0881049633026123
 Per-token loss scaled by world size: 0.0006382779683917761
 Epoch: 1, Step: 169, Rank: 7, loss = 0.24633565545082092
 Epoch: 1, Step: 169, Rank: 5, loss = 1.7959545850753784
                                                          total tokens: 6174 num samples: 3 num padding tokens: 245 - rank: 1 max len: 2058 min len: 1856 avg len: 1976.3333333333333 num_loss_counted_tokens: 674
 total tokens: 7389 num samples: 9 num padding tokens: 742 - rank: 4 max len: 821 min len: 683 avg len: 738.5555555555555 num_loss_counted_tokens: 5047
 total tokens: 7320 num samples: 30 num padding tokens: 2107 - rank: 7 max len: 244 min len: 91 avg len: 173.76666666666668 num_loss_counted_tokens: 2225
 {
    "epoch": 1,
    "step": 169,
    "rank": 0,
    "loss": 0.013284197077155113,
    "overall_throughput": 41.77523610688796,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.364055633544922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22510,
    "batch_size": 81,
    "total_loss": 0.7044830918312073,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:31.054810"
 }
 total tokens: 7602 num samples: 7 num padding tokens: 811 - rank: 3 max len: 1086 min len: 880 avg len: 970.1428571428571 num_loss_counted_tokens: 5560
 total tokens: 8100 num samples: 12 num padding tokens: 1477 - rank: 5 max len: 675 min len: 438 avg len: 551.9166666666666 num_loss_counted_tokens: 3835
 total tokens: 7270 num samples: 5 num padding tokens: 1169 - rank: 2 max len: 1454 min len: 1115 avg len: 1220.2 num_loss_counted_tokens: 3611
 total tokens: 5712 num samples: 2 num padding tokens: 500 - rank: 0 max len: 2856 min len: 2356 avg len: 2606.0 num_loss_counted_tokens: 161
 total tokens: 7848 num samples: 18 num padding tokens: 1235 - rank: 6 max len: 436 min len: 265 avg len: 367.3888888888889 num_loss_counted_tokens: 4095
 Per-token loss scaled by world size: 0.00010392792319180444Per-token loss scaled by world size: 0.00014247662329580635Per-token loss scaled by world size: 0.00030479932320304215Per-token loss scaled by world size: 0.0002555457758717239Per-token loss scaled by world size: 0.00023876398336142302Per-token loss scaled by world size: 0.0003587114915717393


 Per-token loss scaled by world size: 0.00022289040498435497



 Epoch: 1, Step: 170, Rank: 6, loss = 0.8772567510604858
 Epoch: 1, Step: 170, Rank: 4, loss = 1.0463379621505737
 Epoch: 1, Step: 170, Rank: 2, loss = 0.4891044497489929
 Epoch: 1, Step: 170, Rank: 7, loss = 0.8196468949317932Epoch: 1, Step: 170, Rank: 3, loss = 0.7651548981666565

 Epoch: 1, Step: 170, Rank: 5, loss = 1.2314116954803467
 Epoch: 1, Step: 170, Rank: 1, loss = 0.3567715585231781
 Per-token loss scaled by world size: 2.3532686100224964e-05
 Epoch: 1, Step: 170, Rank: 0, loss = 0.08078476786613464
                                                          total tokens: 7016 num samples: 4 num padding tokens: 1240 - rank: 1 max len: 1754 min len: 1179 avg len: 1444.0 num_loss_counted_tokens: 1903
 total tokens: 8085 num samples: 11 num padding tokens: 649 - rank: 4 max len: 735 min len: 609 avg len: 676.0 num_loss_counted_tokens: 4383
 {
    "epoch": 1,
    "step": 170,
    "rank": 0,
    "loss": 0.08078476786613464,
    "overall_throughput": 40.4277908497992,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.523925304412842,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27463,
    "batch_size": 84,
    "total_loss": 0.7083086371421814,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:33.666058"
 }
 total tokens: 7812 num samples: 18 num padding tokens: 1829 - rank: 6 max len: 434 min len: 242 avg len: 332.3888888888889 num_loss_counted_tokens: 3261
 total tokens: 7917 num samples: 13 num padding tokens: 1050 - rank: 5 max len: 609 min len: 441 avg len: 528.2307692307693 num_loss_counted_tokens: 3856
 total tokens: 7644 num samples: 7 num padding tokens: 494 - rank: 2 max len: 1092 min len: 972 avg len: 1021.4285714285714 num_loss_counted_tokens: 4662
 total tokens: 7260 num samples: 33 num padding tokens: 2169 - rank: 7 max len: 220 min len: 78 avg len: 154.27272727272728 num_loss_counted_tokens: 1876
 total tokens: 7704 num samples: 8 num padding tokens: 1070 - rank: 3 max len: 963 min len: 747 avg len: 829.25 num_loss_counted_tokens: 5234
 total tokens: 6336 num samples: 3 num padding tokens: 247 - rank: 0 max len: 2112 min len: 1890 avg len: 2029.6666666666667 num_loss_counted_tokens: 2181
 Per-token loss scaled by world size: 0.0003628956328611821Per-token loss scaled by world size: 0.0002919238177128136Per-token loss scaled by world size: 0.0003382969880476594Per-token loss scaled by world size: 7.853787246858701e-05Per-token loss scaled by world size: 0.00014888570876792073Per-token loss scaled by world size: 0.00044355227146297693
 Per-token loss scaled by world size: 0.00015364577120635659





 Epoch: 1, Step: 171, Rank: 4, loss = 0.961414635181427Epoch: 1, Step: 171, Rank: 3, loss = 1.1141388416290283Epoch: 1, Step: 171, Rank: 1, loss = 0.49033647775650024


 Epoch: 1, Step: 171, Rank: 2, loss = 0.2586546540260315
 Epoch: 1, Step: 171, Rank: 5, loss = 1.4607839584350586Epoch: 1, Step: 171, Rank: 6, loss = 1.195151448249817

 Epoch: 1, Step: 171, Rank: 7, loss = 0.5060131549835205
 Per-token loss scaled by world size: 3.983392525697127e-05
 Epoch: 1, Step: 171, Rank: 0, loss = 0.1311880499124527
                                                          total tokens: 7515 num samples: 9 num padding tokens: 869 - rank: 4 max len: 835 min len: 663 avg len: 738.4444444444445 num_loss_counted_tokens: 5088
 total tokens: 8040 num samples: 4 num padding tokens: 1032 - rank: 1 max len: 2010 min len: 1576 avg len: 1752.0 num_loss_counted_tokens: 2036
 {
    "epoch": 1,
    "step": 171,
    "rank": 0,
    "loss": 0.1311880499124527,
    "overall_throughput": 40.56471721668778,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.534343242645264,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26347,
    "batch_size": 96,
    "total_loss": 0.7647101283073425,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:36.272989"
 }
 total tokens: 7752 num samples: 17 num padding tokens: 1950 - rank: 6 max len: 456 min len: 279 avg len: 341.29411764705884 num_loss_counted_tokens: 3147
 total tokens: 7596 num samples: 12 num padding tokens: 817 - rank: 5 max len: 633 min len: 491 avg len: 564.9166666666666 num_loss_counted_tokens: 5545
 total tokens: 7479 num samples: 27 num padding tokens: 3306 - rank: 7 max len: 277 min len: 84 avg len: 154.55555555555554 num_loss_counted_tokens: 1626 total tokens: 8016 num samples: 8 num padding tokens: 490 - rank: 3 max len: 1002 min len: 848 avg len: 940.75 num_loss_counted_tokens: 5067

 total tokens: 5446 num samples: 2 num padding tokens: 669 - rank: 0 max len: 2723 min len: 2054 avg len: 2388.5 num_loss_counted_tokens: 841
 total tokens: 6835 num samples: 5 num padding tokens: 1245 - rank: 2 max len: 1367 min len: 1005 avg len: 1118.0 num_loss_counted_tokens: 616
 Per-token loss scaled by world size: 0.00026996861561201513Per-token loss scaled by world size: 8.522550342604518e-05Per-token loss scaled by world size: 0.00032177582033909857Per-token loss scaled by world size: 9.002388833323494e-05Per-token loss scaled by world size: 9.29309317143634e-05Per-token loss scaled by world size: 0.00024938900605775416




 Per-token loss scaled by world size: 0.00014789693523198366

 Epoch: 1, Step: 172, Rank: 0, loss = 0.3234558403491974Epoch: 1, Step: 172, Rank: 4, loss = 1.1561405658721924

 Epoch: 1, Step: 172, Rank: 6, loss = 0.9699972867965698Epoch: 1, Step: 172, Rank: 2, loss = 0.30621522665023804Epoch: 1, Step: 172, Rank: 1, loss = 0.3339008390903473Epoch: 1, Step: 172, Rank: 5, loss = 0.896054744720459



 Epoch: 1, Step: 172, Rank: 7, loss = 0.5313937067985535
 Per-token loss scaled by world size: 0.00018880210700444877
 Epoch: 1, Step: 172, Rank: 3, loss = 0.67836594581604
                                                          total tokens: 6222 num samples: 3 num padding tokens: 850 - rank: 1 max len: 2074 min len: 1514 avg len: 1790.6666666666667 num_loss_counted_tokens: 599
 total tokens: 7579 num samples: 11 num padding tokens: 750 - rank: 4 max len: 689 min len: 580 avg len: 620.8181818181819 num_loss_counted_tokens: 3571
 {
    "epoch": 1,
    "step": 172,
    "rank": 0,
    "loss": 0.3234558403491974,
    "overall_throughput": 41.46099612058457,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.491368293762207,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 28744,
    "batch_size": 104,
    "total_loss": 0.6494404673576355,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:38.822555"
 }
 total tokens: 7328 num samples: 32 num padding tokens: 2250 - rank: 7 max len: 229 min len: 81 avg len: 158.6875 num_loss_counted_tokens: 2126
 total tokens: 7920 num samples: 10 num padding tokens: 439 - rank: 3 max len: 792 min len: 702 avg len: 748.1 num_loss_counted_tokens: 6006
 total tokens: 7617 num samples: 3 num padding tokens: 480 - rank: 0 max len: 2539 min len: 2078 avg len: 2379.0 num_loss_counted_tokens: 282
 total tokens: 7494 num samples: 6 num padding tokens: 1464 - rank: 2 max len: 1249 min len: 825 avg len: 1005.0 num_loss_counted_tokens: 3339
 total tokens: 7938 num samples: 14 num padding tokens: 1254 - rank: 5 max len: 567 min len: 424 avg len: 477.42857142857144 num_loss_counted_tokens: 4110
 total tokens: 8018 num samples: 19 num padding tokens: 1923 - rank: 6 max len: 422 min len: 231 avg len: 320.7894736842105 num_loss_counted_tokens: 3186
 Per-token loss scaled by world size: 0.0002713052381295711Per-token loss scaled by world size: 0.00035623108851723373Per-token loss scaled by world size: 0.00047955545596778393Per-token loss scaled by world size: 0.0002560637367423624Per-token loss scaled by world size: 3.385763557162136e-05Per-token loss scaled by world size: 5.4830157750984654e-05



 Per-token loss scaled by world size: 2.089197550958488e-06


 Epoch: 1, Step: 173, Rank: 3, loss = 0.766302764415741Epoch: 1, Step: 173, Rank: 2, loss = 0.101323202252388Epoch: 1, Step: 173, Rank: 5, loss = 1.4351296424865723


 Epoch: 1, Step: 173, Rank: 4, loss = 1.066066026687622
 Epoch: 1, Step: 173, Rank: 1, loss = 0.16408610343933105Epoch: 1, Step: 173, Rank: 7, loss = 0.8119148015975952

 Epoch: 1, Step: 173, Rank: 0, loss = 0.006252184975892305
 Per-token loss scaled by world size: 0.00036022928543388844
 Epoch: 1, Step: 173, Rank: 6, loss = 1.0780311822891235
                                                          total tokens: 7947 num samples: 9 num padding tokens: 497 - rank: 4 max len: 883 min len: 762 avg len: 827.7777777777778 num_loss_counted_tokens: 6018
 total tokens: 6960 num samples: 3 num padding tokens: 675 - rank: 1 max len: 2320 min len: 1916 avg len: 2095.0 num_loss_counted_tokens: 1117
 {
    "epoch": 1,
    "step": 173,
    "rank": 0,
    "loss": 0.006252184975892305,
    "overall_throughput": 42.234314636277595,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.221298694610596,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23941,
    "batch_size": 80,
    "total_loss": 0.678638219833374,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:41.355552"
 }
 total tokens: 6350 num samples: 2 num padding tokens: 380 - rank: 0 max len: 3175 min len: 2795 avg len: 2985.0 num_loss_counted_tokens: 1089
 total tokens: 6624 num samples: 4 num padding tokens: 786 - rank: 2 max len: 1656 min len: 1233 avg len: 1459.5 num_loss_counted_tokens: 2379
 total tokens: 8041 num samples: 11 num padding tokens: 1409 - rank: 5 max len: 731 min len: 520 avg len: 602.9090909090909 num_loss_counted_tokens: 3830
 total tokens: 8064 num samples: 16 num padding tokens: 2063 - rank: 6 max len: 504 min len: 251 avg len: 375.0625 num_loss_counted_tokens: 3722
 total tokens: 7936 num samples: 32 num padding tokens: 2626 - rank: 7 max len: 248 min len: 74 avg len: 165.9375 num_loss_counted_tokens: 2201
 total tokens: 7314 num samples: 6 num padding tokens: 1159 - rank: 3 max len: 1219 min len: 923 avg len: 1025.8333333333333 num_loss_counted_tokens: 4532
 Per-token loss scaled by world size: 0.0002848200674634427Per-token loss scaled by world size: 0.0003533354902174324Per-token loss scaled by world size: 0.00015056866686791182

 Per-token loss scaled by world size: 0.00036068688496015966Per-token loss scaled by world size: 0.0002968825865536928


 Per-token loss scaled by world size: 0.000274753401754424
 Per-token loss scaled by world size: 2.8490408112702426e-06
 Epoch: 1, Step: 174, Rank: 6, loss = 1.022597074508667
 Epoch: 1, Step: 174, Rank: 2, loss = 0.4357645511627197
 Epoch: 1, Step: 174, Rank: 4, loss = 0.8243048787117004
 Epoch: 1, Step: 174, Rank: 5, loss = 1.0438729524612427
 Epoch: 1, Step: 174, Rank: 7, loss = 0.8592153191566467
 Epoch: 1, Step: 174, Rank: 1, loss = 0.7951706647872925
 Epoch: 1, Step: 174, Rank: 0, loss = 0.008245480246841908
 Per-token loss scaled by world size: 0.0002512831415515393
 Epoch: 1, Step: 174, Rank: 3, loss = 0.7272448539733887
                                                          total tokens: 7217 num samples: 7 num padding tokens: 815 - rank: 4 max len: 1031 min len: 839 avg len: 914.5714285714286 num_loss_counted_tokens: 4377
 total tokens: 5458 num samples: 2 num padding tokens: 51 - rank: 1 max len: 2729 min len: 2678 avg len: 2703.5 num_loss_counted_tokens: 801
 {
    "epoch": 1,
    "step": 174,
    "rank": 0,
    "loss": 0.008245480246841908,
    "overall_throughput": 41.476173818456004,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.290608882904053,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23153,
    "batch_size": 84,
    "total_loss": 0.7145519852638245,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:43.868184"
 }
 total tokens: 8085 num samples: 15 num padding tokens: 1660 - rank: 6 max len: 539 min len: 289 avg len: 428.3333333333333 num_loss_counted_tokens: 4071
 total tokens: 6890 num samples: 5 num padding tokens: 1024 - rank: 3 max len: 1378 min len: 1085 avg len: 1173.2 num_loss_counted_tokens: 3068
 total tokens: 7680 num samples: 3 num padding tokens: 1620 - rank: 2 max len: 2560 min len: 1464 avg len: 2020.0 num_loss_counted_tokens: 1227
 total tokens: 7850 num samples: 10 num padding tokens: 1055 - rank: 5 max len: 785 min len: 541 avg len: 679.5 num_loss_counted_tokens: 4251
 total tokens: 7830 num samples: 29 num padding tokens: 3032 - rank: 7 max len: 270 min len: 79 avg len: 165.44827586206895 num_loss_counted_tokens: 2136
 total tokens: 6586 num samples: 2 num padding tokens: 49 - rank: 0 max len: 3293 min len: 3244 avg len: 3268.5 num_loss_counted_tokens: 217
 Per-token loss scaled by world size: 0.0003542072663549334Per-token loss scaled by world size: 0.00031207496067509055Per-token loss scaled by world size: 0.000552273471839726Per-token loss scaled by world size: 0.0003514452837407589
 Per-token loss scaled by world size: 8.925243264457094e-07Per-token loss scaled by world size: 2.603805114631541e-06
 Per-token loss scaled by world size: 0.00034913059789687395




 Epoch: 1, Step: 175, Rank: 4, loss = 0.9186340570449829Epoch: 1, Step: 175, Rank: 6, loss = 1.4435738325119019Epoch: 1, Step: 175, Rank: 1, loss = 0.0023329469840973616Epoch: 1, Step: 175, Rank: 2, loss = 0.9258535504341125


 Epoch: 1, Step: 175, Rank: 3, loss = 0.8157249093055725
 Epoch: 1, Step: 175, Rank: 0, loss = 0.006806021090596914

 Epoch: 1, Step: 175, Rank: 7, loss = 0.9125837683677673
 Per-token loss scaled by world size: 0.00037907989462837577
 Epoch: 1, Step: 175, Rank: 5, loss = 0.9908674359321594
                                                          total tokens: 7552 num samples: 8 num padding tokens: 1076 - rank: 4 max len: 944 min len: 723 avg len: 809.5 num_loss_counted_tokens: 4001
 total tokens: 6219 num samples: 3 num padding tokens: 803 - rank: 1 max len: 2073 min len: 1455 avg len: 1805.3333333333333 num_loss_counted_tokens: 1982
 {
    "epoch": 1,
    "step": 175,
    "rank": 0,
    "loss": 0.006806021090596914,
    "overall_throughput": 41.26828372064641,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.29252052307129,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20911,
    "batch_size": 75,
    "total_loss": 0.752047061920166,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:46.430726"
 }
 total tokens: 5842 num samples: 23 num padding tokens: 1859 - rank: 7 max len: 254 min len: 86 avg len: 173.17391304347825 num_loss_counted_tokens: 1764
 total tokens: 7518 num samples: 7 num padding tokens: 452 - rank: 3 max len: 1074 min len: 955 avg len: 1009.4285714285714 num_loss_counted_tokens: 5885
 total tokens: 7803 num samples: 17 num padding tokens: 1386 - rank: 6 max len: 459 min len: 302 avg len: 377.47058823529414 num_loss_counted_tokens: 3507
 total tokens: 7755 num samples: 11 num padding tokens: 586 - rank: 5 max len: 705 min len: 578 avg len: 651.7272727272727 num_loss_counted_tokens: 3489
 total tokens: 7440 num samples: 6 num padding tokens: 402 - rank: 2 max len: 1240 min len: 1106 avg len: 1173.0 num_loss_counted_tokens: 4030
 total tokens: 7226 num samples: 2 num padding tokens: 1117 - rank: 0 max len: 3613 min len: 2496 avg len: 3054.5 num_loss_counted_tokens: 257
 Per-token loss scaled by world size: 0.00023268039512913674Per-token loss scaled by world size: 0.00022860463650431484Per-token loss scaled by world size: 0.00033770385198295116Per-token loss scaled by world size: 4.125645318708848e-06Per-token loss scaled by world size: 0.0003991488483734429Per-token loss scaled by world size: 1.1589580026338808e-05





 Per-token loss scaled by world size: 0.0002852912584785372
 Epoch: 1, Step: 176, Rank: 3, loss = 0.6885303854942322
 Epoch: 1, Step: 176, Rank: 0, loss = 0.012208300642669201Epoch: 1, Step: 176, Rank: 6, loss = 0.9993079304695129
 Epoch: 1, Step: 176, Rank: 4, loss = 1.181131362915039

 Epoch: 1, Step: 176, Rank: 2, loss = 0.6764696836471558
 Epoch: 1, Step: 176, Rank: 1, loss = 0.034295015037059784
 Epoch: 1, Step: 176, Rank: 7, loss = 0.844212532043457
 Per-token loss scaled by world size: 0.0003952819970436394
 Epoch: 1, Step: 176, Rank: 5, loss = 1.1696888208389282
                                                          total tokens: 7189 num samples: 7 num padding tokens: 797 - rank: 4 max len: 1027 min len: 807 avg len: 913.1428571428571 num_loss_counted_tokens: 4546
 total tokens: 7528 num samples: 4 num padding tokens: 414 - rank: 1 max len: 1882 min len: 1584 avg len: 1778.5 num_loss_counted_tokens: 2995
 {
    "epoch": 1,
    "step": 176,
    "rank": 0,
    "loss": 0.012208300642669201,
    "overall_throughput": 41.56666979221221,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.38214635848999,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23673,
    "batch_size": 86,
    "total_loss": 0.7007305026054382,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:49.012721"
 }
 total tokens: 8025 num samples: 15 num padding tokens: 1836 - rank: 6 max len: 535 min len: 285 avg len: 412.6 num_loss_counted_tokens: 4071
 total tokens: 7525 num samples: 5 num padding tokens: 253 - rank: 2 max len: 1505 min len: 1379 avg len: 1454.4 num_loss_counted_tokens: 1812
 total tokens: 7860 num samples: 10 num padding tokens: 1300 - rank: 5 max len: 786 min len: 557 avg len: 656.0 num_loss_counted_tokens: 4726
 total tokens: 7980 num samples: 6 num padding tokens: 677 - rank: 3 max len: 1330 min len: 1078 avg len: 1217.1666666666667 num_loss_counted_tokens: 3981
 total tokens: 8091 num samples: 29 num padding tokens: 3107 - rank: 7 max len: 279 min len: 72 avg len: 171.86206896551724 num_loss_counted_tokens: 2093
 total tokens: 5826 num samples: 2 num padding tokens: 735 - rank: 0 max len: 2913 min len: 2178 avg len: 2545.5 num_loss_counted_tokens: 2214
 Per-token loss scaled by world size: 0.0002194504631916061Per-token loss scaled by world size: 0.0005286230007186532Per-token loss scaled by world size: 0.00018121296307072043Per-token loss scaled by world size: 0.00039247411768883467


 Per-token loss scaled by world size: 3.4347518521826714e-05Per-token loss scaled by world size: 4.241336228005821e-06

 Per-token loss scaled by world size: 0.00030080656870268285

 Epoch: 1, Step: 177, Rank: 2, loss = 0.5341705083847046Epoch: 1, Step: 177, Rank: 5, loss = 1.156915545463562
 Epoch: 1, Step: 177, Rank: 3, loss = 1.558248519897461

 Epoch: 1, Step: 177, Rank: 7, loss = 0.646885097026825
 Epoch: 1, Step: 177, Rank: 1, loss = 0.10124789923429489Epoch: 1, Step: 177, Rank: 0, loss = 0.012502399273216724

 Epoch: 1, Step: 177, Rank: 4, loss = 0.8867025375366211
 Per-token loss scaled by world size: 0.00036306059337221086
 Epoch: 1, Step: 177, Rank: 6, loss = 1.0702118873596191
                                                          total tokens: 7792 num samples: 8 num padding tokens: 837 - rank: 4 max len: 974 min len: 724 avg len: 869.375 num_loss_counted_tokens: 4288
 total tokens: 7242 num samples: 3 num padding tokens: 572 - rank: 1 max len: 2414 min len: 1992 avg len: 2223.3333333333335 num_loss_counted_tokens: 799
 {
    "epoch": 1,
    "step": 177,
    "rank": 0,
    "loss": 0.012502399273216724,
    "overall_throughput": 42.43462885426557,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.246610641479492,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23582,
    "batch_size": 96,
    "total_loss": 0.7458605170249939,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:51.468642"
 }
 total tokens: 7854 num samples: 11 num padding tokens: 757 - rank: 5 max len: 714 min len: 600 avg len: 645.1818181818181 num_loss_counted_tokens: 5101
 total tokens: 7854 num samples: 14 num padding tokens: 1807 - rank: 6 max len: 561 min len: 335 avg len: 431.92857142857144 num_loss_counted_tokens: 3657
 total tokens: 7752 num samples: 24 num padding tokens: 3111 - rank: 7 max len: 323 min len: 76 avg len: 193.375 num_loss_counted_tokens: 2099
 total tokens: 5566 num samples: 2 num padding tokens: 100 - rank: 0 max len: 2783 min len: 2683 avg len: 2733.0 num_loss_counted_tokens: 274
 total tokens: 7026 num samples: 6 num padding tokens: 376 - rank: 3 max len: 1171 min len: 1022 avg len: 1108.3333333333333 num_loss_counted_tokens: 3075
 total tokens: 7916 num samples: 4 num padding tokens: 1590 - rank: 2 max len: 1979 min len: 1340 avg len: 1581.5 num_loss_counted_tokens: 4584
 Per-token loss scaled by world size: 0.0001956072374014184Per-token loss scaled by world size: 0.00021871054195798934Per-token loss scaled by world size: 0.0003932247345801443

 Per-token loss scaled by world size: 1.0211075277766213e-05

 Per-token loss scaled by world size: 0.00026568045723252
 Per-token loss scaled by world size: 0.00030937412520870566Epoch: 1, Step: 178, Rank: 2, loss = 0.689293622970581

 Per-token loss scaled by world size: 0.000248032680246979Epoch: 1, Step: 178, Rank: 3, loss = 0.6164806485176086

 Epoch: 1, Step: 178, Rank: 5, loss = 1.2392969131469727
 Epoch: 1, Step: 178, Rank: 1, loss = 0.032181479036808014
 Epoch: 1, Step: 178, Rank: 6, loss = 0.8373252153396606
 Epoch: 1, Step: 178, Rank: 4, loss = 0.9750311970710754
 Per-token loss scaled by world size: 3.7826398511242587e-06Epoch: 1, Step: 178, Rank: 7, loss = 0.7817060351371765

 Epoch: 1, Step: 178, Rank: 0, loss = 0.01192146260291338
                                                          total tokens: 8037 num samples: 9 num padding tokens: 489 - rank: 4 max len: 893 min len: 768 avg len: 838.6666666666666 num_loss_counted_tokens: 5348
 total tokens: 7436 num samples: 4 num padding tokens: 550 - rank: 1 max len: 1859 min len: 1539 avg len: 1721.5 num_loss_counted_tokens: 2082
 total tokens: 7380 num samples: 5 num padding tokens: 481 - rank: 2 max len: 1476 min len: 1220 avg len: 1379.8 num_loss_counted_tokens: 1223
 {
    "epoch": 1,
    "step": 178,
    "rank": 0,
    "loss": 0.01192146260291338,
    "overall_throughput": 40.62829660225024,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.526740550994873,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25213,
    "batch_size": 83,
    "total_loss": 0.647904634475708,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:54.071173"
 }
 total tokens: 8000 num samples: 16 num padding tokens: 2833 - rank: 6 max len: 500 min len: 256 avg len: 322.9375 num_loss_counted_tokens: 2783
 total tokens: 7994 num samples: 7 num padding tokens: 840 - rank: 3 max len: 1142 min len: 895 avg len: 1022.0 num_loss_counted_tokens: 3757
 total tokens: 7520 num samples: 10 num padding tokens: 909 - rank: 5 max len: 752 min len: 566 avg len: 661.1 num_loss_counted_tokens: 3632
 total tokens: 7560 num samples: 30 num padding tokens: 2931 - rank: 7 max len: 252 min len: 77 avg len: 154.3 num_loss_counted_tokens: 1778
 total tokens: 6586 num samples: 2 num padding tokens: 527 - rank: 0 max len: 3293 min len: 2766 avg len: 3029.5 num_loss_counted_tokens: 357
 Per-token loss scaled by world size: 0.0004986776039004326Per-token loss scaled by world size: 0.0002882078697439283Per-token loss scaled by world size: 0.0005514522199518979
 Per-token loss scaled by world size: 0.0002959422126878053
 Per-token loss scaled by world size: 0.0005512300995178521Per-token loss scaled by world size: 4.8149313442991115e-06


 Per-token loss scaled by world size: 8.981861901702359e-05

 Epoch: 1, Step: 179, Rank: 5, loss = 1.2778526544570923
 Epoch: 1, Step: 179, Rank: 6, loss = 1.1555607318878174
 Epoch: 1, Step: 179, Rank: 4, loss = 0.6678496599197388
 Epoch: 1, Step: 179, Rank: 7, loss = 1.277337908744812
 Epoch: 1, Step: 179, Rank: 0, loss = 0.011157399974763393
 Epoch: 1, Step: 179, Rank: 1, loss = 0.20813219249248505Epoch: 1, Step: 179, Rank: 2, loss = 0.6857721209526062

 Per-token loss scaled by world size: 0.00024368343292735517
 Epoch: 1, Step: 179, Rank: 3, loss = 0.5646754503250122
                                                          total tokens: 5876 num samples: 2 num padding tokens: 220 - rank: 1 max len: 2938 min len: 2718 avg len: 2828.0 num_loss_counted_tokens: 745
 total tokens: 7839 num samples: 9 num padding tokens: 725 - rank: 4 max len: 871 min len: 630 avg len: 790.4444444444445 num_loss_counted_tokens: 4180
 total tokens: 7287 num samples: 7 num padding tokens: 860 - rank: 3 max len: 1041 min len: 872 avg len: 918.1428571428571 num_loss_counted_tokens: 4584
 {
    "epoch": 1,
    "step": 179,
    "rank": 0,
    "loss": 0.011157399974763393,
    "overall_throughput": 41.67005705220922,
    "lr": 3.2000000000000003e-06,
    "cuda_mem_allocated": 24.338607788085938,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18538,
    "batch_size": 79,
    "total_loss": 0.7310422658920288,
    "gradnorm": 0.9710609316825867,
    "weight_norm": 433.04327392578125,
    "timestamp": "2024-08-18T20:55:56.610997"
 }
 total tokens: 8112 num samples: 13 num padding tokens: 833 - rank: 5 max len: 624 min len: 485 avg len: 559.9230769230769 num_loss_counted_tokens: 4530
 total tokens: 7329 num samples: 3 num padding tokens: 1943 - rank: 2 max len: 2443 min len: 1182 avg len: 1795.3333333333333 num_loss_counted_tokens: 502
 total tokens: 7812 num samples: 28 num padding tokens: 2498 - rank: 7 max len: 279 min len: 87 avg len: 189.78571428571428 num_loss_counted_tokens: 2490
 total tokens: 6798 num samples: 2 num padding tokens: 177 - rank: 0 max len: 3399 min len: 3222 avg len: 3310.5 num_loss_counted_tokens: 146
 total tokens: 8024 num samples: 17 num padding tokens: 1349 - rank: 6 max len: 472 min len: 304 avg len: 392.6470588235294 num_loss_counted_tokens: 4448
 Per-token loss scaled by world size: 0.0003445638285484165Per-token loss scaled by world size: 0.00042620761087164283Per-token loss scaled by world size: 7.993769395397976e-05Per-token loss scaled by world size: 4.4396303565008566e-05



 Per-token loss scaled by world size: 0.00013256767124403268
 Per-token loss scaled by world size: 0.0002054571668850258
 Epoch: 1, Step: 180, Rank: 1, loss = 0.21686097979545593Epoch: 1, Step: 180, Rank: 5, loss = 1.1562479734420776

 Per-token loss scaled by world size: 0.0001180191757157445Epoch: 1, Step: 180, Rank: 0, loss = 0.12044162303209305
 Epoch: 1, Step: 180, Rank: 4, loss = 0.9347586035728455

 Epoch: 1, Step: 180, Rank: 3, loss = 0.3596395254135132
 Epoch: 1, Step: 180, Rank: 7, loss = 0.5573796033859253
 Per-token loss scaled by world size: 0.00041190627962350845
 Epoch: 1, Step: 180, Rank: 2, loss = 0.3201712667942047
 Epoch: 1, Step: 180, Rank: 6, loss = 1.11745023727417
 [2024-08-18 20:55:59,149] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[4.000000000000001e-06], mom=[(0.9, 0.95)]
 [2024-08-18 20:55:59,226] [INFO] [timer.py:258:stop] epoch=0/micro_step=180/global_step=5, RunningAvgSamplesPerSec=41.67583330946055, CurrSamplesPerSec=41.56020644882237, MemAllocated=22.74GB, MaxMemAllocated=30.61GB
                                                          total tokens: 7120 num samples: 4 num padding tokens: 1624 - rank: 1 max len: 1780 min len: 1186 avg len: 1374.0 num_loss_counted_tokens: 1016
 total tokens: 7790 num samples: 10 num padding tokens: 1050 - rank: 4 max len: 779 min len: 583 avg len: 674.0 num_loss_counted_tokens: 5078
 {
    "epoch": 1,
    "step": 180,
    "rank": 0,
    "loss": 0.12044162303209305,
    "overall_throughput": 40.381532723652576,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 22.73690176010132,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21703,
    "batch_size": 76,
    "total_loss": 0.5978687405586243,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:55:59.229405"
 }
 total tokens: 6960 num samples: 29 num padding tokens: 2603 - rank: 7 max len: 240 min len: 85 avg len: 150.24137931034483 num_loss_counted_tokens: 1642
 total tokens: 7408 num samples: 8 num padding tokens: 579 - rank: 3 max len: 926 min len: 792 avg len: 853.625 num_loss_counted_tokens: 5923
 total tokens: 8020 num samples: 20 num padding tokens: 1553 - rank: 6 max len: 401 min len: 267 avg len: 323.35 num_loss_counted_tokens: 3882
 total tokens: 5448 num samples: 2 num padding tokens: 854 - rank: 0 max len: 2724 min len: 1870 avg len: 2297.0 num_loss_counted_tokens: 185
 total tokens: 7812 num samples: 14 num padding tokens: 1049 - rank: 5 max len: 558 min len: 414 avg len: 483.07142857142856 num_loss_counted_tokens: 3625
 total tokens: 7854 num samples: 7 num padding tokens: 726 - rank: 2 max len: 1122 min len: 943 avg len: 1018.2857142857143 num_loss_counted_tokens: 3158
 Per-token loss scaled by world size: 0.00021158010349608958
 Per-token loss scaled by world size: 9.194504673359916e-05Per-token loss scaled by world size: 2.7597000098467106e-06Per-token loss scaled by world size: 0.00044244344462640584Per-token loss scaled by world size: 0.0002959812409244478Per-token loss scaled by world size: 1.918704765557777e-05




 Per-token loss scaled by world size: 0.0002461440162733197
 Epoch: 1, Step: 181, Rank: 4, loss = 0.640823245048523
 Epoch: 1, Step: 181, Rank: 5, loss = 1.3400505781173706
 Epoch: 1, Step: 181, Rank: 2, loss = 0.27847856283187866
 Epoch: 1, Step: 181, Rank: 0, loss = 0.008358441293239594Epoch: 1, Step: 181, Rank: 1, loss = 0.058112770318984985

 Epoch: 1, Step: 181, Rank: 7, loss = 0.8964532017707825
 Epoch: 1, Step: 181, Rank: 3, loss = 0.7455086708068848
 Per-token loss scaled by world size: 0.00037991307908669114
 Epoch: 1, Step: 181, Rank: 6, loss = 1.1506617069244385
                                                          total tokens: 7336 num samples: 8 num padding tokens: 1044 - rank: 4 max len: 917 min len: 694 avg len: 786.5 num_loss_counted_tokens: 4411
 total tokens: 5638 num samples: 2 num padding tokens: 201 - rank: 1 max len: 2819 min len: 2618 avg len: 2718.5 num_loss_counted_tokens: 197
 {
    "epoch": 1,
    "step": 181,
    "rank": 0,
    "loss": 0.008358441293239594,
    "overall_throughput": 41.99766118896557,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.434746265411377,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24230,
    "batch_size": 85,
    "total_loss": 0.6398059129714966,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:01.770174"
 }
 total tokens: 8088 num samples: 12 num padding tokens: 1035 - rank: 5 max len: 674 min len: 501 avg len: 587.75 num_loss_counted_tokens: 4793
 total tokens: 7696 num samples: 16 num padding tokens: 1539 - rank: 6 max len: 481 min len: 270 avg len: 384.8125 num_loss_counted_tokens: 3508
 total tokens: 7340 num samples: 4 num padding tokens: 1220 - rank: 2 max len: 1835 min len: 1287 avg len: 1530.0 num_loss_counted_tokens: 2015
 total tokens: 7047 num samples: 27 num padding tokens: 2280 - rank: 7 max len: 261 min len: 83 avg len: 176.55555555555554 num_loss_counted_tokens: 1980
 total tokens: 6842 num samples: 2 num padding tokens: 194 - rank: 0 max len: 3421 min len: 3227 avg len: 3324.0 num_loss_counted_tokens: 203
 total tokens: 7308 num samples: 6 num padding tokens: 899 - rank: 3 max len: 1218 min len: 951 avg len: 1068.1666666666667 num_loss_counted_tokens: 2980
 Per-token loss scaled by world size: 0.00044985805288888514Per-token loss scaled by world size: 0.0003697045613080263Per-token loss scaled by world size: 0.00020446558482944965Per-token loss scaled by world size: 0.0002235985011793673Per-token loss scaled by world size: 0.0003323642595205456




 Per-token loss scaled by world size: 8.518856338923797e-05
 Epoch: 1, Step: 182, Rank: 3, loss = 0.6204019784927368
 Per-token loss scaled by world size: 0.00010683093569241464Epoch: 1, Step: 182, Rank: 5, loss = 1.0257915258407593Epoch: 1, Step: 182, Rank: 6, loss = 0.9221861958503723Epoch: 1, Step: 182, Rank: 2, loss = 0.5673153400421143


 Epoch: 1, Step: 182, Rank: 4, loss = 1.2481874227523804

 Epoch: 1, Step: 182, Rank: 1, loss = 0.23636631667613983
 Epoch: 1, Step: 182, Rank: 7, loss = 0.296415776014328Per-token loss scaled by world size: 2.3087804947863333e-06

 Epoch: 1, Step: 182, Rank: 0, loss = 0.006406000349670649
                                                          total tokens: 7911 num samples: 3 num padding tokens: 1870 - rank: 1 max len: 2637 min len: 1680 avg len: 2013.6666666666667 num_loss_counted_tokens: 268
 total tokens: 3888 num samples: 18 num padding tokens: 1364 - rank: 7 max len: 216 min len: 80 avg len: 140.22222222222223 num_loss_counted_tokens: 1085
 total tokens: 7695 num samples: 9 num padding tokens: 456 - rank: 4 max len: 855 min len: 750 avg len: 804.3333333333334 num_loss_counted_tokens: 5317
 {
    "epoch": 1,
    "step": 182,
    "rank": 0,
    "loss": 0.006406000349670649,
    "overall_throughput": 40.52672826708525,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.530043125152588,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22197,
    "batch_size": 78,
    "total_loss": 0.6153838038444519,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:04.374573"
 }
 total tokens: 8024 num samples: 17 num padding tokens: 2930 - rank: 6 max len: 472 min len: 227 avg len: 299.6470588235294 num_loss_counted_tokens: 3168
 total tokens: 6560 num samples: 4 num padding tokens: 820 - rank: 2 max len: 1640 min len: 1225 avg len: 1435.0 num_loss_counted_tokens: 3801
 total tokens: 7062 num samples: 6 num padding tokens: 912 - rank: 3 max len: 1177 min len: 874 avg len: 1025.0 num_loss_counted_tokens: 4465
 total tokens: 7854 num samples: 11 num padding tokens: 1369 - rank: 5 max len: 714 min len: 483 avg len: 589.5454545454545 num_loss_counted_tokens: 3058
 total tokens: 6820 num samples: 2 num padding tokens: 89 - rank: 0 max len: 3410 min len: 3321 avg len: 3365.5 num_loss_counted_tokens: 208
 Per-token loss scaled by world size: 0.00046004995238035917Per-token loss scaled by world size: 0.0006659817881882191Per-token loss scaled by world size: 0.0003906514320988208

 Per-token loss scaled by world size: 2.6746805815491825e-05Per-token loss scaled by world size: 0.000365366053301841


 Per-token loss scaled by world size: 6.110809408710338e-06Per-token loss scaled by world size: 2.263839405713952e-06

 Epoch: 1, Step: 183, Rank: 6, loss = 1.5981065034866333Epoch: 1, Step: 183, Rank: 3, loss = 0.9374169707298279

 Epoch: 1, Step: 183, Rank: 4, loss = 1.103947401046753Epoch: 1, Step: 183, Rank: 2, loss = 0.06418230384588242

 Epoch: 1, Step: 183, Rank: 7, loss = 0.8767415285110474
 Epoch: 1, Step: 183, Rank: 0, loss = 0.005432365462183952Epoch: 1, Step: 183, Rank: 1, loss = 0.014663650654256344

 Per-token loss scaled by world size: 0.0007157879881560802
 Epoch: 1, Step: 183, Rank: 5, loss = 1.7176227569580078
                                                          total tokens: 7851 num samples: 3 num padding tokens: 445 - rank: 1 max len: 2617 min len: 2389 avg len: 2468.6666666666665 num_loss_counted_tokens: 473
 total tokens: 7480 num samples: 11 num padding tokens: 807 - rank: 4 max len: 680 min len: 536 avg len: 606.6363636363636 num_loss_counted_tokens: 4807
 {
    "epoch": 1,
    "step": 183,
    "rank": 0,
    "loss": 0.005432365462183952,
    "overall_throughput": 41.57406220293697,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.319289207458496,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19197,
    "batch_size": 71,
    "total_loss": 0.7897641658782959,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:06.915463"
 }
 total tokens: 976 num samples: 8 num padding tokens: 185 - rank: 7 max len: 122 min len: 80 avg len: 98.875 num_loss_counted_tokens: 195
 total tokens: 7280 num samples: 7 num padding tokens: 746 - rank: 3 max len: 1040 min len: 851 avg len: 933.4285714285714 num_loss_counted_tokens: 5111
 total tokens: 6990 num samples: 6 num padding tokens: 218 - rank: 2 max len: 1165 min len: 1063 avg len: 1128.6666666666667 num_loss_counted_tokens: 3805
 total tokens: 7917 num samples: 21 num padding tokens: 2826 - rank: 6 max len: 377 min len: 129 avg len: 242.42857142857142 num_loss_counted_tokens: 2506
 total tokens: 7920 num samples: 15 num padding tokens: 1424 - rank: 5 max len: 528 min len: 378 avg len: 433.06666666666666 num_loss_counted_tokens: 4285
 total tokens: 5748 num samples: 2 num padding tokens: 35 - rank: 0 max len: 2874 min len: 2839 avg len: 2856.5 num_loss_counted_tokens: 168
 Per-token loss scaled by world size: 0.000344208674505353Per-token loss scaled by world size: 0.00029513309709727764Per-token loss scaled by world size: 0.0003547095402609557Per-token loss scaled by world size: 0.0004679278936237097Per-token loss scaled by world size: 4.7765744966454804e-05Per-token loss scaled by world size: 0.0003639253554865718Per-token loss scaled by world size: 4.720650849776575e-06






 Epoch: 1, Step: 184, Rank: 6, loss = 0.9553658366203308
 Epoch: 1, Step: 184, Rank: 1, loss = 0.12865106761455536Epoch: 1, Step: 184, Rank: 2, loss = 0.9270830750465393
 Epoch: 1, Step: 184, Rank: 4, loss = 1.2603052854537964Epoch: 1, Step: 184, Rank: 0, loss = 0.012714482843875885Epoch: 1, Step: 184, Rank: 7, loss = 0.7949041128158569


 Epoch: 1, Step: 184, Rank: 5, loss = 0.9801874756813049

 Per-token loss scaled by world size: 0.00017599221609998494
 Epoch: 1, Step: 184, Rank: 3, loss = 0.4740130305290222
                                                          total tokens: 6996 num samples: 2 num padding tokens: 1478 - rank: 1 max len: 3498 min len: 2020 avg len: 2759.0 num_loss_counted_tokens: 177
 total tokens: 7335 num samples: 9 num padding tokens: 438 - rank: 4 max len: 815 min len: 722 avg len: 766.3333333333334 num_loss_counted_tokens: 4746
 total tokens: 7648 num samples: 16 num padding tokens: 1473 - rank: 6 max len: 478 min len: 288 avg len: 385.9375 num_loss_counted_tokens: 3125
 total tokens: 7722 num samples: 11 num padding tokens: 1234 - rank: 5 max len: 702 min len: 483 avg len: 589.8181818181819 num_loss_counted_tokens: 4245
 {
    "epoch": 1,
    "step": 184,
    "rank": 0,
    "loss": 0.012714482843875885,
    "overall_throughput": 41.53668738608439,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.342710971832275,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21547,
    "batch_size": 88,
    "total_loss": 0.6916530132293701,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:09.462656"
 }
 total tokens: 7588 num samples: 28 num padding tokens: 2405 - rank: 7 max len: 271 min len: 82 avg len: 185.10714285714286 num_loss_counted_tokens: 2309
 total tokens: 7024 num samples: 4 num padding tokens: 1375 - rank: 2 max len: 1756 min len: 1193 avg len: 1412.25 num_loss_counted_tokens: 2078
 total tokens: 7448 num samples: 7 num padding tokens: 379 - rank: 3 max len: 1064 min len: 961 avg len: 1009.8571428571429 num_loss_counted_tokens: 5273
 total tokens: 4061 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4061 min len: 4061 avg len: 4061.0 num_loss_counted_tokens: 393
 Per-token loss scaled by world size: 0.00041325463098473847Per-token loss scaled by world size: 0.00040759582770988345Per-token loss scaled by world size: 0.0002913470088969916Per-token loss scaled by world size: 5.2812706599070225e-06Per-token loss scaled by world size: 9.605777449905872e-05
 Per-token loss scaled by world size: 4.5470653276424855e-05




 Per-token loss scaled by world size: 0.000289949937723577
 Epoch: 1, Step: 185, Rank: 5, loss = 1.2077573537826538
 Epoch: 1, Step: 185, Rank: 6, loss = 1.2245250940322876Epoch: 1, Step: 185, Rank: 0, loss = 0.015649065375328064Epoch: 1, Step: 185, Rank: 4, loss = 0.8632975816726685

 Epoch: 1, Step: 185, Rank: 2, loss = 0.2846311926841736

 Epoch: 1, Step: 185, Rank: 1, loss = 0.13473522663116455
 Epoch: 1, Step: 185, Rank: 7, loss = 0.859157919883728
 Per-token loss scaled by world size: 0.00039052340434864163
 Epoch: 1, Step: 185, Rank: 3, loss = 1.1571696996688843
                                                          total tokens: 7909 num samples: 11 num padding tokens: 619 - rank: 4 max len: 719 min len: 567 avg len: 662.7272727272727 num_loss_counted_tokens: 3030
 total tokens: 7710 num samples: 5 num padding tokens: 1395 - rank: 1 max len: 1542 min len: 1135 avg len: 1263.0 num_loss_counted_tokens: 2357
 total tokens: 7533 num samples: 9 num padding tokens: 442 - rank: 3 max len: 837 min len: 720 avg len: 787.8888888888889 num_loss_counted_tokens: 4365
 {
    "epoch": 1,
    "step": 185,
    "rank": 0,
    "loss": 0.015649065375328064,
    "overall_throughput": 42.01932932464499,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.411304473876953,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23705,
    "batch_size": 85,
    "total_loss": 0.7183653712272644,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:11.984726"
 }
 total tokens: 8030 num samples: 22 num padding tokens: 1301 - rank: 6 max len: 365 min len: 237 avg len: 305.8636363636364 num_loss_counted_tokens: 3940
 total tokens: 7868 num samples: 14 num padding tokens: 1280 - rank: 5 max len: 562 min len: 411 avg len: 470.57142857142856 num_loss_counted_tokens: 4766
 total tokens: 6728 num samples: 29 num padding tokens: 2588 - rank: 7 max len: 232 min len: 75 avg len: 142.75862068965517 num_loss_counted_tokens: 1514
 total tokens: 6430 num samples: 2 num padding tokens: 1586 - rank: 0 max len: 3215 min len: 1629 avg len: 2422.0 num_loss_counted_tokens: 1620
 total tokens: 7819 num samples: 7 num padding tokens: 799 - rank: 2 max len: 1117 min len: 938 avg len: 1002.8571428571429 num_loss_counted_tokens: 3592
 Per-token loss scaled by world size: 0.00034539305488578975Per-token loss scaled by world size: 0.0004060663341078907Per-token loss scaled by world size: 0.0002557812840677798Per-token loss scaled by world size: 0.00037697027437388897Per-token loss scaled by world size: 0.00021574345009867102Per-token loss scaled by world size: 1.922340288729174e-06Per-token loss scaled by world size: 1.7185264368890785e-05






 Epoch: 1, Step: 186, Rank: 6, loss = 1.187833309173584Epoch: 1, Step: 186, Rank: 2, loss = 0.8059667944908142
 Epoch: 1, Step: 186, Rank: 0, loss = 0.006057294551283121
 Epoch: 1, Step: 186, Rank: 3, loss = 1.0883334875106812
 Epoch: 1, Step: 186, Rank: 4, loss = 1.279515027999878
 Epoch: 1, Step: 186, Rank: 7, loss = 0.6798076033592224Epoch: 1, Step: 186, Rank: 1, loss = 0.054150767624378204


 Per-token loss scaled by world size: 0.00038644636515527964
 Epoch: 1, Step: 186, Rank: 5, loss = 1.217692494392395
                                                          total tokens: 8019 num samples: 9 num padding tokens: 612 - rank: 4 max len: 891 min len: 729 avg len: 823.0 num_loss_counted_tokens: 5812
 total tokens: 7845 num samples: 3 num padding tokens: 1345 - rank: 1 max len: 2615 min len: 1815 avg len: 2166.6666666666665 num_loss_counted_tokens: 455
 {
    "epoch": 1,
    "step": 186,
    "rank": 0,
    "loss": 0.006057294551283121,
    "overall_throughput": 41.92610024295953,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.250525951385498,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25208,
    "batch_size": 86,
    "total_loss": 0.7899196147918701,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:14.513363"
 }
 total tokens: 8030 num samples: 5 num padding tokens: 2103 - rank: 3 max len: 1606 min len: 905 avg len: 1185.4 num_loss_counted_tokens: 1115
 total tokens: 7990 num samples: 17 num padding tokens: 1874 - rank: 6 max len: 470 min len: 282 avg len: 359.7647058823529 num_loss_counted_tokens: 3470
 total tokens: 7040 num samples: 4 num padding tokens: 176 - rank: 2 max len: 1760 min len: 1656 avg len: 1716.0 num_loss_counted_tokens: 1048
 total tokens: 8091 num samples: 29 num padding tokens: 2852 - rank: 7 max len: 279 min len: 85 avg len: 180.6551724137931 num_loss_counted_tokens: 2267
 total tokens: 6206 num samples: 2 num padding tokens: 163 - rank: 0 max len: 3103 min len: 2940 avg len: 3021.5 num_loss_counted_tokens: 193
 total tokens: 7722 num samples: 11 num padding tokens: 1158 - rank: 5 max len: 702 min len: 471 avg len: 596.7272727272727 num_loss_counted_tokens: 4117
 Per-token loss scaled by world size: 0.0002549219934735447Per-token loss scaled by world size: 9.078537550522014e-05Per-token loss scaled by world size: 0.0001285703619942069
 Per-token loss scaled by world size: 0.00026954273926094174
 Per-token loss scaled by world size: 0.00027302553644403815


 Per-token loss scaled by world size: 0.00012735271593555808
 Per-token loss scaled by world size: 0.00019962496298830956
 Epoch: 1, Step: 187, Rank: 3, loss = 0.3104405999183655
 Epoch: 1, Step: 187, Rank: 2, loss = 0.8717057108879089
 Epoch: 1, Step: 187, Rank: 6, loss = 0.9217013716697693
 Epoch: 1, Step: 187, Rank: 1, loss = 0.4396463632583618
 Epoch: 1, Step: 187, Rank: 4, loss = 0.9336108565330505
 Epoch: 1, Step: 187, Rank: 0, loss = 0.43548262119293213
 Per-token loss scaled by world size: 0.00028984216623939574
 Epoch: 1, Step: 187, Rank: 7, loss = 0.6826175451278687
 Epoch: 1, Step: 187, Rank: 5, loss = 0.9911152720451355
                                                          total tokens: 6432 num samples: 3 num padding tokens: 1172 - rank: 1 max len: 2144 min len: 1513 avg len: 1753.3333333333333 num_loss_counted_tokens: 778
 total tokens: 7992 num samples: 8 num padding tokens: 895 - rank: 4 max len: 999 min len: 761 avg len: 887.125 num_loss_counted_tokens: 3688
 total tokens: 7580 num samples: 10 num padding tokens: 664 - rank: 5 max len: 758 min len: 591 avg len: 691.6 num_loss_counted_tokens: 5001
 {
    "epoch": 1,
    "step": 187,
    "rank": 0,
    "loss": 0.43548262119293213,
    "overall_throughput": 41.80966716175401,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.324402809143066,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27356,
    "batch_size": 97,
    "total_loss": 0.6982901096343994,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:17.044420"
 }
 total tokens: 7756 num samples: 14 num padding tokens: 2347 - rank: 6 max len: 554 min len: 267 avg len: 386.35714285714283 num_loss_counted_tokens: 3784
 total tokens: 5145 num samples: 21 num padding tokens: 1727 - rank: 7 max len: 245 min len: 84 avg len: 162.76190476190476 num_loss_counted_tokens: 1417
 total tokens: 7055 num samples: 5 num padding tokens: 476 - rank: 2 max len: 1411 min len: 1211 avg len: 1315.8 num_loss_counted_tokens: 2646
 total tokens: 7146 num samples: 6 num padding tokens: 724 - rank: 3 max len: 1191 min len: 1017 avg len: 1070.3333333333333 num_loss_counted_tokens: 3869
 total tokens: 7376 num samples: 2 num padding tokens: 819 - rank: 0 max len: 3688 min len: 2869 avg len: 3278.5 num_loss_counted_tokens: 160
 Per-token loss scaled by world size: 0.00011057691881433129Per-token loss scaled by world size: 0.00035715868580155075Per-token loss scaled by world size: 0.0003879719879478216Per-token loss scaled by world size: 0.00021651088900398463Per-token loss scaled by world size: 6.0199621657375246e-05




 Per-token loss scaled by world size: 5.681112452293746e-05
 Per-token loss scaled by world size: 0.0005215432029217482
 Epoch: 1, Step: 188, Rank: 6, loss = 1.0699580907821655Epoch: 1, Step: 188, Rank: 4, loss = 1.1622670888900757
 Epoch: 1, Step: 188, Rank: 2, loss = 0.18034301698207855Epoch: 1, Step: 188, Rank: 1, loss = 0.3312608003616333


 Epoch: 1, Step: 188, Rank: 7, loss = 0.6486124992370605
 Epoch: 1, Step: 188, Rank: 0, loss = 0.1701919287443161
 Epoch: 1, Step: 188, Rank: 5, loss = 1.562412977218628
 Per-token loss scaled by world size: 0.0001381830807076767
 Epoch: 1, Step: 188, Rank: 3, loss = 0.4139619469642639
                                                          total tokens: 5862 num samples: 2 num padding tokens: 913 - rank: 1 max len: 2931 min len: 2018 avg len: 2474.5 num_loss_counted_tokens: 226
 total tokens: 7819 num samples: 7 num padding tokens: 1165 - rank: 4 max len: 1117 min len: 742 avg len: 950.5714285714286 num_loss_counted_tokens: 4359
 {
    "epoch": 1,
    "step": 188,
    "rank": 0,
    "loss": 0.1701919287443161,
    "overall_throughput": 40.95455841532473,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.21819305419922,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23966,
    "batch_size": 84,
    "total_loss": 0.6923760771751404,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:19.630875"
 }
 total tokens: 8085 num samples: 11 num padding tokens: 1172 - rank: 5 max len: 735 min len: 538 avg len: 628.4545454545455 num_loss_counted_tokens: 4424
 total tokens: 7680 num samples: 4 num padding tokens: 1258 - rank: 2 max len: 1920 min len: 1352 avg len: 1605.5 num_loss_counted_tokens: 1060
 total tokens: 7875 num samples: 15 num padding tokens: 1673 - rank: 6 max len: 525 min len: 296 avg len: 413.46666666666664 num_loss_counted_tokens: 4419
 total tokens: 6734 num samples: 2 num padding tokens: 36 - rank: 0 max len: 3367 min len: 3331 avg len: 3349.0 num_loss_counted_tokens: 164
 total tokens: 7920 num samples: 6 num padding tokens: 486 - rank: 3 max len: 1320 min len: 1118 avg len: 1239.0 num_loss_counted_tokens: 1697
 total tokens: 8064 num samples: 28 num padding tokens: 2898 - rank: 7 max len: 288 min len: 79 avg len: 184.5 num_loss_counted_tokens: 2302
 Per-token loss scaled by world size: 0.0002291280252393335Per-token loss scaled by world size: 0.00042932084761559963Per-token loss scaled by world size: 0.0003715125494636595Per-token loss scaled by world size: 0.0003486127534415573Per-token loss scaled by world size: 0.00029553903732448816
 Per-token loss scaled by world size: 2.541187996030203e-06



 Per-token loss scaled by world size: 4.7232209908543155e-05

 Epoch: 1, Step: 189, Rank: 5, loss = 1.2460501194000244
 Epoch: 1, Step: 189, Rank: 6, loss = 1.0782687664031982Epoch: 1, Step: 189, Rank: 0, loss = 0.007375480607151985
 Epoch: 1, Step: 189, Rank: 2, loss = 0.665015459060669

 Epoch: 1, Step: 189, Rank: 7, loss = 0.8577651381492615
 Epoch: 1, Step: 189, Rank: 4, loss = 1.0118049383163452
 Epoch: 1, Step: 189, Rank: 1, loss = 0.13708558678627014
 Per-token loss scaled by world size: 0.00045083268196322024
 Epoch: 1, Step: 189, Rank: 3, loss = 1.308485507965088
                                                          total tokens: 8012 num samples: 4 num padding tokens: 1085 - rank: 1 max len: 2003 min len: 1547 avg len: 1731.75 num_loss_counted_tokens: 501
 total tokens: 7704 num samples: 9 num padding tokens: 539 - rank: 4 max len: 856 min len: 750 avg len: 796.1111111111111 num_loss_counted_tokens: 5817
 {
    "epoch": 1,
    "step": 189,
    "rank": 0,
    "loss": 0.007375480607151985,
    "overall_throughput": 41.729872575916524,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.477022171020508,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23219,
    "batch_size": 98,
    "total_loss": 0.7889814376831055,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:22.162410"
 }
 total tokens: 6978 num samples: 6 num padding tokens: 878 - rank: 3 max len: 1163 min len: 953 avg len: 1016.6666666666666 num_loss_counted_tokens: 3406
 total tokens: 5496 num samples: 24 num padding tokens: 2060 - rank: 7 max len: 229 min len: 84 avg len: 143.16666666666666 num_loss_counted_tokens: 1259
 total tokens: 8074 num samples: 11 num padding tokens: 921 - rank: 5 max len: 734 min len: 585 avg len: 650.2727272727273 num_loss_counted_tokens: 3399
 total tokens: 8100 num samples: 15 num padding tokens: 2428 - rank: 6 max len: 540 min len: 251 avg len: 378.1333333333333 num_loss_counted_tokens: 2866
 total tokens: 7430 num samples: 5 num padding tokens: 440 - rank: 2 max len: 1486 min len: 1279 avg len: 1398.0 num_loss_counted_tokens: 2676
 total tokens: 7548 num samples: 2 num padding tokens: 467 - rank: 0 max len: 3774 min len: 3307 avg len: 3540.5 num_loss_counted_tokens: 178
 Per-token loss scaled by world size: 0.00038093223702162504Per-token loss scaled by world size: 0.0003549058164935559Per-token loss scaled by world size: 0.0002593057288322598Per-token loss scaled by world size: 0.0002407356078037992Per-token loss scaled by world size: 0.0001624725991860032

 Per-token loss scaled by world size: 9.274062176700681e-05Per-token loss scaled by world size: 3.2670384825905785e-05




 Epoch: 1, Step: 190, Rank: 6, loss = 1.1850801706314087
 Epoch: 1, Step: 190, Rank: 7, loss = 0.8067001104354858Epoch: 1, Step: 190, Rank: 4, loss = 1.1041120290756226

 Epoch: 1, Step: 190, Rank: 3, loss = 0.7489284873008728
 Epoch: 1, Step: 190, Rank: 0, loss = 0.10163756459951401
 Epoch: 1, Step: 190, Rank: 1, loss = 0.2885160744190216Epoch: 1, Step: 190, Rank: 2, loss = 0.5054522752761841

 Per-token loss scaled by world size: 0.00030219164909794927
 Epoch: 1, Step: 190, Rank: 5, loss = 0.9401181936264038
                                                          total tokens: 6304 num samples: 2 num padding tokens: 712 - rank: 1 max len: 3152 min len: 2440 avg len: 2796.0 num_loss_counted_tokens: 243
 total tokens: 7343 num samples: 7 num padding tokens: 1218 - rank: 4 max len: 1049 min len: 690 avg len: 875.0 num_loss_counted_tokens: 4702
 {
    "epoch": 1,
    "step": 190,
    "rank": 0,
    "loss": 0.10163756459951401,
    "overall_throughput": 41.867834901558076,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.32686471939087,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24888,
    "batch_size": 83,
    "total_loss": 0.7100681662559509,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:24.689299"
 }
 total tokens: 8112 num samples: 12 num padding tokens: 910 - rank: 5 max len: 676 min len: 517 avg len: 600.1666666666666 num_loss_counted_tokens: 4936
 total tokens: 4446 num samples: 19 num padding tokens: 1491 - rank: 7 max len: 234 min len: 76 avg len: 155.52631578947367 num_loss_counted_tokens: 1146
 total tokens: 7266 num samples: 2 num padding tokens: 342 - rank: 0 max len: 3633 min len: 3291 avg len: 3462.0 num_loss_counted_tokens: 203
 total tokens: 7984 num samples: 16 num padding tokens: 1646 - rank: 6 max len: 499 min len: 253 avg len: 396.125 num_loss_counted_tokens: 3681
 total tokens: 7490 num samples: 5 num padding tokens: 639 - rank: 2 max len: 1498 min len: 1299 avg len: 1370.2 num_loss_counted_tokens: 3496
 total tokens: 7590 num samples: 6 num padding tokens: 448 - rank: 3 max len: 1265 min len: 1117 avg len: 1190.3333333333333 num_loss_counted_tokens: 2752
 Per-token loss scaled by world size: 0.0003764858120121062Per-token loss scaled by world size: 0.0006537719164043665Per-token loss scaled by world size: 0.00021446413302328438Per-token loss scaled by world size: 9.457199485041201e-05Per-token loss scaled by world size: 0.00034635854535736144




 Per-token loss scaled by world size: 1.4150586139294319e-05
 Per-token loss scaled by world size: 7.414004357997328e-05
 Epoch: 1, Step: 191, Rank: 5, loss = 1.6465245485305786
 Epoch: 1, Step: 191, Rank: 2, loss = 0.23817956447601318
 Epoch: 1, Step: 191, Rank: 3, loss = 0.8723039627075195Epoch: 1, Step: 191, Rank: 7, loss = 0.9481794834136963

 Epoch: 1, Step: 191, Rank: 4, loss = 0.5401279330253601
 Epoch: 1, Step: 191, Rank: 0, loss = 0.03563825041055679
 Epoch: 1, Step: 191, Rank: 1, loss = 0.18672169744968414
 Per-token loss scaled by world size: 0.0005529047921299934
 Epoch: 1, Step: 191, Rank: 6, loss = 1.3924907445907593
                                                          total tokens: 7998 num samples: 3 num padding tokens: 968 - rank: 1 max len: 2666 min len: 1768 avg len: 2343.3333333333335 num_loss_counted_tokens: 631
 total tokens: 7758 num samples: 9 num padding tokens: 1421 - rank: 4 max len: 862 min len: 603 avg len: 704.1111111111111 num_loss_counted_tokens: 3783
 {
    "epoch": 1,
    "step": 191,
    "rank": 0,
    "loss": 0.03563825041055679,
    "overall_throughput": 41.91096199447346,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.354421615600586,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20148,
    "batch_size": 73,
    "total_loss": 0.7325208187103271,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:27.214193"
 }
 total tokens: 7946 num samples: 29 num padding tokens: 2882 - rank: 7 max len: 274 min len: 81 avg len: 174.6206896551724 num_loss_counted_tokens: 2096
 total tokens: 8094 num samples: 19 num padding tokens: 1335 - rank: 6 max len: 426 min len: 280 avg len: 355.7368421052632 num_loss_counted_tokens: 4052
 total tokens: 8118 num samples: 6 num padding tokens: 1664 - rank: 3 max len: 1353 min len: 917 avg len: 1075.6666666666667 num_loss_counted_tokens: 3173
 total tokens: 7826 num samples: 13 num padding tokens: 1193 - rank: 5 max len: 602 min len: 428 avg len: 510.2307692307692 num_loss_counted_tokens: 2941
 total tokens: 7920 num samples: 5 num padding tokens: 502 - rank: 2 max len: 1584 min len: 1373 avg len: 1483.6 num_loss_counted_tokens: 2395
 total tokens: 7424 num samples: 2 num padding tokens: 728 - rank: 0 max len: 3712 min len: 2984 avg len: 3348.0 num_loss_counted_tokens: 259
 Per-token loss scaled by world size: 0.00023188829072751105Per-token loss scaled by world size: 0.00031627173302695155Per-token loss scaled by world size: 0.0003686068521346897Per-token loss scaled by world size: 0.0001379517198074609Per-token loss scaled by world size: 0.00017684763588476926Per-token loss scaled by world size: 3.2488858323631575e-06Per-token loss scaled by world size: 3.9887581806397066e-05






 Epoch: 1, Step: 192, Rank: 6, loss = 1.147979974746704
 Epoch: 1, Step: 192, Rank: 4, loss = 0.42963337898254395
 Epoch: 1, Step: 192, Rank: 0, loss = 0.010118248872458935Epoch: 1, Step: 192, Rank: 2, loss = 0.9849887490272522

 Epoch: 1, Step: 192, Rank: 3, loss = 0.7221871018409729Epoch: 1, Step: 192, Rank: 7, loss = 0.5507698655128479

 Epoch: 1, Step: 192, Rank: 1, loss = 0.12422488629817963
 Per-token loss scaled by world size: 0.00024023951846174896
 Epoch: 1, Step: 192, Rank: 5, loss = 0.7481959462165833
                                                          total tokens: 6840 num samples: 3 num padding tokens: 728 - rank: 1 max len: 2280 min len: 1767 avg len: 2037.3333333333333 num_loss_counted_tokens: 324
 total tokens: 8030 num samples: 10 num padding tokens: 837 - rank: 4 max len: 803 min len: 661 avg len: 719.3 num_loss_counted_tokens: 4290
 {
    "epoch": 1,
    "step": 192,
    "rank": 0,
    "loss": 0.010118248872458935,
    "overall_throughput": 42.505572644428966,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.430901527404785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24915,
    "batch_size": 77,
    "total_loss": 0.589762270450592,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:29.701973"
 }
 total tokens: 7060 num samples: 5 num padding tokens: 1273 - rank: 2 max len: 1412 min len: 992 avg len: 1157.4 num_loss_counted_tokens: 3150
 total tokens: 7864 num samples: 8 num padding tokens: 738 - rank: 3 max len: 983 min len: 809 avg len: 890.75 num_loss_counted_tokens: 4284
 total tokens: 7627 num samples: 29 num padding tokens: 3124 - rank: 7 max len: 263 min len: 85 avg len: 155.27586206896552 num_loss_counted_tokens: 1771
 total tokens: 7680 num samples: 12 num padding tokens: 957 - rank: 5 max len: 640 min len: 482 avg len: 560.25 num_loss_counted_tokens: 4379
 total tokens: 7837 num samples: 17 num padding tokens: 1825 - rank: 6 max len: 461 min len: 273 avg len: 353.6470588235294 num_loss_counted_tokens: 2956
 total tokens: 7128 num samples: 2 num padding tokens: 1238 - rank: 0 max len: 3564 min len: 2326 avg len: 2945.0 num_loss_counted_tokens: 463
 Per-token loss scaled by world size: 0.00031890295213088393Per-token loss scaled by world size: 0.0001828969834605232Per-token loss scaled by world size: 0.0001293038367293775Per-token loss scaled by world size: 0.00029604701558128

 Per-token loss scaled by world size: 0.00024164760543499142


 Per-token loss scaled by world size: 0.00021057862613815814Per-token loss scaled by world size: 3.072937033721246e-05

 Epoch: 1, Step: 193, Rank: 2, loss = 0.4273168742656708Epoch: 1, Step: 193, Rank: 6, loss = 0.9783613681793213

 Epoch: 1, Step: 193, Rank: 1, loss = 0.6044288277626038Epoch: 1, Step: 193, Rank: 5, loss = 1.0538945198059082

 Epoch: 1, Step: 193, Rank: 7, loss = 0.7985849380493164
 Epoch: 1, Step: 193, Rank: 0, loss = 0.10155288130044937
 Epoch: 1, Step: 193, Rank: 4, loss = 0.6959097385406494
 Per-token loss scaled by world size: 0.00022084206284489483
 Epoch: 1, Step: 193, Rank: 3, loss = 0.7298278212547302
                                                          total tokens: 6780 num samples: 5 num padding tokens: 682 - rank: 4 max len: 1356 min len: 1000 avg len: 1219.6 num_loss_counted_tokens: 3729
 total tokens: 7341 num samples: 3 num padding tokens: 810 - rank: 1 max len: 2447 min len: 1943 avg len: 2177.0 num_loss_counted_tokens: 1272
 {
    "epoch": 1,
    "step": 193,
    "rank": 0,
    "loss": 0.10155288130044937,
    "overall_throughput": 41.24658740563778,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.264228343963623,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26438,
    "batch_size": 78,
    "total_loss": 0.6737346053123474,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:32.267001"
 }
 total tokens: 7935 num samples: 15 num padding tokens: 1362 - rank: 6 max len: 529 min len: 333 avg len: 438.2 num_loss_counted_tokens: 4416
 total tokens: 7840 num samples: 8 num padding tokens: 1773 - rank: 5 max len: 980 min len: 597 avg len: 758.375 num_loss_counted_tokens: 3956
 total tokens: 5744 num samples: 2 num padding tokens: 30 - rank: 0 max len: 2872 min len: 2842 avg len: 2857.0 num_loss_counted_tokens: 152
 total tokens: 7596 num samples: 4 num padding tokens: 617 - rank: 2 max len: 1899 min len: 1609 avg len: 1744.75 num_loss_counted_tokens: 1330
 total tokens: 7644 num samples: 26 num padding tokens: 3101 - rank: 7 max len: 294 min len: 79 avg len: 174.73076923076923 num_loss_counted_tokens: 1861
 total tokens: 7405 num samples: 5 num padding tokens: 157 - rank: 3 max len: 1481 min len: 1402 avg len: 1449.6 num_loss_counted_tokens: 2757
 Per-token loss scaled by world size: 0.00030297457124106586Per-token loss scaled by world size: 0.000216660147998482Per-token loss scaled by world size: 0.00037809842615388334Per-token loss scaled by world size: 0.00046558064059354365Per-token loss scaled by world size: 0.00023001583758741617

 Per-token loss scaled by world size: 6.362871499732137e-05



 Per-token loss scaled by world size: 1.6167678040801547e-05
 Epoch: 1, Step: 194, Rank: 5, loss = 1.3895835876464844Epoch: 1, Step: 194, Rank: 6, loss = 0.9042654633522034Epoch: 1, Step: 194, Rank: 4, loss = 1.1284819841384888


 Epoch: 1, Step: 194, Rank: 1, loss = 0.18990786373615265
 Epoch: 1, Step: 194, Rank: 7, loss = 0.6466493010520935
 Epoch: 1, Step: 194, Rank: 3, loss = 0.6865110397338867
 Epoch: 1, Step: 194, Rank: 0, loss = 0.048254456371068954
 Per-token loss scaled by world size: 0.0002928127069026232
 Epoch: 1, Step: 194, Rank: 2, loss = 0.873936116695404
                                                          total tokens: 7677 num samples: 3 num padding tokens: 483 - rank: 1 max len: 2559 min len: 2205 avg len: 2398.0 num_loss_counted_tokens: 1070
 total tokens: 7784 num samples: 8 num padding tokens: 1598 - rank: 4 max len: 973 min len: 692 avg len: 773.25 num_loss_counted_tokens: 3677
 {
    "epoch": 1,
    "step": 194,
    "rank": 0,
    "loss": 0.048254456371068954,
    "overall_throughput": 41.96848339305704,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.232909202575684,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23877,
    "batch_size": 72,
    "total_loss": 0.7334486842155457,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:34.786179"
 }
 total tokens: 6501 num samples: 3 num padding tokens: 1290 - rank: 2 max len: 2167 min len: 1410 avg len: 1737.0 num_loss_counted_tokens: 241
 total tokens: 7221 num samples: 29 num padding tokens: 2731 - rank: 7 max len: 249 min len: 78 avg len: 154.82758620689654 num_loss_counted_tokens: 1846
 total tokens: 8080 num samples: 20 num padding tokens: 1685 - rank: 6 max len: 404 min len: 251 avg len: 319.75 num_loss_counted_tokens: 3499
 total tokens: 6885 num samples: 5 num padding tokens: 603 - rank: 3 max len: 1377 min len: 1090 avg len: 1256.4 num_loss_counted_tokens: 2165
 total tokens: 7980 num samples: 12 num padding tokens: 1568 - rank: 5 max len: 665 min len: 429 avg len: 534.3333333333334 num_loss_counted_tokens: 4731
 total tokens: 6814 num samples: 2 num padding tokens: 843 - rank: 0 max len: 3407 min len: 2564 avg len: 2985.5 num_loss_counted_tokens: 553
 Per-token loss scaled by world size: 0.00011609335342654958Per-token loss scaled by world size: 0.00037096577580086887Per-token loss scaled by world size: 0.000276279344689101Per-token loss scaled by world size: 0.0002262169582536444Per-token loss scaled by world size: 0.0004427096282597631

 Per-token loss scaled by world size: 0.00033836739021353424
 Per-token loss scaled by world size: 1.8575705325929448e-05



 Epoch: 1, Step: 195, Rank: 4, loss = 0.9719303250312805Epoch: 1, Step: 195, Rank: 7, loss = 0.7238518595695496

 Epoch: 1, Step: 195, Rank: 2, loss = 0.3041645884513855Epoch: 1, Step: 195, Rank: 0, loss = 0.04866834729909897
 Epoch: 1, Step: 195, Rank: 1, loss = 0.5926884412765503

 Epoch: 1, Step: 195, Rank: 5, loss = 0.8865225911140442Epoch: 1, Step: 195, Rank: 6, loss = 1.1598992347717285

 Per-token loss scaled by world size: 0.00017340479826088995
 Epoch: 1, Step: 195, Rank: 3, loss = 0.4543205797672272
                                                          total tokens: 6090 num samples: 3 num padding tokens: 726 - rank: 1 max len: 2030 min len: 1643 avg len: 1788.0 num_loss_counted_tokens: 1593
 total tokens: 7690 num samples: 10 num padding tokens: 460 - rank: 4 max len: 769 min len: 680 avg len: 723.0 num_loss_counted_tokens: 4881
 {
    "epoch": 1,
    "step": 195,
    "rank": 0,
    "loss": 0.04866834729909897,
    "overall_throughput": 41.3547409747129,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.354421615600586,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20960,
    "batch_size": 83,
    "total_loss": 0.6427558064460754,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:37.382964"
 }
 total tokens: 7967 num samples: 31 num padding tokens: 2761 - rank: 7 max len: 257 min len: 83 avg len: 167.93548387096774 num_loss_counted_tokens: 2276
 total tokens: 6960 num samples: 5 num padding tokens: 813 - rank: 2 max len: 1392 min len: 1089 avg len: 1229.4 num_loss_counted_tokens: 1530
 total tokens: 7856 num samples: 16 num padding tokens: 1984 - rank: 6 max len: 491 min len: 260 avg len: 367.0 num_loss_counted_tokens: 3780
 total tokens: 7602 num samples: 7 num padding tokens: 1100 - rank: 3 max len: 1086 min len: 795 avg len: 928.8571428571429 num_loss_counted_tokens: 5015
 total tokens: 7469 num samples: 11 num padding tokens: 1062 - rank: 5 max len: 679 min len: 515 avg len: 582.4545454545455 num_loss_counted_tokens: 4102
 total tokens: 6152 num samples: 2 num padding tokens: 117 - rank: 0 max len: 3076 min len: 2959 avg len: 3017.5 num_loss_counted_tokens: 200
 Per-token loss scaled by world size: 0.00046574202133342624Per-token loss scaled by world size: 2.0194725038891193e-06Per-token loss scaled by world size: 0.0005153888487257063Per-token loss scaled by world size: 0.0002995604299940169Per-token loss scaled by world size: 0.0004275553219486028Per-token loss scaled by world size: 3.1239229429047555e-05Per-token loss scaled by world size: 3.525143984006718e-05






 Epoch: 1, Step: 196, Rank: 6, loss = 1.3931604623794556
 Epoch: 1, Step: 196, Rank: 3, loss = 1.2589589357376099Epoch: 1, Step: 196, Rank: 0, loss = 0.005458886735141277

 Epoch: 1, Step: 196, Rank: 4, loss = 0.8097493052482605Epoch: 1, Step: 196, Rank: 2, loss = 0.08444354683160782Epoch: 1, Step: 196, Rank: 1, loss = 0.09528905153274536Epoch: 1, Step: 196, Rank: 7, loss = 1.1557354927062988



 Per-token loss scaled by world size: 0.0005320304771885276
 Epoch: 1, Step: 196, Rank: 5, loss = 1.4381449222564697
                                                          total tokens: 6570 num samples: 3 num padding tokens: 733 - rank: 1 max len: 2190 min len: 1806 avg len: 1945.6666666666667 num_loss_counted_tokens: 897
 total tokens: 8064 num samples: 9 num padding tokens: 1305 - rank: 4 max len: 896 min len: 651 avg len: 751.0 num_loss_counted_tokens: 4732
 {
    "epoch": 1,
    "step": 196,
    "rank": 0,
    "loss": 0.005458886735141277,
    "overall_throughput": 41.48583814053106,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.379756450653076,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21625,
    "batch_size": 81,
    "total_loss": 0.7801175117492676,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:39.894850"
 }
 total tokens: 7520 num samples: 5 num padding tokens: 1864 - rank: 3 max len: 1504 min len: 905 avg len: 1131.2 num_loss_counted_tokens: 2489
 total tokens: 7856 num samples: 16 num padding tokens: 1953 - rank: 6 max len: 491 min len: 301 avg len: 368.9375 num_loss_counted_tokens: 3951
 total tokens: 7020 num samples: 4 num padding tokens: 441 - rank: 2 max len: 1755 min len: 1533 avg len: 1644.75 num_loss_counted_tokens: 2927
 total tokens: 7965 num samples: 27 num padding tokens: 3021 - rank: 7 max len: 295 min len: 88 avg len: 183.11111111111111 num_loss_counted_tokens: 2199
 total tokens: 7226 num samples: 2 num padding tokens: 1285 - rank: 0 max len: 3613 min len: 2328 avg len: 2970.5 num_loss_counted_tokens: 471
 total tokens: 7764 num samples: 12 num padding tokens: 807 - rank: 5 max len: 647 min len: 495 avg len: 579.75 num_loss_counted_tokens: 4070
 Per-token loss scaled by world size: 0.0001413007703376934Per-token loss scaled by world size: 0.0003590704873204231Per-token loss scaled by world size: 0.000253127800533548

 Per-token loss scaled by world size: 0.00044859678018838167Per-token loss scaled by world size: 7.146921416278929e-05


 Per-token loss scaled by world size: 0.0002180417359340936Per-token loss scaled by world size: 9.443299404665595e-07

 Epoch: 1, Step: 197, Rank: 5, loss = 1.1000573635101318
 Epoch: 1, Step: 197, Rank: 2, loss = 0.4328925609588623
 Epoch: 1, Step: 197, Rank: 3, loss = 0.7754886150360107
 Epoch: 1, Step: 197, Rank: 1, loss = 0.21895486116409302
 Epoch: 1, Step: 197, Rank: 4, loss = 1.374332308769226
 Epoch: 1, Step: 197, Rank: 7, loss = 0.6679981350898743Epoch: 1, Step: 197, Rank: 0, loss = 0.002893072785809636

 Per-token loss scaled by world size: 0.00030740915099158883
 Epoch: 1, Step: 197, Rank: 6, loss = 0.9417863488197327
                                                          total tokens: 6920 num samples: 4 num padding tokens: 227 - rank: 1 max len: 1730 min len: 1622 avg len: 1673.25 num_loss_counted_tokens: 2502
 total tokens: 7749 num samples: 9 num padding tokens: 674 - rank: 4 max len: 861 min len: 729 avg len: 786.1111111111111 num_loss_counted_tokens: 3445
 {
    "epoch": 1,
    "step": 197,
    "rank": 0,
    "loss": 0.002893072785809636,
    "overall_throughput": 41.993274454566276,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.219207286834717,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 24509,
    "batch_size": 94,
    "total_loss": 0.6893004179000854,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:42.416522"
 }
 total tokens: 7620 num samples: 15 num padding tokens: 2321 - rank: 6 max len: 508 min len: 250 avg len: 353.26666666666665 num_loss_counted_tokens: 3281
 total tokens: 7766 num samples: 11 num padding tokens: 899 - rank: 5 max len: 706 min len: 532 avg len: 624.2727272727273 num_loss_counted_tokens: 4764
 total tokens: 5782 num samples: 2 num padding tokens: 1020 - rank: 0 max len: 2891 min len: 1871 avg len: 2381.0 num_loss_counted_tokens: 1847
 total tokens: 8015 num samples: 5 num padding tokens: 1364 - rank: 2 max len: 1603 min len: 1206 avg len: 1330.2 num_loss_counted_tokens: 1802
 total tokens: 7968 num samples: 32 num padding tokens: 2753 - rank: 7 max len: 249 min len: 81 avg len: 162.96875 num_loss_counted_tokens: 2041
 total tokens: 8008 num samples: 7 num padding tokens: 830 - rank: 3 max len: 1144 min len: 885 avg len: 1025.4285714285713 num_loss_counted_tokens: 4315
 Per-token loss scaled by world size: 0.00019664198043756187Per-token loss scaled by world size: 0.0002436544600641355Per-token loss scaled by world size: 7.737488886050414e-06Per-token loss scaled by world size: 0.0004965663538314402Per-token loss scaled by world size: 5.262534159555798e-06




 Per-token loss scaled by world size: 0.000477502413559705Per-token loss scaled by world size: 0.00032058294164016843

 Epoch: 1, Step: 198, Rank: 3, loss = 0.611785888671875
 Epoch: 1, Step: 198, Rank: 0, loss = 0.019427867606282234Epoch: 1, Step: 198, Rank: 6, loss = 1.2468160390853882Epoch: 1, Step: 198, Rank: 2, loss = 0.4937434196472168


 Epoch: 1, Step: 198, Rank: 1, loss = 0.013213565573096275
 Epoch: 1, Step: 198, Rank: 4, loss = 1.198948860168457Epoch: 1, Step: 198, Rank: 7, loss = 0.8049436807632446

 Per-token loss scaled by world size: 0.0005729582044295967
 Epoch: 1, Step: 198, Rank: 5, loss = 1.4386264085769653
                                                          total tokens: 7452 num samples: 9 num padding tokens: 706 - rank: 4 max len: 828 min len: 701 avg len: 749.5555555555555 num_loss_counted_tokens: 4285
 total tokens: 7940 num samples: 5 num padding tokens: 529 - rank: 1 max len: 1588 min len: 1397 avg len: 1482.2 num_loss_counted_tokens: 3826
 total tokens: 7634 num samples: 11 num padding tokens: 983 - rank: 5 max len: 694 min len: 488 avg len: 604.6363636363636 num_loss_counted_tokens: 4895
 {
    "epoch": 1,
    "step": 198,
    "rank": 0,
    "loss": 0.019427867606282234,
    "overall_throughput": 41.69130867994642,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.38558578491211,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20087,
    "batch_size": 77,
    "total_loss": 0.7284382581710815,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:44.992595"
 }
 total tokens: 7712 num samples: 16 num padding tokens: 2400 - rank: 6 max len: 482 min len: 260 avg len: 332.0 num_loss_counted_tokens: 2891
 total tokens: 7206 num samples: 3 num padding tokens: 834 - rank: 0 max len: 2402 min len: 1692 avg len: 2124.0 num_loss_counted_tokens: 304
 total tokens: 7511 num samples: 29 num padding tokens: 2662 - rank: 7 max len: 259 min len: 84 avg len: 167.20689655172413 num_loss_counted_tokens: 1997
 total tokens: 7902 num samples: 6 num padding tokens: 937 - rank: 2 max len: 1317 min len: 1043 avg len: 1160.8333333333333 num_loss_counted_tokens: 5167
 total tokens: 7928 num samples: 8 num padding tokens: 746 - rank: 3 max len: 991 min len: 833 avg len: 897.75 num_loss_counted_tokens: 4239
 Per-token loss scaled by world size: 0.00035409454721957445Per-token loss scaled by world size: 0.00035435750032775104Per-token loss scaled by world size: 0.00030246065580286086Per-token loss scaled by world size: 9.443990165891591e-06

 Per-token loss scaled by world size: 1.890566181828035e-06Per-token loss scaled by world size: 0.00016794257680885494

 Per-token loss scaled by world size: 0.00030859385151416063


 Epoch: 1, Step: 199, Rank: 3, loss = 0.9465774893760681
 Epoch: 1, Step: 199, Rank: 1, loss = 0.0050501748919487Epoch: 1, Step: 199, Rank: 0, loss = 0.0252272579818964

 Epoch: 1, Step: 199, Rank: 2, loss = 0.8079480528831482Epoch: 1, Step: 199, Rank: 4, loss = 0.9458750486373901

 Epoch: 1, Step: 199, Rank: 5, loss = 0.8243313431739807
 Epoch: 1, Step: 199, Rank: 7, loss = 0.4486165940761566
 Per-token loss scaled by world size: 0.00041021130164153874
 Epoch: 1, Step: 199, Rank: 6, loss = 1.095776915550232
                                                          total tokens: 7592 num samples: 13 num padding tokens: 445 - rank: 4 max len: 584 min len: 509 avg len: 549.7692307692307 num_loss_counted_tokens: 5215
 total tokens: 6915 num samples: 5 num padding tokens: 522 - rank: 1 max len: 1383 min len: 1143 avg len: 1278.6 num_loss_counted_tokens: 1111
 total tokens: 8064 num samples: 36 num padding tokens: 2321 - rank: 7 max len: 224 min len: 71 avg len: 159.52777777777777 num_loss_counted_tokens: 2309
 {
    "epoch": 1,
    "step": 199,
    "rank": 0,
    "loss": 0.0252272579818964,
    "overall_throughput": 42.03544117506762,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.382384777069092,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21370,
    "batch_size": 70,
    "total_loss": 0.6374253630638123,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:47.471975"
 }
 total tokens: 7920 num samples: 16 num padding tokens: 1517 - rank: 5 max len: 495 min len: 348 avg len: 400.1875 num_loss_counted_tokens: 4230
 total tokens: 7693 num samples: 7 num padding tokens: 1201 - rank: 2 max len: 1099 min len: 763 avg len: 927.4285714285714 num_loss_counted_tokens: 3819
 total tokens: 7610 num samples: 10 num padding tokens: 880 - rank: 3 max len: 761 min len: 590 avg len: 673.0 num_loss_counted_tokens: 4994
 total tokens: 8004 num samples: 23 num padding tokens: 1596 - rank: 6 max len: 348 min len: 229 avg len: 278.60869565217394 num_loss_counted_tokens: 3462
 total tokens: 7608 num samples: 3 num padding tokens: 1669 - rank: 0 max len: 2536 min len: 1608 avg len: 1979.6666666666667 num_loss_counted_tokens: 938
 Per-token loss scaled by world size: 0.0003396017709746957Per-token loss scaled by world size: 0.0003830210189335048Per-token loss scaled by world size: 0.000320710358209908
 Per-token loss scaled by world size: 0.0005103153525851667Per-token loss scaled by world size: 0.0006997347227297723



 Per-token loss scaled by world size: 1.8154447616325342e-06
 Epoch: 1, Step: 200, Rank: 6, loss = 0.8558957576751709
 Epoch: 1, Step: 200, Rank: 3, loss = 0.9063122272491455Epoch: 1, Step: 200, Rank: 2, loss = 1.022187352180481

 Epoch: 1, Step: 200, Rank: 5, loss = 1.3619040250778198
 Epoch: 1, Step: 200, Rank: 4, loss = 1.8674170970916748
 Per-token loss scaled by world size: 3.414201637497172e-05
 Epoch: 1, Step: 200, Rank: 0, loss = 0.0048449682071805
 Per-token loss scaled by world size: 5.252029586699791e-05
 Epoch: 1, Step: 200, Rank: 7, loss = 0.09111650288105011
 Epoch: 1, Step: 200, Rank: 1, loss = 0.14016354084014893
                                                          total tokens: 7851 num samples: 3 num padding tokens: 1199 - rank: 1 max len: 2617 min len: 1961 avg len: 2217.3333333333335 num_loss_counted_tokens: 624
 total tokens: 8019 num samples: 27 num padding tokens: 2910 - rank: 7 max len: 297 min len: 79 avg len: 189.22222222222223 num_loss_counted_tokens: 2220
 total tokens: 7539 num samples: 7 num padding tokens: 949 - rank: 4 max len: 1077 min len: 826 avg len: 941.4285714285714 num_loss_counted_tokens: 4098
 {
    "epoch": 1,
    "step": 200,
    "rank": 0,
    "loss": 0.0048449682071805,
    "overall_throughput": 41.06569124957365,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.254440784454346,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21350,
    "batch_size": 73,
    "total_loss": 0.7812302112579346,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:50.045786"
 }
 total tokens: 7458 num samples: 11 num padding tokens: 2562 - rank: 6 max len: 678 min len: 317 avg len: 445.09090909090907 num_loss_counted_tokens: 2861
 total tokens: 7362 num samples: 9 num padding tokens: 766 - rank: 5 max len: 818 min len: 687 avg len: 732.8888888888889 num_loss_counted_tokens: 4108
 total tokens: 7554 num samples: 6 num padding tokens: 667 - rank: 3 max len: 1259 min len: 1083 avg len: 1147.8333333333333 num_loss_counted_tokens: 931
 total tokens: 5794 num samples: 2 num padding tokens: 43 - rank: 0 max len: 2897 min len: 2854 avg len: 2875.5 num_loss_counted_tokens: 721
 total tokens: 7444 num samples: 4 num padding tokens: 878 - rank: 2 max len: 1861 min len: 1304 avg len: 1641.5 num_loss_counted_tokens: 1667
 Per-token loss scaled by world size: 0.0003313705965410918Per-token loss scaled by world size: 0.00037777406396344304Per-token loss scaled by world size: 0.00013144082913640887Per-token loss scaled by world size: 0.00039307758561335504
 Per-token loss scaled by world size: 0.00033154338598251343
 Per-token loss scaled by world size: 1.7019568986142986e-05

 Per-token loss scaled by world size: 5.37650066689821e-06


 Epoch: 1, Step: 201, Rank: 6, loss = 0.9256009459495544
 Epoch: 1, Step: 201, Rank: 2, loss = 0.3671470880508423
 Epoch: 1, Step: 201, Rank: 0, loss = 0.04753991216421127Epoch: 1, Step: 201, Rank: 3, loss = 1.0552173852920532
 Epoch: 1, Step: 201, Rank: 4, loss = 1.0979639291763306

 Epoch: 1, Step: 201, Rank: 7, loss = 0.9260835647583008
 Epoch: 1, Step: 201, Rank: 1, loss = 0.015017910860478878
 Per-token loss scaled by world size: 0.0004141141544096172
 Epoch: 1, Step: 201, Rank: 5, loss = 1.1567243337631226
                                                          total tokens: 7734 num samples: 3 num padding tokens: 874 - rank: 1 max len: 2578 min len: 2141 avg len: 2286.6666666666665 num_loss_counted_tokens: 217
 total tokens: 7511 num samples: 7 num padding tokens: 887 - rank: 4 max len: 1073 min len: 833 avg len: 946.2857142857143 num_loss_counted_tokens: 5956
 {
    "epoch": 1,
    "step": 201,
    "rank": 0,
    "loss": 0.04753991216421127,
    "overall_throughput": 42.99976331813581,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.05379819869995,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 22346,
    "batch_size": 78,
    "total_loss": 0.6989118456840515,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:52.499632"
 }
 total tokens: 6544 num samples: 2 num padding tokens: 303 - rank: 0 max len: 3272 min len: 2969 avg len: 3120.5 num_loss_counted_tokens: 203
 total tokens: 8040 num samples: 15 num padding tokens: 2803 - rank: 6 max len: 536 min len: 243 avg len: 349.1333333333333 num_loss_counted_tokens: 2637
 total tokens: 6354 num samples: 3 num padding tokens: 420 - rank: 2 max len: 2118 min len: 1749 avg len: 1978.0 num_loss_counted_tokens: 450
 total tokens: 7452 num samples: 9 num padding tokens: 1025 - rank: 5 max len: 828 min len: 555 avg len: 714.1111111111111 num_loss_counted_tokens: 3110
 total tokens: 7005 num samples: 5 num padding tokens: 460 - rank: 3 max len: 1401 min len: 1187 avg len: 1309.0 num_loss_counted_tokens: 3212
 total tokens: 6902 num samples: 29 num padding tokens: 3026 - rank: 7 max len: 238 min len: 78 avg len: 133.6551724137931 num_loss_counted_tokens: 1459
 Per-token loss scaled by world size: 0.00022836425341665745Per-token loss scaled by world size: 0.0001815208961488679Per-token loss scaled by world size: 8.564612653572112e-05Per-token loss scaled by world size: 0.0004483639495447278


 Per-token loss scaled by world size: 0.00014287869271356612

 Per-token loss scaled by world size: 0.0001819162571337074
 Per-token loss scaled by world size: 0.0001619806425878778
 Epoch: 1, Step: 202, Rank: 2, loss = 0.5714277625083923
 Epoch: 1, Step: 202, Rank: 5, loss = 1.411449670791626
 Epoch: 1, Step: 202, Rank: 1, loss = 0.26961401104927063
 Epoch: 1, Step: 202, Rank: 3, loss = 0.7188906669616699
 Epoch: 1, Step: 202, Rank: 0, loss = 0.4497821033000946
 Epoch: 1, Step: 202, Rank: 4, loss = 0.5726723670959473
 Epoch: 1, Step: 202, Rank: 7, loss = 0.5099150538444519
 Per-token loss scaled by world size: 0.0003626368416007608
 Epoch: 1, Step: 202, Rank: 6, loss = 1.1415808200836182
                                                          total tokens: 7960 num samples: 4 num padding tokens: 506 - rank: 1 max len: 1990 min len: 1681 avg len: 1863.5 num_loss_counted_tokens: 885
 total tokens: 8028 num samples: 9 num padding tokens: 771 - rank: 4 max len: 892 min len: 741 avg len: 806.3333333333334 num_loss_counted_tokens: 4521
 {
    "epoch": 1,
    "step": 202,
    "rank": 0,
    "loss": 0.4497821033000946,
    "overall_throughput": 41.605158110193564,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.336650848388672,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25184,
    "batch_size": 99,
    "total_loss": 0.7056666016578674,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:55.026480"
 }
 total tokens: 8040 num samples: 30 num padding tokens: 2761 - rank: 7 max len: 268 min len: 77 avg len: 175.96666666666667 num_loss_counted_tokens: 2391
 total tokens: 8090 num samples: 5 num padding tokens: 1606 - rank: 2 max len: 1618 min len: 1055 avg len: 1296.8 num_loss_counted_tokens: 2154
 total tokens: 7315 num samples: 7 num padding tokens: 483 - rank: 3 max len: 1045 min len: 932 avg len: 976.0 num_loss_counted_tokens: 4783
 total tokens: 7920 num samples: 16 num padding tokens: 1848 - rank: 6 max len: 495 min len: 269 avg len: 379.5 num_loss_counted_tokens: 3451
 total tokens: 7579 num samples: 11 num padding tokens: 1108 - rank: 5 max len: 689 min len: 497 avg len: 588.2727272727273 num_loss_counted_tokens: 4657
 total tokens: 7665 num samples: 3 num padding tokens: 442 - rank: 0 max len: 2555 min len: 2201 avg len: 2407.6666666666665 num_loss_counted_tokens: 2954
 Per-token loss scaled by world size: 0.0001610679319128394Per-token loss scaled by world size: 0.000670548586640507Per-token loss scaled by world size: 5.642953510687221e-06Per-token loss scaled by world size: 0.0005202414467930794
 Per-token loss scaled by world size: 0.00042746320832520723Per-token loss scaled by world size: 1.2397128557495307e-05Per-token loss scaled by world size: 0.0004940929939039052





 Epoch: 1, Step: 203, Rank: 5, loss = 1.5487158298492432
 Epoch: 1, Step: 203, Rank: 2, loss = 0.37200650572776794
 Epoch: 1, Step: 203, Rank: 0, loss = 0.013033106923103333Epoch: 1, Step: 203, Rank: 6, loss = 1.2015626430511475

 Epoch: 1, Step: 203, Rank: 1, loss = 0.028632719069719315
 Epoch: 1, Step: 203, Rank: 4, loss = 1.141169548034668
 Epoch: 1, Step: 203, Rank: 7, loss = 0.9872797131538391
 Per-token loss scaled by world size: 0.00014795419701840729
 Epoch: 1, Step: 203, Rank: 3, loss = 0.3417187035083771
                                                          total tokens: 5726 num samples: 2 num padding tokens: 666 - rank: 1 max len: 2863 min len: 2197 avg len: 2530.0 num_loss_counted_tokens: 483
 total tokens: 7280 num samples: 8 num padding tokens: 395 - rank: 4 max len: 910 min len: 794 avg len: 860.625 num_loss_counted_tokens: 4783
 {
    "epoch": 1,
    "step": 203,
    "rank": 0,
    "loss": 0.013033106923103333,
    "overall_throughput": 41.573804652467835,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.309247970581055,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18477,
    "batch_size": 80,
    "total_loss": 0.7042648792266846,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:56:57.572853"
 }
 total tokens: 7865 num samples: 13 num padding tokens: 1975 - rank: 6 max len: 605 min len: 321 avg len: 453.0769230769231 num_loss_counted_tokens: 4477
 total tokens: 7068 num samples: 4 num padding tokens: 740 - rank: 2 max len: 1767 min len: 1462 avg len: 1582.0 num_loss_counted_tokens: 1481
 total tokens: 6864 num samples: 22 num padding tokens: 2832 - rank: 7 max len: 312 min len: 80 avg len: 183.27272727272728 num_loss_counted_tokens: 1790
 total tokens: 7750 num samples: 10 num padding tokens: 711 - rank: 5 max len: 775 min len: 654 avg len: 703.9 num_loss_counted_tokens: 2490
 total tokens: 7506 num samples: 6 num padding tokens: 1303 - rank: 3 max len: 1251 min len: 942 avg len: 1033.8333333333333 num_loss_counted_tokens: 3894
 total tokens: 7992 num samples: 2 num padding tokens: 51 - rank: 0 max len: 3996 min len: 3945 avg len: 3970.5 num_loss_counted_tokens: 176
 Per-token loss scaled by world size: 0.0005467137671075761Per-token loss scaled by world size: 0.0001344903139397502Per-token loss scaled by world size: 2.761472160273115e-06
 Per-token loss scaled by world size: 0.0002821572998072952
 Per-token loss scaled by world size: 0.00027734090690501034
 Per-token loss scaled by world size: 4.0628190618008375e-05


 Epoch: 1, Step: 204, Rank: 2, loss = 0.3588033616542816
 Epoch: 1, Step: 204, Rank: 5, loss = 1.4585639238357544
 Epoch: 1, Step: 204, Rank: 0, loss = 0.007367262616753578
 Epoch: 1, Step: 204, Rank: 4, loss = 0.7527604103088379
 Per-token loss scaled by world size: 0.00022612858447246253Epoch: 1, Step: 204, Rank: 1, loss = 0.1083909347653389

 Epoch: 1, Step: 204, Rank: 3, loss = 0.7399108409881592
 Per-token loss scaled by world size: 0.0005020878161303699
 Epoch: 1, Step: 204, Rank: 7, loss = 0.6032828092575073
 Epoch: 1, Step: 204, Rank: 6, loss = 1.3395075798034668
                                                          total tokens: 6300 num samples: 3 num padding tokens: 1167 - rank: 1 max len: 2100 min len: 1431 avg len: 1711.0 num_loss_counted_tokens: 1709
 total tokens: 7578 num samples: 9 num padding tokens: 661 - rank: 4 max len: 842 min len: 721 avg len: 768.5555555555555 num_loss_counted_tokens: 5406
 {
    "epoch": 1,
    "step": 204,
    "rank": 0,
    "loss": 0.007367262616753578,
    "overall_throughput": 42.3404581250765,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.448826789855957,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21343,
    "batch_size": 69,
    "total_loss": 0.6710734367370605,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:00.072190"
 }
 total tokens: 7812 num samples: 28 num padding tokens: 2702 - rank: 7 max len: 279 min len: 93 avg len: 182.5 num_loss_counted_tokens: 2270
 total tokens: 8109 num samples: 17 num padding tokens: 1674 - rank: 6 max len: 477 min len: 282 avg len: 378.52941176470586 num_loss_counted_tokens: 3720
 total tokens: 7854 num samples: 11 num padding tokens: 1060 - rank: 5 max len: 714 min len: 484 avg len: 617.6363636363636 num_loss_counted_tokens: 5323
 total tokens: 5918 num samples: 2 num padding tokens: 56 - rank: 0 max len: 2959 min len: 2903 avg len: 2931.0 num_loss_counted_tokens: 173
 total tokens: 7658 num samples: 7 num padding tokens: 462 - rank: 3 max len: 1094 min len: 870 avg len: 1028.0 num_loss_counted_tokens: 4939
 total tokens: 8088 num samples: 6 num padding tokens: 933 - rank: 2 max len: 1348 min len: 1098 avg len: 1192.5 num_loss_counted_tokens: 6453
 Per-token loss scaled by world size: 0.0004229408223181963Per-token loss scaled by world size: 6.994641353230691e-06Per-token loss scaled by world size: 2.8473236852732953e-06Per-token loss scaled by world size: 0.0008069836185313761Per-token loss scaled by world size: 0.00010431646660435945Per-token loss scaled by world size: 0.00042939934064634144Per-token loss scaled by world size: 0.0006417424301616848






 Epoch: 1, Step: 205, Rank: 2, loss = 0.24320079386234283Epoch: 1, Step: 205, Rank: 6, loss = 1.8813813924789429
 Epoch: 1, Step: 205, Rank: 4, loss = 1.0010908842086792Epoch: 1, Step: 205, Rank: 7, loss = 0.9860336780548096Epoch: 1, Step: 205, Rank: 0, loss = 0.016307132318615913



 Epoch: 1, Step: 205, Rank: 5, loss = 1.4961422681808472Epoch: 1, Step: 205, Rank: 1, loss = 0.006638179067522287

 Per-token loss scaled by world size: 0.00024272690643556416
 Epoch: 1, Step: 205, Rank: 3, loss = 0.565887451171875
                                                          total tokens: 7544 num samples: 4 num padding tokens: 665 - rank: 1 max len: 1886 min len: 1535 avg len: 1719.75 num_loss_counted_tokens: 918
 total tokens: 7810 num samples: 10 num padding tokens: 835 - rank: 4 max len: 781 min len: 634 avg len: 697.5 num_loss_counted_tokens: 3430
 {
    "epoch": 1,
    "step": 205,
    "rank": 0,
    "loss": 0.016307132318615913,
    "overall_throughput": 41.43704192183203,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.372108459472656,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 18651,
    "batch_size": 75,
    "total_loss": 0.7745852470397949,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:02.626787"
 }
 total tokens: 7680 num samples: 6 num padding tokens: 744 - rank: 2 max len: 1280 min len: 1064 avg len: 1156.0 num_loss_counted_tokens: 6296
 total tokens: 6860 num samples: 28 num padding tokens: 2293 - rank: 7 max len: 245 min len: 78 avg len: 163.10714285714286 num_loss_counted_tokens: 1793
 total tokens: 8112 num samples: 13 num padding tokens: 1290 - rank: 5 max len: 624 min len: 454 avg len: 524.7692307692307 num_loss_counted_tokens: 5007
 total tokens: 8046 num samples: 18 num padding tokens: 1903 - rank: 6 max len: 447 min len: 248 avg len: 341.27777777777777 num_loss_counted_tokens: 3492
 total tokens: 5766 num samples: 2 num padding tokens: 993 - rank: 0 max len: 2883 min len: 1890 avg len: 2386.5 num_loss_counted_tokens: 1080
 total tokens: 7856 num samples: 8 num padding tokens: 604 - rank: 3 max len: 982 min len: 795 avg len: 906.5 num_loss_counted_tokens: 3800
 Per-token loss scaled by world size: 0.0005271086702123284Per-token loss scaled by world size: 0.00045938679249957204Per-token loss scaled by world size: 3.3147989597637206e-06Per-token loss scaled by world size: 2.3308498384722043e-06Per-token loss scaled by world size: 0.0002847542054951191Per-token loss scaled by world size: 0.0002816052583511919





 Per-token loss scaled by world size: 0.00022615509806200862
 Epoch: 1, Step: 206, Rank: 1, loss = 0.005856843199580908Epoch: 1, Step: 206, Rank: 6, loss = 1.324492335319519Epoch: 1, Step: 206, Rank: 0, loss = 0.008329261094331741Epoch: 1, Step: 206, Rank: 2, loss = 0.7076036334037781



 Epoch: 1, Step: 206, Rank: 3, loss = 0.7155161499977112Epoch: 1, Step: 206, Rank: 4, loss = 1.1543241739273071

 Epoch: 1, Step: 206, Rank: 7, loss = 0.5682712197303772
 Per-token loss scaled by world size: 0.0005236774450168014
 Epoch: 1, Step: 206, Rank: 5, loss = 1.3158705234527588
                                                          total tokens: 7600 num samples: 5 num padding tokens: 1200 - rank: 4 max len: 1520 min len: 945 avg len: 1280.0 num_loss_counted_tokens: 2473
 total tokens: 6054 num samples: 2 num padding tokens: 522 - rank: 1 max len: 3027 min len: 2505 avg len: 2766.0 num_loss_counted_tokens: 167
 total tokens: 6924 num samples: 3 num padding tokens: 422 - rank: 2 max len: 2308 min len: 2080 avg len: 2167.3333333333335 num_loss_counted_tokens: 427
 {
    "epoch": 1,
    "step": 206,
    "rank": 0,
    "loss": 0.008329261094331741,
    "overall_throughput": 41.21002850546722,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.469753742218018,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 20102,
    "batch_size": 76,
    "total_loss": 0.7250330448150635,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:05.194525"
 }
 total tokens: 7872 num samples: 24 num padding tokens: 2881 - rank: 7 max len: 328 min len: 83 avg len: 207.95833333333334 num_loss_counted_tokens: 2171
 total tokens: 7536 num samples: 8 num padding tokens: 1132 - rank: 5 max len: 942 min len: 619 avg len: 800.5 num_loss_counted_tokens: 4424
 total tokens: 7956 num samples: 13 num padding tokens: 1551 - rank: 6 max len: 612 min len: 344 avg len: 492.6923076923077 num_loss_counted_tokens: 3611
 total tokens: 8112 num samples: 4 num padding tokens: 1080 - rank: 3 max len: 2028 min len: 1605 avg len: 1758.0 num_loss_counted_tokens: 520
 total tokens: 7128 num samples: 2 num padding tokens: 321 - rank: 0 max len: 3564 min len: 3243 avg len: 3403.5 num_loss_counted_tokens: 173
 Per-token loss scaled by world size: 0.0001850179978646338Per-token loss scaled by world size: 3.4617110031831544e-06Per-token loss scaled by world size: 0.0002683971542865038Per-token loss scaled by world size: 0.0003899486910086125Per-token loss scaled by world size: 0.00042667845264077187Per-token loss scaled by world size: 8.096924830169883e-06





 Per-token loss scaled by world size: 0.00022364444157574326
 Epoch: 1, Step: 207, Rank: 2, loss = 0.7098768949508667Epoch: 1, Step: 207, Rank: 0, loss = 0.009155793115496635

 Epoch: 1, Step: 207, Rank: 6, loss = 1.0313655138015747
 Epoch: 1, Step: 207, Rank: 3, loss = 0.48934948444366455Epoch: 1, Step: 207, Rank: 1, loss = 0.021415354683995247

 Epoch: 1, Step: 207, Rank: 4, loss = 1.1285111904144287
 Epoch: 1, Step: 207, Rank: 7, loss = 0.591511607170105
 Per-token loss scaled by world size: 0.00048212718684226274
 Epoch: 1, Step: 207, Rank: 5, loss = 1.2751661539077759
                                                          total tokens: 7845 num samples: 5 num padding tokens: 421 - rank: 1 max len: 1569 min len: 1404 avg len: 1484.8 num_loss_counted_tokens: 4961
 total tokens: 7887 num samples: 11 num padding tokens: 927 - rank: 4 max len: 717 min len: 576 avg len: 632.7272727272727 num_loss_counted_tokens: 3254
 total tokens: 6467 num samples: 29 num padding tokens: 1852 - rank: 7 max len: 223 min len: 79 avg len: 159.13793103448276 num_loss_counted_tokens: 1984
 {
    "epoch": 1,
    "step": 207,
    "rank": 0,
    "loss": 0.009155793115496635,
    "overall_throughput": 41.547768908524304,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.43647813796997,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21159,
    "batch_size": 69,
    "total_loss": 0.6570440530776978,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:07.743363"
 }
 total tokens: 7980 num samples: 20 num padding tokens: 2050 - rank: 6 max len: 399 min len: 224 avg len: 296.5 num_loss_counted_tokens: 3262
 total tokens: 8024 num samples: 8 num padding tokens: 757 - rank: 3 max len: 1003 min len: 749 avg len: 908.375 num_loss_counted_tokens: 5777
 total tokens: 8022 num samples: 14 num padding tokens: 1040 - rank: 5 max len: 573 min len: 420 avg len: 498.7142857142857 num_loss_counted_tokens: 4255
 total tokens: 7974 num samples: 6 num padding tokens: 1244 - rank: 2 max len: 1329 min len: 1027 avg len: 1121.6666666666667 num_loss_counted_tokens: 4559
 total tokens: 5482 num samples: 2 num padding tokens: 958 - rank: 0 max len: 2741 min len: 1783 avg len: 2262.0 num_loss_counted_tokens: 672
 Per-token loss scaled by world size: 0.0004824527713935822Per-token loss scaled by world size: 0.000384941027732566Per-token loss scaled by world size: 0.0005819547805003822Per-token loss scaled by world size: 0.0003608883998822421

 Per-token loss scaled by world size: 0.00046005993499420583Per-token loss scaled by world size: 5.933908323640935e-05
 Per-token loss scaled by world size: 1.4446718523686286e-05



 Epoch: 1, Step: 208, Rank: 6, loss = 1.4061481952667236
 Epoch: 1, Step: 208, Rank: 3, loss = 1.1657265424728394
 Epoch: 1, Step: 208, Rank: 5, loss = 0.9301137924194336Epoch: 1, Step: 208, Rank: 1, loss = 0.14337806403636932
 Epoch: 1, Step: 208, Rank: 7, loss = 0.8719965815544128

 Epoch: 1, Step: 208, Rank: 0, loss = 0.03490688279271126
 Epoch: 1, Step: 208, Rank: 4, loss = 1.1116198301315308
 Per-token loss scaled by world size: 0.0002515815431252122
 Epoch: 1, Step: 208, Rank: 2, loss = 0.607883870601654
                                                          total tokens: 6438 num samples: 2 num padding tokens: 348 - rank: 1 max len: 3219 min len: 2871 avg len: 3045.0 num_loss_counted_tokens: 205
 total tokens: 7539 num samples: 7 num padding tokens: 636 - rank: 4 max len: 1077 min len: 894 avg len: 986.1428571428571 num_loss_counted_tokens: 4629
 {
    "epoch": 1,
    "step": 208,
    "rank": 0,
    "loss": 0.03490688279271126,
    "overall_throughput": 40.69133919277144,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.454561710357666,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19330,
    "batch_size": 86,
    "total_loss": 0.7839717268943787,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:10.341855"
 }
 total tokens: 7893 num samples: 9 num padding tokens: 1279 - rank: 5 max len: 877 min len: 621 avg len: 734.8888888888889 num_loss_counted_tokens: 3674
 total tokens: 7657 num samples: 13 num padding tokens: 2094 - rank: 6 max len: 589 min len: 288 avg len: 427.9230769230769 num_loss_counted_tokens: 3574
 total tokens: 5474 num samples: 2 num padding tokens: 292 - rank: 2 max len: 2737 min len: 2445 avg len: 2591.0 num_loss_counted_tokens: 182
 total tokens: 7275 num samples: 5 num padding tokens: 1073 - rank: 3 max len: 1455 min len: 1087 avg len: 1240.4 num_loss_counted_tokens: 3573
 total tokens: 6975 num samples: 25 num padding tokens: 2126 - rank: 7 max len: 279 min len: 79 avg len: 193.96 num_loss_counted_tokens: 2228
 total tokens: 7194 num samples: 2 num padding tokens: 95 - rank: 0 max len: 3597 min len: 3502 avg len: 3549.5 num_loss_counted_tokens: 205
 Per-token loss scaled by world size: 0.0006150374538265169Per-token loss scaled by world size: 0.00033614260610193014Per-token loss scaled by world size: 0.0003573091235011816Per-token loss scaled by world size: 0.0002815852640196681
 Per-token loss scaled by world size: 0.00023183257144410163


 Per-token loss scaled by world size: 2.870956450351514e-05

 Per-token loss scaled by world size: 2.8179058062960394e-05
 Epoch: 1, Step: 209, Rank: 6, loss = 0.96549391746521Epoch: 1, Step: 209, Rank: 4, loss = 0.9082993268966675

 Epoch: 1, Step: 209, Rank: 5, loss = 1.6619080305099487Epoch: 1, Step: 209, Rank: 2, loss = 0.7608785629272461

 Epoch: 1, Step: 209, Rank: 7, loss = 0.6264405846595764
 Epoch: 1, Step: 209, Rank: 0, loss = 0.07757683098316193
 Epoch: 1, Step: 209, Rank: 1, loss = 0.07614333927631378
 Per-token loss scaled by world size: 0.00017559101979713887
 Epoch: 1, Step: 209, Rank: 3, loss = 0.4744688868522644
                                                          total tokens: 7210 num samples: 5 num padding tokens: 1177 - rank: 1 max len: 1442 min len: 1085 avg len: 1206.6 num_loss_counted_tokens: 3033
 total tokens: 7776 num samples: 9 num padding tokens: 1371 - rank: 4 max len: 864 min len: 614 avg len: 711.6666666666666 num_loss_counted_tokens: 2696
 {
    "epoch": 1,
    "step": 209,
    "rank": 0,
    "loss": 0.07757683098316193,
    "overall_throughput": 41.93509249352839,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.419190883636475,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21617,
    "batch_size": 86,
    "total_loss": 0.6939011812210083,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:12.865160"
 }
 total tokens: 7826 num samples: 13 num padding tokens: 1029 - rank: 5 max len: 602 min len: 416 avg len: 522.8461538461538 num_loss_counted_tokens: 4387
 total tokens: 7904 num samples: 19 num padding tokens: 1674 - rank: 6 max len: 416 min len: 259 avg len: 327.89473684210526 num_loss_counted_tokens: 3581
 total tokens: 7758 num samples: 3 num padding tokens: 1159 - rank: 0 max len: 2586 min len: 1912 avg len: 2199.6666666666665 num_loss_counted_tokens: 346
 total tokens: 7568 num samples: 8 num padding tokens: 373 - rank: 3 max len: 946 min len: 874 avg len: 899.375 num_loss_counted_tokens: 4817
 total tokens: 7441 num samples: 7 num padding tokens: 404 - rank: 2 max len: 1063 min len: 956 avg len: 1005.2857142857143 num_loss_counted_tokens: 3350
 total tokens: 7967 num samples: 31 num padding tokens: 2428 - rank: 7 max len: 257 min len: 80 avg len: 178.67741935483872 num_loss_counted_tokens: 2210
 Per-token loss scaled by world size: 0.0002890804025810212Per-token loss scaled by world size: 0.0005653423140756786Per-token loss scaled by world size: 0.000670413370244205Per-token loss scaled by world size: 9.497793507762253e-05Per-token loss scaled by world size: 2.160387111871387e-06
 Per-token loss scaled by world size: 9.657991176936775e-05




 Per-token loss scaled by world size: 0.00031631108140572906
 Epoch: 1, Step: 210, Rank: 5, loss = 1.631869912147522Epoch: 1, Step: 210, Rank: 2, loss = 0.23118816316127777

 Epoch: 1, Step: 210, Rank: 1, loss = 0.2350875735282898Epoch: 1, Step: 210, Rank: 4, loss = 1.3761138916015625Epoch: 1, Step: 210, Rank: 3, loss = 0.703657865524292


 Epoch: 1, Step: 210, Rank: 0, loss = 0.005258652381598949
 Epoch: 1, Step: 210, Rank: 7, loss = 0.7699407339096069
 Per-token loss scaled by world size: 0.000578251841943711
 Epoch: 1, Step: 210, Rank: 6, loss = 1.4075372219085693
                                                          total tokens: 7904 num samples: 8 num padding tokens: 750 - rank: 4 max len: 988 min len: 839 avg len: 894.25 num_loss_counted_tokens: 5760
 total tokens: 7317 num samples: 3 num padding tokens: 636 - rank: 1 max len: 2439 min len: 1946 avg len: 2227.0 num_loss_counted_tokens: 845
 {
    "epoch": 1,
    "step": 210,
    "rank": 0,
    "loss": 0.005258652381598949,
    "overall_throughput": 42.45218456081296,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.25443983078003,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 19473,
    "batch_size": 68,
    "total_loss": 0.7950817346572876,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:15.357201"
 }
 total tokens: 7389 num samples: 9 num padding tokens: 1138 - rank: 5 max len: 821 min len: 562 avg len: 694.5555555555555 num_loss_counted_tokens: 4077
 total tokens: 7860 num samples: 15 num padding tokens: 2296 - rank: 6 max len: 524 min len: 263 avg len: 370.93333333333334 num_loss_counted_tokens: 3347
 total tokens: 6396 num samples: 26 num padding tokens: 1827 - rank: 7 max len: 246 min len: 83 avg len: 175.73076923076923 num_loss_counted_tokens: 2093
 total tokens: 7170 num samples: 6 num padding tokens: 733 - rank: 3 max len: 1195 min len: 1001 avg len: 1072.8333333333333 num_loss_counted_tokens: 2366
 total tokens: 7220 num samples: 4 num padding tokens: 329 - rank: 2 max len: 1805 min len: 1615 avg len: 1722.75 num_loss_counted_tokens: 1729
 total tokens: 7124 num samples: 2 num padding tokens: 264 - rank: 0 max len: 3562 min len: 3298 avg len: 3430.0 num_loss_counted_tokens: 186
 Per-token loss scaled by world size: 0.0008867266005836427Per-token loss scaled by world size: 0.00016864115605130792Per-token loss scaled by world size: 3.4453678381396458e-06
 Per-token loss scaled by world size: 7.723766611889005e-05Per-token loss scaled by world size: 7.638386159669608e-05Per-token loss scaled by world size: 0.0004412019916344434
 Per-token loss scaled by world size: 0.00035067120916210115




 Epoch: 1, Step: 211, Rank: 5, loss = 1.9709715843200684Epoch: 1, Step: 211, Rank: 3, loss = 0.3748471140861511

 Epoch: 1, Step: 211, Rank: 1, loss = 0.16978223621845245Epoch: 1, Step: 211, Rank: 0, loss = 0.1716800183057785

 Epoch: 1, Step: 211, Rank: 2, loss = 0.007658191490918398
 Epoch: 1, Step: 211, Rank: 4, loss = 0.9806817173957825Epoch: 1, Step: 211, Rank: 7, loss = 0.7794544100761414

 Per-token loss scaled by world size: 0.0005940343835391104
 Epoch: 1, Step: 211, Rank: 6, loss = 1.320389986038208
                                                          total tokens: 6598 num samples: 2 num padding tokens: 510 - rank: 1 max len: 3299 min len: 2789 avg len: 3044.0 num_loss_counted_tokens: 202
 total tokens: 7126 num samples: 7 num padding tokens: 665 - rank: 4 max len: 1018 min len: 822 avg len: 923.0 num_loss_counted_tokens: 4429
 {
    "epoch": 1,
    "step": 211,
    "rank": 0,
    "loss": 0.1716800183057785,
    "overall_throughput": 42.027410025626445,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.381672859191895,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17782,
    "batch_size": 82,
    "total_loss": 0.721933126449585,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:17.873318"
 }
 total tokens: 8049 num samples: 3 num padding tokens: 1897 - rank: 2 max len: 2683 min len: 1693 avg len: 2050.6666666666665 num_loss_counted_tokens: 2086
 total tokens: 6725 num samples: 25 num padding tokens: 2254 - rank: 7 max len: 269 min len: 87 avg len: 178.84 num_loss_counted_tokens: 2062
 total tokens: 7750 num samples: 10 num padding tokens: 610 - rank: 5 max len: 775 min len: 681 avg len: 714.0 num_loss_counted_tokens: 2811
 total tokens: 8060 num samples: 13 num padding tokens: 2361 - rank: 6 max len: 620 min len: 304 avg len: 438.38461538461536 num_loss_counted_tokens: 3659
 total tokens: 7035 num samples: 5 num padding tokens: 813 - rank: 3 max len: 1407 min len: 1064 avg len: 1244.4 num_loss_counted_tokens: 3760
 total tokens: 7708 num samples: 2 num padding tokens: 291 - rank: 0 max len: 3854 min len: 3563 avg len: 3708.5 num_loss_counted_tokens: 181
 Per-token loss scaled by world size: 0.00016636037616990507Per-token loss scaled by world size: 0.00016077600594144315Per-token loss scaled by world size: 0.00041039849747903645Per-token loss scaled by world size: 0.00037365706521086395


 Per-token loss scaled by world size: 0.00032838378683663905Per-token loss scaled by world size: 0.0003097376029472798

 Per-token loss scaled by world size: 3.766906047530938e-06

 Epoch: 1, Step: 212, Rank: 5, loss = 1.1992356777191162
 Epoch: 1, Step: 212, Rank: 3, loss = 0.46980756521224976
 Epoch: 1, Step: 212, Rank: 1, loss = 1.0918726921081543Epoch: 1, Step: 212, Rank: 2, loss = 0.48612579703330994

 Epoch: 1, Step: 212, Rank: 0, loss = 0.011007370427250862
 Epoch: 1, Step: 212, Rank: 4, loss = 0.9595785140991211
 Epoch: 1, Step: 212, Rank: 7, loss = 0.9050920009613037
 Per-token loss scaled by world size: 0.00043126812670379877
 Epoch: 1, Step: 212, Rank: 6, loss = 1.2602193355560303
                                                          total tokens: 7704 num samples: 8 num padding tokens: 1465 - rank: 4 max len: 963 min len: 685 avg len: 779.875 num_loss_counted_tokens: 4503
 total tokens: 8007 num samples: 3 num padding tokens: 648 - rank: 1 max len: 2669 min len: 2232 avg len: 2453.0 num_loss_counted_tokens: 287
 {
    "epoch": 1,
    "step": 212,
    "rank": 0,
    "loss": 0.011007370427250862,
    "overall_throughput": 42.88947039606486,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.303375244140625,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23377,
    "batch_size": 85,
    "total_loss": 0.7978672981262207,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:20.342573"
 }
 total tokens: 7920 num samples: 12 num padding tokens: 1442 - rank: 5 max len: 660 min len: 470 avg len: 539.8333333333334 num_loss_counted_tokens: 4323
 total tokens: 6372 num samples: 3 num padding tokens: 828 - rank: 2 max len: 2124 min len: 1644 avg len: 1848.0 num_loss_counted_tokens: 677
 total tokens: 7533 num samples: 27 num padding tokens: 2411 - rank: 7 max len: 279 min len: 75 avg len: 189.7037037037037 num_loss_counted_tokens: 2282
 total tokens: 7255 num samples: 5 num padding tokens: 1237 - rank: 3 max len: 1451 min len: 980 avg len: 1203.6 num_loss_counted_tokens: 3251
 total tokens: 7388 num samples: 2 num padding tokens: 920 - rank: 0 max len: 3694 min len: 2774 avg len: 3234.0 num_loss_counted_tokens: 589
 total tokens: 7973 num samples: 17 num padding tokens: 1660 - rank: 6 max len: 469 min len: 291 avg len: 371.3529411764706 num_loss_counted_tokens: 3245
 Per-token loss scaled by world size: 0.0003884605539496988Per-token loss scaled by world size: 0.0004556115891318768Per-token loss scaled by world size: 0.0003792895295191556Per-token loss scaled by world size: 0.00035762478364631534Per-token loss scaled by world size: 0.00014603856834582984

 Per-token loss scaled by world size: 3.432124140090309e-05


 Per-token loss scaled by world size: 8.249920938396826e-05

 Epoch: 1, Step: 213, Rank: 6, loss = 1.0305296182632446
 Epoch: 1, Step: 213, Rank: 4, loss = 1.2378966808319092Epoch: 1, Step: 213, Rank: 7, loss = 0.9716665744781494
 Epoch: 1, Step: 213, Rank: 3, loss = 0.39678677916526794
 Epoch: 1, Step: 213, Rank: 0, loss = 0.0932508111000061Epoch: 1, Step: 213, Rank: 2, loss = 1.0554473400115967


 Epoch: 1, Step: 213, Rank: 1, loss = 0.22415034472942352
 Per-token loss scaled by world size: 0.00048252404667437077
 Epoch: 1, Step: 213, Rank: 5, loss = 1.3110178709030151
                                                          total tokens: 7245 num samples: 3 num padding tokens: 751 - rank: 1 max len: 2415 min len: 2008 avg len: 2164.6666666666665 num_loss_counted_tokens: 1753
 total tokens: 8024 num samples: 8 num padding tokens: 872 - rank: 4 max len: 1003 min len: 763 avg len: 894.0 num_loss_counted_tokens: 3911
 {
    "epoch": 1,
    "step": 213,
    "rank": 0,
    "loss": 0.0932508111000061,
    "overall_throughput": 42.19836853236888,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.430901527404785,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 21736,
    "batch_size": 78,
    "total_loss": 0.7900933623313904,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:22.852352"
 }
 total tokens: 7104 num samples: 24 num padding tokens: 3028 - rank: 7 max len: 296 min len: 82 avg len: 169.83333333333334 num_loss_counted_tokens: 1782
 total tokens: 7856 num samples: 16 num padding tokens: 1303 - rank: 6 max len: 491 min len: 317 avg len: 409.5625 num_loss_counted_tokens: 3722
 total tokens: 8107 num samples: 11 num padding tokens: 1365 - rank: 5 max len: 737 min len: 516 avg len: 612.9090909090909 num_loss_counted_tokens: 5379
 total tokens: 7816 num samples: 4 num padding tokens: 627 - rank: 2 max len: 1954 min len: 1579 avg len: 1797.25 num_loss_counted_tokens: 1262
 total tokens: 7015 num samples: 5 num padding tokens: 1303 - rank: 3 max len: 1403 min len: 1004 avg len: 1142.4 num_loss_counted_tokens: 2463
 total tokens: 7132 num samples: 2 num padding tokens: 1084 - rank: 0 max len: 3566 min len: 2482 avg len: 3024.0 num_loss_counted_tokens: 887
 Per-token loss scaled by world size: 0.0001603560958756134Per-token loss scaled by world size: 0.00011656123388092965Per-token loss scaled by world size: 0.00013124111865181476Per-token loss scaled by world size: 0.0004040842177346349Per-token loss scaled by world size: 0.0002459329552948475
 Per-token loss scaled by world size: 0.00028513988945633173




 Per-token loss scaled by world size: 0.0004030088894069195
 Epoch: 1, Step: 214, Rank: 2, loss = 0.4810081720352173
 Epoch: 1, Step: 214, Rank: 0, loss = 0.3936741352081299
 Epoch: 1, Step: 214, Rank: 1, loss = 0.34963998198509216
 Epoch: 1, Step: 214, Rank: 6, loss = 1.2121011018753052
 Epoch: 1, Step: 214, Rank: 7, loss = 0.7377066612243652
 Epoch: 1, Step: 214, Rank: 4, loss = 0.855312705039978
 Epoch: 1, Step: 214, Rank: 5, loss = 1.2088755369186401
 Per-token loss scaled by world size: 0.00028479599859565496
 Epoch: 1, Step: 214, Rank: 3, loss = 0.8542811870574951
                                                          total tokens: 7895 num samples: 5 num padding tokens: 1253 - rank: 1 max len: 1579 min len: 1089 avg len: 1328.4 num_loss_counted_tokens: 1673
 total tokens: 7480 num samples: 11 num padding tokens: 884 - rank: 4 max len: 680 min len: 516 avg len: 599.6363636363636 num_loss_counted_tokens: 4908
 {
    "epoch": 1,
    "step": 214,
    "rank": 0,
    "loss": 0.3936741352081299,
    "overall_throughput": 41.376296140594235,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.258357048034668,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23997,
    "batch_size": 85,
    "total_loss": 0.761574923992157,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:25.410853"
 }
 total tokens: 8085 num samples: 35 num padding tokens: 2995 - rank: 7 max len: 231 min len: 82 avg len: 145.42857142857142 num_loss_counted_tokens: 1964
 total tokens: 6426 num samples: 2 num padding tokens: 1555 - rank: 0 max len: 3213 min len: 1658 avg len: 2435.5 num_loss_counted_tokens: 537
 total tokens: 7791 num samples: 21 num padding tokens: 1378 - rank: 6 max len: 371 min len: 236 avg len: 305.3809523809524 num_loss_counted_tokens: 3598
 total tokens: 7740 num samples: 15 num padding tokens: 793 - rank: 5 max len: 516 min len: 375 avg len: 463.1333333333333 num_loss_counted_tokens: 3953
 total tokens: 8104 num samples: 8 num padding tokens: 972 - rank: 2 max len: 1013 min len: 810 avg len: 891.5 num_loss_counted_tokens: 5296
 total tokens: 8100 num samples: 10 num padding tokens: 449 - rank: 3 max len: 810 min len: 698 avg len: 765.1 num_loss_counted_tokens: 3926
 Per-token loss scaled by world size: 0.00016885650984477252Per-token loss scaled by world size: 0.00022565454128198326Per-token loss scaled by world size: 0.00031425835913978517Per-token loss scaled by world size: 7.733783036201203e-07



 Per-token loss scaled by world size: 0.00022409454686567187
 Per-token loss scaled by world size: 0.00017783122893888503
 Per-token loss scaled by world size: 0.00018609287508297712
 Epoch: 1, Step: 215, Rank: 2, loss = 0.7786210179328918
 Epoch: 1, Step: 215, Rank: 5, loss = 1.084348440170288Epoch: 1, Step: 215, Rank: 1, loss = 0.5826393961906433

 Epoch: 1, Step: 215, Rank: 0, loss = 0.002668541856110096
 Epoch: 1, Step: 215, Rank: 6, loss = 0.7732382416725159
 Epoch: 1, Step: 215, Rank: 4, loss = 0.6136066317558289
 Epoch: 1, Step: 215, Rank: 7, loss = 0.642113447189331
 Per-token loss scaled by world size: 0.00024193401623051614
 Epoch: 1, Step: 215, Rank: 3, loss = 0.8347933292388916
                                                          total tokens: 7668 num samples: 9 num padding tokens: 1000 - rank: 4 max len: 852 min len: 647 avg len: 740.8888888888889 num_loss_counted_tokens: 4699
 total tokens: 8060 num samples: 5 num padding tokens: 515 - rank: 1 max len: 1612 min len: 1420 avg len: 1509.0 num_loss_counted_tokens: 2649
 {
    "epoch": 1,
    "step": 215,
    "rank": 0,
    "loss": 0.002668541856110096,
    "overall_throughput": 41.217687956745806,
    "lr": 4.000000000000001e-06,
    "cuda_mem_allocated": 24.42807674407959,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27604,
    "batch_size": 87,
    "total_loss": 0.6640036106109619,
    "gradnorm": 0.8729047775268555,
    "weight_norm": 433.0433044433594,
    "timestamp": "2024-08-18T20:57:27.975308"
 }
 total tokens: 7025 num samples: 5 num padding tokens: 744 - rank: 2 max len: 1405 min len: 1116 avg len: 1256.2 num_loss_counted_tokens: 4200
 total tokens: 7672 num samples: 7 num padding tokens: 693 - rank: 3 max len: 1096 min len: 866 avg len: 997.0 num_loss_counted_tokens: 5684
 total tokens: 8112 num samples: 13 num padding tokens: 563 - rank: 5 max len: 624 min len: 528 avg len: 580.6923076923077 num_loss_counted_tokens: 5251
 total tokens: 7856 num samples: 16 num padding tokens: 1465 - rank: 6 max len: 491 min len: 299 avg len: 399.4375 num_loss_counted_tokens: 3069
 total tokens: 7812 num samples: 28 num padding tokens: 2476 - rank: 7 max len: 279 min len: 86 avg len: 190.57142857142858 num_loss_counted_tokens: 2263
 total tokens: 6210 num samples: 3 num padding tokens: 683 - rank: 0 max len: 2070 min len: 1654 avg len: 1842.3333333333333 num_loss_counted_tokens: 1033
 Per-token loss scaled by world size: 0.00016547582345083356Per-token loss scaled by world size: 0.0002693594142328948Per-token loss scaled by world size: 0.00031613183091394603Per-token loss scaled by world size: 5.018114825361408e-05

 Per-token loss scaled by world size: 0.0004250952915754169Per-token loss scaled by world size: 0.00035303577897138894Per-token loss scaled by world size: 5.689787940355018e-05




 Epoch: 1, Step: 216, Rank: 5, loss = 1.0305107831954956
 Epoch: 1, Step: 216, Rank: 2, loss = 0.5394098162651062
 Epoch: 1, Step: 216, Rank: 7, loss = 0.8780443072319031
 Epoch: 1, Step: 216, Rank: 1, loss = 0.16357800364494324Epoch: 1, Step: 216, Rank: 0, loss = 0.18547286093235016

 Epoch: 1, Step: 216, Rank: 4, loss = 1.3857043981552124
 Epoch: 1, Step: 216, Rank: 3, loss = 1.150808334350586
 Per-token loss scaled by world size: 0.00032035927870310843
 Epoch: 1, Step: 216, Rank: 6, loss = 1.0442911386489868
 [2024-08-18 20:57:30,497] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[4.800000000000001e-06], mom=[(0.9, 0.95)]
 [2024-08-18 20:57:30,575] [INFO] [timer.py:258:stop] epoch=0/micro_step=216/global_step=6, RunningAvgSamplesPerSec=41.682838913878484, CurrSamplesPerSec=41.70386986575945, MemAllocated=22.89GB, MaxMemAllocated=30.61GB
                                                          total tokens: 8096 num samples: 11 num padding tokens: 403 - rank: 4 max len: 736 min len: 676 avg len: 699.3636363636364 num_loss_counted_tokens: 3366
 total tokens: 7895 num samples: 5 num padding tokens: 1102 - rank: 1 max len: 1579 min len: 1157 avg len: 1358.6 num_loss_counted_tokens: 2579
 {
    "epoch": 1,
    "step": 216,
    "rank": 0,
    "loss": 0.18547286093235016,
    "overall_throughput": 40.61028464966421,
    "lr": 4.800000000000001e-06,
    "cuda_mem_allocated": 22.89185380935669,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 26078,
    "batch_size": 113,
    "total_loss": 0.7972275614738464,
    "gradnorm": 0.7308346033096313,
    "weight_norm": 433.0433349609375,
    "timestamp": "2024-08-18T20:57:30.638455"
 }
 total tokens: 7917 num samples: 7 num padding tokens: 347 - rank: 2 max len: 1131 min len: 1017 avg len: 1081.4285714285713 num_loss_counted_tokens: 5762
 total tokens: 7920 num samples: 12 num padding tokens: 1053 - rank: 5 max len: 660 min len: 459 avg len: 572.25 num_loss_counted_tokens: 3495
 total tokens: 7784 num samples: 8 num padding tokens: 989 - rank: 3 max len: 973 min len: 781 avg len: 849.375 num_loss_counted_tokens: 3166
 total tokens: 7786 num samples: 17 num padding tokens: 2184 - rank: 6 max len: 458 min len: 242 avg len: 329.52941176470586 num_loss_counted_tokens: 3198
 total tokens: 7887 num samples: 33 num padding tokens: 2326 - rank: 7 max len: 239 min len: 82 avg len: 168.5151515151515 num_loss_counted_tokens: 2289
 total tokens: 7284 num samples: 3 num padding tokens: 783 - rank: 0 max len: 2428 min len: 1739 avg len: 2167.0 num_loss_counted_tokens: 406
 Per-token loss scaled by world size: 0.0005763740628026426Per-token loss scaled by world size: 0.0009438088163733482Per-token loss scaled by world size: 6.679360376438126e-05Per-token loss scaled by world size: 0.0001800585159799084Per-token loss scaled by world size: 0.0003732589539140463Per-token loss scaled by world size: 7.849858957342803e-05




 Per-token loss scaled by world size: 0.00011912822810700163

 Epoch: 1, Step: 217, Rank: 1, loss = 0.1438567191362381
 Epoch: 1, Step: 217, Rank: 5, loss = 2.0327281951904297Epoch: 1, Step: 217, Rank: 3, loss = 0.38780102133750916

 Epoch: 1, Step: 217, Rank: 6, loss = 1.241365671157837Epoch: 1, Step: 217, Rank: 0, loss = 0.16906633973121643Epoch: 1, Step: 217, Rank: 4, loss = 0.8039065003395081


 Epoch: 1, Step: 217, Rank: 2, loss = 0.256572425365448
 Per-token loss scaled by world size: 0.0005111052305437624
 Epoch: 1, Step: 217, Rank: 7, loss = 1.1007928848266602
                                                          total tokens: 6974 num samples: 2 num padding tokens: 562 - rank: 1 max len: 3487 min len: 2925 avg len: 3206.0 num_loss_counted_tokens: 216
 total tokens: 7728 num samples: 7 num padding tokens: 633 - rank: 4 max len: 1104 min len: 930 avg len: 1013.5714285714286 num_loss_counted_tokens: 5349
 {
    "epoch": 1,
    "step": 217,
    "rank": 0,
    "loss": 0.16906633973121643,
    "overall_throughput": 41.92863106467066,
    "lr": 4.800000000000001e-06,
    "cuda_mem_allocated": 24.260313034057617,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 17230,
    "batch_size": 69,
    "total_loss": 0.767011284828186,
    "gradnorm": 0.7308346033096313,
    "weight_norm": 433.0433349609375,
    "timestamp": "2024-08-18T20:57:33.123706"
 }
 total tokens: 4065 num samples: 1 num padding tokens: 0 - rank: 0 max len: 4065 min len: 4065 avg len: 4065.0 num_loss_counted_tokens: 82
 total tokens: 6741 num samples: 3 num padding tokens: 1158 - rank: 2 max len: 2247 min len: 1528 avg len: 1861.0 num_loss_counted_tokens: 930
 total tokens: 7992 num samples: 27 num padding tokens: 3238 - rank: 7 max len: 296 min len: 81 avg len: 176.07407407407408 num_loss_counted_tokens: 2221
 total tokens: 7632 num samples: 12 num padding tokens: 2411 - rank: 6 max len: 636 min len: 301 avg len: 435.0833333333333 num_loss_counted_tokens: 3582
 total tokens: 7080 num samples: 5 num padding tokens: 814 - rank: 3 max len: 1416 min len: 1134 avg len: 1253.2 num_loss_counted_tokens: 2449
 total tokens: 8091 num samples: 9 num padding tokens: 1045 - rank: 5 max len: 899 min len: 643 avg len: 782.8888888888889 num_loss_counted_tokens: 5684
 Per-token loss scaled by world size: 0.0003471940290182829Per-token loss scaled by world size: 0.0005411332240328193Per-token loss scaled by world size: 6.277004104049411e-06Per-token loss scaled by world size: 0.0005339714116416872
 Per-token loss scaled by world size: 4.73601221528952e-06

 Per-token loss scaled by world size: 0.00031345669412985444


 Epoch: 1, Step: 218, Rank: 5, loss = 1.1664127111434937
 Per-token loss scaled by world size: 5.614342057924659e-07Epoch: 1, Step: 218, Rank: 1, loss = 0.010208474472165108

 Epoch: 1, Step: 218, Rank: 3, loss = 0.748376727104187Epoch: 1, Step: 218, Rank: 0, loss = 0.013530082069337368

 Epoch: 1, Step: 218, Rank: 4, loss = 1.1509753465652466
 Epoch: 1, Step: 218, Rank: 7, loss = 0.6756559014320374
 Epoch: 1, Step: 218, Rank: 2, loss = 0.0012101713800802827Per-token loss scaled by world size: 0.00039466869202442467

 Epoch: 1, Step: 218, Rank: 6, loss = 0.8507083654403687
                                                          total tokens: 6480 num samples: 3 num padding tokens: 206 - rank: 1 max len: 2160 min len: 1983 avg len: 2091.3333333333335 num_loss_counted_tokens: 1674
 total tokens: 7760 num samples: 10 num padding tokens: 447 - rank: 4 max len: 776 min len: 688 avg len: 7