Skip to content

Instantly share code, notes, and snippets.

@danikhan632
Created April 13, 2026 18:35
Show Gist options
  • Select an option

  • Save danikhan632/89ee831f8b4270d049a079525232fe62 to your computer and use it in GitHub Desktop.

Select an option

Save danikhan632/89ee831f8b4270d049a079525232fe62 to your computer and use it in GitHub Desktop.
CSL Language Reference

CSL (Cerebras Streaming Language) - In-Depth Language Reference

1. Overview

CSL (Cerebras Streaming Language) is a domain-specific programming language designed for programming Cerebras Wafer-Scale Engine (WSE) processors. Its syntax is derived from Zig, but its purpose and compiler implementation are entirely different. CSL targets a massively parallel architecture consisting of hundreds of thousands of Processing Elements (PEs) arranged in a 2D grid, each with its own compute engine (CE), router, and local memory.

A complete CSL program consists of:

  1. Layout code (layout.csl) -- defines the spatial arrangement of PEs and their routing/communication configuration.
  2. PE programs (pe_program.csl, etc.) -- define the computation running on individual PEs.
  3. Host code (run.py) -- a Python script that compiles the CSL code, transfers data to/from the device, and launches device functions via remote procedure calls (RPCs).

Programs target either WSE-2 or WSE-3 architectures via the --arch compiler flag.


2. Type System

2.1 Primitive Types

Type Description
bool Boolean values
i16 16-bit signed integer
i32 32-bit signed integer
u16 16-bit unsigned integer
u32 32-bit unsigned integer
f16 16-bit IEEE-754 floating point
f32 32-bit IEEE-754 floating point
void No return value

2.2 Array Types

Arrays are declared with a size and element type:

var A: [24]f32;                 // 1D array of 24 f32 elements
var B: [4, 6]f32;               // 2D array (4 rows, 6 columns)
const C = [3]i16{ 1, 2, 3 };   // Array literal with initializer

Array sizes must be compile-time constants. Arrays can use compile-time-known expressions for dimensions:

const M: i16 = 4;
const N: i16 = 6;
var A: [M * N]f32;

2.3 Pointer Types

Pointers are used to pass arrays and scalars to runtime functions (arrays themselves cannot be passed to runtime functions directly):

var y: [4]f32;
var y_ptr: [*]f32 = &y;        // Pointer to array (many-item pointer)
var scalar_ptr: *i16 = &value;  // Pointer to single value

Pointer dereferencing uses the .* syntax:

fn increment_and_sum(data_ptr: *[3]i16, result_ptr: *i16) void {
    (data_ptr.*)[0] += 1;      // Dereference then index
    result_ptr.* = 42;          // Dereference scalar pointer
}

2.4 Special Types

Type Description
color Identifies a communication channel / wavelet color
comptime_struct Compile-time structure passed as parameter
local_task_id Identifies a local task
data_task_id Identifies a wavelet-triggered (data) task
control_task_id Identifies a control-triggered task
input_queue Hardware input queue identifier (WSE-3)
output_queue Hardware output queue identifier
mem1d_dsd 1D memory Data Structure Descriptor type
fabin_dsd Fabric input DSD type
fabout_dsd Fabric output DSD type

3. Variables and Constants

Every variable must be declared as either const (immutable) or var (mutable):

const M: i16 = 4;              // Immutable, explicit type
var counter: u16 = 0;           // Mutable, explicit type
var x = @zeros([6]f32);         // Type inferred from builtin
var y = @constants([4]f32, 2.0); // All elements initialized to 2.0

3.1 Initialization Builtins

var a = @zeros([M]f32);            // All elements zero
var b = @constants([N]f32, 1.0);   // All elements set to a constant

3.2 Exported Variables

Variables visible to the host program must be exported via comptime blocks:

var y_ptr: [*]f32 = &y;
comptime {
    @export_symbol(y_ptr, "y");     // Export with custom name
    @export_symbol(compute);        // Export function (name inferred)
}

The export keyword can also be used directly on variable declarations for SDK Layout programs:

export var A = @zeros([height, width]f32);

4. Functions

Functions are declared with fn, specifying parameter types and return type:

fn gemv() void {
    // ...
}

fn add(a: f32, b: f32) f32 {
    return a + b;
}

fn transformation(value: f32, coeff1: f32, coeff2: f32, weight: f32) f32 {
    return value * (coeff1 + weight) + value * (coeff2 + weight);
}

Functions used at compile time can accept and return arrays directly. Runtime functions must use pointers instead.


5. Control Flow

5.1 If Statements

if (pe_id == 0) {
    send_right();
} else {
    recv_left();
}

5.2 While Loops

CSL supports while loops with an optional continuation expression:

var i: i16 = 0;
while (i < M) : (i += 1) {
    b[i] = 2.0;
}

5.3 For Loops with @range

The @range builtin generates a compile-time range for iteration:

// Loop from 0 to M*N - 1
for (@range(i16, M * N)) |idx| {
    A[idx] = @as(f32, idx);
}

// Loop from 1 to 3 with step 1
for (@range(u16, 1, 4, 1)) |pe_id| {
    // pe_id takes values 1, 2, 3
}

5.4 Switch Statements

var state: u16 = 0;
switch (state) {
    0 => {
        // handle case 0
    },
    1 => {
        // handle case 1
    },
    else => {
        // default case
    }
}

6. Compile-Time Parameters (param)

Parameters are compile-time constants injected by the layout file or the SDK Layout API. They are a core mechanism for configuring PE programs:

param memcpy_params: comptime_struct;  // Structured compile-time parameter
param M: i16;                          // Scalar parameter
param N_per_PE: i16;
param pe_id: i16;
param send_color: color;               // Color parameter
param main_task_id: local_task_id;     // Task ID parameter

Parameters are set in the layout file via @set_tile_code:

@set_tile_code(0, 0, "pe_program.csl", .{
    .memcpy_params = memcpy.get_params(0),
    .M = M,
    .N_per_PE = N / 2,
    .pe_id = 0,
    .send_color = send_color
});

Since parameters are compile-time constants, they can be used to:

  • Size arrays: var A: [M * N_per_PE]f32;
  • Control compilation: if (@is_arch("wse3")) { ... }
  • Define DSD extents and all other compile-time constructs

7. Module System

7.1 Importing Modules

Modules are imported using @import_module. System libraries use angle brackets:

// System libraries
const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
const random = @import_module("<random>");
const tsc = @import_module("<time>");
const math = @import_module("<math>");
const ctrl = @import_module("<control>");
const layout_mod = @import_module("<layout>");
const debug = @import_module("<debug>");
const simprint = @import_module("<simprint>");

// Collective communication library
const mpi_x = @import_module("<collectives_2d/pe>", .{
    .dim_params = c2d_params.x,
    .queues = [2]u16{2, 4},
    .dest_dsr_ids = [1]u16{1},
    .src0_dsr_ids = [1]u16{1},
    .src1_dsr_ids = [1]u16{1}
});

7.2 Standard Libraries

Library Purpose
<memcpy/memcpy> Host-device data transfer and command stream
<memcpy/get_params> Generate memcpy configuration parameters
<random> Random number generation (e.g., random_f32(-1.0, 1.0))
<time> Timestamp counter operations
<math> Mathematical functions (e.g., sqrt_f32)
<control> Control wavelet encoding helpers
<layout> Runtime PE coordinate queries
<debug> Trace logging for post-execution analysis
<simprint> Print to simulator log (sim.log) during simulation
<collectives_2d/pe> MPI-style collective operations (broadcast, reduce, scatter, gather)

7.3 Layout Module

The <layout> module provides runtime access to PE coordinates, which avoids the need for compile-time pe_id parameters:

const layout_mod = @import_module("<layout>");

fn is_top_row() bool {
    return (layout_mod.get_y_coord() == 0);
}
fn is_left_col() bool {
    return (layout_mod.get_x_coord() == 0);
}

8. Data Structure Descriptors (DSDs)

DSDs are a central concept in CSL. They describe how to iterate over data in memory or on the communication fabric, and they allow hardware-accelerated bulk operations without explicit loops.

8.1 Memory DSDs (mem1d_dsd)

A memory DSD describes a pattern of memory accesses:

// Method 1: base_address + extent (contiguous access)
var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });

// Method 2: tensor_access expression (flexible, supports striding)
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });

// Strided access (every Nth element)
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i * N] });

// Diagonal access of a 2D array
const dsdA = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{size} -> A[i, i] });

// With wavelet_index_offset (uses wavelet index to offset access)
const dsd_A = @get_dsd(mem1d_dsd, .{
    .tensor_access = |i|{height} -> A[i, 0],
    .wavelet_index_offset = true
});

The tensor_access syntax |i|{BOUND} -> array[expr(i)] specifies:

  • i -- the induction variable
  • {BOUND} -- the number of iterations (loop bound)
  • array[expr(i)] -- an affine expression generating the address

8.2 Fabric DSDs

Fabric DSDs describe data flowing through the communication network:

// Fabric output DSD -- sends wavelets along a color
const out_dsd = @get_dsd(fabout_dsd, .{
    .fabric_color = send_color,
    .extent = M,
    .output_queue = send_color_oq
});

// Fabric input DSD -- receives wavelets from a color
const in_dsd = @get_dsd(fabin_dsd, .{
    .fabric_color = send_color,
    .extent = M,
    .input_queue = send_color_iq
});

// Control wavelet output DSD
const tx_ctrl_dsd = @get_dsd(fabout_dsd, .{
    .extent = 1,
    .fabric_color = tx_color,
    .control = true,          // marks as control wavelet
    .output_queue = tx_oq
});

8.3 DSD Manipulation

// Shift a DSD's base offset by N elements
A_dsd = @increment_dsd_offset(A_dsd, 1, f32);

9. DSD Operations (Builtins)

DSD operations are hardware-accelerated bulk operations that iterate over DSDs:

9.1 Arithmetic Operations

Builtin Description
@fadds(dest, src1, src2) Element-wise f32 addition: dest = src1 + src2
@fmacs(dest, src1, src2, scalar) Multiply-accumulate: dest = src1 + src2 * scalar
@fmovs(dest, src) Copy/move elements
@fnegs(dest, src) Element-wise negation
@add16(dest, src1, value) 16-bit integer addition
@mov32(dest, value) Move a 32-bit value

9.2 Async Options

DSD operations support an options struct for asynchronous execution:

@fmovs(out_dsd, y_dsd, .{
    .async = true,              // Run asynchronously
    .activate = exit_task_id,   // Activate task when done
    .ut_id = send_color_ut,     // Explicit microthread ID (WSE-3)
    .unblock = process_task     // Unblock a blocked task when done
});

9.3 The @map Builtin

@map provides customizable DSD operations using user-defined callback functions:

// Apply sqrt to each element of a DSD
@map(math_lib.sqrt_f32, input_dsd, output_dsd);

// Custom transformation with mixed scalar and DSD arguments
@map(transformation, dsdA, 2.0, 6.0, dsd_weight, dsdA);

// Reduction (sum all elements)
fn reduction(value: i32, sum: *i32) i32 {
    return sum.* + value;
}
@map(reduction, dsdB, &sum[0], &sum[0]);

@map eliminates explicit loops by leveraging DSD descriptions as implicit loop structures, enabling hardware-optimized iteration.


10. Tasks and the Execution Model

CSL uses a task-based execution model. The CE on each PE executes tasks, which are triggered by various events.

10.1 Local Tasks

Local tasks are explicitly activated by ID, used for control flow and sequencing:

const exit_task_id: local_task_id = @get_local_task_id(9);

task exit_task() void {
    sys_mod.unblock_cmd_stream();
}

comptime {
    @bind_local_task(exit_task, exit_task_id);
}

// Activate programmatically
@activate(exit_task_id);

10.2 Wavelet-Triggered Tasks (Data Tasks)

Data tasks are activated automatically when a wavelet of the associated color arrives. The wavelet payload is passed as the task argument:

// On WSE-2, data tasks bind to colors; on WSE-3, to input queues
const recv_x_task_id: data_task_id =
    if      (@is_arch("wse2")) @get_data_task_id(x_color)
    else if (@is_arch("wse3")) @get_data_task_id(x_color_iq);

// Single f32 payload
task recv_x(x_val: f32) void {
    @fmacs(y_dsd, y_dsd, A_dsd, x_val);
}

// Sparse tensor: upper 16 bits = index, lower 16 bits = data
task main_task(data: f16, idx: u16) void {
    result[idx] = data;
}

comptime {
    @bind_data_task(recv_x, recv_x_task_id);
}

10.3 Control Tasks (Sentinels)

Control tasks are activated by control wavelets. They use non-routable task IDs and are often used as sentinels to signal the end of a data stream:

param sentinel: u16;

const send_result_task_id: control_task_id = @get_control_task_id(sentinel);

task send_result() void {
    @fmovs(out_dsd, result_dsd, .{ .async = true });
}

comptime {
    @bind_control_task(send_result, send_result_task_id);
}

10.4 Task Blocking

Tasks can be blocked and unblocked to prevent re-entrant execution during async operations:

task process_task(element: f32) void {
    @block(process_task_id);            // Prevent re-entry
    elem[0] = element * element * element;
    @fmovs(out_dsd, elem_dsd, .{
        .async = true,
        .unblock = process_task         // Unblock when done
    });
}

11. Communication: Colors, Routes, and the Fabric

11.1 Colors

A color is a communication channel identifier. Wavelets (data packets) travel through the 2D fabric tagged with a color. Each PE's router inspects the color to determine forwarding behavior.

const send_color: color = @get_color(0);   // Allocate color with explicit ID
param send_color: color;                    // Or receive as parameter

11.2 Routing Configuration

Routes are configured in the layout file using @set_color_config:

// Receive from the compute element (RAMP), send EAST
@set_color_config(0, 0, send_color, .{
    .routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } }
});

// Receive from WEST, send to RAMP (and optionally EAST)
@set_color_config(1, 0, send_color, .{
    .routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } }
});

Route directions: NORTH, SOUTH, EAST, WEST, RAMP (to/from the CE).

11.3 Fabric Switches

Switches enable limited runtime control of routes, allowing a single color to be reused for multiple routing configurations:

const sender_switches = .{
    .pos1 = .{ .tx = WEST },
    .pos2 = .{ .tx = EAST },
    .pos3 = .{ .tx = SOUTH },
    .current_switch_pos = 1,
    .ring_mode = true,           // Wraps from pos3 back to pos0
};

@set_color_config(1, 1, channel, .{
    .routes = sender_routes,
    .switches = sender_switches
});

Switch positions advance when a control wavelet is received:

const ctrl = @import_module("<control>");
const switch_adv_pld = ctrl.encode_single_payload(ctrl.opcode.SWITCH_ADV, true, {}, 0);
@mov32(tx_ctrl_dsd, switch_adv_pld);   // Advance switch position
@mov32(tx_data_dsd, payload);           // Send data on new route

11.4 Fabric Filters

Filters allow PEs to selectively accept wavelets based on the upper 16 bits of the wavelet payload:

const filter = .{
    .kind = .{ .range = true },
    .min_idx = pe_id * 3,
    .max_idx = pe_id * 3 + 2,
};

@set_color_config(pe_id, 0, data_color, .{
    .routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } },
    .filter = filter
});

11.5 Color Swap

On WSE-2, when swap_color_x is enabled, a wavelet's color bit is flipped as it passes through a router. This lets two colors alternate between PEs:

@set_color_config(pe_id, 0, red, .{
    .routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } },
    .swap_color_x = true
});

11.6 Checkerboard Routing

For multi-hop accumulation patterns (e.g., reducing partial results from WEST to EAST), a checkerboard pattern using two alternating colors avoids conflicts:

// Even columns: receive on ax_color_1, send on ax_color_2
// Odd columns: receive on ax_color_2, send on ax_color_1

12. Queues and Microthreads

12.1 Queues

Hardware I/O queues buffer wavelets entering and leaving the PE. On WSE-3, queues must be explicitly initialized:

const send_color_oq: output_queue = @get_output_queue(2);
const send_color_iq: input_queue = @get_input_queue(2);

comptime {
    if (@is_arch("wse3")) {
        @initialize_queue(send_color_oq, .{ .color = send_color });
        @initialize_queue(send_color_iq, .{ .color = send_color });
    }
}

12.2 Microthreads (WSE-3)

On WSE-3, microthread IDs can be decoupled from queue IDs for flexible resource management:

const send_color_ut = @get_ut_id(4);   // Explicit microthread ID

@fmovs(out_dsd, y_dsd, .{
    .async = true,
    .ut_id = send_color_ut,             // Use specific microthread
    .activate = exit_task_id
});

This allows sharing microthreads between output queues and conserving resources.


13. FIFOs

FIFOs buffer data between asynchronous operations, extending the small hardware queues:

var fifo_buffer = @zeros([1024]f32);
const fifo = @allocate_fifo(fifo_buffer);

// Push from fabric into FIFO
@fadds(fifo, in_dsd, ten_dsd, .{ .async = true });

// Pop from FIFO to fabric
@fnegs(loopback_dsd, fifo, .{ .async = true });

FIFOs require two microthreads (one for push, one for pop) and a scratch buffer.


14. The comptime Block

The comptime block contains code that executes at compile time. It is used for:

  • Binding tasks to task IDs
  • Initializing queues (WSE-3)
  • Exporting symbols to the host
  • Activating initial tasks
  • Conditional compilation based on architecture
comptime {
    // Task binding
    @bind_local_task(exit_task, exit_task_id);
    @bind_data_task(recv_x, recv_x_task_id);
    @bind_control_task(send_result, send_result_task_id);

    // Initial task activation
    @activate(main_task_id);

    // Architecture-conditional compilation
    if (@is_arch("wse3")) {
        @initialize_queue(send_color_oq, .{ .color = send_color });
    }

    // Export symbols for host access
    @export_symbol(y_ptr, "y");
    @export_symbol(compute);
}

15. Layout Programs

Layout files define the spatial arrangement and communication topology of the program.

15.1 Layout Block

layout {
    // Define a width x height rectangle of PEs
    @set_rectangle(2, 1);

    // Assign code to specific PEs with parameters
    @set_tile_code(0, 0, "pe_program.csl", .{
        .memcpy_params = memcpy.get_params(0),
        .M = M,
        .pe_id = 0,
        .send_color = send_color
    });

    // Configure color routing for specific PEs
    @set_color_config(0, 0, send_color, .{
        .routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } }
    });

    // Export symbol names visible to host
    @export_name("y", [*]f32, true);       // true = writable
    @export_name("compute", fn()void);
}

15.2 Key Layout Builtins

Builtin Description
@set_rectangle(width, height) Define the PE grid dimensions
@set_tile_code(x, y, file, params) Assign CSL code and params to a PE
@set_color_config(x, y, color, config) Configure routing/switches/filters for a color on a PE
@export_name(name, type, writable) Make a symbol accessible from the host
@get_rectangle() Get the rectangle dimensions at compile time

15.3 Loop Constructs in Layout

Layout blocks support for loops over @range for configuring multiple PEs:

layout {
    @set_rectangle(4, 1);
    for (@range(u16, 1, 4, 1)) |pe_id| {
        @set_tile_code(pe_id, 0, "recv.csl", .{ .pe_id = pe_id });
    }
}

16. SdkLayout API (High-Level Layout)

The SdkLayout API is a Python-based alternative to writing layout files in CSL. It provides:

  • Code regions: rectangular groups of PEs running the same code
  • Symbolic colors: automatic physical color allocation
  • Ports and connections: automatic route finding between code regions
  • Host I/O streams: connections to/from the host
from cerebras.sdk.sdk_layout import SdkLayout, Color, Route, Edge

layout = SdkLayout()

# Create a code region
gemv = layout.create_code_region("/path/to/gemv.csl", "gemv", width, height)

# Create symbolic colors
x_color = Color()
layout.paint(gemv, {"x_in": x_color})

# Define input/output ports
x_port = layout.create_input_port(gemv, x_color)
y_port = layout.create_output_port(gemv, y_color)

# Connect to host
layout.create_input_stream("x_stream", x_port)
layout.create_output_stream("y_stream", y_port)

17. Compiler Builtins Reference

17.1 Type Conversion and Introspection

Builtin Description
@as(type, value) Cast a value to the specified type
@ptrcast([*]u32, &data) Cast a pointer to a different pointer type
@is_arch("wse2") / @is_arch("wse3") Check target architecture at compile time
@get_rectangle() Get the PE rectangle dimensions

17.2 DSD and Memory Builtins

Builtin Description
@get_dsd(type, config) Create a DSD of the given type
@increment_dsd_offset(dsd, offset, type) Shift a DSD's base offset
@zeros([size]type) Create a zero-initialized array
@constants([size]type, value) Create an array initialized to a constant
@allocate_fifo(buffer) Create a FIFO backed by a buffer

17.3 Color and Queue Builtins

Builtin Description
@get_color(id) Get a color with explicit ID
@get_output_queue(id) Get an output queue by ID
@get_input_queue(id) Get an input queue by ID
@get_ut_id(id) Get a microthread ID (WSE-3)
@initialize_queue(queue, config) Initialize a queue (required on WSE-3)

17.4 Task Builtins

Builtin Description
@get_local_task_id(id) Get a local task ID
@get_data_task_id(color_or_queue) Get a data task ID
@get_control_task_id(id) Get a control task ID
@bind_local_task(fn, id) Bind a function to a local task ID
@bind_data_task(fn, id) Bind a function to a data task ID
@bind_control_task(fn, id) Bind a function to a control task ID
@activate(task_id) Activate a local task
@block(task_id) Block a task from executing

17.5 Symbol Export Builtins

Builtin Description
@export_symbol(symbol, name?) Export a symbol for host access
@export_name(name, type, writable) Export a name in the layout file
@import_module(path, params?) Import a library or module

17.6 DSD Operation Builtins

Builtin Signature Description
@fadds (dest, src1, src2, opts?) f32 element-wise add
@fmacs (dest, src1, src2, scalar, opts?) f32 multiply-accumulate
@fmovs (dest, src, opts?) f32 move/copy
@fnegs (dest, src, opts?) f32 negate
@add16 (dest, src1, value, opts?) i16/u16 add
@mov32 (dest, value) Move 32-bit value
@map (fn, args..., dest) Custom DSD operation

18. Host-Side Programming (Python SDK)

18.1 Compilation

# Compile CSL to device binary
cslc layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 \
    --params=M:4,N:6 --memcpy --channels=1 -o out --arch=wse2

18.2 SdkRuntime API

from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder

# Initialize and start
runner = SdkRuntime("out", cmaddr=args.cmaddr)
runner.load()
runner.run()

# Get device symbol handle
y_symbol = runner.get_id("y")

# Host-to-device copy
runner.memcpy_h2d(y_symbol, data, px, py, width, height, num_elems,
                  streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
                  order=MemcpyOrder.ROW_MAJOR, nonblock=False)

# Launch device function (RPC)
runner.launch("compute", nonblock=False)

# Device-to-host copy
runner.memcpy_d2h(result_buf, y_symbol, px, py, width, height, num_elems,
                  streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
                  order=MemcpyOrder.ROW_MAJOR, nonblock=False)

# Cleanup
runner.stop()

18.3 Streaming Mode

Streaming mode uses special memcpy colors instead of named symbols:

# Stream data to device (uses color ID, not symbol)
runner.memcpy_h2d(MEMCPYH2D_DATA_1, x_data, 0, 0, width, 1, size,
                  streaming=True, data_type=MemcpyDataType.MEMCPY_32BIT)

# Stream data from device
runner.memcpy_d2h(result, MEMCPYD2H_DATA_1, 0, 0, 1, height, size,
                  streaming=True, data_type=MemcpyDataType.MEMCPY_32BIT)

18.4 SdkLayout Compilation

from cerebras.sdk.sdk_layout import SdkLayout

layout = SdkLayout()
# ... define code regions, ports, connections ...
compile_artifacts = layout.compile(fabric_dims=(8, 3))

runtime = SdkRuntime(compile_artifacts, platform="sim")
runtime.run()
runtime.send("input_stream", data, nonblock=True)
runtime.receive("output_stream", buffer, size, nonblock=True)
runtime.stop()

19. Architecture Differences: WSE-2 vs WSE-3

Feature WSE-2 WSE-3
Data task binding Bound to colors (IDs 0-24) Bound to input queues (IDs 0-7)
Queue initialization Implicit Explicit (@initialize_queue required)
Microthread IDs Coupled to queue IDs Decoupled (@get_ut_id)
Color swap Supported Not yet supported
Local task IDs 0-30 8-30
Data task IDs 0-23 (from colors) 0-7 (from input queues)

Architecture checks are done at compile time:

if (@is_arch("wse3")) {
    @initialize_queue(oq, .{ .color = my_color });
}

20. Struct Literals and Anonymous Structs

CSL uses Zig-style anonymous struct literals extensively for configuration:

// DSD configuration
.{ .base_address = &y, .extent = M }

// Route configuration
.{ .routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } } }

// Async operation options
.{ .async = true, .activate = exit_task_id }

// Filter configuration
.{ .kind = .{ .range = true }, .min_idx = 3, .max_idx = 5 }

// Module parameters
.{ .width = 4, .height = 3 }

21. Debugging

21.1 Debug Library

The <debug> library records traces for post-execution analysis:

const trace = @import_module("<debug>", .{ ... });
trace.trace_string("task started");
trace.trace_i16(global);
trace.trace_timestamp();

Host code reads traces after execution:

from cerebras.sdk.debug.debug_util import read_trace
read_trace(runner, x, y, width, height, "trace")

21.2 Simprint Library

The <simprint> library prints directly to simulator logs during execution:

const simprint = @import_module("<simprint>");
simprint.print_string("recv_task: in_data = ");
simprint.print_f32(in_data);
simprint.print_string("\n");  // newline flushes output

Output appears in sim.log with cycle-accurate timestamps:

@968 PE(0,0): sender beginning main_fn
@1156 PE(1,0): recv_task: in_data = 0, global = 0

22. Collective Communication

The <collectives_2d> library provides MPI-style operations across PE rows and columns:

const mpi_x = @import_module("<collectives_2d/pe>", .{
    .dim_params = c2d_params.x,
    .queues = [2]u16{2, 4},
    .dest_dsr_ids = [1]u16{1},
    .src0_dsr_ids = [1]u16{1},
    .src1_dsr_ids = [1]u16{1}
});

// Broadcast from PE 0 to all PEs in the row
mpi_x.broadcast(0, send_buf, num_elems, callback_task_id);

// Reduce with f32 addition back to PE 0
mpi_x.reduce_fadds(0, send_buf, recv_buf, num_elems, callback_task_id);

// Scatter: distribute chunks from PE 0
mpi_y.scatter(0, send_buf, recv_buf, chunk_size, callback_task_id);

// Gather: collect chunks at PE 0
mpi_y.gather(0, send_buf, recv_buf, chunk_size, callback_task_id);

All collective operations are asynchronous and invoke a callback task upon completion, enabling chaining of multiple collective operations via a state machine pattern.


23. Complete Program Example

A minimal but complete CSL program that computes y = Ax + b on a single PE:

layout.csl:

param M: i16;
param N: i16;

const memcpy = @import_module("<memcpy/get_params>", .{
    .width = 1, .height = 1,
});

layout {
    @set_rectangle(1, 1);
    @set_tile_code(0, 0, "pe_program.csl", .{
        .memcpy_params = memcpy.get_params(0),
        .M = M, .N = N,
    });
    @export_name("y", [*]f32, false);
    @export_name("init_and_compute", fn()void);
}

pe_program.csl:

param memcpy_params: comptime_struct;
param M: i16;
param N: i16;

const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);

var A: [M * N]f32;
var x = @constants([N]f32, 1.0);
var b = @constants([M]f32, 2.0);
var y = @zeros([M]f32);

var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i * N] });
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });

const y_ptr: [*]f32 = &y;

fn init_and_compute() void {
    for (@range(i16, M * N)) |idx| {
        A[idx] = @as(f32, idx);
    }
    for (@range(u16, N)) |i| {
        @fmacs(y_dsd, y_dsd, A_dsd, x[i]);
        A_dsd = @increment_dsd_offset(A_dsd, 1, f32);
    }
    @fadds(y_dsd, y_dsd, b_dsd);
    sys_mod.unblock_cmd_stream();
}

comptime {
    @export_symbol(y_ptr, "y");
    @export_symbol(init_and_compute);
}

run.py:

import numpy as np
from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder

runner = SdkRuntime("out", cmaddr=args.cmaddr)
runner.load()
runner.run()

runner.launch("init_and_compute", nonblock=False)

y_symbol = runner.get_id("y")
result = np.zeros(4, dtype=np.float32)
runner.memcpy_d2h(result, y_symbol, 0, 0, 1, 1, 4,
                  streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
                  order=MemcpyOrder.ROW_MAJOR, nonblock=False)
runner.stop()

24. Key Concepts Summary

Concept Description
PE Processing Element -- one unit in the 2D grid with its own CE, router, and memory
CE Compute Engine -- executes tasks on a PE
Wavelet A 32-bit data packet that travels through the fabric, tagged with a color
Color A communication channel identifier used for routing wavelets
DSD Data Structure Descriptor -- describes memory access or fabric I/O patterns
Route Directional path (NORTH/SOUTH/EAST/WEST/RAMP) for wavelets at each PE
RAMP The connection between the router and the CE
Task A function triggered by events (local activation, wavelet arrival, control signal)
Switch Runtime-modifiable route configuration, advanced by control wavelets
Filter Hardware mechanism for selectively accepting wavelets
FIFO Software buffer for decoupling producer/consumer in asynchronous operations
Memcpy SDK infrastructure for host-device data transfer
Sentinel Control wavelet used to signal end-of-stream or trigger control tasks
Microthread Hardware execution context for async operations (explicitly managed on WSE-3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment