CSL (Cerebras Streaming Language) is a domain-specific programming language designed for programming Cerebras Wafer-Scale Engine (WSE) processors. Its syntax is derived from Zig, but its purpose and compiler implementation are entirely different. CSL targets a massively parallel architecture consisting of hundreds of thousands of Processing Elements (PEs) arranged in a 2D grid, each with its own compute engine (CE), router, and local memory.
A complete CSL program consists of:
- Layout code (
layout.csl) -- defines the spatial arrangement of PEs and their routing/communication configuration. - PE programs (
pe_program.csl, etc.) -- define the computation running on individual PEs. - Host code (
run.py) -- a Python script that compiles the CSL code, transfers data to/from the device, and launches device functions via remote procedure calls (RPCs).
Programs target either WSE-2 or WSE-3 architectures via the --arch compiler flag.
| Type | Description |
|---|---|
bool |
Boolean values |
i16 |
16-bit signed integer |
i32 |
32-bit signed integer |
u16 |
16-bit unsigned integer |
u32 |
32-bit unsigned integer |
f16 |
16-bit IEEE-754 floating point |
f32 |
32-bit IEEE-754 floating point |
void |
No return value |
Arrays are declared with a size and element type:
var A: [24]f32; // 1D array of 24 f32 elements
var B: [4, 6]f32; // 2D array (4 rows, 6 columns)
const C = [3]i16{ 1, 2, 3 }; // Array literal with initializerArray sizes must be compile-time constants. Arrays can use compile-time-known expressions for dimensions:
const M: i16 = 4;
const N: i16 = 6;
var A: [M * N]f32;Pointers are used to pass arrays and scalars to runtime functions (arrays themselves cannot be passed to runtime functions directly):
var y: [4]f32;
var y_ptr: [*]f32 = &y; // Pointer to array (many-item pointer)
var scalar_ptr: *i16 = &value; // Pointer to single valuePointer dereferencing uses the .* syntax:
fn increment_and_sum(data_ptr: *[3]i16, result_ptr: *i16) void {
(data_ptr.*)[0] += 1; // Dereference then index
result_ptr.* = 42; // Dereference scalar pointer
}| Type | Description |
|---|---|
color |
Identifies a communication channel / wavelet color |
comptime_struct |
Compile-time structure passed as parameter |
local_task_id |
Identifies a local task |
data_task_id |
Identifies a wavelet-triggered (data) task |
control_task_id |
Identifies a control-triggered task |
input_queue |
Hardware input queue identifier (WSE-3) |
output_queue |
Hardware output queue identifier |
mem1d_dsd |
1D memory Data Structure Descriptor type |
fabin_dsd |
Fabric input DSD type |
fabout_dsd |
Fabric output DSD type |
Every variable must be declared as either const (immutable) or var (mutable):
const M: i16 = 4; // Immutable, explicit type
var counter: u16 = 0; // Mutable, explicit type
var x = @zeros([6]f32); // Type inferred from builtin
var y = @constants([4]f32, 2.0); // All elements initialized to 2.0var a = @zeros([M]f32); // All elements zero
var b = @constants([N]f32, 1.0); // All elements set to a constantVariables visible to the host program must be exported via comptime blocks:
var y_ptr: [*]f32 = &y;
comptime {
@export_symbol(y_ptr, "y"); // Export with custom name
@export_symbol(compute); // Export function (name inferred)
}The export keyword can also be used directly on variable declarations for SDK Layout programs:
export var A = @zeros([height, width]f32);Functions are declared with fn, specifying parameter types and return type:
fn gemv() void {
// ...
}
fn add(a: f32, b: f32) f32 {
return a + b;
}
fn transformation(value: f32, coeff1: f32, coeff2: f32, weight: f32) f32 {
return value * (coeff1 + weight) + value * (coeff2 + weight);
}Functions used at compile time can accept and return arrays directly. Runtime functions must use pointers instead.
if (pe_id == 0) {
send_right();
} else {
recv_left();
}CSL supports while loops with an optional continuation expression:
var i: i16 = 0;
while (i < M) : (i += 1) {
b[i] = 2.0;
}The @range builtin generates a compile-time range for iteration:
// Loop from 0 to M*N - 1
for (@range(i16, M * N)) |idx| {
A[idx] = @as(f32, idx);
}
// Loop from 1 to 3 with step 1
for (@range(u16, 1, 4, 1)) |pe_id| {
// pe_id takes values 1, 2, 3
}var state: u16 = 0;
switch (state) {
0 => {
// handle case 0
},
1 => {
// handle case 1
},
else => {
// default case
}
}Parameters are compile-time constants injected by the layout file or the SDK Layout API. They are a core mechanism for configuring PE programs:
param memcpy_params: comptime_struct; // Structured compile-time parameter
param M: i16; // Scalar parameter
param N_per_PE: i16;
param pe_id: i16;
param send_color: color; // Color parameter
param main_task_id: local_task_id; // Task ID parameterParameters are set in the layout file via @set_tile_code:
@set_tile_code(0, 0, "pe_program.csl", .{
.memcpy_params = memcpy.get_params(0),
.M = M,
.N_per_PE = N / 2,
.pe_id = 0,
.send_color = send_color
});Since parameters are compile-time constants, they can be used to:
- Size arrays:
var A: [M * N_per_PE]f32; - Control compilation:
if (@is_arch("wse3")) { ... } - Define DSD extents and all other compile-time constructs
Modules are imported using @import_module. System libraries use angle brackets:
// System libraries
const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
const random = @import_module("<random>");
const tsc = @import_module("<time>");
const math = @import_module("<math>");
const ctrl = @import_module("<control>");
const layout_mod = @import_module("<layout>");
const debug = @import_module("<debug>");
const simprint = @import_module("<simprint>");
// Collective communication library
const mpi_x = @import_module("<collectives_2d/pe>", .{
.dim_params = c2d_params.x,
.queues = [2]u16{2, 4},
.dest_dsr_ids = [1]u16{1},
.src0_dsr_ids = [1]u16{1},
.src1_dsr_ids = [1]u16{1}
});| Library | Purpose |
|---|---|
<memcpy/memcpy> |
Host-device data transfer and command stream |
<memcpy/get_params> |
Generate memcpy configuration parameters |
<random> |
Random number generation (e.g., random_f32(-1.0, 1.0)) |
<time> |
Timestamp counter operations |
<math> |
Mathematical functions (e.g., sqrt_f32) |
<control> |
Control wavelet encoding helpers |
<layout> |
Runtime PE coordinate queries |
<debug> |
Trace logging for post-execution analysis |
<simprint> |
Print to simulator log (sim.log) during simulation |
<collectives_2d/pe> |
MPI-style collective operations (broadcast, reduce, scatter, gather) |
The <layout> module provides runtime access to PE coordinates, which avoids the need for compile-time pe_id parameters:
const layout_mod = @import_module("<layout>");
fn is_top_row() bool {
return (layout_mod.get_y_coord() == 0);
}
fn is_left_col() bool {
return (layout_mod.get_x_coord() == 0);
}DSDs are a central concept in CSL. They describe how to iterate over data in memory or on the communication fabric, and they allow hardware-accelerated bulk operations without explicit loops.
A memory DSD describes a pattern of memory accesses:
// Method 1: base_address + extent (contiguous access)
var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
// Method 2: tensor_access expression (flexible, supports striding)
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
// Strided access (every Nth element)
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i * N] });
// Diagonal access of a 2D array
const dsdA = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{size} -> A[i, i] });
// With wavelet_index_offset (uses wavelet index to offset access)
const dsd_A = @get_dsd(mem1d_dsd, .{
.tensor_access = |i|{height} -> A[i, 0],
.wavelet_index_offset = true
});The tensor_access syntax |i|{BOUND} -> array[expr(i)] specifies:
i-- the induction variable{BOUND}-- the number of iterations (loop bound)array[expr(i)]-- an affine expression generating the address
Fabric DSDs describe data flowing through the communication network:
// Fabric output DSD -- sends wavelets along a color
const out_dsd = @get_dsd(fabout_dsd, .{
.fabric_color = send_color,
.extent = M,
.output_queue = send_color_oq
});
// Fabric input DSD -- receives wavelets from a color
const in_dsd = @get_dsd(fabin_dsd, .{
.fabric_color = send_color,
.extent = M,
.input_queue = send_color_iq
});
// Control wavelet output DSD
const tx_ctrl_dsd = @get_dsd(fabout_dsd, .{
.extent = 1,
.fabric_color = tx_color,
.control = true, // marks as control wavelet
.output_queue = tx_oq
});// Shift a DSD's base offset by N elements
A_dsd = @increment_dsd_offset(A_dsd, 1, f32);DSD operations are hardware-accelerated bulk operations that iterate over DSDs:
| Builtin | Description |
|---|---|
@fadds(dest, src1, src2) |
Element-wise f32 addition: dest = src1 + src2 |
@fmacs(dest, src1, src2, scalar) |
Multiply-accumulate: dest = src1 + src2 * scalar |
@fmovs(dest, src) |
Copy/move elements |
@fnegs(dest, src) |
Element-wise negation |
@add16(dest, src1, value) |
16-bit integer addition |
@mov32(dest, value) |
Move a 32-bit value |
DSD operations support an options struct for asynchronous execution:
@fmovs(out_dsd, y_dsd, .{
.async = true, // Run asynchronously
.activate = exit_task_id, // Activate task when done
.ut_id = send_color_ut, // Explicit microthread ID (WSE-3)
.unblock = process_task // Unblock a blocked task when done
});@map provides customizable DSD operations using user-defined callback functions:
// Apply sqrt to each element of a DSD
@map(math_lib.sqrt_f32, input_dsd, output_dsd);
// Custom transformation with mixed scalar and DSD arguments
@map(transformation, dsdA, 2.0, 6.0, dsd_weight, dsdA);
// Reduction (sum all elements)
fn reduction(value: i32, sum: *i32) i32 {
return sum.* + value;
}
@map(reduction, dsdB, &sum[0], &sum[0]);@map eliminates explicit loops by leveraging DSD descriptions as implicit loop structures, enabling hardware-optimized iteration.
CSL uses a task-based execution model. The CE on each PE executes tasks, which are triggered by various events.
Local tasks are explicitly activated by ID, used for control flow and sequencing:
const exit_task_id: local_task_id = @get_local_task_id(9);
task exit_task() void {
sys_mod.unblock_cmd_stream();
}
comptime {
@bind_local_task(exit_task, exit_task_id);
}
// Activate programmatically
@activate(exit_task_id);Data tasks are activated automatically when a wavelet of the associated color arrives. The wavelet payload is passed as the task argument:
// On WSE-2, data tasks bind to colors; on WSE-3, to input queues
const recv_x_task_id: data_task_id =
if (@is_arch("wse2")) @get_data_task_id(x_color)
else if (@is_arch("wse3")) @get_data_task_id(x_color_iq);
// Single f32 payload
task recv_x(x_val: f32) void {
@fmacs(y_dsd, y_dsd, A_dsd, x_val);
}
// Sparse tensor: upper 16 bits = index, lower 16 bits = data
task main_task(data: f16, idx: u16) void {
result[idx] = data;
}
comptime {
@bind_data_task(recv_x, recv_x_task_id);
}Control tasks are activated by control wavelets. They use non-routable task IDs and are often used as sentinels to signal the end of a data stream:
param sentinel: u16;
const send_result_task_id: control_task_id = @get_control_task_id(sentinel);
task send_result() void {
@fmovs(out_dsd, result_dsd, .{ .async = true });
}
comptime {
@bind_control_task(send_result, send_result_task_id);
}Tasks can be blocked and unblocked to prevent re-entrant execution during async operations:
task process_task(element: f32) void {
@block(process_task_id); // Prevent re-entry
elem[0] = element * element * element;
@fmovs(out_dsd, elem_dsd, .{
.async = true,
.unblock = process_task // Unblock when done
});
}A color is a communication channel identifier. Wavelets (data packets) travel through the 2D fabric tagged with a color. Each PE's router inspects the color to determine forwarding behavior.
const send_color: color = @get_color(0); // Allocate color with explicit ID
param send_color: color; // Or receive as parameterRoutes are configured in the layout file using @set_color_config:
// Receive from the compute element (RAMP), send EAST
@set_color_config(0, 0, send_color, .{
.routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } }
});
// Receive from WEST, send to RAMP (and optionally EAST)
@set_color_config(1, 0, send_color, .{
.routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } }
});Route directions: NORTH, SOUTH, EAST, WEST, RAMP (to/from the CE).
Switches enable limited runtime control of routes, allowing a single color to be reused for multiple routing configurations:
const sender_switches = .{
.pos1 = .{ .tx = WEST },
.pos2 = .{ .tx = EAST },
.pos3 = .{ .tx = SOUTH },
.current_switch_pos = 1,
.ring_mode = true, // Wraps from pos3 back to pos0
};
@set_color_config(1, 1, channel, .{
.routes = sender_routes,
.switches = sender_switches
});Switch positions advance when a control wavelet is received:
const ctrl = @import_module("<control>");
const switch_adv_pld = ctrl.encode_single_payload(ctrl.opcode.SWITCH_ADV, true, {}, 0);
@mov32(tx_ctrl_dsd, switch_adv_pld); // Advance switch position
@mov32(tx_data_dsd, payload); // Send data on new routeFilters allow PEs to selectively accept wavelets based on the upper 16 bits of the wavelet payload:
const filter = .{
.kind = .{ .range = true },
.min_idx = pe_id * 3,
.max_idx = pe_id * 3 + 2,
};
@set_color_config(pe_id, 0, data_color, .{
.routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } },
.filter = filter
});On WSE-2, when swap_color_x is enabled, a wavelet's color bit is flipped as it passes through a router. This lets two colors alternate between PEs:
@set_color_config(pe_id, 0, red, .{
.routes = .{ .rx = .{ WEST }, .tx = .{ RAMP, EAST } },
.swap_color_x = true
});For multi-hop accumulation patterns (e.g., reducing partial results from WEST to EAST), a checkerboard pattern using two alternating colors avoids conflicts:
// Even columns: receive on ax_color_1, send on ax_color_2
// Odd columns: receive on ax_color_2, send on ax_color_1Hardware I/O queues buffer wavelets entering and leaving the PE. On WSE-3, queues must be explicitly initialized:
const send_color_oq: output_queue = @get_output_queue(2);
const send_color_iq: input_queue = @get_input_queue(2);
comptime {
if (@is_arch("wse3")) {
@initialize_queue(send_color_oq, .{ .color = send_color });
@initialize_queue(send_color_iq, .{ .color = send_color });
}
}On WSE-3, microthread IDs can be decoupled from queue IDs for flexible resource management:
const send_color_ut = @get_ut_id(4); // Explicit microthread ID
@fmovs(out_dsd, y_dsd, .{
.async = true,
.ut_id = send_color_ut, // Use specific microthread
.activate = exit_task_id
});This allows sharing microthreads between output queues and conserving resources.
FIFOs buffer data between asynchronous operations, extending the small hardware queues:
var fifo_buffer = @zeros([1024]f32);
const fifo = @allocate_fifo(fifo_buffer);
// Push from fabric into FIFO
@fadds(fifo, in_dsd, ten_dsd, .{ .async = true });
// Pop from FIFO to fabric
@fnegs(loopback_dsd, fifo, .{ .async = true });FIFOs require two microthreads (one for push, one for pop) and a scratch buffer.
The comptime block contains code that executes at compile time. It is used for:
- Binding tasks to task IDs
- Initializing queues (WSE-3)
- Exporting symbols to the host
- Activating initial tasks
- Conditional compilation based on architecture
comptime {
// Task binding
@bind_local_task(exit_task, exit_task_id);
@bind_data_task(recv_x, recv_x_task_id);
@bind_control_task(send_result, send_result_task_id);
// Initial task activation
@activate(main_task_id);
// Architecture-conditional compilation
if (@is_arch("wse3")) {
@initialize_queue(send_color_oq, .{ .color = send_color });
}
// Export symbols for host access
@export_symbol(y_ptr, "y");
@export_symbol(compute);
}Layout files define the spatial arrangement and communication topology of the program.
layout {
// Define a width x height rectangle of PEs
@set_rectangle(2, 1);
// Assign code to specific PEs with parameters
@set_tile_code(0, 0, "pe_program.csl", .{
.memcpy_params = memcpy.get_params(0),
.M = M,
.pe_id = 0,
.send_color = send_color
});
// Configure color routing for specific PEs
@set_color_config(0, 0, send_color, .{
.routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } }
});
// Export symbol names visible to host
@export_name("y", [*]f32, true); // true = writable
@export_name("compute", fn()void);
}| Builtin | Description |
|---|---|
@set_rectangle(width, height) |
Define the PE grid dimensions |
@set_tile_code(x, y, file, params) |
Assign CSL code and params to a PE |
@set_color_config(x, y, color, config) |
Configure routing/switches/filters for a color on a PE |
@export_name(name, type, writable) |
Make a symbol accessible from the host |
@get_rectangle() |
Get the rectangle dimensions at compile time |
Layout blocks support for loops over @range for configuring multiple PEs:
layout {
@set_rectangle(4, 1);
for (@range(u16, 1, 4, 1)) |pe_id| {
@set_tile_code(pe_id, 0, "recv.csl", .{ .pe_id = pe_id });
}
}The SdkLayout API is a Python-based alternative to writing layout files in CSL. It provides:
- Code regions: rectangular groups of PEs running the same code
- Symbolic colors: automatic physical color allocation
- Ports and connections: automatic route finding between code regions
- Host I/O streams: connections to/from the host
from cerebras.sdk.sdk_layout import SdkLayout, Color, Route, Edge
layout = SdkLayout()
# Create a code region
gemv = layout.create_code_region("/path/to/gemv.csl", "gemv", width, height)
# Create symbolic colors
x_color = Color()
layout.paint(gemv, {"x_in": x_color})
# Define input/output ports
x_port = layout.create_input_port(gemv, x_color)
y_port = layout.create_output_port(gemv, y_color)
# Connect to host
layout.create_input_stream("x_stream", x_port)
layout.create_output_stream("y_stream", y_port)| Builtin | Description |
|---|---|
@as(type, value) |
Cast a value to the specified type |
@ptrcast([*]u32, &data) |
Cast a pointer to a different pointer type |
@is_arch("wse2") / @is_arch("wse3") |
Check target architecture at compile time |
@get_rectangle() |
Get the PE rectangle dimensions |
| Builtin | Description |
|---|---|
@get_dsd(type, config) |
Create a DSD of the given type |
@increment_dsd_offset(dsd, offset, type) |
Shift a DSD's base offset |
@zeros([size]type) |
Create a zero-initialized array |
@constants([size]type, value) |
Create an array initialized to a constant |
@allocate_fifo(buffer) |
Create a FIFO backed by a buffer |
| Builtin | Description |
|---|---|
@get_color(id) |
Get a color with explicit ID |
@get_output_queue(id) |
Get an output queue by ID |
@get_input_queue(id) |
Get an input queue by ID |
@get_ut_id(id) |
Get a microthread ID (WSE-3) |
@initialize_queue(queue, config) |
Initialize a queue (required on WSE-3) |
| Builtin | Description |
|---|---|
@get_local_task_id(id) |
Get a local task ID |
@get_data_task_id(color_or_queue) |
Get a data task ID |
@get_control_task_id(id) |
Get a control task ID |
@bind_local_task(fn, id) |
Bind a function to a local task ID |
@bind_data_task(fn, id) |
Bind a function to a data task ID |
@bind_control_task(fn, id) |
Bind a function to a control task ID |
@activate(task_id) |
Activate a local task |
@block(task_id) |
Block a task from executing |
| Builtin | Description |
|---|---|
@export_symbol(symbol, name?) |
Export a symbol for host access |
@export_name(name, type, writable) |
Export a name in the layout file |
@import_module(path, params?) |
Import a library or module |
| Builtin | Signature | Description |
|---|---|---|
@fadds |
(dest, src1, src2, opts?) |
f32 element-wise add |
@fmacs |
(dest, src1, src2, scalar, opts?) |
f32 multiply-accumulate |
@fmovs |
(dest, src, opts?) |
f32 move/copy |
@fnegs |
(dest, src, opts?) |
f32 negate |
@add16 |
(dest, src1, value, opts?) |
i16/u16 add |
@mov32 |
(dest, value) |
Move 32-bit value |
@map |
(fn, args..., dest) |
Custom DSD operation |
# Compile CSL to device binary
cslc layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 \
--params=M:4,N:6 --memcpy --channels=1 -o out --arch=wse2from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder
# Initialize and start
runner = SdkRuntime("out", cmaddr=args.cmaddr)
runner.load()
runner.run()
# Get device symbol handle
y_symbol = runner.get_id("y")
# Host-to-device copy
runner.memcpy_h2d(y_symbol, data, px, py, width, height, num_elems,
streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
order=MemcpyOrder.ROW_MAJOR, nonblock=False)
# Launch device function (RPC)
runner.launch("compute", nonblock=False)
# Device-to-host copy
runner.memcpy_d2h(result_buf, y_symbol, px, py, width, height, num_elems,
streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
order=MemcpyOrder.ROW_MAJOR, nonblock=False)
# Cleanup
runner.stop()Streaming mode uses special memcpy colors instead of named symbols:
# Stream data to device (uses color ID, not symbol)
runner.memcpy_h2d(MEMCPYH2D_DATA_1, x_data, 0, 0, width, 1, size,
streaming=True, data_type=MemcpyDataType.MEMCPY_32BIT)
# Stream data from device
runner.memcpy_d2h(result, MEMCPYD2H_DATA_1, 0, 0, 1, height, size,
streaming=True, data_type=MemcpyDataType.MEMCPY_32BIT)from cerebras.sdk.sdk_layout import SdkLayout
layout = SdkLayout()
# ... define code regions, ports, connections ...
compile_artifacts = layout.compile(fabric_dims=(8, 3))
runtime = SdkRuntime(compile_artifacts, platform="sim")
runtime.run()
runtime.send("input_stream", data, nonblock=True)
runtime.receive("output_stream", buffer, size, nonblock=True)
runtime.stop()| Feature | WSE-2 | WSE-3 |
|---|---|---|
| Data task binding | Bound to colors (IDs 0-24) | Bound to input queues (IDs 0-7) |
| Queue initialization | Implicit | Explicit (@initialize_queue required) |
| Microthread IDs | Coupled to queue IDs | Decoupled (@get_ut_id) |
| Color swap | Supported | Not yet supported |
| Local task IDs | 0-30 | 8-30 |
| Data task IDs | 0-23 (from colors) | 0-7 (from input queues) |
Architecture checks are done at compile time:
if (@is_arch("wse3")) {
@initialize_queue(oq, .{ .color = my_color });
}CSL uses Zig-style anonymous struct literals extensively for configuration:
// DSD configuration
.{ .base_address = &y, .extent = M }
// Route configuration
.{ .routes = .{ .rx = .{ RAMP }, .tx = .{ EAST } } }
// Async operation options
.{ .async = true, .activate = exit_task_id }
// Filter configuration
.{ .kind = .{ .range = true }, .min_idx = 3, .max_idx = 5 }
// Module parameters
.{ .width = 4, .height = 3 }The <debug> library records traces for post-execution analysis:
const trace = @import_module("<debug>", .{ ... });
trace.trace_string("task started");
trace.trace_i16(global);
trace.trace_timestamp();Host code reads traces after execution:
from cerebras.sdk.debug.debug_util import read_trace
read_trace(runner, x, y, width, height, "trace")The <simprint> library prints directly to simulator logs during execution:
const simprint = @import_module("<simprint>");
simprint.print_string("recv_task: in_data = ");
simprint.print_f32(in_data);
simprint.print_string("\n"); // newline flushes outputOutput appears in sim.log with cycle-accurate timestamps:
@968 PE(0,0): sender beginning main_fn
@1156 PE(1,0): recv_task: in_data = 0, global = 0
The <collectives_2d> library provides MPI-style operations across PE rows and columns:
const mpi_x = @import_module("<collectives_2d/pe>", .{
.dim_params = c2d_params.x,
.queues = [2]u16{2, 4},
.dest_dsr_ids = [1]u16{1},
.src0_dsr_ids = [1]u16{1},
.src1_dsr_ids = [1]u16{1}
});
// Broadcast from PE 0 to all PEs in the row
mpi_x.broadcast(0, send_buf, num_elems, callback_task_id);
// Reduce with f32 addition back to PE 0
mpi_x.reduce_fadds(0, send_buf, recv_buf, num_elems, callback_task_id);
// Scatter: distribute chunks from PE 0
mpi_y.scatter(0, send_buf, recv_buf, chunk_size, callback_task_id);
// Gather: collect chunks at PE 0
mpi_y.gather(0, send_buf, recv_buf, chunk_size, callback_task_id);All collective operations are asynchronous and invoke a callback task upon completion, enabling chaining of multiple collective operations via a state machine pattern.
A minimal but complete CSL program that computes y = Ax + b on a single PE:
layout.csl:
param M: i16;
param N: i16;
const memcpy = @import_module("<memcpy/get_params>", .{
.width = 1, .height = 1,
});
layout {
@set_rectangle(1, 1);
@set_tile_code(0, 0, "pe_program.csl", .{
.memcpy_params = memcpy.get_params(0),
.M = M, .N = N,
});
@export_name("y", [*]f32, false);
@export_name("init_and_compute", fn()void);
}pe_program.csl:
param memcpy_params: comptime_struct;
param M: i16;
param N: i16;
const sys_mod = @import_module("<memcpy/memcpy>", memcpy_params);
var A: [M * N]f32;
var x = @constants([N]f32, 1.0);
var b = @constants([M]f32, 2.0);
var y = @zeros([M]f32);
var y_dsd = @get_dsd(mem1d_dsd, .{ .base_address = &y, .extent = M });
var A_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> A[i * N] });
var b_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{M} -> b[i] });
const y_ptr: [*]f32 = &y;
fn init_and_compute() void {
for (@range(i16, M * N)) |idx| {
A[idx] = @as(f32, idx);
}
for (@range(u16, N)) |i| {
@fmacs(y_dsd, y_dsd, A_dsd, x[i]);
A_dsd = @increment_dsd_offset(A_dsd, 1, f32);
}
@fadds(y_dsd, y_dsd, b_dsd);
sys_mod.unblock_cmd_stream();
}
comptime {
@export_symbol(y_ptr, "y");
@export_symbol(init_and_compute);
}run.py:
import numpy as np
from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder
runner = SdkRuntime("out", cmaddr=args.cmaddr)
runner.load()
runner.run()
runner.launch("init_and_compute", nonblock=False)
y_symbol = runner.get_id("y")
result = np.zeros(4, dtype=np.float32)
runner.memcpy_d2h(result, y_symbol, 0, 0, 1, 1, 4,
streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
order=MemcpyOrder.ROW_MAJOR, nonblock=False)
runner.stop()| Concept | Description |
|---|---|
| PE | Processing Element -- one unit in the 2D grid with its own CE, router, and memory |
| CE | Compute Engine -- executes tasks on a PE |
| Wavelet | A 32-bit data packet that travels through the fabric, tagged with a color |
| Color | A communication channel identifier used for routing wavelets |
| DSD | Data Structure Descriptor -- describes memory access or fabric I/O patterns |
| Route | Directional path (NORTH/SOUTH/EAST/WEST/RAMP) for wavelets at each PE |
| RAMP | The connection between the router and the CE |
| Task | A function triggered by events (local activation, wavelet arrival, control signal) |
| Switch | Runtime-modifiable route configuration, advanced by control wavelets |
| Filter | Hardware mechanism for selectively accepting wavelets |
| FIFO | Software buffer for decoupling producer/consumer in asynchronous operations |
| Memcpy | SDK infrastructure for host-device data transfer |
| Sentinel | Control wavelet used to signal end-of-stream or trigger control tasks |
| Microthread | Hardware execution context for async operations (explicitly managed on WSE-3) |