Skip to content

Instantly share code, notes, and snippets.

@d3v-null
Created May 19, 2026 07:55
Show Gist options
  • Select an option

  • Save d3v-null/d363ab5bc3eb7ad445a846cad31cccd6 to your computer and use it in GitHub Desktop.

Select an option

Save d3v-null/d363ab5bc3eb7ad445a846cad31cccd6 to your computer and use it in GitHub Desktop.
Rapthor on HTConder / Glidein

Rapthor + HTCondor/GlideinWMS: Architecture, Late-Binding, and Integration Options

Technical analysis for radio astronomers and SKA/SRCNet engineers Date: May 2026


1. Background: The Two Systems

1.1 Rapthor

Rapthor is the LOFAR direction-dependent effects (DDE) calibration and imaging pipeline, developed at ASTRON and adopted as the basis of the SKAO ICAL pipeline. It automates the iterative self-calibration loop:

[Input MSes] → calibrate (DP3/DDECal) → image (WSClean+IDG) → extract sources → loop

Each pipeline operation (calibrate, image, subtract, mosaic, source-find) is implemented as a CWL (Common Workflow Language) workflow. Toil acts as the CWL execution engine and provides the pluggable batch-system layer. The supported backends are:

  • --batchSystem=single_machine — local development/testing
  • --batchSystem=slurm — HPC cluster, Toil leader submits Slurm jobs
  • --batchSystem=kubernetes — cloud/containerised clusters
  • --batchSystem=htcondor — HTCondor pool (community-supported)

Toil owns the full job DAG: it parses CWL, resolves step dependencies, marshals intermediate data, and issues individual compute tasks to the chosen batch backend. Rapthor's Python orchestration code drives the outer self-calibration loop, calling Toil repeatedly per operation.

Compute profile of Rapthor's job types:

Operation step Primary tool Resource profile
Direction-dependent calibration DP3 / DDECal CPU-heavy (many cores), high RAM (tens of GB per facet), I/O-heavy (MS read)
Wide-field imaging WSClean + IDG CPU or GPU (IDG-GPU mode); very high RAM (100+ GB for large fields); I/O-heavy
Beam application / subtraction DP3 CPU-moderate, I/O-heavy
Source finding PyBDSF / similar CPU-light to moderate, memory-moderate
Mosaicking custom I/O-dominated, CPU-light

The mix is heterogeneous: some steps want GPUs (WSClean+IDG), most want many CPU cores and a lot of RAM, and several are gated by I/O bandwidth from measurement sets that can be hundreds of gigabytes per pointing.

1.2 HTCondor and GlideinWMS

HTCondor is the reference implementation of High Throughput Computing (HTC), developed at the University of Wisconsin-Madison since 1988. Unlike traditional HPC batch systems (Slurm, PBS) that assume a static cluster, HTCondor is designed for dynamic, heterogeneous resource pools where nodes come and go.

GlideinWMS (Glidein-based Workflow Management System) is a meta-scheduler built on top of HTCondor, developed by Fermilab and used by CMS, ATLAS, and the Open Science Grid (OSG). It implements the pilot job (Glidein) model for true late-binding across federated resources.

Core HTCondor/GlideinWMS components relevant here:

Component Role
Central Manager Matchmaker daemon; receives job and machine ClassAds, performs matchmaking
Schedd Scheduler daemon; holds job queue, negotiates with central manager for slots
Startd Execution daemon on worker nodes; advertises slot capabilities, runs jobs
GlideinWMS Factory Submits pilot jobs (glideins) to remote batch systems on behalf of users
GlideinWMS Frontend Monitors user job queues, requests glideins from factory based on demand
HTCondor-CE Compute Entrypoint; accepts remote job submissions, translates to local batch system
ClassAds Flexible key-value expression language for describing jobs, machines, and matchmaking

2. HTCondor Late Binding: A Deep Dive

Late binding is a defining feature of HTCondor's glidein model. Understanding it precisely is essential before mapping Rapthor onto it.

2.1 The Problem Late Binding Solves

In early binding (classical batch submission), a job is submitted to a specific queue at a specific site. If that site is congested, the job waits. If the site fails, the job is lost. The job is irrevocably coupled to a slot at submission time.

In HTCondor's late binding via GlideinWMS, the work and the resource are matched only when the resource is actually available — when a glidein starts executing and joins the HTCondor pool.

2.2 The Glidein Mechanism

The canonical GlideinWMS flow is:

1. User submits jobs to a local HTCondor Schedd (user pool)
   — jobs specify requirements via ClassAds but are NOT bound to any site

2. Frontend monitors the user pool: "There are N pending jobs with requirements X"

3. Frontend requests glideins from Factory: "Please submit pilots to sites A, B, C
   that can satisfy requirements X"

4. Factory submits glidein jobs to HTCondor-CEs at multiple sites
   — glideins are submitted to local batch systems (Slurm, PBS, K8s)
   — glideins carry NO payload; they just request a slot

5. When a glidein starts on a worker node, it:
   a. Contacts the user pool's Central Manager
   b. Advertises its capabilities as a ClassAd (CPU, RAM, disk, GPU, etc.)
   c. Requests a job that matches its capabilities

6. Central Manager performs matchmaking: job ClassAd ↔ slot ClassAd

7. Job is dispatched to the glidein; glidein executes payload

8. Glidein can request another job (slot reuse) or exit

The binding between job and compute slot happens at step 6, after the slot is already running — hence "late binding."

2.3 Why This Differs from PanDA

Both GlideinWMS and PanDA use pilot jobs, but their architectures differ:

Aspect GlideinWMS PanDA
Central queue HTCondor Schedd (job ClassAds) PanDA Server (job database)
Matchmaking Condor Central Manager (ClassAd expressions) JEDI brokerage (plugin architecture)
Pilot submission Factory → HTCondor-CE → local batch Harvester → local batch
Job pull Startd negotiates with Central Manager Pilot polls PanDA Server REST API
Data locality Via ClassAd requirements/rank Via Rucio integration in JEDI

The key difference: HTCondor's matchmaking is expression-based (ClassAds), while PanDA's is database-driven with explicit Rucio integration. HTCondor can express data locality via ClassAds, but this requires explicit infrastructure to populate machine ClassAds with data availability information.

2.4 ClassAd Matchmaking for Data-Aware Scheduling

HTCondor's ClassAd mechanism allows flexible job requirements and ranking:

Requirements (hard constraints):

requirements = (TARGET.HasGPU == True) && (TARGET.Memory >= 64000)

Rank (soft preferences):

rank = (TARGET.HasLocalData ? 100 : 0) + (TARGET.Memory / 1000)

For data-aware scheduling, one could imagine:

# Job requires data at a specific site
requirements = member("dataset_1234", TARGET.AvailableDatasets)

# Prefer sites with more data locally
rank = size(filter(lambda ds: member(ds, TARGET.AvailableDatasets), MY.RequiredDatasets))

This requires:

  1. A service that populates worker/site ClassAds with data availability
  2. Jobs that specify their data requirements as ClassAd attributes
  3. Periodic refresh of data availability ClassAds

Pablo's proposal mentions a "Data Locality Awareness Metadata Hydrator" — this is exactly the missing component.

2.5 Why Late Binding Matters for SRCNet

In a federated system like SRCNet — where compute nodes are at different institutions across multiple countries with heterogeneous network links, storage backends, and queue policies — late binding provides:

  • Resilience: if a site becomes unavailable after glidein submission, another site's glideins can pick up the work
  • Opportunism: idle capacity at any site is automatically exploited
  • Load balancing: jobs naturally flow to sites with available slots
  • Heterogeneous resource routing: GPU vs. CPU decisions made at match time based on actual availability
  • Queue absorption: glideins wait in site queues, but jobs remain unbound until execution

3. Existing Toil + HTCondor Integration

3.1 Current State

Toil already includes an HTCondor batch system backend (--batchSystem htcondor). The HTCondorBatchSystem class:

class HTCondorBatchSystem(AbstractGridEngineBatchSystem):
    """
    Submits jobs to a local HTCondor pool via condor_submit.
    """
    
    def issueBatchJob(self, command, jobNode, job_environment=None):
        # Creates HTCondor submit description
        # Submits via Schedd API
        # Returns job ID
    
    def getRunningBatchJobIDs(self):
        # Queries condor_q for job status
    
    def killBatchJobs(self, jobIDs):
        # Calls condor_rm

This is a direct submission model: Toil leader submits jobs directly to a local HTCondor Schedd. It does NOT use GlideinWMS or HTCondor-CE — it assumes access to an existing HTCondor pool.

3.2 Limitations of Current Integration

The existing Toil HTCondor backend has several limitations:

  1. Local pool only: Assumes direct Schedd access; no federation
  2. No late binding: Jobs are submitted directly, not via glideins
  3. Community-supported: Not actively maintained by Toil core team
  4. Known issues: Environment variable handling bugs (issue #4363), Schedd busy handling
  5. No data locality: No mechanism to express or match data requirements

For SRCNet's federated model, this is insufficient. We need either:

  • A GlideinWMS-aware batch system that submits to a glidein-backed pool, or
  • A new backend that submits via HTCondor-CE to remote sites

4. Mapping Rapthor's Job Graph onto HTCondor Concepts

4.1 Conceptual Correspondence

Rapthor / Toil concept HTCondor / GlideinWMS concept
CWL Workflow DAGMan workflow (or external orchestrator)
CWL CommandLineTool step HTCondor job cluster
Individual Toil BatchJob (one scatter element) HTCondor job (one proc in a cluster)
Toil batch backend HTCondorBatchSystem (to Schedd)
Toil job store (file-based checkpoint) No equivalent — would use shared filesystem or HTCondor file transfer
Toil leader process (DAG orchestration) DAGMan or external (Toil stays)
Toil --batchSystem plugin HTCondor submit + ClassAd configuration
CWL ResourceRequirement hints ClassAd request_cpus, request_memory, request_disk, request_gpus
Toil scatter (parallel over facets) HTCondor job cluster with N procs

4.2 The Self-Calibration Loop

Rapthor's outer loop (calibrate → image → extract → re-calibrate) is driven by Python code in Rapthor's Pipeline class. HTCondor's DAGMan supports conditional branching via SCRIPT PRE/POST and RETRY, but expressing convergence-based loops requires external logic.

The most natural mapping keeps Toil (or Rapthor's Python) as the outer orchestrator, submitting individual CWL operations to HTCondor. This is similar to Option A in the PanDA analysis.

4.3 The Scatter Pattern

Rapthor processes each facet or calibration direction independently. In CWL this is a scatter over direction items. In HTCondor, this maps to a job cluster:

# Submit 100 parallel jobs for 100 directions
queue 100

Each job in the cluster gets a unique $(Process) ID (0-99). GlideinWMS glideins at multiple sites can execute these jobs in parallel, with late binding ensuring jobs run where slots become available.


5. Integration Architectures: Three Options

Option A — Enhanced Toil-HTCondor Plugin (Toil Stays; Local HTCondor Pool)

Architecture:

Rapthor (Python orchestrator)
  └── Toil CWL runner (leader process, DAG management)
        └── HTCondorBatchSystem (existing, enhanced)
              └── Local HTCondor Schedd
                    └── GlideinWMS glideins from multiple SRC sites

How it works:

  1. Deploy a central HTCondor pool (Schedd + Central Manager) accessible from SRCNet
  2. Deploy GlideinWMS Factory + Frontend that provisions glideins at SRC sites via HTCondor-CE
  3. Enhance Toil's existing HTCondorBatchSystem to:
  • Set data-locality ClassAd attributes on jobs
  • Configure resource requirements from CWL hints
  1. Glideins from multiple sites join the central pool
  2. Jobs are matched to glideins based on ClassAd expressions

Data staging: Toil's job store must be accessible from all SRC sites. Options:

  • Globally-mounted object store (Ceph/S3)
  • HTCondor file transfer (for small files)
  • External data management (Rucio or DMAPI) with ClassAd-based locality matching

Pros:

  • Minimal changes to Rapthor codebase
  • Uses existing Toil HTCondor backend (with enhancements)
  • Preserves Toil's CWL DAG management and checkpoint/restart
  • True late binding via GlideinWMS
  • Leverages HTCondor's mature matchmaking
  • Single central job queue simplifies monitoring

Cons:

  • Requires deploying GlideinWMS infrastructure (Factory, Frontend, per-site HTCondor-CEs)
  • Central HTCondor pool is a single point of failure
  • Data locality requires custom "hydrator" service to populate machine ClassAds
  • Toil leader must remain alive for pipeline duration
  • HTCondor backend is "community supported" with known issues

Verdict: This is the most direct path if SRCNet decides to adopt HTCondor. The infrastructure investment is significant but well-understood (OSG has run this at scale for 15+ years).


Option B — Toil + HTCondor-CE Direct Submission (Per-Site Late Binding)

Architecture:

Rapthor (Python orchestrator)
  └── Toil CWL runner
        └── [NEW] HTCondorCEBatchSystem
              └── Multiple HTCondor-CE endpoints (per SRC site)
                    └── Local batch systems (Slurm, K8s, etc.)

How it works:

  1. Each SRC site deploys HTCondor-CE as a gateway to its local batch system
  2. New Toil batch system plugin submits jobs directly to HTCondor-CE endpoints
  3. Site selection logic (brokerage) lives in the plugin or a separate service
  4. HTCondor-CE translates job to local batch system (Slurm, K8s, etc.)

This is not true GlideinWMS late binding — it's more like the current TIGER Broker/Gateway model but using HTCondor-CE as the gateway protocol.

Pros:

  • Simpler than full GlideinWMS deployment
  • HTCondor-CE is already used by 170+ sites in WLCG
  • SciTokens authentication supported
  • No central HTCondor pool required

Cons:

  • Not true late binding — job is bound to site at submission time
  • Requires implementing brokerage/site-selection logic (the same problem as TIGER)
  • Each site needs HTCondor-CE deployment
  • Less flexible than GlideinWMS for opportunistic resource use

Verdict: This provides a standardized gateway protocol but doesn't solve the late-binding problem. It's an alternative to TIGER's custom Job Gateway, not a fundamentally different scheduling model.


Option C — Replace Broker/Gateway with GlideinWMS (Full HTCondor Stack)

Architecture:

User / Client
  └── HTCondor Schedd (central job queue)
        └── Central Manager (matchmaking)
              └── GlideinWMS Factory
                    └── HTCondor-CE at each SRC site
                          └── Local batch (Slurm, K8s)
                                └── Glideins (workers)
                                      └── Execute Rapthor CWL steps

How it works:

  1. Replace TIGER Broker entirely with HTCondor Schedd + Central Manager
  2. Replace Job Gateways with HTCondor-CE + glidein infrastructure
  3. Submit Rapthor CWL operations as HTCondor DAGMan workflows or via Toil
  4. GlideinWMS provisions workers across all SRC sites
  5. ClassAd matchmaking handles resource and data locality

This is the "full HTCondor" approach — using HTCondor as the unified job management layer.

Pros:

  • True slot-level late binding across federation
  • Battle-tested at scale (CMS processes millions of jobs/day)
  • Rich matchmaking via ClassAds
  • Large community, good documentation, annual conference
  • No custom brokerage code to maintain
  • Handles heterogeneous resources (CPU, GPU) natively

Cons:

  • Significant infrastructure investment (full GlideinWMS stack)
  • Requires HTCondor-CE at every SRC site
  • Data locality requires custom ClassAd hydration (no native Rucio integration)
  • Learning curve for operators unfamiliar with HTCondor
  • Doesn't leverage existing SRCNet investment in PanDA/Harvester

Data locality implementation:

Unlike PanDA (which integrates with Rucio), HTCondor has no native data management. Options:

  1. ClassAd hydrator service: Periodically query DMAPI/Rucio, update machine ClassAds with available datasets
  2. Job-side data staging: Jobs use DMAPI to stage data before execution (early binding of data, late binding of slot)
  3. Storage-coupled workers: Glideins only run at sites with local data access (constrain factory submissions)

Verdict: This is a major architectural change that replaces rather than enhances the current SRCNet approach. It would work well technically but represents a strategic pivot away from PanDA.


6. Late Binding: Specific Advantages for Rapthor's Workload

6.1 Heterogeneous Resource Routing (CPU vs. GPU)

Rapthor's WSClean imaging step can run in CPU-only, idg (CPU), or idg-gpu (GPU) modes. With HTCondor:

# Job ClassAd
+WantsGPU = True
request_gpus = 1
rank = TARGET.HasGPU ? 1000 : 0

# Falls back to CPU if no GPU slots available
requirements = (TARGET.HasGPU == True) || (TARGET.Cpus >= 16 && TARGET.Memory >= 64000)

The matchmaker routes to GPU slots when available, falls back to CPU otherwise — without user intervention.

6.2 Data Locality Across SRCNet Nodes

Assuming a ClassAd hydrator populates machine ads with dataset availability:

# Machine ClassAd (populated by hydrator)
AvailableDatasets = {"obs_1234", "obs_5678", "skymodel_v3"}
SiteLocation = "SRCNET_UK"

# Job ClassAd
+RequiredDatasets = {"obs_1234", "skymodel_v3"}
requirements = subset(MY.RequiredDatasets, TARGET.AvailableDatasets)
rank = size(intersection(MY.RequiredDatasets, TARGET.AvailableDatasets))

Jobs preferentially run at sites with local data, but can run elsewhere if needed (triggering data staging).

6.3 HPC Queue Latency Absorption

GlideinWMS glideins wait in site batch queues. When they start, they immediately join the HTCondor pool and request work. This absorbs queue latency:

  1. Factory submits 100 glideins across 5 sites
  2. Glideins wait in local queues (minutes to hours depending on site)
  3. As glideins start, they pull jobs from the central queue
  4. Sites with faster queues naturally process more jobs
  5. No job is "trapped" in a slow queue — work stays in the central pool until a glidein is ready

6.4 Multi-Site Parallelism

A Rapthor scatter over 100 facets creates 100 HTCondor jobs. With glideins at 5 sites:

  • 20 glideins run at SRCNET_UK (fast queue)
  • 15 at SRCNET_NL (medium queue)
  • 10 at SRCNET_AU (slow queue)
  • Jobs flow to available slots as glideins start
  • Natural load balancing across sites

6.5 Resilience and Retry

If a glidein crashes or a site goes offline:

  • Running jobs on that glidein are marked failed
  • HTCondor automatically requeues them (configurable retry policy)
  • Another glidein (possibly at a different site) picks them up

No manual intervention required for transient failures.


7. Comparison: HTCondor/GlideinWMS vs. PanDA vs. TIGER

Feature TIGER (current) PanDA HTCondor/GlideinWMS
Late binding level Site (Gateway claims job) Slot (pilot pulls job) Slot (glidein pulls job)
Central queue RabbitMQ PanDA Server DB HTCondor Schedd
Matchmaking Simple (spread across sites) JEDI plugins (data-aware) ClassAd expressions
Data locality Future (DMAPI integration) Native (Rucio) Custom (ClassAd hydrator)
Site gateway Job Gateway Harvester HTCondor-CE
Pilot/worker Toil batch runner PanDA Pilot Glidein (startd)
Workflow DAG Toil iDDS DAGMan or external
GPU routing Manual Native Native (ClassAds)
Community size Small (in-house) Medium (HEP) Large (OSG, HEP, HTC)
Complexity Low High Medium
SRCNet investment Current focus Already planned Would be new

8. Gotchas and Open Problems

8.1 Toil Job Store Accessibility

Toil uses a job store for intermediate files between CWL steps. With HTCondor distributing jobs across sites:

  1. Shared filesystem: All sites mount the same GPFS/Lustre/CephFS — simplest but requires infrastructure
  2. Object store: Toil's S3 job store, accessible from all sites — adds latency for small files
  3. HTCondor file transfer: Transfer input/output files with each job — works for small files, problematic for large MS data

For Rapthor's large intermediate products (GB-scale measurement sets), option 1 or 2 is necessary.

8.2 Container Image Distribution

Rapthor requires DP3, WSClean, EveryBeam, IDG — a complex ~10GB image. Options:

  • CVMFS: Distribute via CernVM-FS (used by LHC experiments)
  • Container registry: Per-site Harbor/Docker registry with pull-through caching
  • Singularity/Apptainer: Convert to SIF, stage to shared storage

HTCondor supports container_image in job ClassAds, with configurable execution via Singularity/Apptainer.

8.3 No Native Data Management Integration

Unlike PanDA (which integrates with Rucio), HTCondor has no built-in data management. The "ClassAd hydrator" for data locality is a custom component that must:

  • Query DMAPI/Rucio for dataset locations
  • Update machine ClassAds periodically
  • Handle replication-in-progress states

This is significant engineering — roughly equivalent to the "data-aware scheduling" goal in Michele's proposal, but implemented differently.

8.4 GlideinWMS Operational Complexity

Deploying and operating GlideinWMS requires:

  • Central HTCondor pool (Schedd + Central Manager + Collector)
  • GlideinWMS Factory (submits glideins, manages credentials)
  • GlideinWMS Frontend (monitors demand, requests glideins)
  • HTCondor-CE at each site (translates to local batch)
  • SciTokens/IAM integration for authentication

This is well-documented (OSG provides detailed guides), but it's not trivial. The OSG/CMS operations teams have decades of experience; SRCNet would be starting fresh.

8.5 HTCondor Community Support in Toil

Toil's HTCondor backend is "community supported" — the core Toil team at UCSC doesn't actively maintain it. Known issues include:

  • Environment variable handling bugs
  • Schedd busy handling
  • Limited testing in CI

Any SRCNet deployment would likely need to enhance and maintain this code.


9. Recommended Path Forward

For an SRCNet deployment, given the context of existing TIGER work and planned PanDA investment:

If HTCondor is seriously considered (Pablo's proposal):

Stage 1 (3-6 months): Proof of Concept

  • Deploy minimal HTCondor pool (single Schedd + Central Manager) in SRCNet dev environment
  • Use existing Toil HTCondor backend to run Rapthor calibration operation
  • No GlideinWMS yet — just validate HTCondor job execution
  • Document gaps in Toil backend, estimate enhancement effort

Stage 2 (6-12 months): Single-Site Late Binding

  • Deploy HTCondor-CE at one SRC site
  • Deploy GlideinWMS Factory + Frontend
  • Glideins join central pool from one site
  • Validate late-binding behavior under queue pressure

Stage 3 (12-18 months): Multi-Site + Data Locality

  • Add HTCondor-CE at additional sites
  • Implement ClassAd hydrator for data locality (DMAPI integration)
  • Test data-aware job placement
  • Compare performance/complexity with PanDA approach

If PanDA remains the primary path:

HTCondor evaluation becomes a parallel spike to validate Pablo's hypothesis:

  • Small team (0.2 FTE) explores HTCondor proof of concept
  • Results inform tool assessment decision
  • Avoid committing significant resources until comparative data exists

10. Summary

HTCondor/GlideinWMS is a legitimate alternative to PanDA for SRCNet federated job execution. Its strengths:

  • True slot-level late binding via glidein model
  • Mature, battle-tested (25+ years, millions of jobs/day at CMS)
  • Large community with active development and support
  • Flexible matchmaking via ClassAd expressions
  • Existing Toil integration (though community-supported)

Its weaknesses for SRCNet:

  • No native data management — requires custom ClassAd hydrator
  • Significant infrastructure investment — full GlideinWMS stack
  • Diverges from existing SRCNet plans — PanDA/Harvester already documented
  • Operational learning curve — SRCNet team would need to build expertise

The most practical near-term integration path is enhancing Toil's HTCondor backend to work with a GlideinWMS-backed pool, while implementing a data-locality ClassAd hydrator that integrates with DMAPI. This preserves Rapthor's CWL/Toil architecture while gaining HTCondor's scheduling capabilities.

Whether this is preferable to the Toil-PanDA plugin approach depends on:

  1. SRCNet's strategic commitment to PanDA vs. openness to alternatives
  2. Relative complexity of ClassAd hydrator vs. PanDA/Rucio integration
  3. Availability of HTCondor operational expertise in the SRCNet community

Pablo's proposal to "do due diligence" by exploring HTCondor is reasonable — a controlled proof-of-concept would provide empirical data for the tool assessment decision.


References and Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment