Technical analysis for radio astronomers and SKA/SRCNet engineers Date: May 2026
Rapthor is the LOFAR direction-dependent effects (DDE) calibration and imaging pipeline, developed at ASTRON and adopted as the basis of the SKAO ICAL pipeline. It automates the iterative self-calibration loop:
[Input MSes] → calibrate (DP3/DDECal) → image (WSClean+IDG) → extract sources → loop
Each pipeline operation (calibrate, image, subtract, mosaic, source-find) is implemented as a CWL (Common Workflow Language) workflow. Toil acts as the CWL execution engine and provides the pluggable batch-system layer. The supported backends are:
--batchSystem=single_machine— local development/testing--batchSystem=slurm— HPC cluster, Toil leader submits Slurm jobs--batchSystem=kubernetes— cloud/containerised clusters--batchSystem=htcondor— HTCondor pool (community-supported)
Toil owns the full job DAG: it parses CWL, resolves step dependencies, marshals intermediate data, and issues individual compute tasks to the chosen batch backend. Rapthor's Python orchestration code drives the outer self-calibration loop, calling Toil repeatedly per operation.
Compute profile of Rapthor's job types:
| Operation step | Primary tool | Resource profile |
|---|---|---|
| Direction-dependent calibration | DP3 / DDECal | CPU-heavy (many cores), high RAM (tens of GB per facet), I/O-heavy (MS read) |
| Wide-field imaging | WSClean + IDG | CPU or GPU (IDG-GPU mode); very high RAM (100+ GB for large fields); I/O-heavy |
| Beam application / subtraction | DP3 | CPU-moderate, I/O-heavy |
| Source finding | PyBDSF / similar | CPU-light to moderate, memory-moderate |
| Mosaicking | custom | I/O-dominated, CPU-light |
The mix is heterogeneous: some steps want GPUs (WSClean+IDG), most want many CPU cores and a lot of RAM, and several are gated by I/O bandwidth from measurement sets that can be hundreds of gigabytes per pointing.
HTCondor is the reference implementation of High Throughput Computing (HTC), developed at the University of Wisconsin-Madison since 1988. Unlike traditional HPC batch systems (Slurm, PBS) that assume a static cluster, HTCondor is designed for dynamic, heterogeneous resource pools where nodes come and go.
GlideinWMS (Glidein-based Workflow Management System) is a meta-scheduler built on top of HTCondor, developed by Fermilab and used by CMS, ATLAS, and the Open Science Grid (OSG). It implements the pilot job (Glidein) model for true late-binding across federated resources.
Core HTCondor/GlideinWMS components relevant here:
| Component | Role |
|---|---|
| Central Manager | Matchmaker daemon; receives job and machine ClassAds, performs matchmaking |
| Schedd | Scheduler daemon; holds job queue, negotiates with central manager for slots |
| Startd | Execution daemon on worker nodes; advertises slot capabilities, runs jobs |
| GlideinWMS Factory | Submits pilot jobs (glideins) to remote batch systems on behalf of users |
| GlideinWMS Frontend | Monitors user job queues, requests glideins from factory based on demand |
| HTCondor-CE | Compute Entrypoint; accepts remote job submissions, translates to local batch system |
| ClassAds | Flexible key-value expression language for describing jobs, machines, and matchmaking |
Late binding is a defining feature of HTCondor's glidein model. Understanding it precisely is essential before mapping Rapthor onto it.
In early binding (classical batch submission), a job is submitted to a specific queue at a specific site. If that site is congested, the job waits. If the site fails, the job is lost. The job is irrevocably coupled to a slot at submission time.
In HTCondor's late binding via GlideinWMS, the work and the resource are matched only when the resource is actually available — when a glidein starts executing and joins the HTCondor pool.
The canonical GlideinWMS flow is:
1. User submits jobs to a local HTCondor Schedd (user pool)
— jobs specify requirements via ClassAds but are NOT bound to any site
2. Frontend monitors the user pool: "There are N pending jobs with requirements X"
3. Frontend requests glideins from Factory: "Please submit pilots to sites A, B, C
that can satisfy requirements X"
4. Factory submits glidein jobs to HTCondor-CEs at multiple sites
— glideins are submitted to local batch systems (Slurm, PBS, K8s)
— glideins carry NO payload; they just request a slot
5. When a glidein starts on a worker node, it:
a. Contacts the user pool's Central Manager
b. Advertises its capabilities as a ClassAd (CPU, RAM, disk, GPU, etc.)
c. Requests a job that matches its capabilities
6. Central Manager performs matchmaking: job ClassAd ↔ slot ClassAd
7. Job is dispatched to the glidein; glidein executes payload
8. Glidein can request another job (slot reuse) or exit
The binding between job and compute slot happens at step 6, after the slot is already running — hence "late binding."
Both GlideinWMS and PanDA use pilot jobs, but their architectures differ:
| Aspect | GlideinWMS | PanDA |
|---|---|---|
| Central queue | HTCondor Schedd (job ClassAds) | PanDA Server (job database) |
| Matchmaking | Condor Central Manager (ClassAd expressions) | JEDI brokerage (plugin architecture) |
| Pilot submission | Factory → HTCondor-CE → local batch | Harvester → local batch |
| Job pull | Startd negotiates with Central Manager | Pilot polls PanDA Server REST API |
| Data locality | Via ClassAd requirements/rank | Via Rucio integration in JEDI |
The key difference: HTCondor's matchmaking is expression-based (ClassAds), while PanDA's is database-driven with explicit Rucio integration. HTCondor can express data locality via ClassAds, but this requires explicit infrastructure to populate machine ClassAds with data availability information.
HTCondor's ClassAd mechanism allows flexible job requirements and ranking:
Requirements (hard constraints):
requirements = (TARGET.HasGPU == True) && (TARGET.Memory >= 64000)
Rank (soft preferences):
rank = (TARGET.HasLocalData ? 100 : 0) + (TARGET.Memory / 1000)
For data-aware scheduling, one could imagine:
# Job requires data at a specific site
requirements = member("dataset_1234", TARGET.AvailableDatasets)
# Prefer sites with more data locally
rank = size(filter(lambda ds: member(ds, TARGET.AvailableDatasets), MY.RequiredDatasets))
This requires:
- A service that populates worker/site ClassAds with data availability
- Jobs that specify their data requirements as ClassAd attributes
- Periodic refresh of data availability ClassAds
Pablo's proposal mentions a "Data Locality Awareness Metadata Hydrator" — this is exactly the missing component.
In a federated system like SRCNet — where compute nodes are at different institutions across multiple countries with heterogeneous network links, storage backends, and queue policies — late binding provides:
- Resilience: if a site becomes unavailable after glidein submission, another site's glideins can pick up the work
- Opportunism: idle capacity at any site is automatically exploited
- Load balancing: jobs naturally flow to sites with available slots
- Heterogeneous resource routing: GPU vs. CPU decisions made at match time based on actual availability
- Queue absorption: glideins wait in site queues, but jobs remain unbound until execution
Toil already includes an HTCondor batch system backend (--batchSystem htcondor). The HTCondorBatchSystem class:
class HTCondorBatchSystem(AbstractGridEngineBatchSystem):
"""
Submits jobs to a local HTCondor pool via condor_submit.
"""
def issueBatchJob(self, command, jobNode, job_environment=None):
# Creates HTCondor submit description
# Submits via Schedd API
# Returns job ID
def getRunningBatchJobIDs(self):
# Queries condor_q for job status
def killBatchJobs(self, jobIDs):
# Calls condor_rmThis is a direct submission model: Toil leader submits jobs directly to a local HTCondor Schedd. It does NOT use GlideinWMS or HTCondor-CE — it assumes access to an existing HTCondor pool.
The existing Toil HTCondor backend has several limitations:
- Local pool only: Assumes direct Schedd access; no federation
- No late binding: Jobs are submitted directly, not via glideins
- Community-supported: Not actively maintained by Toil core team
- Known issues: Environment variable handling bugs (issue #4363), Schedd busy handling
- No data locality: No mechanism to express or match data requirements
For SRCNet's federated model, this is insufficient. We need either:
- A GlideinWMS-aware batch system that submits to a glidein-backed pool, or
- A new backend that submits via HTCondor-CE to remote sites
| Rapthor / Toil concept | HTCondor / GlideinWMS concept |
|---|---|
CWL Workflow |
DAGMan workflow (or external orchestrator) |
CWL CommandLineTool step |
HTCondor job cluster |
Individual Toil BatchJob (one scatter element) |
HTCondor job (one proc in a cluster) |
| Toil batch backend | HTCondorBatchSystem (to Schedd) |
| Toil job store (file-based checkpoint) | No equivalent — would use shared filesystem or HTCondor file transfer |
| Toil leader process (DAG orchestration) | DAGMan or external (Toil stays) |
Toil --batchSystem plugin |
HTCondor submit + ClassAd configuration |
CWL ResourceRequirement hints |
ClassAd request_cpus, request_memory, request_disk, request_gpus |
| Toil scatter (parallel over facets) | HTCondor job cluster with N procs |
Rapthor's outer loop (calibrate → image → extract → re-calibrate) is driven by Python code in Rapthor's Pipeline class. HTCondor's DAGMan supports conditional branching via SCRIPT PRE/POST and RETRY, but expressing convergence-based loops requires external logic.
The most natural mapping keeps Toil (or Rapthor's Python) as the outer orchestrator, submitting individual CWL operations to HTCondor. This is similar to Option A in the PanDA analysis.
Rapthor processes each facet or calibration direction independently. In CWL this is a scatter over direction items. In HTCondor, this maps to a job cluster:
# Submit 100 parallel jobs for 100 directions
queue 100
Each job in the cluster gets a unique $(Process) ID (0-99). GlideinWMS glideins at multiple sites can execute these jobs in parallel, with late binding ensuring jobs run where slots become available.
Architecture:
Rapthor (Python orchestrator)
└── Toil CWL runner (leader process, DAG management)
└── HTCondorBatchSystem (existing, enhanced)
└── Local HTCondor Schedd
└── GlideinWMS glideins from multiple SRC sites
How it works:
- Deploy a central HTCondor pool (Schedd + Central Manager) accessible from SRCNet
- Deploy GlideinWMS Factory + Frontend that provisions glideins at SRC sites via HTCondor-CE
- Enhance Toil's existing HTCondorBatchSystem to:
- Set data-locality ClassAd attributes on jobs
- Configure resource requirements from CWL hints
- Glideins from multiple sites join the central pool
- Jobs are matched to glideins based on ClassAd expressions
Data staging: Toil's job store must be accessible from all SRC sites. Options:
- Globally-mounted object store (Ceph/S3)
- HTCondor file transfer (for small files)
- External data management (Rucio or DMAPI) with ClassAd-based locality matching
Pros:
- Minimal changes to Rapthor codebase
- Uses existing Toil HTCondor backend (with enhancements)
- Preserves Toil's CWL DAG management and checkpoint/restart
- True late binding via GlideinWMS
- Leverages HTCondor's mature matchmaking
- Single central job queue simplifies monitoring
Cons:
- Requires deploying GlideinWMS infrastructure (Factory, Frontend, per-site HTCondor-CEs)
- Central HTCondor pool is a single point of failure
- Data locality requires custom "hydrator" service to populate machine ClassAds
- Toil leader must remain alive for pipeline duration
- HTCondor backend is "community supported" with known issues
Verdict: This is the most direct path if SRCNet decides to adopt HTCondor. The infrastructure investment is significant but well-understood (OSG has run this at scale for 15+ years).
Architecture:
Rapthor (Python orchestrator)
└── Toil CWL runner
└── [NEW] HTCondorCEBatchSystem
└── Multiple HTCondor-CE endpoints (per SRC site)
└── Local batch systems (Slurm, K8s, etc.)
How it works:
- Each SRC site deploys HTCondor-CE as a gateway to its local batch system
- New Toil batch system plugin submits jobs directly to HTCondor-CE endpoints
- Site selection logic (brokerage) lives in the plugin or a separate service
- HTCondor-CE translates job to local batch system (Slurm, K8s, etc.)
This is not true GlideinWMS late binding — it's more like the current TIGER Broker/Gateway model but using HTCondor-CE as the gateway protocol.
Pros:
- Simpler than full GlideinWMS deployment
- HTCondor-CE is already used by 170+ sites in WLCG
- SciTokens authentication supported
- No central HTCondor pool required
Cons:
- Not true late binding — job is bound to site at submission time
- Requires implementing brokerage/site-selection logic (the same problem as TIGER)
- Each site needs HTCondor-CE deployment
- Less flexible than GlideinWMS for opportunistic resource use
Verdict: This provides a standardized gateway protocol but doesn't solve the late-binding problem. It's an alternative to TIGER's custom Job Gateway, not a fundamentally different scheduling model.
Architecture:
User / Client
└── HTCondor Schedd (central job queue)
└── Central Manager (matchmaking)
└── GlideinWMS Factory
└── HTCondor-CE at each SRC site
└── Local batch (Slurm, K8s)
└── Glideins (workers)
└── Execute Rapthor CWL steps
How it works:
- Replace TIGER Broker entirely with HTCondor Schedd + Central Manager
- Replace Job Gateways with HTCondor-CE + glidein infrastructure
- Submit Rapthor CWL operations as HTCondor DAGMan workflows or via Toil
- GlideinWMS provisions workers across all SRC sites
- ClassAd matchmaking handles resource and data locality
This is the "full HTCondor" approach — using HTCondor as the unified job management layer.
Pros:
- True slot-level late binding across federation
- Battle-tested at scale (CMS processes millions of jobs/day)
- Rich matchmaking via ClassAds
- Large community, good documentation, annual conference
- No custom brokerage code to maintain
- Handles heterogeneous resources (CPU, GPU) natively
Cons:
- Significant infrastructure investment (full GlideinWMS stack)
- Requires HTCondor-CE at every SRC site
- Data locality requires custom ClassAd hydration (no native Rucio integration)
- Learning curve for operators unfamiliar with HTCondor
- Doesn't leverage existing SRCNet investment in PanDA/Harvester
Data locality implementation:
Unlike PanDA (which integrates with Rucio), HTCondor has no native data management. Options:
- ClassAd hydrator service: Periodically query DMAPI/Rucio, update machine ClassAds with available datasets
- Job-side data staging: Jobs use DMAPI to stage data before execution (early binding of data, late binding of slot)
- Storage-coupled workers: Glideins only run at sites with local data access (constrain factory submissions)
Verdict: This is a major architectural change that replaces rather than enhances the current SRCNet approach. It would work well technically but represents a strategic pivot away from PanDA.
Rapthor's WSClean imaging step can run in CPU-only, idg (CPU), or idg-gpu (GPU) modes. With HTCondor:
# Job ClassAd
+WantsGPU = True
request_gpus = 1
rank = TARGET.HasGPU ? 1000 : 0
# Falls back to CPU if no GPU slots available
requirements = (TARGET.HasGPU == True) || (TARGET.Cpus >= 16 && TARGET.Memory >= 64000)
The matchmaker routes to GPU slots when available, falls back to CPU otherwise — without user intervention.
Assuming a ClassAd hydrator populates machine ads with dataset availability:
# Machine ClassAd (populated by hydrator)
AvailableDatasets = {"obs_1234", "obs_5678", "skymodel_v3"}
SiteLocation = "SRCNET_UK"
# Job ClassAd
+RequiredDatasets = {"obs_1234", "skymodel_v3"}
requirements = subset(MY.RequiredDatasets, TARGET.AvailableDatasets)
rank = size(intersection(MY.RequiredDatasets, TARGET.AvailableDatasets))
Jobs preferentially run at sites with local data, but can run elsewhere if needed (triggering data staging).
GlideinWMS glideins wait in site batch queues. When they start, they immediately join the HTCondor pool and request work. This absorbs queue latency:
- Factory submits 100 glideins across 5 sites
- Glideins wait in local queues (minutes to hours depending on site)
- As glideins start, they pull jobs from the central queue
- Sites with faster queues naturally process more jobs
- No job is "trapped" in a slow queue — work stays in the central pool until a glidein is ready
A Rapthor scatter over 100 facets creates 100 HTCondor jobs. With glideins at 5 sites:
- 20 glideins run at SRCNET_UK (fast queue)
- 15 at SRCNET_NL (medium queue)
- 10 at SRCNET_AU (slow queue)
- Jobs flow to available slots as glideins start
- Natural load balancing across sites
If a glidein crashes or a site goes offline:
- Running jobs on that glidein are marked failed
- HTCondor automatically requeues them (configurable retry policy)
- Another glidein (possibly at a different site) picks them up
No manual intervention required for transient failures.
| Feature | TIGER (current) | PanDA | HTCondor/GlideinWMS |
|---|---|---|---|
| Late binding level | Site (Gateway claims job) | Slot (pilot pulls job) | Slot (glidein pulls job) |
| Central queue | RabbitMQ | PanDA Server DB | HTCondor Schedd |
| Matchmaking | Simple (spread across sites) | JEDI plugins (data-aware) | ClassAd expressions |
| Data locality | Future (DMAPI integration) | Native (Rucio) | Custom (ClassAd hydrator) |
| Site gateway | Job Gateway | Harvester | HTCondor-CE |
| Pilot/worker | Toil batch runner | PanDA Pilot | Glidein (startd) |
| Workflow DAG | Toil | iDDS | DAGMan or external |
| GPU routing | Manual | Native | Native (ClassAds) |
| Community size | Small (in-house) | Medium (HEP) | Large (OSG, HEP, HTC) |
| Complexity | Low | High | Medium |
| SRCNet investment | Current focus | Already planned | Would be new |
Toil uses a job store for intermediate files between CWL steps. With HTCondor distributing jobs across sites:
- Shared filesystem: All sites mount the same GPFS/Lustre/CephFS — simplest but requires infrastructure
- Object store: Toil's S3 job store, accessible from all sites — adds latency for small files
- HTCondor file transfer: Transfer input/output files with each job — works for small files, problematic for large MS data
For Rapthor's large intermediate products (GB-scale measurement sets), option 1 or 2 is necessary.
Rapthor requires DP3, WSClean, EveryBeam, IDG — a complex ~10GB image. Options:
- CVMFS: Distribute via CernVM-FS (used by LHC experiments)
- Container registry: Per-site Harbor/Docker registry with pull-through caching
- Singularity/Apptainer: Convert to SIF, stage to shared storage
HTCondor supports container_image in job ClassAds, with configurable execution via Singularity/Apptainer.
Unlike PanDA (which integrates with Rucio), HTCondor has no built-in data management. The "ClassAd hydrator" for data locality is a custom component that must:
- Query DMAPI/Rucio for dataset locations
- Update machine ClassAds periodically
- Handle replication-in-progress states
This is significant engineering — roughly equivalent to the "data-aware scheduling" goal in Michele's proposal, but implemented differently.
Deploying and operating GlideinWMS requires:
- Central HTCondor pool (Schedd + Central Manager + Collector)
- GlideinWMS Factory (submits glideins, manages credentials)
- GlideinWMS Frontend (monitors demand, requests glideins)
- HTCondor-CE at each site (translates to local batch)
- SciTokens/IAM integration for authentication
This is well-documented (OSG provides detailed guides), but it's not trivial. The OSG/CMS operations teams have decades of experience; SRCNet would be starting fresh.
Toil's HTCondor backend is "community supported" — the core Toil team at UCSC doesn't actively maintain it. Known issues include:
- Environment variable handling bugs
- Schedd busy handling
- Limited testing in CI
Any SRCNet deployment would likely need to enhance and maintain this code.
For an SRCNet deployment, given the context of existing TIGER work and planned PanDA investment:
Stage 1 (3-6 months): Proof of Concept
- Deploy minimal HTCondor pool (single Schedd + Central Manager) in SRCNet dev environment
- Use existing Toil HTCondor backend to run Rapthor calibration operation
- No GlideinWMS yet — just validate HTCondor job execution
- Document gaps in Toil backend, estimate enhancement effort
Stage 2 (6-12 months): Single-Site Late Binding
- Deploy HTCondor-CE at one SRC site
- Deploy GlideinWMS Factory + Frontend
- Glideins join central pool from one site
- Validate late-binding behavior under queue pressure
Stage 3 (12-18 months): Multi-Site + Data Locality
- Add HTCondor-CE at additional sites
- Implement ClassAd hydrator for data locality (DMAPI integration)
- Test data-aware job placement
- Compare performance/complexity with PanDA approach
HTCondor evaluation becomes a parallel spike to validate Pablo's hypothesis:
- Small team (0.2 FTE) explores HTCondor proof of concept
- Results inform tool assessment decision
- Avoid committing significant resources until comparative data exists
HTCondor/GlideinWMS is a legitimate alternative to PanDA for SRCNet federated job execution. Its strengths:
- True slot-level late binding via glidein model
- Mature, battle-tested (25+ years, millions of jobs/day at CMS)
- Large community with active development and support
- Flexible matchmaking via ClassAd expressions
- Existing Toil integration (though community-supported)
Its weaknesses for SRCNet:
- No native data management — requires custom ClassAd hydrator
- Significant infrastructure investment — full GlideinWMS stack
- Diverges from existing SRCNet plans — PanDA/Harvester already documented
- Operational learning curve — SRCNet team would need to build expertise
The most practical near-term integration path is enhancing Toil's HTCondor backend to work with a GlideinWMS-backed pool, while implementing a data-locality ClassAd hydrator that integrates with DMAPI. This preserves Rapthor's CWL/Toil architecture while gaining HTCondor's scheduling capabilities.
Whether this is preferable to the Toil-PanDA plugin approach depends on:
- SRCNet's strategic commitment to PanDA vs. openness to alternatives
- Relative complexity of ClassAd hydrator vs. PanDA/Rucio integration
- Availability of HTCondor operational expertise in the SRCNet community
Pablo's proposal to "do due diligence" by exploring HTCondor is reasonable — a controlled proof-of-concept would provide empirical data for the tool assessment decision.
- HTCondor Manual
- HTCondor ClassAd Mechanism
- GlideinWMS Documentation
- GlideinWMS Factory
- HTCondor-CE Overview
- HTCondor-CE SciTokens Authentication
- OSG GlideinWMS Factory SLA
- Toil HTCondor Batch System
- Toil HPC Environments
- CMS and HTCondor (CHEP 2025)
- LOFAR AGLOW Distributed Processing (arXiv:1808.10735)
- Rapthor GitHub
- Rapthor Documentation