Building Enterprise HPC Platforms: Slurm Workload Manager Best Practices
Production-tested strategies for configuring Slurm on AWS ParallelCluster to support 200+ researchers with fair-share scheduling and Weka storage integration
Introduction
Slurm (Simple Linux Utility for Resource Management) has become the de facto standard for HPC workload orchestration, powering some of the world's largest supercomputing facilities. However, deploying Slurm in cloud environments—particularly AWS—requires specialized expertise to balance elasticity, cost optimization, and performance.
At DCLOUD9, we've architected Slurm-based HPC platforms on AWS ParallelCluster supporting 200+ data scientists and computational biologists at organizations like Genentech, IAVI, and Imperial College London. This article shares our battle-tested configuration patterns, troubleshooting strategies, and integration approaches with high-performance storage systems like Weka.
Why Slurm for Cloud HPC?
Slurm provides essential capabilities for multi-tenant HPC environments:
- Resource Management: Fair allocation of compute, memory, and GPU resources across users and projects
- Job Scheduling: Intelligent queueing with backfill, preemption, and priority-based scheduling
- Accounting: Detailed tracking of resource consumption for chargeback and reporting
- Elastic Scaling: Dynamic cluster expansion/contraction based on workload demand
- Multi-Queue Support: Separate partitions for CPU, GPU, high-memory, and spot instance workloads
When integrated with AWS ParallelCluster, Slurm automatically provisions EC2 instances based on job requirements, significantly reducing costs compared to static on-premises clusters.
AWS ParallelCluster + Slurm Architecture
Our reference architecture separates control plane, compute, and storage for optimal performance and cost:
Control Plane
- Head Node: Runs slurmctld (controller daemon), database, and cluster management services
- Database Backend: MySQL/MariaDB for accounting data (slurmdbd)
- Login Nodes: User-facing SSH access, JupyterHub, RStudio Server
Compute Plane
- CPU Partitions: General compute (c6i, c7i instances) with auto-scaling
- GPU Partitions: NVIDIA B200, H100, A100 instances for AI/ML workloads
- High-Memory Partitions: r6i, r7i instances for genomics assembly and large datasets
- Spot Partitions: Cost-optimized instances for fault-tolerant batch jobs
Storage Architecture
- Shared Home: EFS or FSx for Lustre for user home directories
- Scratch Storage: Weka Data Platform or FSx for high-throughput workloads
- Archive: S3 with automated lifecycle policies
Slurm Configuration Best Practices
1. Multi-Queue Architecture
Separate queues (partitions) optimize cost and performance. Our production configuration:
# slurm.conf - Partition Configuration
# General CPU compute
PartitionName=cpu-standard Nodes=compute-cpu-[1-100] Default=YES MaxTime=7-00:00:00 State=UP
PartitionName=cpu-highmem Nodes=compute-highmem-[1-50] MaxTime=7-00:00:00 State=UP
# GPU partitions
PartitionName=gpu-a100 Nodes=compute-gpu-a100-[1-20] MaxTime=2-00:00:00 State=UP
PartitionName=gpu-b200 Nodes=compute-gpu-b200-[1-10] MaxTime=2-00:00:00 State=UP
# Spot instances (cost-optimized)
PartitionName=cpu-spot Nodes=compute-spot-[1-200] MaxTime=12:00:00 State=UP
# Priority configuration
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightQOS=10000
2. Fair-Share Scheduling
For multi-tenant environments, fair-share ensures equitable resource distribution:
# Create accounts and associations
sacctmgr add account genomics Description="Genomics Research"
sacctmgr add account drug_discovery Description="Drug Discovery"
# Add users with shares
sacctmgr add user alice Account=genomics Fairshare=100
sacctmgr add user bob Account=drug_discovery Fairshare=100
# View fair-share status
sshare -A
3. GPU Resource Management
Proper GPU configuration prevents resource conflicts and enables fine-grained allocation:
# gres.conf - GPU resource configuration
NodeName=compute-gpu-b200-[1-10] Name=gpu Type=b200 File=/dev/nvidia[0-7]
NodeName=compute-gpu-a100-[1-20] Name=gpu Type=a100 File=/dev/nvidia[0-7]
# slurm.conf - Node definitions with GPUs
NodeName=compute-gpu-b200-[1-10] Gres=gpu:b200:8 CPUs=96 RealMemory=2048000 State=CLOUD
NodeName=compute-gpu-a100-[1-20] Gres=gpu:a100:8 CPUs=96 RealMemory=1024000 State=CLOUD
# Example job requesting specific GPU type
#!/bin/bash
#SBATCH --partition=gpu-b200
#SBATCH --gres=gpu:b200:4
#SBATCH --cpus-per-gpu=12
#SBATCH --mem-per-gpu=256G
4. Job Accounting and Reporting
Comprehensive accounting enables chargeback and capacity planning:
# slurm.conf - Accounting configuration
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd-host
# Generate monthly usage report
sreport cluster utilization start=2024-12-01 end=2024-12-31
sreport user top start=2024-12-01 end=2024-12-31
# Cost analysis query
sacct -S 2024-12-01 -E 2024-12-31 --format=User,Account,JobID,Partition,AllocCPUs,Elapsed,State
Integration with Weka Parallel File System
High-performance storage is critical for HPC workloads. Weka provides multi-GB/s throughput with sub-millisecond latency:
Architecture
- Weka Cluster: Deployed on i4i.4xlarge instances with local NVMe SSDs
- Client Integration: Weka client installed on compute nodes via ParallelCluster custom AMI
- Mount Points: /weka/scratch for high-performance temporary storage
ParallelCluster Configuration
CustomActions:
OnNodeConfigured:
Script: s3://my-bucket/scripts/install-weka-client.sh
SharedStorage:
- MountDir: /weka/scratch
Name: weka-scratch
StorageType: FsxLustre
# Alternative: Custom mount via OnNodeConfigured script
Performance Results
In production environments supporting genomics pipelines:
- 45 GB/s aggregate throughput for parallel BAM file processing
- Sub-millisecond latency for metadata operations
- Linear scaling to 200+ concurrent compute nodes
- S3 tiering for automated archival of cold data
Elastic Scaling Strategies
Cloud HPC's superpower is elastic scaling. Our configuration automatically scales compute nodes:
# ParallelCluster YAML - Scaling configuration
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 5
SlurmQueues:
- Name: cpu-standard
ComputeResources:
- Name: c6i-32xlarge
InstanceType: c6i.32xlarge
MinCount: 0
MaxCount: 100
# Slurm will scale from 0 to 100 nodes based on queue depth
Cost Optimization Techniques
- Spot Instances: 70% cost savings for fault-tolerant workloads
- Rapid Scale-Down: 5-minute idle time before termination
- Mixed Instance Types: Flexible instance selection within a partition
- Reserved Capacity: Savings Plans for baseline compute demand
Monitoring and Troubleshooting
Production HPC platforms require comprehensive observability:
Key Metrics
- Queue Depth: Jobs waiting for resources (squeue)
- Node State: Active, idle, down, drain (sinfo)
- Cluster Utilization: CPU, GPU, memory usage (Grafana dashboards)
- Job Success Rate: Failed vs. completed jobs
- Scaling Latency: Time from job submission to job start
Common Issues and Solutions
Issue: Nodes Stuck in "Powering Up"
Cause: EC2 capacity constraints or IAM permission issues
Solution: Check CloudWatch logs, verify service quotas, diversify instance types
Issue: Jobs Not Starting Despite Available Nodes
Cause: Resource constraints (CPUs, memory, GPUs) or fair-share limits
Solution: Use scontrol show job JOBID to see exact reason
Real-World Case Study: Supporting 200+ Researchers
For Genentech's genomics research platform, we deployed a production Slurm cluster supporting:
Platform Specifications:
- 200+ active users across genomics, computational biology, and drug discovery teams
- 5 partitions: CPU, high-memory, GPU (A100/B200), spot, and interactive
- Auto-scaling: 0-500 compute nodes based on queue depth
- Weka storage: 2 PB capacity, 45 GB/s throughput
- 99.9% uptime with automated health monitoring and failover
- 3x cost reduction vs. previous on-premises HPC cluster
Conclusion
Slurm on AWS ParallelCluster provides the ideal foundation for enterprise HPC workloads. With proper configuration of multi-queue architectures, fair-share scheduling, GPU resource management, and integration with high-performance storage like Weka, organizations can build scalable, cost-effective platforms supporting hundreds of researchers.
At DCLOUD9, we specialize in designing and operating these mission-critical HPC platforms for biotech, genomics, and computational research organizations. Our team brings decades of combined experience in Slurm administration, AWS architecture, and performance optimization.
Need Expert Help with Your HPC Platform?
Let DCLOUD9 design and deploy your Slurm-based HPC infrastructure
Request Consultation