HPC Architecture December 2024 • 15 min read

Building Enterprise HPC Platforms: Slurm Workload Manager Best Practices

Production-tested strategies for configuring Slurm on AWS ParallelCluster to support 200+ researchers with fair-share scheduling and Weka storage integration

Introduction

Slurm (Simple Linux Utility for Resource Management) has become the de facto standard for HPC workload orchestration, powering some of the world's largest supercomputing facilities. However, deploying Slurm in cloud environments—particularly AWS—requires specialized expertise to balance elasticity, cost optimization, and performance.

At DCLOUD9, we've architected Slurm-based HPC platforms on AWS ParallelCluster supporting 200+ data scientists and computational biologists at organizations like Genentech, IAVI, and Imperial College London. This article shares our battle-tested configuration patterns, troubleshooting strategies, and integration approaches with high-performance storage systems like Weka.

Why Slurm for Cloud HPC?

Slurm provides essential capabilities for multi-tenant HPC environments:

Resource Management: Fair allocation of compute, memory, and GPU resources across users and projects
Job Scheduling: Intelligent queueing with backfill, preemption, and priority-based scheduling
Accounting: Detailed tracking of resource consumption for chargeback and reporting
Elastic Scaling: Dynamic cluster expansion/contraction based on workload demand
Multi-Queue Support: Separate partitions for CPU, GPU, high-memory, and spot instance workloads

When integrated with AWS ParallelCluster, Slurm automatically provisions EC2 instances based on job requirements, significantly reducing costs compared to static on-premises clusters.

AWS ParallelCluster + Slurm Architecture

Our reference architecture separates control plane, compute, and storage for optimal performance and cost:

Control Plane

Head Node: Runs slurmctld (controller daemon), database, and cluster management services
Database Backend: MySQL/MariaDB for accounting data (slurmdbd)
Login Nodes: User-facing SSH access, JupyterHub, RStudio Server

Compute Plane

CPU Partitions: General compute (c6i, c7i instances) with auto-scaling
GPU Partitions: NVIDIA B200, H100, A100 instances for AI/ML workloads
High-Memory Partitions: r6i, r7i instances for genomics assembly and large datasets
Spot Partitions: Cost-optimized instances for fault-tolerant batch jobs

Storage Architecture

Shared Home: EFS or FSx for Lustre for user home directories
Scratch Storage: Weka Data Platform or FSx for high-throughput workloads
Archive: S3 with automated lifecycle policies

Slurm Configuration Best Practices

1. Multi-Queue Architecture

Separate queues (partitions) optimize cost and performance. Our production configuration:

# slurm.conf - Partition Configuration

# General CPU compute
PartitionName=cpu-standard Nodes=compute-cpu-[1-100] Default=YES MaxTime=7-00:00:00 State=UP
PartitionName=cpu-highmem Nodes=compute-highmem-[1-50] MaxTime=7-00:00:00 State=UP

# GPU partitions
PartitionName=gpu-a100 Nodes=compute-gpu-a100-[1-20] MaxTime=2-00:00:00 State=UP
PartitionName=gpu-b200 Nodes=compute-gpu-b200-[1-10] MaxTime=2-00:00:00 State=UP

# Spot instances (cost-optimized)
PartitionName=cpu-spot Nodes=compute-spot-[1-200] MaxTime=12:00:00 State=UP

# Priority configuration
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightQOS=10000

2. Fair-Share Scheduling

For multi-tenant environments, fair-share ensures equitable resource distribution:

# Create accounts and associations
sacctmgr add account genomics Description="Genomics Research"
sacctmgr add account drug_discovery Description="Drug Discovery"

# Add users with shares
sacctmgr add user alice Account=genomics Fairshare=100
sacctmgr add user bob Account=drug_discovery Fairshare=100

# View fair-share status
sshare -A

3. GPU Resource Management

Proper GPU configuration prevents resource conflicts and enables fine-grained allocation:

# gres.conf - GPU resource configuration
NodeName=compute-gpu-b200-[1-10] Name=gpu Type=b200 File=/dev/nvidia[0-7]
NodeName=compute-gpu-a100-[1-20] Name=gpu Type=a100 File=/dev/nvidia[0-7]

# slurm.conf - Node definitions with GPUs
NodeName=compute-gpu-b200-[1-10] Gres=gpu:b200:8 CPUs=96 RealMemory=2048000 State=CLOUD
NodeName=compute-gpu-a100-[1-20] Gres=gpu:a100:8 CPUs=96 RealMemory=1024000 State=CLOUD

# Example job requesting specific GPU type
#!/bin/bash
#SBATCH --partition=gpu-b200
#SBATCH --gres=gpu:b200:4
#SBATCH --cpus-per-gpu=12
#SBATCH --mem-per-gpu=256G

4. Job Accounting and Reporting

Comprehensive accounting enables chargeback and capacity planning:

# slurm.conf - Accounting configuration
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd-host

# Generate monthly usage report
sreport cluster utilization start=2024-12-01 end=2024-12-31
sreport user top start=2024-12-01 end=2024-12-31

# Cost analysis query
sacct -S 2024-12-01 -E 2024-12-31 --format=User,Account,JobID,Partition,AllocCPUs,Elapsed,State

Integration with Weka Parallel File System

High-performance storage is critical for HPC workloads. Weka provides multi-GB/s throughput with sub-millisecond latency:

Architecture

Weka Cluster: Deployed on i4i.4xlarge instances with local NVMe SSDs
Client Integration: Weka client installed on compute nodes via ParallelCluster custom AMI
Mount Points: /weka/scratch for high-performance temporary storage

ParallelCluster Configuration

CustomActions:
  OnNodeConfigured:
    Script: s3://my-bucket/scripts/install-weka-client.sh

SharedStorage:
  - MountDir: /weka/scratch
    Name: weka-scratch
    StorageType: FsxLustre
    # Alternative: Custom mount via OnNodeConfigured script

Performance Results

In production environments supporting genomics pipelines:

45 GB/s aggregate throughput for parallel BAM file processing
Sub-millisecond latency for metadata operations
Linear scaling to 200+ concurrent compute nodes
S3 tiering for automated archival of cold data

Elastic Scaling Strategies

Cloud HPC's superpower is elastic scaling. Our configuration automatically scales compute nodes:

# ParallelCluster YAML - Scaling configuration
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: cpu-standard
      ComputeResources:
        - Name: c6i-32xlarge
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: 100
      # Slurm will scale from 0 to 100 nodes based on queue depth

Cost Optimization Techniques

Spot Instances: 70% cost savings for fault-tolerant workloads
Rapid Scale-Down: 5-minute idle time before termination
Mixed Instance Types: Flexible instance selection within a partition
Reserved Capacity: Savings Plans for baseline compute demand

Monitoring and Troubleshooting

Production HPC platforms require comprehensive observability:

Key Metrics

Queue Depth: Jobs waiting for resources (squeue)
Node State: Active, idle, down, drain (sinfo)
Cluster Utilization: CPU, GPU, memory usage (Grafana dashboards)
Job Success Rate: Failed vs. completed jobs
Scaling Latency: Time from job submission to job start

Common Issues and Solutions

Issue: Nodes Stuck in "Powering Up"

Cause: EC2 capacity constraints or IAM permission issues

Solution: Check CloudWatch logs, verify service quotas, diversify instance types

Issue: Jobs Not Starting Despite Available Nodes

Cause: Resource constraints (CPUs, memory, GPUs) or fair-share limits

Solution: Use scontrol show job JOBID to see exact reason

Real-World Case Study: Supporting 200+ Researchers

For Genentech's genomics research platform, we deployed a production Slurm cluster supporting:

Platform Specifications:

200+ active users across genomics, computational biology, and drug discovery teams
5 partitions: CPU, high-memory, GPU (A100/B200), spot, and interactive
Auto-scaling: 0-500 compute nodes based on queue depth
Weka storage: 2 PB capacity, 45 GB/s throughput
99.9% uptime with automated health monitoring and failover
3x cost reduction vs. previous on-premises HPC cluster

Conclusion

Slurm on AWS ParallelCluster provides the ideal foundation for enterprise HPC workloads. With proper configuration of multi-queue architectures, fair-share scheduling, GPU resource management, and integration with high-performance storage like Weka, organizations can build scalable, cost-effective platforms supporting hundreds of researchers.

At DCLOUD9, we specialize in designing and operating these mission-critical HPC platforms for biotech, genomics, and computational research organizations. Our team brings decades of combined experience in Slurm administration, AWS architecture, and performance optimization.

Need Expert Help with Your HPC Platform?

Let DCLOUD9 design and deploy your Slurm-based HPC infrastructure

Request Consultation