← Back to Blog
HPC Architecture December 2024 • 15 min read

Building Enterprise HPC Platforms: Slurm Workload Manager Best Practices

Production-tested strategies for configuring Slurm on AWS ParallelCluster to support 200+ researchers with fair-share scheduling and Weka storage integration

Introduction

Slurm (Simple Linux Utility for Resource Management) has become the de facto standard for HPC workload orchestration, powering some of the world's largest supercomputing facilities. However, deploying Slurm in cloud environments—particularly AWS—requires specialized expertise to balance elasticity, cost optimization, and performance.

At DCLOUD9, we've architected Slurm-based HPC platforms on AWS ParallelCluster supporting 200+ data scientists and computational biologists at organizations like Genentech, IAVI, and Imperial College London. This article shares our battle-tested configuration patterns, troubleshooting strategies, and integration approaches with high-performance storage systems like Weka.

Why Slurm for Cloud HPC?

Slurm provides essential capabilities for multi-tenant HPC environments:

  • Resource Management: Fair allocation of compute, memory, and GPU resources across users and projects
  • Job Scheduling: Intelligent queueing with backfill, preemption, and priority-based scheduling
  • Accounting: Detailed tracking of resource consumption for chargeback and reporting
  • Elastic Scaling: Dynamic cluster expansion/contraction based on workload demand
  • Multi-Queue Support: Separate partitions for CPU, GPU, high-memory, and spot instance workloads

When integrated with AWS ParallelCluster, Slurm automatically provisions EC2 instances based on job requirements, significantly reducing costs compared to static on-premises clusters.

AWS ParallelCluster + Slurm Architecture

Our reference architecture separates control plane, compute, and storage for optimal performance and cost:

Control Plane

  • Head Node: Runs slurmctld (controller daemon), database, and cluster management services
  • Database Backend: MySQL/MariaDB for accounting data (slurmdbd)
  • Login Nodes: User-facing SSH access, JupyterHub, RStudio Server

Compute Plane

  • CPU Partitions: General compute (c6i, c7i instances) with auto-scaling
  • GPU Partitions: NVIDIA B200, H100, A100 instances for AI/ML workloads
  • High-Memory Partitions: r6i, r7i instances for genomics assembly and large datasets
  • Spot Partitions: Cost-optimized instances for fault-tolerant batch jobs

Storage Architecture

  • Shared Home: EFS or FSx for Lustre for user home directories
  • Scratch Storage: Weka Data Platform or FSx for high-throughput workloads
  • Archive: S3 with automated lifecycle policies

Slurm Configuration Best Practices

1. Multi-Queue Architecture

Separate queues (partitions) optimize cost and performance. Our production configuration:

# slurm.conf - Partition Configuration

# General CPU compute
PartitionName=cpu-standard Nodes=compute-cpu-[1-100] Default=YES MaxTime=7-00:00:00 State=UP
PartitionName=cpu-highmem Nodes=compute-highmem-[1-50] MaxTime=7-00:00:00 State=UP

# GPU partitions
PartitionName=gpu-a100 Nodes=compute-gpu-a100-[1-20] MaxTime=2-00:00:00 State=UP
PartitionName=gpu-b200 Nodes=compute-gpu-b200-[1-10] MaxTime=2-00:00:00 State=UP

# Spot instances (cost-optimized)
PartitionName=cpu-spot Nodes=compute-spot-[1-200] MaxTime=12:00:00 State=UP

# Priority configuration
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightQOS=10000

2. Fair-Share Scheduling

For multi-tenant environments, fair-share ensures equitable resource distribution:

# Create accounts and associations
sacctmgr add account genomics Description="Genomics Research"
sacctmgr add account drug_discovery Description="Drug Discovery"

# Add users with shares
sacctmgr add user alice Account=genomics Fairshare=100
sacctmgr add user bob Account=drug_discovery Fairshare=100

# View fair-share status
sshare -A

3. GPU Resource Management

Proper GPU configuration prevents resource conflicts and enables fine-grained allocation:

# gres.conf - GPU resource configuration
NodeName=compute-gpu-b200-[1-10] Name=gpu Type=b200 File=/dev/nvidia[0-7]
NodeName=compute-gpu-a100-[1-20] Name=gpu Type=a100 File=/dev/nvidia[0-7]

# slurm.conf - Node definitions with GPUs
NodeName=compute-gpu-b200-[1-10] Gres=gpu:b200:8 CPUs=96 RealMemory=2048000 State=CLOUD
NodeName=compute-gpu-a100-[1-20] Gres=gpu:a100:8 CPUs=96 RealMemory=1024000 State=CLOUD

# Example job requesting specific GPU type
#!/bin/bash
#SBATCH --partition=gpu-b200
#SBATCH --gres=gpu:b200:4
#SBATCH --cpus-per-gpu=12
#SBATCH --mem-per-gpu=256G

4. Job Accounting and Reporting

Comprehensive accounting enables chargeback and capacity planning:

# slurm.conf - Accounting configuration
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd-host

# Generate monthly usage report
sreport cluster utilization start=2024-12-01 end=2024-12-31
sreport user top start=2024-12-01 end=2024-12-31

# Cost analysis query
sacct -S 2024-12-01 -E 2024-12-31 --format=User,Account,JobID,Partition,AllocCPUs,Elapsed,State

Integration with Weka Parallel File System

High-performance storage is critical for HPC workloads. Weka provides multi-GB/s throughput with sub-millisecond latency:

Architecture

  • Weka Cluster: Deployed on i4i.4xlarge instances with local NVMe SSDs
  • Client Integration: Weka client installed on compute nodes via ParallelCluster custom AMI
  • Mount Points: /weka/scratch for high-performance temporary storage

ParallelCluster Configuration

CustomActions:
  OnNodeConfigured:
    Script: s3://my-bucket/scripts/install-weka-client.sh

SharedStorage:
  - MountDir: /weka/scratch
    Name: weka-scratch
    StorageType: FsxLustre
    # Alternative: Custom mount via OnNodeConfigured script

Performance Results

In production environments supporting genomics pipelines:

  • 45 GB/s aggregate throughput for parallel BAM file processing
  • Sub-millisecond latency for metadata operations
  • Linear scaling to 200+ concurrent compute nodes
  • S3 tiering for automated archival of cold data

Elastic Scaling Strategies

Cloud HPC's superpower is elastic scaling. Our configuration automatically scales compute nodes:

# ParallelCluster YAML - Scaling configuration
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
  SlurmQueues:
    - Name: cpu-standard
      ComputeResources:
        - Name: c6i-32xlarge
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: 100
      # Slurm will scale from 0 to 100 nodes based on queue depth

Cost Optimization Techniques

  1. Spot Instances: 70% cost savings for fault-tolerant workloads
  2. Rapid Scale-Down: 5-minute idle time before termination
  3. Mixed Instance Types: Flexible instance selection within a partition
  4. Reserved Capacity: Savings Plans for baseline compute demand

Monitoring and Troubleshooting

Production HPC platforms require comprehensive observability:

Key Metrics

  • Queue Depth: Jobs waiting for resources (squeue)
  • Node State: Active, idle, down, drain (sinfo)
  • Cluster Utilization: CPU, GPU, memory usage (Grafana dashboards)
  • Job Success Rate: Failed vs. completed jobs
  • Scaling Latency: Time from job submission to job start

Common Issues and Solutions

Issue: Nodes Stuck in "Powering Up"

Cause: EC2 capacity constraints or IAM permission issues

Solution: Check CloudWatch logs, verify service quotas, diversify instance types

Issue: Jobs Not Starting Despite Available Nodes

Cause: Resource constraints (CPUs, memory, GPUs) or fair-share limits

Solution: Use scontrol show job JOBID to see exact reason

Real-World Case Study: Supporting 200+ Researchers

For Genentech's genomics research platform, we deployed a production Slurm cluster supporting:

Platform Specifications:

  • 200+ active users across genomics, computational biology, and drug discovery teams
  • 5 partitions: CPU, high-memory, GPU (A100/B200), spot, and interactive
  • Auto-scaling: 0-500 compute nodes based on queue depth
  • Weka storage: 2 PB capacity, 45 GB/s throughput
  • 99.9% uptime with automated health monitoring and failover
  • 3x cost reduction vs. previous on-premises HPC cluster

Conclusion

Slurm on AWS ParallelCluster provides the ideal foundation for enterprise HPC workloads. With proper configuration of multi-queue architectures, fair-share scheduling, GPU resource management, and integration with high-performance storage like Weka, organizations can build scalable, cost-effective platforms supporting hundreds of researchers.

At DCLOUD9, we specialize in designing and operating these mission-critical HPC platforms for biotech, genomics, and computational research organizations. Our team brings decades of combined experience in Slurm administration, AWS architecture, and performance optimization.

Need Expert Help with Your HPC Platform?

Let DCLOUD9 design and deploy your Slurm-based HPC infrastructure

Request Consultation