Cloud Architecture October 2024 • 16 min read

AWS ParallelCluster 3.0: Building Modern HPC Platforms with Infrastructure-as-Code

Production-tested Terraform patterns for deploying multi-region HPC platforms with Slurm scheduler, NVIDIA B200 GPU nodes, and Weka storage integration

Introduction

AWS ParallelCluster has revolutionized cloud-based HPC, enabling organizations to deploy elastically scalable compute clusters with just a YAML configuration file. However, production deployments require more than just ParallelCluster—they need comprehensive infrastructure-as-code (IaC) that manages networking, IAM, storage, monitoring, and security in a repeatable, version-controlled manner.

At DCLOUD9, we've deployed dozens of production HPC platforms using AWS ParallelCluster 3.0 orchestrated with Terraform. Our clients include Genentech, IAVI, and Imperial College London, running workloads from genomics research to AI model training on NVIDIA B200 GPUs with Weka parallel file systems.

This article shares our battle-tested Terraform patterns, security best practices, and deployment strategies for building enterprise-grade HPC platforms on AWS ParallelCluster 3.0.

Why Infrastructure-as-Code for HPC?

Manual cluster deployments don't scale for enterprise environments:

Repeatability: Deploy identical clusters across dev, staging, production environments
Version Control: Track changes to cluster configuration over time via Git
Disaster Recovery: Recreate infrastructure from code in minutes
Multi-Region: Deploy clusters across regions with consistent configuration
Compliance: Enforce security policies and audit trails
Team Collaboration: Code reviews, pull requests, automated testing

AWS ParallelCluster 3.0: Key Improvements

ParallelCluster 3.0 brings significant architectural improvements:

Multi-User Support

Login nodes separate from head node for better security
Integration with Active Directory or AWS Managed Microsoft AD
Per-user home directories on shared storage

Enhanced Networking

Multiple security groups and subnets per queue
Proxy support for restricted network environments
IPv6 support for future-proofing

Custom Actions and Extensibility

OnNodeStart, OnNodeConfigured, OnNodeUpdated hooks
Custom AMI support with streamlined workflows
S3-based script distribution for configuration

API-Driven Management

REST API for programmatic cluster operations
CloudFormation StackSets for multi-region deployment
Enhanced logging to CloudWatch Logs

Terraform Architecture for ParallelCluster

Our reference architecture uses Terraform to orchestrate all infrastructure components:

Module Structure

hpc-platform-terraform/
├── main.tf                    # Root module
├── variables.tf               # Input variables
├── outputs.tf                 # Output values
├── modules/
│   ├── network/               # VPC, subnets, NAT gateways
│   ├── iam/                   # IAM roles and policies
│   ├── storage/               # FSx, EFS, S3 buckets
│   ├── weka/                  # Weka cluster deployment
│   ├── parallelcluster/       # ParallelCluster configuration
│   └── monitoring/            # CloudWatch, Grafana dashboards
├── environments/
│   ├── dev/                   # Development cluster
│   ├── staging/               # Staging cluster
│   └── prod/                  # Production cluster
└── cluster-config/
    └── cluster.yaml.tpl       # ParallelCluster YAML template

Key Terraform Resources

# modules/parallelcluster/main.tf

# S3 bucket for cluster configuration and scripts
resource "aws_s3_bucket" "cluster_config" {
  bucket = "hpc-cluster-config-${var.environment}"

  versioning {
    enabled = true
  }

  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

# Upload cluster configuration
resource "aws_s3_object" "cluster_config" {
  bucket = aws_s3_bucket.cluster_config.id
  key    = "cluster.yaml"
  content = templatefile("${path.module}/../../cluster-config/cluster.yaml.tpl", {
    vpc_id                = var.vpc_id
    subnet_id             = var.subnet_id
    key_name              = var.key_name
    instance_types        = var.instance_types
    max_count             = var.max_count
    weka_mount_script     = aws_s3_object.weka_mount.s3_uri
    monitoring_script     = aws_s3_object.monitoring.s3_uri
  })

  etag = filemd5("${path.module}/../../cluster-config/cluster.yaml.tpl")
}

# ParallelCluster deployment via null_resource
resource "null_resource" "parallelcluster" {
  triggers = {
    config_version = aws_s3_object.cluster_config.version_id
  }

  provisioner "local-exec" {
    command = <<-EOT
      pcluster create-cluster \
        --cluster-name ${var.cluster_name} \
        --cluster-configuration s3://${aws_s3_bucket.cluster_config.id}/cluster.yaml \
        --region ${var.region}
    EOT
  }

  depends_on = [
    aws_s3_object.cluster_config,
    module.iam,
    module.storage
  ]
}

ParallelCluster Configuration Template

Our production cluster template integrates multiple queues, Weka storage, and custom actions:

# cluster.yaml.tpl
Region: ${region}
Image:
  Os: alinux2

HeadNode:
  InstanceType: c6i.4xlarge
  Networking:
    SubnetId: ${head_subnet_id}
    SecurityGroups:
      - ${head_security_group}
  Ssh:
    KeyName: ${key_name}
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  CustomActions:
    OnNodeConfigured:
      Script: ${monitoring_script}

LoginNodes:
  Pools:
    - Name: login-pool
      Count: 2
      InstanceType: c6i.2xlarge
      Networking:
        SubnetIds:
          - ${login_subnet_id}
        SecurityGroups:
          - ${login_security_group}

Scheduling:
  Scheduler: slurm
  SlurmSettings:
    ScaledownIdletime: 5
    Database:
      Uri: ${database_uri}
      UserName: ${database_user}
      PasswordSecretArn: ${database_password_secret}

  SlurmQueues:
    # CPU compute queue
    - Name: cpu-standard
      ComputeResources:
        - Name: c6i-nodes
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: ${cpu_max_count}
          DisableSimultaneousMultithreading: true
      Networking:
        SubnetIds:
          - ${compute_subnet_id}
        PlacementGroup:
          Enabled: true
      CustomActions:
        OnNodeConfigured:
          Script: ${weka_mount_script}

    # GPU queue with B200 instances
    - Name: gpu-b200
      ComputeResources:
        - Name: b200-nodes
          InstanceType: p5.48xlarge
          MinCount: 0
          MaxCount: ${gpu_max_count}
          Efa:
            Enabled: true
            GdrSupport: true
      Networking:
        SubnetIds:
          - ${compute_subnet_id}
        PlacementGroup:
          Enabled: true
      CustomActions:
        OnNodeConfigured:
          Script: ${weka_mount_script}

    # Spot instance queue for cost optimization
    - Name: cpu-spot
      CapacityType: SPOT
      ComputeResources:
        - Name: spot-nodes
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: 200
      Networking:
        SubnetIds:
          - ${compute_subnet_id}

SharedStorage:
  - MountDir: /shared
    Name: shared-home
    StorageType: Efs
    EfsSettings:
      EncryptedFileSystem: true
      PerformanceMode: generalPurpose
      ThroughputMode: elastic

Monitoring:
  Logs:
    CloudWatch:
      Enabled: true
      RetentionInDays: 30
  Dashboards:
    CloudWatch:
      Enabled: true

Integrating Weka with Terraform

Weka deployment is fully automated through Terraform:

# modules/weka/main.tf

# Weka backend instances
resource "aws_instance" "weka_backend" {
  count                  = var.weka_backend_count
  ami                    = var.weka_ami_id
  instance_type          = "i4i.8xlarge"
  subnet_id              = element(var.weka_subnet_ids, count.index)
  vpc_security_group_ids = [aws_security_group.weka.id]

  iam_instance_profile   = aws_iam_instance_profile.weka.name

  user_data = templatefile("${path.module}/weka-backend-init.sh", {
    cluster_name     = var.weka_cluster_name
    backend_index    = count.index
    s3_bucket        = var.weka_s3_bucket
    license_key      = var.weka_license_key
  })

  tags = {
    Name = "weka-backend-${count.index}"
    Role = "weka-backend"
  }
}

# Network Load Balancer for Weka clients
resource "aws_lb" "weka_nlb" {
  name               = "weka-nlb"
  internal           = true
  load_balancer_type = "network"
  subnets            = var.weka_subnet_ids

  enable_cross_zone_load_balancing = true
}

# S3 script for mounting Weka on compute nodes
resource "aws_s3_object" "weka_mount_script" {
  bucket = var.cluster_config_bucket
  key    = "scripts/mount-weka.sh"
  content = templatefile("${path.module}/mount-weka.sh.tpl", {
    weka_nlb_dns = aws_lb.weka_nlb.dns_name
  })
}

Security Best Practices

1. Network Isolation

# modules/network/main.tf

# Private subnets for compute nodes - no internet access
resource "aws_subnet" "compute_private" {
  count             = length(var.availability_zones)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index)
  availability_zone = var.availability_zones[count.index]

  tags = {
    Name = "compute-private-${var.availability_zones[count.index]}"
  }
}

# Public subnets for login nodes with restrictive security groups
resource "aws_subnet" "login_public" {
  count                   = length(var.availability_zones)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 4, count.index + 10)
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "login-public-${var.availability_zones[count.index]}"
  }
}

2. IAM Least Privilege

# modules/iam/compute-node-policy.tf

data "aws_iam_policy_document" "compute_node" {
  # S3 access for cluster configuration (read-only)
  statement {
    effect = "Allow"
    actions = [
      "s3:GetObject",
      "s3:ListBucket"
    ]
    resources = [
      "arn:aws:s3:::${var.cluster_config_bucket}",
      "arn:aws:s3:::${var.cluster_config_bucket}/*"
    ]
  }

  # CloudWatch Logs for monitoring
  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:*:*:log-group:/aws/parallelcluster/*"]
  }

  # Secrets Manager for database passwords (read-only)
  statement {
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue"
    ]
    resources = [var.database_password_secret_arn]
  }
}

3. Encryption Everywhere

EBS Volumes: Encrypted with KMS customer-managed keys
S3 Buckets: Server-side encryption with versioning
EFS/FSx: Encryption at rest and in transit
Secrets: AWS Secrets Manager for sensitive data
Network: TLS for all external communications

Multi-Environment Strategy

Terraform workspaces and environment-specific variables enable dev/staging/prod workflows:

# environments/prod/terraform.tfvars

environment          = "prod"
region              = "us-west-2"
cluster_name        = "genomics-prod"
vpc_cidr            = "10.0.0.0/16"

# CPU queue configuration
cpu_instance_type   = "c6i.32xlarge"
cpu_max_count       = 100

# GPU queue configuration
gpu_instance_type   = "p5.48xlarge"
gpu_max_count       = 20

# Weka storage
weka_backend_count  = 8
weka_instance_type  = "i4i.8xlarge"

# High availability
multi_az            = true
backup_retention    = 30

# Cost optimization
enable_spot_queue   = true
scaledown_idletime  = 5

Automated Cluster Lifecycle Management

Our Terraform modules include automated operations:

1. Cluster Updates

# Update cluster configuration
resource "null_resource" "parallelcluster_update" {
  triggers = {
    config_version = aws_s3_object.cluster_config.version_id
  }

  provisioner "local-exec" {
    command = <<-EOT
      pcluster update-cluster \
        --cluster-name ${var.cluster_name} \
        --cluster-configuration s3://${aws_s3_bucket.cluster_config.id}/cluster.yaml \
        --region ${var.region}
    EOT
  }
}

2. Backup and Disaster Recovery

# AWS Backup for EFS
resource "aws_backup_plan" "cluster_backup" {
  name = "hpc-cluster-backup"

  rule {
    rule_name         = "daily_backup"
    target_vault_name = aws_backup_vault.cluster.name
    schedule          = "cron(0 2 * * ? *)"

    lifecycle {
      delete_after = 30
    }
  }
}

# Backup selection
resource "aws_backup_selection" "cluster_efs" {
  name         = "cluster-efs"
  plan_id      = aws_backup_plan.cluster_backup.id
  iam_role_arn = aws_iam_role.backup.arn

  resources = [
    module.storage.efs_arn
  ]
}

Monitoring and Observability

Comprehensive monitoring through Terraform-managed CloudWatch and Grafana:

# modules/monitoring/cloudwatch.tf

# CloudWatch dashboard for cluster metrics
resource "aws_cloudwatch_dashboard" "cluster" {
  dashboard_name = "${var.cluster_name}-dashboard"

  dashboard_body = jsonencode({
    widgets = [
      {
        type = "metric"
        properties = {
          metrics = [
            ["AWS/EC2", "CPUUtilization", {stat = "Average"}],
            [".", "NetworkIn", {stat = "Sum"}],
            [".", "NetworkOut", {stat = "Sum"}]
          ]
          period = 300
          stat   = "Average"
          region = var.region
          title  = "Compute Node Metrics"
        }
      }
    ]
  })
}

# CloudWatch alarms
resource "aws_cloudwatch_metric_alarm" "high_queue_depth" {
  alarm_name          = "${var.cluster_name}-high-queue-depth"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "QueueDepth"
  namespace           = "ParallelCluster"
  period              = "300"
  statistic           = "Average"
  threshold           = "100"
  alarm_description   = "Alert when job queue depth exceeds threshold"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

Cost Optimization Strategies

Spot Instances: Dedicated spot queue for fault-tolerant workloads
Aggressive Scale-Down: 5-minute idle timeout for compute nodes
Instance Rightsizing: Match instance types to workload profiles
Storage Tiering: Weka S3 tiering for cold data
Savings Plans: Terraform-managed commitment for baseline capacity

CI/CD Pipeline for Cluster Deployments

GitLab CI pipeline for automated testing and deployment:

# .gitlab-ci.yml

stages:
  - validate
  - plan
  - deploy

terraform-validate:
  stage: validate
  script:
    - terraform init
    - terraform validate
    - terraform fmt -check

terraform-plan:
  stage: plan
  script:
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - tfplan

terraform-apply:
  stage: deploy
  script:
    - terraform init
    - terraform apply tfplan
  when: manual
  only:
    - main

Production Case Study: Multi-Region Deployment

For a global genomics research consortium, we deployed identical HPC clusters across three AWS regions:

Deployment Specifications:

Regions: us-west-2, eu-west-1, ap-southeast-1
Terraform State: S3 backend with DynamoDB locking
Deployment Time: 25 minutes per cluster (fully automated)
Configuration Drift: Zero—enforced through Terraform
DR Recovery: Tested quarterly, < 30 minutes to full operation
Cost Savings: 40% vs. manual deployment (reduced errors, faster scaling)

Conclusion

AWS ParallelCluster 3.0 combined with Terraform infrastructure-as-code provides a powerful foundation for enterprise HPC platforms. Our production-tested patterns enable:

Repeatable, version-controlled cluster deployments
Multi-environment workflows (dev/staging/prod)
Comprehensive security and compliance
Integration with Weka storage and NVIDIA B200 GPUs
Automated lifecycle management and monitoring

At DCLOUD9, we specialize in designing and deploying these next-generation HPC platforms for biotech, genomics, and AI research organizations. Our DevSecOps expertise ensures production-ready infrastructure that scales, performs, and remains secure.

Ready to Modernize Your HPC Infrastructure?

Let DCLOUD9 build your infrastructure-as-code HPC platform

Request Consultation