AWS ParallelCluster 3.0: Building Modern HPC Platforms with Infrastructure-as-Code
Production-tested Terraform patterns for deploying multi-region HPC platforms with Slurm scheduler, NVIDIA B200 GPU nodes, and Weka storage integration
Introduction
AWS ParallelCluster has revolutionized cloud-based HPC, enabling organizations to deploy elastically scalable compute clusters with just a YAML configuration file. However, production deployments require more than just ParallelCluster—they need comprehensive infrastructure-as-code (IaC) that manages networking, IAM, storage, monitoring, and security in a repeatable, version-controlled manner.
At DCLOUD9, we've deployed dozens of production HPC platforms using AWS ParallelCluster 3.0 orchestrated with Terraform. Our clients include Genentech, IAVI, and Imperial College London, running workloads from genomics research to AI model training on NVIDIA B200 GPUs with Weka parallel file systems.
This article shares our battle-tested Terraform patterns, security best practices, and deployment strategies for building enterprise-grade HPC platforms on AWS ParallelCluster 3.0.
Why Infrastructure-as-Code for HPC?
Manual cluster deployments don't scale for enterprise environments:
- Repeatability: Deploy identical clusters across dev, staging, production environments
- Version Control: Track changes to cluster configuration over time via Git
- Disaster Recovery: Recreate infrastructure from code in minutes
- Multi-Region: Deploy clusters across regions with consistent configuration
- Compliance: Enforce security policies and audit trails
- Team Collaboration: Code reviews, pull requests, automated testing
AWS ParallelCluster 3.0: Key Improvements
ParallelCluster 3.0 brings significant architectural improvements:
Multi-User Support
- Login nodes separate from head node for better security
- Integration with Active Directory or AWS Managed Microsoft AD
- Per-user home directories on shared storage
Enhanced Networking
- Multiple security groups and subnets per queue
- Proxy support for restricted network environments
- IPv6 support for future-proofing
Custom Actions and Extensibility
- OnNodeStart, OnNodeConfigured, OnNodeUpdated hooks
- Custom AMI support with streamlined workflows
- S3-based script distribution for configuration
API-Driven Management
- REST API for programmatic cluster operations
- CloudFormation StackSets for multi-region deployment
- Enhanced logging to CloudWatch Logs
Terraform Architecture for ParallelCluster
Our reference architecture uses Terraform to orchestrate all infrastructure components:
Module Structure
hpc-platform-terraform/
├── main.tf # Root module
├── variables.tf # Input variables
├── outputs.tf # Output values
├── modules/
│ ├── network/ # VPC, subnets, NAT gateways
│ ├── iam/ # IAM roles and policies
│ ├── storage/ # FSx, EFS, S3 buckets
│ ├── weka/ # Weka cluster deployment
│ ├── parallelcluster/ # ParallelCluster configuration
│ └── monitoring/ # CloudWatch, Grafana dashboards
├── environments/
│ ├── dev/ # Development cluster
│ ├── staging/ # Staging cluster
│ └── prod/ # Production cluster
└── cluster-config/
└── cluster.yaml.tpl # ParallelCluster YAML template
Key Terraform Resources
# modules/parallelcluster/main.tf
# S3 bucket for cluster configuration and scripts
resource "aws_s3_bucket" "cluster_config" {
bucket = "hpc-cluster-config-${var.environment}"
versioning {
enabled = true
}
server_side_encryption_configuration {
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
}
# Upload cluster configuration
resource "aws_s3_object" "cluster_config" {
bucket = aws_s3_bucket.cluster_config.id
key = "cluster.yaml"
content = templatefile("${path.module}/../../cluster-config/cluster.yaml.tpl", {
vpc_id = var.vpc_id
subnet_id = var.subnet_id
key_name = var.key_name
instance_types = var.instance_types
max_count = var.max_count
weka_mount_script = aws_s3_object.weka_mount.s3_uri
monitoring_script = aws_s3_object.monitoring.s3_uri
})
etag = filemd5("${path.module}/../../cluster-config/cluster.yaml.tpl")
}
# ParallelCluster deployment via null_resource
resource "null_resource" "parallelcluster" {
triggers = {
config_version = aws_s3_object.cluster_config.version_id
}
provisioner "local-exec" {
command = <<-EOT
pcluster create-cluster \
--cluster-name ${var.cluster_name} \
--cluster-configuration s3://${aws_s3_bucket.cluster_config.id}/cluster.yaml \
--region ${var.region}
EOT
}
depends_on = [
aws_s3_object.cluster_config,
module.iam,
module.storage
]
}
ParallelCluster Configuration Template
Our production cluster template integrates multiple queues, Weka storage, and custom actions:
# cluster.yaml.tpl
Region: ${region}
Image:
Os: alinux2
HeadNode:
InstanceType: c6i.4xlarge
Networking:
SubnetId: ${head_subnet_id}
SecurityGroups:
- ${head_security_group}
Ssh:
KeyName: ${key_name}
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
CustomActions:
OnNodeConfigured:
Script: ${monitoring_script}
LoginNodes:
Pools:
- Name: login-pool
Count: 2
InstanceType: c6i.2xlarge
Networking:
SubnetIds:
- ${login_subnet_id}
SecurityGroups:
- ${login_security_group}
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 5
Database:
Uri: ${database_uri}
UserName: ${database_user}
PasswordSecretArn: ${database_password_secret}
SlurmQueues:
# CPU compute queue
- Name: cpu-standard
ComputeResources:
- Name: c6i-nodes
InstanceType: c6i.32xlarge
MinCount: 0
MaxCount: ${cpu_max_count}
DisableSimultaneousMultithreading: true
Networking:
SubnetIds:
- ${compute_subnet_id}
PlacementGroup:
Enabled: true
CustomActions:
OnNodeConfigured:
Script: ${weka_mount_script}
# GPU queue with B200 instances
- Name: gpu-b200
ComputeResources:
- Name: b200-nodes
InstanceType: p5.48xlarge
MinCount: 0
MaxCount: ${gpu_max_count}
Efa:
Enabled: true
GdrSupport: true
Networking:
SubnetIds:
- ${compute_subnet_id}
PlacementGroup:
Enabled: true
CustomActions:
OnNodeConfigured:
Script: ${weka_mount_script}
# Spot instance queue for cost optimization
- Name: cpu-spot
CapacityType: SPOT
ComputeResources:
- Name: spot-nodes
InstanceType: c6i.32xlarge
MinCount: 0
MaxCount: 200
Networking:
SubnetIds:
- ${compute_subnet_id}
SharedStorage:
- MountDir: /shared
Name: shared-home
StorageType: Efs
EfsSettings:
EncryptedFileSystem: true
PerformanceMode: generalPurpose
ThroughputMode: elastic
Monitoring:
Logs:
CloudWatch:
Enabled: true
RetentionInDays: 30
Dashboards:
CloudWatch:
Enabled: true
Integrating Weka with Terraform
Weka deployment is fully automated through Terraform:
# modules/weka/main.tf
# Weka backend instances
resource "aws_instance" "weka_backend" {
count = var.weka_backend_count
ami = var.weka_ami_id
instance_type = "i4i.8xlarge"
subnet_id = element(var.weka_subnet_ids, count.index)
vpc_security_group_ids = [aws_security_group.weka.id]
iam_instance_profile = aws_iam_instance_profile.weka.name
user_data = templatefile("${path.module}/weka-backend-init.sh", {
cluster_name = var.weka_cluster_name
backend_index = count.index
s3_bucket = var.weka_s3_bucket
license_key = var.weka_license_key
})
tags = {
Name = "weka-backend-${count.index}"
Role = "weka-backend"
}
}
# Network Load Balancer for Weka clients
resource "aws_lb" "weka_nlb" {
name = "weka-nlb"
internal = true
load_balancer_type = "network"
subnets = var.weka_subnet_ids
enable_cross_zone_load_balancing = true
}
# S3 script for mounting Weka on compute nodes
resource "aws_s3_object" "weka_mount_script" {
bucket = var.cluster_config_bucket
key = "scripts/mount-weka.sh"
content = templatefile("${path.module}/mount-weka.sh.tpl", {
weka_nlb_dns = aws_lb.weka_nlb.dns_name
})
}
Security Best Practices
1. Network Isolation
# modules/network/main.tf
# Private subnets for compute nodes - no internet access
resource "aws_subnet" "compute_private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "compute-private-${var.availability_zones[count.index]}"
}
}
# Public subnets for login nodes with restrictive security groups
resource "aws_subnet" "login_public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index + 10)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "login-public-${var.availability_zones[count.index]}"
}
}
2. IAM Least Privilege
# modules/iam/compute-node-policy.tf
data "aws_iam_policy_document" "compute_node" {
# S3 access for cluster configuration (read-only)
statement {
effect = "Allow"
actions = [
"s3:GetObject",
"s3:ListBucket"
]
resources = [
"arn:aws:s3:::${var.cluster_config_bucket}",
"arn:aws:s3:::${var.cluster_config_bucket}/*"
]
}
# CloudWatch Logs for monitoring
statement {
effect = "Allow"
actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
resources = ["arn:aws:logs:*:*:log-group:/aws/parallelcluster/*"]
}
# Secrets Manager for database passwords (read-only)
statement {
effect = "Allow"
actions = [
"secretsmanager:GetSecretValue"
]
resources = [var.database_password_secret_arn]
}
}
3. Encryption Everywhere
- EBS Volumes: Encrypted with KMS customer-managed keys
- S3 Buckets: Server-side encryption with versioning
- EFS/FSx: Encryption at rest and in transit
- Secrets: AWS Secrets Manager for sensitive data
- Network: TLS for all external communications
Multi-Environment Strategy
Terraform workspaces and environment-specific variables enable dev/staging/prod workflows:
# environments/prod/terraform.tfvars
environment = "prod"
region = "us-west-2"
cluster_name = "genomics-prod"
vpc_cidr = "10.0.0.0/16"
# CPU queue configuration
cpu_instance_type = "c6i.32xlarge"
cpu_max_count = 100
# GPU queue configuration
gpu_instance_type = "p5.48xlarge"
gpu_max_count = 20
# Weka storage
weka_backend_count = 8
weka_instance_type = "i4i.8xlarge"
# High availability
multi_az = true
backup_retention = 30
# Cost optimization
enable_spot_queue = true
scaledown_idletime = 5
Automated Cluster Lifecycle Management
Our Terraform modules include automated operations:
1. Cluster Updates
# Update cluster configuration
resource "null_resource" "parallelcluster_update" {
triggers = {
config_version = aws_s3_object.cluster_config.version_id
}
provisioner "local-exec" {
command = <<-EOT
pcluster update-cluster \
--cluster-name ${var.cluster_name} \
--cluster-configuration s3://${aws_s3_bucket.cluster_config.id}/cluster.yaml \
--region ${var.region}
EOT
}
}
2. Backup and Disaster Recovery
# AWS Backup for EFS
resource "aws_backup_plan" "cluster_backup" {
name = "hpc-cluster-backup"
rule {
rule_name = "daily_backup"
target_vault_name = aws_backup_vault.cluster.name
schedule = "cron(0 2 * * ? *)"
lifecycle {
delete_after = 30
}
}
}
# Backup selection
resource "aws_backup_selection" "cluster_efs" {
name = "cluster-efs"
plan_id = aws_backup_plan.cluster_backup.id
iam_role_arn = aws_iam_role.backup.arn
resources = [
module.storage.efs_arn
]
}
Monitoring and Observability
Comprehensive monitoring through Terraform-managed CloudWatch and Grafana:
# modules/monitoring/cloudwatch.tf
# CloudWatch dashboard for cluster metrics
resource "aws_cloudwatch_dashboard" "cluster" {
dashboard_name = "${var.cluster_name}-dashboard"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
properties = {
metrics = [
["AWS/EC2", "CPUUtilization", {stat = "Average"}],
[".", "NetworkIn", {stat = "Sum"}],
[".", "NetworkOut", {stat = "Sum"}]
]
period = 300
stat = "Average"
region = var.region
title = "Compute Node Metrics"
}
}
]
})
}
# CloudWatch alarms
resource "aws_cloudwatch_metric_alarm" "high_queue_depth" {
alarm_name = "${var.cluster_name}-high-queue-depth"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "QueueDepth"
namespace = "ParallelCluster"
period = "300"
statistic = "Average"
threshold = "100"
alarm_description = "Alert when job queue depth exceeds threshold"
alarm_actions = [aws_sns_topic.alerts.arn]
}
Cost Optimization Strategies
- Spot Instances: Dedicated spot queue for fault-tolerant workloads
- Aggressive Scale-Down: 5-minute idle timeout for compute nodes
- Instance Rightsizing: Match instance types to workload profiles
- Storage Tiering: Weka S3 tiering for cold data
- Savings Plans: Terraform-managed commitment for baseline capacity
CI/CD Pipeline for Cluster Deployments
GitLab CI pipeline for automated testing and deployment:
# .gitlab-ci.yml
stages:
- validate
- plan
- deploy
terraform-validate:
stage: validate
script:
- terraform init
- terraform validate
- terraform fmt -check
terraform-plan:
stage: plan
script:
- terraform init
- terraform plan -out=tfplan
artifacts:
paths:
- tfplan
terraform-apply:
stage: deploy
script:
- terraform init
- terraform apply tfplan
when: manual
only:
- main
Production Case Study: Multi-Region Deployment
For a global genomics research consortium, we deployed identical HPC clusters across three AWS regions:
Deployment Specifications:
- Regions: us-west-2, eu-west-1, ap-southeast-1
- Terraform State: S3 backend with DynamoDB locking
- Deployment Time: 25 minutes per cluster (fully automated)
- Configuration Drift: Zero—enforced through Terraform
- DR Recovery: Tested quarterly, < 30 minutes to full operation
- Cost Savings: 40% vs. manual deployment (reduced errors, faster scaling)
Conclusion
AWS ParallelCluster 3.0 combined with Terraform infrastructure-as-code provides a powerful foundation for enterprise HPC platforms. Our production-tested patterns enable:
- Repeatable, version-controlled cluster deployments
- Multi-environment workflows (dev/staging/prod)
- Comprehensive security and compliance
- Integration with Weka storage and NVIDIA B200 GPUs
- Automated lifecycle management and monitoring
At DCLOUD9, we specialize in designing and deploying these next-generation HPC platforms for biotech, genomics, and AI research organizations. Our DevSecOps expertise ensures production-ready infrastructure that scales, performs, and remains secure.
Ready to Modernize Your HPC Infrastructure?
Let DCLOUD9 build your infrastructure-as-code HPC platform
Request Consultation