Storage Solutions November 2024 • 14 min read

Weka Data Platform: High-Performance Storage for AI/HPC Workloads

How we leverage Weka's parallel file system with AWS ParallelCluster to deliver multi-GB/s throughput and sub-millisecond latency for genomics and AI workloads

Introduction

Storage is often the unsung hero—and frequent bottleneck—of modern HPC and AI platforms. While organizations invest in powerful NVIDIA B200 GPUs and high-core-count CPU instances, inadequate storage architecture can cripple performance, leaving expensive compute resources idle while waiting for data.

At DCLOUD9, we've deployed Weka Data Platform for several biotech and genomics clients running on AWS ParallelCluster. The results speak for themselves: 45 GB/s aggregate throughput, sub-millisecond latency, and seamless scaling to support 200+ concurrent researchers processing petabytes of genomics data.

This article explores why traditional storage fails for modern AI/HPC workloads, how Weka's architecture solves these challenges, and our production-tested integration patterns with AWS ParallelCluster and Slurm.

The Storage Challenge in AI/HPC

Modern computational workloads have fundamentally different I/O patterns than traditional enterprise applications:

Genomics Data Pipelines

Massive sequential reads: Processing TB-scale BAM/FASTQ files
Parallel access: 100+ compute nodes reading simultaneously
Small-file operations: Millions of VCF, BED, and annotation files
Metadata-intensive: Directory listings, stats, file opens/closes

AI/ML Training

Random small reads: Training data shuffling across datasets
Checkpoint writes: Multi-GB model states written periodically
GPU-speed requirements: Storage must not bottleneck GPU compute
Multi-node coordination: Distributed training across 8-32+ GPUs

Why Traditional Storage Falls Short

Common AWS storage solutions and their limitations:

Amazon EFS: Excellent for general use, but bandwidth scales with storage size. Limited to ~10 GB/s aggregate, high latency for metadata ops.

Amazon FSx for Lustre: Better performance (~1 GB/s per TiB), but complex tuning required. Sub-optimal for small-file workloads.

EBS Volumes: High performance but not shared across compute nodes. Requires complex NFS server architecture.

For organizations running 50+ concurrent GPU jobs or processing hundreds of terabytes daily, these limitations become critical blockers.

Weka Architecture: Built for Modern Workloads

Weka Data Platform is a software-defined parallel file system designed from the ground up for modern HPC and AI workloads. Key architectural advantages:

1. Distributed, Scale-Out Design

No metadata bottleneck: Metadata distributed across all backend nodes
Linear scaling: Add nodes to increase both capacity and performance
NVMe-optimized: Leverages instance storage SSDs for maximum IOPS

2. Cloud-Native Integration

S3 tiering: Automatically migrate cold data to object storage
Snapshots to S3: Cost-effective backup and DR
Multi-AZ deployment: High availability across availability zones
Elastic scaling: Add/remove backend nodes without downtime

3. Protocol Flexibility

POSIX: Standard file system semantics, works with existing tools
S3: Native S3 API support for cloud-native applications
NFS: Legacy application compatibility
GPUDirect Storage: Direct GPU-to-storage transfers bypassing CPU

Integration with AWS ParallelCluster

Our reference architecture deploys Weka alongside AWS ParallelCluster for optimal performance:

Architecture Overview

┌─────────────────────────────────────────────────┐
│           AWS ParallelCluster                   │
│                                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │  Head    │  │  Login   │  │ Compute  │     │
│  │  Node    │  │  Nodes   │  │  Nodes   │     │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘     │
│       │             │             │            │
│       └─────────────┴─────────────┘            │
│                     │                          │
│                     ▼                          │
│         ┌─────────────────────┐               │
│         │   Weka Clients      │               │
│         │   (POSIX mounts)    │               │
│         └──────────┬──────────┘               │
└────────────────────┼──────────────────────────┘
                     │
          ┌──────────┴──────────┐
          │   Weka Backend      │
          │   i4i.8xlarge x 6   │
          │   (NVMe SSDs)       │
          └──────────┬──────────┘
                     │
                     ▼
          ┌─────────────────────┐
          │   Amazon S3         │
          │   (Tiered Storage)  │
          └─────────────────────┘

Weka Backend Cluster

We typically deploy Weka on i4i instances with local NVMe SSDs:

Instance Type: i4i.4xlarge or i4i.8xlarge (4x 3.75 TB NVMe SSDs)
Cluster Size: 6-12 backend nodes (production deployments)
Networking: 25-100 Gbps network bandwidth
Capacity: 50-200 TB hot storage, unlimited with S3 tiering

Client Integration via ParallelCluster

Compute nodes mount Weka using POSIX client. Our automated deployment script:

#!/bin/bash
# install-weka-client.sh - ParallelCluster OnNodeConfigured script

# Install Weka client from package
curl -o /tmp/weka-client.rpm https://my-bucket.s3.amazonaws.com/weka-client-4.2.rpm
yum install -y /tmp/weka-client.rpm

# Join Weka cluster
weka cluster container join weka-backend-nlb.example.com

# Mount file system
mkdir -p /weka/scratch
mount -t wekafs default /weka/scratch

# Add to fstab for persistence
echo "default /weka/scratch wekafs defaults 0 0" >> /etc/fstab

# Configure for optimal performance
echo "vm.dirty_ratio = 10" >> /etc/sysctl.conf
echo "vm.dirty_background_ratio = 5" >> /etc/sysctl.conf
sysctl -p

ParallelCluster YAML Configuration

HeadNode:
  CustomActions:
    OnNodeConfigured:
      Script: s3://my-cluster-config/scripts/install-weka-client.sh

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      CustomActions:
        OnNodeConfigured:
          Script: s3://my-cluster-config/scripts/install-weka-client.sh
      ComputeResources:
        - Name: c6i-nodes
          InstanceType: c6i.32xlarge
          MinCount: 0
          MaxCount: 100

Performance Benchmarks

Real-world performance results from production genomics workloads:

Sequential Read/Write Performance

Single-stream read: 3.2 GB/s (limited by client network)
Aggregate read (100 clients): 45 GB/s
Single-stream write: 2.8 GB/s
Aggregate write (50 clients): 32 GB/s

Metadata Performance

File creates: 1.2M ops/sec (aggregate)
Stats (metadata reads): 2.5M ops/sec
Directory listing (10K files): < 50ms
File open latency: 0.3ms (average)

Genomics Pipeline: Real-World Example

Processing 500 whole-genome sequences (30x coverage, ~100 GB each):

Without Weka (EFS): 18 hours total runtime, frequent I/O wait states
With Weka: 6.5 hours total runtime, CPU/GPU bound throughout
Performance gain: 2.8x faster, 100% compute utilization

Cost Optimization with S3 Tiering

Weka's intelligent tiering automatically moves cold data to S3, dramatically reducing costs:

Tiering Configuration

# Configure S3 tiering with lifecycle policies
weka fs tier s3 add default-tier \
  --bucket my-genomics-data \
  --hostname s3.us-west-2.amazonaws.com \
  --region us-west-2

# Set tiering policy: move data not accessed in 7 days
weka fs tier policy set --release-after-seconds 604800

# View tiering status
weka fs tier status

Cost Comparison

For a 2 PB genomics dataset with 10% active working set:

Scenario A: All data on Weka NVMe

2 PB on i4i instances: ~$140,000/month

Scenario B: Hot/cold tiering with Weka + S3

200 TB hot on Weka: ~$14,000/month
1.8 PB cold on S3 Standard: ~$41,400/month
Total: ~$55,400/month (60% savings)

Integration with Slurm Workload Manager

Weka provides transparent POSIX access, seamlessly working with Slurm job scheduling:

#!/bin/bash
#SBATCH --job-name=genomics-pipeline
#SBATCH --partition=compute
#SBATCH --nodes=50
#SBATCH --cpus-per-task=16
#SBATCH --time=4:00:00

# All compute nodes have /weka/scratch mounted
INPUT_DIR=/weka/scratch/raw-data
OUTPUT_DIR=/weka/scratch/results

# Parallel processing of FASTQ files
srun --ntasks=50 --cpus-per-task=16 \
  process_genome.sh \
    --input ${INPUT_DIR} \
    --output ${OUTPUT_DIR} \
    --threads 16

# Weka delivers 45 GB/s aggregate throughput across 50 nodes

Operational Excellence

Monitoring and Alerting

Weka provides comprehensive observability:

Prometheus metrics export: Integrate with existing monitoring stack
Grafana dashboards: Pre-built dashboards for ops and IOps
CloudWatch integration: AWS-native monitoring and alerting
Per-client statistics: Identify noisy neighbor issues

High Availability

Multi-AZ deployment: Backend nodes spread across availability zones
Automatic failover: N+2 redundancy, transparent to clients
Rolling upgrades: Update Weka software with zero downtime
Snapshot and restore: Point-in-time recovery from S3

Production Case Study: Genomics Research Platform

For a major biotech client (Genentech), we deployed Weka as the primary storage for genomics workloads:

Deployment Specifications:

Weka Cluster: 8x i4i.8xlarge instances (120 TB SSD capacity)
S3 Tiering: 2 PB archived data, 120 TB hot working set
Client Nodes: 200+ ParallelCluster compute nodes
Performance: 45 GB/s read, 32 GB/s write (aggregate)
Latency: < 1ms average for metadata operations
Uptime: 99.95% availability over 18 months
Cost Savings: 65% vs. all-flash architecture

When to Choose Weka

Weka is ideal for:

High-throughput workloads: Genomics pipelines, video rendering, CFD simulations
GPU-bound AI/ML: Prevent storage bottlenecks in multi-GPU training
Metadata-intensive applications: Millions of small files, frequent directory operations
Mixed workloads: Simultaneous sequential and random I/O patterns
Multi-tenant environments: Isolate and track per-user/project storage performance

Not ideal for: Simple file sharing, infrequent access patterns, cost-constrained projects with low performance requirements.

Conclusion

Storage performance is critical for modern AI and HPC platforms. Weka Data Platform eliminates traditional bottlenecks, delivering the multi-GB/s throughput, sub-millisecond latency, and seamless scalability required for genomics research, AI model training, and computational science.

When integrated with AWS ParallelCluster and Slurm, Weka enables organizations to maximize compute utilization, reduce time-to-results, and optimize costs through intelligent S3 tiering.

At DCLOUD9, we specialize in designing and deploying high-performance storage architectures for demanding HPC and AI workloads. Our team has deep expertise in Weka, AWS ParallelCluster, and storage optimization for scientific computing.

Ready to Eliminate Storage Bottlenecks?

Let DCLOUD9 design your high-performance storage architecture

Request Consultation