HPC on Azure: Benefits and Best Practices

High performance computing (HPC) enables you to leverage supercomputing resources for complex processes. In the past, supercomputers were accessible mainly to those with high budgets. Today, affordable HPC services are offered by the top cloud vendors, including Azure. This article explains basic HPC concepts, and examines best practices for Azure HPC deployments.

What Is HPC?

High performance computing (HPC) systems are networks of aggregated computing resources that you can use to perform complex processes. These systems enable you to distribute compute processes across many central processing units (CPUs) or graphical processing units (GPUs). This distribution enables significantly faster computing and supports machine learning or predictive analyses that were previously impossible.

HPC systems, also sometimes called supercomputers, were initially developed to run custom applications and processes. This structure made HPC expensive and inaccessible to many organizations. However, you can design modern HPC deployments from a wider variety of resources and the overall cost has decreased.

In particular, HPC systems have been adapted to operate on cloud resources. These advancements have made HPC computing power significantly more accessible.

The Move to Cloud: HPC for the Masses

According to studies by Hyperion Research, the number of HPC sites running workloads in the cloud was 74%. This was up from just 13% in 2011. This increase is largely due to the advancement of HPC technologies in the cloud, making resources more accessible to organizations and increasing cloud HPC’s capacity.

In these same studies, Hyperion found evidence predicting that by 2023, the HPC cloud market could reach $6 billion, or around 14% of the total HPC industry. These predictions are tied to evidence that the percentage of workloads each organization is running in the cloud is growing, albeit slowly.

A specific factor leading the growth of HPC in the cloud is the containerization of on-premises HPC. Currently, containerization is being used to better support increasingly diverse workloads. With containers, organizations are better able to segment their resources, spinning up appropriate environments dynamically. This adoption makes on-premises implementations compatible with cloud-based HPC, increasing the ease of hybridization with or migration to the cloud.

Another factor is the increasing popularity of using cloud resources for HPC bursting. Organizations can use cloud resources to supplement on-premises ones for time-sensitive or highly complex operations with minimal costs.

HPC Deployment on Azure

Microsoft Azure is a cloud platform that offers HPC deployments on a pay for use basis. HPC in Azure can be used to extend on-premise infrastructures or as stand alone deployments. With these resources, organizations of all sizes can run high-performance workloads without purchasing or maintaining the expensive hardware typically required for HPC.

Typical components of HPC in Azure include:

  • HPC head node—a virtual machine (VM) working as a central managing server for HPC clusters. This node enables you to schedule workloads and jobs to worker nodes.
  • Virtual Machine Scale Sets—a service you can use to create and scale sets of VMs. It enables you to run a variety of databases, including Mesos, Cloudera, MongoDB, Hadoop, and Cassandra. Scale Sets also includes features for use in multiple availability zones, autoscaling, and load balancing.
  • Virtual Network—a network service that you use to connect your storage, compute, and head node. Connections are made via IP and you have granular control over subnet traffic. With Virtual Network, you can define custom DNS servers and IPs. You can also create secure connections based on ExpressRoute or IPsec VPN.
  • Storage—a variety of options are available, including blob, disk, file, Data Lake, or hybrid storage. You can also integrate Avere vFXT, which enables you to use intelligent caching to run workloads with low latency.
  • Azure Resource Manager—a utility that enables you to use script files or templates to manage your HPC resource deployment.

Managing HPC Deployments in Azure

In addition to the default components, Azure offers several services that you can use to manage your deployments more easily. The following services are native to Azure and integrate smoothly with your existing resources.

Azure Batch

Azure Batch is a managed service you can use to provision, assign, monitor, and run workloads. Batch enables you to configure a VM pool, define job scheduling policies, and autoscale your deployment.

Microsoft HPC Pack

HPC Pack is a service that creates an HPC cluster. You can use it to manage clusters, monitor workloads, and schedule jobs. This service requires you to manage and provision both your network infrastructure and manage your cluster.

In exchange, it enables you to move existing HPC workloads from on-premise resources or private clouds. You can also use HPC Pack to create hybrid deployments which leverage your resources in combination with Azure ones.

Azure CycleCloud

Azure CycleCloud is another cluster management service that enables you to use your choice of third-party scheduler. With these schedulers, you can orchestrate and manage workloads, customize clusters with built-in governance features, and define access controls in Active Directory. You can use CycleCloud with HPC Pack, Symphony, Grid Engine, or Slurm.

Best Practices for Using HPC on Azure

When deploying HPC in Azure, several best practices can help you ensure you are gaining the best performance for your investment.

Spread out your deployments 

When setting up your deployment, try to limit the size of your deployments. Although there is no fixed limit, Azure recommends no more than 500 VMs or 1k cores per deployment. If you exceed this number, you risk losing live instances, deployment timeouts, or running into issues with IP address swapping.

Additionally, smaller deployments make it easier to:

  • Flexibly start or stop nodes
  • Locate available nodes in your cluster
  • Ensure disaster recovery through the use of multiple data centers

Manage your number of proxy nodes

A proxy node is a worker node instance used to enable communication between Azure nodes and on-premises head nodes. These proxies are typically automatically added by your HPC clusters.

When deploying your nodes, you should ensure that the number of proxies scales accordingly. The specific number you need depends on what sort of jobs you have running on each node and how many nodes you have deployed.

Use HPC Client Utilities for remote connections

If you have users connecting to your head node from remote desktops, your performance may suffer. This particularly true if your head node is already under a heavy load.

To avoid this, you should have users connect through HPC Pack Client Utilities. These tools can be installed on each of your workstations to provide remote access instead of using Remote Desktop Services (RDS).

Weigh the benefits of monitoring vs performance

Using Azure Monitor can help you track your deployments’ performance, but can also put a strain on your SQL Server and HPC Management Service. This is especially true for larger deployments. Rather than continuously monitoring, you may want to periodically audit your performance and otherwise disable the collection of performance counters.

Conclusion

With HPC now available in the cloud, many teams can now leverage high processing capabilities for their own development and research. In Azure, an HPC deployment typically includes a HPC head node, virtual machine scale sets, a virtual network, storage, and templates offered via Azure Resource Manager.

For Azure HPC management, you can use Azure Batch, Microsoft HPC Pack and Azure CycleCloud. To avoid deployment timeouts, you should spread out your deployments to no more than 500 VMs or 1k cores per deployment. You should also manage the number of proxy nodes, to ensure efficient deployment.

——————–

Author Bio: Ilai Bavati

Image Source

I’m a technology writer and editor based in Tel Aviv. I cover topics ranging from machine learning and cybersecurity to cloud computing and the Internet of Things. I’m interested in the real-world application of emerging technologies, and I see our increasingly connected reality as both disruptive and potentially life-saving.

LinkedIn: https://www.linkedin.com/in/ilai-bavati-0b1a1418a/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.