This post was contributed by Cristian Măgherușan-Stanciu, Sr. Specialist Solution Architect, EC2 Spot, with contributions from Cristian Kniep, Sr. Developer Advocate for HPC and AWS Batch at AWS, Carlos Manzanedo Rueda, Principal Solutions Architect, EC2 Spot at AWS . Ludvig Nordstrom, Principal Solutions Architect at AWS, Vytautas Gapsys, Project Group Leader at the Max Planck Institute for Biophysical Chemistry, and Carsten Kutzner, Research Associate at the Max Planck Institute for Biophysical Chemistry.
This blog is part of a blog series It describes how we worked with a team of researchers at the Max Planck Institute for Biophysical Chemistry and helped them use the cloud for drug discovery applications in the pharmaceutical industry.
In this post, we focus on how Max Planck’s team obtained thousands of EC2 Spot Instances spread across multiple AWS Regions to run their compute-intensive simulations in a cost-effective manner, and how their solution continues with the new one Improved Spot Placement Score API.
Computational drug design in the cloud
The drug research and development process usually starts with a really large number of potentially promising compounds. From this seemingly infinite chemical space, researchers aim to identify potent molecules that could be life-saving. These compounds are then gradually filtered through a multi-step selection process until finally a small subset of them is synthesized and thoroughly tested before further use is allowed.
After identifying a potential drug candidate (the “lead compound”), the goal is to further optimize that lead compound into a truly active molecule. Computer methods based on molecular dynamics simulations help by efficiently reducing the search space to just a few hundred candidates. In later steps, these can then be processed and tested in ever more complex – and expensive – ways.
Computational drug design (CADD) is increasingly used in the early phase of drug discovery, and thanks to advances in technology, highly accurate and computationally intensive methods can be used to select the best possible candidates. This includes a class of methods using molecular dynamics, where we simulate the protein-ligand interaction at the atomic level.
These early drug discovery simulations are typically performed on-site using large supercomputers shared by multiple research and development facilities. Such infrastructure takes years to build, and once built, is expensive to maintain, has limited capacity, and many other users, meaning it sometimes takes a long time to see results.
AWS can offer enormous capacities that are only provisioned and charged for the duration of a simulation. In addition to lower costs and less time to provision capacity, it also offers more flexibility by using multiple instance types, different families, and purchasing options. This flexibility means researchers can experiment with many of the available options to find the best solution for each application. This allows them to achieve the best possible compromise between time to result and cost for each simulation.
Running GROMACS at scale on EC2 Spot Instances
EC2 Spot Instances allow AWS customers to request unused EC2 capacity at deep discounts of up to 90% compared to On-Demand pricing. They are ideal for many stateless, fault-tolerant, and/or flexible workloads, and are particularly useful for loosely coupled compute-intensive applications running across hundreds or thousands of instances. In these cases, Spot savings can add up to significant amounts of money that can ensure the feasibility of a given workload.
spot used capacity pools, which are sets of idle EC2 instances with the same instance type and operating system running in an Availability Zone. If EC2 needs that capacity for another customer, instances will be reclaimed with (minimum) a two-minute warning.
To be successful with Spot, it helps to be flexible, especially when it comes to your preferred instance types. Diversification across multiple Spot capacity pools means that in the event of Spot disruptions in a given pool, EC2 can provision new instances from other capacity pools. Your workload can then resume and resume on the new instances, often with no visible impact.
For most workloads, Spot diversification is achieved by using multiple instance types and tapping into all Availability Zones within a region; The more Availability Zones and instance types there are, the better the chance of getting the Spot capacity you want and the lower the frequency of Spot interruptions.
The Max Planck research team was interested in using EC2 Spot to provide thousands of instances to run their computationally intensive simulations. Your GROMACS workload has a few characteristics that make it a great solution for Spot:
- It’s loosely coupled and flexible in terms of instance type – it runs well on CPUs and GPU’s.
- It is regionally flexible – there is relatively little input data and output data to move from one location to the next.
- The acceptable time to get the final results is flexible – depending on the simulation it can be measured in hours, days or even more than a week. Time-flexible workloads like these often involve tradeoffs between cost and time to achieve results.
- It can implement checkpointing – a job can be quickly resumed in the face of a spot break. For compute-intensive workloads like molecular dynamics, a task can take hours or days to compute.
In our previous blogs in this series, the Max Planck research team showed benchmark results for several instance types and found that the most cost-effective instance types for them are G4dn.xlarge, G4dn.2xlarge, and G4dn.4xlarge. But they knew they could use more instance types—the cost efficiencies varied. They summarized their results, which we have shown in Table 1.
Given the regional flexibility of the workload and its large capacity requirements, we helped the team run it in parallel across multiple AWS Regions using a tool called HyperBatch. This is a solution designed by an AWS solution architect to run AWS Batch across multiple AWS Regions to secure the required capacity by leveraging a large number of capacity pools.
Depending on the trade-off between cost and time-to-result that the Max Planck research team is trying to achieve for a particular simulation, they had a few options to reach their workload’s spot capacity:
- For the lowest possible cost, they could only run on the preferred G4dn GPU instance types – this doesn’t offer much diversification. Because G4dn instances are popular and used for many workloads, including HPC, deep learning, and graphics rendering, they are often under-represented in Spot capacity pools. This can increase the disruption rate, which can extend the simulation time to several days, which is not always practical.
- For a shorter time to results, the team could use a highly diversified mix of instance types, including a variety of compute-optimized EC2 instances. By increasing the diversification of the instances, there is more computing capacity overall that could meet our needs and we can run the simulations sooner.
For this simulation run, the team optimized for shorter results and used a mix of C5 and G4dn instance types of different sizes, following spot diversification best practices.
To see the results of the run, please read the full blog here.
Reminder: You can learn a lot from AWS HPC engineers by subscribing to them HPC Tech Short YouTube channel, and following the AWS HPC Blog Channel.