How to consistently run temporary workloads on AWS and save money

7 min readJul 12, 2021

Do you want to create and then destroy a cloud environment ready to run your application with a single click of a button?

In this series of posts, we explore how to accomplish this for a particular use case, which hopefully motivates you to create your own version to suit your needs.

The overall goal is to get several cloud resources automatically provisioned and configured to deliver a reproducible Linux environment that supports a use case, which in this case is creating virtual network topologies for lab testing. This is a four-part blog post series that covers:

Use-case description
Automating Infrastructure Provisioning
Delivering an Application Deployment Environment
Releasing Cloud resources

Introduction

Running short-lived workloads on dedicated IT resources might not be the most cost-effective option you have. These workloads sometimes can run for a couple of weeks, like a University that needs to put up an application every semester, during registration, or a retailer that hosts raffles during the month from Thanksgiving to Christmas.

Or we might only need to run them a few hours a day, like some of our day-to-day job activities. For example, Software Engineers need resources to run test cases on, and Sales Engineers might need them to work on product demos. More often than not, we can’t afford to dedicate resources just for these tasks and let those resources sit idle for the rest of the day when we are not using them.

Given this context, you can potentially benefit from the flexibility the Cloud can offer to scale resources up — and down to zero — as needed. I say potentially because to truly experience cost savings; you want to make sure you only pay for Cloud resources while you are using them, which most times means destroying them when not. Especially if your applications are resource intensive. Keep in mind, users are responsible for cloud cost management. Things can get rapidly out of control. There is even a market for cloud cost optimization⚠️

For example, the cost of an instance with 4 vCPU and 16GB RAM in AWS is approximately $0.20 per hour (m4.xlarge), which translates to ~$1.750 per year. If you only run this instance 4 hours per workday, the annual cost goes down to ~$291. That’s over 80% in savings! 💲

If we destroy the resources, how can we quickly and repeatably spin up a platform that is ready to run our use case?

Automation

If the resources aren’t always on, then the role of automation is making these resources readily available when needed and destroying them when not.

Infrastructure as a service (IaaS) from a Cloud Provider can get you a virtual machine (VM) running an operating system in minutes. In this scenario is up to the user to install additional applications, runtimes, databases, or libraries that are requirements for your use case. Fortunately, you can automate this process as well, and is part of what we aim to show through this series of posts.

Perhaps spinning up a VM, creating any additional resources by clicking on a Cloud Provider Console (GUI), and then login into the VM to download packages, configure your applications, and run sanity checks wouldn’t take you more than an hour. However, this time-consuming, repetitive process is error-prone for humans, so it becomes a natural fit for automation.

Does it automating something for the first time take longer than actually doing that thing one time? Most likely, however, over time the incremental cost of doing that work automatically vs manually is what drives cost savings.

In this opportunity, the goal is to create a setup that is ready to run multi-vendor network topologies for testing. This is a good example of a setup that normally requires more horsepower than what a user's laptop can provide, with very specific software dependencies, which forces you to set apart a server with large specs just for this purpose. This server not only sits idle most of the time, but it will also become a blocker if someone makes a breaking change.

Network testing challenges

Network engineers are often faced with a dilemma. They want to test network features in a safe environment, but they don’t have access to the infrastructure resources to create a faithful representation of their network (or section of), which leaves them without too many options before trying anything in production.

Fact: One configuration change — even one line — can bring down a network or part of the Internet.

Some companies build expensive networking labs with dedicated hardware that can normally run a single use-case at a time. However, the hardware life-cycle usually renders these labs obsolete in just a couple of years. If you are testing mostly control plane features, this is definitely overkill.

Having the different networking vendors provide virtual images of their Network Operating System (NOS) was a step in the right direction to allow customers to run more robust and scalable testing environments. However, these virtual images are resource-hungry, therefore the network topologies you can create are constrained by the resources you can allocate for this purpose.

Containers are the natural evolution. However, a good portion of the commercial networking software has strong dependencies on custom Linux Kernels or they simply run on top of non-Linux-based Operating Systems (QNX, BSD, etc.). Fortunately, trends like Network Disaggregation and Cloud-Native Networking are making vendors step up and start releasing containerized versions of their NOS.

Containerlab is a new project that takes advantage of these container images, by orchestrating and managing container-based networking labs, helping bring the networking industry into the twenty-first century.

However, to run Containerlab you need a Linux environment that is not always easy to get access to in a typical enterprise setting, hence we’ll show you how to build it in the cloud in minutes with a process that is repeatable and reliable.

Target setup

AWS is the Cloud Provider of choice for this example. If you don’t have an account already, you can get one for free to run this example. In case you have an expired Free Tier account, you can use “email+nonce” to keep using Free Tier with the same email multiple times (credits for this tip go to Michael Kashin).

We use Ansible to automate all the steps in this scenario to minimize the prerequisites to run this, and to describe them in a way that anyone can understand. We could have used Terraform to provision the cloud resources and then Ansible to configure them, but we kept it simple and use only one tool for now.

“If we can get it simpler, and then complicate it where it needs complications” Paul McCartney

The diagram below is a representation of the end-goal, that shows the AWS resources that need to be present or otherwise created. It also hints about the software that is installed.

In the following posts, we dive deeper into the list of Ansible tasks to:

Provision a VM in AWS — with CPU/Memory defined by the user — along with all the resources required to support it
Install and configure a Linux environment with Containerlab, Docker, test topologies, and a couple of programming languages to interact with the network devices.
Tear down all the resources previously provisioned.

The Ansible Playbooks are hosted on GitHub, to enable peer code review and version control. They come with sane defaults, so you should only need an AWS account (credentials) to run them.

Launching the setup

Make sure you have Ansible installed on your computer and follow these steps to provision the testbed:

1. Clone the repository with git clone https://github.com/nleiva/aws-testbed.git

2. Make your AWS account credentials (AWS_ACCESS_KEY and AWS_SECRET_KEY) available as environment variables (export).

export AWS_ACCESS_KEY=’…’
export AWS_SECRET_KEY=’…’

3. Run the Playbook with the ansible-playbook command, and wait for a couple of minutes while a VM is provisioned in AWS and all the required software is installed.

ansible-playbook create-EC2-testbed.yml -v

Login instructions to the VM are provided in the logs. You can now start running network topologies for lab testing. The VM comes with several network topology samples you can spin up. For example, the following back-to-back two router topology with routing protocols pre-configured between them.

Conclusions

Automation plays a key role to run temporary workloads in the Cloud. It not only helps save time and money, but it also improves consistency by enforcing standardized processes.

It can’t be stressed enough you don’t need to be an expert to start writing automation Playbooks to make your job and life easier. Take this example and make it your own. That’s the beauty of open-source, you don’t need to re-invent the wheel every time you need to do something.

In the next post, we go through the details of how we automatically provision all the cloud resources necessary to run this example. Stay tuned!