Ending the Build Infrastructure Nightmares

The mindset “If it’s not broken, don’t fix it” doesn’t work in software development ever! Things neglected will poorly age over time, which exactly is what happened to a homegrown build and test infrastructure system that was designed and built over a decade ago. Maintaining this infrastructure would have someone spend four hours to a whole a day on diagnosing and fixing infrastructure issues.

The Scenario

Both the build and test machines ran on physical hardware on task schedulers (or cron jobs on Linux/Mac) are set to run at a particular time every night. The test machines have a delay run of an hour to allow the build machines to finish. Each build machines comes in pairs for every operating system, but only one will upload its nightly product build to a shared network server. Then the test machines download the freshly built nightly package from the shared network server to run the nightly set of tests.

The first step to solving an issue is to identify the problems. Take a moment to brainstorm all the failure modes that can occur in the above scenario before reading onward.

The Problems

The best way to collect data is talking to engineers who use the build system every day. They are the best source of data about what’s negatively affecting them. The following problems occurred on a frequent occasion:

Scheduled Task

Once in a while, the build machine takes longer than an hour to build the nightly package. The test machines would start up and download the older nightly package, which is a lost of testing.

Another occasional problem was a human error where the task be set up to start at the wrong time due to time zone differences (i.e., 10 pm EST is very different than 10 pm PST).

Single Build Machine

If the single build machine crashed or hung, the build never occurred for that night. The test machines would use the previous night’s build for testing. The secondary build machine was quite useless in many scenarios because it didn’t have logic to upload its build if the primary build machine failed.

Product Build Errors

Due to how the infrastructure was architected, the nightly package only built once per day. One recurring issue is compiler errors during compilation, which caused a loss of a nightly build and all testing for that platform to be lost.

Network Issues

One common issue is there would be a network issue that caused the upload or download of the build to fail. An upload failure caused the loss of a night’s worth of testing. A download failure caused the loss of all testing on that test machine.

Shared Server Storage

If the shared storage ran out of space or the upload to the server failed, it caused the loss of a night’s worth of the testing.

Individual Machine Maintenance

Updates such as newer python or operating system updates had to be done individually. Having to do this for 30+ machines was time-consuming.

Individual Machine Setup

Each machine was set up individually, which caused scalability issues. The first issue is there were differences between each of them because someone set up it differently. Another problem is if the hardware died, it would take someone a few hours to set up a new machine. It’s also the same problem for adding additional resources.

Conclusion

From the data collected, the most significant issue was the build system doesn’t have any redundancy. Single points of failures resulted in a loss of testing or a product build. Engineers also wasted on occasional problems with physical hardware.

Architecting the New Build System

An analysis of the current infrastructure showed that it is cheaper (time and monetary) to adopt new technologies rather than update and maintain the existing infrastructure. A valuable lesson is homegrown solutions are cheaper to implement than paid solutions; however, they are costly to maintain, which makes open-source and even paid solutions far less expensive in the long run.

The new build system adopted new technologies and development practices that had emerged with the introduction of Continuous Integration and Continuous Delivery. Other people have already created the solutions needed to solve these problems. The solutions identified are:

Continuous Integration

Continuous Integration is a development practice that is quickly becoming the industry standard. In a nutshell, a new product build starts up whenever a change is checked into the code repository. If the build or test fails, it alerts the engineers about the problem.

Jenkins

Jenkins is an automation server that does Continuous Integration and also Continuous Delivery. Jenkins has a master server and a pool of nodes that it can use. These nodes are labeled as either build nodes to represent a node that can build the product or test nodes to represent a node that can run tests. The labels can be more specific to label the node as a Windows build node or a Mac build node. Jenkins and CI solve the single build machine, failures with download/upload, removes the need for schedule tasks and reduces the impact that a product build failure causes.

Virtual Machines and Puppet

Puppet is a software configuration management tool which can be used to orchestrate the setup of machines automatically. Combined with VM images, it created a single point of maintenance that saved a lot of time. If the VM image is updated, all the VMs would also be updated. The added benefit of Puppet is a developer machine can be set up completely headless.

Puppet and VMs together solved the physical machine maintenance nightmare. It also provides room for scaling up the pool of resources as cloning a VM is infinite.

jFrog for a storage server

jFrog provides a robust archival system, which can provide data on how many times a file was downloaded and make alerts when the storage space is getting low.

The Implementation

The real feat is the migration of the existing infrastructure over to the new system, which is equivalent to doing a heart transparent.

DevOps

Upper management formed a DevOps team to tackle the CI creation and migration. The DevOps team worked separately from the development teams by first building out Jenkins and AWS. Then they moved onto code migration, which is where they fell short on.

DevOps focused on the component libraries that the product used for its first code migration project. Inadvertently, they created a production workflow that crippled the development workflows. An example of a typical development workflow is fixing a bug in a library component of the product. The developer checks out a library repro, make code changes, and test before checking it in. The code that used to take 15 seconds to build would now take around 5 minutes. The build script changed to have Jenkins build, run a bunch of tests, package the library, and upload the package onto a server. The developer needed to redirect the product to download and use this new version. It is insane! A developer may compile code many times especially when they’re debugging a problem with the code.

In hindsight, it’s an easily avoided problem if the DevOps Team worked with the development teams. It makes sense that DevOps didn’t consider the developer workflows. They don’t work on code nor need to debug the library code.

At this point, I was tasked to plan and execute my product’s transition over to the CI infrastructure with the DevOps team. The DevOps team had access to the development team this time around.

Puppet Creation

My priority was to have puppet scripts that can recreate the build machine or test machine for a short-term fix. If the physical hardware (including a developer) failed, a new machine could spin up quickly. The puppet script also was used to create VM images that could be used to scale up the infrastructure. The puppet script fulfilled two purposes and significantly reduced effort on the second purpose.

Creating a pool of resources

The next thing done was to move the build machines and test machines off of physical machines and into AWS AMI. A limiting issue found was AWS doesn’t support Mac and consumer Windows versions. Half of the problem was solved using Windows Server and Linux images, but Mac and Windows 10 was still needed. There are a few “rent a Mac” cloud services available, but they charged quite a lot more money per hour than AWS does. The cost made it a prohibitive solution because the test bed was quite large. In the end, the solution decided upon is to build a farm of machines that hosted a pool of VMs for the missing Mac and Windows. Each of the physical machines could host one to four VMs depending on its hardware spec. Jenkins would spin up a node with the VM when it needed.

Switch to git code repro

The main reason for the switch was git provides branching which the previous source repository system did not have and submodules which both were needed to fix the developer workflow problems. A longer term benefit is git was becoming important as new people hired came readily skilled with git.

On the other hand, transitioning the product over to the git and git-flow caused issues with older employees who were used to the previous single branch code repro. Learning the new multi-branch paradigm in the workflow caused resistance as it added additional steps. The training helped older employees get over the learning curve and understand the importance of git.

Adding Developer Workflows

After all the building blocks were assembled, it was time to revisit the developer workflow issue. The solution was to split the script into two logic paths, which used the git command line to decide. If the git branch isn’t ‘develop’ or ‘master,’ it followed logic to build on the local machine. Else, it would use logic to build through Jenkins. Also, the main repository would have the component repositories as submodules. It reconnected the main repository with the component codes so it didn’t have to download the built packages and can use the locally built one.

One additional improvement was to switch over to cmake files from makefiles and vcxproj files to create a single point of maintenance. Another improvement done was to switch the scripts from Ant to Perl. Ant scripting uses XML which doesn’t allow programming concepts such as arrays.

Uploading to jFrog

The transition to uploading to jFrog storage server went somewhat smoothly. There were only two problems encountered. One was that passwords in the Jenkins script, which was easily remedied by using tokens and removing the password. The other problem was every CI build was on the jFrog server which quickly filled up 1.5TB of space within a month or so. Each build of the feature branches is not needed longer than the existence of the feature branch.

Planning for the future

Jenkins was set up to kick off a new build each time code was checked in, but it is running the heavy load of system tests to verify the package. The importance of unit tests became apparent as they run fast, are platform agnostic, and test the code directly. The next project would be to create unit tests.

The elephant in the room is implemented of Continous Delivery never started. Unfortunately, I departed with the company before I could execute my CD plans. That’s a journey for another time.

Lessons Learned

  • Homegrown solutions are costly to maintain. Paid solutions are cheaper in the long run.
  • Talking to developers is the best way to find infrastructure problems.
  • Do not have the DevOps teamwork in isolation from the development team.
  • Consider development workflows while creating production workflows.
  • Do not forget about training people.
  • Do not store passwords in the Jenkins script or build scripts. Use tokens instead.
  • AWS does not have consumer Windows or Mac.
  • ANT/XML is very verbose. Don’t use it for scripting.
  • Unit tests are cheap and important.

Engineering leader. Software Developer. Problem solver. Failing forward.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store