Introduction What is Dev-ops? What is RSE-ops? Comparing the Two

Introduction

Successful development, deployment, and maintenance of research software is central to scientific discovery. In the last decade, the role of Research Software Engineer (RSE) [1] has risen to awareness, and fostered a community of combined researchers and software developers that focus almost exclusively on this task.

While some RSEs work on research software separate from its application, others are embedded in labs and responsible for data processing, analysis, and otherwise running tasks at scale to produce research outputs. These RSEs, whether they be staff at national labs, academic institutions, or private research institutes, historically have used some form of high performance computing (HPC) to achieve this scale [2], [3].

This traditional practice has slowly been changing with the availability of cloud computing [4]. As the technological gap between HPC and cloud computing is closing [5], and the cloud can equally meet the needs of research groups [6], Research Software Engineers are presented with the task of working in both spaces. As they discover best practices and tools, there arises the need to write all of this knowledge down. Synthesizing what we know not only identifies what we know, but also what we don’t know and where there are gaps that require attention or work. Arguably, a mature community should have awareness of:

What are functional categories of need for the community?
What are best practices?
What tools are out there and recommended for each use case?

Further, there is separation between the developers of research software, and those that deploy it as a workflow or service. This problem isn’t new, and in fact we can look to cloud computing for inspiration. Although cloud computing goes back to the 1960s [7] and the term wasn’t coined until 1996 [4], what we are specifically interested in is DevOps – a movement that sought to bring together development of software and services ("Dev") with their deployment (operations, or "Ops") starting around 2007 [8]. Interesting, Research Software Engineering is going through the same challenges, and would benefit from the same kind of movement.

This white-paper introduces the concept of RSE-ops, or the intersection between Research Software Engineering and operations, which for research can mean running workflows or services. We present a first effort at defining relevant functional categories for the community, best practices, and the current landscape of potential areas of growth. We hope this structure can provide a basis for inspiring community and initiative around collaborative and meaningful work.

“A not-so-brief history of Research Software Engineers.” https://www.software.ac.uk/blog/2016-08-17-not-so-brief-history-research-software-engineers-0. [bibtex]
Wikipedia contributors, “History of supercomputing,” Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=History_of_supercomputing&oldid=1034423047, Jul-2021. [bibtex]
“A Brief History of High-Performance Computing (HPC) - XSEDE Home - XSEDE Wiki.” https://confluence.xsede.org/pages/viewpage.action?pageId=1677620. [bibtex]
Scality, “The history of cloud computing - SOLVED.” https://www.scality.com/solved/the-history-of-cloud-computing/, Mar-2020. [bibtex]
G. Guidi, M. Ellis, A. Buluc, K. Yelick, and D. Culler, “10 Years Later: Cloud Computing is Closing the Performance Gap,” Nov. 2020. [bibtex]
“Challenging the barriers to High Performance Computing in the Cloud - HPCwire.” https://www.hpcwire.com/solution_content/aws/manufacturing-engineering-aws/challenging-the-barriers-to-high-performance-computing-in-the-cloud/, Jan-2020. [bibtex]
K. D. Foote, “A Brief History of Cloud Computing - DATAVERSITY.” https://www.dataversity.net/brief-history-cloud-computing/, Jun-2017. [bibtex]
Atlassian, “History of DevOps.” https://www.atlassian.com/devops/what-is-devops/history-of-devops. [bibtex]

What is DevOps?

A definition of Research Software Engineering Operations (RSE-ops) can best be derived by first explaining the philosophy behind DevOps [1] is a term that refers to best practices to bridge development and operations. It was coined in 2008 [2], and has grown out of a web services oriented model. The term has expanded to refer to other subsets of expertise in this area such as cloud security operations, which is called "DevSecOps," adding the "Sec" for "Security." In DevOps, best practices are generally defined around:

continuous integration: automated integration of code changes into a repository
continuous delivery: automated release of software in short cycles
monitoring and logging: tracking errors and system status in an automated way
communication and collaboration: interaction between team members and optimally working together
"infrastructure as code": provisioning resources through simple text files
using micro-services: single function modules that can be assembled into a more complex service

The above best practices are done for the purposes of speed and efficiency, reliability, scale, and collaboration. It has been shown that teams that adopt these practices can see improvements in productivity, efficiency, and quality across the board [3]. It is a culture because along with these best practices, it also alludes to a way of thinking and working. Where there were barriers before between development and operations teams, DevOps brought them down. You can grow a community around these ideas, which is a powerful thing.

DevOps as the Driver of the Cloud

And surely the statistics are alarmingly good, as teams that practice DevOps outperform their peers in number and speed of deployments, recovery from downtime events, and employee ability to work on new things over tenuous maintenance [4]. Recognizing these gains and providing structure for collaboration, training, and projects was arguably just one of the goals of the Cloud Native Computing Foundation (CNCF), which was founded in 2015 [5]. Specifically, the primary stated reason for foundation of CNCF was to foster community and support around container technologies, which often are the core unit of automation and DevOps practices [6]). A new term, "cloud-native" was coined with this title, which is heavily reliant on DevOps. DevOps practices are considered the fundamental base of taking on a cloud-native approach, and another term, "Cloud Native DevOps" [7] was even coined to specifically refer to the application of DevOps practices to the cloud. Since the two are so inexplicably intertwined, for the remainder of this paper, we will refer to them interchangeably [8].

“What is Devops.” https://aws.amazon.com/devops/what-is-devops/. [bibtex]
“DevOps.” https://dl.acm.org/doi/10.1109/MS.2016.68. [bibtex]
“DevOps: The Shift That Changed The World Of Development.” https://www.narwalinc.com/blog/devops-the-shift-that-changed-the-world-of-development/, Jul-2020. [bibtex]
P. Webteam, “2016 State of DevOps Report.” https://puppet.com/resources/report/2016-state-devops-report/. [bibtex]
Wikipedia contributors, “Cloud Native Computing Foundation,” Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Cloud_Native_Computing_Foundation&oldid=1034940082, Jul-2021. [bibtex]
nishanil, “Containers as the foundation for DevOps collaboration.” https://docs.microsoft.com/en-us/dotnet/architecture/containerized-lifecycle/docker-application-lifecycle/containers-foundation-for-devops-collaboration. [bibtex]
“Cloud Native Devops.” https://www.oreilly.com/library/view/cloud-native-devops/9781492040750/ch01.html. [bibtex]
E. Choice, “How DevOps is integral to a cloud-native strategy.” https://www.information-age.com/how-devops-integral-cloud-native-strategy-123488706/, Apr-2020. [bibtex]

What is Rse-ops?

The role of Research Software Engineer (RSE) has been emerging in the last decade, and due to the hybrid nature of working environments across cloud, and HPC, DevOps practices are logically being adopted. So if DevOps is the intersection of "Developer" and "Operations," then how does this concept map to this new space, where high performance computing, or more generally, Research Software Engineering is at the forefront?

Inspired by DevOps, we can define a similar term for the Research Software Engineering community to also inspire collaboration and champion best practices – RSE-ops. Research Software Engineers (RSEs) [1] are those individuals that write code for scientific software, and more generally support researchers to use codes on high performance computing systems, cloud services, and lab computers. Akin to traditional Software Engineers at major tech companies, they are responsible not just for software development, but also for deployment of analysis pipelines and general services. It can be noted that we are not calling the new term RseDevOps (dropping "Dev"), and this is done intentionally as the term "Research Software Engineering" encompasses this "Development" portion. RSE-ops, then, appropriately refers to best practices for ensuring the same reliability, scale, collaboration, and software engineering for research codes. We may not always be running a scaled web service, but we might be running scaled jobs on a manager, profiling performance, or testing software in development.

Thus, RSE-ops is the intersection of Research Software Engineering and Operations, and generally refers to best practices for development and operations of scientific software. Arguably, the RSE community has just as much to gain by building community and putting structure around these practices. It’s important to note that while high performance computing (HPC) has traditionally been a large part of scientific computation, researchers have extended their tools to also use cloud services and other non-HPC tools, so HPC is only considered a subset of Research Software Engineering and thus RSE-ops. Many modern applications are web-based and extend beyond HPC, and so it is important to consider this set as part of the larger scientific or research software engineering universe. However, the dual need to run or deploy application across environments presents greater challenges for the community.

“A not-so-brief history of Research Software Engineers.” https://www.software.ac.uk/blog/2016-08-17-not-so-brief-history-research-software-engineers-0. [bibtex]

Comparison of RSE-ops vs. DevOps

The easiest way to start to map out the space of RSE-ops is to address a series of questions about people, goals, and practices, and make a direct comparison to DevOps. On a high level, RSE-ops has a stronger association with HPC, while DevOps has a stronger association with the cloud, but the lines are blurry. While early efforts of some of these clouds attempted to re-brand HPC [1], progress has been made to the point that the gap between cloud and HPC is narrowing, and HPC centers are able to take advantage of cloud technologies, and vice versa. There are still subtle differences, and ideally there could be convergence to empower researchers to use software across different platforms. For this reason, we think that making comparisons between the two can be helpful to understand what practices are well established for RSE-ops, and which require further development. Since there is a stronger association of HPC with RSE-ops, in the discussion below we will often be comparing HPC with cloud, however this does not say that there is always a strong dividing line between the two. We will proceed in the following sections to ask questions of each, speculate on best practices, and then summarize our findings in a table.

What are the goals of each?

Arguably, the goals of DevOps are to provide applications and services, typically web-based. The goals of RSE-ops are to sometimes provide services for scientific applications, but more-so to provide infrastructure and means to run, develop, and distribute, scientific software. RSE-ops, then, is for research software and services, while DevOps is typically for more widely available, persistent services and corresponding software. This does not mean, however, that RSEs are never involved with DevOps, nor that industry Software Engineers are never working on research software.

Who is involved?

You will typically find individuals practicing RSE-ops at academic institutions, national labs, and some private industry, or anywhere that high performance computing is the primary means of compute. While some companies might also use high performance computing, typically we likely find that larger companies maintain their own internal container orchestration system (e.g., Google uses Borg [2], and smaller companies pay to use cloud services that offer a similar set of tooling. Likely this decision results from some cost-benefit analysis [3] that determines that one is more cost effective than the other. Whether we look at Google Cloud [4], Microsoft Azure [5] or Amazon Web Services [6], all of these cloud environments have a primary focus on distributed, scaled, and "server-less" technologies. We might call this cloud computing.

When we look closely at individuals involved, it tends to be the case that institutions with HPC have a combination of Linux Administrators, Support Staff, Research Software Engineers, and Researchers. The Research Software Engineers in particular play an interesting role because they can sit on the administrative side (with Linux Administrators and Support Staff), on the user side (with Researchers) or somewhere in between. For this reason, they are essential staff for communication, or ensuring that the needs of the researchers are known by those that run the resources. For tech companies, it’s likely the case that a DevOps team or team of Support Reliability Engineers (SREs) is tasked with managing software and services for the company. The SREs are primarily concerned with how things should be done, and developing monitoring and other support tools, while a DevOps teams is primarily concerned with doing it [7]. The line gets blurry with respect to titles, because a company can have some flexibility with respect to naming these roles. However, it’s common to see titles like Software Engineer, DevOps Engineer, SRE, or even Cloud Architect.

“Google HPC.” https://cloud.google.com/solutions/hpc. [bibtex]
“Large-scale cluster management at Google with Borg.” https://research.google/pubs/pub43438/. [bibtex]
A. Prabhakaran and L. J., “Cost-Benefit Analysis of Public Clouds for Offloading In-House HPC Jobs,” in 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), 2018, pp. 57–64. [bibtex]
“DevOps.” https://cloud.google.com/devops. [bibtex]
“DevOps.” https://azure.microsoft.com/en-us/services/devops/. [bibtex]
“What is Devops.” https://aws.amazon.com/devops/what-is-devops/. [bibtex]
“Google - Site Reliability Engineering.” https://sre.google/sre-book/introduction/. [bibtex]

Continue reading about the differences for each category in the space.