A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Writing clean and maintainable code is crucial for successful software development projects. Clean and maintainable code ensures that the software is easy to read, understand, and modify, which can save time and effort in the long run. This article will discuss some best practices for writing clean and maintainable code. Follow a Consistent Coding Style Consistency is key when it comes to writing clean and maintainable code. Following a consistent coding style makes the code easier to read and understand. It also helps ensure that the code is formatted correctly, which can prevent errors and make debugging easier. Some common coding styles include the Google Style Guide, the Airbnb Style Guide, and the PEP 8 Style Guide for Python. Keep It Simple Simplicity is another important aspect of writing clean and maintainable code. Simple code is easier to understand, modify, and debug. Avoid adding unnecessary complexity to your code; use clear and concise variable and function names. Additionally, use comments to explain complex code or algorithms. Use Meaningful Variable and Function Names Using meaningful variable and function names is essential for writing clean and maintainable code. Descriptive names help make the code more readable and understandable. Use names that accurately describe the purpose of the variable or function. Avoid using single-letter variable names or abbreviations that may be confusing to others who read the code. I just implement this on the project that helps me to grab the client for the long term. It was about the calculator consisting of general calculators like GST and VAT calculators. Write Modular Code Modular code is divided into smaller, independent components or modules. This approach makes the code easier to read, understand, and modify. Each module should have a clear and specific purpose and should be well-documented. Additionally, modular code can be reused in different parts of the software, which can save time and effort. Write Unit Tests Unit tests verify the functionality of individual units or components of the software. Writing unit tests helps ensure that the code is working correctly and can help prevent bugs from appearing. Unit tests should be written for each module or component of the software and be automated to ensure that they are run regularly. Use Version Control Version control is a system that tracks changes to the code over time. Using version control is essential for writing clean and maintainable code because it allows developers to collaborate on the same codebase without overwriting each other's work. Additionally, version control will enable developers to revert to previous versions of the code if necessary. Document Your Code Documentation is an essential part of writing clean and maintainable code. Documentation helps others understand the code, and it helps ensure that the code is well-documented for future modifications. Include comments in your code to explain how it works and what it does. Additionally, write documentation outside of the code, such as README files or user manuals, to explain how to use the software. Refactor Regularly Refactoring is the process of improving the code without changing its functionality. Regularly refactoring your code can help keep it clean and maintainable. Refactoring can help remove unnecessary code, simplify complex code, and improve performance. Additionally, refactoring can help prevent technical debt, which is the cost of maintaining code that is difficult to read, understand, or modify. Conclusion In conclusion, writing clean and maintainable code is crucial for the success of software development projects. Following best practices such as consistent coding styles, simplicity, meaningful variable and function names, modular code, unit tests, version control, documentation, and regular refactoring can help ensure that the code is easy to read, understand, and modify. By following these best practices, developers can save time and effort in the long run and ensure that the software is of high quality.
I attended Dynatrace Perform 2023. This was my sixth “Perform User Conference,” but the first over the last three years. Rick McConnell, CEO of Dynatrace, kicked off the event by sharing his thoughts on the company’s momentum and vision. The company is focused on adding value to the IT ecosystem and the cloud environment. As the world continues to change rapidly, this enables breakout opportunities to occur. Dynatrace strives to enable clients to be well-positioned for the upcoming post-Covid and post-recession recovery. The cloud delivers undeniable benefits for companies and their customers. It enables companies to deliver software and infrastructure much faster. That’s why we continue to see the growth of hyper-scale cloud providers. Companies rely on the cloud to take advantage of category and business growth opportunities. However, with the growth comes complexity. 71% of CIOs say increasing is difficult to manage all of the data that is being produced. It is beyond human ability to manage and make sense of all the data. This creates a need and opportunity for automation and observability to address these needs. There is an increased focus on cloud optimization on multiple fronts. Key areas of focus are to reduce costs and drive more reliability and availability to ultimately drive more value for customers. Observability is moving from an optional “nice to have” to a mandatory “must-have.” The industry is at an inflection point with an opportunity to drive change right now. Organizations need end-to-end observability. Dynatrace approaches the problem in a radically different way. Data types need to be looked at collectively and holistically to be more powerful in the management of the ecosystem. Observability + Security + Business Analytics The right software intelligence platform can provide end-to-end observability to drive transformational change to businesses by delivering answers and intelligent automation from data. End users are no longer willing to accept poor performance from applications. If your application doesn’t work or provides an inferior user experience, your customer will find another provider. As such, it is incumbent on businesses to deliver flawless and secure digital interactions that are performant and great all of the time. New Product Announcements Consistent with the vision of A world where software works perfectly, not having an incident in the first place, Dynatrace announced four new products today: Grail data lakehouse expansion: The Dynatrace Platform’s central data lakehouse technology that stores, contextualizes, and queries data, beyond logs and business events to encompass metrics, distributed traces, and multi-cloud topology and dependencies. This enhances the platform’s ability to store, process, and analyze the tremendous volume and variety of data from modern cloud environments while retaining full data context. Enhanced user experience: With new UX features such as built-in dashboard functionalities and a visual interface to help foster teamwork between technical and business personnel. These new UX features allow Dynatrace Notebooks to be used, which is an interactive document capability that allows IT, development, security, and business users to work together using code, text, as well as multimedia to construct, analyze, and disseminate insights from exploratory, causal-AI analytics projects to ensure better coordination and decision making throughout the company. Dynatrace AutomationEngine: Features an interactive user interface and no-code and low-code tools that empower groups to make use of Dynatrace’s causal-AI analytics for observability and security insights to automate BizDevSecOps procedures over their multi-cloud environments. This automation platform enables IT teams to detect and solve issues proactively or direct them to the right personnel, thus saving time and allowing them to concentrate on complex matters that only humans can handle. Dynatrace AppEngine: Provides IT, development, security, and business teams with the capability of designing tailored, consistent, and knowledge-informed applications with a user-friendly, minimal-code method. Clients and associates can build personalized links to sync the Dynatrace platform with technologies over hybrid and multi-cloud surroundings, unify segregated solutions, and enable more personnel from their businesses with smart apps that rely on perceptibility, security, and business knowledge from their ecosystems. Client Feedback I had the opportunity to speak with Michael Cabrera, Site Reliability Engineering Leader at Vivint. Michael brought SRE to Vivint after bringing SRE to Home Depot and Delta. Vivint realized they were spending more time firefighting than optimizing, and SRE helps solve this problem. Michael evaluated more than a dozen solutions comparing features, ease of use, and comprehensiveness of the platform. Dynatrace was a clear winner. It enables SRE and enables a view into what customers are experiencing not available with another tool. By seeing what customers feel, Michael and his team can be proactive versus reactive. The SRE team at Vivint has 12 engineers and 200 developers servicing thousands of employees. Field technicians are in customers’ homes, helping them create and live in smarter homes. Technicians are key stakeholders since they are front-facing to end users. Dynatrace is providing Vivint with a tighter loop between what customer experience and what they could see in the tech stack. It reduces the time spent troubleshooting and firefighting versus optimizing. Development teams can see how their code is performing. Engineers can see how the infrastructure is performing. Michael feels Grail is a game changer. It allows Vivint to combine logs with business analytics to achieve full observability end-to-end into their entire business. Vivint was a beta tester of the new technology. The tighter feedback loops with deployment showed how the company’s engineering policies could further improve. They were able to scale and review the performance of apps and infrastructure and see more interconnected services and how things align with each other. Dynatrace is helping Vivint to manage apps and software through SLOs. They have been able to set up SLOs in a couple of minutes. It’s easy to install with one agent without enabling plug-ins or buying add-ons. SREs can sit with engineering and product teams and show the experience from the tech stack to the customer. It’s great for engineering teams to have real-time feedback on performance. They can release code and see the performance before, during, and after. The biggest challenge is having so much more information than before. They are training to help team members know what to do with the information and how to drill down as needed. Conclusion I hope you have taken away some helpful information from my day one experience at Dynatrace Perform. To read more about my day two experience, read here.
Are you looking to move your workloads from your on-premises environment to the cloud, but don't know where to start? Migrating your business applications and data to a new environment can be a daunting task, but it doesn't have to be. With the right strategy, you can execute a successful lift and shift migration in no time. Whether you're migrating to a cloud environment or just updating your on-premises infrastructure, this comprehensive guide will cover everything from planning and preparation to ongoing maintenance and support. In this article, I have provided the essential steps to execute a smooth lift and shift migration and make the transition to your new environment as seamless as possible. Preparation for Lift and Shift Migration Assess the Workloads for Migration Before starting the lift and shift migration process, it is important to assess the workloads that need to be migrated. This involves identifying the applications, data, and resources that are required for the migration. This assessment will help in determining the migration strategy, resource requirements, and timeline for the migration. Identify Dependencies and Potential Roadblocks This involves understanding the relationships between the workloads and identifying any dependencies that might impact the migration process. Potential roadblocks could include compatibility issues, security and data privacy concerns, and network limitations. By identifying these dependencies and roadblocks, you can plan and ensure a smooth migration process. Planning for Network and Security Changes Lift and shift migration often involves changes to the network and security configurations. It is important to plan for these changes in advance to ensure the integrity and security of the data being migrated. This includes defining the network architecture, creating firewall rules, and configuring security groups to ensure secure data transfer during the migration process. Lift and Shift Migration Lift and shift migration is a method used to transfer applications and data from one infrastructure to another. The goal is to recreate the current environment with minimum changes, making it easier for users and reducing downtime. Migration Strategies There are several strategies to migrate applications and data. A common approach is to use a combination of tools to ensure accurate and efficient data transfer. One strategy is to utilize a data migration tool. These tools automate the process of transferring data from one environment to another, reducing the risk of data loss or corruption. Some popular data migration tools include AWS Database Migration Service, Azure Database Migration Service, and Google Cloud Data Transfer. Another strategy is to use a cloud migration platform. These platforms simplify the process of moving the entire infrastructure, including applications, data, and networks, to the cloud. Popular cloud migration platforms include AWS Migration Hub, Azure Migrate, and Google Cloud Migrate. Testing and Validation Testing and validation play a crucial role in any migration project, including lift and shift migrations. To ensure success, it's essential to test applications and data before, during, and after the migration process. Before migration, test applications and data in the current environment to identify potential issues. During migration, conduct ongoing testing and validation to ensure accurate data transfer. After the migration is complete, final testing and validation should be done to confirm everything is functioning as expected. Managing and Monitoring Managing and monitoring the migration process is crucial for success. A project plan should be in place outlining the steps, timeline, budget, and resources needed. Understanding the tools and technologies used to manage and monitor the migration process is important, such as migration tools and platforms, and monitoring tools like AWS CloudTrail, Azure Monitor, and Google Cloud Stackdriver. Post-Migration Considerations Once your lift and shift migration is complete, it's important to turn your attention to the post-migration considerations. These considerations will help you optimize your migrated workloads, handle ongoing maintenance and updates, and address any lingering issues or challenges. Optimizing the Migrated Workloads for Performance One of the key post-migration considerations is optimizing the migrated workloads for performance. This is an important step because it ensures that your migrated applications and data are running smoothly and efficiently in the new environment. After a successful migration, it's crucial to ensure that your applications and data perform optimally in the new environment. To achieve this, you need to evaluate their performance in the new setup. This can be done through various performance monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Stackdriver. Upon examining the performance, you can identify areas that need improvement and make the necessary adjustments. This may include modifying the configuration of your applications and data or adding more resources to guarantee efficient performance. Handling Ongoing Maintenance and Updates Another important post-migration consideration is handling ongoing maintenance and updates. This is important because it ensures that your applications and data continue to run smoothly and efficiently, even after the migration is complete. To handle ongoing maintenance and updates, it's important to have a clear understanding of your infrastructure and the tools and technologies that you are using. You should also have a plan in place for how you will handle any updates or changes that may arise in the future. One of the key things to consider when it comes to maintenance and updates is having a regular schedule for updating your applications and data. This will help you stay on top of any changes that may need to be made, and will ensure that your workloads are running optimally at all times. Addressing Any Lingering Issues or Challenges It's crucial to resolve any unresolved problems or difficulties that occurred during the migration process. This guarantees that your applications and data run smoothly and efficiently and that any issues overlooked during migration are dealt with before they become bigger problems. To resolve lingering issues, it is necessary to have a good understanding of your infrastructure and the tools you use. Having a plan in place for handling future issues is also important. A key aspect of resolving lingering issues is to have a monitoring system in place for your applications and data. This helps to identify any problems and respond promptly. When Should You Consider the Lift and Shift Approach? The lift and shift approach allows you to convert capital expenses into operational ones by moving your applications and data to the cloud with little to no modification. This method can be beneficial in several scenarios, such as: When you need a complete cloud migration: The lift and shift method is ideal for transferring your existing applications to a more advanced and flexible cloud platform to manage future risks. When you want to save on costs: The lift and shift approach helps you save money by migrating your workloads to the cloud from on-premises with little modifications, avoiding the need for expensive licenses or hiring professionals. When you have limited expertise in cloud-native solutions: This approach is suitable when you need to move your data to the cloud quickly and with minimal investment and you have limited expertise in cloud-native solutions. When you don’t have proper documentation: The lift and shift method is also useful if you lack proper documentation, as it allows you to move your application to the cloud first, and optimize or replace it later. Conclusion Lift and shift migration is a critical step in modernizing legacy applications and taking advantage of the benefits of the cloud. The process can be complex and time-consuming, but careful planning and working with a knowledgeable vendor or using a reliable cloud migration tool can ensure a smooth and successful migration. Organizations can minimize downtime and risk of data loss, while increasing scalability, reliability, and reducing costs. Lift and shift migration is a smart choice for organizations looking to upgrade their technology and benefit from cloud computing. By following the best practices outlined in this article, organizations can achieve their goals and execute a successful lift and shift migration.
In November 2022, the Green Software Foundation organized its first hackathon, “Carbon Hack 2022,” with the aim of supporting software projects whose objective is to reduce carbon emissions. I participated in this hackathon with the Carbon Optimised Process Scheduler project along with my colleagues Kamlesh Kshirsagar and Mayur Andulkar, in which we developed an API to optimize job scheduling in order to reduce carbon emissions, and we won the “Most Insightful” project prize. In this article, I will summarize the key concepts of “green software” and explain how software engineers can help reduce carbon emissions. I will also talk about the Green Software Foundation hackathon, Carbon Hack, and its winners. What Is “Green Software”? According to this research by Malmodin and Lundén (2018), the global ICT sector is responsible for 1.4% of carbon emissions and 4% of electricity use. In another article, it is estimated that the ICT sector’s emissions in 2020 were between 1.8% and 2.8% of global greenhouse gas emissions. Even though these estimates carry some uncertainty, they give a reasonable idea of the impact of the ICT sector. Green Software Foundation defines “green software” as a new field that combines climate science, hardware, software, electricity markets, and data center design to create carbon-efficient software that emits the least amount of carbon possible. Green software focuses on three crucial areas to do this: hardware efficiency, carbon awareness, and energy efficiency. Green software practitioners should be aware of these six key points: Carbon Efficiency: Emit the least amount of carbon Energy Efficiency: Use the least amount of energy Carbon Awareness: Aim to utilize “cleaner” sources of electricity when possible Hardware Efficiency: Use the least amount of embodied carbon Measurement: You can’t get better at something that you don’t measure Climate Commitments: Understand the mechanism of carbon reduction What Can We Do as Software Engineers? Fighting global warming and climate change involves all of us, and since we can do it by changing our code, we might start by reading Ismael Velasco’s advice, an expert in this field. These principles are extracted from his presentation at the Code For All Summit 2022: 1. Green By Default We should move our applications to a greener cloud provider or zone. This article collects three main cloud providers. Google Cloud has matched 100% of its electricity consumption with renewable energy purchases since 2017 and has recently committed to fully decarbonize its electricity supply by 2030. Azure has been 100% carbon-neutral since 2012, meaning they remove as much carbon each year as they emit, either by removing carbon or reducing carbon emissions. AWS purchases and retires environmental attributes like renewable energy credits and Guarantees of Origin to cover the non-renewable energy used in specific regions. Also, only a handful of their data centers have achieved carbon neutrality through offsets. Make sure the availability zone where your app is hosted is green This can be checked on the website of the Green Web Foundation. Transfers of data should be optional, minimal, and sent just once. Prevent pointless data transfers. Delete useless information (videos, special fonts, unused JavaScript, and CSS). Optimize media and minify assets. Reduce page loads and data consumption with service workers’ focused caching solutions Make use of a content delivery network (CDN) You can handle all requests from servers that are currently using renewable energy thanks to Cloudfront Reduce the number of HTTP requests and data exchanges in your API designs Track your app’s environmental impact Start out quickly and simply, then gradually increase complexity. 2. Green Mode Design Users have the option to reduce functionality for less energy using the Green Mode design. Sound-only videos, Transcript-only audio Cache-only web app Zero ads/trackers Images optional: click-to-view, Grayscale images Green Mode is a way of designing software that prioritizes the extension of the device life and digital inclusion over graceful degradation. To achieve this, it suggests designing for maximum backward compatibility with operating systems and web APIs, as well as offering minimal versions of CSS. 3. Green Partnerships We should ponder three questions: What knowledge are we lacking? What missing networks are there? What can we provide to partners? What Is the Green Software Foundation? Accenture, GitHub, Microsoft, and ThoughtWorks launched the Green Software Foundation with the Linux Foundation to put software engineering’s focus on sustainability. The Green Software Foundation is a non-profit organization created under the Linux Foundation with the goal of creating a reliable ecosystem of individuals, standards, tools, and “green software best practices”. It focuses on lowering the carbon emissions that software is responsible for and reducing the adverse effects of software on the environment. Moreover, it was established for those who work in the software industry and has the aim to provide them with information on what they may do to reduce the software emissions that their work in the software industry is responsible for. Carbon Hack 2022 Carbon Hack 2022 took place for the first time between October 13th and November 10th, 2022, and was supported by the GSF member organizations Accenture, Avanade, Intel, Thoughtworks, Globant, Goldman Sachs, UBS, BCG, and VMware. The aim of Carbon Hack was to create carbon-aware software projects using The GSF Carbon Aware SDK, which has two parts, a Hosted API and a client library available for 40 languages. The hackathon had 395 participants and 51 qualified projects from all over the world. Carbon-aware software refers to when an application is executed at different times or in regions where electricity is generated from greener sources — like wind and solar — as this can reduce its carbon footprint. When the electricity is clean, carbon-aware software works harder; when the electricity is dirty, it works less. By including carbon-aware features in an application, we can partially offset our carbon footprint and lower greenhouse gas emissions. Carbon Hack 2022 Winners The total prize pool of $100,000 USD was divided between the first three winners and 4 category winners: First place – Lowcarb Lowcarb is a plugin that enables carbon-aware scheduling of training jobs on geographically distributed clients for the well-known federated learning framework Flower. The results of this plugin displayed 13% lower carbon emissions without any negative impacts. Second place – Carbon-Aware DNN Training with Zeus This energy optimization framework adjusts the power limit of the GPU and can be integrated into any DNN training job. The use case for Zeus showed a 24% reduction in carbon emissions and only a 3% decrease in learning time. Third place – Circa Circa is a lightweight library – written in C – that can be installed from a release using a procedure of configuring and making install instructions. It chooses the most effective time to run a program within a predetermined window of time and also contains a simple scripting command that waits for the energy with the lowest carbon density over a specified period of time. Most Innovative – Sustainable UI A library that provides a set of base primitives for building carbon-aware UIs to any React application; in the future, the developers would like to offer versions for other popular frameworks as well. The developers predicted that Facebook’s monthly emissions would be reduced by 1,800 metric tons of gross CO2 emissions if they were to use SUI Headless by reducing a tenth of a gram of CO2e every visit while gradually degrading its user interface. This is comparable to the fuel used by 24 tanker trucks or the annual energy consumption of 350 houses. Most Polished – GreenCourier Scheduling plugin implemented for Kubernetes. To deploy carbon-aware scheduling across geographically connected Kubernetes clusters, the authors developed a scheduling policy based on marginal carbon emission statistics obtained from the Carbon-aware SDK. Most Insightful – Carbon Optimised Process Scheduler Disclosure: This was my Carbon Hack team! In order to reduce carbon emissions, we developed an API service with a UI application that optimizes job scheduling. The problem was modeled using mixed-integer linear programming and solved using Open Source Solver. If it were possible to optimize hundreds of high-energy industrial processes, carbon emissions could be reduced by up to 2 million tons per year. An example scenario from the IT sector demonstrates how moving work by just three hours can reduce CO2 emissions by almost 18.5%. This results in a savings of roughly 300 thousand tons of CO2 per year when applied to a million IT processes. Most Actionable – HEDGE.earth 83% of the carbon emissions on the web come from API requests. This team developed a reverse proxy — an application that sits in front of back-end applications and forwards client requests to those apps — to maximize the amount of clean energy needed to complete API requests (also accessible in NPM). Take a look at all the projects from Carbon Hack 2022 here. Conclusion The collective effort and cross-disciplinary cooperation across industries and within engineering are really important to achieve global climate goals, and we can start with these two courses that the Linux Foundation and Microsoft offer regarding green software and sustainable software engineering. Also, we can begin debating with our colleagues on how to lower the carbon emissions produced by our applications. In addition, we could follow the people on their social media accounts who have knowledge on this topic; I would recommend the articles of Ismael Velasco to start. Regarding this article, if we achieve to write our codes greener, our software projects will be more robust, reliable, faster, and brand resilient. Sustainable software applications will not only help to reduce our carbon footprint with our applications but will also help to sustain our applications with fewer dependencies, better performance, low-resource usage, cost savings, and energy-efficient features.
Tracking Mean Time To Restore (MTTR) is standard industry practice for incident response and analysis, but should it be? Courtney Nash, an Internet Incident Librarian, argues that MTTR is not a reliable metric — and we think she's got a point. We caught up with Courtney at the DevOps Enterprise Summit in Las Vegas, where she was making her case against MTTR in favor of alternative metrics (SLOs and cost of coordination data), practices (Near Miss analysis), and mindsets (humans are the solution, not the problem) to help organization better learn from their incidents. Episode Highlights (1:54) The end of MTTR? (4:50) Library of incidents (13:20) What is an incident? (19:41) Cost of coordination (22:13) Near misses (24:21) Mental models (28:16) Role of language in shaping public discourse (29:33) Learnings from The Void Episode Excerpt Dan: Hey, everyone; welcome to Dev Interrupted. My name is Dan lines, and I'm here with Courtney Nash, who has one of the coolest possibly made-up titles, but possibly real: Internet Incident Librarian. Courtney: Yep, that's right, yeah, you got it. Dan: Welcome to the show. Courtney: Thank you for having me on. Dan: I love that title Courtney: Still possibly made up, possibly, possibly... Dan: Still possibly made up. Courtney: We'll just leave that one out there for the listeners to decide. Dan: Let everyone decide what that could possibly mean. We have a, I think, maybe a spicy show, a spicy topic. Courtney: It's a hot topic show. Dan: Hot topic, especially since we're at DevOps Enterprise Summit, where we hear a lot about the DORA metrics, one of them being MTTR. Courtney: Yes. Dan: And you might have a hot take on that. The end of MTTR? Or how would you describe it? Courtney: Yeah, I feel a little like the fox in the henhouse here, but Gene accepted the talk. So you know, there's that. Dan: So it's on him. Courtney: [laughing] It's all Gene's fault! So I have been interested in complex systems for a long time; I used to study the brain. And I got sucked down an internet rabbit hole quite a lot quite a while ago. And I've had beliefs for a long time that I haven't had data to back up necessarily. And we see these sort of perverted behaviors, not that kind of perverted, but where we take metrics in the industry, and then with Goddard's Law, pick whatever you pick up, people incentivize them, and then weird things happen. But I think we spend too little time looking at the humans in the system and a lot of time focusing on the technical aspects and the data that come out of the technical side of systems. So, I started a project about a year ago called The Void. It's the Verica Open Incident Database, actually a real, not made-up name. And it's the largest collection of public incident reports. So, if you all have an outage, and you hopefully go and figure out and talk about what happened, and then you write that up, but that's out in the world, so I'm not writing these, I'm curating them and collecting. I'm a librarian. So, I have about 10,000 of them now. And a bunch of metadata associated with all these incident reports. Engineering Insights before anyone else... The Weekly Interruption is a newsletter designed for engineering leaders by engineering leaders. We get it. You're busy. So are we. That's why our newsletter is light, informative, and oftentimes irreverent. No BS or fluff. Each week we deliver actionable advice to help make you - whether you're a CTO, VP of Engineering, team lead, or IC — a better leader. It's also the best way to stay up-to-date on all things Dev Interrupted — from our podcast to trending articles, Interact, and our community Discord. Get interrupted.
The Southwest Airlines fiasco from December 2022 and the FAA Notam database fiasco from January 2023 had one thing in common: their respective root causes were mired in technical debt. At its most basic, technical debt represents some kind of technology mess that someone has to clean up. In many cases, technical debt results from poorly written code, but more often than not, it is more a result of evolving requirements that older software simply cannot keep up with. Both the Southwest and FAA debacles centered on legacy systems that may have met their respective business needs at the time they were implemented but, over the years, became increasingly fragile in the face of changing requirements. Such fragility is a surefire result of technical debt. The coincidental occurrence of these two high-profile failures mere weeks apart lit a fire under organizations across both the public and private sectors to finally do something about their technical debt. It’s time to modernize, the pundits proclaimed, regardless of the cost. Ironically, at the same time, a different set of pundits, responding to the economic slowdown and prospects of a looming recession, recommended that enterprises delay modernization efforts in order to reduce costs short term. After all, modernization can be expensive and rarely delivers the type of flashy, top-line benefits the public markets favor. How, then, should executives make decisions about cleaning up the technical debt in their organizations? Just how important is such modernization in the context of all the other priorities facing the C-suite? Understanding and Quantifying Technical Debt Risk Some technical debt is worse than others. Just as getting a low-interest mortgage is a much better idea than loan shark money, so too with technical debt. After all, sometimes shortcuts when writing code are a good thing. Quantifying technical debt, however, isn’t a matter of somehow measuring how messy legacy code might be. The real question is one of the risk to the organization. Two separate examples of technical debt might be just as messy and equally worthy of refactoring. But the first example may be working just fine, with a low chance of causing problems in the future. The other one, in contrast, could be a bomb waiting to go off. Measuring the risks inherent in technical debt, therefore, is far more important than any measure of the debt itself — and places this discussion into the broader area of risk measurement or, more broadly, risk scoring. Risk scoring begins with risk profiling, which determines the importance of a system to the mission of the organization. Risk scoring provides a basis for quantitative risk-based analysis that gives stakeholders a relative understanding of the risks from one system to another — or from one area of technical debt to another. The overall risk score is the sum of all of the risk profiles across the system in question — and thus gives stakeholders a way of comparing risks in an objective, quantifiable manner. One particularly useful (and free to use) resource for calculating risk profiles and scores is Cyber Risk Scoring (CRS) from NIST, an agency of the US Department of Commerce. CRS focuses on cybersecurity risk, but the folks at NIST have intentionally structured it to apply to other forms of risk, including technical debt risk. Comparing Risks Across the Enterprise As long as an organization has a quantitative approach to risk profiling and scoring, then it’s possible to compare one type of risk to another — and, furthermore, make decisions about mitigating risks across the board. Among the types of risks that are particularly well-suited to this type of analysis are operational risk (i.e., risk of downtime), which includes network risk; cybersecurity risk (the risk of breaches); compliance risk (the risk of out-of-compliance situations); and technical debt risk (the risk that legacy assets will adversely impact the organization). The primary reason to bring these various sorts of risks onto a level playing field is to give the organization an objective approach to making decisions about how much time and money to spend on mitigating those risks. Instead of having different departments decide how to use their respective budgets to mitigate the risks within their scope of responsibility, organizations require a way to coordinate various risk mitigation efforts that leads to an optimal balance between risk mitigation and the costs for achieving it. Calculating the Threat Budget Once an organization looks at its risks holistically, one uncomfortable fact emerges: it’s impossible to mitigate all risks. There simply isn’t enough money or time to address every possible threat to the organization. Risk mitigation, therefore, isn’t about eliminating risk. It’s about optimizing the amount of risk we can’t mitigate. Optimizing the balance between mitigation and the cost of achieving it across multiple types of risk requires a new approach to managing risk. We can find this approach in the practice of Site Reliability Engineering (SRE). SRE focuses on managing reliability risk, a type of operational risk concerned with reducing system downtime. Given the goal of zero downtime is too expensive and time-consuming to achieve in practice, SRE calls for an error budget. The error budget is a measure of how far short of perfect reliability the organization targets, given the cost considerations of mitigating the threat of downtime. If we generalize the idea of error budgets to other types of risk, we can postulate a threat budget which represents a quantitative measure of how far short of eliminating a particular risk the organization is willing to tolerate. Intellyx calls the quantitative, best practice approach to managing threat budgets across different types of risks threat engineering. Assuming an organization has leveraged the risk scoring approach from NIST (or some alternative approach), it’s now possible to engineer risk mitigation across all types of threats to optimize the organization’s response to such threats. Applying Threat Engineering to Technical Debt Resolving technical debt requires some kind of modernization effort. Sometimes this modernization is a simple matter of refactoring some code. In other cases, it’s a complex, difficult migration process. There are several other approaches to modernization with varying risk/reward profiles as well. Risk scoring provides a quantitative assessment of just how important a particular modernization effort is to the organization, given the threats inherent in the technical debt in question. Threat engineering, in turn, gives an organization a way of placing the costs of mitigating technical debt risks in the context of all the other risks facing the organization — regardless of which department or budget is responsible for mitigating one risk or another. Applying threat engineering to technical debt risk is especially important because other types of risk, namely cybersecurity and compliance risk, get more attention and, thus, a greater emotional reaction. It’s difficult to be scared of spaghetti code when ransomware is in the headlines. As the Southwest and FAA debacles show, however, technical debt risk is every bit as risky as other, sexier forms of risk. With threat engineering, organizations finally have a way of approaching risk holistically in a dispassionate, best practice-based manner. The Intellyx Take Threat engineering provides a proactive, best practice-based approach to breaking down the organizational silos that naturally form around different types of risks. Breaking down such silos has been a priority for several years now, leading to practices like NetSecOps and DevSecOps that seek to leverage common data and better tooling to break down the divisions between departments. Such efforts have always been a struggle because these different teams have long had different priorities — and everyone ends up fighting for a slice of the budget pie. Threat engineering can align these priorities. Once everybody realizes that their primary mission is to manage and mitigate risk, then real organizational change can occur. Copyright © Intellyx LLC. Intellyx is an industry analysis and advisory firm focused on enterprise digital transformation. Covering every angle of enterprise IT from mainframes to artificial intelligence, our broad focus across technologies allows business executives and IT professionals to connect the dots among disruptive trends. As of the time of writing, none of the organizations mentioned in this article is an Intellyx customer. No AI was used to produce this article.
Software maintenance may require different approaches based on your business goals, the industry you function in, the expertise of your tech team, and the predictive trends of the market. Therefore, along with understanding the different types of software maintenance, you also have to explore various models of software. Based on the kind of problem you are trying to solve, your team can choose the right model from the following options: 1. Quick-Fix Model A quick-fix model in software maintenance is a method for addressing bugs or issues in the software by prioritizing a fast resolution over a more comprehensive solution. This approach typically involves making a small, localized change to the software to address the immediate problem rather than fully understanding and addressing the underlying cause. However, organizations adopt this approach of maintenance only in the case of emergency situations that call for quick resolutions. Under the quick-fix model, tech teams carry out the following software maintenance activities: Annotate software changes by including change IDs and code comments Enter them into a maintenance history detailing why they made the change and the techniques used by them Note each location and merge them via the change ID if there are multiple points in the code change 2. Iterative Enhancement Model The iterative model is used for small-scale application modernization and scheduled maintenance. Generally, the business justification for changes is ignored in this approach as it only involves the software development team, not the business stakeholders. So, the software team will not know if more significant changes are required in the future, which is quite risky. The iterative enhancement model treats the application target as a known quantity. It incorporates changes in the software based on the analysis of the existing system. The iterative model best suits changes made to confined application targets, with little cross-impact on other apps or organizations. 3. Reuse-Oriented Model The reuse-oriented model identifies components of the existing system that are suitable to use again in multiple places. In recent years, this model also includes creating components that can be reused in multiple applications of a system.. There are three ways to incorporate the reuse-oriented model — object and function, application system, and component. Object and function reuse: This model reuses the software elements that implement a single well-defined object. Application system reuse: Under this model, developers can integrate new components in an application without making changes to the system or re-configuring it for a specific user to reuse. Component reuse: Component reuse refers to using a pre-existing component rather than creating a new one in software development. This can include using pre-built code libraries, frameworks, or entire software applications. 4. Boehm’s Model Introduced in 1978, Boehm’s model focuses on measuring characteristics to get non-tech stakeholders involved with the life cycle of software. The model represents a hierarchical structure of high-level, intermediate, and primitive characteristics of software that define its overall quality. The high-level characteristics of quality software are: Maintainability: It should be easy to understand, evaluate, and modify the processes in a system. Portability: Software systems should help in ascertaining the most effective way to make environmental changes As-is utility: It should be easy and effective to use an as-is utility in the system. The intermediate level of characteristics represented by the model displays different factors that validate the expected quality of a software system. These characteristics are: Reliability: Software performance is as expected, with zero defects. Portability: The software can run in various environments and on different platforms. Efficiency: The system makes optimum utilization of code, applications, and hardware resources. Testability: The software can be tested easily and the users can trust the results. Understandability: The end-user should be able to understand the functionality of the software easily and thus, use it effectively. Usability: Efforts needed to learn, use, and comprehend different software functions should be minimum. The primitive characteristics of quality software include basic features like device independence, accessibility, accuracy, etc. 5. Taute Maintenance Model Developed by B.J. Taute in 1983, the Taute maintenance model facilitates development teams to update and perform necessary modifications after executing the software. The Taute model for software maintenance can be carried out in the following phases: Change request phase: In this phase, the client sends the request to make changes to the software in a prescribed format. Estimate phase: Then, developers conduct an impact analysis on the existing system to estimate the time and effort required to make the requested changes. Schedule phase: Here, the team aggregates the change requests for the upcoming scheduled release and creates the planning documents accordingly. Programming phase: In the programming phase, requested changes are implemented in the source code, and all the relevant documents, like design documents and manuals, are updated accordingly. Test phase: During this phase, the software modifications are carefully analyzed. The code is tested using existing and new test cases, along with the implementation of regression testing. Documentation phase: Before the release, system and user documentation are prepared and updated based on regression testing results. Thus, developers can maintain the coherence of documents and code. Release phase: The customer receives the new software product and updated documentation. Then the system’s end users perform acceptance testing. Conclusion Software maintenance is not just a necessary chore, but an essential aspect of any successful software development project. By investing in ongoing maintenance and addressing issues as they arise, organizations can ensure that their software remains reliable, secure, and up-to-date. From bug fixes to performance optimizations, software maintenance is a crucial step in maximizing the value and longevity of your software. So don't overlook this critical aspect of software development — prioritize maintenance and keep your software running smoothly for years to come.
In the cloud-native era, we often hear that "security is job zero," which means it's even more important than any number one priority. Modern infrastructure and methodologies bring us enormous benefits, but, at the same time, since there are more moving parts, there are more things to worry about: How do you control access to your infrastructure? Between services? Who can access what? Etc. There are many questions to be answered, including policies: a bunch of security rules, criteria, and conditions. Examples: Who can access this resource? Which subnet egress traffic is allowed from? Which clusters a workload must be deployed to? Which protocols are not allowed for reachable servers from the Internet? Which registry binaries can be downloaded from? Which OS capabilities can a container execute with? Which times of day can the system be accessed? All organizations have policies since they encode important knowledge about compliance with legal requirements, work within technical constraints, avoid repeating mistakes, etc. Since policies are so important today, let's dive deeper into how to best handle them in the cloud-native era. Why Policy-as-Code? Policies are based on written or unwritten rules that permeate an organization's culture. So, for example, there might be a written rule in our organizations explicitly saying: For servers accessible from the Internet on a public subnet, it's not a good practice to expose a port using the non-secure "HTTP" protocol. How do we enforce it? If we create infrastructure manually, a four-eye principle may help. But first, always have a second guy together when doing something critical. If we do Infrastructure as Code and create our infrastructure automatically with tools like Terraform, a code review could help. However, the traditional policy enforcement process has a few significant drawbacks: You can't be guaranteed this policy will never be broken. People can't be aware of all the policies at all times, and it's not practical to manually check against a list of policies. For code reviews, even senior engineers will not likely catch all potential issues every single time. Even though we've got the best teams in the world that can enforce policies with no exceptions, it's difficult, if possible, to scale. Modern organizations are more likely to be agile, which means many employees, services, and teams continue to grow. There is no way to physically staff a security team to protect all of those assets using traditional techniques. Policies could be (and will be) breached sooner or later because of human error. It's not a question of "if" but "when." And that's precisely why most organizations (if not all) do regular security checks and compliance reviews before a major release, for example. We violate policies first and then create ex post facto fixes. I know, this doesn't sound right. What's the proper way of managing and enforcing policies, then? You've probably already guessed the answer, and you are right. Read on. What Is Policy-as-Code (PaC)? As business, teams, and maturity progress, we'll want to shift from manual policy definition to something more manageable and repeatable at the enterprise scale. How do we do that? First, we can learn from successful experiments in managing systems at scale: Infrastructure-as-Code (IaC): treat the content that defines your environments and infrastructure as source code. DevOps: the combination of people, process, and automation to achieve "continuous everything," continuously delivering value to end users. Policy-as-Code (PaC) is born from these ideas. Policy as code uses code to define and manage policies, which are rules and conditions. Policies are defined, updated, shared, and enforced using code and leveraging Source Code Management (SCM) tools. By keeping policy definitions in source code control, whenever a change is made, it can be tested, validated, and then executed. The goal of PaC is not to detect policy violations but to prevent them. This leverages the DevOps automation capabilities instead of relying on manual processes, allowing teams to move more quickly and reducing the potential for mistakes due to human error. Policy-as-Code vs. Infrastructure-as-Code The "as code" movement isn't new anymore; it aims at "continuous everything." The concept of PaC may sound similar to Infrastructure as Code (IaC), but while IaC focuses on infrastructure and provisioning, PaC improves security operations, compliance management, data management, and beyond. PaC can be integrated with IaC to automatically enforce infrastructural policies. Now that we've got the PaC vs. IaC question sorted out, let's look at the tools for implementing PaC. Introduction to Open Policy Agent (OPA) The Open Policy Agent (OPA, pronounced "oh-pa") is a Cloud Native Computing Foundation incubating project. It is an open-source, general-purpose policy engine that aims to provide a common framework for applying policy-as-code to any domain. OPA provides a high-level declarative language (Rego, pronounced "ray-go," purpose-built for policies) that lets you specify policy as code. As a result, you can define, implement and enforce policies in microservices, Kubernetes, CI/CD pipelines, API gateways, and more. In short, OPA works in a way that decouples decision-making from policy enforcement. When a policy decision needs to be made, you query OPA with structured data (e.g., JSON) as input, then OPA returns the decision: Policy Decoupling OK, less talk, more work: show me the code. Simple Demo: Open Policy Agent Example Pre-requisite To get started, download an OPA binary for your platform from GitHub releases: On macOS (64-bit): curl -L -o opa https://openpolicyagent.org/downloads/v0.46.1/opa_darwin_amd64 chmod 755 ./opa Tested on M1 mac, works as well. Spec Let's start with a simple example to achieve an Access Based Access Control (ABAC) for a fictional Payroll microservice. The rule is simple: you can only access your salary information or your subordinates', not anyone else's. So, if you are bob, and john is your subordinate, then you can access the following: /getSalary/bob /getSalary/john But accessing /getSalary/alice as user bob would not be possible. Input Data and Rego File Let's say we have the structured input data (input.json file): { "user": "bob", "method": "GET", "path": ["getSalary", "bob"], "managers": { "bob": ["john"] } } And let's create a Rego file. Here we won't bother too much with the syntax of Rego, but the comments would give you a good understanding of what this piece of code does: File example.rego: package example default allow = false # default: not allow allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] input.user == person # input user is the person } allow = true { # allow if: input.method == "GET" # method is GET input.path = ["getSalary", person] managers := input.managers[input.user][_] contains(managers, person) # input user is the person's manager } Run The following should evaluate to true: ./opa eval -i input.json -d example.rego "data.example" Changing the path in the input.json file to "path": ["getSalary", "john"], it still evaluates to true, since the second rule allows a manager to check their subordinates' salary. However, if we change the path in the input.json file to "path": ["getSalary", "alice"], it would evaluate to false. Here we go. Now we have a simple working solution of ABAC for microservices! Policy as Code Integrations The example above is very simple and only useful to grasp the basics of how OPA works. But OPA is much more powerful and can be integrated with many of today's mainstream tools and platforms, like: Kubernetes Envoy AWS CloudFormation Docker Terraform Kafka Ceph And more. To quickly demonstrate OPA's capabilities, here is an example of Terraform code defining an auto-scaling group and a server on AWS: With this Rego code, we can calculate a score based on the Terraform plan and return a decision according to the policy. It's super easy to automate the process: terraform plan -out tfplan to create the Terraform plan terraform show -json tfplan | jq > tfplan.json to convert the plan into JSON format opa exec --decision terraform/analysis/authz --bundle policy/ tfplan.json to get the result.
GitOps is a software development and operations methodology that uses Git as the source of truth for deployment configurations. It involves keeping the desired state of an application or infrastructure in a Git repository and using Git-based workflows to manage and deploy changes. Two popular open-source tools that help organizations implement GitOps for managing their Kubernetes applications are Flux and Argo CD. In this article, we’ll take a closer look at these tools, their pros and cons, and how to set them up. Common Use Cases for Flux and Argo CD Flux Continuous delivery: Flux can be used to automate the deployment pipeline and ensure that changes are automatically deployed as soon as they are pushed to the Git repository. Configuration management: Flux allows you to store and manage your application’s configuration as code, making it easier to version control and track changes. Immutable infrastructure: Flux helps enforce an immutable infrastructure approach—where changes are made only through the Git repository and not through manual intervention on the cluster. Blue-green deployments: Flux supports blue-green deployments—where a new version of an application is deployed alongside the existing version, and traffic is gradually shifted to the new version. Argo CD Continuous deployment: Argo CD can be used to automate the deployment process, ensuring that applications are always up-to-date with the latest changes from the Git repository. Application promotion: Argo CD supports application promotion—where applications can be promoted from one environment to another. For example, from development to production. Multi-cluster management: Argo CD can be used to manage applications across multiple clusters, ensuring the desired state of the applications is consistent across all clusters. Rollback management: Argo CD provides rollback capabilities, making it easier to revert changes in case of failures. The choice between the two tools depends on the specific requirements of the organization and application, but both tools provide a GitOps approach to simplify the deployment process and reduce the risk of manual errors. They both have their own pros and cons, and in this article, we’ll take a look at what they are and how to set them up. What Is Flux? Flux is a GitOps tool that automates the deployment of applications on Kubernetes. It works by continuously monitoring the state of a Git repository and applying any changes to a cluster. Flux integrates with various Git providers such as GitHub, GitLab, and Bitbucket. When changes are made to the repository, Flux automatically detects them and updates the cluster accordingly. Pros of Flux Automated deployments: Flux automates the deployment process, reducing manual errors and freeing up developers to focus on other tasks. Git-based workflow: Flux leverages Git as a source of truth, which makes it easier to track and revert changes. Declarative configuration: Flux uses Kubernetes manifests to define the desired state of a cluster, making it easier to manage and track changes. Cons of Flux Limited customization: Flux only supports a limited set of customizations, which may not be suitable for all use cases. Steep learning curve: Flux has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. How To Set Up Flux Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application's source code and Kubernetes manifests. The repository URL and a SSH key for the Git repository. Step 1: Add the Flux Helm Repository The first step is to add the Flux Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add fluxcd https://charts.fluxcd.io Step 2: Install Flux Now that the Flux Helm repository is added, you can install Flux on the cluster. Run the following command to install Flux: Shell helm upgrade -i flux fluxcd/flux \ --set git.url=git@github.com:<your-org>/<your-repo>.git \ --set git.path=<path-to-manifests> \ --set git.pollInterval=1m \ --set git.ssh.secretName=flux-git-ssh In the above command, replace the placeholder values with your own Git repository information. The git.url parameter is the URL of the Git repository, the git.path parameter is the path to the directory containing the Kubernetes manifests, and the git.ssh.secretName parameter is the name of the SSH secret containing the SSH key for the repository. Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Flux pods. Run the following command to view the pods: Shell kubectl get pods -n <flux-namespace> If the pods are running, Flux has been installed successfully. Step 4: Connect Flux to Your Git Repository The final step is to connect Flux to your Git repository. Run the following command to generate a SSH key and create a secret: Shell ssh-keygen -t rsa -b 4096 -f id_rsa kubectl create secret generic flux-git-ssh \ --from-file=id_rsa=./id_rsa --namespace=<flux-namespace> In the above command, replace the <flux-namespace> placeholder with the namespace where Flux is installed. Now, add the generated public key as a deployment key in your Git repository. You have successfully set up Flux using Helm. Whenever changes are made to the Git repository, Flux will detect them and update the cluster accordingly. In conclusion, setting up Flux using Helm is a quite simple process. By using Git as a source of truth and continuously monitoring the state of the cluster, Flux helps simplify the deployment process and reduce the risk of manual errors. What Is Argo CD? Argo CD is an open-source GitOps tool that automates the deployment of applications on Kubernetes. It allows developers to declaratively manage their applications and keeps the desired state of the applications in sync with the live state. Argo CD integrates with Git repositories and continuously monitors them for changes. Whenever changes are detected, Argo CD applies them to the cluster, ensuring the application is always up-to-date. With Argo CD, organizations can automate their deployment process, reduce the risk of manual errors, and benefit from Git’s version control capabilities. Argo CD provides a graphical user interface and a command-line interface, making it easy to use and manage applications at scale. Pros of Argo CD Advanced deployment features: Argo CD provides advanced deployment features, such as rolling updates and canary deployments, making it easier to manage complex deployments. User-friendly interface: Argo CD provides a user-friendly interface that makes it easier to manage deployments, especially for non-technical users. Customizable: Argo CD allows for greater customization, making it easier to fit the tool to specific use cases. Cons of Argo CD Steep learning curve: Argo CD has a steep learning curve for new users and requires a deep understanding of Kubernetes and Git. Complexity: Argo CD has a more complex architecture than Flux, which can make it more difficult to manage and troubleshoot. How To Set Up Argo CD Argo CD can be installed on a Kubernetes cluster using Helm, a package manager for Kubernetes. In this section, we’ll go through the steps to set up Argo CD using Helm. Prerequisites A running Kubernetes cluster. Helm installed on your local machine. A Git repository for your application’s source code and Kubernetes manifests. Step 1: Add the Argo CD Helm Repository The first step is to add the Argo CD Helm repository to your local machine. Run the following command to add the repository: Shell helm repo add argo https://argoproj.github.io/argo-cd Step 2: Install Argo CD Now that the Argo CD Helm repository is added, you can install Argo CD on the cluster. Run the following command to install Argo CD: Shell helm upgrade -i argocd argo/argo-cd --set server.route.enabled=true Step 3: Verify the Installation After running the above command, you can verify the installation by checking the status of the Argo CD pods. Run the following command to view the pods: Shell kubectl get pods -n argocd If the pods are running, Argo CD has been installed successfully. Step 4: Connect Argo CD to Your Git Repository The final step is to connect Argo CD to your Git repository. Argo CD provides a graphical user interface that you can use to create applications and connect to your Git repository. To access the Argo CD interface, run the following command to get the URL: Shell kubectl get routes -n argocd Use the URL in a web browser to access the Argo CD interface. Once you’re in the interface, you can create a new application by providing the Git repository URL and the path to the Kubernetes manifests. Argo CD will continuously monitor the repository for changes and apply them to the cluster. You have now successfully set up Argo CD using Helm. Conclusion GitOps is a valuable approach for automating the deployment and management of applications on Kubernetes. Flux and Argo CD are two popular GitOps tools that provide a simple and efficient way to automate the deployment process, enforce an immutable infrastructure, and manage applications in a consistent and predictable way. Flux focuses on automating the deployment pipeline and providing configuration management as code, while Argo CD provides a more complete GitOps solution, including features such as multi-cluster management, application promotion, and rollback management. Both tools have their own strengths and weaknesses, and the choice between the two will depend on the specific requirements of the organization and the application. Regardless of the tool chosen, GitOps provides a valuable approach for simplifying the deployment process and reducing the risk of manual errors. By keeping the desired state of the applications in sync with the Git repository, GitOps ensures that changes are made in a consistent and predictable way, resulting in a more reliable and efficient deployment process.
Modern organizations need complex IT infrastructures functioning properly to provide goods and services at the expected level of performance. Therefore, losing critical parts or the whole infrastructure can put the organization on the edge of disappearance. Disasters remain a threat to production processes. What Is a Disaster? A disaster is challenging trouble that instantly overwhelms the capacity of available human, IT, financial and other resources and results in significant losses of valuable assets (for example, documents, intellectual property objects, data, or hardware). In most cases, a disaster is a sudden chain of events causing non-typical threats that are difficult or impossible to stop once the disaster starts. Depending on the type of disaster, an organization needs to react in specific ways. There are three main types of disasters: Natural disasters Technological and human-made disasters Hybrid disasters A natural disaster is the first thing that probably comes to your mind when you hear the word “disaster”. Different types of natural disasters include floods, earthquakes, forest fires, abnormal heat, intense snowfalls, heavy rains, hurricanes and tornadoes, and sea and ocean storms. Technological disaster is the consequence of anything connected with the malfunctions of tech infrastructure, human error, or evil will. The list can include any issue from a software disruption in an organization to a power plant problem causing difficulties in the whole city, region, or even country. These are disasters such as global software disruption, critical hardware malfunction, power outages, and electricity supply problems, malware infiltration (including ransomware attacks), telecommunication issues (including network isolation), military conflicts, terrorism incidents, dam failures, chemical incidents. The third category to mention describes mixed disasters that unite the features of natural and technological factors. For example, a dam failure can cause a flood resulting in a power outage and communication issues across the entire region or country. What Is Disaster Recovery? Disaster recovery (DR) is a set of actions (methodology) that an organization should take to recover and restore operations after a global disruptive event. Major disaster recovery activities focus on regaining access to data, hardware, software, network devices, connectivity, and power supply. DR actions can also cover rebuilding logistics, and relocating staff members and office equipment, in case of damaged or destroyed assets. To create a disaster recovery plan, you need to think over the action sequences to complete during these periods: Before the disaster (building, maintaining, and testing the DR system and policies). During the disaster (applying the immediate response measures to avoid or mitigate asset losses). After the disaster (applying the DR system to restore operation, contacting clients, partners, and officials, and analyzing losses and recovery efficiency). Here are the points to include in your disaster recovery plan. Business Impact Analysis and Risk Assessment Data At this step, you study threats and vulnerabilities typical and most dangerous for your organization. With that knowledge, you can also calculate the probability of a particular disaster occurring, measure potential impacts on your production and implement suitable disaster recovery solutions easier. Recovery Objectives: Defined RPO and RTO RPO is the recovery point objective: the parameter defines the amount of data you can lose without a significant impact on production. RTO is the recovery time objective: the longest downtime your organization can tolerate and, thus, the maximum time you can have to complete recovery workflows. Distribution of Responsibilities A team that is aware of every member’s duties in case of disaster is a must-have component of an efficient DR plan. Assemble a special DR team, assign specific roles to every employee and train them to fulfill their roles before an actual disaster strikes. This is the way to avoid confusion and missing links when real action is required to save an organization’s assets and production. DR Site Creation A disaster of any scale or nature can critically damage your main server and production office, making resuming operations there impossible or extraordinarily time-consuming. In this situation, a prepared DR site with replicas of critical workloads is the best choice to minimize RTO and continue providing services to the organization’s clients during and after in an emergency. Failback Preparations Failback, which is the process of returning the workloads back to the main site when the main data center is operational again, can be overlooked when planning disaster recovery. Nevertheless, establishing failback sequences beforehand helps to make the entire process smoother and avoid minor data losses that might happen otherwise. Additionally, keep in mind that a DR site is usually not designed to support your infrastructure’s functioning for a prolonged period. Remote Storage for Crucial Documents and Assets Even small organizations produce and process a lot of crucial data nowadays. Losing hard copies or digital documents can make their recovery time-consuming, expensive, or even impossible. Thus, preparing remote storage (for example, VPS cloud storage for digital docs and protected physical storage for hard copy assets) is a solid choice to ensure the accessibility of important data in case of disaster. You can check the all-in-one solution for VMware disaster recovery at once if you want. Equipment Requirements Noted This DR plan element requires auditing the nodes that enable the functioning of your organization’s IT infrastructure. This includes computers, physical servers, network routers, hard drives, cloud-based server hosting equipment, etc. That knowledge enables you to view the elements required to restore the original state of the IT environment after a disaster. What’s more, you can see the list of equipment required to support at least mission-critical workloads and ensure production continuity when the main resource is unavailable. Communication Channels Defined Ensure enabling a stable and reliable internal communication system for your staff members, management, and DR team. Set the order of communication channels’ usage to deal with the unavailability of the main server and internal network right after a disaster. Response Procedures Outlined In a DR plan, the first hours are critical. Create step-by-step instructions on how to execute DR activities, monitor and conduct processes, failover sequences, system recovery verification, etc. In case a disaster still hits the production center despite all the prevention measures applied, a concentrated and rapid response to a particular event can help mitigate the damage. Incident Reporting to Stakeholders After a disaster strikes and disrupts your production, not only DR team members should be informed. You also need to notify key stakeholders, including your marketing team, third-party suppliers, partners, and clients. As a part of your disaster recovery plan, create outlines and scripts showing your staff how to inform every critical group regarding its concerns. Additionally, a basic press release created beforehand can help you not to waste time during an actual incident. DR Plan Testing and Adjustment Successful organizations change and expand with time, and their DR plans should be adjusted according to the relevant needs and recovery objectives. Test your plan right after you finish it, and perform additional testing every time you introduce changes. Thus, you can measure the efficiency of a disaster recovery plan and ensure the recoverability of your assets. Optimal DR Strategy Applied The DR strategy can be implemented on a DIY (do it yourself) basis or delegated to a third-party vendor. The former choice is the way to sacrifice reliability in favor of the economy, while the latter one can be more expensive but more efficient. The choice of a DR strategy fully depends on your organization’s features, including the team size, IT infrastructure complexity, budget, risk factors, and desired reliability, among others. Summary A disaster is a sudden destructive event that can render an organization inoperable. Natural, human-made, and hybrid disasters have different levels of predictability, but they are barely preventable at an organization’s level. The only way to ensure the safety of an organization is to create a reliable disaster recovery plan based on the organization’s specific needs. The key elements of a DR plan are: Risk assessment and impact analysis, Defined RPO and RTO DR team responsibilities distributed DR site creation Preparations for failback Remote storage Equipment list Established communication channels Immediate response sequences Incident reporting instructions Disaster recovery testing and adjustment. Optimal DR strategy choice.
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn