In every organization that has embraced the ITIL (Information Technology Infrastructure Library) framework, the emergence of the term Service Level Agreement (SLA) is an inevitable milestone, and it often marks the beginning of a challenging journey. The process of crafting a well-defined SLA that is comprehensible and agreeable to all involved parties can be an arduous undertaking, particularly when attempting to articulate it in a language that is not your native tongue.

The implementation of ITIL inherently introduces the concept of SLAs, which serve as contractual agreements outlining the expected service levels, response times, and performance benchmarks. These agreements are essential for establishing clear expectations and responsibilities between service providers and their clients, whether they are internal business units or external customers.

According to ITIL 4, a service level agreement (SLA) is “A documented 
agreement between a service provider and a customer that identifies both 
services required and the expected level of service.”

source: https://www.bmc.com/blogs/sla-best-practices/

The two particularly vexing aspects of this scenario lie in the interpretation of the terms “service” and “expected level”. These elements are intrinsically subjective, which introduces a considerable degree of ambiguity and potential discord. However, the situation takes a more troublesome turn when we consider that the responsibility for negotiating the SLA typically falls on the vendor’s shoulders. Consequently, the vendor has a significant incentive to draft the SLA document in a manner that serves their own interests.

The term “service” is a multifaceted concept, and its definition can vary depending on one’s perspective. What the client envisions as a high-quality service might differ significantly from the vendor’s interpretation. This divergence in understanding can become a breeding ground for misunderstandings and disputes, especially when it comes to assessing whether the service meets the “expected level”.

Expected level” is another slippery term. It encapsulates the precise standard of performance or quality that the client anticipates from the vendor. Again, this is an area rife with subjectivity, and it is highly susceptible to differing interpretations. What may seem satisfactory to the vendor might fall far short of the client’s expectations, leading to dissatisfaction and disputes.

Compounding these challenges is the fact that vendors, who often have more experience in crafting SLAs, have the upper hand in negotiating these agreements. They are well-versed in the art of constructing SLAs that tilt the balance in their favor, potentially leaving clients with terms and conditions that may not align with their best interests.

If your organization operates with just a single SLA, you might consider yourself fortunate, as you’re less likely to be overwhelmed by the complexities that multiple and overlapping SLAs can bring. However, the reality for many IT departments is far more intricate. The question to ponder is: How many SLAs are currently in effect within your IT department — five, ten, or perhaps even more? And furthermore, do you possess a comprehensive understanding and oversight of these SLAs? If not, it’s time to take proactive measures by revisiting and meticulously examining these agreements.

The management of multiple SLAs can swiftly evolve into a challenging endeavor. Each SLA represents a unique commitment, tailored to specific services, expectations, and stakeholders. Without a clear and organized overview of these agreements, the potential for operational mishaps, unmet obligations, and, ultimately, discontented clients or users significantly escalates.

To navigate this intricate landscape, the prudent approach is to revisit your existing SLAs systematically. Open each one and scrutinize its contents carefully. Ensure that you have a comprehensive grasp of the service levels, response times, escalation procedures, and any associated penalties or incentives stipulated in each agreement. Creating a detailed overview of these SLAs will empower your IT department to proactively manage service commitments, maintain accountability, and ultimately enhance the overall quality of service delivery.

Now, let’s revisit these SLAs with a heightened focus on security considerations. This perspective becomes particularly crucial during a significant incident scenario, such as dealing with a widespread ransomware outbreak. In this hypothetical situation, your SOC is managed by third-party A, infrastructure services by third-party B, and workplace services by third-party C. Several critical questions immediately come to the forefront:

  1. Identification Speed: One paramount concern is how swiftly your organization can identify the security threat, especially during a major incident like a ransomware outbreak. The clock starts ticking as soon as the intrusion occurs.
  2. Containment and Eradication: Equally significant is the agility of your response. Once the threat is detected, the ability to contain and ultimately eradicate it is paramount. Knowing how quickly your security teams, likely in collaboration with party A, can execute these crucial steps is essential to minimize damage and downtime.
  3. Applicable SLAs: In this context, identifying the relevant SLAs is pivotal. These SLAs should address the timeframes and responsibilities associated with threat detection, containment, and eradication during security incidents of various magnitudes.

I invite you to pause for a moment and thoroughly examine the contents of the digital forensic report entitled ‘From Zero to Domain admin’. Now, let’s delve into the realm of SLAs that pertain to this particular context. Picture a scenario where all the key stakeholders responsible for addressing this security incident manage to respond within 85% of the agreed timeframe.

Source: From Zero to Domain Admin — The DFIR Report

The pivotal question that arises is whether this level of responsiveness is adequate to effectively contain the security incident and mitigate any potential damage, or is the entire system already compromised by the time all parties have mobilized their responses?

It’s essential to bear in mind that a security incident often commences even before the SOC can detect it. To gain visibility into the unfolding event, pertinent data must traverse a series of processes within the SIEM environment. This journey involves the transportation of data, parsing it to make it comprehensible, and then subjecting it to thorough analysis.

The time taken for these initial stages can significantly impact the SOC’s ability to respond effectively. In an ideal scenario, where the security infrastructure is optimized, the transportation and parsing of data can be accomplished within a remarkably swift five-second window after the event’s generation. However, real-world circumstances often introduce variability into this timeframe.

One factor that can influence the speed of data handling is the efficiency of the security architecture. In some cases, the SIEM solution may be designed to pull data actively from various sources, which can extend the process to several minutes. This variation can be attributed to factors such as network latency, the volume of data being processed, and the complexity of the data sources involved.

It’s essential to bear in mind that the mere act of parsing data within a SIEM solution does not automatically entail that this data is rigorously evaluated against the array of defined use cases. This nuance becomes even more critical when we consider the inherent time constraints within SIEM operations.

Depending on the SIEM platform in use, the minimal query interval for a specific use case is typically set at 5 minutes. This implies that by the time an event is subjected to evaluation against predefined criteria, the data is already 5 minutes old. In the realm of cybersecurity, it’s crucial to acknowledge that a lot can happen in just a few minutes.

This gap underscores the inherent challenge in real-time threat detection and response. The SIEM’s 5-minute query interval, while a practical constraint, underscores the importance of complementary security measure like IPS and WAF that can offer more immediate, continuous monitoring and threat mitigation capabilities. While SIEMs play a vital role in post-event analysis and compliance, they may not be the sole solution for detecting and thwarting rapidly evolving cyber threats.

Once a specific use case has been triggered, and an alert is generated within the security infrastructure, it’s crucial to understand that the mere generation of an alert doesn’t guarantee immediate action by the SOC analyst. Fortunately, this is precisely the point in the process where the integration of automation can significantly expedite response times. The extent of automation applied can determine whether the alert can be addressed entirely automatically or if it necessitates human intervention.

In cases where a Security Orchestration, Automation, and Response (SOAR) playbook can autonomously handle the alert, it typically results in a relatively minor addition to the overall incident’s “blast radius” timeframe. On average, this additional delay ranges from 5 to 10 seconds. This swift response is a testament to the efficiency and speed of automated processes, as they can swiftly identify, assess, and mitigate security incidents without human involvement.

However, in scenarios where human intervention is deemed necessary, the timeline for resolution becomes notably more variable. Here, the “blast radius” timeframe can see an average delay ranging from 5 minutes to an undetermined period that hinges on the SOC analyst’s expertise, workload, and the complexity of the incident at hand. This variable time window underscores the importance of streamlining and optimizing SOC workflows, including the utilization of automation, to minimize response times and mitigate potential security risks effectively.

The containment of a security incident within a SOC largely depends on the procedures outlined in the SOC playbook. In some cases, the incident may be resolved before the SOC analyst is fully prepared to intervene. However, when a situation necessitates action from a third party, it introduces an additional variable that can significantly extend the window of potential damage, which is often referred to as the “blast radius”. The duration of this extension hinges on the terms specified in the SLA established with the third party.

In essence, the SLA acts as a crucial contractual framework that defines the response time and responsibilities of the third party when a security incident occurs. The specifics of the SLA, such as the maximum allowable response time, the actions to be taken, and the communication protocols, can vary widely based on the nature of the third-party service and the criticality of their role in incident resolution.

For instance, if a security incident involves a cloud service provider, the SLA might stipulate a certain response time, which could range from minutes to hours, depending on the severity of the incident and the terms negotiated. During this period, the blast radius continues to expand, potentially exposing the organization to more substantial risks.

Therefore, the SLA with the third party plays a pivotal role in determining how much additional time is added to the blast radius timeframe when their involvement is required. It underscores the importance of carefully crafting and regularly reviewing SLAs to align with the organization’s security objectives and risk tolerance, as well as ensuring that all stakeholders are well-prepared to respond effectively to security incidents within these agreed-upon timeframes.

The pivotal inquiry that should occupy your thoughts is whether the potential impact, often referred to as the “blast radius”, within your operational environment exceeds the timeframe it took for the network to be wholly compromised, as delineated in the DFIR report. If this temporal assessment reveals a scenario where the repercussions of a security breach persist for a duration longer than the timeframe it took for the network to be breached initially, then your organization may find itself ensnared in what can aptly be labeled as “Death by SLA” — a situation where SLAs are inadvertently undermining your cybersecurity posture.

In essence, “Death by SLA” underscores the perilous predicament wherein the commitments and stipulations of SLAs, often established to govern various aspects of service delivery, inadvertently become counterproductive. Such agreements, while vital for ensuring the efficiency and accountability of service providers, should never compromise the paramount importance of security and rapid incident response.

Are you suffering from “Death by SLA”?