This is a challenging question that is not easy to answer because it depends on several things. There are good arguments for having 24×7 eyes-on-screen in a Security Operation Center. But there are also strong arguments for implementing a different model.
The conventional model of how most Security Operation Centers are organized is based on a tiered model. Level 1, Level 2, and Level 3, combined with support staff. Level 1 is 24×7 staffed while other functions are only staffed during the daytime and are on call during the other hours. Depending on local laws, you quickly need a minimum of 15 people to run the 24×7 Security Operation Center service. And that comes with a particular cost element.
And although nothing is wrong with this model, I do wonder if it is really delivering the value it supposes to bring. Keep in mind the fact the Security Operation Center will only come into action once a use case has been triggered. If you look at the statistics of a Security Operation Center, luckily for most organizations, not every minute a use case is triggered. Also, keep in mind the fact that Level 1 analysts are typically only allowed to follow the agreed runbook of a use case. If it doesn’t meet the requirements of the runbook, they need to pass it on to the Level 2 analyst.
And to make it even more complicated, most Security Operation Centers are not allowed to execute the required actions to isolate and/or mitigate the security incident. That responsibility lies on the shoulders of the resolver teams. But are these teams also 24×7 available? Or are they only available during office hours and outside these hours have pager duty?
When you zoom in on the use cases themselves, ask yourself the question ‘If the use case is triggered, how fast should containment be established?’. I purposely skipped the NIST phase ‘Identification of the security established’, because that is only focusing on the Security Operation Center itself, and again if you only focus on this question, you are sending the wrong signal.
Let’s use a concrete example. The use case is ‘Possible ransomware activity detected’. Depending on the environment there are a number of ways to detect the activity behind this use case. For example, an antimalware/EDR notification, elevated disk I/O activity, and known ransomware file extensions being logged on the filesystem. Each of these described activities is relatively easy to detect if the relevant security controls are available.
But here comes the first issue, most SIEM solutions are based on scheduled searches. In other words, there will be a delay between the moment the alert is generated by the security control and the moment the use case is triggered. In this example, the scheduled search behind this use case is set to run every 5 minutes. Therefore, the potential ransomware activity can already carry on for 5 minutes before the alert is displayed on the screen of the Level 1 analyst. At the same time, you can ponder what the Level 1 analyst will do between alerts. Definitely, they need to have 24×7 eyes-on-screen and yes, that is relatively easy to organize with 2 monitors. But depending on the SIEM technology being used, you need to manually refresh the alert screen and therefore, more time is lost.
The second issue the Level 1 analyst needs to read and analyze the alert. And during the analysis, more data may be required. If the data is not automatically fetched when the alert is generated, more time is lost. Depending on the use case, this can easily be between 1 and 15 minutes. Once all the data is gathered and analyzed, the analyst needs to draw a conclusion (True or False positive) and based on this conclusion take the required action. For a simple use case, the entire process may take 10 minutes. Complex use cases may require even more time.
Hey, but wait, isn’t there a KPI for Initial Triage? Yes, most Security Operation Centers have implemented various ITIL processes including performance management. That is in my view resulting in some really poorly defined KPIs, like 15 minutes SLA for Initial Triage. What most managers are forgetting when they agree with these KPIs, is the fact that use cases will vary in complexity and that not every security analyst is highly experienced. Remember, most Level 1 analyst have low experience because it is advertised as an entry-level job. And therefore, the Level 1 analyst is only allowed to follow what is described in the runbook behind the use case. If it doesn’t meet the requirement of the runbook, the alert must be passed to the Level 2 analyst.
The earlier mentioned example of possible ransomware activity you need highly specialized and highly skilled security professionals to recognize what is going on. As it most likely doesn’t meet the requirements of the runbook of the use case, the Level 1 analyst needs to pass the alert to the Level 2 analyst and therefore, more time is lost and the possible ransomware activity can still continue. Let’s say the Level 1 analyst takes 15 minutes to draw this conclusion. Therefore, in total already 20 minutes are passed since the initial activity has been detected.
As the Level 2 analyst is only active during business hours time is lost between the moment the Level 2 analyst is notified and the moment the Level 2 analyst responds. This can be minutes but also can be hours. It all depends on how things are organized. Let’s say, the Level 2 analyst responds in 10 minutes. The analyst still needs to read, study and analyze the raised security incident and therefore, more time is lost.
If I take the resolver team also into account, more time will be lost because even if they are 24×7 available, they have also other activities to complete. And switching between activities takes time.
You really use to analyze which use cases do and don’t require a 24×7 response. And for those use cases that do require 24×7 response, you really need to think about how to organize this 24×7 response. Break it up into smaller pieces. There is no single solution that can do everything. You will need to compromise. Sometimes outside office hours establishing containment is enough, but some use cases may require the eradication and recovery phase also to be completed.
Security orchestration, automation, and response, or SOAR for short, maybe a solution. But then again, it may also be overkill. In the earlier mentioned example, with the perfectly designed and implemented runbook in the SOAR platform, you can reduce the time between the moment the activity is detected and the activity is contained to seconds. Yes, in this example there will be still some damages caused by the potential ransomware activity. But restoring a few systems will most likely be cheaper than restoring all systems when you follow the traditional approach.
In the above-mentioned example, the SOAR ransomware runbook would have been initiated one second after the event was generated by the security control. The affected systems would have been isolated and moved to a quarantine network zone and all relevant support teams would have been notified within 1 minute after the initial event. In case data had been leaked to an external location, both the Legal and Public Relations teams would have received the relevant information at the right time. However, it could also occur that the event is close to but not 100% similar to the one that is used in the use case. In that case, the Level 2 analyst would have received a pre-analyzed alert where all relevant information would have been already gathered by the SOAR solution, reducing the overall time between the initial event and the analyzed alerts to less than 10 minutes. When the right SOAR technology is used, SOAR is an extremely effective tool for the security department.
Leave a Reply