Being Prepared When Everything Goes Wrong

As an organization running a website that accepts payment via credit card, Drupalize.Me is committed to keeping our data secure and our members safe. Part of this means adhering to PCI-DSS standards. Since we don't handle any of the credit card data directly on our infrastructure (Recurly, who is PCI Level 1, handles that for us), we annually apply for PCI-DSS Self-Assessment Questionnaire A. Having just gone through this process for 2015, there was one large—and notable—change to the PCI-DSS 3.0 SAQ A-EP that was not required in 2014: an incident response plan. My initial reaction was "UGH, this sounds like a lot of work!" But the truth is that the PCI-DSS SAQ is really good at identifying tedious, yet important tasks.

So I spent time researching and writing an incident response plan for Drupalize.Me. Of course, we hope we never have to put our plan into action, but we would rather be prepared than not.

And that's really the gist of it. An incident response plan is all about being prepared. So that in the moment, under pressure, when everything that could possibly go wrong is going wrong, you can remain calm, cool, and level-headed. If you've ever had to write a social media message, or respond to a support request during an un-planned site outage you know how easy it can be to misstep—even if your intentions are good.

Not all incident response plans are equal either. The PCI-DSS SAQ requires an incident response plan for security breaches, but even if you're not accepting payment and don't have to worry about PCI-DSS compliance you should still have a response plan for general site outages, server downtime, and what to do when your co-worker accidentally restores the production site from a local copy and then goes on vacation for a week. So here are some of the things I've learned along the way that I hope will help you to be better prepared for whatever it is the future holds.

What to Cover

A good incident response should cover the following topics:

Making an initial assessment
Communication
Containing the damage and minimizing risk
Identifying the type and severity of the attack
Protecting evidence
Notifying external agencies
Recovering your system(s)
Reporting
Assessing damage and cost
Reviewing the response and updating documentation

In the heat of the moment, these steps are not purely sequential, but having them documented helps to ensure that they are all covered appropriately.

Pro tip: documentation should start at the very beginning and continue throughout the entire life cycle of the incident.

Making an Initial Assessment

Many activities could indicate a possible attack. Until you understand the type and severity of the compromise in more detail, you will not be able to be truly effective in containing the damage and minimizing the risk. An overzealous response could even cause more damage than the initial attack.

Take steps to determine whether you are dealing with an actual incident or a false positive
Gain a general idea of the type and severity of attack
Record your actions thoroughly (to be used for later documentation)
It is always better to act on a false positive than fail to act on a genuine incident

Your response plan should cover any relevant details specific to assessing your particular system.

Communicate

Once you suspect that there is a security incident, you should quickly communicate the breach to the rest of the team.

Identified incident leads should contact anyone outside of the core team as needed.
Don't over communicate. Until an incident is properly contained only those playing a role in the incident response should be informed. Failure to contain communication could result in additional problems if it ends up in social media or an additional attacker is tipped off about an open attack vector.
Determine who needs to be informed of the incident—both internally and externally.
Use template messages for things like social media in order to minimize poorly written messaging due to panic.
External communication should be coordinated with the incident response team, and in some cases probably with a legal representative as well.

Your response should include contact information about the roles and responsibilities of anyone on the incident response team as well as up-to-date contact information for those people.

Containing the Damage and Minimizing the Risks

Having an incident response plan allows you to act quickly and to reduce the potential risk of an attack. This can make the difference between a minor incident and a major one.

Here are some things to prioritize in order to minimize damage and risk:

Protect human life and people's safety. This should, of course, always be your first priority.
Protect sensitive data. Defining ahead of time which data is sensitive will enable you to prioritize your responses in protecting the data.
Other data in your environment might still be of great value and you should act to protect the most valuable data first. Things like logs, and mission critical data. On Drupalize.Me this might be information like our video playback history that allows for tracking our members recently watched videos and video popularity. This data can't be regenerated.
Protect hardware and software against attack. Damage to systems can result in costly downtime.
Minimize disruption of computing resources. Remember that while uptime is important, keeping systems up during an attack may end up in greater problems. Don't be afraid to pull the plug.
Determine the access point(s) used by the attacker and plug them. Measures might include disabling a modem, adding access control entries to a router or firewall, or increasing physical security measures.

A note about taking systems offline. In the vast majority of cases escalation of an incident should involve immediately taking the system off the network. Effectively isolating it from further attack. However, if you have service level agreements in place that require you to keep your systems available it's important to understand the implications from both a cost and legal perspective to taking those systems offline. If needed, you can choose to keep a system online but I suggest limiting it to only the critical network activity.

Your response plan should clearly identify possible areas of risk and ways to mitigate any potential damage, as well as any known "weak" points in the system. As someone looking to contain damage, where should I start and in what order should I proceed?

Identify the Severity of the Attack

In order to be able to recover from an attack, you need to determine how seriously your systems have been compromised. Knowing this will help you contain and minimize risk and inform other decisions along the way such as; how quickly and with whom you communicate, how to recover, and whether you want to seek legal redress.

You should attempt to:

Determine the nature of the attack, and remember that this this might be different than the initial assessment. This time around we want to know for sure.
Determine the point of origin.
Attempt to determine the intent of the attacker. Are they trying to acquire specific information, or was it random?
Identify the systems that have been compromised.
Identify the information that has been accessed and determine the sensitivity of those files.

Determining the severity of the incident will likely occur as the incident response plays out. In many cases you might find something that seems small, but as you dig deeper you begin to realize the potential ramifications of what you've just uncovered, and the incident will need to be escalated. A good incident response plan will outline specific procedures to follow as you learn more about the attack.

During an incident, response time is often crucial, and your team should work to prioritize steps in the incident response plan to ensure that you're protecting yourselves and your data first, and minimizing potential damage.

Your response plan should include information about the relative severity of an incident and where appropriate documentation about to whom, and how, to escalate an incident. Relate this back to the communication section where you've got roles and responsibilities listed.

Protecting the Evidence

Whether for legal reasons, or simply for educational ones, protecting the evidence you uncover while investigating an incident is always a good idea. Before you begin your investigation make a backup of the system as it is, in its compromised state. Even for less severe incidents this backup can be helpful in figuring out why the site went offline and how to prevent it from happening again. Oftentimes we prioritize getting the system back online in a fixed state, and then come back to do a full analysis of what happened later.

As an example, anytime we experience even a minor hiccup on the Drupalize.Me site (cron failing to run, PHP errors occurring, etc.) the first thing we do is make a backup of the database. Then fix the problem. The backup allows us to get a broken version of the site running on my local and figure out why something was failing, not just what was failing.

For incidents that involve attackers, and not just bad code or corrupt data, also make sure to include things like network logs, server logs, system logs, and any other relevant information in your backups. If you're responsible for a physical location, backup access logs for the room. The more information you can save about the compromised state of the system the easier it will be to perform an analysis later.

Once you have working, verified backups, you can wipe the compromised systems and rebuild them. This will enable you to begin running your site again. The backups provide the critical, untainted evidence required for future analysis or even prosecution. Of course, a different backup than the forensic backup should be used when restoring data.

One super fast method is to consider rebuilding a fresh system with new hard disks. The existing hard disks can be removed and put in storage as backup—ensure that you change any local passwords—and then you can restore from a known good backup.

Note: If you're dealing with PCI compliance, many payment providers require that you can provide them access to the compromised system in an un-altered state. As in, unplug the network cable, and then don't touch it.

Your incident response plan should include information about the location of log files, and any other data that should be retained, as well as procedures for backing up those files.

Notifying External Agencies

Once the incident has been contained and data preserved consider whether you need to start notifying appropriate external entities. Do you have business partners you need to inform? What about law enforcement?

In some cases you might have legal agreements with your customers, or even with the public, that require you to notify them of incidents of a certain type.

Always make sure that you assess any media or social media responses on a case-by-case basis and that someone within the organization, or an external partner who specializes in disaster response, is vetting your communication. It's so easy to panic under pressure and say something just slightly wrong, and these statements can often escalate an incident far beyond it's initial severity.

Your response plan should identify any business partners you might need to contact as well as information about who should make contact and any process for vetting that communication.

Recovering Systems

Can you bring the system back online after patching it up or do you need to rebuild from scratch? This will depend on the incident and whether or not physical damage has occurred. Also, remember that if you're restoring from a backup that many attacks can go on for long periods of time without detection, and it's important that as part of your incident response plan you determine when the attack started so you can be assured that you're using clean data. This also points of the importance of using a backup schedule that includes long-term backup retention.

Your response plan should include documentation about how to re-build or recover your system from backup.

Evidence and Reporting

During execution of the incident response plan all team members should be diligently logging the actions that they take. This information, as well as everyone's findings can then be compiled into an incident report. Your report should include a description of the incident, actions taken with the reasoning behind them, information about how the system has been fixed, and steps that should be taken to prevent future incidents.

Depending on the level of severity you might want to consider having two people involved with every decision that's made. This helps reduce the likelihood of evidence being tampered with. It's not a fun thing to think about, but the attacker could be an employee, contractor, or someone else that is part of the incident response team.

Assessing Damage and Cost

What are the direct and indirect costs to your organization of this incident?

Loss of sensitive data
Loss of competitive edge via proprietary information
Loss of reputation and customer trust
Legal costs
Labor costs to implement the incident response plan and any follow up work
Costs incurred because of downtime

Amongst other things this information is useful in the incident report and also might prove useful when deciding whether or not you want to pursue legal action against an attacker.

Your response plan should include information about anything that you might want to assess for direct or indirect costs.

Reviewing Response and Updating Documentation

After an incident has been dealt with, take some time to review your response. Did it work, could you have done anything better? As you talk through the whole scenario with your team make sure you update your documentation and incident response plan for next time.

Conclusion

These types of things are never fun to think about. No one wants things to go wrong. But future you will thank past you for not forcing you to try and figure this all out while it's all going wrong.