Leveraging Apono and PagerDuty for Effective Incident Response at LabelBox
Ofir Stein
July 15, 2024
Session Overview
This webinar covers the story of how LabelBox utilized PagerDuty and Apono to create a new solution for resolving critical incidents faster and more securely.
Introduction of Participants
Sharon Kisluk, Director of Product at Apono
“I’m Sharon, the Director of Product at Apono. Today, we want to talk about incident response and how to tackle that issue. Inevitably, every company will experience downtime for their products, and we want to ensure that incident response is lightning fast to avoid violating SLAs and minimize downtime for users and customers. The challenge lies in balancing the need for open-ended access, which facilitates faster incident response, with the best practice of restricting access to prevent potential harm to production.”
Mandi Walls, DevOps Advocate, Pagerduty
“I’m Mandy from PagerDuty. We started as an incident response platform to help responders get to where they need to be during incidents on their platforms. Managing the security and access to all components of your infrastructure is crucial. We’ve seen many customers struggle with this, and it’s great to see a product like Apono streamlining these processes and providing necessary access during incidents.”
Aaron Bacchi, Sr. DevOps Engineer at Labelbox
“Hey, I’m Aaron from Labelbox. We offer a SaaS platform for training AI models, and the data and databases are crucial. I’m on the security team, focusing on cloud configuration and software security. Initially, our incident response for cloud SQL databases involved allowing all developers access to static service accounts, which wasn’t secure. We needed a better solution.“
Importance of a Break-Glass Solution
The Challenge
Managing break-glass situations posed a significant challenge for the Labelbox team. It had to balance security concerns with the need to maintain productivity. The team found that its use of shared service accounts for database access was not secure, so there was a need for a more robust solution.
Aaron’s Insights
“We knew we needed a break-glass system because we faced a critical incident once where a key responder couldn’t gain access. This incident underscored the importance of having a flexible and reliable break-glass solution.”
Mandy’s Insights
“From PagerDuty’s perspective, customers often face challenges with microservices architectures, figuring out access requirements, and ensuring compliance reporting. Having an auditable trail of access during incidents is essential, and flexibility in access management is crucial for resolving production issues.”
Sharon’s Insights
“Managing access to sensitive resources is about balancing risk and response speed. Customer data is highly sensitive, and even minimal access can be risky. During incident response, you need to ensure quick access while preventing overexposure. Apono’s policies allow for this balance, enabling quick access during incidents and allowing responders to bring in additional help as needed without compromising security.”
Moving Towards a Solution
When it comes to incident response, having a robust and flexible access management system is critical. Using Apono and PagerDuty effectively addresses this need by leveraging a combination of tools and processes to ensure that engineers have the necessary access when required while maintaining strict security controls. Let’s delve deeper into how this innovative approach was implemented and the key components that make it successful.
Integration with Apono and PagerDuty
The integration between Apono and PagerDuty is a critical aspect of the solution. By utilizing these tools, Aaron was able to create a seamless flow that enhances incident response efficiency:
- Apono’s Role: Apono allows for the configuration of flows that integrate with Pagerduty. It enables the automatic approval and revocation of access based on pre-defined criteria and schedules.
- PagerDuty’s Role: PagerDuty manages the incident response process, including shift rotations and incident notifications. It ensures that the right personnel are alerted and can take immediate action.
- Combined Workflow: By integrating these tools, Aaron created a workflow where access requests are managed dynamically. Engineers on the PagerDuty rotation can approve access requests directly through Apono, ensuring that only the necessary permissions are granted when needed.
The “Break-Glass” Google Group
One of the core components of the solution is the “break-glass” Google group. This group serves as a temporary access point for engineers who are not part of the regular database admin team but need immediate access during an incident. Here’s how it works:
- Incident Identification: When an incident occurs, the engineer responsible for addressing the issue can request access to the “break-glass” Google group.
- Approval Process: This request is routed through PagerDuty to the on-call database admin, who has the authority to approve the access. This step ensures that only authorized personnel can grant access.
- Temporary Access: Once approved, the engineer is added to the Google group, which grants him or her the necessary permissions to interact with the production database. This access is time-bound, typically limited to two hours, to minimize security risks.
Employee Training and Implementation
Implementing a new system requires proper training and buy-in from the team. Aaron addressed this by conducting bi-weekly tech talks and creating comprehensive documentation:
- Tech Talks: These sessions provided an opportunity to demonstrate the new system to the entire engineering team, showcasing its benefits and how it works in practice.
- Documentation: A detailed Confluence page was created to document the steps and procedures for using the new system. This resource is invaluable for engineers who need to refresh their knowledge or learn about the system for the first time.
Addressing Compliance and Security Concerns
One of the significant advantages of this system is its ability to address compliance and security concerns effectively:
- Auditing: Apono provides a full audit trail of every access request, including who made the request, who approved it, and the exact times of access and revocation. This detailed logging is crucial for compliance reporting and internal audits.
- Compliance: By implementing this system, Aaron’s team can meet stringent compliance requirements. The granular control and auditing capabilities ensure that access to sensitive data is tightly controlled and documented.
- Security: The time-bound nature of the access and the integration with PagerDuty ensures that permissions are only granted when absolutely necessary and are automatically revoked after the incident is resolved.
Aaron’s innovative approach to incident response and access management demonstrates how combining the right tools and processes can create a secure, efficient, and compliant system. By leveraging the “break-glass” Google group, integrating Apono with PagerDuty, and ensuring thorough training and documentation, Aaron successfully enhanced his team’s ability to respond to incidents quickly and securely. This solution not only improves operational efficiency but also meets the high standards required for auditing and compliance in today’s complex IT environments.