System Slowdown/Offline

Incident Report for Dripos

Postmortem

Summary

During our release last night, we made a change that interacted inconsistently with our hosting provider and our local development environments. It caused a few customers to have a longer than normal boot-up time on their point of sale. In some cases, the tablet would stay offline.

The core of the problem was slow code that fetched pending tickets on point-of-sale boot. Our tablets display as offline until we fully fetch all pending tickets, so the delay led to long wait times until tablets came online. All other Dripos services were confirmed to be working, and most customers were not affected by this incident.

Timeline

This incident primarily spanned from 10:47 AM to 1:01 PM on Tuesday, Oct 24th. All times are in Eastern Standard Time.

10:47 AM - A few affected customers had restarted their tablets and noticed extended connection times or connection failures.

10:58 AM - Our engineering team was alerted by both internal alarms and customer reachouts about a potential problem loading ticket data.

11:05 AM - Dripos deployed a fix to prevent the affected code from locking during the fetch ticket function.

11:09 AM - Customers returned to normal

Analysis

As stated in the summary, the main cause of this incident was the code that fetches tickets. During the point-of-sale boot, the system fetches every pending ticket. We made a change to this fetch that behaved inconsistently between testing environments and production environments. Large quantities of tickets exacerbated the damage of the change, so the effects were most noticeable to high-volume businesses.

To resolve the issue, affected customers had to restart their tablets. Upon restart, the tablets called the fixed code and booted up normally.

Conclusion

We are fortunate to have extensive monitoring infrastructure that catches these issues early so we can resolve them quickly. We were able to identify the issue and push a fix in 18 minutes, with full resolution in 22 minutes.

While the incident didn’t have lasting impacts, we still consider any outage unacceptable. We are working tirelessly to improve our testing to prevent issues in the future. We also continue to develop our escalation processes to improve our customer communication during incidents.

We sympathize deeply with our customers whenever an incident occurs, as we know it causes real problems for family-owned and enterprise businesses. We strive to be a market leader in stability and customer communication. We continue to rigorously develop testing infrastructure and escalation procedures to prevent future incidents.

Posted Oct 25, 2023 - 10:05 EDT

Resolved

We have confirmed that all customers who reached out are operating normally again. The problem would only occur when restarting your point of sale during the affected window (around 10:30 AM EST to 11:15 AM EST), and the point of sale would appear offline. We'll follow up with a post-mortem with full information about what happened later tonight.

We're sorry for any changes in workflows we caused, and please reach out to our support if you have any questions!

Posted Oct 24, 2023 - 11:44 EDT

Monitoring

We found and fixed an issue regarding a slowdown on our database that may have caused some tablets to go offline. If you are having problems with connectivity, please reach out to our support via text or call at (781) 583-3699.

Posted Oct 24, 2023 - 11:28 EDT

This incident affected: Core Server and Point of Sale.