Point of Sale Troubled Performance

Incident Report for Dripos

Postmortem

Summary

Yesterday (Sep 2nd 2023), an incident at 12:46 PM EST caused our database (and subsequently servers) to overload and stop responding to requests. This incident was caused by an old report that was created a few years ago, that happened to be run by a customer. This report tried to fetch too much data from our primary server, which in turn froze our database for around 14 minutes. During these 14 minutes, Dripos was largely unavailable, but our offline payment system was still able to take payments. Our system was able to restore itself, and customers were back online starting at 1:01 PM EST.

We understand how much these incidents affect the normal workflow of our customers, and we’re incredibly sorry for the inconvenience to your business, especially on a busy Saturday. Dripos is always around with our 24/7 support (be it calls or texts to 781-583-3699, or sending us an email at support@dripos.com), so if you want to discuss this incident or have feedback or suggestions, please don’t hesitate to reach out.

Timeline

This incident spanned from 12:46 PM to 1:01 PM on Saturday, Sep 2nd. All times are in Eastern Standard Time.

12:46 PM - A deprecated product summary report was generated over a large set of data on our primary database.

12:47 PM - Our primary database CPU usage spiked to 100% utilization, which froze data from being created or retrieved from our database.

12:50 PM - Dripos receives its first phone call from a customer saying that the point of sale was not operating as expected.

12:50 PM - Internal alarms inside Dripos start going off, detailing high utilization of our primary database.

12:53 PM - Dripos started our highest severity escalation channel and started reaching out to customers regarding the incident

01:01 PM - Our database began to unfreeze, and data started to be passed through. Customers should have seen operations returning to normal.

01:05 PM - Dripos started reaching back out to customers about systems being back online.

01:09 PM - Our database returned back to operating at the expected normal levels.

01:22 PM - Dripos pushed a fix to prevent any reports from being run on our primary database.

Analysis

At Dripos, we store our data in things called databases. There are two main types of databases we have set up:

  1. A primary database to which our data is sent to be created, updated, or deleted. For example, data will be sent to this primary database if an order is created, a payment is validated, or an order is completed.
  2. A secondary database called a read-only database which can only read data, it can’t manipulate any of the data that is stored there. This database is primarily used to run reports and other operations that you’d need to retrieve a lot of data.

Once we send data to our primary database, say details of an order, it will update the primary database to include that data. Once that data is confirmed to be stored in the primary data, we will then send the data to be replicated on our read-only database so that it stays up to date.

In today’s incident, someone ran a report over a large time period using an outdated reporting method. This report tried to retrieve data using our primary database, which we rely on to process and maintain orders. Our read-only database is supposed to run these reports, but the outdated and deprecated report was configured to run on our primary database. The run report tried to find too much data and took a long time. This database lag can sometimes freeze our database for a few minutes. Several quarters ago, we released our new reports which use our read-only database since if the read-only database slows down a little, it will not affect other functions of Dripos.

When the primary database ran the report, it locked itself from creating or retrieving other data. This database freezing caused payments to fail and the point of sale to be offline. Offline payments could be taken during this time, but every other part of Dripos was offline during the 14 minutes our database was frozen.

Our servers were operating normally, so in some cases, customers had devices that said online but since the database couldn’t fetch data, all their requests to the server would fail to work. Thankfully, our database resolved the problem itself after this report was entirely generated. We’re incredibly fortunate that we have a system in which we didn’t have to intervene to bring operations back to normal.

Our read-only database is supposed to run these reports, but the outdated and deprecated report was configured to run on our primary database. We’re continuing to audit our server to ensure no function that requires a lot of data can run on our primary database.

Conclusion

This incident was the first critical incident we’ve experienced in the last year. We’re fortunate to have set up alarms and systems that were able to notify us, and in turn be able to notify customers as soon as we saw anything was functioning abnormally. We’re also fortunate that our system was able to automatically resolve the lock within 14 minutes of the incident starting.

While it was the first significant incident in a while, it was an incident nonetheless. The best solution to these incidents is not to have the incident happen in the first place. We’ve been working around the clock to find any place in our software that could cause a database lock like this in the future. We’re also doing a deep dive into our escalation channel to continue to improve how we communicate within the company during these incidents and also to communicate what is happening with our customers.

We understand how much these incidents affect the normal workflow of our customers, and we’re incredibly sorry about this incident happening. We strive to have a more stable product than other solutions in our field. Every incident that happens, we learn a lot from and use those learnings to continue growing and improving our company.

Posted Sep 03, 2023 - 10:17 EDT

Resolved

All locations have reported back to routine services. We'll send a post-mortem detailing what went wrong later tonight.
Posted Sep 02, 2023 - 13:33 EDT

Monitoring

All services are back to working normally. We're continuing to make sure everything is good before fully resolving this incident.
Posted Sep 02, 2023 - 13:25 EDT

Identified

There are longer response times from our server which is causing the point of sale to go offline. We've identified the problem and deployed a fix which should be out shortly.
Posted Sep 02, 2023 - 13:11 EDT
This incident affected: Core Server, Point of Sale, and Web Dashboard.