Records not being sent to destination

Incident Report for Census

Postmortem

Incident Summary

From Nov 28, 2023 until Feb 23, 2024 the Census Sync Engine contained a timing bug which could cause syncs to mark records as successfully synced even though they had not been sent over to the destination. The bug impacted 0.012% of sync runs during the incident and was patched Feb 23, 2024. We will be reaching out to impacted customers with steps to remediate impacted syncs. In most cases running a full sync restores correct record tracking.

Incident Details

Background

The Census Sync Engine runs syncs as a workflow of multiple discrete activities: sync preflight, unload, service load, commit, etc. Historically these activities would run to completion on a single host before scheduling the next one. On Nov 28, 2023 our team introduced a change, referred to from here on as asynchronous activities, which would allow an activity to suspend itself after issuing a query, or a set of queries, to the warehouse via our Query Runner Service. Since certain warehouse queries may take many minutes, this allows for much more efficient utilization of our worker fleet - it allows us to pipeline other activities while waiting for warehouse queries to complete. This pattern is heavily utilized in our unload activity.

Initial Report and Discovery

On February 13, 2024 a customer reported seeing records marked as synced on the UI which could not be found in the sync’s destination service. Our initial investigation seemed to suggest that the query we use to unload data from the warehouse was not producing any files in the cloud storage system (this customer was using our Advanced Sync Engine).

After adding additional telemetry to track down the cause of the failed unload, the team discovered that the unload queries were actually never being issued to the warehouse because the entire query set they were a part of was being cancelled by the Query Runner Service.

Root Cause Analysis

The cause of the query cancellation was a timing bug between two modules of the Query Runner Service: one that supports asynchronous activities, and the query garbage collector [1]. The code that added asynchronous queries to the query execution queue would also mark these queries as ineligible for garbage collection. These two calls did not happen atomically or in a protected block, however. This meant that under periods of high load in the Query Runner Service—which makes extensive use of multi-threading—it was possible for the garbage collector to cancel an asynchronous query before it was opted out of garbage collection. This would occur when all of the following were true:

The asynchronous activity thread would be paused by the thread scheduler after adding the query to the execution list but before adding it to the garbage collection exclusion list.
The query took longer than one minute to execute.
The garbage collector was scheduled to run before the asynchronous activity thread was resumed.

Impact

The incidence rate of the bug impacted 0.012% of all runs and 0.026% of runs with row changes, but had selection effects that made it more likely for certain customers to be impacted:

The bug only affected syncs on the advanced sync engine.
Customers with slow or congested warehouses were more likely to be impacted since the longer queries ran the more likely they were to be garbage collected.
Customers who run lots of similar syncs on the exact same schedule were also more likely to be impacted. These syncs were more likely to issue asynchronous queries at the same time, thus increasing the load on the Query Runner Service and increasing the odds of one of them being selected for garbage collection.

Remediation

Our team has rolled out a fix for the timing issue to prevent further occurrences. In addition, we are putting in place additional safety checks throughout the sync pipeline. While this particular bug was subtle, its effects could easily have been detected by a simple invariant: ensuring that the number of records we unloaded was consistent with the count inside the warehouse.

We take our responsibility as stewards of our customers’ data seriously, and while we strive to deliver that data as quickly and efficiently as possible, we value correctness above all else. In this case we’ve failed to deliver on that promise, and we will be reaching out to impacted customers to offer our full support with remediation options. In most cases running a full sync of the data is sufficient, but we’ll work with customers for cases where that’s not possible or desirable.

If you have any questions about any of the above details don’t hesitate to reach out to your Census representative or to support@getcensus.com.

[1] Query Garbage Collection exists on the Query Runner service to facilitate other query modes: synchronous and polled. It ensures that we’re not running queries that are no longer of interest to the requester.

Posted Mar 06, 2024 - 00:59 UTC

Resolved

Some Advanced Sync Engine syncs are showing successfully synced records that aren't visible in the destination.

Posted Feb 24, 2024 - 00:45 UTC