Things To Think About When Building Data Load Jobs

We all know that data is a major success factor for modern businesses. Many of the systems we build for our clients and ourselves involve loading data from various sources, and it’s usually imperative that these loads produce complete and consistent datasets. After all incorrect data can be worse than no data at all.

In this article I’m going to talk about the key points I discuss with my team when we’re planning any data loading work. In my experience, any data load that doesn’t consider these factors has a high risk of running into issues at some point…

Batch vs Stream

First let’s define these terms:

  • Batch: a batch job runs periodically and processes a batch of data at once.
  • Stream: a stream “listens” to a data source and updates your dataset in (near) realtime.

The first question I ask is about the expectations for the end use(s) of this data. Is it expected to be updated in real time or would a periodic update work best? For example, P&L closing calculations for a bank would be expected each morning for the previous day’s trading, based on the previous day’s closing prices. However, the bank’s traders need to see prices in real time, looking at the previous day’s price or even prices which are stale by minutes is not going to allow them to make good decisions.

The next thing to consider is what is possible with the data source? Taking the above examples, meaningful P&L calculations have to be based on the marked price and number of other data sources, they cannot be calculated in real time. Stock prices are available in real time from the exchanges.

Completeness

How do we ensure nothing is missed?

For batch processes that pull data from an API we need some way of knowing what has changed. Fortunately many APIs provide a way of querying with an updatedFrom parameter, so if we store the latest updated timestamp of the previous batch we can query for anything that’s been updated since. You could also do this based on the time the job runs, eg if you job runs daily at midnight you can pull in anything that was updated since the beginning of the previous day.

For batches that read data from a file produced by the source system, we are somewhat reliant on the source to provide the correct set of data in each file. However, there may be some “sense checks” we can run, such as checking the dates of the entries look right. It is our job to ensure that all files are processed, I recommend archiving processed files (eg by moving them to a different folder) or changing the filename (eg by adding .imported to the name) once all data from the file has been successfully loaded. This is also going to make your life easier if something goes wrong!

For stream jobs I advise using a robust pub/sub queue system, such as MQ, GCP Pub/Sub or Laravel Queues. There are lots of options here, the right one depends largely on the platform you are using and the volume of data you intend to process. The queue will ensure that each transaction is processed successfully or it will do something on failure (eg notify you).

Re-runnability

The next thing to consider is that your data load must be re-runnable. This means that if it runs more than once for a particular item then it won’t cause a problem, such as a failure or duplicated data. This can happen for a number of reasons, eg a batch job fails part way through, some of the source data was found to be incorrect and needs to be reloaded manually or the data processor processed a queue item but failed before notifying the queue of the successful run (top tip: make sure you leave yourself an easy way to run your job manually, including ways to specify constraints such as the date range to process). Two broad approaches to this are:

Check first: it may be feasible to first check if an item exists before inserting or updating a record, eg by performing a lookup on the primary key. This is often necessary in transactional systems where duplicates will cause issues, but in say a data warehousing situation it may not be efficient to do this.

Deal with it: design your system in such a way that in can accept duplicate records. In the data warehousing example we can do this by adding a load timestamp to each record and building a view that expose the most recent version only (check out this example for BigQuery using window functions).

Logging & Alerting

Be kind to yourself. Do you want to spend your nights and weekends guessing what went wrong with your job? Do you want to be the person who had the whole company looking at incorrect data for 3 months due to a silent failure (I’ve been there 🙁 )?

As you are building your load job, you need to think hard about what information you’d like to have when you need to debug it. Data processing jobs can be notoriously hard to debug because they may be complex and often take significant amounts of time to run. So make sure you log what happened as much as is feasible. What files where loaded and when? How many records were loaded? What steps were completed? If you can save the data source for a period of time, eg by storing it in a cloud bucket object, this will help you too (next top tip: make sure you have data rotation policy setup, the last thing you want is for jobs to fail due a lack of disk space or for your cloud costs to keep going up and up forever).

All this logging is great, but what good is it if you are never told when something has gone wrong? You need to have an alerting system setup, which could be as simple as configuring your queue to say retry 3 times and then alert you. You may need to build in an email or other integration (for emails you should use a transactional email API like PostMark to ensure emails don’t end up in spam). The key here is to identify when a failure occurs, eg catch exceptions then log and alert or check for return codes of API calls and log/alert if they are not as expected. You can also setup alerts based on expectations about the data, eg is it ever expected that 0 records are loaded on any run?

The more you log (without breaking performance or logging sensitive date such as PII or passwords!) and alert the better. I can’t stress how important logging and alerting is. It’s the part I most often see people “leaving to the end” (and then never getting round to) and it’s almost inevitable that something will go wrong at some point, don’t be that person!

Conclusion

I wrote this post as I found myself having similar conversations with my engineering team every time we build something like this. With the few key pointers above you should be in good stead.

Do comment below or find me on twitter if you have any thoughts or questions.

Let’s launch something!