Building a Simple Data Pipeline with Google App Engine: Part 1

This is part 1 of a two-part series

Part 1 focusses on the setup and structure of a PHP app for App Engine, how to manage authentication & environment variables for local development, validating endpoint calls come from App Engine Cron and, finally, logging and alerts.

Data is so important to modern business and ours is no different. For years we’ve worked towards being a data-driven business and making data-driven decisions for ourselves and our clients. But how do we efficiently process all this data? One thing I’ve learnt is that if the data is not readily available then any good intentions of building reports get lost in the day-to-day melee.

For a while now, we have been relying on Google Data Studio to build out Business Intelligence dashboards for ourselves and our customers. We use StitchData to load Jira, QuickBooks and HubSpot data into our BigQuery warehouse from where we can build custom views and reports in Data Studio. However, we were missing one vital dataset for which Stitch does not provide an integration: Tempo (time logs and plans). As I am taking the Google Cloud Architect certification I thought this would be a great opportunity to put into practice what I’d learn so far.

The Solution

I decided to use App Engine as it provides a server-less solution and would allow me to build in a familiar technology, PHP. The solution is based on Google’s sample app, using the Slim Framework and the logic outline is as follows:

  1. Hit the Tempo API to extract the required data
  2. Convert it to the NDJSON format expected by BigQuery
  3. Save the file to a Google Cloud Storage Bucket
  4. [Repeat steps 1-3 if multiple “pages” of data are returned by the API]
  5. Import each file into BigQuery

Let’s step through the solution:

Setup

I’m assuming you already have a Google Cloud project setup, if not do that first! Once you’ve done that you’ll need to initialise App Engine using the the CLI:

gcloud app create --project=[YOUR_PROJECT_ID]

Structure

The app is structured as follows:

src/
  controllers/
    main.php
    tempo.php
  BigQuery.php
  Bucket.php
  Logger.php
  Secrets.php
.gcloudignore
.gitignore
app.yaml
composer.json
cron.yaml
index.php
README.md

Most of this is standard PHP stuff, but there are a few special files required by App Engine:

  • .gcloudignore tells Google to ignore certain files when deploying to App Engine. By default it ignores your .git* files and vendor folder (not required because App Engine will run composer install for you when you deploy).
  • app.yaml contains app-specific information for App Engine. At a minimum you need to specify the runtime (PHP 7.3 in my case). I also use this file to specify environment variables – more on that later.
  • cron.yaml specifies the cron jobs I want App Engine to run. We’ll talk about that later too.

Authentication & Environment Variables for Local Development (and app.yaml)

In App Engine the app will run as the App Engine service account, which has access to Cloud Storage and BigQuery by default (if these resources are setup in the same project as your app). In a minute we’ll see how to grant access to the Secrets Manager too.

However, to run the app locally you will need to create a new service account for yourself with the required permissions on the Cloud Storage bucket, BigQuery and Secrets Manager. Make sure you download the JSON file with the credentials.

The Cloud SDK uses a hierarchy to determine the credentials used:

  1. Use the credentials supplied when an object is instantiated.
  2. Look for credentials in the GOOGLE_APPLICATION_CREDENTIALS environment variable.
  3. Use the credentials of the account under which the app is running.

In production we are using option 3, on local we are going to use option 2. We can use the following format to pass in environment variables when starting up the default PHP server:

GOOGLE_APPLICATION_CREDENTIALS=/Users/adam/Downloads/NolteMetrics-34cc803fcfc5.json \
   GC_PROJECT="nolte-metrics" \
   GC_BUCKET="nolte-metrics.appspot.com" \
   ALERT_EMAIL="afenton@wearenolte.com" \
   php -d variables_order=EGPCS -S localhost:8085 -t . index.php

Just replace the GOOGLE_APPLICATION_CREDENTIALS value with the path to your JSON file.

The other 3 variables, GC_PROJECT, GC_BUCKET and ALERT_EMAIL are read from the app.yaml file in App Engine, but for local use we need to define them ourselves. Here’s my complete app.yaml file for this project:

Secrets

I am using Google’s Secrets Manager to store my Tempo and PostMark authentication keys, these keys allow for complete control of our accounts and are not something we want to reveal in our source code (please don’t)!

In Secrets Manager you need to create the new secrets, give them each a name (I used TEMPO_AUTH_TOKEN and POSTMARK_AUTH_TOKEN) and enter the value. You will need to grant your local service account and App Engine’s service account, <project-name>@appspot.gserviceaccount.com, Secret Manager Secret Accessor permission on each secret individually. Once you’ve done this you can access them in your PHP code using the following function:

The Tempo Controller

I defined a GET controller which runs the job. The first thing I did is define an endpoint URL structure which will allow me to add more datasets in the future. The template of this structure is /[dataset]/[table]{/[optional_additional_params]}. Specifically for Tempo I used

/tempo/{table:worklogs|plans|accounts}[/{updatedFrom}]

Where:

  • tempo is the dataset name in BigQuery
  • table is either worklogs, plans or accounts which are the 3 tables I want to load from the Tempo API. The names match those used by Tempo and also the table names in BigQuery
  • updatedFrom is an optional parameter used for running the app locally (say to load all data going back a couple of years which I wanted to do the first time the job was run).

Validating Endpoint Calls come from App Engine Cron

The very first thing the controller does is validate the requester. If the app is running in App Engine then I only want App Engine Cron to be able to run it. If it is running locally then all good, anyone can run it. To check this:

More info on App Engine Cron header validation.

Logging and Alerts

My requirements are to have decent logs in the Google Cloud Logger in order to debug. And even more importantly to be alerted on job failure – the last thing we want is for some data to be missing days due to a silent failure.

The Google SDK provides a LoggingClient which we can easily use to write custom logs. However, Google does not provide a simple way of alerting on cron job failure. The best solution I could come up with was to use try/catch blocks to capture failures and send myself and email via PostMark so I can take action. It’s not perfect, but it should work (I hope)! Here’s the custom logger class I wrote to encapsulate this:

There are certainly a few things in this article which took me some time to figure out, so I hope this is helpful to you and maybe even saves you some time!

Read Part 2 here

In part 2 I explain how I used the PHP SDK to connect to Cloud Storage and BigQuery, and how to set up cron jobs to make it run automatically and how to deploy your app.

Let’s launch something!