Building a Simple Data Pipeline with Google App Engine: Part 2

This is part 2 of a two-part series (part 1 is here)

Part 2 focusses on how I used the PHP SDK to connect to Cloud Storage and BigQuery, and how to set up cron jobs to make my app run automatically, and how to deploy your app. 

In the first part I explained how to setup and structure a PHP app for App Engine, how to manage authentication & environment variables for local development, validating endpoint calls come from App Engine Cron and, finally, logging and alerts. Now we’re going to dig into the meat of the app logic…

The Google PHP SDK

Google provides a PHP SDK which abstracts communication with Google Cloud via their REST API. The easiest way to add this to your project is with composer, this is what my composer.json file looks like:

Cloud Storage

I wrapped the Cloud Storage SDK in the Bucket object below:

The constructor instantiates the StorageClient (remember it will use the credentials per the hierarchy described in part 1) and sets the bucket object.

The writeStringToGcs() method takes a string and writes it to a new object in the bucket. Let’s step through this:

  • First we have to write the string to a temporary file on the App Engine server. App Engine has its own implementation of the PHP tempnam function, which will write the file a temporary disk space.
  • Then we use fopen to open the file stream for reading.
  • And finally using the Bucket object’s upload function we upload the file to an object on the bucket. I’m using the predefine projectPrivate ACL to make the file access reflect users’ access to the GCP project.


In the same way, I wrapped the BigQuery SDK in a BigQuery class:

Let’s take a closer look at the importFile method:

  • First we get the table object.
  • Then we create the job using the loadFromStorage function. The source format is NEWLINE_DELIMITED_JSON and we want to WRITE_APPEND to the table (ie add records, but not touch the existing ones).
  • We then poll the job to check that it completed, throwing an exception if it times out.
  • Finally we check it succeeded, again throwing an exception if it did not.

Scheduling the jobs

When using App Engine, Google Cloud offers 2 alternatives for job scheduling:

  1. Cloud Scheduler is a general purpose scheduler that can run jobs against a range of resources: any HTTP request, an App Engine HTTP request or by publishing a message to a Pub/Sub queue. Scheduling is configured using the standard UNIX Cron format.
  2. App Engine Cron is specific to App Engine and is configured in a YAML file within your project. I chose this option to keep everything together in one place, which should make for easier maintenance and for someone (my future self included) to understand what’s going on in the future.

Here is my cron.yaml file:

You can see it accepts only the path after the domain of the URL, as it will always run against the current app. The schedule is set using an English-like format, eg every day 00:01 will run daily at 1 minute past midnight UTC (you can also specify the timezone if required, but I wanted to use UTC as the Tempo timestamps are also in UTC). I’m unclear why App Engine Cron uses this English-like format instead of the more common UNIX Cron format used in Cloud Scheduler, but there you go.


Once your app is ready to go, you can deploy it via the command line using the following command:

gcloud --project nolte-metrics app deploy app.yaml cron.yaml

This will deploy the files in the current folder to the app in the nolte-metrics project. You need to specify the app.yaml and cron.yaml files that you want to use. If all went well your dashboard should show something like this:

And the Cron jobs tab will show your jobs:

Putting it All Together

I’ve published the complete version of the app to github, feel to free to clone or copy and adapt to your needs.

You can reach me by commenting below or on Twitter, @adamf321, or email, I’d love to hear your feedback, questions or suggestions for future articles.

Let’s launch something!