How Google Cloud Storage processes cloud functions 4x faster than Amazon Web Services

How Google Cloud Storage processes cloud functions 4x faster than Amazon Web Services

Research Report July 30 2019
Reproduce This Code

Abstract

Managed storage is an important part of many application architectures, but especially for those that use serverless functions.

Using distributed tracing, we investigated object retrieval times from regional buckets with Google Cloud Storage and AWS S3 and found that Google Cloud Functions’ reusable connection insertion makes requests more than four times faster both in-region and cross region.

Updates:

8/13: Clarifications section added

Context

Serverless (and specifically cloud functions) promises to simplify how software gets developed and deployed. By implementing a business process or user interaction as a single or set of specific functions — without the need to manage a full operating system stack — it should be easier to provide value quickly. The functions themselves are considered “stateless” as they can be called at any time and are expected to perform their transformation in the same way.

To understand concerns expressed about this model, it helps to take a general look at what programs are intended to do. Generally, the goal of a program or function is to take data and turn it into other data or actions. When we look at a “stateless function” then we have to ask, “but where does the data come from and where does it go?” In other stateless architectures like 12 factor, the answer is a cache, database, or other datastore accessed over the network. The same applies to cloud functions. Either the full data to be processed has to be passed into the function and returned by the function, tightly coupling it to a particular shape of data, or it needs to be read from and stored to another datastore accessed over a network.

Because the performance, and hence the cost, of the function is influenced by network latency, it’s important to understand how interactions with the network will impact the total efficiency, cost, and effectiveness of a cloud function. LightStep Research chose this as an initial research project to understand, specifically for functions to interact with object stores, can they be fast and cheap while remaining simple? The answer is “yes or no” depending on which platform provider you choose: in an initial set of tests we found a 4x difference in performance.

Connection Pooling and Cloud Functions

Load balancers, reverse proxies, and RPC libraries have implemented connection pooling to address the well-understood problem of network connection latency, which has been amplified by general use of TLS. Often at startup time, connections are opened to peers or servers and then reused through the process lifecycle. Lost connections and demand for connections not being met will lead to more connection negotiations and pooled open connections. Too many idle connections will result in closing and removing pooled connections.

The goal is to use a long process lifecycle to amortize the cost of network connection latency.

To put some specific numbers around this, data accessed from REST APIs secured with TLS generally requires three network round trips to set up the connection (TCP SYN, TLS Client Hello, TLS Change Cipher Spec) and at least one additional round trip for the data to be retrieved (HTTP GET). In the same datacenter this means total connection times are in single to double digit milliseconds. If the data is in another datacenter or a cloud provider, often a round trip will take tens to 100s of milliseconds.

With cloud functions, where there isn’t a process lifecycle to manage (by design), these tools become ineffective. Though one can usually have an initialization function, the lifetime of that particular instance of the function is not known nor is the concurrency that it may be called with (if any). Without lifetime and concurrency information connection pooling cannot be used effectively.

When deciding if and how to integrate cloud functions into your architectural portfolio, consider the SLOs you have for a particular function. If the function is expected to complete in less than 100s of milliseconds but includes data from a cloud datastore, the default assumption should be that it will not be able to hit that target latency even with very minimal computational interaction with the data. Most importantly, creating an observable prototype for the particular interaction with a database, datastore, or cache is crucial.

Empirical Results

Our hypothesis was that the performance of GCS and S3 buckets in similar geographic regions would be similar, with GCS performance local to the originating function being slightly faster since it would stay within Google’s network. We expected the initial connection to take longer due to DNS, TCP connection initiation, and TLS negotiation before the HTTP request. Subsequent requests in the same session would reuse the connection and only have the HTTP request latency. Furthermore, we expected cross region latency to be significantly higher based on minimums for speed of light plus general network congestion overhead. None of these hypotheses were found to be true.

As mentioned above, LightStep Research sought to understand the performance characteristics of object stores used from cloud functions. For incidental reasons, it was easier to try Google Cloud Functions as an initial platform for assessment. To show the impact of cross-cloud and cross-region latencies, Google Cloud Storage and AWS S3, both using regional buckets, were chosen for the initial testing. To get detailed information about the performance of HTTP connections in Go, the chosen language, the AWS X-Ray Go SDK was ported to OpenTracing on an ad-hoc basis. The cloud function was written to receive an HTTP request as a trigger and trace 50 requests to fetch an object in either a S3 or GCS bucket in a specific region.

The initial results for S3 looked as expected, see figure 1. Each initial “connect” span included time for DNS, dial (TCP connection), TLS negotiation. In later use of the SDK for requests (figure 2), the connect span was very short and tags showed that a connection had been reused automatically.

Figure 1
Figure 1
Figure 2
Figure 2

The initial results for GCS were surprising, see figure 3. Even for the first “connect” span, the tags showed reuse, and there were no DNS, dial, or TLS spans.

Figure 3
Figure 3

We found that API calls could be fast inside functions. The GCS SDK when running in Google Cloud Functions was somehow reusing connections from before the function started / outside the context of the function.

  In Region Connect Cross Region Connect
GCS 0.008 ms 0.008 ms
S3 118 ms 701 ms

Looking at overall statistics, we found that “first connection reuse” varied by region and happened between 82% and 85% of the time. When Google Cloud Functions’ reuse scheme failed to find a reusable connection, the connection profile was still different due to differences between AWS and Google approaches to regionality.

AWS cross-region connections will cross the internet to get to the destination region. Figure 4 shows a snapshot analysis of AWS connections that did not have reuse by origin region. The three round trips happen at cross region internet latencies resulting in latencies in the 100s of ms.

Figure 4
Figure 4

Google requests for regional buckets will still hit a front end in the same region as the cloud function. Figure 5 shows a snapshot analysis of GCS connections that did not have reuse by origin region. The three round trips all happen at local latencies resulting in connect latencies in tens of ms.

Figure 5
Figure 5

Going back to the initial hypothesis, that GCS would be slightly faster than S3 both in and cross region due to networks, we analyzed just the HTTP request / response portion of the process. We found that, in this snapshot, same region latency to GCS was 4 ms faster. For cross region latency, S3 was faster by 21 ms.

  In Region Response Cross Region Response
GCS 34 ms 229 ms
S3 38 ms 208 ms

Bringing together the complete analysis, for functions making a single or multiple parallel API requests to an object store without connection reuse, Google Cloud Functions’ reusable connection insertion makes the requests more than 4 times faster both in region and cross region.

  In Region Total Cross Region Total
GCS 37 ms 216 ms
S3 156 ms 909 ms

The goal of LightStep Research is to produce reproducible experiments. To further that goal, the source code for the cloud function that was used to gather this information can be found on GitHub. Detailed information about setting up the Google Cloud Platform buckets, IAM, and cloud functions, AWS bucket and IAM, and LightStep Tracing can be found in the reproduction instructions below. We welcome and will publish or cross publish reproductions and original related research.

Reproduction Instructions

The following instructions will step you through the process of setting up the cloud provider environments and deploying code necessary to gather the information that lead to the above conclusions. In less than an hour you should be able to see the same results as well as have ongoing telemetry similar to have was used in the Google’s June 2nd Outage blog post. Please share reproduction results or ask questions via research@lightstep.com or @LightStepLabs on Twitter.

Clone GitHub Repo

Clone the source code from LightStep Research’s objcheck repo at https://github.com/lightstep-research/objcheck.

1
git clone https://github.com/lightstep-research/objcheck.git

Set Basic Environment Variables

The shell scripts below use GCP_PROJECT to specify which Google Cloud Platform project to use for Cloud Storage, Functions, Cloud Scheduler, and AppEngine. They use BUCKET_PREFIX to get unique bucket and function names. The associated GCP project will need billing enabled.

1
2
export GCP_PROJECT="<your project name here>"
export BUCKET_PREFIX="<your bucket prefix here>"

Set Up Object Pool

The Cloud Function code pulls objects randomly from a pool to control for caching effects. In this version all objects will, on average, be pulled several times, but later research will show how this can change. Objects are each 1k of random data in the format below.

1
<pool size>_<object order>_<size>.obj
1
for ((i = 1; i < 11; i++)); do dd if=/dev/urandom of=10_${i}_1k.obj bs=1k count=1; done

Google Cloud Storage Setup

IAM

Creating a service account allows us to give permissions to read from the cloud storage bucket automatically to the Cloud Function. The service account is named the same as our bucket prefix.

1
gcloud iam service-accounts create $BUCKET_PREFIX --project $GCP_PROJECT

Buckets

Create regional buckets in 4 regions allowing us to understand how performance varies for functions accessing objects in diverse locations. Because bucket names must be globally unique, the name of the bucket is the prefix and the region.

1
2
3
4
for bucket_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    gsutil mb -p $GCP_PROJECT -c regional -l $bucket_region -b on gs://$BUCKET_PREFIX-$bucket_region
done

For each of the buckets we need to give the service account object read permissions (“objectViewer”).

1
2
3
4
for bucket_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    gsutil iam ch serviceAccount:$BUCKET_PREFIX@$GCP_PROJECT.iam.gserviceaccount.com:objectViewer gs://$BUCKET_PREFIX-$bucket_region
done

Objects

We then upload the 10 1k random files to each of the regional buckets.

1
2
3
4
for bucket_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    gsutil -m cp 10_*_1k.obj gs://$BUCKET_PREFIX-$bucket_region
done

AWS S3 Setup

IAM

We create an IAM user for the Cloud Function to use to access S3 buckets. The user, the group, and the policy are all named with the bucket prefix. The below IAM policy give access to all regionally suffixed buckets and their contents. If you’re running this in a production account, make sure that doesn’t overlap with anything that it shouldn’t.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
aws iam create-user --user-name $BUCKET_PREFIX
aws iam create-access-key --user-name $BUCKET_PREFIX # Save key for function setup

aws iam create-group --group-name $BUCKET_PREFIX
aws iam add-user-to-group --group-name $BUCKET_PREFIX --user-name $BUCKET_PREFIX

aws iam create-policy --policy-name $BUCKET_PREFIX --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::'$BUCKET_PREFIX'-*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::'$BUCKET_PREFIX'-*/*"
        }
    ]
}'

aws iam attach-group-policy --group-name $BUCKET_PREFIX --policy-arn <arn from last step>

Buckets

We next create S3 buckets in 4 regions with prefix - region naming.

1
2
3
4
for bucket_region in "us-east-1" "us-east-2" "us-west-2" "eu-west-2";
do
    aws s3 mb s3://$BUCKET_PREFIX-$bucket_region --region $bucket_region
done

Objects

Then we upload the 10 1k random data files to all the buckets.

1
2
3
4
5
6
7
for bucket_region in "us-east-1" "us-east-2" "us-west-2" "eu-west-2";
do
    aws s3 cp . s3://$BUCKET_PREFIX-$bucket_region \
        --recursive \
        --exclude "*" \
        --include "10_*_1k.obj"
done

LightStep Tracing Setup

The directions and analysis use LightStep [x]PM to collect the spans from the Cloud Function and analyze them with the Trace Analysis feature in Explorer. The LightStep tracer usage in the function can be replaced with any OpenTracing tracer and analysis done in other OSS or proprietary systems. We welcome these reproductions of results as well.

You can sign up for a LightStep [x]PM Free Trial here. Then follow the link in email to complete signup.

Retrieve the Project Access Token from the Project Settings page and paste into LS_ACCESS_TOKEN environment setting below.

Google Cloud Functions Deployment

Next we deploy the Go application as a Cloud Function on GCP in the same 4 regions as the buckets. This may prompt you to enable Cloud Functions in the project to continue.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
export LS_ACCESS_TOKEN="<access token from LightStep>"
export AWS_S3_ACCESS="<access key id from AWS S3 user>"
export AWS_S3_SECRET="<secret access key from AWS S3 user>"

for function_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    gcloud functions deploy --project $GCP_PROJECT ObjCheck \
        --runtime go111 \
        --trigger-http \
        --set-env-vars BUCKET_PREFIX=$BUCKET_PREFIX \
        --set-env-vars LS_ACCESS_TOKEN=$LS_ACCESS_TOKEN \
        --set-env-vars AWS_ACCESS_KEY_ID=$AWS_S3_ACCESS \
        --set-env-vars AWS_SECRET_ACCESS_KEY=$AWS_S3_SECRET \
        --set-env-vars GIT_TAG=$(git rev-parse --short HEAD) \
        --service-account $BUCKET_PREFIX@$GCP_PROJECT.iam.gserviceaccount.com \
        --region $function_region
done

Google Cloud Scheduler Setup

The Google Cloud Function has a HTTP trigger. We use Cloud Scheduler entries for the complete set of function regions and bucket regions set to trigger every minute and cause the function to retrieve 50 random objects.

Running the below command may prompt you to enable Cloud Scheduler and App Engine for the project. The hosting region for App Engine should not affect the reproduction.

1
2
3
4
5
6
7
8
9
10
11
12
13
for function_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    for bucket_region in "us-central1" "us-east1" "asia-east2" "europe-west2" "us-east-1" "us-east-2" "us-west-2" "eu-west-2";
    do
        suffix=$(echo $bucket_region | cut -d'-' -f3)
        if [[ -n $suffix ]]; then service=s3; else service=gcs; fi
        gcloud beta scheduler jobs create http $BUCKET_PREFIX-$function_region-$bucket_region-10-10 \
            --schedule="* * * * *" \
            --uri=https://$function_region-$GCP_PROJECT.cloudfunctions.net/ObjCheck \
            --message-body='{"service": "'$service'", "region": "'$bucket_region'", "pool": 10, "count": 50}' \
            --project $GCP_PROJECT
    done
done

LightStep Stream Setup

To track the performance of the full combination of functions and regional buckets, you need to set up Streams. We’ll do this using that API. This is not necessary to reproduce the results but can be interesting for showing longer term trends. Under Project Settings, in the Identification box, find the Organization and Project values and paste into LS_ORG and LS_PROJECT below.

Create a LightStep API key with “Member” privileges using these directions and paste it below into LS_API_KEY.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
export LS_ORG="<LightStep Org>"
export LS_PROJECT="<LightStep Project>"
export LS_API_KEY="<API Key (not Access Token)>

for function_region in "us-central1" "us-east1" "asia-east2" "europe-west2";
do
    for bucket_region in "us-central1" "us-east1" "asia-east2" "europe-west2" "us-east-1" "us-east-2" "us-west-2" "eu-west-2";
    do
        suffix=$(echo $bucket_region | cut -d'-' -f3)
        if [[ -n $suffix ]]; then service=s3; else service=gcs; fi
        curl --request POST \
            -H "Authorization: bearer $LS_API_KEY" \
            --url https://api.lightstep.com/public/v0.1/$LS_ORG/projects/$LS_PROJECT/searches \
            --data '{"data":{"attributes":{"name":"Requests from '$function_region' to '$bucket_region' ('$service')","query":"operation:\"requestObject\" tag:\"region\"=\"'$function_region'\" tag:\"bucket\"=\"'$BUCKET_PREFIX'-'$bucket_region'\""}, "type":"search"}}'
    done
done

Analysis Process

After letting the functions run for about 10 minutes you should have sufficient data to analyze. Navigate to the Explorer tab on the left side of the interface.

To see first requests for GCS bucket resources add “operation: requestObject”, “tag: seq=”0””, and “tag: service=”gcs”” to the query bar then click run to get a snapshot based on that query. You can then filter by region and group by bucket in the Trace Analysis to see p50 latency numbers from that region to the different buckets. Clicking on the different regions will show lists of traces. Clicking on traces will time spent in different parts of the request process.

Changing “tag: service=”s3”” in the query bar and rerunning, the doing the same filtering and group by shows the different for first connections to S3. Clicking through to regional traces and then examine traces shows the different in latency for roundtrips.

Similarly you can change the sequence tag to “tag: seq=”1”” to see second connections and compare GCS and S3 services to see how they become much more similar due to connection reuse.

Clarifications

Security

While visibility into OAuth setup in the Google SDK is very limited, the behavior of the reused TCP/TLS connections is consistent with correct security scopes in both positive and negative scenarios.

Comparison of Response Times

As mentioned in the body of the report, the amount of time spent retriving the object is essentially equivalent. The below shows an analysis on 8/13/19.

In Figure 6, GCS p50 response time for the same region is 34 ms.

Figure 6
Figure 6

In Figure 7, S3 p50 response time is 31 ms.

Figure 7
Figure 7

As can be seen, S3 response time (vs overall fetch time) is faster in this particular sample. Unfortunately due to the current organization of tagging, it’s hard to provide ongoing response comparison for all the different combinations. We’re looking at other ways to enable this by reorganizing tagging.

Cost

At worst, reproducing these results should cost $2 - $3 a day if left running continuously.

Don’t miss our next research reports

The latest research reports and notes delivered to your inbox. Make sure you’re in the know on the latest in cloud services performance and availability.