Syncing Data from S3 to BigQuery

Created by Ben Deverman, Modified on Mon, 25 Mar at 6:20 PM by Ben Deverman

Do you have data in S3 that you want to start syncing to BigQuery so that you can query it alongside all your other data in PAD? CTA can help! With a few steps and configurations, we can help set up a data transfer to synchronize data between your AWS S3 bucket into PAD.


Where and How to Deliver Files to CTA

There are a couple of options to give CTA access to data in an S3 bucket. Both of these involve giving an IAM User permission to your S3 bucket. CTA can create an IAM user, to which you can give permissions to access the S3 bucket, or alternatively, you could create and manage your own IAM user for CTA and simply give CTA the credentials to assume that user to access the S3 resources.


CTA Providing an IAM User

CTA can create a new IAM User to access data in your S3 bucket. Once that IAM User is created, CTA will share the ARN of the User with you. Then you would just need to give the CTA-provided IAM user with access to the S3 bucket and the resources within it.

Below are the required Steps to give the CTA IAM User permission to your S3 Bucket:

  • Add the following permissions to the S3 Bucket policy::
    • s3:ListBucket - Allows Storage Transfer Service to list objects in the bucket.
    • s3:GetBucketLocation - Allows Storage Transfer Service to get the location of the bucket.
    • s3:GetObject - Allows Storage Transfer Service to read objects in the bucket.
    • s3:GetObjectVersion - Allows Storage Transfer Service to read specific versions of objects in the bucket.
    • Here’s an example Policy that you could add to your S3 bucket
{
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<CTA provided IAM ARN>"
                ]
            },
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::<CTA provided IAM ARN>"
                ]
            },
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>/*"
        }
    ],
    "Version": "2012-10-17"
}
  • Once the CTA IAM user is added to your bucket policy, provide CTA with the S3 Bucket Name and the Region it is located in.

Providing CTA with an IAM User to access the S3 bucket

If instead you would like to create an IAM User in your AWS Account to provide to CTA to access S3 resources, here are the steps you would need.

  • Create an AWS IAM user with the following permissions:
    • s3:ListBucket - Allows Storage Transfer Service to list objects in the bucket.
    • s3:GetBucketLocation - Allows Storage Transfer Service to get the location of the bucket.
    • s3:GetObject - Allows Storage Transfer Service to read objects in the bucket.
    • s3:GetObjectVersion - Allows Storage Transfer Service to read specific versions of objects in the bucket.
  • Here is an example policy with the permissions your AWS IAM User should have:
{
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>"
        },
        {
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>/*"
        }
    ],
    "Version": "2012-10-17"
}
  • Once the IAM User is created, give that user access to the S3 bucket where the data that needs to be synced is in. To do this, follow the steps in the previous section for adding an IAM user to an S3 bucket policy, but use the ARN for the new user you created instead. 
  • Finally, Provide CTA with IAM user credentials and bucket information through 1Password
    • If you haven’t yet been provided with a 1Password vault, please contact help@techallies.org.
    • Create a 1Password item with your AWS IAM User credentials.
    • With the IAMSecret Access Key ID in the username field
    • IAM Secret Access Key in the password field
    • And S3 bucket URI in the website field (with a prefix, if needed: E.g., s3://bucket/prefix).
    • Once the information is in your 1Password Vault, notify our team by emailing help@techallies.org.


CTA’s Preferred Data formats and Directory Structure

To simplify the ingestion process of batch data in S3, CTA has come up with a set of preferences for the format of the data and how to structure it. Please refer to the following article to learn more about those preferences!

CTA Data Formats - Best Practices and Recommendations

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article