How to Extract ZIP Files in an Amazon S Data Lake with AWS Lambda

Listen to this Post

Hosting data lakes in the cloud is a common practice, and optimizing storage costs while ensuring data accessibility is crucial. Storing data in a compressed format saves space, but decompressing files on-demand can improve usability. AWS Lambda, combined with S3 event triggers, provides a serverless solution for automatic decompression.

You Should Know:

1. Setting Up S3 Bucket and Lambda Function

First, create an S3 bucket and configure it to trigger a Lambda function upon file upload.

AWS CLI Commands:

 Create an S3 bucket 
aws s3 mb s3://your-data-lake-bucket

Create a Lambda deployment package (Python example) 
zip lambda_function.zip lambda_function.py

Create the Lambda function 
aws lambda create-function \ 
--function-name UnzipFiles \ 
--runtime python3.8 \ 
--handler lambda_function.handler \ 
--role arn:aws:iam::123456789012:role/lambda-s3-role \ 
--zip-file fileb://lambda_function.zip

Add S3 trigger to Lambda 
aws lambda add-permission \ 
--function-name UnzipFiles \ 
--statement-id s3-trigger \ 
--action "lambda:InvokeFunction" \ 
--principal s3.amazonaws.com \ 
--source-arn arn:aws:s3:::your-data-lake-bucket

aws s3api put-bucket-notification-configuration \ 
--bucket your-data-lake-bucket \ 
--notification-configuration file://notification.json 

2. Lambda Function Code (Python)

Here’s a sample Python script to decompress ZIP files automatically:

import boto3 
import zipfile 
import io

s3 = boto3.client('s3')

def handler(event, context): 
bucket = event['Records'][bash]['s3']['bucket']['name'] 
key = event['Records'][bash]['s3']['object']['key']

if key.endswith('.zip'): 
zip_obj = s3.get_object(Bucket=bucket, Key=key) 
buffer = io.BytesIO(zip_obj['Body'].read())

with zipfile.ZipFile(buffer) as zip_ref: 
for file in zip_ref.namelist(): 
s3.upload_fileobj( 
zip_ref.open(file), 
bucket, 
f"extracted/{file}" 
) 

3. Cost Considerations

  • S3 Costs: ~$0.023 per GB (Standard Storage)
  • Lambda Costs: $0.0000166667 per GB-second (Python runtime)
  • Tradeoffs: For frequent small files, Lambda costs remain low. For large-scale operations, monitor execution time.

4. Automating with EventBridge (Advanced)

For better orchestration, use Amazon EventBridge to manage workflows:

aws events put-rule \ 
--name "S3-Zip-Processing" \ 
--event-pattern "{\"source\":[\"aws.s3\"],\"detail-type\":[\"Object Created\"]}"

aws events put-targets \ 
--rule S3-Zip-Processing \ 
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:UnzipFiles" 

What Undercode Say

Automating file extraction in S3 using Lambda is efficient but requires monitoring:
– Use AWS CloudWatch to track Lambda invocations:

aws cloudwatch get-metric-statistics \ 
--namespace AWS/Lambda \ 
--metric-name Invocations \ 
--dimensions Name=FunctionName,Value=UnzipFiles \ 
--start-time 2023-10-01T00:00:00Z \ 
--end-time 2023-10-02T00:00:00Z \ 
--period 3600 \ 
--statistics Sum 

– Optimize Lambda Memory: Adjust memory settings for faster decompression:

aws lambda update-function-configuration \ 
--function-name UnzipFiles \ 
--memory-size 512 

– Clean Up Extracted Files: Schedule S3 lifecycle policies:

aws s3api put-bucket-lifecycle-configuration \ 
--bucket your-data-lake-bucket \ 
--lifecycle-configuration file://lifecycle.json 

For large-scale data lakes, consider AWS Glue or EMR for batch processing.

Expected Output:

  • Decompressed files in `s3://your-data-lake-bucket/extracted/`
  • CloudWatch logs for Lambda executions
  • Cost-optimized storage with lifecycle policies

Reference: How to Extract ZIP Files in an Amazon S3 Data Lake with AWS Lambda

References:

Reported By: Darryl Ruggles – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass βœ…

Join Our Cyber World:

πŸ’¬ Whatsapp | πŸ’¬ TelegramFeatured Image