Join the Community

23,739
Expert opinions
40,527
Total members
341
New members (last 30 days)
207
New opinions (last 30 days)
29,193
Total comments

How to create a Serverless Data Lake on AWS and Migrate your On-Premise Data to it?

Most clients have a significant on-prem data presence which they’re planning on migrating to AWS but aren’t sure where to start. My recommendation is to start small, move quickly, and iterate. Hence this short introductory blog post on how to get started without a lot of development resources, while at the same time utilizing the latest and greatest AWS services.

I’ll try to keep the words to a minimal, so let’s jump in by going through each step in the above architecture:

Prep: Create an S3 bucket (e.g. my-data-lake). Create “raw” and “processed” prefixes. Create any additional sub-prefixes (e.g. /customers and /requests). It really comes down to what type of data you have. The goal is to move all of your on-prem data into Amazon S3 first and foremost, so make sure your data lake bucket is setup appropriately. If later on you’ll be giving direct access to your S3 data to your users, you will want to incorporate the Cognito Federated ID into the bucket prefix to control which users will be accessing which bucket prefixes. You will also want to have a chronological order to your bucket items for multiple reasons, including being Amazon Athena-friendly (even if the data is raw). Here’s what a prefix might look like:

/raw/customers/user=ASDF44444/year=2021/month=11/day=08/

  1. For starters, you’ll need to get access creds to access the AWS resources. There are a couple of ways of going about doing this, including the one I illustrated above. Using Amazon Cognito gives you an easy way to manage your users AND your temporary credentials. You can interact with Cognito programmatically, through scripts, or through a web app (or mobile app). For the DB to DMS to S3 flow, you can review this section on how to lock down this setup. With that out of the way, let’s talk about your DBs. You probably have a couple that you would want to migrate to AWS or simply “pipe” into the S3-based Data Lake that we’re building here for further processing and aggregation. You can use the AWS Database Migration Service (DMS) to accomplish this. It will let you keep the target (S3) updated on an ongoing basis. As for the automated scripts and user “sources”, it’s quite self-explanatory. Once the scripts and/or users get the temporary user credentials by logging in with Cognito, they can upload the objects directly to S3 (or through a intermediary step, like a web app).
  2. Your data is now in S3 — but is it clean? Do you need to transform it to fit a specific schema or format? You have multiple options here (EMR being one of them), but AWS Glue is a managed services that you can use for that. AWS Glue is a fully managed extract, transform, and load (ETL) service (taken straight out of its official description). Glue makes data transformation easy and scalable. It also discovers your unstructured data and creates a data catalog out of it, allowing services like Athena to query the data using SQL (imagine querying CSV and JSON files without loading them into a relational database).
  3. Now that your data is transformed and saved in a “processed” S3 prefix of the data lake bucket, you might want to ingest some of the data into a relational database for real-time data access (or a myriad of other reasons). Well S3 can automatically create an AWS Simple Queue Service (SQS) message whenever a new object is created, foregoing any additional code that you need to write.
  4. Once the message is in SQS, an AWS Lambda function reads the message and starts the next phase of data transformation/processing. Since AWS Lambda added Amazon SQS to supported event sources, the setup is quite easy and can be achieved using the Serverless Application Model as well as a couple of other frameworks and steps (including the Serverless Framework which I’m a huge fan of).
  5. The Lambda function establishes a connection with the Aurora MySQL Database and calls “Load Data From S3”.
  6. This instructs Aurora MySQL to reach out to S3 and ingest a file stored in a particular location specified by the Lambda function. You might be asking yourself — why are we using Aurora? Why not query the data on S3 using Athena? The simple answer is: storage tiers. You can use Athena or Redshift Spectrum to query your S3 data, but you will need to be cognizant of the concurrency. These services are more suited for backend data processing, while an Aurora database can be quickly scaled and optimized.

There are no servers to manage here. Even the Aurora MySQL database, which gives you access to the EC2 instances, manages the database and resources for you, so you don’t really need to sorry about these resources and let Aurora do its magic. The web app can be easily hosted on S3 as well.

Now you’re ready for data consumption.

7. As you can see, there are a lot of steps in this section. You can create a very simple web app that allows users to authenticate with Cognito and then have access to their data (if you saved the ingested data under a specific userId which corresponds to the Cognito federated id, you can give your users direct access to their S3 data, or pipe the web requests through a Lambda function, checking their AuthZ that way. You can build a very complex and data-rich app using our AWS AppSync service. It will take care of interacting with the Lambda function (or an HTTP endpoint that you specify) and will streamline your data retrieval on the web app.

External

This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.

Join the Community

23,739
Expert opinions
40,527
Total members
341
New members (last 30 days)
207
New opinions (last 30 days)
29,193
Total comments

Now Hiring