AWS DataSync – Discover and move your data between on-premises, AWS, and other cloud storage with end-to-end security, including data encryption and integrity validation. AWS DataSync is a fully managed data transfer service that simplifies, automates, and accelerates transferring data between on-premises storage systems and Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. To use AWS DataSync for this task, you should first install an AWS DataSync agent in the on-premises data center. This agent is a lightweight software application that you install on your on-premises data source. The agent communicates with the AWS DataSync service to transfer data between the data source and target locations.
Workflow of AWS Glue
- Define crawler to populate the Data catalog
- Create the ETL job
- AWS Glue generate script for the ETL job or you can also provide/write one
- Run the job on-demand or define the scheduler or Trigger
- Extract data from DS, transforms it & load it into the Data Target
What makes AWS Glue Data Catalog
Databases: A logical group of tables
Tables: Metadata definition that represents dara
Crawlers & Classifiers: Detect & infer schemas to store it in Dara catalog
Connections: An object that contains the properties to connect to a particular data store
AWS Glue Schema Registry: Schema & Registry for streaming data
AWS Glue Job bookmarks are used to track the source data that has already been processed, preventing the reprocessing of old data. Job bookmarks can be used with JDBC data sources and some Amazon Simple Storage Service (Amazon S3) sources. Job bookmarks are tied to jobs. If you delete a job, then its job bookmark is also deleted.