Preparing for the AWS Data Engineer — Associate Certification Exam
For Starters
I took the DEA-C01 exam in Dec of 2023. This certification is still in Beta, so the result will take 90 days until given(fingers crossed).
The AWS Data Engineer Associate exam is a new AWS Certification. It validates skills and knowledge in core data-related AWS services, ability to implement data pipelines, monitor and troubleshoot issues, and optimize cost and performance in accordance with best practices. So be prepared for these types of questions and answers during the exam.
Since this Exam is in beta it has some differences from the usual associate certifications: Exam duration is 170 minutes, Exam format consists of 85 questions of either multiple choice or multiple responses, Cost is 75 USD(from the usually 150 USD), Test can be done in-person or online,
only language available is English and the window to do the exam is from November 27, 2023 to January 12, 2024.
In this exam, I have never worked with several of the AWS Services, It does not have many courses yet due to being a beta, so I studied some with the AWS Data Analytics Special practice tests. My end-to-end study path took around 1 month(because I have just finished studying and achieving the AWS Solutions Architect — Professional, so I leveraged some of the content for this one), from reading some FAQs, trying out the exam practices, and reviewing the services and features I had to learn.
My study routine was usually at the end of the day(it depends if I had any late meetings or job activities and if I had then tried to study in the morning to compensate) and about 1/2 hours of study per day during the week.
Your daily routine depends only on yourself. If you are a morning person or are more active at night; how your work agenda; gym hours and etc. Only you know what’s the better time or the time you are more comfortable with.
Best Courses and Practices Exams
For the studying I used this Hand-On and the practice test from Udemy. Do not forget the FAQ’s!
And for the Practice exams, I used the Data Analytics — Specialty Practice test from Tutorial Dojo.
DO NOT RUSH THE PROCESS! Studying for the certifications takes time, especially if you are entering the world of cloud now so it’s natural to mature the content over time and get experience during work.
AWS Certification Roadmap
- AWS Cloud Practitioner
- AWS Solutions Architect — Associate
The focus of this certification is on the design of cost and performance-optimized solutions, demonstrating a strong understanding of the AWS Well-Architected Framework. This certification can enhance the career profile and earnings of certified individuals and increase your credibility and confidence in stakeholder and customer interactions. by AWS certification documentation - AWS Data Engineer — Associate
This exam is designed for candidates with 2–3 years of experience in cloud data-related roles or in on-premises data-related roles, moving to the AWS Cloud. Candidates in cloud roles such as data engineer, data analyst, data architect, or business intelligence engineer can earn this certification and gain credibility and confidence.
Those in adjacent roles like software engineer, cloud engineer, reporting analyst, data quality analyst, and on-premises data roles can also prepare for and earn this certification. by AWS blog
Solutions Architect Roadmap
Here are some of the skills needed to be a Solutions Architect
Don’t be afraid of all of this. You will get these skills over time while working, facing challenges, clients/customers, people management and partnerships. You can always trust your fellow partners in crime from work for advice to help you out with technologies and services you don’t have expertise in.
Exam Domains
- Domain 1: Data Ingestion and Transformation (34% of scored content)
- Domain 2: Data Store Management (26% of scored content)
Domain 3: Data Operations and Support (22% of scored content) - Domain 4: Data Security and Governance (18% of scored content)
Study Paths
There are some paths for you to study for the certifications:
- Slow: Watch all the course for the certification, do the demos, understand the service understand how it integrate with other AWS services, read the whitepapers and FAQs, do the practice exams, go back to the course or read the papers again, more practice exams and finally go take the certification exam. This is the path that I recommend, you will learn much from it and be a better IT professional with all this knowledge.
- Fast: Start by doing a practice exam with all domains and by the end it will give you a score and which domain you do not match the expectations. Focus on studying the domain and repeat the process. I only recommend this only if you have a short deadline to take the certification. BE AWARE that this is only to pass the exam, you will NOT learn much from it.
Exam Topics:
Here are some of the AWS Service and its features that were in the exam. Remember that the questions are always about use cases, so you are going to face questions with at least 2 services and more of their features on it.
- Athena
Athena notebooks: can be used to directly query data in S3 and leverage Apache Spark for advanced data transformations and analytics; with its Jupyter notebook integration and Apache Spark provides a robust platform for direct querying of S3 data;
Bad performance while reading several small objects in S3: Optimize file sizes into larger objects
Aggregate functions to gain a summarized view of data
Query examples - AWS Lake Formation
Uses a Centralized permission model for Granular access to data
Designed to manage permissions across different AWS analytics services.
Tag-based access control. Tag sensitive data
Data sharing feature simplifies and secures the process of sharing data across different AWS accounts or with external organizations. - S3
Infrequent Access
S3 Archive
Events
Object Lock
Cross-Account Replication - Macie
PII data
Integration with S3
Integration with AWS Lake Formation: Allows for robust management and governance of access to the PII data - ElastiCache
Lazy-Loading strategy: Ideal for read-heavy, infrequently updated data scenarios; Since the application’s primary workload involves complex, read-intensive queries, this approach minimizes cache maintenance overhead and ensures only the most requested data is cached. - Amazon QuickSight
SPICE engine: offers the capability to build interactive dashboards with direct, live connections to various data sources, including Amazon RDS; SPICE’s automatic refresh capability ensures that the dashboards display the most up-to-date information from the RDS PostgreSQL database. - AWS SCT
- KMS
SSE-KMS with customer managed keys - AWS DMS
AWS SCT (Schema Conversion Tool)
Schema Copy - AWS CloudTrail
Logs sent to S3
Query trail logs
AWS CloudTrail Lake: Provides an optimized and centralized solution for storing, managing, and analyzing CloudTrail logs; It allows the retention and querying of logs for up to seven years, which aligns well with the company’s need for year-long data analysis
AWS CloudTrail Lake - AWS CloudWatch
Logs
Logs Insights - AWS Glue
AWS Glue Crawler
AWS Glue Jobs
AWS Glue Jobs Bookmark
AWS Glue Data Brew: Missing data; Inconsistent data; Duplicate data
AWS Glue Schema Registry: Is crucial as it stores the schemas of your streaming data and manages different versions; This ensures your data format stays consistent, which is essential for preventing issues due to changes in data structure over time; Such consistency is critical to avoid data processing failures or corruption, ensuring the streaming data’s integrity remains intact.
ETL
Apache Spark on AWS Glue
From and to S3
JDBC/ODBC connections - Amazon Redshift
Amazon Redshift Advisor: For recommendations on query performance
provides automated recommendations to optimize the performance of Redshift clusters, such as distribution style changes, sort key additions, and more.
Amazon Redshift Query Performance
Amazon Redshift Query Performance Insights: To monitor query performance; Provides a comprehensive view of query performance, allowing data engineers to quickly identify long-running or problematic queries. This helps in understanding the performance characteristics of both individual queries and the overall workload.
Redshift Serverless: Optimizes data warehouse capacity, charging solely for the compute resources used, and incurs no charges when idle; Data sharing in Redshift allows the seamless sharing of live data between Redshift clusters and Redshift Serverless endpoints without incurring additional costs; Minimize compute costs
Amazon Redshift Row Level Security (RLS): Control access to rows of data based on user attributes(Such as team roles, Fine graned control within shared tables); Row-Level Security in Amazon Redshift, allows the database administrator to set up security policies to control access to rows in a database table based on user attributes like their roles or team. This makes it an ideal choice for situations where data-sharing needs are intricate and closely tied to user identities or roles.
Amazon Redshift Data Sharing; enables sharing of live data across Redshift clusters
Query examples starts with “string”
.csv import issue because of IGNOREHEADER on COPY command
VACUUM operation
COPY command
Amazon Redshift Spectrum
Workload Management(WLM) queue in Amazon Redshift - AWS SAM
- API Gateway
- Data Pipeline
- AWS Lambda
Provisioned concurrency for warm pool instances
AWS EFS for additional storage for large files processing - AWS Step Functions
- AWS Sagemaker
AWS Sagemaker ML Lineate Tracking
AWS Sagemaker Data Wrangler: Utilizing its built-in date functions simplifies the process of standardizing date formats, and employing string functions for cleaning categorical fields is an efficient use of the tool’s features, making this the most suitable option. - EMR
EMR with Apache Spark: Can efficiently process and anonymize large datasets, and Amazon Redshift allows for robust analytics capabilities post-anonymization - CodeCommit
- CodeBuild
- CodeDeploy
- CodePipeline
- AWS Neptune
For graphical structure - AWS Kinesis Data Firehose
Near Real-Time use cases - AWS Kinesis Data Streams
Real-Time use cases - AWS Kinesis Data Analytics
- Amazon MSK
- DynamoDB
GSI
TTL
Streams - RDS
RDS Read Replica
RDS Multi-AZ
Supported Engines - AWS Aurora
Aurora Read Replica - AWS EKS
HPA - AWS Lambda
Integration with Services
Deployment
Versions
Concurrency - Others Technologies
Apache Spark
Apache Flink: Advanced Stream processing
Hive
Parquet: Better performance than JSON - Use Cases
Increase performance
HIPAA and PII information
Real-time/near-real time
Cost-effective
Steless and statefull transactions
Statistically significant insights while ensuring minimal computation and storage usage
Some Useful Links
AWS Certified Data Engineer — Associate official page
AWS Certified Data Engineer — Associate Exam Guide
Exam Prep: AWS Certified Data Engineer — Associate
Hands-On Course on Udemy
Practice tests on Udemy
AWS Data Analytics — Specialty Practice Tests on Tutorial Dojo
Considerations
The certification exam questions normally involved 2 to 4 services and their integrations and features, there were real-life scenarios, trick questions, and so on. Always try it on the practice exams because they really help to be more prepared for the certifications.
This exam is very extensive, it is a total of 85 questions in 170 minutes(+30 minutes in accommodations in case English is not your native language) + 5/10 minutes for the surveys, so try to do it in the morning while you are well-rested.
Feel free to comment in case you got anything different from your certification exam.
And finally, Good luck with your next AWS certification, and hope this preview and documentation can come in handy!