Learnings from the AWS Data Analytics Speciality
Last year I blogged about how I got my AWS Pro Architect certificate, which you can read about here. Now it was time to get the Data Analytics Speciality. I won’t repeat all the same things that I wrote in the AWS Pro blogpost. That would be a bit redundant. Here I’d like to focus on the things I learned for this Data Analytics Speciality.
All cloud exams are alike
I’ve been doing quite some certifications on AWS and Azure, and I am now studying for a GCP certification. There is a pattern. That’s also why, if you are able to pass one, it’s much easier to pass other ones. Even if they are in other clouds. Clouds these days all offer the same concepts. They just name it differently, and structure it differently.
The exams also have some things in common, which is good to know if you ever want to pass:
- Our product GOOD. Other product BAD. Almost always, in any question, regardless of exam or cloud provider, they push their own products. You have to choose between an Oracle DB or AWS Aurora? The answer almost always is AWS Aurora. Or you have to choose between MongoDB and Azure CosmosDB? The answer will be CosmosDB. Kafka vs Google Pubsub? The answer is Pubsub. It’s really annoying because it makes the exam look like a marketing training, which is probably not that far from reality. The exception is the products that they offer as a managed service as well. Eg, AWS Elasticsearch Service is very often the right answer.
- More managed GOOD. Less managed BAD. In the Data Analytics speciality, you often had to choose between running something on EC2, running it on EMR or running it in Glue. Glue was almost always the right answer. Again, marketing training kicking in here. The reality of it, is that I know plenty of projects where Glue really didn’t fit the bill because it was too slow, too constraining or too expensive.
- Read through the nonsense. Some questions can have very long lead-ins. “A Marketing firm in the US wants to set up a direct marketing campaign…” Blahblahblah. After 4 alineas, the question is “To which destinations can Kinesis Firehose write?”. You really only needed that last bit. Same with the answers. Whenever you are in doubt, Eliminate the answers that contain nonsense.
How does it differ from AWS Pro?
Ok, all cloud exams are alike. But there still are differences. Well, obviously, it’s focused a lot more on Data and Analytics, but what else is different?
- It goes deeper into Data Analytics than Pro: While in Pro, you needed to know quite some details already, this goes much further. So it’s not enough to pass Pro and then also get this certificate. You learn more about Kinesis Producer and Consumer Libraries, more about Redshift Performance tweaking, more about EMR settings, …
- It is more narrow than Pro: This compliments the first point. Outside of Data Analytics, you don’t need to know anything. You don’t have to worry about VPCs, or EC2 instances, or autoscaling groups. Yo don’t have to break your head over IAM or dive into Sagemaker.
- The questions are easier. In Pro, every question was a book. And every answer had slight nuanced differences that you could debate for hours. In this one, you almost always could eliminate two answers and worry about the remaining two. Also, some questions were quite brief and straightforward.
So all in all, I found this one to be less useful than Pro. If you have Pro, you can probably figure out most of Data Analytics on the job. While it’s not the other way around. The Pro exam really did open up a lot of new concepts and ideas that I still use every day.
What did I learn?
When you start studying for the Data Analytics specialisation, you probably already know about Kinesis, Redshift, EMR and Glue. So the learnings are more nuanced.
Redshift now offers RA3 nodes
These are nodes where storage is detached from compute. You no longer have to upgrade to more expensive clusters just because you run out of disk space. That scales automatically. But it is still not the same as Snowflake though, where they have a full decoupling of storage and compute and you can run multiple compute clusters on the same storage. I reckon that’s coming next for Redshift. Maybe even at the upcoming re:invent. I still find it amazing that a relatively young company such as Snowflake can steal the lunch of AWS, Microsoft and Google with such ease. But hey, even these giants have a limit to innovation I guess.
AWS S3 access logs are not 100% complete.
This might be such a small but important nuance. But I read the docs on it for the exam and it turns out that it really is a “Best effort” service. So ok to understand how your data is accessed, but not good enough for an audit trail.
The purpose of server logs is to give you an idea of the nature of traffic against your bucket. It is rare to lose log records, but server logging is not meant to be a complete accounting of all requests.
How we solved that at one client is to also use a different KMS key for sensitive data and then monitor who accessed that KMS key.
Kinesis Firehose is actually a really convenient service
I used Kinesis before, with a lambda function that would transform some data, store a copy on S3 and then push it to a database. But the Firehose does most of that work for you. It reads from a continuous stream of Kinesis data, buffers that internally, with whichever buffer size you prefer, and then can write it to Parquet files in S3, following a Schema from the Glue Catalog. The more buffer, the bigger the S3 files, which is good for performance. But of course, the more buffer, the more latency you will have as well. So it’s a trade-off. Once it’s in S3, you can query it in Athena, or Redshift Spectrum. Or the Kinesis Firehose can even load it in Redshift for you. You can add a lambda function to do some small transformations before writing to Parquet. In essence, it’s a really convenient way of ingesting real-time data and making it available for long-term analysis.
How did I study?
Everybody studies in their own way. For me, this helped:
- I followed the Big Data course on Udemy, which is still the old specialisation. They updated the course with the Data Analytics details. It’s mostly good. I learned a lot
- I took some practice exams on Whizlabs. This time, they were really poor. So I’m not going to link to them. The questions were barely understandable. A lot of non-native English. And yes, I know, I’m also not a native speaker. But that only made it twice as hard. So I’m not going to link to it. Google it if you want :-)
- Luckily I found the practice exams from Jon Bonso on TutorialsDojo. They were really helpful, also in explaining what I got wrong and why. They were also very representative of the actual exam. And finally, what I liked as well, was that it was structured in different topic areas, as practice exams, as self-paced exams, … So there was some good variation. Once I started passing those, I realised I was ready for my actual exam. The only negative about this, is that I really don’t like the TutorialsDojo interface. The one from Whizlabs is much better.
Final take-aways
A few tips before wrapping up.
- Book your exam I was postponing this and postponing this. When you book your exam, you don’t have a choice. You need to study because it’s quite expensive and you don’t want to waste it.
- Look into other clouds I find it really refreshing that I get the opportunity to work on AWS, Azure and GCP. It broadens your horizon and it also makes you more critical about things that don’t work that well. And most importantly, you start seeing recurring patterns across clouds. These were things that were so obviously good, that everyone implemented them.
- Put it to good use It’s a massive cliche. But a car has no value in a garage. Your certificate has no value on your wall. Use it to build better systems. See the exam as an opportunity to understand your craft better. It’s an incremental step in a never-ending journey.