Thomas CardenasinData Engineer ThingsBest Practices for Writing Maintainable and Testable Spark Code in ScalaEnhancing Scalability and Reliability Through Structured Spark Development PracticesApr 17Apr 17
Thomas CardenasinData Engineer ThingsManaging Late-Arriving Data for Accurate ReportingData Engineering Excellence: Best Practices for Managing Late-Arriving Data in Metrics PipelinesMar 26Mar 26
Thomas CardenasinData Engineer ThingsThe Art of Efficient Data Lake OrganizationGuidelines for Streamlined Data Lake OrganizationOct 24, 2023Oct 24, 2023
Thomas CardenasinAncestry Product & TechnologyHarnessing Intervals in Apache Airflow for Efficient and Reliable Data ProcessingIntroductionOct 17, 2023Oct 17, 2023
Thomas CardenasCalculating Daily/Monthly Active Users with Spark & IcebergWhen ever I hear about metrics I really want to dive into understanding them and coming up with a sample pipeline to demonstrate it. One of…Oct 17, 2023Oct 17, 2023
Thomas CardenasSimplifying Complex Data Merging: Combining Data Sources into a Single TableIn the world of data engineering, merging data from different sources into a single table is a common practice. In this article, we will…Oct 9, 2023Oct 9, 2023
Thomas CardenasStreaming data to S3 from SNS using FirehosePhoto by Tosab Photography on UnsplashOct 2, 2023Oct 2, 2023
Thomas CardenasHow to Reduce Full Table Scans during Merges in Apache Iceberg and Save MoneySep 28, 2023Sep 28, 2023
Thomas CardenasinAncestry Product & TechnologySolving the Small File Problem in Iceberg TablesThe Data Platform team at Ancestry has been maintaining a fully-refreshed 100-billion-row Apache Iceberg table for several months. A…Aug 29, 20234Aug 29, 20234
Thomas CardenasinAncestry Product & TechnologyScaling Ancestry.com: How to Optimize Updates for Iceberg Tables with 100 Billion RowsOne of the most interesting datasets at Ancestry is the Hints database. This is used to alert users that potential new information is…Feb 23, 20233Feb 23, 20233