Unwinding the data engineering days @ DTDL

Internship Experience.

Ritwija Dixit
Deutsche Telekom Digital Labs
8 min readDec 5, 2022

--

10 alphabets in each of these words, 100s of learnings involved, 210 days of digging data, being one of 1000 members of the DT Family and making millions of memories for lifetime. In June 2022, I embarked on my internship journey at Deutsche Telekom Digital Labs, Gurugram in the Data Engineering team. DTDL is an ideal hub for data enthusiasts owing to its enormity of developing products for over 10 European nations, grossing a revenue beyond 2 billion euros, having a colossal customer base of around 94 million, and streaming over 350 million events per day. Whooaaa 😱, right ❕

Speaking of the most riveting part of the experience: the field of data. DATA. As small as the length of this word is, the depth is beyond comprehension. From clicks 🖱 to pics 🖼, from event streams 👩‍💻to internet memes 📱, everything together and apart is embraced by a single 4 letter word. Adding seasoning is Engineering, another interesting word or rather realm which ranges from the early mechanization to modern day machine learning and its realizations. Now consider mixing the 2, and voila we have a lethal combo of Data Engineering- Just imagine the power and the widespread grasp, in totality of pipelines to the concoction of files, and gasp 😲!

Making the passion for DATA evident, I had always been keen to explore it in entirety. With pursuit of my bachelor’s degree, I was privileged to be exposed to core web and backend development, owing to the projects we practised, alongside a glimmer of data science, which is just all-inclusive. For our 6 months industrial training, we had to opt for organisation and the field of interest, and I was pretty sure that I had to go into data but which branch — science or engineering or analytics ❓❓❓ To unravel the solution, I browsed, asked and gathered that business decisions are driven by data analysts, while data scientists solve business problems. Data engineers empower data analysts and scientists to perform their work. To get the best of both worlds, I plunged into data engineering at DTDL.

Swiftly rolling from dusk of domain to the dawn of workdays, wherein I initially joined in virtual mode, the onboarding process, from delivery of joining kit to assigning of team was done smoothly, thanks to the People team at DTDL. Subsequently, I was acquainted to the team, company, organizational structure, methodology and workflow through the tech as well as non-tech overview sessions, collectively planned for all new joiners in the company. The cognisance grew more comfortable in office — cafeteria 🍕, coffee hub ☕️, customized meeting rooms 💁 and the most important, colleagues 👥!

At DT, we have different LOBs pertaining to the products offered and central teams like Data Engineering serve across all of them, whose primary focus is to process data in a way that allows them to extract meaningful inputs for the business. I found out that the primary responsibility of a Data Engineer is not quite as glamorous as I imagined. Instead, the job is much more foundational and requires constant dedication — to maintain critical pipelines to not just track information like how many users are active on the application or how much time each user spends on different contents but also aid in the smooth ingestion and processing of data for the different product teams.

Sprints 🔛, story points 🔢, schema registries 📑, SQL servers 💾, SSH ‼️, scheduling ⏲, sudo access 💻 were some of the jargons that I came across initially, and some fun ones like our standard data engineer tagline: pipeline ‘phat gyi’ 😬 (PS: it sounds way more fun than it actually is)

Mentored by my buddy, I was set in motion by knowledge transfer sessions, looking into the tech stacks, beginning with NiFi to AWS Cloud, titbits about file formats, spark transformations, advanced SQL, power of Excel and other Big Data essentials to understand the flow of data in organisation. Here’s some pointers which I feel should definitely make it to the checklist of a budding data engineer:

· Software Engineering Concepts — Agile, DevOps, Architecture Design, Git, etc.

· Open-Source Frameworks — Apache Spark, Hadoop, Hive, MapReduce, Kafka, Airflow and others.

· SQL — An indispensable skill in the data community.

· Skilled Programming — Working with Python has become the favorable option due to its ocean of extensive libraries. Java on the other hand is widely sought as a useful backend integrating tool, though has fallen out of favor than Python.

· Cloud Infrastructure–AWS is probably the most predominant cloud skillset for Data Engineers to acquire. Google Cloud, Microsoft Azure are the other competitors, and knowledge of either of these based on the company’s preferred platform is adequate.

· Data modeling –It is rather vital now in as much as a data engineer needs to know how the tables are going to be structured, partitioning technique, where normalization should be performed, when to denormalize data in the warehouse and how to think about retrieving certain attributes, etc.

· Analytics mainly for Visualization/Dashboards–Fundamentals of tools like Tableau, PowerBI, etc.

All this while, as I read and continued to groom my theoretical expertise, I had the opportunities of utilising my time and resources to get a hands-on with it. The first project that I was assigned, was to assist my senior teammate in a POC of AWS cost monitoring and optimization. Filled with enthusiasm, we started to explore the services of AWS providing us the cost-breakup, and as per what was demanded from us viz. the different requirements at different levels, we suggested the solutions. After few weeks of brainstorming and getting a hang of the corporate culture, the project was put on hold due to some internal issues and I was onboarded onto a new project, for which I was more than thrilled as it was an end-to-end data pipeline..yayayaaya 😇 ! That being said, in the requirement elicitation phase, I began with documentation and was joined by a new team member — a fellow intern. Followed by his onboarding, our requisite accesses and approvals, we had dived into the tech-verse finally: the development phase: #ReadyToCode ➡️

Revolving around business responses to revamp requests, we kicked off with the analysis of the problem statement, the need of it, scrutiny of existing pipeline, followed by documentation, deliberations, and breakdown of tasks. To highlight the technical aspect, the objective was to revamp the existing campaign performance dashboards hosted on Redash, to more detailed, quickly accessible, near real time, and interactive dashboards on Tableau, an another in-place BI tool. Redesigning the backend pipeline and the corresponding UI would not only help in boosting the performance, but also assist in developing quicker and more meaningful insights. All in all, to automate the entire process would be a thorough walk-through of being a data engineer and our proposed workflow was:

1. Elicitation: Analysis of existing data pipeline to understand the data flow and identify the shortcomings, gain a product perspective and deduction of new KPIs

2. Data Modelling: Optimizing the present table queried by AWS Athena created by Create Table As Select (CTAS), scheduled by Airflow and stored on AWS S3.

3. Data Processing: Calculating the added KPIs and storing the aggregated data for producing trends, by deploying AWS Glue jobs scheduled on Airflow

4. Data Warehousing: Moving the table with calculated KPIs to AWS Redshift, for enhanced data quality and analytics

5. Data Dashboarding: Bringing the dashboards near real time and faster development of dashboards by using live connection on Tableau

6. Backfilling of Data: Retroactively processing the new KPIs from the earlier processed aggregated data in the current pipeline

7. Embedded Analytics: Integrating the dashboards of Tableau to the company portal by embedding.

Data Pipeline Architecture

Next few months found us actively engrossed in accomplishing the above milestones, and I can’t express the elation when we tested our first successful script and the data reconciled 😄 (P.S. It’s a very big deal for us, our bread, butter and brain, all is fueled by clean and correct data). All along the way, we went through our ups ⏫ : when our scripts ran, errors getting resolved, data being validated on production; and obviously the downs ⏬ : stuck in a bug, not getting the data, delay in testing and many others. The vivacity was not only present in our code editors but also in our office lives. Be it our delectable lunch table conversations 😋 or be it the eventful evenings in Cyber Hub 🍹. Be it the tantalizing team outings to bowling arena 🎳 and sky-carting 🚗, not to forget the exhilarating office themed parties — beachy 🏖 or Bollywood 💃. Keeping grandeur and amusement aside, all these moments helped us in building a better connect as a team, as an organisation and foster a sense of union🙌.

Meandering through these 7 months of a holistic voyage, looking back I believe that I have grown and learnt every single moment. Perseverance is the key word, which I think is the biggest learning of this venture in every phase: logic building, development, testing, analysis or even communication. Everything would not have fallen in place if I hadn’t received the support from my manager, who was a refreshing breeze of motivation, a truly exemplar leader who inspired us to work, day in and day out. The 24x7 ready to help atmosphere in the team, perpetual discussion on approaches, the freedom to express and experiment the thoughts, propose plus even try out the latest tech stacks are some of the many takeaways DTDL has taught me.

Wrapping up this blog is as difficult as saying goodbye to the team : it’s a bursting load of emotions and gratitude. The entire voyage feels like an efficiently deployed pipeline, wherein I came as raw data, oblivious of corporate work life, then my team helped me transformed, similar to warehousing, I had to organize and yepp, finally post processing, cheers to one more successful venture ✨.

Thanks Shashidhar Singhal, Swati, Nikunj Pahwa, data engineering team and the entire DT family for this revamping experience!

--

--