<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Data Detective on Medium]]></title>
        <description><![CDATA[Stories by Data Detective on Medium]]></description>
        <link>https://medium.com/@prashant.tandan528?source=rss-f8427a8c2d28------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*kxEsb1_cRq63ELbWk4KR_Q.png</url>
            <title>Stories by Data Detective on Medium</title>
            <link>https://medium.com/@prashant.tandan528?source=rss-f8427a8c2d28------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Thu, 28 May 2026 17:01:39 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@prashant.tandan528/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Data Modeling (and how it will save your data job)]]></title>
            <link>https://medium.com/@prashant.tandan528/data-modeling-and-how-to-save-your-data-job-e30bc2604a11?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/e30bc2604a11</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[data-modeling]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Mon, 23 Mar 2026 02:07:36 GMT</pubDate>
            <atom:updated>2026-03-23T02:20:14.218Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Simple explanation of data modeling stages with a real world example.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*eFRdAQZsNWZPZwaaIDZ7nw.jpeg" /><figcaption>Photo by <a href="https://www.pexels.com/photo/woman-holding-cartoon-4226206/">Anna Shvets</a></figcaption></figure><p>Congratulations! You just got a new job of a Data X (where X = Engineer or Analyst or Architect).</p><p>Sooner or later, you will be tasked with designing(modeling) a data pipeline or a data storage schema (like a data warehouse) for the company. Now, you had memorized a lot of abstract stuff for the job interview and actually did a pretty good job impressing your interviewers, but deep down, you start questioning if you really know how to model the data practically.</p><p><strong>The data modeling task is:</strong></p><p><strong><em>“Can we analyze our sales by product, store, and date?”</em></strong></p><p><strong>Your goal is simple: build a dataset that helps answer this.</strong></p><p>Fear not! And please don’t start creating the tables yet!</p><h3>Data Modeling Stages</h3><p><strong>First, </strong>let’s revisit those abstract topics you memorized for the interview:</p><ol><li>Conceptual modeling</li><li>Logical modeling</li><li>Physical modeling</li></ol><p>But the knowledge of these topics alone would not be of much help to complete your task. So, let’s look how you’d go on about the actual task following these above stages.</p><h4>1. Conceptual Modeling (Understanding the business)</h4><p>At this stage, you’re trying to understand how the business works.</p><p>You ask questions like:</p><ul><li>What data does the business need?</li><li>What data do we already have?</li><li>Where does this data come from?</li><li>Are these sources accessible?</li><li>What are the high-level relationships?</li><li>Is it worth building this pipeline?</li></ul><p>For this example, you identify:</p><ul><li>Sales data exists in a sales processing system or in an Order Management System (OMS).</li><li>Product data exists in a catalog or an excel file.</li><li>Store data exists in another system Y.</li></ul><p>And at a high level, you understand:</p><ul><li>A sale happens.</li><li>It is tied to a product.</li><li>It happens in a store.</li><li>It happens at a time.</li></ul><p>No schemas yet. Just clarity on the business and data.</p><h4>2. Logical Modeling (Structuring the Data)</h4><p>Now you start thinking about how to represent this data.</p><p>You define:</p><p>a. Facts and Dimensions</p><ul><li><strong>Fact:</strong> Sales (a number denoting an event happening at a point in time)</li><li><strong>Dimensions:</strong> Product, Store, Date (attributes of the fact that make it meaningful when analyzing).</li></ul><p>b. Grain</p><ul><li>You decide: One row = one sale of <strong>one product</strong> in a store at a given time</li></ul><p>c. Relationships</p><ul><li>You define how sales links to Product, Store, and Date.</li><li>For example, there could be multiple products in a single sale (many-to-one). One sale happens in exactly one store and a single date (one-to-one).</li></ul><p>You also decide how you want to structure the relationship between the entities like Product, Store and Date. This is called data modeling <strong>technique</strong> like dimensional model or data vault model (next blog about this?).</p><p>Anyways, at this stage:</p><ul><li>You are looking at the data facts and dimensions, grain and relationships.</li><li>You are designing the schema.</li><li>You are not thinking about databases yet. We are fully tech-stack-agnostic until this step.</li></ul><h4>3. Physical Modeling (Making It Real)</h4><p>Now you move to implementation.</p><p>You take the logical schema and define:</p><ul><li>Tables</li><li>Columns</li><li>Data types</li><li>Storage decisions</li></ul><p>For example:</p><p><strong>Sales table is created as:</strong></p><ul><li>sale_id</li><li>product_id</li><li>store_id</li><li>date</li><li>quantity</li><li>sales_amount</li></ul><p><strong>Product table is created as:</strong></p><ul><li>product_id</li><li>product_name</li><li>category</li></ul><p>You may also think about:</p><ul><li>How data is stored (database, schema)</li><li>Partitioning or clustering</li><li>Compression</li><li>Indexing and plethora of other implementation quirks.</li></ul><p>This is where the model becomes actual data in a database.</p><p>Horray! You have created your first data model! (I’m not sure if you could do it within the deadline but a win is a win nevertheless!)</p><h3>Putting It Together</h3><p>What did you do in each step?</p><ol><li>Conceptual modeling: You understood the business and data sources.</li></ol><p>2. Logical modeling: You structured facts, dimensions, and relationships.</p><p>3. Physical modeling: You implemented the actual tables.</p><p>It would have been so easy just to jump straight into writing SQL and creating tables. But if you have a giant database and nobody actually uses it, why would the company need you at your job?</p><p>But if you spend some time on each of these stages, you will eventually build something that actually matches the business need and will be far easier to work with in the future.</p><p><strong><em>This is the fourth blog of a </em></strong><a href="https://medium.com/@prashant.tandan528/list/data-engineering-a948c0e22c0f"><strong><em>series of blogs</em></strong></a><strong><em> about the different data engineering concepts. Stick around for more nuggets like these.</em></strong></p><p><em>Let’s get connected here or on </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>LinkedIn</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e30bc2604a11" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How Conway’s Law Quietly Shapes Every Data System You Build]]></title>
            <link>https://medium.com/towards-data-engineering/there-might-be-something-else-thats-designing-your-system-architecture-conway-s-law-6f6115431bc4?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/6f6115431bc4</guid>
            <category><![CDATA[design-systems]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-architecture]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Mon, 17 Nov 2025 17:02:07 GMT</pubDate>
            <atom:updated>2025-11-24T19:33:36.442Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KgsNpgtnO3b_3Bg2In2yfg.jpeg" /><figcaption>Photo by <a href="https://www.pexels.com/photo/close-up-photography-of-yellow-green-red-and-brown-plastic-cones-on-white-lined-surface-163064/">Pixabay</a></figcaption></figure><p><em>Part 3 of the series </em><a href="https://medium.com/@prashant.tandan528/list/data-engineering-a948c0e22c0f"><em>Data Engineering Concepts</em></a></p><p>If you’ve ever spent a few minutes of your life away from the hustle and bustle of daily life (or <em>now</em>, away from your screens), you’ve probably been struck hard, at least once, by the realization that we don’t fully control our decisions. There are countless invisible hands shaping our values, goals, visions, and, in the day-to-day scale, each of our choices that we make to move towards them.</p><p>Honestly, I have not been spending much time away these days so I can give you only a short generic list of examples. The place you live, the people around you, your parents, the societal norms and values, your faith, your childhood, your education etc. No matter how much you try to break out of their influence, they will eventually have some say in your choices.</p><p>Wait, wait, wait! Not a journal! Not a journal.</p><p>Let’s switch gears and talk about an invisible hand that might be secretly shaping your decisions if you’re someone working in the field of software/data systems, especially your decisions related to system design and architecture.</p><h3><strong>CONWAY’S LAW</strong></h3><p>Conway’s law says that the design of a system created by an organization will likely reflect how the organization itself is structured. The exact words of Melvin Conway (according to Wikipedia) were:</p><p><em>“Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.”</em></p><p>In simple terms: If your whole team is a single unit, which is rare, you are more likely to create a less modular system even though you follow modularity principles. Similarly, if you have different teams, the system will likely be more modular.</p><p>In terms of communication, the teams that communicate the most will create parts of the system that have more seamless integration with each other. And the teams that don’t like meetings, will create system units that will be harder to integrate with the rest of the system.</p><h3>What does it mean in data systems?</h3><p>This law is starkly visible in data systems and architecture. The organization structure has a huge effect on how the data platforms are designed, built and used.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fCd5CPOLvJt5gRrHeWTu3A.jpeg" /><figcaption>Silos (Photo by <a href="https://www.pexels.com/photo/silos-in-factory-near-river-26443232/">Dario Rawert</a>)</figcaption></figure><p>I think we can all agree that the goal of most data systems in an organization is to provide numbers (sales, revenue, customers, costs, margins etc) that show how the organization is performing. But if the organization is divided into uncommunicative departments and they meet up for a all-hands meeting once a month, they are most likely to be spending the whole meeting fighting with each other, trying to settle on what the REAL sales numbers were for that month.</p><p>Such jokes have almost become a cliché LinkedIn post material for data professionals. The term “data-silos” is often used to refer to this compartmentalization of data and facts. These jokes are first-hand proofs that this is a quite prevalent issue in many real-life data systems.</p><p>This difference in the truth is caused mainly because of two reasons that Conway’s law predicts:</p><p>a. An organization is inherently composed of several departments: sales, marketing, HR, IT etc. and not all of them use the same system for their day-to-day tasks. And when they try to use their own systems for organizing, storing and analyzing data, they might get different numbers as end result.</p><p>b. And even though, there is a centralized Data Platform, the teams that are more communicative and give more input during the design phase will get the final numbers closer to their truth than the teams that don’t like talking.</p><h3>What does it mean for data professionals?</h3><p>Even though simply being aware that your organization’s structure can heavily influence your data architecture may help you avoid designing flawed systems, you can also take proactive steps to prevent this from happening.</p><p>Here are a few measures I can recommend:</p><ol><li><strong>Design data platforms that break organizational boundaries.</strong><br>Break down silos and integrate all relevant data sources so that data flows freely across departments rather than stopping at team borders.</li><li><strong>Involve stakeholders early and often.</strong><br>Gather feedback during both design and implementation stages to ensure everyone agrees on a <em>single source of truth</em> for metrics and definitions.</li><li><strong>Create shared data models and governance frameworks.</strong><br>Standardize definitions for key business metrics and maintain them centrally to avoid conflicting interpretations. This ensures that even when different departments look at different data sources, they know how to communicate with other departments.</li><li><strong>Inverse Conway Maneuver.</strong><br>Organize the data team so that they intermix with different business stakeholders, and create smaller overlapping groups aligned with different business domains. The goal is to create the <em>desired architecture</em> (the data team acting as a central hub for everyone in the org), not the existing hierarchy.</li><li><strong>Continuously re-evaluate data ownership and team interfaces.</strong><br> As the organization evolves, periodically assess whether your data systems still mirror healthy communication patterns rather than legacy silos.</li></ol><h3>CONCLUSION</h3><p>Conway’s law doesn’t necessarily mean that ‘structure’ in an organization is a bad thing. But it pays to be aware of what the structure can do to the systems we design and build, more so, in data related scenarios. It is only after the awareness that it can be decided whether to create systems that mirror the same structure or to put some effort into flowing against the stream.</p><p>It is just like being aware of the invisible hands that drive your decisions and goals in life. You can go with the flow if they feel right to you but you can also re-assess them and change your direction.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rARhx9T2Gc4UKvGpJjpJ7g.jpeg" /><figcaption>Photo by <a href="https://www.pexels.com/photo/person-holding-gray-and-black-compas-220147/">Pixabay</a></figcaption></figure><p><strong><em>This is the third blog of a </em></strong><a href="https://medium.com/@prashant.tandan528/list/data-engineering-a948c0e22c0f"><strong><em>series of blogs</em></strong></a><strong><em> about the different data engineering concepts, and is based on my notes from this </em></strong><a href="https://www.coursera.org/learn/intro-to-data-engineering"><strong><em>Coursera</em></strong></a><strong><em> course. Stick around for more nuggets like these.</em></strong></p><p><em>Let’s get connected here or on </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>LinkedIn</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6f6115431bc4" width="1" height="1" alt=""><hr><p><a href="https://medium.com/towards-data-engineering/there-might-be-something-else-thats-designing-your-system-architecture-conway-s-law-6f6115431bc4">How Conway’s Law Quietly Shapes Every Data System You Build</a> was originally published in <a href="https://medium.com/towards-data-engineering">Towards Data Engineering</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Data Engineering Best Practices]]></title>
            <link>https://medium.com/@prashant.tandan528/data-engineering-best-practices-b140b4503c5e?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/b140b4503c5e</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[principles-of-design]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-analysis]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Sun, 29 Jun 2025 22:31:35 GMT</pubDate>
            <atom:updated>2025-07-03T16:10:31.473Z</atom:updated>
            <content:encoded><![CDATA[<p><em>The undercurrents of Data Engineering</em></p><blockquote><em>“</em>Smooth seas never made skilled sailors.<em>” — Franklin D. Roosevelt</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GKFO2C70XrNlP00FUjjAvQ.jpeg" /><figcaption>Photo by Emma Li (<a href="https://www.pexels.com/photo/close-up-shot-of-ocean-waves-5584079/">Pexels</a>)</figcaption></figure><p>In the last blog, we came to know <a href="https://medium.com/@prashant.tandan528/what-do-data-engineers-really-do-476e744ff91f"><strong><em>what</em></strong> data engineers really do</a>. In this blog, we are going to explore <strong><em>how</em></strong> they do the things they do (or practically speaking, how they should be doing those things).</p><p>I think my last blog was clear about this: data engineering doesn’t have a cool-kid hype like training LLMs or finetuning diffusion models to make Drake wear a sari. What it <em>does</em> have, however, is everything you need to make all of those things work.</p><p>The cool LLMs and those ML algorithms don’t work if there’s no data or if the data is crappy. The work of data engineers is not just to dump the data from disparate sources to a target system so that it can be gobbled up by these cool applications. Yes, I admit that it involves a lot of data dumping but there are a set of principles that have to be followed so that the dumping is done securely and reliably, and so that the dumping system survives a long time.</p><p>Without further ado, let’s talk about those principles:</p><h3>1. Security: The Least You Can Do</h3><p>One of the most common anti-patterns in early data teams is: “Just give them admin access. We’ll clean it up later.” (Spoiler: we never do!)</p><p>In data engineering, <em>security</em> isn’t just a checkbox, it’s a mindset. When pipelines touch sensitive data like credit cards, healthcare info, internal emails, you can’t afford a “YOLO” mentality.</p><p>Here’s the golden rule: <strong>Least privilege, always.</strong><br>Give people <em>only</em> the permissions they need, <em>only</em> for the time they need them. Just like you don’t hand over your Netflix password to someone who’s looking to watch a movie trailer, don’t give root access to people checking the column of a table.</p><p>And for the data engineers themselves:</p><ul><li>Use temporary credentials.</li><li>Hide secrets and env variables.</li><li>Learn how IAM, encryption, and network protocols work because when things go wrong, <em>you</em> become the incident report.</li></ul><h3>2. Data Management</h3><p>“Data management” sounds like one of those overly broad corporate terms, like “strategic synergies” or “robust alignment.” But underneath that umbrella lives half of a data engineer’s job.</p><p>It includes:</p><ul><li>Data modeling (Should this be a dimension or a fact?)</li><li>Data quality (Is NULL a valid value or a sad mistake?)</li><li>Warehousing (Is this a lakehouse now?)</li><li>Governance (Who owns this field called user_name_2_backup_new?)</li></ul><h3>3. Data Architecture: Design Like You’re Going to Rebuild It Anyway</h3><p>Architecture in data engineering is not just “how the tables are joined.” It’s about <strong>designing for today with tomorrow in mind.</strong></p><p>Good architecture is flexible. It makes <strong>reversible decisions</strong>, prepares for <strong>failures</strong>, and chooses components that work well with <em>other teams</em>, not just yours.</p><p>Here are some principles that you may want to tattoo on your Teams channel:</p><ul><li><strong>Plan for failure.</strong> (Things will break. The goal is not “never break” but “recover gracefully.”)</li><li><strong>Loosely couple systems.</strong> (Let them dance, not tangle.)</li><li><strong>Prioritize security.</strong> (Even over speed sometimes.)</li><li><strong>Embrace FinOps.</strong> (Design systems that <em>do the job</em> without draining your AWS credits like a slot machine.)</li></ul><p>And most importantly:</p><ul><li><strong>Lead with ideas, not just tickets.</strong><br>Communicate and collaborate with all of the stakeholders and the data team while designing because architecture is a shared vision, not just a diagram in Lucidchart.</li></ul><h3>4. DataOps</h3><p>Imagine if software engineers didn’t have Git, CI/CD, or Jira (okay, maybe Jira is debatable). That’s what data engineering often looked like five years ago.</p><p>Enter <strong>DataOps, </strong>a set of practices that bring agility, collaboration, and reliability to data products.</p><p>What does it actually mean?</p><p><strong>a. Automation:</strong></p><ul><li>Use orchestration tools like Airflow or Dagster.</li><li>Build CI/CD for data pipelines.</li><li>Automate tests for data quality and schema changes.</li></ul><p><strong>b. Observability:</strong></p><ul><li>Know when your data looks weird.</li><li>Monitor freshness, nulls, duplicates, unexpected spikes.</li><li>Keep logs and create systems to audit the changes.</li></ul><p><strong>c. Incident Response:</strong></p><ul><li>Alerts that matter.</li><li>Create dashboards that explain if anything went wrong.</li><li>A culture that doesn’t shoot the messenger.</li></ul><p>In short, DataOps helps you <em>trust</em> your data even when you’re not staring at it.</p><h3>5. Orchestration: The Invisible Conductor</h3><p>You can have the best scripts and transformations, but if they don’t run in the right order with the right dependencies, you’ll spend your evenings debugging why there were no sales loaded for that day (have you checked if our products data was loaded correctly today?).</p><p>Good orchestration:</p><ul><li>Automates dependencies.</li><li>Fails loudly and clearly.</li><li>Plays well with retries, alerts, and dashboards.</li></ul><p>Bad orchestration?<br>It’s why someone’s Sunday evening gets ruined.</p><h3>6. Software Engineering: Because SQL Alone Won’t Save You</h3><p>There was a time when data engineering was just SQL and cron jobs. Those times are over.</p><p>Modern data engineers write a lot of code: Python, Spark, dbt, APIs, tests. And that code needs to be:</p><ul><li>Readable</li><li>Reusable</li><li>Modular</li><li>Testable</li></ul><p>In other words, <strong>you need to know software engineering.</strong><br>Version control, environment management, testing, code reviews. They’re not optional anymore.</p><h3>Final Thoughts: The Part You Don’t See</h3><p>Most of what makes a great data engineer is not what shows up on the analytics dashboards and reports. It’s the habits, the design decisions, the trade-offs, the duct-taped late-night hotfixes that later get replaced with clean abstractions.</p><p>So next time someone asks what data engineers do, you can say:<br>“We move data from A to B, yes. But we do it securely, reliably, observably, and in a way that won’t haunt our successors in 2027.”</p><p><strong><em>This is the second blog of a </em></strong><a href="https://medium.com/@prashant.tandan528/list/data-engineering-a948c0e22c0f"><strong><em>series of blogs</em></strong></a><strong><em> about the different data engineering concepts and technologies, and is based on my notes from this </em></strong><a href="https://www.coursera.org/learn/intro-to-data-engineering"><strong><em>Coursera</em></strong></a><strong><em> course. If you’ve ever wanted to see what a data engineer does instead of just reading job descriptions full of buzzwords, stick around.</em></strong></p><p><strong><em>Until then, may your data be clean and your DAGs never fail.</em></strong></p><p><em>Let’s get connected here or on </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>LinkedIn</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b140b4503c5e" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What Do Data Engineers Really Do?]]></title>
            <link>https://medium.com/@prashant.tandan528/what-do-data-engineers-really-do-476e744ff91f?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/476e744ff91f</guid>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data-pipeline]]></category>
            <category><![CDATA[data-ecosystems]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Tue, 24 Jun 2025 01:11:03 GMT</pubDate>
            <atom:updated>2025-06-29T22:44:01.047Z</atom:updated>
            <content:encoded><![CDATA[<p><em>A Beginner’s Map to the Data Ecosystem</em></p><blockquote><em>“Civilization advances by extending the number of important operations which we can perform without thinking about them.” — Alfred North Whitehead</em></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5dsmZ7F-DI7zwv9F2IJxjg.jpeg" /><figcaption>Photo by SpaceX (<a href="https://www.pexels.com/photo/steel-pipe-lines-with-pressure-gauge-586019/">pexels</a>)</figcaption></figure><p>If you’re anything like me, the term <strong>data engineering</strong> might’ve sounded boring at first. Like some grey-suited person automating spreadsheets in a dark corner of the internet. I mean, who gets excited by hearing the word ‘pipeline’?</p><p>Turns out, a lot of people do. Because once you look under the hood, <strong>data engineering is the quiet powerhouse</strong> that makes everything from your Spotify Wrapped, to fraud detection, to ChatGPT, actually work. It’s like plumbing. If it breaks, everything else floods.</p><p>So in this blog, I want to map out what data engineers <em>really</em> do, how their work shows up in the apps and decisions around us, and why it’s way more exciting than just loading CSVs into a database (though yes, sometimes we do that too).</p><h3>So… What Even Is Data Engineering?</h3><p>Let’s start simple.</p><p>Data engineering is about <strong>moving</strong>, <strong>cleaning</strong>, and <strong>organizing data</strong> so that others (analysts, ML models, dashboards, decision-makers, your future robot overlords) can use it to do cool stuff.</p><p>But don’t let that fool you. That deceptively short sentence hides an entire universe of work, from streaming billions of records in real time to building bulletproof pipelines that don’t collapse every time someone uploads a weird Excel file.</p><h3>The Data Engineering Lifecycle</h3><p>Here’s the basic lifecycle of a data engineer’s job, a journey that begins with messy chaos and ends with beautifully structured, queryable magic:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/716/1*jxEndapSkXzSyApvtCs4Bw.png" /><figcaption>Data Engineering Lifecycle overview</figcaption></figure><p><strong>1. Ingest</strong><br>Raw data resides in different sources such as APIs, databases, logs, files, sensors, toasters, etc. and we have to pull them all into a centralized repository of data. The pulls could be made in batch (intervals of few to many hours) or streaming (real-time). Some of the tools used to ingest are Airbyte, Kafka or custom Python scripts with way too many ‘try/except’s.</p><p><strong>2. Clean &amp; Transform</strong><br>The data we get may not be in the best format. So, we need to clean it and transform it. The process consists of fixing typos, unifying formats, converting units, handling missing values, merging duplicates and anonymizing stuff. This is where bad data goes to be reborn. Some of the tools used for this process are Python, dbt, Spark and a lot of SQL sorcery.</p><p><strong>3. Store</strong><br>After cleaning data from disparate sources, we send it to a place where it can chill safely, usually a data warehouse like BigQuery or Snowflake, or a data lake like S3/GCS.</p><p><strong>4. Serve</strong><br>The data is finally ready to be mined so we have to make it accessible to other teams or apps who can do that. This could mean passing it to APIs, analytics dashboards, or machine learning models. Because if the data just sits there, did it even exist?</p><h3>Wait, Isn’t That What Data Scientists Do?</h3><p>Let’s clear up the confusion. These are some other data people as well but they have different priorities.</p><p><strong>Data Engineer:</strong> “How do I <em>move</em> and <em>structure</em> this data?”</p><p><strong>Data Scientist: </strong>“What can I <em>predict</em> or <em>explain</em> with this data?”</p><p><strong>Data Analyst: </strong>“What’s <em>happening</em> in the data?”</p><p>Of course, in smaller teams, these roles blur. But trust me, nobody wants their ML model failing because the timestamp column had emojis.</p><h3>Why Should I Care?</h3><p>Still sounds abstract? Let me ground this.</p><p>Data engineers help companies:</p><ul><li><strong>Predict demand</strong>: So stores don’t overstock purple Crocs.</li><li><strong>Catch fraud</strong>: Like when someone makes 10,000 transactions in 3 seconds.</li><li><strong>Improve UX</strong>: By personalizing app content based on behavior.</li><li><strong>Automate reports</strong>: So analysts can go back to analyzing, not copy-pasting SQL results.</li></ul><p>Behind all of this is a reliable data pipeline humming quietly in the background, until it breaks and everyone panics.</p><p><strong><em>This will be the first blog of a </em></strong><a href="https://medium.com/@prashant.tandan528/list/data-engineering-a948c0e22c0f"><strong><em>series of blogs</em></strong></a><strong><em> about the different data engineering concepts and technologies. If you’ve ever wanted to see what a data engineer does instead of just reading job descriptions full of buzzwords, stick around.</em></strong></p><p><strong><em>Until then, may your data be clean and your DAGs never fail.</em></strong></p><p><em>Let’s get connected here or on </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>LinkedIn</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=476e744ff91f" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Scores Without Soul: The Metric Obsession from Report Cards to Economy]]></title>
            <link>https://medium.com/@prashant.tandan528/scores-without-soul-the-metric-obsession-from-report-cards-to-economy-527a291a015d?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/527a291a015d</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[rewards]]></category>
            <category><![CDATA[economics]]></category>
            <category><![CDATA[philosophy]]></category>
            <category><![CDATA[metrics]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Thu, 24 Apr 2025 22:21:35 GMT</pubDate>
            <atom:updated>2025-11-24T06:43:26.957Z</atom:updated>
            <content:encoded><![CDATA[<h3>Why your GPA doesn’t matter?</h3><p><em>“Not everything that can be counted counts, and not everything that counts can be counted.”</em> — William Bruce Cameron</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QR8ahPl4OXP54Jm_95VoJg.jpeg" /><figcaption>Photo by Guillaume Falco (<a href="https://www.pexels.com/photo/icebergs-2229887/">Pexels</a>)</figcaption></figure><p>The fact that numbers exist is probably how humans have been able to organize societies so well, have sent Katy Perry to the space and will probably soon create artificial copies of their brains. Metrics help us summarize a lot of information with a number, communicate the results efficiently and effectively, make decisions and take actions.</p><p>This makes it hard for us to not fall in love with our numbers. Calories on the protein bar. SAT scores. Steps on fitness trackers. Views on your YouTube videos (guilty!). Metrics give us a sense of control in a chaotic world, like tiny icebergs that we can walk over to cross an ocean.</p><p>But sometimes we tend to forget that some icebergs might be enormous, their actual body hidden under the surface. Similarly, we tend to forget or disregard what some metrics actually mean. Worse, we try to optimize those metrics even though it doesn’t improve the quality of the thing that we’re trying to improve. We become so focused on the metrics and forget that they are just some proxies that <strong><em>we</em></strong> made to measure quality.</p><p>The issue looks harmless at first but it has been messing up a lot of things if we start looking closely. I want to talk about two of the examples that I find particularly distressing.</p><h3><strong>The Report Card Menace</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7yiiRRSl-NpZxDSW5mOdVw.jpeg" /><figcaption>Photo by Safari Consoler (<a href="https://www.pexels.com/photo/shabby-chalkboard-on-stone-wall-11580454/">Pexels</a>)</figcaption></figure><p>If you are from a third world country like me, it should have been a cliché for you but the issue is worth repeating once again here ; the importance given to exam scores is more than absurd. People have forgotten that those scores were meant to evaluate the quality (and quantity ?) of the knowledge gained.</p><p>It is rare for the students to get tested on problem solving skills, critical thinking or real world application of the materials taught (the only things that matter). If you score high on certain tests, you are automatically considered a genius. And there are a lot of other ways to score high on those tests.</p><p>Students find a way to hack the exams, be it through rote memorization, cheating or learning the patterns of questions asked. The focus is on optimizing the scores (metric) not on grasping of the gist of the subjects (quality).</p><h3>The Mirage of Economic Indicators</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vVBsmFFR75vV4y51WgOqfA.jpeg" /><figcaption>Photo by Hitesh Choudhary (<a href="https://www.pexels.com/photo/man-pouring-water-from-dipper-on-blue-and-grey-house-1739855/">Pexels</a>)</figcaption></figure><p>If you try to keep up with the news, you will find a lot of fancy words being thrown around: GDP, unemployment rates, inflation, market gains etc. The people in their fancy suits try to assess the economic health of the nations using these numbers.</p><p>I admit that these metrics are hard to game (but totally possible if the politicians and corporations collude to do so) but the issue here is that they don’t represent the overall well-being of the population in anyway.</p><p>GDP, for instance, measures the total value of goods and services produced, but it doesn’t account for income inequality, environmental degradation, or labor abuse. A country might boast impressive GDP growth while a significant portion of its population struggles with poverty or lack of access to basic services (you know who).</p><p>The overreliance on these quantitative measures can lead policymakers to prioritize economic growth over human development. As a result, initiatives that enhance quality of life but don’t directly boost economic indicators, such as mental health programs, environmental conservation, or community-building efforts, may be undervalued or overlooked.</p><p>I mean, it won’t really matter how many trillions of GDP the world has if the people don’t live or enjoy living.</p><h3>Other Examples</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*BM0HMbnYu5Z7Yh27EemwCw.jpeg" /><figcaption>Photo by Josip Štiburek (<a href="https://www.pexels.com/photo/dog-in-sunglasses-18757666/">Pexels</a>)</figcaption></figure><p>I can find numerous other examples where people have found ways to optimize for the metric rather than the thing that actually mattered:</p><ol><li>There are hilarious <a href="https://blog.mypacer.com/step-count-scandals-hilarious-cheating-stories-in-step-challenges/?utm_source=chatgpt.com">stories</a> of people tying their FitBits to their dogs, fans or even washing machines to get their step counts. Climbing the leaderboard makes them so high that they forget that it is their health that matters not the step count.</li><li>Companies try to evaluate employee performance using metrics like monthly sales, ticket resolution time or <a href="https://factorialhr.com/blog/stack-ranking/#:~:text=Stack%20ranking%20is%20a%20forced,and%20fire%20the%20bottom%2010%25.">stack ranking</a> which incentivize hurried deals, short-term gains, aggressive competition that may actually hurt the profitability or the productivity of the company.</li><li>IT companies trying to evaluate developers’ productivity using the Lines of Code (LoC) written per given time may force them to write lengthy, messier code just to hit the metric.</li></ol><h3>Are metrics that bad?</h3><p>After all of these discouraging examples, you might be thinking that we should get rid of the metrics completely and focus on the quality instead. But like I said, it probably wouldn’t have been possible to send Katy Perry to space if we didn’t have metrics at all (Yes! Sending her to space was that important! ).</p><p>I think there are two important mantras for anyone trying to avoid this pitfall:</p><ol><li>Metrics are useful when they are designed carefully so that they represent the underlying quality accurately.</li><li>And they are useful as long as <strong><em>we remember</em></strong> that they are just metrics and don’t forget what they were meant to be measuring.</li></ol><p>So, next time you find yourself fighting tooth and nail to get a good score (in anything), please ask yourself what is it that you’re trying to achieve.</p><p>[ I know some of you nerds might be searching for the section with ‘Reward Hacking in AI’. The blogpost has gone too long to talk about it. Maybe next time! ]</p><p><em>I’d love to hear your feedbacks and what you have to say on this. If you like to read blogs like these, please follow me here on Medium or on LinkedIn: </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>https://www.linkedin.com/in/prashant-tandan-74aa94163/</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=527a291a015d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How far till an AI replaces data scientists and engineers?]]></title>
            <link>https://medium.com/@prashant.tandan528/how-far-till-an-ai-replaces-data-scientists-and-engineers-c4efe8c508f7?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/c4efe8c508f7</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[jobs]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Wed, 09 Apr 2025 06:35:36 GMT</pubDate>
            <atom:updated>2025-04-09T17:45:57.387Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S5f2cI5dlsDnLYyiIq6bUg.jpeg" /><figcaption>Photo by cottonbro studio (Pexels)</figcaption></figure><p>With “ALL THE SOFTWARE ENGINEERING JOBS WILL SOON BE REPLACED BY AI!” narrative flooding the internet, it is difficult to not question if the field of your interest is going down along with the whole field.</p><p>I, personally, don’t agree that humans are ever going to go out of any work; they are just going to be working at a more abstract level. But it is always a good thing to keep an eye out for new technologies so that we can integrate them more elegantly with current workflows. <em>(And if you’re looking for those exact jobs (like me), you should know who your competition is!)</em></p><p>To find out where we stand with AI agents automating data science and engineering works, I started looking at the research papers and came across a really interesting paper “<a href="https://proceedings.neurips.cc/paper_files/paper/2024/file/c2f71567cd53464161cab3336e8fc865-Paper-Datasets_and_Benchmarks_Track.pdf#page=8.01">Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?</a> “ which was published on NeurIPS, 2024.</p><p>Let’s see what the paper says.</p><h3><strong>What is ‘Spider2-V’?</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MXPb0lCewbvPw1MYAGIakw.jpeg" /><figcaption>Photo by Lukas (<a href="https://www.pexels.com/photo/person-holding-pen-pointing-at-graph-590020/">Pexels</a>)</figcaption></figure><p>The authors introduced Spider2-V, a benchmark designed to evaluate the capabilities of multimodal AI agents in performing complex data-related tasks that typically require human expertise.</p><p>In simple terms, they have created a collection of tasks focusing on professional data science and engineering workflows which can be used to test the AI agents designed to automate data-related workflows.</p><p>It consists of 494 real-world data-related tasks using 20 enterprise-level professional applications (like dbt, Airflow, Snowflake etc.). These tasks span across various stages of data workflows, including warehousing, ingestion, transformation, visualization, and orchestration. This collection of tasks can be used to test an AI agent’s ability to generate SQL queries, write code, and manage graphical user interfaces (GUIs) across multiple professional data software systems to do a certain task.</p><h3>So, should I start looking for a new job?</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HICZLuR_GINIKVqZjZdUDA.jpeg" /><figcaption>Photo by Andrea Piacquadio (<a href="https://www.pexels.com/photo/woman-leaning-on-table-3767411/">Pexels</a>)</figcaption></figure><p>If you’re a data scientist or a data engineer, I know how high you are on adrenaline right now to find out the answer <em>(I was too when I was reading the paper!).</em> And drum roll, pleaseeee!!!!</p><p>You are safe!!! <em>(At least, if you’re reading this in early 2025)</em>. The AI models had a maximum success rate of just 14% on their benchmark. Phew!!!</p><p>They tested the latest LLMs and VLMs on their benchmark. The models comprised of open source models like Mixtral-8x7B and Llama-3–70B, and closed-source ones including Qwen-Max, Gemini-Pro-1.5, Claude-3-Opus and GPT families. GPT-4V was the most accurate one but with just 14% of success rate.</p><p>The tasks related to “data warehousing” and “traditional data processing”<br>were found to be the most challenging, both less than 10% success rates. The authors highlighted that this could be because of user authentications, pop-ups, heavy GUI usage, network waiting times etc.</p><h3>Where are the AI Models better/worse?</h3><p>These were some other interesting findings listed on the paper like:</p><ol><li>Tasks with more action steps (&gt;15 steps) were more difficult.</li><li>Tasks involving authentic user accounts were much more challenging.</li><li>Pure GUI tasks were much easier than CLI tasks.</li><li>Providing a step-by-step guideline in task instructions resulted in remarkable performance gains.</li></ol><h3>Should I just chill, then?</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LOD9EohRz2NXrUs6IjSlkw.jpeg" /><figcaption>Photo by Maurício Mascaro (<a href="https://www.pexels.com/photo/woman-wearing-red-shirt-drinking-948199/">Pexels</a>)</figcaption></figure><p>Umm, first, let me remind you what happens when you dangle a ‘goal’ in front of super-ambitious humans. Remember Alexander the Great? Remember how he saw “the ends of the world” on a map, and by age 30, he’d marched through THREE freaking continents trying to touch them! That’s probably where this whole affair is headed too.</p><p>Spider2-V is a really good benchmark and it will make testing of the AI agents on data workflows much more easier and quantifiable. It is obvious that AI researchers and practitioners are going to try to smash the high scores, trying to frantically climb up the leaderboards. So, you should probably embrace yourself for many more papers coming out JUST ABOUT automating your job!</p><p><em>(I’m too lazy to look up but there probably are already some papers doing that.)</em></p><h3>My two cents</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2AGC1LfxcVcOjO7UcbNotA.jpeg" /><figcaption>Photo by Google DeepMind (<a href="https://www.pexels.com/photo/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-depicts-how-ai-can-help-humans-to-understand-the-complexity-of-biology-it-was-created-by-artist-khyati-trehan-as-part-17484975/">Pexels</a>)</figcaption></figure><p>I am not a big fan of the <a href="https://en.wikipedia.org/wiki/Ostrich_effect">Ostrich Algorithm</a> and fully accept that it is a matter of time before things start getting real. But I think there are things that we can be optimistic about!</p><p>If you’ve ever worked with data pipelines and analysis, you know how tricky sometimes the “I’m gonna automate this shit!” can get. You keep on repeating the same tasks but you cannot automate them fully because there are small nuances to be taken care of in between. I think AI will help us move past that. That will free up time for data scientists and engineers to focus on more strategic and creative aspects of their work (<em>you’d like that, right? RIGHT?</em>).</p><p>Also, do you ever drive around the countryside and think that you would happily be working at one of those farms??? <em>(I’m serious! Why do you think I started working out?)</em></p><p><em>I’d love to hear your feedbacks and what you have to say on this. If you like to read about data engineering, data science and other AI stuffs, please follow me here on Medium or on LinkedIn: </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>https://www.linkedin.com/in/prashant-tandan-74aa94163/</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c4efe8c508f7" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Here’s how I created a satellite tracker for free!]]></title>
            <link>https://medium.com/@prashant.tandan528/heres-how-i-created-a-satellite-tracker-for-free-bea1d12bfb6a?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/bea1d12bfb6a</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data-analysis]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Sat, 28 Dec 2024 23:35:09 GMT</pubDate>
            <atom:updated>2024-12-28T23:35:09.241Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*umTmMTizuYOg_vJ9wFxA2g.png" /><figcaption>Fig: Final Dashboard on Looker</figcaption></figure><p>Try it live here: <a href="https://lookerstudio.google.com/reporting/3ef7ff37-839d-4807-ad43-6461e0131cfd/page/ew9XE">Looker Link</a>. <br>Find the full code here: <a href="https://github.com/Prashant528/satellite_visualization_with_looker">GitHub Link</a>.</p><h3>MOTIVATION</h3><p>I was looking around for some cool APIs to play with and I stumbled upon the <a href="https://www.n2yo.com/">n2yo</a> website which provides information about current satellite locations for free. I instantly got the urge to create some <strong>live</strong> visualizations using the API data but got taken aback when I realized running a data pipeline on a server is going to create some deep holes in my pocket. I could have ran things locally but it doesn’t come close to being as much fun as deploying a pipeline on the cloud. So, I started looking for free options and finalized the following pipeline.</p><h3>THE PIPELINE</h3><p>To create an end-to-end data pipeline that fetches and visualizes the data, I used the following components:</p><ol><li>n2yo API (free)</li><li>Google Cloud Platform ($300 credit for new accounts)</li><li>Google Sheets (free)</li><li>Looker Studio (free)</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rDl-0rofTUFDK_g854i_fw.png" /><figcaption>Fig: Complete data pipeline</figcaption></figure><p>Let’s talk about the components in detail:</p><h4><strong>1. n2yo API</strong></h4><p><a href="https://www.n2yo.com/api/">This</a> is a free API that provides different types of information about the satellites. There are different endpoints that provide interesting set of information such as a satellite’s current co-ordinates (latitude, longitude), time for which a satellite is visible on an observer’s sky, all satellites present above a certain location within some radius etc.</p><p>To use the API, we can create a free account and download the API key provided. Then, we can use the key to access the REST endpoints.</p><p>The code to call the API for visible satellites over an observer looks like this:</p><pre>from locations import locations<br>import pandas as pd<br>import requests<br><br>def api_caller():<br>    base_url = &#39;https://api.n2yo.com/rest/v1/satellite/&#39;<br><br>    with open(&#39;keys/ny2o_key.txt&#39;, &#39;r&#39;) as file:<br>        ny2o_api_key = file.read().strip()<br><br>    meta_info_columns = [&#39;seen_from&#39;, &#39;category&#39;, &#39;transactionscount&#39;, &#39;satcount&#39;]<br>    meta_info_df = pd.DataFrame(columns=meta_info_columns)<br>    sat_info_columns = [&#39;seen_from&#39;, &#39;satid&#39;, &#39;satname&#39;, &#39;launchDate&#39;, &#39;satlat&#39;, &#39;satlng&#39;, &#39;satalt&#39;]<br>    sat_info_df = pd.DataFrame(columns=sat_info_columns)<br><br>    for location in locations:<br>        name = location[&#39;name&#39;]<br>        lat = location[&#39;lat&#39;]<br>        long = location[&#39;long&#39;]<br>        alt=  location[&#39;alt&#39;]<br>        radius = location[&#39;radius&#39;]<br>        sat_cat = location[&#39;sat_cat&#39;]<br><br>        api_info = f&#39;above/{lat}/{long}/{alt}/{radius}/{sat_cat}&#39;<br><br>        full_api_url = base_url + api_info + &#39;&amp;apiKey=&#39; + ny2o_api_key<br><br>        response = requests.get(full_api_url)<br>        if response.status_code == 200:<br>            content = response.json()<br>            loc_info_df = pd.DataFrame([content[&#39;info&#39;]])<br>            loc_sat_df = pd.DataFrame(content[&#39;above&#39;])<br>            loc_info_df[&#39;seen_from&#39;] = name<br>            loc_sat_df[&#39;seen_from&#39;] = name<br>            meta_info_df = pd.concat([meta_info_df,loc_info_df ], ignore_index=True)<br>            sat_info_df = pd.concat([sat_info_df, loc_sat_df], ignore_index=True)<br>            print(f&quot;Fetched data successfully for {name}&quot;)<br>        else:<br>            print(f&quot;Failed to fetch data for {name}. Status code: {response.status_code}&quot;)<br><br>    return meta_info_df, sat_info_df</pre><p>In the code, you can see that the location details (co-ordinates, name etc.) are stored in a file called ‘locations.py’. The API key is also stored in a file inside ‘keys’ folder for security purposes (to add it to gitignore so we don’t push the keys to GitHub). Then, we construct the full API call using the location details and the API key and make a GET request. Finally, we get the response and parse it into a Pandas dataframe.</p><h4><strong>2. GCP Composer (Airflow)</strong></h4><p>Google Cloud Platform provides a managed Airflow environment called Composer where we can create, schedule, run, monitor, and manage tasks. The best part is that we can create Directed Acyclic Graphs (DAGs) to run a bunch of tasks one after another or parallely. This helps us ensure that the dependent tasks are ran after the independent tasks are completed.</p><p>Anyone can try this awesome tool with the free $300 credit given by Google. You have to create a <a href="https://console.cloud.google.com/">Google Console</a> account and then accept to use the free credits. Before creating a Composer environment, let’s enable the Google Sheets API that will be used as our storage.</p><p>Search for ‘Google Sheets API’ in the search bar. When it opens, you can click on ‘Enable’ which will prompt you to create a service account. This account will be used to create the Composer environment and the API key will be used to update our Google Sheet. You can download the API key file and save in the ‘keys’ folder.</p><p>Now, let’s create the composer environment. Search for ‘Composer’ that opens up a window like this:</p><figure><img alt="Composer window that shows how you can create a new environment." src="https://cdn-images-1.medium.com/max/1024/1*G-0rKBAV28aYGKauf6UpCA.png" /></figure><p>Then, you can create a new environment. Give it a name, leave the rest of the settings at default and select the service account that we created for the Google Sheets API.</p><p>Now, we add the code from the GitHub repo in the “DAGs folder”. The directory structure should look like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9rjP9Gg-mLOv_Lpf0YMlRA.png" /></figure><p>The project folder contains rest of our code in the GitHub repository but we have to keep the DAG configuration file outside the project folder i.e. directly inside ‘dags/’ folder.</p><p>The file that we create to specify the DAG is ‘airflow_dag.py.’ It looks like this:</p><pre>from airflow import DAG<br>from airflow.operators.python import PythonOperator<br>from airflow.operators.dummy import DummyOperator<br><br>from datetime import datetime, timedelta<br>import sys<br>import os<br><br># Define default_args<br>default_args = {<br>    &#39;owner&#39;: &#39;airflow&#39;,<br>    &#39;depends_on_past&#39;: False,<br>    &#39;email_on_failure&#39;: False,<br>    &#39;email_on_retry&#39;: False,<br>    &#39;retries&#39;: 1,<br>    &#39;retry_delay&#39;: timedelta(minutes=5),<br>}<br><br># Define the DAG<br>dag = DAG(<br>    &#39;update_google_sheet_with_sat_data&#39;,<br>    default_args=default_args,<br>    description=&#39;Run app.py every 15 minutes&#39;,<br>    schedule_interval=&#39;*/15 * * * *&#39;,<br>    start_date=datetime(2023, 12, 12),<br>    catchup=False,<br>)<br><br># Define the Python function wrapper<br>def run_app():<br>    project_dir = os.path.join(os.path.dirname(__file__), &#39;project&#39;)<br>    sys.path.insert(0, project_dir)<br>    os.chdir(project_dir)<br>    from app import main  # Assuming your main function is called &#39;main&#39;<br>    main()<br><br># Add a PythonOperator<br>run_app_task = PythonOperator(<br>    task_id=&#39;run_app&#39;,<br>    python_callable=run_app,<br>    dag=dag,<br>)<br><br>start_task = DummyOperator(<br>    task_id=&#39;start&#39;,<br>    dag=dag,<br>)<br><br>end_task = DummyOperator(<br>    task_id=&#39;end&#39;,<br>    dag=dag,<br>)<br><br>start_task &gt;&gt; run_app_task &gt;&gt; end_task</pre><p>In the code, you can see that our DAG is triggered every 15 minutes. We have created dummy start and end task to give the DAG a starting and ending point. The main task to run is the ‘run_app_task’ that calls the main() function inside the ‘app.py’ file.</p><p>Now, since our Airflow environment is setup, we can think about how we’re going to store the API data.</p><h4><strong>3. Google Sheets</strong></h4><p>I don’t think Google Sheets needs an introduction. We are going to store our API data here and link it as a source in the Looker environment.</p><p>Create a sheet and name it as ‘Satellite data’. My sheet looks like <a href="https://docs.google.com/spreadsheets/d/1TUBDCIEyPqv8Hfx6pARoGfH_csbCR0Gk0ZIqLa4IpzA/edit?usp=sharing">this</a>.</p><p>Now, we write some code to update this sheet with the API data. In the file, ‘sheets_updater.py’:</p><pre>from google.oauth2 import service_account<br>from googleapiclient.discovery import build<br><br>class GSheet():<br>    def __init__(self, spreadsheet_id) -&gt; None:<br>        self.SPREADSHEET_ID = spreadsheet_id<br>        self.SERVICE_ACCOUNT_FILE = &#39;keys/molten-catalyst-443503-a5-c663cf6164b1.json&#39;<br>        # Authenticate and create the Sheets API service<br>        self.credentials = service_account.Credentials.from_service_account_file(<br>            self.SERVICE_ACCOUNT_FILE, scopes=[&#39;https://www.googleapis.com/auth/spreadsheets&#39;]<br>        )<br>        self.service = build(&#39;sheets&#39;, &#39;v4&#39;, credentials=self.credentials)<br><br>    def update_sheets_with_data(self, sheet_range, data ):<br>        RANGE_NAME = sheet_range<br>        values = [data.columns.values.tolist()]<br>        values.extend(data.values.tolist())<br>        # Prepare the request body<br>        body = {<br>            &#39;values&#39;: values<br>        }<br><br>        # Write data to the sheet<br>        result = self.service.spreadsheets().values().update(<br>            spreadsheetId=self.SPREADSHEET_ID,<br>            range=RANGE_NAME,<br>            valueInputOption=&#39;RAW&#39;,  # or &#39;USER_ENTERED&#39; for formatted input<br>            body=body<br>        ).execute()<br><br>        print(f&quot;{result.get(&#39;updatedCells&#39;)} cells updated.&quot;)<br><br>    def clear_sheet(self, sheet):<br>        resultClear = self.service.spreadsheets( ).values( ).clear(<br>                                        spreadsheetId=self.SPREADSHEET_ID, <br>                                        range=sheet, <br>                                        body={}).execute()</pre><p>In app.py:</p><pre>from ny2o_api_caller import api_caller<br>from sheets_updater import GSheet<br>from parse_api_responses import combine_sat_lat_long<br><br>spreadsheet_id = &#39;1TUBDCIEyPqv8Hfx6pARoGfH_csbCR0Gk0ZIqLa4IpzA&#39;<br><br>def main():<br>    google_sheet = GSheet(spreadsheet_id)<br>    meta_info_df, sat_info_df = api_caller()<br>    meta_sheet_range = &#39;meta_info&#39;<br>    sat_sheet_range = &#39;sat_info&#39;<br>    sat_cordinates_df = combine_sat_lat_long(sat_info_df)<br>    #clear the sheets for new update<br>    google_sheet.clear_sheet(meta_sheet_range)<br>    google_sheet.clear_sheet(sat_sheet_range)<br>    #update sheets with new data<br>    google_sheet.update_sheets_with_data(meta_sheet_range, meta_info_df)<br>    google_sheet.update_sheets_with_data(sat_sheet_range, sat_info_df)<br><br><br>if __name__==&#39;__main__&#39;:<br>    main()</pre><p>The spreadsheet_id can be extracted from the URL of the sheet.</p><p>For example, the spreadsheet ID in the URL https://docs.google.com/spreadsheets/d/1TUBDCIEyPqv8Hfx6pARoGfH_csbCR0Gk0ZIqLa4IpzA is “1TUBDCIEyPqv8Hfx6pARoGfH_csbCR0Gk0ZIqLa4IpzA”.</p><p>You can test the code by triggering the DAG manually and seeing if the sheet gets updated with new data or not.</p><h4>4. Looker Studio</h4><p>Since we have our data in the Google Sheets now, the final step left is to see what Looker can do to show the data nicely. <a href="https://lookerstudio.google.com/">Looker</a> is a visualization and analytics tool provided by Google that can connect to various data sources and create dashboards, charts and reports on top of it.</p><p>Open Looker Studio (should be accessible from your gmail just like other Google Services). You’ll be prompted to choose a source. Select Google Sheets and then select the sheet where our data gets loaded. Also, be sure that you choose the sheet that has the actual data (if you have followed my format, you should choose the ‘sat_info’ sheet).</p><p>Looker will pull up some default charts with the data but we will be creating a map visualization. Delete all the default charts loaded. Then, add a Google Map bubble visualizer.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5RYapjX3Q8vdEARug0ZcFg.png" /></figure><p>Then, you can choose the fields to show up in the map. My configuration looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*dn8iPha1kFpwHTaEnMyaLg.png" /></figure><p>You can add filters as per your likings too to drill down on the satellites you want to find.</p><p><strong>And viola! You have successfully created an end-to-end data pipeline that pulls data from an API source, transforms and stores it, and creates live visualization.</strong></p><p><em>If you get stuck or have any feedback, please feel free to comment or hit me up directly on </em><a href="https://www.linkedin.com/in/prashant-tandan-74aa94163/"><em>LinkedIn</em></a><em>. Thanks for reading. Happy coding!</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bea1d12bfb6a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Curious Case of the NULL Columns in Snowflake]]></title>
            <link>https://medium.com/@prashant.tandan528/the-curious-case-of-the-null-columns-in-snowflake-7d93260b1aaf?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/7d93260b1aaf</guid>
            <category><![CDATA[data-manipulation]]></category>
            <category><![CDATA[etl]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[snowflake]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Tue, 26 Nov 2024 01:02:30 GMT</pubDate>
            <atom:updated>2024-11-26T05:52:28.153Z</atom:updated>
            <content:encoded><![CDATA[<h3>The Curious Case of the NULL Columns</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nyTPIjEB8eq-g7Ajdk55Cg.jpeg" /><figcaption>(Photo by Muhammed Ensar, <a href="https://www.pexels.com/photo/text-on-computer-monitor-10725897/">Pexels</a>)</figcaption></figure><p>In a bustling data-centric company, Mr. Data Detective (DD) was known for solving the most baffling data issues. One day, his manager approached him with a mission: rescue the data engineering team, who were drowning in frustration.</p><p>“Every day, one or two of our 400 data transformation jobs fail, while inserting to the tables from SELECT statements” the manager explained. “It’s always the same error —</p><pre>NULL result in a non-nullable column</pre><p>We recently migrated to Snowflake from our on-premise database. The migration to Snowflake had only made things worse. Unlike our previous database platforms, Snowflake’s error messages didn’t pinpoint the exact column causing the issue. The team has no idea which column is causing the issue by looking at the error message. ”</p><p>This was no ordinary challenge. The engineers were grappling with tables with over 75 columns. Debugging involved pulling table definitions, identifying NOT NULL constraints, and manually looking through data to locate the culprit column.</p><p>Time-consuming. Tedious. And completely unsustainable. It was eating into their SLA commitments and the clients were furious!</p><p>Mr. DD knew while Snowflake excels in performance and scalability, but unlike other databases, it doesn’t provide detailed feedback on which column causes a NULL result in a non-nullable column error.</p><p>Armed with his smartness and determination, Mr. DD headed to his drawing board.</p><h3><strong>The Master Plan</strong></h3><p>“If Snowflake won’t tell us where the error is, we’ll make it tell us,” thought Mr. DD. His idea? A procedure that would dynamically validate and insert data into a table. If a validation failed, it would flag the column causing the issue, log the offending row, and provide a clear error message. Here’s the plan he devised:</p><ol><li><strong>Dynamic Validation:</strong></li></ol><p>Extract all NOT NULL columns from the target table.</p><pre>SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME<br>=&#39;&#39;&#39; ||target_table_name||&#39;&#39;&#39;AND IS_NULLABLE = &#39;&#39;NO&#39;&#39;;</pre><p><strong>2. Row Inspection:</strong></p><p>Check if any rows in the incoming dataset had NULL values in those columns.</p><pre>LET col_list RESULTSET := (EXECUTE IMMEDIATE column_list_query);<br><br>-- Validate each column<br>FOR col_name IN col_list DO<br>  validation_query := &#39;SELECT COUNT(*) FROM (&#39; || insert_query || &#39;)<br>                    WHERE &#39; || col_name.column_name || &#39; IS NULL;&#39;;</pre><p><strong>3. Error Logging:</strong></p><p>Log problematic rows and columns in a centralized error_log table.</p><pre>IF (validation_result &gt; 0) THEN<br>   error_log_query :=<br>         &#39;INSERT INTO ERROR_LOG (ROW_ID, COLUMN_NAME, ERROR_MESSAGE)<br>          SELECT row_number() OVER (ORDER BY 1), &#39;&#39;&#39; || col_name ||<br>          &#39;&#39;&#39;, &#39;&#39;NULL in NOT NULL column&#39;&#39; FROM (&#39; || insert_query || &#39;) WHERE &#39; ||<br>          col_name || &#39; IS NULL;&#39;;<br><br>EXECUTE IMMEDIATE error_log_query;<br>return &#39;NULL result in NOT NULL column “&#39;|| col_name || &#39;&quot;.Check error_log for details.&#39;;<br>END IF;</pre><p><strong>4. Insert If Clean:</strong></p><p>If no issues were found, proceed with the insertion.</p><p>He coded the procedure and called it validate_and_insert. Here’s a simplified version:</p><h3><strong>The Procedure</strong></h3><pre>CREATE OR REPLACE PROCEDURE validate_and_insert(<br>     target_table_name STRING,<br>     insert_query STRING<br>    )<br>RETURNS STRING<br>LANGUAGE SQL<br>AS<br><br>$$<br>DECLARE<br> col_list RESULTSET;<br> column_list_query STRING;<br> validation_query STRING;<br> validation_result INT;<br> error_log_query STRING;<br><br>BEGIN<br><br>-- Get NOT NULL columns<br>column_list_query := &#39;SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS<br>WHERE TABLE_NAME =&#39;&#39;&#39; ||target_table_name||&#39;&#39;&#39;AND IS_NULLABLE = &#39;&#39;NO&#39;&#39;;&#39;;<br><br>--Get NOT NULL columns<br>LET col_list RESULTSET := (EXECUTE IMMEDIATE column_list_query);<br><br>-- Validate each column<br>FOR col_name IN col_list DO<br>validation_query := &#39;SELECT COUNT(*) FROM (&#39; || insert_query || &#39;)<br>WHERE &#39; || col_name.column_name || &#39; IS NULL;&#39;;<br><br>EXECUTE IMMEDIATE validation_query;<br><br>SELECT $1 INTO validation_result FROM TABLE(RESULT_SCAN(LAST_QUERY<br>_ID()));<br> IF (validation_result &gt; 0) THEN<br> error_log_query :=<br>      &#39;INSERT INTO ERROR_LOG (ROW_ID, COLUMN_NAME, ERROR_MESSAGE)<br>       SELECT row_number() OVER (ORDER BY 1), &#39;&#39;&#39; || col_name ||<br>      &#39;&#39;&#39;, &#39;&#39;NULL in NOT NULL column&#39;&#39; FROM (&#39; || insert_query || &#39;) WHERE &#39; ||<br>      col_name || &#39; IS NULL;&#39;;<br><br>EXECUTE IMMEDIATE error_log_query;<br> return &#39;NULL result in NOT NULL column “&#39;|| col_name || &#39;&quot;.Check error_log for details.&#39;;<br> END IF;<br>END FOR;<br><br>-- Insert if valid<br>EXECUTE IMMEDIATE &#39;INSERT INTO &#39; || target_table_name || &#39; &#39; ||<br>insert_query;<br>return &#39;Data successfully inserted into &#39; || target_table_name || &#39;.&#39;;<br>END;<br>$$<br>;</pre><h3><strong>Testing the Waters</strong></h3><p>With the procedure ready, Mr. DD handed it over to the data engineers to test on a problematic table in their Dev environment. The engineers ran it, holding their breath. The result? A beautifully clear error message:</p><pre>NULL result in NOT NULL column “customer_id”. <br>Check error_log for details.</pre><p>Mission Success!!!</p><p>Not only did the procedure pinpoint the problematic column, but it also logged the offending rows, giving the engineers a clear path to resolution. The team was thrilled. Gone were the days of manual debugging and missed SLAs.</p><p>Mr. DD knew the battle wasn’t over. He still needed to propose this change to his manager, secure sprint time for testing and integration, and present it to the change board for approval. But for now, he allowed himself a moment to savor the victory.</p><h3><strong>The Takeaway</strong></h3><p>Data engineering isn’t just about writing SQL or managing pipelines. It’s about solving real-world problems efficiently. Through this simple yet effective solution, Mr. DD bridged the gap between Snowflake’s limitations and the team’s operational needs.</p><p>The lesson? With creativity and determination, even the trickiest challenges can be turned into scalable solutions.</p><p>And so, the legend of Mr. Data Detective grows…</p><p><em>The author of the blog is </em><a href="https://www.linkedin.com/in/prashanna-adhikari-085839171/"><em>Prashanna Adhikari</em></a><em>, a data enthusiast who loves to solve people’s data issues.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7d93260b1aaf" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hey Mr. DD, is our chatbot on drugs?]]></title>
            <link>https://medium.com/@prashant.tandan528/hey-mr-dd-is-our-chatbot-on-drugs-hallucinations-in-llm-e2719877f181?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/e2719877f181</guid>
            <category><![CDATA[hallucination-in-llm]]></category>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[llm-evaluation]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Sun, 10 Nov 2024 07:25:57 GMT</pubDate>
            <atom:updated>2024-11-10T07:58:42.555Z</atom:updated>
            <content:encoded><![CDATA[<h3>Hey Mr. DD, is our chatbot on drugs? (Hallucinations in LLM)</h3><figure><img alt="A robot sitting among mushrooms, hallucinating." src="https://cdn-images-1.medium.com/max/1024/1*l2_CQ6hujdFTGeD079K6CQ.png" /><figcaption>AI on shrooms (generated by AI on shrooms)</figcaption></figure><p>The manager rushed inside Mr. DD’s office, despair brimming on his face.</p><p>Their guitar company had just added a chatbot to their website to help the customers figure out the type of guitar they want. They had chosen a top notch state-of-the-art LLM as the engine for the chatbot. But as his manager was playing with it and digging into the questions, he realized that the chatbot had no clue what it was being asked about. It talked about made up types of guitars and provided wrong details for the actual guitars, but it sounded fully confident. The manager suspected that it may as well have been a drugged person acting as a chatbot behind the scenes.</p><p>Mr. DD listened to his manager’s grievances and reminded him how any LLM they picked could not possibly act as a guitar nerd like they wanted. And, there could be a bunch of <strong>factors for this behavior</strong>, mainly:</p><ol><li><strong>Lack of domain exposure during training: </strong>Their LLM probably had not seen enough data about guitars during its training process so it doesn’t know anything about guitars yet.</li><li><strong>Training for different goals: </strong>The LLM could have been trained to optimize other qualities like generalizability, fluency, confidence, creativity etc. but not factual accuracy.</li><li><strong>Training cut-off time: </strong>There could have been a lot of new updates in the “guitar culture” that happened after the LLM’s training data was prepared and finalized. This caused it to miss out on that information.</li><li><strong>Questioning without enough context: </strong>A simplistic reduction of a LLM’s task is to pick the next word from a list of options given the sequence of previous words. And the words are clustered around certain contexts. So, if the manager’s questions (prompts) didn’t provide enough context about the information he wanted, the LLM would not pick the word from the correct context and thus, the answer seems random.</li></ol><p>The manager seemed quite convinced by Mr. DD’s reasons as he thought he too would have been confused if he had never read anything about guitars or if somebody asked him questions out of the blue without any context. He could sympathize with the chatbot now.</p><p>Mr. DD’s smirk hinted that he had something brewing in his mind. Without his manager insisting much, he started about the solutions that could fix their “drugged chatbot”.</p><ol><li><strong>Asking better questions:</strong></li></ol><p>The first solution would be to create better questions with enough context, instructions and examples to help the LLM pick the right words. This technique is often called “<strong>prompt engineering</strong>”. Mr. DD hates the fact that people have given this simple task such an intimidating name but who can fight with the crowd.</p><p>But this solution would not be enough if the LLM doesn’t know anything about guitars. So, Mr. DD moves on to the next solution.</p><p><strong>2. Further training the LLM:</strong></p><p>Mr. DD could gather up all of the text data related to guitars: from the internet, sites that talk about guitars, forums where the guitar nerds squabble about the best strings, and the company documents about their guitars and further train the LLM on this data. This process is called “<strong>fine tuning</strong>”.</p><p>This would provide the chatbot with enough information about the guitars and combined with the first solution, the answers could get really reliable.</p><p>This solution sounded good to the manager but he had one question. What if there is a huge cultural shift in the “guitar culture” in next three months? Realizing the stupidity of his question, he tried to compensate by asking a more realistic one: what if the company re-classified all the guitars into totally new categories. Will we have to do this fine-tuning again and again if any information changes?</p><p>Mr. DD gave a ‘yes’ nod but he had one more trick under his sleeve. He knew that this one was going to blow his manager’s mind.</p><p><strong>3. Grounding the LLM’s response with actual truth:</strong></p><p>Remember the collection of text data that Mr. DD was proposing to collect for fine-tuning? Wouldn’t it be awesome if the LLM could access that text data in realtime whenever a user asks it a question related to guitars?</p><p>Such a framework for looking up the ground truth to assist in the LLM’s response generation process is called <strong>Retrieval-Augmented Generation (RAG)</strong>. This fancy term is quite simple if broken down into steps:</p><p>a. Using the prompt given by the user, the framework tries to retrieve the most relevant ground truth ( i.e. actual document/s or part/s of document) in the collection of data. Hence, the term “<strong>Retrieval</strong>”.</p><p>b. The relevant ground truth extracted from the collections is then augmented to the user prompt as the context to be used by the LLM for responding. Hence, the term “<strong>Augmented</strong>”.</p><p>c. The LLM then generates the response that is based on the provided ground truth. Hence, the term “<strong>Generation</strong>”.</p><p>Just like Mr. DD had guessed, the manager could not hold back his amazement with this method. This was probably the cure that their drugged chatbot needed.</p><p>The manager could also see how it removes the need to fine tune the LLM again and again if the existing data was changed or new data was added. They could just update the collections of the data and the RAG framework would pick up the ground truth from the updated data.</p><p>He could also see that looking up for the ground truth while a user is actively interacting with the LLM could delay the response time but it was a trade-off he was willing to make in order to have a reliable chatbot.</p><p>Mr. DD was just coming to a dreadful realization but his manager reached there first and asked,” Will you help us implement this RAG thing, Mr. DD?”</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e2719877f181" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hey Mr. DD, we are bringing in new type of data in the old column.]]></title>
            <link>https://medium.com/@prashant.tandan528/dd-we-are-bringing-in-new-type-of-data-in-the-old-column-7fff69cd17b2?source=rss-f8427a8c2d28------2</link>
            <guid isPermaLink="false">https://medium.com/p/7fff69cd17b2</guid>
            <category><![CDATA[database-administration]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-analysis]]></category>
            <dc:creator><![CDATA[Data Detective]]></dc:creator>
            <pubDate>Fri, 27 Sep 2024 02:34:50 GMT</pubDate>
            <atom:updated>2025-04-26T03:35:53.565Z</atom:updated>
            <content:encoded><![CDATA[<p><em>Today, Mr. DD(Data Detective) got this request from his manager, ‘We are bringing in new ITEM_IDs with alpha-numeric characters. Can we make sure we accommodate them properly in our database?’. That’s smart of the business people to finally realize that changing the ITEM_IDs from numbers to strings will make classification and analysis of the items easier.</em></p><p><em>DD’s first reaction? “What news! Didn’t I tell you that before?”. Anyways, the request looked simple but DD knows that this could turn into a real mess real quick. The database has all of the history data from past 3 years. But DD is not someone who dwells in the past too much so he puts on his detective hat and starts doing his magic.</em></p><p><em>DD’s first idea was that he could change the existing ITEM_ID column to string using ALTER statements.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*I8_dJ_9LAhOmFnUe5nuRuQ.jpeg" /><figcaption>Photo by Kevin Ku (<a href="https://www.pexels.com/photo/data-codes-through-eyeglasses-577585/">Pexels</a>)</figcaption></figure><h3>Method 1: Changing the Type of the Column In-Place</h3><p>Converting a column’s data type in-place means altering the column directly without creating a new one. This kind of type conversion is supported on most database platforms, though the syntax may vary slightly depending on the platform. For DD’s PostgreSQL database, the process looks like this:</p><h3>PostgreSQL</h3><p>In PostgreSQL, you can use the ALTER TABLE command with the USING clause to convert a column from numeric to string directly in place.</p><p>ALTER TABLE DD_item_table ALTER COLUMN ITEM_ID TYPE VARCHAR(255) USING ITEM_ID::VARCHAR;</p><ul><li>The ALTER COLUMN command changes the data type to VARCHAR(255).</li><li>The USING clause ensures that the numeric data is explicitly cast to a string.</li></ul><p>PostgreSQL requires the USING clause to explicitly handle type conversions when changing between incompatible types.</p><p>For some other databases, the query looks like this:</p><h3>MySQL</h3><p>ALTER TABLE DD_item_table MODIFY COLUMN ITEM_ID VARCHAR(255);</p><h3>SQL Server</h3><p>ALTER TABLE DD_item_table ALTER COLUMN ITEM_ID VARCHAR(255);</p><p><em>“This should be easy!”, thought DD and relaxed for a while. But he knows that you can never be too careful while dealing with critical data so he starts thinking of the edge cases. And nonetheless he found one such case that would have busted his ass if he wasn’t careful enough. What if my new data type is too small to hold the old data and the history data gets truncated while converting. And other some questions came into his head: What if I want to keep the the old data intact in the same table? Or what if my database didn’t support this in-place type conversion in the first place. There should be a more reliable process!</em></p><h3>Method 2: Creating a New Column with the Same Data</h3><p>If your database system doesn’t allow in-place conversion (or you want to be cautious), creating a new column with the desired data type is a good alternative. Here’s a general process that works for almost all databases:</p><h3>Step 1: Add a New Column</h3><p>First, create a new column with the desired STRING (or VARCHAR) data type.</p><p>ALTER TABLE DD_item_table ADD COLUMN ITEM_ID_STR VARCHAR(255);</p><h3>Step 2: Copy Data from the Old Column to the New Column</h3><p>Now, copy the existing NUMERIC data from the old ITEM_ID column to the new ITEM_ID_STR column, casting or converting as necessary. For DD&#39;s PostgreSQL, the query would look like:</p><p>UPDATE DD_item_table SET ITEM_ID_STR = ITEM_ID::VARCHAR;</p><p>For other databases, the query would look like:</p><h4>MySQL:</h4><p>UPDATE DD_item_table SET ITEM_ID_STR = ITEM_ID;</p><h4>SQL Server:</h4><p>UPDATE DD_item_table SET ITEM_ID_STR = CAST(ITEM_ID AS VARCHAR(255));</p><h4>Oracle:</h4><p>UPDATE DD_item_table SET ITEM_ID_STR = TO_CHAR(ITEM_ID);</p><h3>Step 3: Drop the Old Column (Optional)</h3><p>You should verify that the data in the new column(ITEM_ID_STR) is the same as in your old column (ITEM_ID). You can do a comparison of the columns using NOT IN or EXCEPT clauses. After you’re sure that the data has been copied successfully, you can drop the old ITEM_ID column if it’s no longer needed.</p><p>ALTER TABLE DD_item_table DROP COLUMN ITEM_ID;</p><h3>Step 4: Rename the New Column (Optional)</h3><p>If you want the new column to have the original column’s name, rename it.</p><p>ALTER TABLE DD_item_table RENAME COLUMN ITEM_ID_STR TO ITEM_ID;</p><p>Now you have the same column ITEM_ID with a different data type and you know that you haven’t messed up anything while doing that.</p><p><em>“Phew, that took a while!” DD was just starting to feel his dopamine levels rising but another thought hit him like a bus. ITEM_ID is probably a primary key and has been referenced in other tables as a foreign key. If that is the case, he now needs to convert those fields too. He also needs to check the table DDLs to see if the alteration of the column has caused any existing indexes or constraints to break. He still has a lot to do before he sends out the email saying “I am ready. Bring those new ITEM_IDs in! “</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7fff69cd17b2" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>