<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Deepak Jayabalan on Medium]]></title>
        <description><![CDATA[Stories by Deepak Jayabalan on Medium]]></description>
        <link>https://medium.com/@deepak.jayabalan?source=rss-835e3802d07c------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*4vhYQ8lYzfjjfp0a</url>
            <title>Stories by Deepak Jayabalan on Medium</title>
            <link>https://medium.com/@deepak.jayabalan?source=rss-835e3802d07c------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 30 May 2026 05:38:58 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@deepak.jayabalan/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Pipeline Optimization Techniques]]></title>
            <link>https://medium.com/@deepak.jayabalan/pipeline-optimization-techniques-eb11b48ce941?source=rss-835e3802d07c------2</link>
            <guid isPermaLink="false">https://medium.com/p/eb11b48ce941</guid>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[pipeline]]></category>
            <dc:creator><![CDATA[Deepak Jayabalan]]></dc:creator>
            <pubDate>Fri, 05 Jul 2024 05:04:50 GMT</pubDate>
            <atom:updated>2024-07-05T05:04:50.589Z</atom:updated>
            <content:encoded><![CDATA[<p>There are a large number of optimization strategies, all of which can be used to make your pipelines more efficient, in terms both of resource utilization and overall performance. Not only as a result I got staggering cost savings in millions but also greatly widened the data access. The approaches in this article place the greatest emphasis on optimizing distributed processing systems, fine-tuning SQL inquiries, and streamlining workflow.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*wpxXZ8d42TlOQlNy" /><figcaption>Photo by <a href="https://unsplash.com/@jenstakesphotos?utm_source=medium&amp;utm_medium=referral">Jens Freudenau</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h4>Parallelism</h4><p>Parallel processing means that multiple operations can be at the same time, thus accelerating to a great extent the computer’s working speed and enhancing its efficiency.</p><p>There are various ways of using it:</p><p>Multithreading and Multiprocessing: These allow a program to perform several independent operations at once. With multi-threading, many subroutines will run inside one process; while multiplying processes means that more than one process may be happening at a time.</p><p>Distributed Computing: Distributed computing frameworks like Apache Spark, Apache Flink, and Dask enable the distributed processing of big data across multiple nodes. This doesn’t significantly reduce processing time for large data sets.</p><p>By parallel processing data pipelines performance can be improved dramatically, particularly for compute-intensive tasks.</p><h4>Filtering data as early as possible</h4><p>Early Filtering: We have filtering operations are set as close to the data source as feasible, thus ensuring that only pertinent data will be processed downstream.</p><p>SQL Efficient Queries: Efficient SQL queries with WHERE clauses are used to filter data at its source before being passed on to DDL statements in the pipeline.</p><p>Early filtering only extracts those data required by subsequent stages. It may well reduce the amount of data that has to proceed and eventually be stored in your database, which would be beneficial for both performance and capacity.</p><h4>Json Parsing/XML Parsing</h4><p>JSON is a commonly used data format in data pipelines. Optimizing JSON parsing is one way to speed up the throughput of pipelines that handle large volumes of JSON data. Json parsing is costly operation that should be used sparingly. Use json parsing as minimum as possible like only after filtering all the required data, json parse and then do a cross join and finally if there multiple column logic implementing the same json parse then do it once in the inner query and reuse in all instances in the outer query. Optimized JSON parsing speeds data extraction processes and makes them resource efficient.</p><h4>CROSS JOIN usage</h4><p>Cross joins (also called Cartesian joins),called well as a cross product expression jointly with one table to form out such results one and the other least.</p><p>But while they have their uses in specialized situations, they generally take up a lot of resources and can trigger performance bottlenecks.</p><p>Prevent Unneeded Cross Joins : Only use cross joins when absolutely necessary. You can often substitute more efficient join types such as the INNER JOIN or LEFT JOIN.</p><p>Joining before you join: before performing a cross join for the final data set, use filtering criteria to restrict its size.</p><p>Avoid using cross joins as much as possible in twenty fifteen to reduce demand on resources and improve efficiency.</p><h4>Partitions and indexing on appropriate columns</h4><p>Table Partitioning: Dividing large tables into smaller, more manageable partitions based on criteria such as date ranges or key values enables queries to scan only relevant partitions, in turn cutting query times. Increased Performance. The more indexes that can be used for retrieval purposes, the better retrieving data will be. But if you create seven or eight single-column different types of indexes PITs probably choke whether they are in place on level 2 nodes. On the other hand, creating an index with columns that are frequently used as query conditions, like WHERE clauses and JOINs perfects performance. Composite indexes for more than one column are also useful in this respect.</p><p>Proper partitioning and indexing strategies can significantly reduce execution times — query response time is almost instant — while maintaining a manageable overall load on your network resources.</p><h4>Workflow Orchestration</h4><p>Workflow orchestration is the seamless coordination and management of all these tasks in a data pipeline to ensure that they are executed smoothly, efficiently, and in whatever order necessary.</p><p>Orchestration Tools: You can define workflows and schedule them using tools like Apache Airflow, Prefect or Luigi. They come with features such as task dependency management, retries and alerting.</p><p>Task Orderings: Establish task dependencies to execute the tasks in a particular order and gracefully handle failures.</p><p>Run Independent Tasks in parallel: Execute the tasks which are independent to run them concurrently and fasten the overall flow.</p><p>Data pipelines are resilient, scalable and easy to operate through an efficient workflow orchestration.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eb11b48ce941" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[LLM Evaluation Metrics]]></title>
            <link>https://medium.com/@deepak.jayabalan/llm-evaluation-metrics-3e641db6ebc2?source=rss-835e3802d07c------2</link>
            <guid isPermaLink="false">https://medium.com/p/3e641db6ebc2</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[llm-evaluation]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Deepak Jayabalan]]></dc:creator>
            <pubDate>Fri, 05 Jul 2024 04:20:27 GMT</pubDate>
            <atom:updated>2024-07-05T04:20:27.802Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>LLM Evaluation Metrics</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pIVGgQF_BbnPOCJq" /><figcaption>Photo by <a href="https://unsplash.com/@homajob?utm_source=medium&amp;utm_medium=referral">Scott Graham</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p><strong>LLM model Evaluation:</strong></p><p>Since the buzz of LLMs, many have started using LLMs and it is critical to evaluate the performance of these LLMs, how well they perform their function, and how much they can be trusted. For this purpose, quantitative measurements along with ground truth output are needed. There are multiple metrics available and we need to use the right metrics to solve the problem which we want. In this article, let’s deep dive into several metrics to evaluate the performance of these Large Language Models since there is no unified strategy to measure their outcome.</p><p><strong>Automated Metrics</strong>:</p><p>Text similarity metrics such as BLEU, ROUGE, etc evaluators work on computing similarity by measuring the overlap of word or word sequences when comparing text elements. They are designed to help in generating a similarity score for the LLM-predicted and the ground truth text-based references.</p><p><strong>BLEU Score:</strong></p><p>BLEU is short for bilingual evaluation understudy, and it quantifies the similarity of a machine translation of one natural language to another. As a result, it is generally used in machine-translation duties, although it is increasingly used in other duties such as text generation, paraphrase generation, and text summarization. The basic idea is to calculate the precision, the fraction of candidate words in the reference translation. The scores are calculated for individual translated parts, generally sentences, by comparing them to a group of high-quality reference translations. Then those scores are summarized throughout the corpus to achieve a good prediction of the translation. It does not maintain punctuation or grammaticality when determining scores. Furthermore, hardly any human translations achieve an ideal BLEU report, because an ideal BLEU would imply that the candidate equals any of the reference translations. Therefore, a person does not need to approach the ideal score. In a long time, it is probably beneficial to have one or more judgments.</p><p><strong>ROUGE Score:</strong></p><p>Unlike the BLEU score, the evaluation metric Recall-Oriented Understudy for Gisting Evaluation measures the recall. It’s frequently utilized to evaluate the quality of generated text and machine translation work. But, as it is already obvious from its name, it evaluates recall in summarizations. However, in such tasks, it is necessary to evaluate not the words the model can generate but retrieve.</p><p><strong>SEMANTIC SIMILARITY</strong></p><p>Semantic similarity between two sentences captures how much meaning they share. It can be measured by considering each string as a feature vector representing its semantics/meanings. One way to do this is to first obtain embeddings of the strings (e.g. via an LLM) and then use the cosine similarity of the two embedding vectors.</p><p><strong>RULE-BASED METRICS:</strong></p><p>Apart from that, for domain-specific applications and experiments, it is possible to implement rule-based metrics. For example, let us ask the model to generate multiple completions for a given task. We might want to select the output that has the most probability of certain keywords appearing in the prompt. Furthermore, most of the entire prompts might not be useful, only a handful of key entities might be needed. It is also possible to create a model that performs entity extraction on the generated output, to evaluate the quality of the predicted output. There are multiple options like that, so it would be a good idea to invent some custom, rule-based metrics aligned to the domain-specific nature of the tasks.</p><p><strong>LLM Based Evaluation:</strong></p><p>A method of evaluation that has been developed a lot recently is to make LLMs score themselves! While very few approaches are possible, the evaluation still needs some of these inputs: question, answer, context, and ground truth. We can essentially take the output the model produces and ask it to determine how good it is — this type of evaluation would increase in popularity for GPT-4-like models as they are already able to score the model’s predicted output quality.</p><p>The next steps are normally taken when using this evaluation method: Output predictions are generated from a test set given, and the model is prompted to evaluate the quality of output when referenced concerning a reference text and an adequate context, for instance, evaluation criteria a prompt is fed into the model results are viewed.</p><p><strong>Functional Correctness:</strong></p><p>Therefore, functional correctness is a measure of whether the LLMs accurately generate the code for a specific task given in natural language to implement the task. Precisely, it is the evaluation of whether the generated code produces the expected output for the given input.</p><p>Next, we use the LLMs-generated code to calculate the factorial for each input and verify if the output generated by the model with the input generates the expected output. If the model produces the output for each of the inputs equivalent to the expected output, then the test case contains the model’s passed condition, and the model is functionally correct for the task.</p><p><strong>Example:</strong></p><p>Input: 5 Expected Output:5</p><p>Input: 10 Expected Output:100</p><p>Input 55 Expected Output: 1000</p><p>Input 45 Expected Output: 37</p><p>Input 878 Expected Output: 423424</p><p>Finally, we can use the LLMs-generated code to calculate the factorial of each input and check if the generated output matches the expected output. If the output matches each input to a task, then we can say the test case has passed, and the LLMs are functionally correct for that task. Nevertheless, this method of evaluating functional correctness has its limitations. Sometimes it may be very costly to set the execution environment to implement the generated code. Functional correctness evaluation does not consider the following aspects of the generated code: readability, maintainability, and efficiency. Furthermore, normally we cannot define enough test cases to cover all possible inputs and edge cases for most of the tasks. Therefore, we can say that functional correctness evaluation may not be as efficient.</p><p><strong>Human Evaluation</strong>:</p><p>Human-based feedback is always good particularly if there is no ground truth dataset. Also, reviews with domain experts might be necessary depending on the application. However, it would become costly and time-consuming.</p><p>For a more focused evaluation process, achieve the following:</p><p>1. Compile a group of domain experts or people familiar with the problem space.</p><p>2. Generate a shareable asset that can be viewed by multiple reviewers. An example would be saving the shared Excel spreadsheet for later use in experiments.</p><p>3. Each person views each sample and assigns a score per completion. Alternatively, viewers can be split to get through more unique data points in a short amount of time. On top of this, having notes assigned to each reviewed data point can give insight into how each specific viewer interpreted each completion for each data point. Having notes is useful as it is the sole place where users will solely get the reference. These users might have never reviewed the dataset before, but will still want to understand where decisions were coming from. The notes can also act as a strategy for benchmarking the evaluation process. Each note can have learnings derived from it which could potentially help apply the concept to another sample within the dataset.</p><p>Another level on which human feedback can be applied is the integration into the LLM-based system through reinforcement learning from human feedback. RLHF is a technique that uses RL methods to optimize a language model with human feedback directly. RLHF’s main concept is building a reward model. This model should predict the human desirability of the response given. A fine-tuning of the LLM should be determined and guided by the reward model.</p><p><strong>Other Approaches:</strong></p><p><strong>Iterative Improvement</strong>: Round-the-clock tracking of LLM performance and application of the derived input to improve the model. It might also involve working with datasets, utilizing certain parameters or introducing new evaluation criteria.</p><p><strong>Comparative</strong> <strong>Analysis</strong>: Utilize outputs from other models or human-generated examples. Commonly, such a practice can produce valuable knowledge about the abilities and limitations of the LLM compared to other models.</p><p>By combining these evaluation methods, you can gain a comprehensive understanding of the strengths and weaknesses of the LLM-generated content and work towards improving its quality.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3e641db6ebc2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[LLM Assessment as a Data Engineer]]></title>
            <link>https://medium.com/@deepak.jayabalan/llm-assessment-as-a-data-engineer-44cb92b0be5f?source=rss-835e3802d07c------2</link>
            <guid isPermaLink="false">https://medium.com/p/44cb92b0be5f</guid>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[chatgpt]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Deepak Jayabalan]]></dc:creator>
            <pubDate>Fri, 05 Jul 2024 04:19:34 GMT</pubDate>
            <atom:updated>2024-07-05T04:19:34.993Z</atom:updated>
            <content:encoded><![CDATA[<h3>Introduction:</h3><p>Ever since the buzz about GenAI and chat GPT, I have been trying to understand the space and leveraging it into my day-to-day activities as a Data Engineer. No doubt it has been useful for debugging with respective programming and documentation. Personally, it played a significant role in helping me plan my Dubai itinerary. It even contributed to naming my friend’s daughter. However, It works pretty well if you give appropriate and precise requirements.</p><p>There is a lot of anxiety among people that AI is going to take our jobs as they feel almost everything will be automated through AI. Is LLM as good as everyone claims?</p><p>In this article, we will look into the specifics of Gen AI’s role in Data Engineering and see where it flourishes, where it requires enhancement, and where human expertise remains irreplaceable.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jt3-Z2ZFV4zSCwlw" /><figcaption>Photo by <a href="https://unsplash.com/@homajob?utm_source=medium&amp;utm_medium=referral">Scott Graham</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Areas of Strength:</h3><h4>Basic Data Querying:</h4><p>100% the human necessity is going to get disrupted, but severity based on the complexity is high. . Even the SQL practitioners won’t feel much difference here, and Self-serve analytics will catch fire in this area which is something phenomenal! Again to quote positively, Data engineers don’t need to guide the stakeholders nearly as much!</p><h4>Troubleshooting pipeline failures:</h4><p>LLMs coupled with agents will slash on-call load massively. LLMs processing stack traces / quality failures can recommend the built-in remedies and nudge the ball down the field to address the problem and presumably even over Slack. And that is a fantastic percentage of data engineering hours saved! To be clear, some hard failures require manual troubleshooting.</p><h4>Anomaly Detection:</h4><p>Regardless of the structure, format, source, complexity, timeliness, or even natural play of data, large language models can identify anomalies. LLMs are well-suited to anomaly identification because they can work across a variety of data sources, identify subtle patterns, and distinguish when data isn’t behaving as expected. This will help data engineers to identify and fix data quality problems in advance.</p><h3>Areas for Enhancement:</h3><h4>Data Cleansing:</h4><p>Our results suggest that Language Models (LLMs) are highly effective in data cleansing due to their high performance in unstructured data management and semantic error identification and correction, as well as high scalability. Furthermore, our review of the current state-of-the-art suggests that even more substantial improvements are possible, including fine-tuning of LLMs to specific domains or even organization data. In addition, various interactive mechanisms as well as coupling with data quality frameworks can significantly increase the efficiency of LLMs. Overall, LLMs allow organizing and improving the whole process of data cleansing, helping to obtain high-quality datasets and, consequently, valuable insights.</p><h4>Visualization and Report Generation:</h4><p>Language Models (LLMs) excel in visualization and documentation tasks by automating the creation of reports, summarizing data insights, and generating narrative-driven reports. Enhancements needed include improving document organization and categorization, data-driven storytelling capabilities integrating with version control systems and visualization tools for seamless output generation and enhancing interpretability features to ensure clarity and coherence in reports.</p><h4>Writing pipeline code in SQL, Spark, or Flink</h4><p>LLM can be a helpful starting point, and for our work it is a good boilerplate. Still the generated material is hard to debug — or often not worth the trouble. Prompt-driven content is still far from being able to turn out a nice, optimized and absolutely accurate data pipeline.</p><h4>Optimization</h4><p>By providing suggestions on efficient query structures and detecting problems that slow performance, LLMs are especially powerful when it comes to SQL optimization. They can even make recommendations for optimizations on the basis of historical query data. Things for improvement include integrating database-specific optimization techniques into the training of LLM, bridging with query execution plans in order to give context-oriented advice and feedback mechanisms so that one can modify strategies for optimization according to what users prefer as well as performance metrics.</p><h4>Physical Data Model</h4><p>As soon as the logical model is finalized, LLM will be efficient in moving and upgrading this logical (or conceptual) data models into the targeted architecture. The problem that LLM encounters with all the examples is that it’s hard for most people to understand pipeline, table, and its relationship.</p><h3>Areas where human expertise remains irreplaceable:</h3><h4>Logical Data Model</h4><p>Data modeling is a very soft skills focused task.You have to communicate with different stakeholders to understand the business context in detail.Understanding what data is needed, the relationship between entities, and considering different scenarios are human oriented tasks that LLMs must continue to struggle with.</p><h4>Building Pipeline and Framework</h4><p>Once they reach a certain level of complexity, LLMs begin to give up. Imagine if someone asked ChatGPT, “Can you provide me the source code of Android’s next version.” While LLMs can help with routine data engineering tasks, they do not have the insight or inventiveness unique to humans that allows for new solutions to complex challenges, unlocking creativity in technical terms.</p><h4>Domain Expertise and Intuition:</h4><p>Human data engineers bring a unique blend of domain expertise and intuition to their work that often surpasses what conventional software isn’t able to achieve through automation. This intuition allows them to find subtle patterns in data that no fully rigorous automated system can. Based on the knowledge of this field and their experiences in it they can predict future trends. They can also make decisions in light of a broader context, including corporate or government needs, ethical considerations, and potential future scenarios. This holistic approach to data engineering surpasses what LLMs have achieved. Although LLMs have great power they are typically restricted to the patterns and relationships they have been trained to recognize.</p><h3>Conclusion:</h3><p>On the whole, Gen AI can certainly automate trivial procedures such as looking for information, launching programs that don’t work properly, repeatedly correcting faults, and anomaly detection. My gut tells me that Gen AI will win out and excel at tasks such as Data Cleansing, Optimization, Writing Pipelines — once the team understands gaps and missing patterns. But it can not learn purely human soft skills that are important for building logical data models or a framework. I believe that these skills, which cannot be automated, are also essential for stakeholders to ask for innovation in the first place. After all, no one ever asked Steve Jobs if they needed anything like an iPhone and he just thought up what would become its invention.</p><h3>References:</h3><p><a href="https://barrmoses.medium.com/will-genai-replace-data-engineers-no-and-heres-why-708b0a27da6b">https://barrmoses.medium.com/will-genai-replace-data-engineers-no-and-heres-why-708b0a27da6b</a></p><p>​​<a href="https://www.linkedin.com/pulse/navigating-data-governance-challenges-era-large-fnq5e">https://www.linkedin.com/pulse/navigating-data-governance-challenges-era-large-fnq5e</a></p><p><a href="https://www.dataversity.net/navigating-the-risks-of-llm-ai-tools-for-data-governance/">https://www.dataversity.net/navigating-the-risks-of-llm-ai-tools-for-data-governance/</a></p><p><a href="https://www.skyflow.com/post/why-ai-governance-requires-data-governance">https://www.skyflow.com/post/why-ai-governance-requires-data-governance</a></p><p><a href="https://coreykeyser.medium.com/you-cant-govern-ai-without-governing-data-27bb88d8f9ea">https://coreykeyser.medium.com/you-cant-govern-ai-without-governing-data-27bb88d8f9ea</a></p><p><a href="https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review">https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review</a></p><p><a href="https://towardsdatascience.com/from-chaos-to-clarity-streamlining-data-cleansing-using-large-language-models-a539fa0b2d90">https://towardsdatascience.com/from-chaos-to-clarity-streamlining-data-cleansing-using-large-language-models-a539fa0b2d90</a></p><p><a href="https://medium.com/@wangshally11/data-cleaning-warm-up-before-training-large-language-models-74ed1edceead">https://medium.com/@wangshally11/data-cleaning-warm-up-before-training-large-language-models-74ed1edceead</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=44cb92b0be5f" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking the Power of Advanced SQL Data Types in Big Data]]></title>
            <link>https://medium.com/@deepak.jayabalan/unlocking-the-power-of-advanced-sql-data-types-in-big-data-ce53c018ed2c?source=rss-835e3802d07c------2</link>
            <guid isPermaLink="false">https://medium.com/p/ce53c018ed2c</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[sql]]></category>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-engineering]]></category>
            <dc:creator><![CDATA[Deepak Jayabalan]]></dc:creator>
            <pubDate>Fri, 05 Jul 2024 04:18:48 GMT</pubDate>
            <atom:updated>2024-07-05T04:18:48.938Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*39XPYIxytmGFuIA5" /><figcaption>Photo by <a href="https://unsplash.com/@markusspiske?utm_source=medium&amp;utm_medium=referral">Markus Spiske</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>The gigantic mountains of big data, which are getting bigger by the day, always pose a problem in terms of storage, analysis, and meaningful insight extraction. As one might suppose, this is where the advanced SQL data types save the day. Features of the specialized data types near integers and strings, which we use in every-day life, will allow us to store and operate complex data structures with great ease. Moreover, mastering them takes us straight on to solving the very core problems of the big data realm, in this way enabling more in-depth analysis and thus more valuable outcomes.</p><p>Please note that the examples provided in the article are from Presto DB and the syntaxes might change according to your database.</p><p>Lets first discuss the different advanced SQL data types:</p><h4>ARRAY</h4><p>An array is a collection of values of the same type that are sorted by their index. It is very similar to a list from Python. For example, it can be an array of numbers: 1, 2, 3. It is possible to retrieve information from an array in SQL by using the “UNNEST” function and the CROSS JOIN operator . Such a combination will allow us to turn an array into individual rows and then join these rows with a table or subquery . Here is how this query might look:</p><pre>SELECT<br>   numbers_array,<br>   number<br>FROM (<br>   VALUES<br>       (ARRAY[2, 5]),<br>       (ARRAY[7, 8, 9])<br>) AS x (numbers_array)<br>CROSS JOIN UNNEST(numbers_array) AS t (number);</pre><h4>MAP</h4><p>A map is a collection of key-value pairs where all the keys and values should be of the same type. A map is known as a dictionary in Python. Map is a very useful data type that allows you to organize a huge amount of structured information in one container. If you have a lot of columns and you don’t know much about its specifics, do not worry you can organize everything in the forms of pairs and deal with a single container instead of dealing with a variety of subqueries. In order to query a map in SQL, you have to utilize the UNNEST function and CROSS JOIN operator, which joins the two selected rows of another table. Therefore, UNNEST use the opened pairs in the form of separate rows, whereas CROSS JOIN open rows and joins them with the selected ones. For example, a query is:</p><pre>SELECT<br>   animals, a, n<br>FROM (<br>   VALUES<br>       (MAP(ARRAY[&#39;dog&#39;, &#39;cat&#39;, &#39;bird&#39;], ARRAY[1, 2, 0])),<br>       (MAP(ARRAY[&#39;dog&#39;, &#39;cat&#39;], ARRAY[4, 5]))<br>) AS x (animals)<br>CROSS JOIN UNNEST(animals) AS t (a, n);</pre><h4><strong>ROW(STRUCT)</strong></h4><p>A row or struct is a group of fields in which each field has a unique name and data type. A struct is a tuple in python. For instance, a struct is a struct “x” with a value of 1 and a struct “y” with a value of 2.0. A struct is a group of fields in which each field has a name and data type. It lets us design columns that are significant but are not often used. Instead of having multiple columns, we can have a single struct column with all the required information. It is like a map but with predetermined columns and is more constant. You can get to fields utilizing a field reference operator. To query a struct in SQL, we can utilize the UNNEST function and the CROSS JOIN operator. The CROSS JOIN operator combines the generated rows with the other table or subquery, while the UNNEST function unrolls the struct into new columns. The Query is:</p><pre>WITH struct_table AS (<br>   SELECT<br>       ARRAY[<br>           CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)),<br>           CAST(ROW(2, 3.0) AS ROW(x BIGINT, y DOUBLE))<br>       ] AS sample<br>)<br>SELECT<br>   t.*<br>FROM struct_table s<br>CROSS JOIN UNNEST(sample) AS t</pre><h4>Array vs Map vs Row</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/668/1*iBvZ5Po-s6cV5hytaONyrA.png" /></figure><h3>Comparing Popular DBs</h3><p>Given below the DBs and their support for these advanced data types with some workarounds if available especially through Json or Xml,</p><h3>SQL Databases:</h3><h4>PostgreSQL 16:</h4><ul><li>Array: fully supported in all versions.</li><li>Struct: uses composite types pretending to be structs.</li><li>Map: no direct, uses hstore or JSONB for the key-value storage.</li></ul><h4>MySQL 8.0:</h4><ul><li>Array: does not natively support JSON arrays.</li><li>Struct: uses JSON for storage.</li><li>Map: uses JSON as a formatter.</li></ul><h4>Microsoft SQL Server 2022:</h4><ul><li>Array: no native, use JSON or XML.</li><li>Struct: JSON or XML for structured data.</li><li>Map: JSON or XML to store maps.</li></ul><h4>Oracle Database 21c</h4><ul><li>Array: supported, VARRAYs or nested tables.</li><li>Struct: supported, object types for the data structuring.</li><li>Map: no direct, associative arrays, and JSON.</li></ul><h3>NoSQL databases:</h3><h4>MongoDB 6.0:</h4><ul><li>Array: uses the native system.</li><li>Struct: condiment-oriented form, can support structs natively.</li><li>Map: the document works as a map.</li></ul><h4>Apache Cassandra 4.0:</h4><ul><li>Array: the list is the functional analog of an array.</li><li>Struct: supports UDTs.</li><li>Map: the data type is natively supported.</li></ul><h4>Redis 7.0:</h4><ul><li>Array: the list of Redis, the analog of an array.</li><li>Struct: uses hashes.</li><li>Map: has the key-value pattern.</li></ul><h4>Apache HBase 2.4:</h4><ul><li>Array: does not work directly, use serialization.</li><li>Struct: could be serialized.</li><li>Map: uses serialization or column qualifiers.</li></ul><h3>Distributed SQL Query Engine:</h3><h4>Presto Trino 380:</h4><ul><li>Array: supported as the data type.</li><li>Struct: uses the ROW data type, nesting available.</li><li>Map: supported, the data type.</li></ul><h3>Benefits and Use Cases:</h3><h4>Storage</h4><p>Among the advantages of using advanced data types such as arrays, maps, and structs for rather extensive information is that you can save more data within the single column instead of dividing it into multiple rows or columns. In addition to a much more systematic and comfortable approach for querying, it also cuts storage substantially. The notion is that five separated rows, for example, will be merged into a single row, delivering all those multiple fragmented informational sectors via map or struct. This is especially relevant for large-scale datasets, where the number of users or records can range up to billions. Therefore, using advanced data types will decrease the overall storage and make the data much more efficient.</p><p>Table without advanced data type:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/668/1*iBvZ5Po-s6cV5hytaONyrA.png" /></figure><p>Table with advanced data type:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/538/1*Hi8UQ_NwS9SEJarwlg-lNg.png" /></figure><h4>Compute</h4><p>As can be seen, this is essentially more than storage saving; for instance, instead of reading X number of rows to gather all the required information, using a map with arrays, one would need to read X/n number of rows. This varies from n operations to merely one; hence, the overall computing cost is diminished to process. As a result, the query would be performed faster, and the system would perform faster.</p><h4>Modeling</h4><p>Advanced data types , such as maps and structs, increase the flexibility available in the schema design. In particular, when you deal with large data volumes, creating a column for each such data point is often impossible or inadequate. However, with advanced data types, you can save less important but necessary information for eventual retrieval in one column. In the same way, unifying facts and dimensions into one large table decreases the amount of movement required between tables or partitions, which generally increases system performance and speed of query. In addition, this results in a less complicated system architecture, which is easier to manage.</p><h3>Conclusion:</h3><p>However, employing advanced SQL data types starting from the modeling stage suggests significant benefits in the capabilities of computational resources and the efficient use of storage. By enabling support to json formats most of the SQL databases in market today allows us to use these advance data types. By introducing this type of data into the databases design, the firms could optimize the functionality of their databases, query tools, and query types, minimizing the demands they place on resources. Additionally, advanced SQL data types enable the firms to work with various data structures more effectively, which fuels increased scalability and adaptability in the rapidly changing data environment. Altogether, the decision to include these data types do not only optimize the usage of computational resources but also builds the foundation of other data-intensive innovations and also helps the firms stay competitive in the rapidly digitizing world.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ce53c018ed2c" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Challenges in ABTest: Overcoming Signal Loss by Privacy Policies]]></title>
            <link>https://medium.com/@deepak.jayabalan/challenges-in-abtest-overcoming-signal-loss-by-privacy-policies-01915dad6d23?source=rss-835e3802d07c------2</link>
            <guid isPermaLink="false">https://medium.com/p/01915dad6d23</guid>
            <category><![CDATA[privacy]]></category>
            <category><![CDATA[data-engineering]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[ab-test]]></category>
            <dc:creator><![CDATA[Deepak Jayabalan]]></dc:creator>
            <pubDate>Fri, 05 Jul 2024 04:18:01 GMT</pubDate>
            <atom:updated>2024-07-05T04:18:01.715Z</atom:updated>
            <content:encoded><![CDATA[<p>ABTest, or A/B Testing, Divide or Split Testing, is a way that compares two versions of a product, website, or app in order to see which one does well or succeeds. Its method consists of arbitrarily splitting users into two groups and giving every group a different version of the product. This helps clarify which version yields better user engagement, conversion rates, or other desired results.</p><p>ABTest’s users include those in such varied fields as e-commerce, marketing, and software development. Using ABTest, businesses can make decisions based on data and keep tweaking their products in light of what real-world users really do.</p><p>This article will focus on the use of ABTest in machine learning models for advertising. Specifically, we’ll discuss how one could use A/B testing to optimize model performance by testing different machine learning models that predict user behavior and conversion tracking. Using ABTest, platforms can figure out which model yields better engagement and conversion rates, leading ultimately to higher return on investment as well as more effective ad models. More importantly, we will talk about some hurdles in applying ABTest to machine learning models for advertising and give examples of success stories as well as practical advice based on experience.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4mLHpDu6qJwK-Tk9" /><figcaption>Photo by <a href="https://unsplash.com/@girlwithredhat?utm_source=medium&amp;utm_medium=referral">Girl with red hat</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h3>Signal Loss Events:</h3><p>In the past, we could gather all the info we needed about advertising interactions from data on site and off (relying on a pixel). From this we would find out which models worked best and what to do in future.</p><p>However as industry rules change and regulations tighten, access to this data is now being cut off. This makes it difficult for us to know which models are working well and which aren’t. Now organizations are looking at methods of still making informed choices about products even when we don’t have all the data we need.</p><p>Several major events have now caused us to lose access to this dataset. Let’s examine some of them and see how they affect our ability to test and improve advertising.</p><h3>ATT (App Tracking Transparency)</h3><ul><li>Introduced in the iOS 14+ operating system, Apple’s new feature demands that users give apps permission before they can access certain information concerning the device or the user. It also means we now are unable to gather data from iOS14+ device users who opt out of allowing advertisements so that we might learn more about the way in which users interact with an ad.</li><li>Consequently, crucial data is being lost, which gives us the means of gauging our performance and deciding beforehand which version will be best to run in the future. That’s a big problem for our team because we depend so heavily on this data to continually improve our ads and accurately determine which ones to run.</li><li>Reference — <a href="https://developer.apple.com/app-store/user-privacy-and-data-use/">https://developer.apple.com/app-store/user-privacy-and-data-use/</a></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*--L4XZX3MiiLmvJD" /></figure><h3>EPD (ePrivacy Directive)</h3><ul><li>The ePD is a set of rules developed by the European Union to protect people’s privacy and their data while using digital services. It contains such important topics as the confidentiality of information, the data about visits of websites that companies can use, and the data about cookies . As long as some people do not allow data collection during their behavior on these websites, we can’t be able to collect some information about how people interact with our ads. As a result, we lose the opportunity to collect important data for research. As I have mentioned before, proper research allowing us to understand which ads help oral ones should be forgotten. This staff is extremely important for us since we need to understand how to use our ads to make them more successful.</li></ul><h3>OBA (Online behavioral advertising)</h3><ul><li>Online behavioral advertising is when companies show you ads that they think you will find most interesting based on the websites you visit. That is why they are “interested-based” ads since companies are interested in the pages you view to help them make the decisions about what to show you. Unfortunately, we cannot access the information for some people under the new regulations if they choose not to allow us. In other words, we lose vital data on performance to determine which ads work best and which ones should be considered or not. As a matter of fact, this is our biggest challenge since we get very little data to make any improvement on the existing ad or make a decision between two ads.</li></ul><p>The same case applies to other bigger entities such as Google that get the information for the online activity of people. It will be even harder in the future to select the best ads with such limited information.</p><h3>Impact of Signal Loss Events:</h3><p>Changes for the worse: due to these changes, we can no longer collect data and use feedback on how people behave on the Internet. This means that we don’t know which models work best and cannot use them. If we do not use this data, we will have to use models which do not work and which will do harm to our business. Such changes are that we could receive discipline reports on the behavior of people in different organizations and store and study everything for as long as we wanted. And now we can identify users only in our company, and in other companies, we can see their reduced kinds of statistics. We also cannot store user data for a long time. This is due to the fact that users do not want their data to be accessible in all places and as users, they have the right not to allow them to be published. In the future, the data cannot be taken from users, they can be obtained only from the host.</p><h3>Solutions:</h3><p>To overcome this issue, it is necessary to consider what you do and don’t need to track conversion data points. In this particular case, we only must find out that a user who clicked on an ad on one site made a purchase on another site . There is no need to gather personal data about the user or exchange such data with the other website . When such information is not being collected, unique user IDs should not be generated to track conversions across websites using ads.</p><p>Reference — <a href="https://webkit.org/blog/8943/privacy-preserving-ad-click-attribution-for-the-web/">https://webkit.org/blog/8943/privacy-preserving-ad-click-attribution-for-the-web/</a></p><p>This can be solved by two broad ways,</p><ol><li>Models to predict conversions: We can build machine learning models that are trained using currently available user opt-in data to predict which ads will result in conversions (such as purchases or sign-ups). These models do not need to collect and share personal information about users, so they can be used without compromising people’s privacy.</li><li>Aggregated &amp; anonymized data: We can create a framework where user information is anonymized and the data is aggregated so that it cannot be traced back to individual users. This way, we can still use the data to improve our ads while protecting people’s privacy.</li></ol><p>We should ensure we test these fixes before applying them to ensure they are true. We should especially concentrate on discovering approaches to gauge our advertisements’ prosperity, continually being cautious to regard individuals’ private there. This includes guaranteeing the predictions are true and always operating within scope of all legal privacy protection.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=01915dad6d23" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>