<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Alessandro Tomassini on Medium]]></title>
        <description><![CDATA[Stories by Alessandro Tomassini on Medium]]></description>
        <link>https://medium.com/@le_Tomassini?source=rss-4cb5c2b60ac8------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*9PGn8q6tW94DFEi0lBXp0Q.jpeg</url>
            <title>Stories by Alessandro Tomassini on Medium</title>
            <link>https://medium.com/@le_Tomassini?source=rss-4cb5c2b60ac8------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 31 May 2026 20:42:04 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@le_Tomassini/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Robust Statistics for Data Scientists Part 2: Resilient Measures of Relationships Between Variables]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/robust-statistics-for-data-scientists-part-2-resilient-measures-of-relationships-between-variables-a59b37a6907f?source=rss-4cb5c2b60ac8------2"><img src="https://cdn-images-1.medium.com/max/1024/1*5DKY_TCB7TMGGQ6zDYf_1A.jpeg" width="1024"></a></p><p class="medium-feed-snippet">From basic to advanced techniques for outlier-rich data analysis.</p><p class="medium-feed-link"><a href="https://medium.com/data-science/robust-statistics-for-data-scientists-part-2-resilient-measures-of-relationships-between-variables-a59b37a6907f?source=rss-4cb5c2b60ac8------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/robust-statistics-for-data-scientists-part-2-resilient-measures-of-relationships-between-variables-a59b37a6907f?source=rss-4cb5c2b60ac8------2</link>
            <guid isPermaLink="false">https://medium.com/p/a59b37a6907f</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[getting-started]]></category>
            <category><![CDATA[analytics]]></category>
            <dc:creator><![CDATA[Alessandro Tomassini]]></dc:creator>
            <pubDate>Sat, 09 Mar 2024 14:42:23 GMT</pubDate>
            <atom:updated>2024-03-09T14:42:23.231Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Robust Statistics for Data Scientists Part 1: Resilient Measures of Central Tendency and…]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/robust-statistics-for-data-scientists-part-1-resilient-measures-of-central-tendency-and-67e5a60b8bf1?source=rss-4cb5c2b60ac8------2"><img src="https://cdn-images-1.medium.com/max/1024/1*2msDCpjUy7KXCnnomPfAmw.png" width="1024"></a></p><p class="medium-feed-snippet">Building a foundation: understanding and applying robust measures in data analysis</p><p class="medium-feed-link"><a href="https://medium.com/data-science/robust-statistics-for-data-scientists-part-1-resilient-measures-of-central-tendency-and-67e5a60b8bf1?source=rss-4cb5c2b60ac8------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/robust-statistics-for-data-scientists-part-1-resilient-measures-of-central-tendency-and-67e5a60b8bf1?source=rss-4cb5c2b60ac8------2</link>
            <guid isPermaLink="false">https://medium.com/p/67e5a60b8bf1</guid>
            <category><![CDATA[getting-started]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-analysis]]></category>
            <dc:creator><![CDATA[Alessandro Tomassini]]></dc:creator>
            <pubDate>Tue, 30 Jan 2024 19:04:16 GMT</pubDate>
            <atom:updated>2024-01-30T19:06:36.851Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Enhancing Data Science Workflows: Mastering Version Control for Jupyter Notebooks]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/enhancing-data-science-workflows-mastering-version-control-for-jupyter-notebooks-b03c839e25ec?source=rss-4cb5c2b60ac8------2"><img src="https://cdn-images-1.medium.com/max/1024/1*VtFvLmsp4OU6T9YKyqVtMA.png" width="1024"></a></p><p class="medium-feed-snippet">A hands-on guide to facilitate collaboration and reproducibility with Jupytext, nbstripout, and nbconvert</p><p class="medium-feed-link"><a href="https://medium.com/data-science/enhancing-data-science-workflows-mastering-version-control-for-jupyter-notebooks-b03c839e25ec?source=rss-4cb5c2b60ac8------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/enhancing-data-science-workflows-mastering-version-control-for-jupyter-notebooks-b03c839e25ec?source=rss-4cb5c2b60ac8------2</link>
            <guid isPermaLink="false">https://medium.com/p/b03c839e25ec</guid>
            <category><![CDATA[jupyter-notebook]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[reproducibility]]></category>
            <category><![CDATA[version-control]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Alessandro Tomassini]]></dc:creator>
            <pubDate>Thu, 11 Jan 2024 17:40:22 GMT</pubDate>
            <atom:updated>2024-01-11T17:40:22.880Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Cultivating Data Integrity in Data Science with Pandera]]></title>
            <link>https://medium.com/data-science/cultivating-data-integrity-in-data-science-with-pandera-2289608626cc?source=rss-4cb5c2b60ac8------2</link>
            <guid isPermaLink="false">https://medium.com/p/2289608626cc</guid>
            <category><![CDATA[data-preprocessing]]></category>
            <category><![CDATA[good-practices]]></category>
            <category><![CDATA[data-validation]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-integrity]]></category>
            <dc:creator><![CDATA[Alessandro Tomassini]]></dc:creator>
            <pubDate>Fri, 22 Dec 2023 01:56:52 GMT</pubDate>
            <atom:updated>2023-12-22T01:56:52.714Z</atom:updated>
            <content:encoded><![CDATA[<h4>Advanced validation techniques with Pandera to promote data quality and reliability</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DtWmDR_iaIWs3T7YnnAgEw.png" /><figcaption>Image generated by DALL-E</figcaption></figure><p>Welcome to an exploratory journey into data validation with Pandera, a lesser-known yet powerful tool in the data scientist’s toolkit. This tutorial aims to illuminate the path for those seeking to fortify their data processing pipelines with robust validation techniques.</p><p>Pandera is a Python library that provides flexible and expressive data validation for pandas data structures. It’s designed to bring more rigor and reliability to the data processing steps, ensuring that your data conforms to specified formats, types, and other constraints before you proceed with analysis or modeling.</p><h3><strong>Why Pandera?</strong></h3><p>In the intricate tapestry of data science, where data is the fundamental thread, ensuring its quality and consistency is paramount. Pandera promotes the integrity and quality of data through rigorous validation. It’s not just about checking data types or formats; Pandera extends its vigilance to more sophisticated statistical validations, making it an indispensable ally in your data science endeavours. Specifically, Pandera stands out by offering:</p><ol><li><strong>Schema enforcement</strong>: Guarantees that your DataFrame adheres to a predefined schema.</li><li><strong>Customisable validation</strong>: Enables creation of complex, custom validation rules.</li><li><strong>Integration with Pandas</strong>: Seamlessly works with existing pandas workflows.</li></ol><h3>Crafting your first schema</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0noKDs5E95OdVivv" /><figcaption>Photo by <a href="https://unsplash.com/@charlesdeluvio?utm_source=medium&amp;utm_medium=referral">charlesdeluvio</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Let’s start with installing Pandera. This can be done using pip:</p><pre>pip install pandera </pre><p>A schema in Pandera defines the expected structure, data types, and constraints of your DataFrame. We’ll begin by importing the necessary libraries and defining a simple schema.</p><pre>import pandas as pd<br>from pandas import Timestamp<br>import pandera as pa<br>from pandera import Column, DataFrameSchema, Check, Index<br><br>schema = DataFrameSchema({<br>    &quot;name&quot;: Column(str),<br>    &quot;age&quot;: Column(int, checks=pa.Check.ge(0)),  # age should be non-negative<br>    &quot;email&quot;: Column(str, checks=pa.Check.str_matches(r&#39;^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$&#39;))  # email format<br>})</pre><p>This schema specifies that our DataFrame should have three columns: name (string), age (integer, non-negative), and email (string, matching a regular expression for email). Now, with our schema in place, let’s validate a DataFrame.</p><pre># Sample DataFrame<br>df = pd.DataFrame({<br>    &quot;name&quot;: [&quot;Alice&quot;, &quot;Bob&quot;, &quot;Charlie&quot;],<br>    &quot;age&quot;: [25, -5, 30],<br>    &quot;email&quot;: [&quot;alice@example.com&quot;, &quot;bob@example&quot;, &quot;charlie@example.com&quot;]<br>})<br><br># Validate<br>validated_df = schema(df)</pre><p>In this example, Pandera will raise a SchemaError because Bob&#39;s age is negative, which violates our schema.</p><pre>SchemaError: &lt;Schema Column(name=age, type=DataType(int64))&gt; failed element-wise validator 0:<br>&lt;Check greater_than_or_equal_to: greater_than_or_equal_to(0)&gt;<br>failure cases:<br>   index  failure_case<br>0      1            -5</pre><p>One of Pandera’s strengths is its ability to define custom validation functions.</p><pre>@pa.check_input(schema)<br>def process_data(df: pd.DataFrame) -&gt; pd.DataFrame:<br>    # Some code to process the DataFrame<br>    return df<br><br>processed_df = process_data(df)</pre><p>The @pa.check_input decorator ensures that the input DataFrame adheres to the schema before the function processes it.</p><h3>Advanced data validation with custom check</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ip45xCsNfEMC3Pr_" /><figcaption>Photo by <a href="https://unsplash.com/@sigmund?utm_source=medium&amp;utm_medium=referral">Sigmund</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Now, let’s explore more complex validations that Pandera offers. Building upon the existing schema, we can add additional columns with various data types and more sophisticated checks. We’ll introduce columns for categorical data, datetime data, and implement more advanced checks like ensuring unique values or referencing other columns.</p><pre># Define the enhanced schema<br>enhanced_schema = DataFrameSchema(<br>    columns={<br>        &quot;name&quot;: Column(str),<br>        &quot;age&quot;: Column(int, checks=[Check.ge(0), Check.lt(100)]),<br>        &quot;email&quot;: Column(str, checks=[Check.str_matches(r&#39;^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$&#39;)]),<br>        &quot;salary&quot;: Column(float, checks=Check.in_range(30000, 150000)),<br>        &quot;department&quot;: Column(str, checks=Check.isin([&quot;HR&quot;, &quot;Tech&quot;, &quot;Marketing&quot;, &quot;Sales&quot;])),<br>        &quot;start_date&quot;: Column(pd.Timestamp, checks=Check(lambda x: x &lt; pd.Timestamp(&quot;today&quot;))),<br>        &quot;performance_score&quot;: Column(float, nullable=True)<br>    },<br>    index=Index(int, name=&quot;employee_id&quot;)<br>)<br><br># Custom check function<br>def salary_age_relation_check(df: pd.DataFrame) -&gt; pd.DataFrame:<br>    if not all(df[&quot;salary&quot;] / df[&quot;age&quot;] &lt; 3000):<br>        raise ValueError(&quot;Salary to age ratio check failed&quot;)<br>    return df<br><br># Function to process and validate data<br>def process_data(df: pd.DataFrame) -&gt; pd.DataFrame:<br>    # Apply custom check<br>    df = salary_age_relation_check(df)<br><br>    # Validate DataFrame with Pandera schema<br>    return enhanced_schema.validate(df)</pre><p>In this enhanced schema, we’ve added:</p><ol><li>Categorical data: The department column validates against specific categories.</li><li>Datetime data: The start_date column ensures dates are in the past.</li><li>Nullable column: The performance_score column can have missing values.</li><li>Index validation: An index employee_id of type integer is defined.</li><li>Complex check: A custom function salary_age_relation_check ensures a logical relationship between salary and age within each department.</li><li>Implementation of the custom check in the data processing function: We integrate the salary_age_relation_check logic directly into our data processing function.</li><li>Use of Pandera’s validate method: Instead of using the @pa.check_types decorator, we manually validated the DataFrame using the validate method provided by Pandera.</li></ol><p>Now, let’s create an example DataFrame df_example that matches the structure and constraints of our enhanced schema and validate it.</p><pre>df_example = pd.DataFrame({<br>    &quot;employee_id&quot;: [1, 2, 3],  <br>    &quot;name&quot;: [&quot;Alice&quot;, &quot;Bob&quot;, &quot;Charlie&quot;],  <br>    &quot;age&quot;: [25, 35, 45],  <br>    &quot;email&quot;: [&quot;alice@example.com&quot;, &quot;bob@example.com&quot;, &quot;charlie@example.com&quot;],  <br>    &quot;salary&quot;: [50000, 80000, 120000], <br>    &quot;department&quot;: [&quot;HR&quot;, &quot;Tech&quot;, &quot;Sales&quot;], <br>    &quot;start_date&quot;: [Timestamp(&quot;2022-01-01&quot;), Timestamp(&quot;2021-06-15&quot;), Timestamp(&quot;2020-12-20&quot;)], <br>    &quot;performance_score&quot;: [4.5, 3.8, 4.2]  <br>})<br><br># Make sure the employee_id column is the index<br>df_example.set_index(&quot;employee_id&quot;, inplace=True)<br><br># Process and validate data<br>processed_df = process_data(df_example)</pre><p>Here, Pandera will raise a SchemaError because of a mismatch between the expected data type of the salary column in enhanced_schema (float which corresponds to float64 in pandas/Numpy types) and the actual data type present in df_example (int or int64 in pandas/Numpy types).</p><pre>SchemaError: expected series &#39;salary&#39; to have type float64, got int64</pre><h3>Advanced data validation with statistical hypothesis testing</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*NbICXftKIBE2f_1x" /><figcaption>Photo by <a href="https://unsplash.com/@paoalchapar?utm_source=medium&amp;utm_medium=referral">Daniela Paola Alchapar</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Pandera can perform statistical hypothesis tests as part of the validation process. This feature is particularly useful for validating assumptions about your data distributions or relationships between variables.</p><p>Suppose you want to ensure that the average salary in your dataset is around a certain value, say £75,000. One can define a custom check function to perform a one-sample t-test to assess if the mean of a sample (e.g., the mean of the salaries in the dataset) differs significantly from a known mean (in our case, £75,000).</p><pre>from scipy.stats import ttest_1samp<br><br># Define the custom check for the salary column<br>def mean_salary_check(series: pd.Series, expected_mean: float = 75000, alpha: float = 0.05) -&gt; bool:<br>    stat, p_value = ttest_1samp(series.dropna(), expected_mean)<br>    return p_value &gt; alpha<br><br>salary_check = Check(mean_salary_check, element_wise=False, error=&quot;Mean salary check failed&quot;)<br><br># Correctly update the checks for the salary column by specifying the column name<br>enhanced_schema.columns[&quot;salary&quot;] = Column(float, checks=[Check.in_range(30000, 150000), salary_check], name=&quot;salary&quot;)</pre><p>In the code above we have:</p><ol><li>Defined the custom check function mean_salary_check that takes a pandas Series (the salary column in our DataFrame) and performs the t-test against the expected mean . The function returns True if the p-value from the t-test is greater than the significance level (alpha = 0.05), indicating that the mean salary is not significantly different from £75,000.</li><li>We then wrapped this function in a Pandera Check, specifying element_wise=False to indicate that the check is applied to the entire series rather than to each element individually.</li><li>Finally, we updated the salary column in our Pandera schema to include this new check along with any existing checks.</li></ol><p>With these steps, our Pandera schema now includes a statistical test on the salary column. We deliberately increase the average salary in df_example to violate the schema’s expectation so that Pandera will raise a SchemaError .</p><pre># Change the salaries to exceede the expected mean of £75,000<br>df_example[&quot;salary&quot;] = df_example[&quot;salary&quot;] = [100000.0, 105000.0, 110000.0]<br>validated_df = enhanced_schema(df_example)</pre><pre>SchemaError: &lt;Schema Column(name=salary, type=DataType(float64))&gt; failed series or dataframe validator 1:<br>&lt;Check mean_salary_check: Mean salary check failed&gt;</pre><h3>Conclusion</h3><p>Pandera elevates data validation from a mundane checkpoint to a dynamic process that encompasses even complex statistical validations. By integrating Pandera into your data processing pipeline, you can catch inconsistencies and errors early, saving time, preventing headaches down the road, and paving the way for more reliable and insightful data analysis.</p><h3>References and Further Reading</h3><p>For those willing to further their understanding of Pandera and its capabilities, the following resources serve as excellent starting points:</p><ol><li>Pandera Documentation: A comprehensive guide to all features and functionalities of Pandera (<a href="https://pandera.readthedocs.io/">Pandera Docs</a>).</li><li>Pandas Documentation: As Pandera extends pandas, familiarity with pandas is crucial (<a href="https://pandas.pydata.org/docs/">Pandas Docs</a>).</li></ol><h3>Disclaimer</h3><p>I am not affiliated with Pandera in any capacity, I am just very enthusiastic about it :)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2289608626cc" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/cultivating-data-integrity-in-data-science-with-pandera-2289608626cc">Cultivating Data Integrity in Data Science with Pandera</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>