<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Zhiqiang Zhong on Medium]]></title>
        <description><![CDATA[Stories by Zhiqiang Zhong on Medium]]></description>
        <link>https://medium.com/@zhiqiangzhong?source=rss-29effe089762------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*Z_a8cvklU8IKlvdgEru1sg.png</url>
            <title>Stories by Zhiqiang Zhong on Medium</title>
            <link>https://medium.com/@zhiqiangzhong?source=rss-29effe089762------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 25 May 2026 22:23:47 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@zhiqiangzhong/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Machine Learning process — Feature Engineering]]></title>
            <link>https://medium.com/@zhiqiangzhong/machine-learning-process-feature-engineering-140d178fcb3b?source=rss-29effe089762------2</link>
            <guid isPermaLink="false">https://medium.com/p/140d178fcb3b</guid>
            <category><![CDATA[feature-engineering]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Zhiqiang Zhong]]></dc:creator>
            <pubDate>Sun, 27 Aug 2017 16:45:47 GMT</pubDate>
            <atom:updated>2017-08-27T20:42:25.670Z</atom:updated>
            <content:encoded><![CDATA[<p>(An <strong>announce</strong> at begin, I’m a Data Science beginner, only share personal understandings here. I never said my opinions are all right and with huge pleasure to hear your opinions.)</p><h4><strong>Why I </strong>write <strong>this article?</strong></h4><p>Nowadays, there are many Machine Learning and Deep Learning books tell us what is an algorithm and how they work, these books have many excellent examples but unfortunately, the dataset they used are all prepared, in university also. But in our work, engineers meet with noised, un-calculable, un-balanced dataset, and I was really confused and perplexed because no course told me how to do them. So I hope this blog could help some beginners when you doing your first real life project.</p><blockquote><strong>1 What is Feature Engineering and why it is important</strong></blockquote><p>As an unofficial topic, FE(Feature Engineering) has many definitions. In fact, it’s really hard to give it one exact definition, so I will show you its position in the process of a machine learning projects.</p><p>A classic machine learning process:</p><p>1, Subject definition / Data collection</p><p>2, Project model building</p><p>3, Data preprocessing</p><p><strong>4, Feature engineering</strong></p><p>5, Algorithm choose and model training</p><p>6, Project submission / Feedback</p><p>As you can see, FS is the bridge between data and machine learning algorithm, it can “create new features according to current features to improve our model’s result”, to some extent, FS decided the TOP of your machine learning projects, algorithms try to touch the TOP.</p><p>And, there are some guys said the one of the famous data science composition platform Kaggle “It’s a feature engineering game indeed.”</p><blockquote><strong>2 Mind Map of Feature Engineering</strong></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/997/1*_GxRAPRUcinxBVNVJUF_1w.jpeg" /></figure><blockquote><strong>3 Explanation of each step</strong></blockquote><p><strong><em>Business First:</em></strong></p><p>For me, as an engineer Data Scientist, one point that I could never forget is that my projects serve business, their final job is <strong>making business more valuable</strong>. So the first step is going to talk with your clients, consultants or managers, clearly understand what kind of environment this project will be applied. “Connected with business” is an eternal theme.</p><p><strong><em>Feature Construction:</em></strong></p><p>Feature Construction is building new features from original features. This process needs us to use lots of time to research characteristics, thinking about potential problems and data structures. During this process, a <strong>brain storm</strong> is a not bad choice. Finding as many features as possible, do not care if it is useful. Design your features after several brain meeting.</p><p>The common ways are: <strong>Combination</strong> and <strong>Segmentation.</strong></p><p>Combination: if you want to predict the apartment price, normally you have the area of the apartment and the number of rooms, so you could have its average room area as S(apartment) / N(number of rooms). Because sometimes the average area could tell more detailed information of apartment. Besides, there are many similar ways, explore them combined with business.</p><p>Segmentation: clustering algorithms only use numeric variables as distance calculation needs. In the same scenes, for the age of apartment, we could set class 1 = (0, 5), class 2=(5, 15), class 3…Why we do this? Because in client’s mind, 2 years old and 3 years old means similar.</p><p><strong><em>Feature Extraction:</em></strong></p><p>When you could not find any one more useful information in Feature Construction, Feature Extraction is your next step. It uses simple geometric and algebraic operations to generate more potential features, like <strong>PCA</strong>(Principal Component Analysis), <strong>ICA</strong>(Independent Component Analysis), <strong>LDA</strong>(Linear Discriminant Analysis).</p><p>Besides, Deep Learning algorithms are also usable in this process. For example, you could use the second last hidden layer’s output as your personal image model’s input variable.</p><p><strong><em>Feature Selection</em></strong></p><p>After above feature works, you might have many variables then Feature Selection is your next step.</p><p>Look your Dataframe, in its dozen columns, some of them have rich information but some has poor, even someone is irrelevant data. We could use correlation relationship(feature importance) between X and Y, to detect useful features. Or use algorithms that have high generalization to calculate the feature importance, choose features according to these importances.</p><p>And in this part, we could set noised, un-balanced, multicollinearity data into normal.</p><blockquote><strong><em>4 Conclusion</em></strong></blockquote><p>Feature engineering is a cycle process, nobody could promise to get the best feature in one treat. So, what we need to do is focusing on data and combined with business, try every process for several times to arrive the best we could.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=140d178fcb3b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Test series]]></title>
            <link>https://medium.com/series/test-series-105b4f1bb5c2?source=rss-29effe089762------2</link>
            <guid isPermaLink="false">https://medium.com/p/105b4f1bb5c2</guid>
            <dc:creator><![CDATA[Zhiqiang Zhong]]></dc:creator>
            <pubDate>Thu, 24 Aug 2017 23:09:52 GMT</pubDate>
            <atom:updated>2017-08-24T23:09:52.735Z</atom:updated>
            <content:encoded><![CDATA[<p>what is series?</p><p>how does it look like?</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=105b4f1bb5c2" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Using PySpark Dataframe as in Python]]></title>
            <link>https://medium.com/@zhiqiangzhong/using-pyspark-dataframe-as-python-dataframe-2959c2a085e?source=rss-29effe089762------2</link>
            <guid isPermaLink="false">https://medium.com/p/2959c2a085e</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[aws]]></category>
            <category><![CDATA[pyspark]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Zhiqiang Zhong]]></dc:creator>
            <pubDate>Thu, 24 Aug 2017 23:02:31 GMT</pubDate>
            <atom:updated>2017-08-29T12:12:50.589Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/454/1*HHpjJot4kNFd2I-agiXNdA.png" /></figure><p>(An announce at begin, I’m a Data Science beginner, only share personal undfrderstandings here. I never said my opinions are all right and with huge pleasure to hear your opinions.)</p><p>In this blog, I will share you about how using Dataframe of PySpark as Dataframe of Python.</p><h4><strong><em>Environment:</em></strong></h4><p>I use AWS(Amazon web service) as work platform in this moment.</p><p>S3 for storage.</p><p>EC2 for hardware settings.</p><p>EMR for programming.</p><blockquote><strong><em>Replacing elements that appear less than a threshold.</em></strong></blockquote><p><strong>In Python</strong>:</p><p>We could use the build-in method “replace” of Python Dataframe.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/651/1*sZE3UBuDd7jfhMGhaUtuSA.png" /></figure><p><strong>In PySpark:</strong></p><p>The most simple way is as follow, but it has a dangerous operation is “toPandas”, it means transform Spark Dataframe to Python Dataframe, it need to collect all related data to master then do transforming, this leads to memory problem if you have limited hardware. In the same way, be modestly using “collect” also, because they are <strong>action.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/610/1*YYOjAVOgoXPVvOqnx1pYuA.png" /></figure><p>I choose to use “join” function to avoid using <strong>action </strong>operations to solve this problem.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/933/1*J5tC2tspVxdoNN0leTxUeg.png" /></figure><blockquote><strong><em>The differences between joins</em></strong></blockquote><p>For “join”, we need to clearly understand the differences between <strong>join (inner)</strong>, <strong>left join</strong>, <strong>right join</strong>, and <strong>full outer.</strong></p><p>If we have two tables: A and B.</p><p>A:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/70/1*zVSA1GmTldmrTPnxpzx6xg.png" /></figure><p>B:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/65/1*F4xKzyNX51mmpQ3UtdIrtg.png" /></figure><p>A join(inner) B:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/103/1*n44fTHhp7ha9Rm0gD5Ts2w.png" /></figure><p>A left join B:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/93/1*w4L0qRVQrv2PIRZeeRln4g.png" /></figure><p>A right join B:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/98/1*cHvawOjZaOGUY_1vsE4B7w.png" /></figure><p>A full outer join B:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/115/1*0Yi3ys5MfYbZ_X_2y-pYQQ.png" /></figure><blockquote><strong><em>One-hot encoding — “get_dummies”</em></strong></blockquote><p>While doing feature transformation, I usually use “<strong>get_dummies</strong>” of pandas. So, how does “<strong>get_dummies</strong>” work?</p><p>If we have a dataframe like follow:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/636/1*XjDsIXz6yIUYT_ZhQZAu2A.png" /></figure><p>There are 2 category columns “Color” and “Size”, many algorithms can’t work with category valuables, so we have to convert them to numeric. In this case, we could use “One-Hot” method, but this function will produce vectors.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/635/1*Q74u34vtSlCyyQMcI24Pxg.png" /></figure><p>It’s not easy for next steps, like dimension reduction and feature combination, so personally I prefer “<strong>get_dummies</strong>” because it could give us a new table as follow(red are new columns).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/633/1*8iUm0DvWCd77RfzvSIjXYg.png" /></figure><p><strong>In Python:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/377/1*B0p5SiLfp49SqrXL3Iwbzw.png" /></figure><p><strong>In Pyspark:</strong></p><p>There is no official function, so I try to implement one efficient solution.</p><p>At first, I find this method:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/626/1*Ac22vsPIZc7mMw3VicoGlg.png" /></figure><p>But it used an action “collect”, so I wanna find new one that don’t use any <strong>action</strong>.</p><p>I use “<strong>pivot</strong>” as replacement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/797/1*rWup7E4EUbcuPxMYF3bBog.png" /></figure><p>Example:</p><p>The original dataframe:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/98/1*oMscNd8R0evG0R10W8QXng.png" /></figure><p>After this operation, dataframe likes:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/168/1*6RmYGwZh95pizIHX3cF8lg.png" /></figure><h3><strong><em>In updating…</em></strong></h3><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=2959c2a085e" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>