<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Hunter Phillips on Medium]]></title>
        <description><![CDATA[Stories by Hunter Phillips on Medium]]></description>
        <link>https://medium.com/@hunter-j-phillips?source=rss-7a7936a6a04------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*7pIFSd-SH0G-p781QlIyzw.jpeg</url>
            <title>Stories by Hunter Phillips on Medium</title>
            <link>https://medium.com/@hunter-j-phillips?source=rss-7a7936a6a04------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 17 May 2026 10:12:39 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@hunter-j-phillips/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[How to Convert an Image to a PDF in Python]]></title>
            <link>https://medium.com/@hunter-j-phillips/how-to-convert-an-image-to-a-pdf-in-python-f1f9cee3b996?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/f1f9cee3b996</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[pdf-converter]]></category>
            <category><![CDATA[pdf]]></category>
            <category><![CDATA[image-to-pdf-converter]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Tue, 08 Aug 2023 00:06:58 GMT</pubDate>
            <atom:updated>2023-08-08T00:06:58.873Z</atom:updated>
            <content:encoded><![CDATA[<p>Want to convert one or more images to a PDF document? Look no further than the <strong>img2pdf </strong>and <strong>PyPDF2 </strong>packages.</p><h3>Packages</h3><p>To start, all you need is a Python environment, preferably version 3.10 or higher. The code in this tutorial was executed in a Google Colaboratory environment with Python 3.10.12.</p><p>The first step is to ensure the following packages are installed in the Python environment:</p><ul><li>img2pdf</li><li>PyPDF2</li><li>Pillow (PIL)</li></ul><p>Pip can be used to install these packages in Colab:</p><pre>!pip install img2pdf PyPDF2 Pillow</pre><p>The first package, img2pdf, will be used to convert an image to a PDF file. Then, PyPDF2 can be used to merge multiple PDFs into a single PDF file. Pillow is an image processing library; it provides additional functions necessary for the conversion.</p><p>These packages, along with os and google.colab, can now be imported.</p><pre># required libraries<br>import os<br>import img2pdf<br>import PyPDF2<br>from PIL import Image<br>from google.colab import files</pre><h3>Prepare the Images</h3><p>Before writing any more code, it is important to know the file location of each image. To make this as easy as possible, a new folder can be created in the Colab environment:</p><pre>!mkdir images</pre><p>All the images need to be uploaded simultaneously to this location using an uploader provided by google.colab. The files will be ordered based on their names, so they should be named something like, page1.png, page2.png, ..., page9.png.</p><pre>os.chdir(&quot;images&quot;)<br>files.upload()</pre><p>With the images stored in a known file location, their names can be stored in a list.</p><pre>imgs = os.listdir()<br>imgs.sort()</pre><p>If there are more than 9 images, there will likely be issues with this approach, and list should be created with the files in the order they need to be in.</p><h3>Converting the Images to PDFs</h3><p>A for-loop can then be used to iterate over each image, convert it to a PDF, and write it to a new folder called pdfs.</p><pre># create a folder called pdfs<br>os.mkdir(&quot;../pdfs&quot;)<br><br># loop over each image<br>for ind, img in enumerate(imgs):<br>  # open each image<br>  with Image.open(img) as image: <br>    # convert the image to a PDF<br>    pdf = img2pdf.convert(image.filename)<br>    # write the PDF to its final destination<br>    with open(f&quot;../pdfs/pdf{ind+1}.pdf&quot;, &quot;wb&quot;) as file:<br>      file.write(pdf)<br>      print(f&quot;Converted {img} to pdf{ind+1}.pdf&quot;)</pre><h3>Merging the PDFs</h3><p>With the images converted to PDF files, they can either be used independently and downloaded with files.download(&#39;filename.pdf&#39;), or they can be merged together. To merge the files together, extract the list of PDF files and sort them by their page number.</p><pre>os.chdir(&quot;../pdfs&quot;)<br>pdfs = os.listdir()</pre><p>Once again, if there are more than 9 images or PDFs, they should be stored in a list in their respective order.</p><p>A PdfMerger object can be used to concatenate each PDF into a single file.</p><pre>pdfMerge = PyPDF2.PdfMerger()<br><br># loop through each pdf page<br>for pdf in pdfs:<br>  # open each pdf<br>  with open(pdf, &#39;rb&#39;) as pdfFile:<br>    # merge each file<br>    pdfMerge.append(PyPDF2.PdfReader(pdfFile))<br><br># write the merged pdf <br>pdfMerge.write(&#39;merged.pdf&#39;)<br><br># download the final pdf<br>files.download(&#39;merged.pdf&#39;)</pre><p>The final merged PDF will contain each image in the order of their respective names.</p><h3>Full Program</h3><p>The entirety of the code can be found below. It is highly customizable to meet most use cases.</p><pre>!pip install img2pdf PyPDF2 Pillow<br>!mkdir images<br># required libraries<br>import os<br>import img2pdf<br>import PyPDF2<br>from PIL import Image<br>from google.colab import files<br><br>os.chdir(&quot;images&quot;)<br>files.upload()<br>imgs = os.listdir()<br><br># create a folder called pdfs<br>os.mkdir(&quot;../pdfs&quot;)<br><br># loop over each image<br>for ind, img in enumerate(imgs):<br>  # open each image<br>  with Image.open(img) as image: <br>    # convert the image to a PDF<br>    pdf = img2pdf.convert(image.filename)<br>    # write the PDF to its final destination<br>    with open(f&quot;../pdfs/pdf{ind+1}.pdf&quot;, &quot;wb&quot;) as file:<br>      file.write(pdf)<br>      print(f&quot;Converted {img} to pdf{ind+1}.pdf&quot;)<br><br>os.chdir(&quot;../pdfs&quot;)<br>pdfs = os.listdir()<br>pdfs.sort()<br><br>pdfMerge = PyPDF2.PdfMerger()<br><br># loop through each pdf page<br>for pdf in pdfs:<br>  # open each pdf<br>  with open(pdf, &#39;rb&#39;) as pdfFile:<br>    # merge each file<br>    pdfMerge.append(PyPDF2.PdfReader(pdfFile))<br><br># write the merged pdf <br>pdfMerge.write(&#39;merged.pdf&#39;)<br><br># download the final pdf<br>files.download(&#39;merged.pdf&#39;)</pre><h3>References</h3><ol><li><a href="https://www.geeksforgeeks.org/python-convert-image-to-pdf-using-img2pdf-module/#">https://www.geeksforgeeks.org/python-convert-image-to-pdf-using-img2pdf-module/</a></li><li><a href="https://python-bloggers.com/2022/04/merging-pdfs-with-python-2/">https://python-bloggers.com/2022/04/merging-pdfs-with-python-2/</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f1f9cee3b996" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What is a DataFrame in PySpark?]]></title>
            <link>https://medium.com/@hunter-j-phillips/what-is-a-dataframe-in-pyspark-e968240fd1f4?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/e968240fd1f4</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[big-data]]></category>
            <category><![CDATA[pyspark]]></category>
            <category><![CDATA[dataframes]]></category>
            <category><![CDATA[spark]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Sat, 10 Jun 2023 04:39:55 GMT</pubDate>
            <atom:updated>2023-06-10T04:39:55.924Z</atom:updated>
            <content:encoded><![CDATA[<p>This article covers DataFrames in PySpark and how to use methods and SparkSQL on them.</p><h3>DataFrames</h3><p>In PySpark, a DataFrame is a table-like structure that can be manipulated using SQL-like methods. A DataFrame can be thought of as a table with rows and columns. Each column is a field, and each row is a record. For instance, the DataFrame below has two fields: age and name. It has three records: (null, Michael), (30, Andy), and (19, Justin).</p><pre>+----+-------+<br>| age|   name|<br>+----+-------+<br>|null|Michael|<br>|  30|   Andy|<br>|  19| Justin|<br>+----+-------+</pre><p>The following sections will highlight how to use DataFrames and their potential advantages over RDDs.</p><h3>Loading and Previewing a DataFrame</h3><p>To use PySpark, a SparkSession can be created to interact with a cluster. This session can be used to create a DataFrame from an <a href="https://medium.com/@hunter-j-phillips/what-is-an-rdd-in-pyspark-5b5968c0ac9d">RDD</a>, JSON files, CSV files, and more.</p><pre>from pyspark.sql import SparkSession<br><br># build a SparkSession<br>spark = SparkSession.builder.appName(&quot;intro&quot;).getOrCreate()</pre><p>spark.read.&lt;data_type&gt; can be used to create a DataFrame. data_type can be json, csv , text and more. The example below uses json to read a short JSON file from a Github repo. The people.csv file can be downloaded <a href="https://raw.githubusercontent.com/apache/spark/master/examples/src/main/resources/people.json">here</a>.</p><pre># create dataframe<br>df = spark.read.json(&quot;people.json&quot;)</pre><p>This is the DataFrame shown in the first section:</p><pre>+----+-------+<br>| age|   name|<br>+----+-------+<br>|null|Michael|<br>|  30|   Andy|<br>|  19| Justin|<br>+----+-------+</pre><p>They can also be created using spark.createDataFrame(&lt;list of lists&gt;, schema=list) . The code below creates a DataFrame of animals on a small farm, including their number of legs and age.</p><pre>df1 = spark.createDataFrame([[&#39;cow&#39;, 4, 5],<br>                             [&#39;cow&#39;, 4, 3],<br>                             [&#39;cow&#39;, 4, 3],<br>                             [&#39;chicken&#39;, 2, 2],<br>                             [&#39;chicken&#39;, 2, 1],<br>                             [&#39;chicken&#39;, 2, 0],<br>                             [&#39;horse&#39;, 4, 8],<br>                             [&#39;donkey&#39;, 4, 8],<br>                             [&#39;donkey&#39;, 4, 2],<br>                             [&#39;turkey&#39;, 2, 1],<br>                             [&#39;turkey&#39;, 2, 1],<br>                             [&#39;pig&#39;, 4, 5],<br>                             [&#39;dog&#39;, 4, 12],<br>                             [&#39;cat&#39;, 4, 9],<br>                             [&#39;goat&#39;, 4, 3],<br>                             [&#39;goat&#39;, 5, 1]<br>                            ], schema=[&#39;animal&#39;, &#39;legs&#39;, &#39;age&#39;])</pre><p>This DataFrame has the following appearance:</p><pre>+-------+----+---+<br>| animal|legs|age|<br>+-------+----+---+<br>|    cow|   4|  5|<br>|    cow|   4|  3|<br>|    cow|   4|  3|<br>|chicken|   2|  2|<br>|chicken|   2|  1|<br>|chicken|   2|  0|<br>|  horse|   4|  8|<br>| donkey|   4|  8|<br>| donkey|   4|  2|<br>| turkey|   2|  1|<br>| turkey|   2|  1|<br>|    pig|   4|  5|<br>|    dog|   4| 12|<br>|    cat|   4|  9|<br>|   goat|   4|  3|<br>|   goat|   5|  1|<br>+-------+----+---+</pre><p>These DataFrames can be manipulated using DataFrame methods or by manually programming SQL expressions.</p><h3>DataFrame Operations with Methods</h3><p>This first section will deal with methods directly accessible on DataFrame objects.</p><h4>PrintSchema</h4><p>To see the structure of the DataFrame, df.printSchema() can be used to show the structure of the data, including column names and data types.</p><pre>df.printSchema()<br>df1.printSchema()</pre><pre>root<br> |-- age: long (nullable = true)<br> |-- name: string (nullable = true)<br><br>root<br> |-- animal: string (nullable = true)<br> |-- legs: long (nullable = true)<br> |-- age: long (nullable = true)</pre><h4>Show</h4><p>To see the contents of the DataFrame, df.show() can be used. This is one benefit of DataFrames over RDDs.</p><pre>df.show()</pre><pre>+----+-------+<br>| age|   name|<br>+----+-------+<br>|null|Michael|<br>|  30|   Andy|<br>|  19| Justin|<br>+----+-------+</pre><h4>Select</h4><p>The DataFrame is also much more interactive than an RDD. Columns can be accessed using attributes df.select(&lt;col&gt;), df.select(df.&lt;col&gt;), or indexing df.select(df[&lt;col&gt;]) :</p><pre>df.select(&quot;age&quot;).show()</pre><pre>+----+<br>| age|<br>+----+<br>|null|<br>|  30|<br>|  19|<br>+----+</pre><p>Also, a value can be easily added to a column using a similar statement:</p><pre>df.select(df[&#39;name&#39;], df[&#39;age&#39;] + 6).show()</pre><pre>+-------+---------+<br>|   name|(age + 6)|<br>+-------+---------+<br>|Michael|     null|<br>|   Andy|       36|<br>| Justin|       25|<br>+-------+---------+</pre><h4>WithColumn</h4><p>df.withColumn(&lt;col&gt;, &lt;manipulated col&gt;) can be used to add a new column or overwrite an existing one. The example below takes advantage of pyspark.sql.functions import lower to lowercase all the values in the name field:</p><pre>from pyspark.sql.functions import lower<br><br>df.withColumn(&#39;nameUpper&#39;, lower(df[&#39;name&#39;])).show()</pre><pre>+----+-------+---------+<br>| age|   name|nameUpper|<br>+----+-------+---------+<br>|null|Michael|  michael|<br>|  30|   Andy|     andy|<br>|  19| Justin|   justin|<br>+----+-------+---------+</pre><h4>Filter</h4><p>df.filter(cond) can be used to filter a DataFrame based on the provided condition. The example below filters for people that are 19 years old.</p><pre>df.filter(df[&#39;age&#39;] == 19).show()</pre><pre>+---+------+<br>|age|  name|<br>+---+------+<br>| 19|Justin|<br>+---+------+</pre><h4><strong>Count</strong></h4><p>df.count() returns the total records for the DataFrame. Notice the animal DataFrame is being used now.</p><pre>df1.count()</pre><pre>16</pre><h4>GroupBy</h4><p>df.groupBy(&lt;col&gt;).&lt;aggFunc&gt; can be used to group a DataFrame based on a specific field, and then an aggregation can be performed on each group. The first example shows the average number of legs and age for each type of animal:</p><pre>df1.groupBy(df1[&#39;animal&#39;]).avg().show()</pre><pre>+-------+---------+------------------+<br>| animal|avg(legs)|          avg(age)|<br>+-------+---------+------------------+<br>|  horse|      4.0|               8.0|<br>|    cow|      4.0|3.6666666666666665|<br>| donkey|      4.0|               5.0|<br>|chicken|      2.0|               1.0|<br>|    dog|      4.0|              12.0|<br>|    cat|      4.0|               9.0|<br>| turkey|      2.0|               1.0|<br>|    pig|      4.0|               5.0|<br>|   goat|      4.5|               2.0|<br>+-------+---------+------------------+</pre><p>The second example sums the legs and age for each type of animal:</p><pre>df1.groupBy(df1[&#39;animal&#39;]).sum().show()</pre><pre>+-------+---------+--------+<br>| animal|sum(legs)|sum(age)|<br>+-------+---------+--------+<br>|  horse|        4|       8|<br>|    cow|       12|      11|<br>| donkey|        8|      10|<br>|chicken|        6|       3|<br>|    dog|        4|      12|<br>|    cat|        4|       9|<br>| turkey|        4|       2|<br>|    pig|        4|       5|<br>|   goat|        9|       4|<br>+-------+---------+--------+</pre><p>The min() and max() aggregation functions could also be used.</p><h4>Distinct</h4><p>df.distinct() returns a new DataFrame without any duplicate rows. In the animal DataFrame, there are two cows with the same record: (cow, 4, 3). Both turkeys also have the same record: (turkey, 2, 1). This method will remove them:</p><pre>df1.distinct().show()</pre><pre>+-------+----+---+<br>| animal|legs|age|<br>+-------+----+---+<br>|chicken|   2|  0|<br>|    cow|   4|  5|<br>|  horse|   4|  8|<br>|chicken|   2|  2|<br>|chicken|   2|  1|<br>|    cow|   4|  3|<br>| donkey|   4|  8|<br>|   goat|   4|  3|<br>| turkey|   2|  1|<br>|    cat|   4|  9|<br>|   goat|   5|  1|<br>| donkey|   4|  2|<br>|    pig|   4|  5|<br>|    dog|   4| 12|<br>+-------+----+---+</pre><h4>DropDuplicates</h4><p>df.dropDuplicates() removes duplicate rows.</p><pre>df1.dropDuplicates().show()</pre><pre>+-------+----+---+<br>| animal|legs|age|<br>+-------+----+---+<br>|chicken|   2|  0|<br>|    cow|   4|  5|<br>|  horse|   4|  8|<br>|chicken|   2|  2|<br>|chicken|   2|  1|<br>|    cow|   4|  3|<br>| donkey|   4|  8|<br>|   goat|   4|  3|<br>| turkey|   2|  1|<br>|    cat|   4|  9|<br>|   goat|   5|  1|<br>| donkey|   4|  2|<br>|    pig|   4|  5|<br>|    dog|   4| 12|<br>+-------+----+---+</pre><h3>DataFrames with SQL</h3><p>DataFrames use the same engine as Spark SQL, so the sql functionality of SparkSession can be used on DataFrames that are registered as a table. This means SQL can be can used on a DataFrame like it would be used on any other table.</p><h4>CreateOrReplaceTempView</h4><p>df.createOrReplaceTempView(&lt;name&gt;) registers a DataFrame as a table that can be accessed in SQL expressions.</p><pre>df1.createOrReplaceTempView(&quot;animals&quot;)</pre><h4>Spark.SQL</h4><p>With the table registered, spark.sql(&quot;SQL expression&quot;) can be used to query it. For the most part, <a href="https://www.w3schools.com/sql/">SQL </a>expressions work as expected. The query below selects all the rows from the table.</p><pre>spark.sql(&quot;SELECT * FROM animals&quot;).show()</pre><pre>+-------+----+---+<br>| animal|legs|age|<br>+-------+----+---+<br>|    cow|   4|  5|<br>|    cow|   4|  3|<br>|    cow|   4|  3|<br>|chicken|   2|  2|<br>|chicken|   2|  1|<br>|chicken|   2|  0|<br>|  horse|   4|  8|<br>| donkey|   4|  8|<br>| donkey|   4|  2|<br>| turkey|   2|  1|<br>| turkey|   2|  1|<br>|    pig|   4|  5|<br>|    dog|   4| 12|<br>|    cat|   4|  9|<br>|   goat|   4|  3|<br>|   goat|   5|  1|<br>+-------+----+---+</pre><p>And even more complicated queries can be used:</p><pre>spark.sql(&quot;&quot;&quot;SELECT animal, MIN(legs), AVG(age) <br>                FROM animals <br>                GROUP BY animal<br>                ORDER BY AVG(age) DESC<br>          &quot;&quot;&quot;).show()</pre><pre>+-------+---------+------------------+<br>| animal|min(legs)|          avg(age)|<br>+-------+---------+------------------+<br>|    dog|        4|              12.0|<br>|    cat|        4|               9.0|<br>|  horse|        4|               8.0|<br>| donkey|        4|               5.0|<br>|    pig|        4|               5.0|<br>|    cow|        4|3.6666666666666665|<br>|   goat|        4|               2.0|<br>|chicken|        2|               1.0|<br>| turkey|        2|               1.0|<br>+-------+---------+------------------+</pre><h3>References</h3><ol><li><a href="https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html">https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e968240fd1f4" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[What is an RDD in PySpark?]]></title>
            <link>https://medium.com/@hunter-j-phillips/what-is-an-rdd-in-pyspark-5b5968c0ac9d?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/5b5968c0ac9d</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[pyspark]]></category>
            <category><![CDATA[spark]]></category>
            <category><![CDATA[big-data]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Sat, 10 Jun 2023 01:52:56 GMT</pubDate>
            <atom:updated>2023-06-10T01:59:06.832Z</atom:updated>
            <content:encoded><![CDATA[<p>This article covers the basic uses of resilient distributed datasets in PySpark. It includes examples of both transformations and actions that can be performed on them.</p><h3>Resilient Distributed Datasets (RDDs)</h3><p>In PySpark, a resilient distributed dataset (RDD) is a collection of elements. Unlike a normal list, they can be operated on in parallel. This basically means that when an operation is performed on a collection, it is split into a number of subcollections. These subcollections are sent to a cluster of computers, and the operation is performed in parallel on each subcollection and returned. RDDs are also fault tolerant, which means operations will be properly performed even if a component of the cluster fails.</p><p>An RDD can be created from an existing collection, or it can be created from an external dataset. To start, a simple list can be loaded and parallelized. Parallelization is controlled by SparkContext; it connects to a cluster and can broadcast the data to it.</p><pre>from pyspark import SparkContext<br><br># initialize SparkContext<br>sc = SparkContext(master=&#39;local&#39;, appName=&#39;test&#39;)</pre><pre>data = [1, 5, 10, 15, 20, 25, 30]<br><br># c = collection to distribute<br># numSlices = partitions of collection<br>distributedData = sc.parallelize(c=data, numSlices=3)<br># preview the partitions<br>distributedData.glom().collect()</pre><pre>[[1, 5], [10, 15], [20, 25, 30]]</pre><p>When parallelizing the data, the number of partitions, numSlices, represents the number of tasks, or subcollections, to run on the cluster. About 2 to 4 slices per CPU in the cluster is normal. glom() can be used to gather each partition’s data into a list, and collect() can be used to preview the partitions. Now, operations can be performed on the RDD. There are two types of RDD operations: transformations, which yield a new RDD, and actions, which return a value.</p><h3>Transformations</h3><p>Transformations are operations that return a new RDD.</p><h4><strong>Map</strong></h4><p>map(func) passes each element of an RDD through a function, and the appropriate operations are performed on each element. In the example below, each element of the distributed dataset is multiplied by 2.</p><pre># map<br>newRDD = distributedData.map(lambda x: 2*x)<br>newRDD.glom().collect()</pre><pre>[[2, 10], [20, 30], [40, 50, 60]]</pre><h4><strong>Filter</strong></h4><p>filter(func) returns an RDD of elements that meet the requirements of the function. The example below filters for elements with a value greater than 10.</p><pre># filter<br>newRDD = distributedData.filter(lambda x: x &gt; 10)<br>newRDD.glom().collect()</pre><pre>[[], [15], [20, 25, 30]]</pre><h4><strong>FlatMap</strong></h4><p>flatMap(func) is similar to map but each element can be mapped to an output of 0 or more elements (a sequence). In this example, the input element is mapped to a tuple of itself and the output of 5<em>x</em>.</p><pre># flatMap<br>newRDD = distributedData.flatMap(lambda x: [(x, 5*x)])<br>newRDD.glom().collect()</pre><pre>[[(1, 5), (5, 25)], [(10, 50), (15, 75)], [(20, 100), (25, 125), (30, 150)]]</pre><h4><strong>MapPartitions</strong></h4><p>mapPartitions(func) is similar to map but runs on each partition and returns the new partition. In this example, each partition’s elements are summed and returned as the partition.</p><pre># mapPartitions<br>def f_mapPart(iterator):<br>  yield sum(iterator)<br><br>newRDD = distributedData.mapPartitions(f_mapPart)<br>newRDD.glom().collect()</pre><pre>[[6], [25], [75]]</pre><h4><strong>MapPartitionsWithIndex</strong></h4><p>mapPartitionsWithIndex(func) is similar to mapPartitions but also includes the partition’s index. The example below yields the index of each partition.</p><pre># mapPartitionsWithIndex<br>def f_mapPartIndex(index, iterator):<br>  yield index<br><br>newRDD = distributedData.mapPartitionsWithIndex(f_mapPartIndex)<br>newRDD.glom().collect()</pre><pre>[[0], [1], [2]]</pre><h4><strong>Union</strong></h4><p>union(RDD) returns a new RDD with the union of the original RDD and provided RDD. The example below shows the distributed dataset unioned with the distributed dataset, creating a new RDD twice as long.</p><pre># union<br>newRDD = distributedData.union(distributedData)<br>newRDD.glom().collect()</pre><pre>[[1, 5], [10, 15], [20, 25, 30], [1, 5], [10, 15], [20, 25, 30]]</pre><h4><strong>Intersection</strong></h4><p>intersection(RDD) returns a new RDD with the intersection of the original and provided RDDs. The example below combines the original distributed data and a new distributed dataset to generate a new RDD with only the intersections.</p><pre>data2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br><br># distributedData2 = [[1, 2], [3, 4], [5, 6], [7, 8, 9, 10]]<br>distributedData2 = sc.parallelize(data2, 4)<br><br># intersection | distributedData = [[1, 5], [10, 15], [20, 25, 30]]<br>newRDD = distributedData.intersection(distributedData2)<br>newRDD.collect()</pre><pre>[1, 10, 5]</pre><h4><strong>Distinct</strong></h4><p>distinct() returns a new RDD with the unique elements from the original.</p><pre>data3 = [1, 1, 2, 2, 3, 3, 4, 5]<br>distributedData3 = sc.parallelize(data3, 4)<br><br># distinct<br>newRDD = distributedData3.distinct()<br>sorted(newRDD.collect())</pre><pre>[1, 2, 3, 4, 5]</pre><h4><strong>GroupByKey and MapValues</strong></h4><p>groupByKey() requires an RDD with elements of (K, V) and returns a new RDD of elements (K, Iterable&lt;V&gt;), where Iterable&lt;V&gt; includes all values paired with K.</p><pre>data4 = [(&quot;red&quot;, 1), (&quot;red&quot;, 2), (&quot;red&quot;, 3), (&quot;blue&quot;, 4), (&quot;blue&quot;, 5)]<br>distributedData4 = sc.parallelize(data4, 4)<br><br># groupByKey()<br>newRDD = distributedData4.groupByKey()<br>newRDD.collect()</pre><pre>[(&#39;red&#39;, &lt;pyspark.resultiterable.ResultIterable at 0x7fea6ec3e860&gt;),<br> (&#39;blue&#39;, &lt;pyspark.resultiterable.ResultIterable at 0x7feaa422ffa0&gt;)]</pre><p>To view the values of the iterables, the RDD’s elements can be mapped to the list function with mapValues(func), which alters each value without altering the keys.</p><pre>newRDD.mapValues(list).collect()</pre><pre>[(&#39;red&#39;, [1, 2, 3]), (&#39;blue&#39;, [4, 5])]</pre><h4><strong>ReduceByKey</strong></h4><p>reduceByKey(func) requires an RDD with elements of (K, V) and returns a new RDD with elements of (K, V), where V is aggregated based on K and reduced by the function. In the example, it is important to note that a and b are required for the function to add each element in the list. As an example, [1, 2, 3] may be reduced like 1 + 2 = 3, then 3 + 3 = 6. The result from the previous addition is an input for the current addition.</p><pre># reduceByKey()<br>newRDD = distributedData4.reduceByKey(lambda a,b: a+b)<br>newRDD.collect()</pre><pre>[(&#39;red&#39;, 6), (&#39;blue&#39;, 9)]</pre><h4><strong>SortByKey</strong></h4><p>sortByKey(ascending=True, keyfunc) returns a new RDD sorted in ascending or descending order based on the key function or the default order. The example below sorts each key in ascending order.</p><pre># sortByKey()<br>data5 = [(&quot;zebra&quot;, 1), (&quot;red&quot;, 2), (&quot;apple&quot;, 3), (&quot;blue&quot;, 4), (&quot;horse&quot;, 5)]<br>distributedData5 = sc.parallelize(data5, 4)<br><br>newRDD = distributedData5.sortByKey(ascending=True)<br>newRDD.collect()</pre><pre>[(&#39;apple&#39;, 3), (&#39;blue&#39;, 4), (&#39;horse&#39;, 5), (&#39;red&#39;, 2), (&#39;zebra&#39;, 1)]</pre><p>This next example uses a key function to select the second to last letter of each key and sorts it in descending order:</p><pre># sortByKey()<br>newRDD = distributedData5.sortByKey(ascending=False, keyfunc=lambda k: k[-2])<br>newRDD.collect()</pre><pre>[(&#39;blue&#39;, 4), (&#39;horse&#39;, 5), (&#39;zebra&#39;, 1), (&#39;apple&#39;, 3), (&#39;red&#39;, 2)]</pre><h4><strong>Join, LeftOuterJoin, RightOuterJoin, FullOuterJoin</strong></h4><p>join(RDD) returns a new RDD of (K, (V, W)) if the original and provided datasets are (K, V) and (K, W), respectively. In other words, values from identical keys are grouped together and returned. Keys without corresponding pairs in both datasets are not returned.</p><pre># join<br>leftData = [(&quot;a&quot;, 1), (&quot;b&quot;, 2), (&quot;c&quot;, 3)]<br>rightData = [(&quot;a&quot;, 4), (&quot;c&quot;, 5), (&quot;d&quot;, 6)]<br><br>leftRDD = sc.parallelize(leftData)<br>rightRDD = sc.parallelize(rightData)<br><br>newRDD = leftRDD.join(rightRDD)<br>newRDD.collect()</pre><pre>[(&#39;c&#39;, (3, 5)), (&#39;a&#39;, (1, 4))]</pre><p>leftOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, V) in the left dataset, the corresponding (K, W) in the right dataset will be joined. If the key does not exist in the right dataset, None will be returned. This means every K in the left dataset is present in the new RDD.</p><pre># leftOuterJoin<br>newRDD = leftRDD.leftOuterJoin(rightRDD)<br>newRDD.collect()</pre><pre>[(&#39;b&#39;, (2, None)), (&#39;c&#39;, (3, 5)), (&#39;a&#39;, (1, 4))]</pre><p>rightOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, W) in the right dataset, the corresponding (K, V) in the left dataset will be joined. If the key does not exist in the left dataset, None will be returned. This means every K in the right dataset is present in the new RDD.</p><pre># rightOuterJoin<br>newRDD = leftRDD.rightOuterJoin(rightRDD)<br>newRDD.collect()</pre><pre>[(&#39;c&#39;, (3, 5)), (&#39;d&#39;, (None, 6)), (&#39;a&#39;, (1, 4))]</pre><p>fullOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, V) in the left dataset and (K, W) in the right dataset, the matches will be returned as (K, (V, W)). If a key exists in the left dataset that is not in the right dataset, the result will be (K, (V, None)). Likewise, if a key exists in the right dataset that is not in the left dataset, the result will be (K, (None, W)). This is essentially a union of the left and right outer joins.</p><pre># fullOuterJoin<br>newRDD = leftRDD.fullOuterJoin(rightRDD)<br>newRDD.collect()</pre><pre>[(&#39;b&#39;, (2, None)), (&#39;c&#39;, (3, 5)), (&#39;d&#39;, (None, 6)), (&#39;a&#39;, (1, 4))]</pre><h4><strong>CoGroup</strong></h4><p>cogroup(RDD) returns an RDD of (K, (Iterable&lt;V&gt;, Iterable&lt;W&gt;)) if the original and source are (K, V) and (K, W), respectively.</p><pre># cogroup<br>newRDD = leftRDD.cogroup(rightRDD)<br>newRDD.collect()</pre><pre>[(&#39;b&#39;,<br>  (&lt;pyspark.resultiterable.ResultIterable at 0x7fea6eca6920&gt;,<br>   &lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb33760&gt;)),<br> (&#39;c&#39;,<br>  (&lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb31870&gt;,<br>   &lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb30e20&gt;)),<br> (&#39;d&#39;,<br>  (&lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb32470&gt;,<br>   &lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb32950&gt;)),<br> (&#39;a&#39;,<br>  (&lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb32230&gt;,<br>   &lt;pyspark.resultiterable.ResultIterable at 0x7fea6eb33fa0&gt;))]</pre><p>To view the iterables, the values can be mapped to lists:</p><pre>[(k, tuple(map(list, v))) for k, v in newRDD.collect()]</pre><pre>[(&#39;b&#39;, ([2], [])), (&#39;c&#39;, ([3], [5])), (&#39;d&#39;, ([], [6])), (&#39;a&#39;, ([1], [4]))]</pre><h4><strong>Coalesce</strong></h4><p>coalesce(numPartitions) reduces the number of partitions of an RDD. The example below coalesces from three partitions to two partitions.</p><pre># preview the partitions<br>distributedData.glom().collect()</pre><pre>[[1, 5], [10, 15], [20, 25, 30]]</pre><pre># coalesce<br>distributedData.coalesce(numPartitions=2).glom().collect()</pre><pre>[[1, 5], [10, 15, 20, 25, 30]]</pre><h4><strong>Repartition</strong></h4><p>repartition(numPartitions) randomly shuffles the data to create more or less partitions. The example below repartitions from three to two, but it differs from coalesce since it randomizes the partitions.</p><pre># preview the partitions<br>distributedData.glom().collect()</pre><pre>[[1, 5], [10, 15], [20, 25, 30]]</pre><pre># repartition<br>distributedData.repartition(numPartitions=2).glom().collect()</pre><pre>[[20, 25, 30], [1, 5, 10, 15]]</pre><h3>Actions</h3><p>Actions are operations that return a value or some values from an RDD rather than creating a new RDD.</p><h4><strong>Collect</strong></h4><p>collect() has been used in the previous examples to return the RDD as a list for viewing purposes. The example below shows that the output is a list. The previous examples show various use cases.</p><pre># collect<br>type(distributedData.glom().collect())</pre><pre>list</pre><h4><strong>Reduce</strong></h4><p>reduce(func) aggregates the elements of an RDD using the provided function. This function takes two arguments and has a single output. The operations should be commutative and associative to allow parallel processes to be performed.</p><pre># reduce | distributedData = [[1, 5], [10, 15], [20, 25, 30]]<br>distributedData.reduce(lambda a,b: a+b)</pre><pre>106</pre><h4><strong>Count</strong></h4><p>count() returns the number of elements in an RDD.</p><pre># count<br>distributedData.count()</pre><pre>7</pre><h4><strong>First, Take, TakeSample</strong></h4><p>first() returns the first element in the RDD.</p><pre># first<br>distributedData.first()</pre><pre>1</pre><p>take(n) returns the first <em>n</em> elements of the RDD</p><pre># take<br>distributedData.take(4)</pre><pre>[1, 5, 10, 15]</pre><p>takeSample(withReplacement=True|False, num) returns a sample from the RDD with a size of num, with or without replacement.</p><pre># take<br>distributedData.takeSample(withReplacement=True, num=5)</pre><pre>[5, 20, 5, 15, 5]</pre><p><strong>CountByKey</strong></p><p>countByKey() can be used on RDDs with elements of (K, V). The result will be a hashmap, or dictionary, for each key in the form of (K, Count(V)).</p><pre># distributedData4= [(&quot;red&quot;, 1), (&quot;red&quot;, 2), (&quot;red&quot;, 3), (&quot;blue&quot;, 4), (&quot;blue&quot;, 5)]<br>dict(distributedData4.countByKey())</pre><pre>{&#39;red&#39;: 3, &#39;blue&#39;: 2}</pre><h3>References</h3><ol><li><a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html">https://spark.apache.org/docs/latest/rdd-programming-guide.html</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5b5968c0ac9d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An Introduction to Machine Learning in Python: Multiple Linear Regression]]></title>
            <link>https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-multiple-linear-regression-b3ddafd18008?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/b3ddafd18008</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[regression]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Mon, 22 May 2023 04:47:43 GMT</pubDate>
            <atom:updated>2023-05-22T17:51:42.240Z</atom:updated>
            <content:encoded><![CDATA[<h3>An Introduction to Machine Learning in Python: Polynomial Regression</h3><p>Polynomial regression can identify a nonlinear relationship between an independent variable and a dependent variable.</p><h3>Background</h3><p>This article is the third in a series on regression, gradient descent, and MSE. The previous articles cover <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">Simple Linear Regression</a>, <a href="https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-the-normal-equation-for-regression-in-python-28dc37d524cf">The Normal Equation for Regression</a>, and <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-multiple-linear-regression-in-python-6f2335d0dcbe">Multiple Linear Regression</a>.</p><h3>Polynomial Regression</h3><p>Polynomial regression is used on complex data that would be best fit with curves. It can be treated as a subset of multiple linear regression.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/497/0*b0GWgW6ksQ3W-ss6.png" /></figure><p>Note that <strong><em>X₀ </em></strong>is a column of ones for the bias; this allows for the generalized formula discussed in the first article. Using the equation above, each “independent” variable can be considered an exponentiated version of <strong><em>X₁</em></strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/139/1*5WWGgQzrXOBoDfUxn6TNsA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/139/1*ArO5QnFcrJ0ADzkbCkfpTA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/142/1*oTiFDJaFnPfxS9AnwI7A8A.png" /></figure><p>This allows the same model to be used from multiple linear regression since only the coefficients of each variable need to be identified. A simple, third-degree polynomial model can be created as an example. Its equation follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/428/1*hswEP5aYxFEIA3for6cDyw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/139/1*5WWGgQzrXOBoDfUxn6TNsA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/139/1*ArO5QnFcrJ0ADzkbCkfpTA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/139/1*0uBA3psLyDQzwueFfq3a4w.png" /></figure><p>The generalized functions for the model, gradient descent, and the MSE can be used from the previous articles:</p><pre># line of best fit<br>def model(w, X):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br>      X: array of inputs  | (n samples, num features)<br><br>    Output:<br>      returns the output of X@w | (n samples, 1)<br>  &quot;&quot;&quot;<br><br>  return torch.matmul(X, w)</pre><pre># mean squared error (MSE)<br>def MSE(Yhat, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      Yhat: array of predictions | (n samples, 1)<br>      Y: array of expected outputs | (n samples, 1)<br>    Output:<br>      returns the loss of the model, which is a scalar<br>  &quot;&quot;&quot;<br>  return torch.mean((Yhat-Y)**2) # mean((error)^2)</pre><pre># optimizer<br>def gradient_descent(w):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br><br>    Global Variables / Constants:<br>      X: array of inputs  | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      lr: learning rate to scale the gradient<br><br>    Output:<br>      returns the updated weights<br>  &quot;&quot;&quot; <br><br>  n = X.shape[0]<br><br>  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)</pre><h4>Creating the Data</h4><p>Now, all that is required is some data to train the model with. A “blueprint” function can be used, and randomness can be added. This follows the same approach as the previous articles. The blueprint can be seen below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/302/1*zWj10rVtf0fW4HP5Vqv-_Q.png" /></figure><p>A train set with a size of (800, 4) and a test set with a size of (200, 4) can be created. Note that each feature, except the bias, is an exponentiated version of the first.</p><pre>import torch<br><br>torch.manual_seed(5)<br>torch.set_printoptions(precision=2)<br><br># features<br>X0 = torch.ones((1000,1))<br>X1 = (100*(torch.rand(1000) - 0.5)).reshape(-1,1) # generates 1000 random numbers from -50 to 50<br>X2, X3 = X1**2, X1**3<br>X = torch.hstack((X0,X1,X2,X3))<br><br># normal distribution with a mean of 0 and std of 8<br>normal = torch.distributions.Normal(loc=0, scale=8)<br><br># targets<br>Y = (3*X[:,3] + 2*X[:,2] + 1*X[:,1] + 5 + normal.sample(torch.ones(1000).shape)).reshape(-1,1)<br><br># train, test<br>Xtrain, Xtest = X[:800], X[800:]<br>Ytrain, Ytest = Y[:800], Y[800:]</pre><p>After defining the initial weights, the data can be plotted with the line of best fit.</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(4, 1))<br>w</pre><pre>tensor([[0.83],<br>        [0.13],<br>        [0.91],<br>        [0.82]])</pre><pre>import matplotlib.pyplot as plt<br><br>def plot_lbf():<br>  &quot;&quot;&quot;<br>    Output:<br>      prints the line of best fit in comparison to the train and test data<br>  &quot;&quot;&quot;<br><br>  # plot the train and test sets<br>  plt.scatter(Xtrain[:,1],Ytrain,label=&quot;train&quot;)<br>  plt.scatter(Xtest[:,1],Ytest,label=&quot;test&quot;)<br><br>  # plot the line of best fit<br>  X1_plot = torch.arange(-50, 50.1,.1).reshape(-1,1) <br>  X2_plot, X3_plot = X1_plot**2, X1_plot**3<br>  X0_plot = torch.ones(X1_plot.shape)<br>  X_plot = torch.hstack((X0_plot,X1_plot,X2_plot,X3_plot))<br><br>  plt.plot(X1_plot.flatten(), model(w, X_plot).flatten(), color=&quot;red&quot;, zorder=4)<br><br>  plt.xlim(-50, 50)<br>  plt.xlabel(&quot;$X$&quot;)<br>  plt.ylabel(&quot;$Y$&quot;)<br>  plt.legend()<br>  plt.show()<br><br>plot_lbf()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/609/1*MmQ2EEAiNyWJNJp1zZwBrA.png" /><figcaption>Image by Author</figcaption></figure><h4>Training the Model</h4><p>To partially minimize the cost function, a learning rate of 5e-11 and 500,000 epochs can be used with gradient descent.</p><pre>lr = 5e-11<br>epochs = 500000<br><br># update the weights 1000 times<br>for i in range(0, epochs):<br>  # update the weights<br>  w = gradient_descent(w)<br><br>  # print the new values every 10 iterations<br>  if (i+1) % 100000 == 0:<br>    print(&quot;epoch:&quot;, i+1)<br>    print(&quot;weights:&quot;, w)<br>    print(&quot;Train MSE:&quot;, MSE(model(w,Xtrain), Ytrain))<br>    print(&quot;Test MSE:&quot;, MSE(model(w,Xtest), Ytest))<br>    print(&quot;=&quot;*10)<br><br>plot_lbf()</pre><pre>epoch: 100000<br>weights: tensor([[0.83],<br>        [0.13],<br>        [2.00],<br>        [3.00]])<br>Train MSE: tensor(163.87)<br>Test MSE: tensor(162.55)<br>==========<br>epoch: 200000<br>weights: tensor([[0.83],<br>        [0.13],<br>        [2.00],<br>        [3.00]])<br>Train MSE: tensor(163.52)<br>Test MSE: tensor(162.22)<br>==========<br>epoch: 300000<br>weights: tensor([[0.83],<br>        [0.13],<br>        [2.00],<br>        [3.00]])<br>Train MSE: tensor(163.19)<br>Test MSE: tensor(161.89)<br>==========<br>epoch: 400000<br>weights: tensor([[0.83],<br>        [0.13],<br>        [2.00],<br>        [3.00]])<br>Train MSE: tensor(162.85)<br>Test MSE: tensor(161.57)<br>==========<br>epoch: 500000<br>weights: tensor([[0.83],<br>        [0.13],<br>        [2.00],<br>        [3.00]])<br>Train MSE: tensor(162.51)<br>Test MSE: tensor(161.24)<br>==========</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/609/1*soBQARwqgtYztsM_jt_I8g.png" /><figcaption>Image by Author</figcaption></figure><p>Even with 500,000 epochs and an extremely small learning rate, the model fails to identify the first two weights. While the current solution is highly accurate with an MSE of 161.24, it would likely require millions of epochs to completely minimize it. This is one of the limitations of gradient descent for polynomial regression.</p><h4>The Normal Equation</h4><p>As an alternative, the Normal Equation from the second article can be used to directly compute the optimized weights:</p><pre>def NormalEquation(X, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      X: array of input values | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      <br>    Output:<br>      returns the optimized weights | (num features, 1)<br>  &quot;&quot;&quot;<br>  <br>  return torch.inverse(X.T @ X) @ X.T @ Y<br><br>w = NormalEquation(Xtrain, Ytrain)<br>w</pre><pre>tensor([[4.57],<br>        [0.98],<br>        [2.00],<br>        [3.00]])</pre><p>The Normal Equation is able to immediately identify the correct values for each weight, and the MSE for each set is about 100 points lower than with gradient descent:</p><pre>MSE(model(w,Xtrain), Ytrain), MSE(model(w,Xtest), Ytest)</pre><pre>(tensor(60.64), tensor(63.84))</pre><h3>Conclusion</h3><p>With simple linear, multiple linear, and polynomial regression implemented, the next two articles will cover Lasso and Ridge regression. These types of regression introduce two important concepts in machine learning: overfitting and regularization.</p><p>Please don’t forget to like and follow! :)</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b3ddafd18008" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python]]></title>
            <link>https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-the-normal-equation-for-regression-in-python-28dc37d524cf?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/28dc37d524cf</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[regression]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Mon, 22 May 2023 03:15:11 GMT</pubDate>
            <atom:updated>2023-05-22T05:01:21.173Z</atom:updated>
            <content:encoded><![CDATA[<p>The Normal Equation is a closed-form solution for minimizing a cost function and identifying the coefficients for regression.</p><h3>Background</h3><p>In the previous article, <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">An Introduction to Machine Learning in Python: Simple Linear Regression</a>, the gradient descent approach was used to minimize the MSE cost function. However, the approach required a large number of epochs and a small learning rate, both of which are difficult to identify in a short amount of time.</p><p>An alternative approach is a closed-form solution that does not require a learning rate or epochs. The closed-form solution for regression is known as the Normal Equation. It can be used to directly determine the weights of a line of best fit. It will be derived in this article and then implemented in Python.</p><h3>Deriving the Normal Equation</h3><p>In <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-gradient-descent-1f32a08b0deb">A Simple Introduction to Gradient Descent</a>, the matrix derivative of the MSE was calculated.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/378/0*C_-7fZVLU96Z5s_w.png" /></figure><p>This partial derivative can be set equal to 0, which indicates where the cost function is at a minimum for each weight. By solving for <strong><em>w</em></strong>, a direct equation to calculate these values can be identified.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/303/1*jfRquikjBBEu4TA_-k6XWw.png" /><figcaption>set equal to 0</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/267/1*3IhU9Drz-eQ_Hczqu-FUBg.png" /><figcaption>multiply by n/2</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/198/1*cE9UV9Q-J9f9o4F34hMT-w.png" /><figcaption>place each term on its own side</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/274/1*Hu_k2xmn4td6VM1_ug3tYw.png" /><figcaption>transpose both sides</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/182/1*PfitX-8UtR23NFxVCuLE_A.png" /><figcaption>simplify</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/458/1*SmH0MFGgcnjT3u8L12Du5Q.png" /><figcaption>use the inverse of X^TX to isolate <strong><em>w</em></strong></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/232/1*w4R9FpLuNFmGJPY5xKFvHw.png" /><figcaption>simplify</figcaption></figure><p>To prove this returns new weights as anticipated, the size of each component can be examined:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XEjpMxHlOXdHBho7ec6gBA.png" /></figure><p>The output is a vector with a size of <strong><em>(num features, 1</em></strong>). This is the same size as the original weight vector from the previous article, <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">An Introduction to Machine Learning in Python: Simple Linear Regression</a>.</p><h3>Implementing the Normal Equation in Python</h3><p>This equation can be implemented in Python, and the same example from the previous article can be used.</p><pre>def NormalEquation(X, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      X: array of input values | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      <br>    Output:<br>      returns the optimized weights | (num features, 1)<br>  &quot;&quot;&quot;<br>  <br>  return torch.inverse(X.T @ X) @ X.T @ Y</pre><p>With the function created, all that is necessary is some input data, which is generated below:</p><pre>import torch<br><br>torch.manual_seed(5)<br>torch.set_printoptions(precision=2)<br><br># (n samples, features)<br>X = torch.randint(low=0, high=11, size=(20, 1))<br><br># normal distribution with a mean of 0 and std of 1<br>normal = torch.distributions.Normal(loc=0, scale=1)<br><br># generate output<br>Y = (1.5*X + 2) + normal.sample(X.shape)<br><br># add bias column<br>X = torch.hstack((torch.ones(X.shape),X))</pre><p>These can be plugged into the Normal Equation to generate the optimized weights:</p><pre>w = NormalEquation(X, Y)</pre><pre>tensor([[1.97],<br>        [1.52]])</pre><p>These weights are nearly identical to the blueprint function. Instead of 2 and 1.5, the equation output 1.97 and 1.52. They aren’t perfect due to the randomness added to the output. Furthermore, these values are more accurate than those from the previous article since a learning rate and specific number of epochs did not have to be selected.</p><h3>When to Use it</h3><p>While this approach seems to be preferable over gradient descent, both have their use cases. For simple problems with small datasets, the Normal Equation will suffice. As the dataset grows, so does the size of the inverted matrix, which has a size of <strong><em>(num features, num features)</em></strong>. This can be expensive to compute.</p><p>When the number of features is large, gradient descent should be used. Gradient descent can also be used to create a generalized equation that does not overfit to the train data.</p><p>For the next two articles, both approaches will be used. The next article is <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-multiple-linear-regression-in-python-6f2335d0dcbe">An Introduction to Machine Learning in Python: Multiple Linear Regression</a>.</p><p>Please don’t forget to like and follow! :)</p><h3>References</h3><ol><li><a href="https://www.datacamp.com/tutorial/tutorial-normal-equation-for-linear-regression">Normal Equation Overview</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=28dc37d524cf" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An Introduction to Machine Learning in Python: Multiple Linear Regression]]></title>
            <link>https://medium.com/@hunter-j-phillips/a-simple-introduction-to-multiple-linear-regression-in-python-6f2335d0dcbe?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/6f2335d0dcbe</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[introduction]]></category>
            <category><![CDATA[regression]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Fri, 19 May 2023 17:40:25 GMT</pubDate>
            <atom:updated>2023-05-22T04:48:29.640Z</atom:updated>
            <content:encoded><![CDATA[<p>Multiple linear regression is used to assess the relationship between many independent variables and one dependent variable.</p><h3>Background</h3><p>This article follows <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">An Introduction to Machine Learning in Python: Simple Linear Regression</a>; it covered simple linear regression, gradient descent, and the MSE. This article will cover multiple linear regression and cover some new machine learning terminology.</p><h3>Multiple Linear Regression</h3><p>While simple linear regression has an equation of <strong><em>Ŷ = w₁X₁ + w₀X₀</em></strong>, multiple linear regression has a generic formula for <strong><em>k </em></strong>independent variables:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/497/1*Z81ebPzgAKUUqcZwzbscWg.png" /></figure><p>Note that <strong><em>X₀ </em></strong>is a column of ones for the bias; this allows for the generalized formula discussed in the first article. As the formula demonstrates, multiple linear regression helps identify the relationship between many independent variables (<strong><em>X</em></strong>) and a single dependent variable (<strong><em>Ŷ</em></strong>). It does this by learning the values of each weight (<strong><em>w</em></strong>).</p><p>To demonstrate this in action, multiple linear regression with 3 weights can be used:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/327/1*ROoN1bdtO9Zlv8xV3diBNw.png" /></figure><p>This formula will create a “plane of best fit” for three dimensional data:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/387/1*feOis6XGgSBKgE4E7Xn0SQ.png" /><figcaption>Image by Author</figcaption></figure><h3>The Implementation</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/407/1*TUeQU5YstsF6ko_qDGRE4w.png" /><figcaption>Image by Author</figcaption></figure><p>Like the previous example, a “blueprint” equation can be used, and randomness can be added. Then, the model can try and learn the weights. When using a model on real data, it is common to split the data into at least two sets: the train set and the test set. The train set is used to train the model and acquire the weights. The test set is used to evaluate the model’s performance on data it has never seen before. If it performs well on both, it is more likely to be useful for real-world applications. It is common to use a split of 80% train, 20% test.</p><p>For this example, 1000 samples can be generated and split into test and train sets. Also, since there are two independent variables and a bias, there will be three columns. Each column will represent an independent variable or bias, and each row will represent a sample of the variables, (<strong><em>X₀, X₁, X₂</em></strong>). The shape of the overall data will be a matrix with a size of (1000, 3); a general shape would be (<strong><em>n samples, num features</em></strong>). Remember, independent variables are also known as features in machine learning. The train set will have a size of (800, 3), and the test set will have a size of (200, 3).</p><p>The data for this article will be based around <strong><em>Y = 6X₂ + 3X₁ + 2</em></strong>. This means <strong><em>w₀ </em></strong>is 2, <strong><em>w₁</em></strong> is 3, and <strong><em>w₂ </em></strong>is 6. In the example below, 1000 values between -250 and 250 are generated for <strong><em>X₁ </em></strong>and <strong><em>X₂, </em></strong>and 1000 ones are generated for <strong><em>X₀</em></strong>. They are reshaped into columns and stacked horizontally to create a matrix with a size of (1000, 3). The output is generated using the aforementioned equation, and values from a normal distribution with a standard deviation of 10 are added.</p><pre>import torch<br><br>torch.manual_seed(5)<br>torch.set_printoptions(precision=2)<br><br># create ones for the bias | 1000 ones<br>X0 = torch.ones(1000).reshape(-1,1)<br><br># create values for the first feature | 1000 numbers from -250 to 250<br>X1 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1) <br><br># create values for the second feature | 1000 numbers from -250 to 250<br>X2 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1)<br><br># stack data together, X0 = X[:,0], X1 = X[:,1], X2 = X[:,2]<br>X = torch.hstack((X0, X1,X2))<br><br># normal distribution with a mean of 0 and std of 10<br>normal = torch.distributions.Normal(loc=0, scale=10)<br><br># output<br>Y = ((6*X[:,2] + 3*X[:,1] + 2*X[:,0]) + normal.sample(torch.ones(1000).shape)).reshape(-1,1)</pre><p>This data can be previewed before being split.</p><pre>import plotly.express as px<br><br>fig = px.scatter_3d(x=X[:,1].flatten(),<br>                    y=X[:,2].flatten(),<br>                    z=Y.flatten())<br><br>fig.update_traces(marker_size=3)<br>fig.update_layout(scene = dict(xaxis_title=&#39;X&lt;sub&gt;1&lt;/sub&gt;&#39;, <br>                               yaxis_title=&#39;X&lt;sub&gt;2&lt;/sub&gt;&#39;, <br>                               zaxis_title=&#39;Y&#39;))</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/508/1*a_c3iA12mGGi_V-cFxnjdA.png" /><figcaption>Image by Author</figcaption></figure><p>Now, the data can be split into test and train data:</p><pre># split the data<br>Xtrain, Xtest = X[:800], X[800:]<br>Ytrain, Ytest = Y[:800], Y[800:]</pre><p>The train data can then be fit with a plane. To start, the functions for the model, MSE, and gradient descent need to be defined. The same ones from the first article, <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">An Introduction to Machine Learning in Python: Simple Linear Regression</a>, can be used. The end of the article will use the Normal Equation to verify the answer.</p><pre># line of best fit<br>def model(w, X):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br>      X: array of inputs  | (n samples, num features)<br><br>    Output:<br>      returns the output of X@w | (n samples, 1)<br>  &quot;&quot;&quot;<br><br>  return torch.matmul(X, w)</pre><pre># mean squared error (MSE)<br>def MSE(Yhat, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      Yhat: array of predictions | (n samples, 1)<br>      Y: array of expected outputs | (n samples, 1)<br>    Output:<br>      returns the loss of the model, which is a scalar<br>  &quot;&quot;&quot;<br>  return torch.mean((Yhat-Y)**2) # mean((error)^2)</pre><pre># optimizer<br>def gradient_descent(w):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br><br>    Global Variables / Constants:<br>      X: array of inputs  | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      lr: learning rate to scale the gradient<br><br>    Output:<br>      returns the updated weights<br>  &quot;&quot;&quot; <br><br>  n = X.shape[0]<br><br>  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)</pre><h4>Training the Model</h4><p>With the functions created, the model can be trained to identify the plane of best fit. To start, three random weights can be generated.</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(3, 1))<br>w</pre><pre>tensor([[0.83],<br>        [0.13],<br>        [0.91]])</pre><p>The current plane of best fit and its MSE can be analyzed below. The plane is in orange, the train set is in red, and the test set is in green.</p><pre>import plotly.graph_objects as go<br>def plot_model(x1_range, x2_range):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      x1_range: x1-axis range [low, high]<br>      x2_range: x2-axis range [low, high]<br><br>    Global Variables:<br>      Xtrain: array of inputs | (n train samples, num features)<br>      Ytrain: array of expected outputs | (n train samples, 1)<br>      Xtest:  array of inputs | (n test samples, num features)<br>      Xtrain: array of expected outputs | (n test samples, 1)<br>      <br>    Output:<br>      prints plane of best fit<br>  &quot;&quot;&quot; <br><br>  # meshgrid of possible combinations of (X1, X2)<br>  X1_plot, X2_plot = torch.meshgrid(torch.arange(x1_range[0], x1_range[1], 5),<br>                                    torch.arange(x2_range[0], x2_range[1], 5))<br>  X0_plot = torch.ones(X1_plot.shape)<br>  <br>  # stack together each point (X1, X2) = (X, Y)<br>  X_plot = torch.hstack((X0_plot.reshape(-1,1),<br>                         X1_plot.reshape(-1,1), <br>                         X2_plot.reshape(-1,1)))<br>  <br>  # all possible model predictions (Yhat = Z)<br>  Yhat = model(w, X_plot)<br><br>  # model&#39;s plane of best fit<br>  fig = go.Figure(data=[go.Mesh3d(x=X_plot[:,1].flatten(), <br>                                  y=X_plot[:,2].flatten(), <br>                                  z=Yhat.flatten(), <br>                                  color=&#39;orange&#39;, <br>                                  opacity=0.50)])<br>  <br>  # training data<br>  fig.add_scatter3d(x=Xtrain[:,1].flatten(),<br>                    y=Xtrain[:,2].flatten(),<br>                    z=Ytrain.flatten(), <br>                    mode=&quot;markers&quot;,<br>                    marker=dict(size=3),<br>                    name=&quot;train&quot;)<br>  <br>  # test data<br>  fig.add_scatter3d(x=Xtest[:,1].flatten(),<br>                    y=Xtest[:,2].flatten(),<br>                    z=Ytest.flatten(), <br>                    mode=&quot;markers&quot;,<br>                    marker=dict(size=3),<br>                    name=&quot;test&quot;)<br>  <br>  # name axes<br>  fig.update_layout(scene = dict(xaxis_title=&#39;X&lt;sub&gt;1&lt;/sub&gt;&#39;, <br>                                 yaxis_title=&#39;X&lt;sub&gt;2&lt;/sub&gt;&#39;, <br>                                 zaxis_title=&#39;Y&#39;))<br><br>  fig.show()<br><br>plot_model([-250,250], [-250,250])</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/807/1*QmAkIJ23TOpzXfyBD49fyw.png" /><figcaption>Image by Author</figcaption></figure><pre>MSE(model(w,Xtrain), Ytrain)</pre><pre>tensor(653812.81)</pre><p>Now, a training loop can be created to minimize the MSE. By using 50,000 epochs and a learning rate of 0.00004, the output becomes extremely accurate. These values were chosen empirically. Both the the train and test MSE can be seen as well.</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(3, 1))<br><br>lr = 0.00004<br>epochs = 50000<br><br># update the weights 1000 times<br>for i in range(0, epochs):<br>  # update the weights<br>  w = gradient_descent(w)<br><br>  # print the new values every 10 iterations<br>  if (i+1) % 10000 == 0:<br>    print(&quot;epoch:&quot;, i+1)<br>    print(&quot;weights:&quot;, w)<br>    print(&quot;Train MSE:&quot;, MSE(model(w,Xtrain), Ytrain))<br>    print(&quot;Test MSE:&quot;, MSE(model(w,Xtest), Ytest))<br>    print(&quot;=&quot;*10)<br><br>plot_model([-250,250], [-250,250])</pre><pre>epoch: 10000<br>weights: tensor([[1.51],<br>        [2.98],<br>        [6.00]])<br>Train MSE: tensor(87.52)<br>Test MSE: tensor(70.03)<br>==========<br>epoch: 20000<br>weights: tensor([[1.82],<br>        [2.98],<br>        [6.00]])<br>Train MSE: tensor(87.20)<br>Test MSE: tensor(70.04)<br>==========<br>epoch: 30000<br>weights: tensor([[1.96],<br>        [2.98],<br>        [6.00]])<br>Train MSE: tensor(87.12)<br>Test MSE: tensor(70.11)<br>==========<br>epoch: 40000<br>weights: tensor([[2.02],<br>        [2.98],<br>        [6.00]])<br>Train MSE: tensor(87.09)<br>Test MSE: tensor(70.16)<br>==========<br>epoch: 50000<br>weights: tensor([[2.05],<br>        [2.98],<br>        [6.00]])<br>Train MSE: tensor(87.08)<br>Test MSE: tensor(70.18)<br>==========</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/784/1*j4UG-uy9rdxOhVQjm47HCg.png" /><figcaption>Image by Author</figcaption></figure><p>The model predicted the plane of best fit to be <strong><em>Ŷ = 6X₂ + 2.98X₁ + 2.05 </em></strong>instead of <strong><em>Y = 6X₂ + 3X₁ + 2. </em></strong>The test and train MSE’s are within 17 points of each other, which indicates the model generalizes to the unseen test data. The only limitation of this approach is the number of epochs required to minimize the loss function. An alternative approach would be to use a closed-form solution, which was covered in the previous article, <a href="https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-the-normal-equation-for-regression-in-python-28dc37d524cf">An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python</a>. A closed-form solution does not require a learning rate or epochs to acquire the weights for minimization.</p><h3>The Normal Equation</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/232/1*w4R9FpLuNFmGJPY5xKFvHw.png" /><figcaption>The Normal Equation</figcaption></figure><pre>def NormalEquation(X, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      X: array of input values | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      <br>    Output:<br>      returns the optimized weights | (num features, 1)<br>  &quot;&quot;&quot;<br>  <br>  return torch.inverse(X.T @ X) @ X.T @ Y</pre><p>The code above is the Python implementation for the Normal Equation, which was derived in the previous article. The training data from this example can be used in the equation to directly calculate the optimized weights.</p><pre>w = NormalEquation(Xtrain,Ytrain)</pre><pre>tensor([[2.19],<br>        [2.98],<br>        [6.00]])</pre><p>The MSE can also be calculated:</p><pre>MSE(model(w, Xtrain), Ytrain), MSE(model(w, Xtest), Ytest)</pre><pre>(tensor(87.08), tensor(70.20))</pre><p>The weights and MSE from the Normal Equation and Gradient Descent approaches are nearly identical. In this case, both are equally valid.</p><h3>Conclusion</h3><p>Multiple linear regression is useful for identifying the relationship between two or more independent variables, or features, and one dependent variable. It is also the basis of polynomial regression, which will be examined in the next article: <a href="https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-multiple-linear-regression-b3ddafd18008">An Introduction to Machine Learning in Python: Polynomial Regression</a>.</p><p>Please don’t forget to like and follow! :)</p><h3>References</h3><ol><li><a href="https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html">Sklearn Regression Example</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6f2335d0dcbe" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Simple Introduction to Gradient Descent]]></title>
            <link>https://medium.com/@hunter-j-phillips/a-simple-introduction-to-gradient-descent-1f32a08b0deb?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/1f32a08b0deb</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[calculus]]></category>
            <category><![CDATA[gradient-descent]]></category>
            <category><![CDATA[introduction]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Wed, 17 May 2023 19:24:04 GMT</pubDate>
            <atom:updated>2023-06-10T18:44:25.299Z</atom:updated>
            <content:encoded><![CDATA[<p>Gradient descent is one of the most common optimization algorithms in machine learning. Understanding its basic implementation is fundamental to understanding all the advanced optimization algorithms built off of it.</p><h3>Background</h3><p>This article is supplementary to<a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8"> An Introduction to Machine Learning in Python: Simple Linear Regression</a>. It should be read first or in conjunction with this article. It would also be beneficial to have a basic understanding of partial derivatives in calculus because this article also examines the partial derivatives of several variations of the Mean Squared Error (MSE).</p><h3>Optimization Algorithms</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/516/1*JcwO8FHnVeJvvzrm3EE1fg.png" /><figcaption>Image by Author</figcaption></figure><p>In machine learning, optimization is the process of finding the ideal parameters, or weights, to maximize or minimize a cost or loss function. The global maximum is the largest value on the domain of the function, whereas the global minimum is the smallest value. While there is only one global maximum and/or minimum, there can be many local maxima and minima. The global minimum or maximum of a cost function indicates where a model’s parameters generate predictions that are close to the actual targets. The local maxima and minima can cause problems when training a model, so their presence should always be considered. The plot above shows an example of each.</p><p>There are a few major algorithm groups within this category: bracketing, local descent, first-order, and second-order. The focus of this article will be first-order algorithms that use the first derivative for optimization. Within this category, the gradient descent algorithm is the most popular.</p><h3>Gradient Descent in One Dimension</h3><p>Gradient descent is a first-order, iterative optimization algorithm used to minimize a cost function. By using partial derivatives, a direction, and a learning rate, gradient descent decreases the error, or difference, between the predicted and actual values.</p><p>The idea behind gradient descent is that the derivative of each weight will reveal its direction and influence on the cost function. In the image below, the cost function is <strong><em>f(w) = w²</em></strong>, which is a parabola. The minimum is at (0,0), and the current weight is -5.6. The current loss is 31.36, and the line in orange represents the derivative, or current rate of change for the weight, which is -11.2. This indicates the weight needs to move “downhill” — or become more positive — to reach a loss of 0. This is where gradient descent comes in.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/589/1*0lLDyjlN_kv01Tmq3sFjrw.png" /><figcaption>Image by Author</figcaption></figure><p>By scaling the gradient with a value known as the learning rate and subtracting the scaled gradient from its weight’s current value, the output will minimize. This can be seen in the image below. In ten iterations (<strong><em>w₀ to w₉</em></strong>), a learning rate of 0.1 is used to minimize the cost function.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/549/1*nErkxEApBl1F4B5cGf2ufg.png" /><figcaption>Image by Author</figcaption></figure><p>In the steps for the algorithm below, a weight is represented by <strong><em>w</em></strong>, with <strong><em>j </em></strong>representing its current value and <strong><em>j+1 </em></strong>representing its new value. The cost function to measure the error is represented by <strong><em>f</em></strong>, and the partial derivative is the gradient of the cost function with respect to the parameters. The learning rate is represented by <strong><em>α</em></strong>.</p><ul><li>select a learning rate and the number of iterations</li><li>choose random values for the parameters</li><li>update the parameters with the equation below</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/298/1*0-yHjw4yeUXG1HiP7JYUfg.png" /></figure><ul><li>repeat step three until the max number of iterations is reached</li></ul><p>When taking the partial derivative, or gradient, of a function, only one parameter can be assessed at a time, and the other parameters are treated as constants. For the example above, <strong><em>f(w) = w²</em></strong>, there is only one parameter, so the derivative is <strong><em>f`(w) = 2w</em></strong>. The formula for updating the parameter follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/244/1*wNcty4IcWh5odwh8p5eG6A.png" /></figure><p>Using a learning rate of 0.1 and a starting weight of -5.6, the first ten iterations follow:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1020/1*gvbWAlNKaxGkUPC7qMvGfA.png" /><figcaption>Table by Author</figcaption></figure><p>The table demonstrates how each component of the formula helps minimize the loss. By negating the scaled gradient, the new weight becomes more positive, and the slope of the new gradient is less steep. As the slope becomes more positive, each iteration yields a smaller update.</p><p>This basic implementation of gradient descent can be applied to almost any cost function, including those with numerous weights. A few variations of the mean squared error can be considered.</p><h3>Gradient Descent with the Mean Squared Error (MSE)</h3><h4>What is the MSE?</h4><p>A popular cost function for machine learning is the Mean Squared Error (MSE).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/290/0*SydHa5UHkx4Zy2hU.png" /></figure><p>This function takes finds the difference between the model’s prediction (<strong><em>Ŷ</em></strong>) and the expected output (<strong><em>Y</em></strong>). It then squares the difference to ensure the output is always positive. This means <strong><em>Ŷ </em></strong>or <strong><em>Y </em></strong>can come first when calculating the difference. This is repeated across a set of points with a size of <strong><em>n</em></strong>. By summing the squared difference of all these points and dividing by <strong><em>n</em></strong>, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/438/1*kkowOm3rlGGoWVTY-Cs4XQ.png" /><figcaption>Table by Author</figcaption></figure><p>In this formula, <strong><em>Ŷ </em></strong>represents a model’s prediction. In regression, the model’s equation may contain one or more weights depending on the requirements of the training data. The table below reflects these situations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5l_3sENNyrNJNcyHBf2EbA.png" /><figcaption>Table by Author</figcaption></figure><p>Now, to perform gradient descent with any of these equations, their gradients must be calculated. The gradient contains the partial derivatives for a function:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/231/1*Wt9qfS_MRVCbPoum8A8E-w.png" /></figure><p>Each weight’s partial derivative has to be calculated. A partial derivative is calculated in the same manner as a normal derivative, but every variable that is not being considered must be treated as a constant. The gradients for the MSE variations listed above can be examined below.</p><h4>One Weight</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/848/1*bPwMBers8K7OhVzHFM3chg.png" /></figure><p>When taking the gradient of the MSE with only one weight, the derivative can be calculated with respect to <strong><em>w</em></strong>. <strong><em>X, Y, </em></strong>and <strong><em>n </em></strong>must be treated as constants. With this in mind, the fraction and sum can be moved outside of the derivative:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/278/1*6Cl1I-Mg-vetfOGA0heGPA.png" /></figure><p>From here, the chain rule can be used to calculate the derivative with respect to <strong><em>w:</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/408/1*TBLgbxb_JLxTTR7yE4d_-A.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/272/1*MA6ZHLFvdKdFJ8YrJl_JxA.png" /></figure><p>Now, this can be simplified:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/248/1*xu16RCsw0d3QrLzDTXzgww.png" /></figure><h4>Two Weights</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OeuOeS1MeIU26uPUkwRS3g.png" /></figure><p>When taking the gradient of the MSE with two weights, the partial derivatives must be taken with respect to both parameters, <strong><em>w₀</em></strong> and <strong><em>w₁</em></strong>. When taking the partial derivative of <strong><em>w₀</em></strong>, the following variables are treated as constants: <strong><em>X, Y, n, </em></strong>and <strong><em>w₁. </em></strong>When taking the partial derivative of <strong><em>w₁</em></strong>, the following variables are treated as constants: <strong><em>X, Y, n, </em></strong>and <strong><em>w₀. </em></strong>The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/368/1*1H4JaoH8Hh40IXcbj60kzA.png" /></figure><p>From here, the chain rule can be used to calculate the derivative with respect to each weight<strong><em>:</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/575/1*gK7gnb-DD4oa4HhUeOaBCQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/353/1*S-R-5lpGi9TvcNXOSII2pw.png" /></figure><p>Finally, they can be simplified.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/328/1*Eo1PaDJmUwYt5diOcDrOgA.png" /></figure><p>Notice that the only difference between the equations is <strong><em>X</em>.</strong></p><h4>Three Weights</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*v2EwAgJZPaGxS7qinxIDfg.png" /></figure><p>When taking the gradient of the MSE with three weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, <strong><em>X, Y, n, </em></strong>and the other two weights will be treated as constants. The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/490/1*yNsolPd5Hu_mxNYR-J4opQ.png" /></figure><p>From here, the chain rule can be used to calculate the derivative with respect to each weight<strong><em>:</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/820/1*Ji-_RICw00yvDFir5FT96w.png" /></figure><p>Finally, they can be simplified.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/462/1*gDi_4fBSF5WPnoaw-yHOiA.png" /></figure><p>As mentioned previously, the only difference between each partial derivative is the input feature, <strong><em>X</em></strong>. This can be generalized for <strong><em>k</em></strong> weights in the next example.</p><h4>More Than Three Weights</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FpxvKoDt0uXP6KFmsFNlHw.png" /></figure><p>When taking the gradient of the MSE with <strong><em>k </em></strong>weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, <strong><em>X, Y, n, </em></strong>and the other <strong><em>k-1 </em></strong>weights will be treated as constants. As seen in the previous example, only the input feature of each partial derivative changes when there are more than two weights.</p><h3>Matrix Derivation</h3><p>The formulas above show how to use gradient descent without explicitly taking advantage of vectors and matrices. However, most of machine learning is best understood by using their operations. For a quick overview, see <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-tensors-c4a8321efffc">A Simple Introduction to Tensors</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/290/1*TcGBkemiYGef5Cq41r3ZoA.png" /></figure><p>The rest of this article will be dedicated to using matrix calculus to derive the derivative of the MSE. To start, <strong><em>Ŷ </em></strong>and <strong><em>Y </em></strong>should be understood as matrices with sizes of (<strong><em>n samples</em></strong>, 1). Both are matrices with 1 column and <strong><em>n</em></strong> rows, or they can be viewed as column vectors, which would change their notation to lowercase:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/285/1*gIiJbn7KjRCCPAt3ZSAbTA.png" /></figure><p>The MSE is element-wise vector subtraction between <strong><em>ŷ </em></strong>and <strong><em>y,</em></strong> followed by the dot product of the difference with itself. Remember, the dot product can only occur if sizes are compatible. Since the goal is to have a scalar output, the first vector must be transposed.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/313/1*JHlkogXRY0WM649FjkA_aw.png" /></figure><p>From here, <strong><em>ŷ </em></strong>can be replaced with <strong><em>Xw </em></strong>for regression<strong><em>. X</em></strong> is a matrix with a size of <strong><em>(n samples, num features)</em></strong>, and <strong><em>w </em></strong>is a column vector with with a size of <strong><em>(num features, 1)</em></strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/375/1*u6TnroyqJl63NymP7VbUAQ.png" /></figure><p>The next step is to simplify the equation before taking the derivative. Notice that <strong><em>w</em></strong> and <strong><em>X </em></strong>switch positions to ensure their multiplication is still valid: <strong><em>(1, features) x (num features, n samples) = (1, n samples)</em></strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/411/1*uFErPkr_KhJaWbuKWXv6Hw.png" /></figure><p>These error calculations can then be multiplied together.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/595/1*XSAUAZ-JuJVnCiqmnz6-tw.png" /></figure><p>Notice that the third term can be rewritten by transposing it, following the third property on this <a href="https://en.wikipedia.org/wiki/Transpose#:~:text=respects%20addition.-,(,.,-Note%20that%20the">page</a>. Then, it can be added to the second term.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/635/1*wW0MeM_wLZd85Q-KEEwyNA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/578/1*q4OiXeBko3dx1VDtePqAIg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/481/1*3kIypu9z-b2biH9UQhYgLQ.png" /></figure><p>Now, the partial derivative of the MSE can be taken with respect to the weight.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/539/1*iNHM1K0r5U6NQ0DsqSEDRQ.png" /></figure><p>This is equivalent to taking the derivative of each term:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/678/1*e6SPHcGoHATprzdch80sKg.png" /></figure><p>Each term that is not <strong><em>w </em></strong>can be treated as a constant. The derivative of each component can be computed using these rules:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/587/1*1rQjIGA-ODmcCh-izT4DwQ.png" /></figure><p>The first term in the equation follows the fourth rule and becomes zero. The second term follows the first rule, and the third term follows the third rule.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/405/1*-1oX58unY4hRF5tavTrgUA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/378/1*Wp7HLZmtyopiJDiJABdZEw.png" /></figure><p>This equation can be used in gradient descent to simultaneously calculate all the partial derivatives:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/434/1*3hiPZu6oulzQ2YPOQmajIA.png" /></figure><h3>Conclusion</h3><p>The gradients for the variations of the MSE cost function can be easily used in gradient descent by plugging them into the formula. An example of gradient descent can be found in <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8">An Introduction to Machine Learning in Python: Simple Linear Regression</a>.</p><p>Please don’t forget to like and follow! :)</p><h3>References</h3><ol><li><a href="https://www.symbolab.com/solver/partial-derivative-calculator/">Symbolab Partial Derivative Calculator</a></li><li><a href="https://math.stackexchange.com/questions/4177039/deriving-the-normal-equation-for-linear-regression">Mathematics Stack Exchange on Deriving the Normal Equation</a></li><li><a href="https://en.wikipedia.org/wiki/Matrix_calculus#Vector-by-vector_identities">Matrix Calculus</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1f32a08b0deb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[An Introduction to Machine Learning in Python: Simple Linear Regression]]></title>
            <link>https://medium.com/@hunter-j-phillips/a-simple-introduction-to-regression-and-machine-learning-in-python-5e6bd76b0bf8?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/5e6bd76b0bf8</guid>
            <category><![CDATA[gradient-descent]]></category>
            <category><![CDATA[regression]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Wed, 17 May 2023 19:12:58 GMT</pubDate>
            <atom:updated>2023-05-22T03:17:35.087Z</atom:updated>
            <content:encoded><![CDATA[<p>Simple linear regression offers an elegant introduction to machine learning. It can be used to identify the relationship between an independent variable and a dependent variable. Using gradient descent, a basic model can be trained to fit a set of points for future prediction.</p><h3>Background</h3><p>This is the first article of a series covering regression, gradient descent, classification, and other fundamental aspects of machine learning. This article focuses on simple linear regression, which identifies the line of best fit for a set of points, allowing for future predictions to be made.</p><h3>The Line of Best Fit</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TAIhTDqmfODfEKntQ4fvwg.png" /><figcaption>Image by Author</figcaption></figure><p>The line of best fit is the equation that most accurately represents a set of points. For a given input, the equation’s output should be as close to the expected output as possible.</p><p>In the image above, it is clear that the middle line fits the blue points better than the left or right lines. However, is it the line of <em>best </em>fit? Could there be a fourth line that fits these points better? The line could certainly be shifted up or down to ensure an equal amount of points fall above and below it. However, there could be a dozen lines that fit this exact criteria. What makes any one of them the best?</p><p>Thankfully, there is a way to mathematically identify a line of best fit for a set of points using regression.</p><h3>Regression</h3><p>Regression helps identify the relationship between two or more variables, and it takes many forms, including simple linear, multiple linear, polynomial, and more. To demonstrate the usefulness of this approach, simple linear regression will be used.</p><p>Simple linear regression attempts to find the line of best fit for a set of points. More specifically, it identifies the relationship between an independent variable and a dependent variable. The line of best fit has the form of <strong><em>y = mx + b</em></strong>.</p><ul><li><strong><em>x</em></strong> is the input or independent variable</li><li><strong><em>m</em></strong> is the slope, or steepness, of the line</li><li><strong><em>b</em></strong> is the <strong><em>y</em></strong>-intercept</li><li><strong><em>y</em></strong> is the output or dependent variable</li></ul><p>The goal of simple linear regression is to identify the values of <strong><em>m</em></strong> and <strong><em>b </em></strong>that will generate the most accurate <strong><em>y</em></strong> value when given an <strong><em>x</em></strong>. This equation, also known as the model, can also be evaluated in machine learning terms. In the equation, <strong><em>w </em></strong>represents “weight”: <strong><em>Ŷ = Xw₁+ w₀</em></strong></p><ul><li><strong><em>X</em></strong> is the input or feature</li><li><strong><em>w₁</em></strong> is the slope</li><li><strong><em>w₀</em></strong> is the bias, or <strong><em>y</em></strong>-intercept</li><li><strong><em>Ŷ </em></strong>is the prediction, which is pronounced as “y-hat”</li></ul><p>While this is useful, the equation needs to be assessed for its accuracy. If its predictions are poor, it is not very useful. To do this, a cost, or loss, function is used.</p><h4>The Cost or Loss Function</h4><p>Regression requires some method of tracking the accuracy of the model’s predictions. Given the inputs, are the outputs of the equation as close as possible to the expected output? A cost function, also known as a loss function, is used to identify the accuracy of an equation.</p><p>For instance, if the expected output is 5, and the equation outputs 18, the loss function should represent this difference. A simple loss function could output 13, which is the difference between these values. This indicates the model’s performance is poor. On the other hand, if the expected output is 5 and the model predicts 5, the loss function should output 0, which indicates the model’s performance is excellent.</p><p>A commonly used loss function that does this is the mean squared error (MSE):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/290/0*SydHa5UHkx4Zy2hU.png" /></figure><p>This function takes finds the difference between the model’s prediction (<strong><em>Ŷ</em></strong>) and the expected output (<strong><em>Y</em></strong>). It then squares the difference to ensure the output is always positive. It does this across a set of points with a size of <strong><em>n</em></strong>. By summing the squared difference of all these points and dividing by <strong><em>n</em></strong>, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/438/1*kkowOm3rlGGoWVTY-Cs4XQ.png" /><figcaption>Table by Author</figcaption></figure><p>While there are countless other loss functions that are just as applicable to this situation, this is one of the most popular loss functions in machine learning for regression due to its simplicity, especially when it comes to gradient descent, which will be explained later.</p><p>To best understand where gradient descent comes in, an example can be evaluated.</p><h3>Predicting the Line of Best Fit</h3><p>To show simple linear regression in action, data to train the model is required. This comes in the form of an <strong><em>X</em></strong> array and <strong><em>Y</em></strong> array. The data can be manually generated for this example. It can be created from a “blueprint” function. Randomness can be added to the blueprint, and the model will be forced to learn the underlying function. PyTorch, a standard machine learning library, used to implement regression.</p><h4>Generating the Data</h4><p>To start, the code below generates an array of input values using a random integer generator. <strong><em>X</em></strong> currently has a shape of <strong><em>(n samples, num features)</em></strong>. Remember a feature is an independent variable, and simple linear regression has 1. For this example, <strong><em>n</em></strong> will be 20.</p><pre>import torch<br><br>torch.manual_seed(5)<br>torch.set_printoptions(precision=2)<br><br># (n samples, features)<br>X = torch.randint(low=0, high=11, size=(20, 1))</pre><pre>tensor([[ 9],<br>        [10],<br>        [ 0],<br>        [ 3],<br>        [ 8],<br>        [ 8],<br>        [ 0],<br>        [ 4],<br>        [ 1],<br>        [ 0],<br>        [ 7],<br>        [ 9],<br>        [ 3],<br>        [ 7],<br>        [ 9],<br>        [ 7],<br>        [ 3],<br>        [10],<br>        [10],<br>        [ 4]])</pre><p>These values can then be passed through <strong><em>Y = 1.5X + 2</em></strong><em> </em>to generate output values, and some randomness can be added to these values using the normal distribution with a mean of 0 and standard deviation of 1. <strong><em>Y</em></strong> will have a shape of (<strong><em>n samples</em></strong>, <strong><em>1</em></strong>).</p><p>The code below shows the random values, which have the same shape.</p><pre>torch.manual_seed(5)<br><br># normal distribution with a mean of 0 and std of 1<br>normal = torch.distributions.Normal(loc=0, scale=1)<br><br>normal.sample(X.shape)</pre><pre>tensor([[ 1.84],<br>        [ 0.52],<br>        [-1.71],<br>        [-1.70],<br>        [-0.13],<br>        [-0.60],<br>        [ 0.14],<br>        [-0.15],<br>        [ 2.61],<br>        [-0.43],<br>        [ 0.35],<br>        [-0.06],<br>        [ 1.48],<br>        [ 0.49],<br>        [ 0.25],<br>        [ 1.75],<br>        [ 0.74],<br>        [ 0.03],<br>        [-1.17],<br>        [-1.51]])</pre><p>Finally, <strong><em>Y</em></strong> can be calculated with the code below.</p><pre>Y = (1.5*X + 2) + normal.sample(X.shape)<br><br>Y</pre><pre>tensor([[15.00],<br>        [15.00],<br>        [-0.36],<br>        [ 6.75],<br>        [13.59],<br>        [15.16],<br>        [ 2.33],<br>        [ 8.72],<br>        [ 2.67],<br>        [ 1.81],<br>        [13.74],<br>        [14.06],<br>        [ 7.15],<br>        [12.81],<br>        [15.91],<br>        [13.15],<br>        [ 6.76],<br>        [18.05],<br>        [18.71],<br>        [ 6.80]])</pre><p>They can also be plotted together with matplotlib for a better understanding of their relationship:</p><pre>import matplotlib.pyplot as plt<br><br>plt.scatter(X,Y)<br>plt.xlim(-1,11)<br>plt.ylim(0,20)<br>plt.xlabel(&quot;$X$&quot;)<br>plt.ylabel(&quot;$Y$&quot;)<br>plt.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*58SIJWrpFd_NCwDEtNddUQ.png" /><figcaption>Image by Author</figcaption></figure><p>While it may seem counterintuitive to generate data for the example, it is a great way to demonstrate how regression works. The model, which can be seen below, will only be provided <strong><em>X</em></strong> and <strong><em>Y</em></strong>, and it will need to identify <strong><em>w₁ </em></strong>as 1.5<strong><em> </em></strong>and <strong><em>w₀ </em></strong>as 2.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/408/1*3PGGaaZ7BfxEMF_bkRojPg.png" /><figcaption>Image by Author</figcaption></figure><p>The weights can be stored in an array, <strong><em>w. </em></strong>This array will have two weights in it, one for the bias and one for the number of features. It will have a shape of (<strong><em>num features + 1 bias, 1</em></strong>). For this example, the array with have a shape of (2, 1).</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(2, 1))<br>w </pre><pre>tensor([[0.83],<br>        [0.13]])</pre><p>With these values generated, the model can be created.</p><h4>Creating the Model</h4><p>The first step for the model is to define a function for the line of best fit and another for the MSE.</p><p>As mentioned before, the model has an equation of <strong><em>Ŷ = Xw₁+ w₀. </em></strong>As of now, the bias is added to every sample. This is equivalent to broadcasting the bias to be the same size as <strong><em>X</em></strong> and added the arrays together. The output can be seen below.</p><pre>w[1]*X + w[0]</pre><pre>tensor([[1.97],<br>        [2.09],<br>        [0.83],<br>        [1.21],<br>        [1.84],<br>        [1.84],<br>        [0.83],<br>        [1.33],<br>        [0.96],<br>        [0.83],<br>        [1.71],<br>        [1.97],<br>        [1.21],<br>        [1.71],<br>        [1.97],<br>        [1.71],<br>        [1.21],<br>        [2.09],<br>        [2.09],<br>        [1.33]])</pre><p>The function below computes the output.</p><pre># line of best fit<br>def model(w, X):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features + 1 bias, 1)<br>      X: array of inputs  | (n samples, num features + 1 bias)<br><br>    Output:<br>      returns the predictions | (n samples, 1)<br>  &quot;&quot;&quot;<br><br>  return w[1]*X + w[0]</pre><p>The function for the MSE is straightforward:</p><pre># mean squared error (MSE)<br>def MSE(Yhat, Y):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      Yhat: array of predictions | (n samples, 1)<br>      Y: array of expected outputs | (n samples, 1)<br>    Output:<br>      returns the loss of the model, which is a scalar<br>  &quot;&quot;&quot;<br><br>  return torch.mean((Yhat-Y)**2) # mean((error)^2)</pre><h4>Previewing the Line of Best Fit</h4><p>With the functions created, the line of best fit can be previewed with a plot, and a standard function can be created for future use. It will display the line of best fit in red, the predictions for each input in orange, and the expected outputs in blue.</p><pre>def plot_lbf():<br>  &quot;&quot;&quot;<br>    Output:<br>      plots the line of best fit in comparison to the training data<br>  &quot;&quot;&quot;<br><br>  # plot the points<br>  plt.scatter(X,Y)<br><br>  # predictions for the line of best fit<br>  Yhat = model(w, X)<br>  plt.scatter(X, Yhat, zorder=3) # plot the predictions<br><br>  # plot the line of best fit<br>  X_plot = torch.arange(-1,11+0.1,.1) # generate values with a step of .1<br>  plt.plot(X_plot, model(w, X_plot), color=&quot;red&quot;, zorder=0)<br><br>  plt.xlim(-1, 11)<br>  plt.xlabel(&quot;$X$&quot;)<br>  plt.ylabel(&quot;$Y$&quot;)<br>  plt.title(f&quot;MSE: {MSE(Yhat, Y):.2f}&quot;)<br>  plt.show()<br><br>plot_lbf()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*ckiegZBJ1zPpYg3boBfCzg.png" /><figcaption>Image by Author</figcaption></figure><p>The output with the current weights is not ideal since the MSE is 105.29. To get a better MSE, different weights need to be chosen. They could be randomized again, but the chance of acquiring the perfect line would be minimal. This is where the gradient descent algorithm can be used to alter the value of the weights in a defined manner.</p><h4>Gradient Descent</h4><p>An explanation of the gradient descent algorithm can be found here: <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-gradient-descent-1f32a08b0deb">A Simple Introduction to Gradient Descent</a>. The article should be read before moving on to avoid confusion.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/262/1*roraxsukJQt1l5z9Ka445A.png" /></figure><p>To summarize the article, gradient descent uses the gradient of a cost function to reveal the direction and influence of each weight on it. By scaling the gradient with a learning rate and subtracting it from each weight’s current value, the cost function minimizes, forcing the model’s prediction to be as close to the expected output as possible.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/328/1*Eo1PaDJmUwYt5diOcDrOgA.png" /></figure><p>For simple linear regression, <strong><em>f</em></strong> will be the MSE. The Python implementation can be seen below. Remember, each weight has its own partial derivative to be used in the formula, which can be seen above.</p><pre># optimizer<br>def gradient_descent(w):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features + 1 bias, 1)<br><br>    Global Variables / Constants:<br>      X: array of inputs  | (n samples, num features + 1 bias)<br>      Y: array of expected outputs | (n samples, 1)<br>      lr: learning rate to scale the gradient<br><br>    Output:<br>      returns the updated weights<br>  &quot;&quot;&quot; <br><br>  n = len(X)<br><br>  # update the bias<br>  w[0] = w[0] - lr*2/n * torch.sum(model(w,X) - Y)<br>  <br>  # update the weight<br>  w[1] = w[1] - lr*2/n * torch.sum(X*(model(w,X) - Y))<br><br>  return w</pre><p>Now, the function can be used to update the weights. The learning rate is selected empirically, but it is normally a small value. The new line of best fit can also be plotted.</p><pre>lr = 0.01<br><br>print(&quot;weights before:&quot;, w.flatten())<br>print(&quot;MSE before:&quot;, MSE(model(w,X), Y))<br><br># update the weights<br>w = gradient_descent(w)<br><br>print(&quot;weights after:&quot;, w.flatten())<br>print(&quot;MSE after:&quot;, MSE(model(w,X), Y))<br><br>plot_lbf()</pre><pre>weights before: tensor([0.83, 0.13])<br>MSE before: tensor(105.29)<br>weights after: tensor([1.01, 1.46])<br>MSE after: tensor(2.99)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*K3bl1jxgaaYpb6aQgn6Tbg.png" /><figcaption>Image by Author</figcaption></figure><p>The MSE decreased by more than 100 on the first try, but the line still doesn’t fit the points perfectly. Remember, the goal is to get <strong><em>w₀ </em></strong>as 2 and <strong><em>w₁ </em></strong>as 1.5. To speed up the learning process, gradient descent can be performed 500 more times, and the new result can be examined.</p><pre># update the weights<br>for i in range(0, 500):<br>  # update the weights<br>  w = gradient_descent(w)<br><br>  # print the new values every 10 iterations<br>  if (i+1) % 100 == 0:<br>    print(&quot;epoch:&quot;, i+1)<br>    print(&quot;weights:&quot;, w.flatten())<br>    print(&quot;MSE:&quot;, MSE(model(w,X), Y))<br>    print(&quot;=&quot;*10)<br><br>plot_lbf()</pre><pre>epoch: 100<br>weights: tensor([1.44, 1.59])<br>MSE: tensor(1.31)<br>==========<br>epoch: 200<br>weights: tensor([1.67, 1.56])<br>MSE: tensor(1.25)<br>==========<br>epoch: 300<br>weights: tensor([1.80, 1.54])<br>MSE: tensor(1.24)<br>==========<br>epoch: 400<br>weights: tensor([1.87, 1.53])<br>MSE: tensor(1.23)<br>==========<br>epoch: 500<br>weights: tensor([1.91, 1.52])<br>MSE: tensor(1.23)<br>==========</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*2WGsuZAf531EIR8ooXcoMw.png" /><figcaption>Image by Author</figcaption></figure><p>After 500 epochs, the MSE is 1.23. <strong><em>w₀ </em></strong>is 1.91, and <strong><em>w₁ </em></strong>is 1.52. This means the model successfully identified the line of best fit. Additional updates could be performed, but the randomness added to the output values will likely prevent the model from achieving a perfect prediction.</p><p>To build additional intuition about how gradient descent works, the impact of <strong><em>w₀ </em></strong>and <strong><em>w₁ </em></strong>can be examined by plotting them with their output, the MSE. The function for plotting gradient descent can be examined in the appendix, and the output can be examined below:</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(2, 1))<br><br>w0s, w1s, losses = list(),list(),list()<br><br># update the weights<br>for i in range(0, 500):<br>  if i == 0 or (i+1) % 10 == 0:<br>    w0s.append(float(w[0]))<br>    w1s.append(float(w[1]))<br>    losses.append(MSE(model(w,X), Y))<br><br>  # update the weights<br>  w = gradient_descent(w)<br><br>plot_GD([-2, 5.2], [-2, 5.2])</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/413/1*FxZQU5jk0zZ1cDtJawrG0w.png" /><figcaption>Image by Author</figcaption></figure><p>Each orange point represents an update to the weights, and the red line represents the change from one iteration to the next. The largest update is from the first to second iteration, which is the red line. The other orange points are close together since their derivatives are small, making the updates even smaller. The plot shows how the weights update until the optimal MSE is acquired.</p><p>While the approach is useful, it could be simplified in a few ways. First, it does not take advantage of matrix multiplication, which would simplify the equation for the model. Second, gradient descent is not a closed-form solution to regression since the number of epochs and the learning rate vary for every problem, and the solution is an approximation. The last section of this article will address the first problem, and the next article will address the second.</p><h3>An Alternative Approach</h3><p>While this approach is useful, it is not as simple as it could be. It does not take advantage of matrices. As of now, the entire equation, <strong><em>Ŷ = Xw₁+ w₀</em></strong>, is used for the model’s function, and the partial derivative of each weight has to be calculated individually for gradient descent. By using matrix operations and calculus, both functions simplify.</p><p>To start, <strong><em>X </em></strong>has a shape of <strong><em>(n samples, num features), </em></strong>and <strong><em>w </em></strong>has a shape of (<strong><em>num features + 1 bias, 1</em></strong>). By adding an additional column to <strong><em>X</em></strong>, matrix multiplication can be used because it will have a new shape of <strong><em>(n samples, num features + 1 bias)</em></strong>. This can be a column of ones that will be multiplied against the bias, which will scale the vector. This is equivalent to broadcasting the bias, which is how the predictions were previously calculated.</p><pre>X = torch.hstack((torch.ones(X.shape),X))<br>X</pre><pre>tensor([[ 1.,  9.],<br>        [ 1., 10.],<br>        [ 1.,  0.],<br>        [ 1.,  3.],<br>        [ 1.,  8.],<br>        [ 1.,  8.],<br>        [ 1.,  0.],<br>        [ 1.,  4.],<br>        [ 1.,  1.],<br>        [ 1.,  0.],<br>        [ 1.,  7.],<br>        [ 1.,  9.],<br>        [ 1.,  3.],<br>        [ 1.,  7.],<br>        [ 1.,  9.],<br>        [ 1.,  7.],<br>        [ 1.,  3.],<br>        [ 1., 10.],<br>        [ 1., 10.],<br>        [ 1.,  4.]])</pre><p>This changes the equation to <strong><em>Ŷ = X₁w₁+ X₀w₀</em></strong>. Moving forward, the bias can be considered a feature, so <strong><em>num features</em></strong> can represent both the independent variable and bias, and the <strong><em>+ 1 bias</em></strong> can be omitted. Therefore, <strong><em>X</em></strong> has a size of (<strong><em>n samples</em></strong>, <strong><em>num features</em></strong>), and <strong><em>w</em></strong> has a size of <strong><em>(num features</em></strong>, <strong><em>1</em></strong>). When they are multiplied by each other, the output is the prediction vector, which has a size of <strong><em>(n samples, 1</em></strong>). The output of the matrix multiplication is the same as w[1]*X + w[0].</p><pre>torch.manual_seed(5)<br>w = torch.rand(size=(2, 1))<br><br>torch.matmul(X, w)</pre><pre>tensor([[1.97],<br>        [2.09],<br>        [0.83],<br>        [1.21],<br>        [1.84],<br>        [1.84],<br>        [0.83],<br>        [1.33],<br>        [0.96],<br>        [0.83],<br>        [1.71],<br>        [1.97],<br>        [1.21],<br>        [1.71],<br>        [1.97],<br>        [1.71],<br>        [1.21],<br>        [2.09],<br>        [2.09],<br>        [1.33]])</pre><p>With this in mind, the model’s function can be updated:</p><pre># line of best fit<br>def model(w, X):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br>      X: array of inputs  | (n samples, num features)<br><br>    Output:<br>      returns the output of X@w | (n samples, 1)<br>  &quot;&quot;&quot;<br><br>  return torch.matmul(X, w)</pre><p>Since each weight is no longer thought of as an individual component, the gradient descent algorithm can also be updated. Based on <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-gradient-descent-1f32a08b0deb">A Simple Introduction to Gradient Descent</a>, the gradient descent algorithm for matrices follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/434/1*3hiPZu6oulzQ2YPOQmajIA.png" /></figure><p>This can easily be implemented with PyTorch. Since <strong><em>w</em></strong> was reshaped at the beginning of the article, the derivative’s output needs to be reshaped for subtraction.</p><pre># optimizer<br>def gradient_descent(w):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w: array of weights | (num features, 1)<br><br>    Global Variables / Constants:<br>      X: array of inputs  | (n samples, num features)<br>      Y: array of expected outputs | (n samples, 1)<br>      lr: learning rate to scale the gradient<br><br>    Output:<br>      returns the updated weights | (num features, 1)<br>  &quot;&quot;&quot; <br><br>  n = X.shape[0]<br><br>  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)</pre><p>Using 500 epochs, the same output can be generated as before:</p><pre>lr = 0.01<br><br># update the weights<br>for i in range(0, 501):<br>  # update the weights<br>  w = gradient_descent(w)<br><br>  # print the new values every 10 iterations<br>  if (i+1) % 100 == 0:<br>    print(&quot;epoch:&quot;, i+1)<br>    print(&quot;weights:&quot;, w.flatten())<br>    print(&quot;MSE:&quot;, MSE(model(w,X), Y))<br>    print(&quot;=&quot;*10)</pre><pre>epoch: 100<br>weights: tensor([1.43, 1.59])<br>MSE: tensor(1.31)<br>==========<br>epoch: 200<br>weights: tensor([1.66, 1.56])<br>MSE: tensor(1.25)<br>==========<br>epoch: 300<br>weights: tensor([1.79, 1.54])<br>MSE: tensor(1.24)<br>==========<br>epoch: 400<br>weights: tensor([1.87, 1.53])<br>MSE: tensor(1.23)<br>==========<br>epoch: 500<br>weights: tensor([1.91, 1.53])<br>MSE: tensor(1.23)<br>==========</pre><p>Since these functions do not require additional variables to be manually added for each feature, they can be used for multiple linear regression and polynomial regression.</p><h3>Conclusion</h3><p>The next article will discuss the closed-form solution to regression that does not approximate the weights. Instead, the minimized values will be directly computed using <a href="https://medium.com/@hunter-j-phillips/an-introduction-to-machine-learning-in-python-the-normal-equation-for-regression-in-python-28dc37d524cf">An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python</a>.</p><p>Please don’t forget to like and follow! :)</p><h3>References</h3><ol><li><a href="https://community.plotly.com/t/3d-scatter-plot-with-surface-plot/27556">Plotly 3D Plots</a></li></ol><h3>Appendix</h3><h4>Plotting Gradient Descent</h4><p>This function utilizes Plotly to display gradient descent in three dimensions.</p><pre>import plotly.graph_objects as go<br>import plotly<br>import plotly.express as px<br><br>def plot_GD(w0_range, w1_range):<br>  &quot;&quot;&quot;<br>    Inputs:<br>      w0_range: weight range [w0_low, w0_high]<br>      w1_range: weight range [w1_low, w1_high]<br><br>    Global Variables:<br>      X: array of inputs  | (n samples, num features + 1 bias)<br>      Y: array of expected outputs | (n samples, 1)<br>      lr: learning rate to scale the gradient<br>      <br>    Output:<br>      prints gradient descent<br>  &quot;&quot;&quot; <br><br>  # generate all the possible weight combinations (w0, w1)<br>  w0_plot, w1_plot = torch.meshgrid(torch.arange(w0_range[0],<br>                                                 w0_range[1],<br>                                                 0.1),<br>                                    torch.arange(w1_range[0],<br>                                                 w1_range[1],<br>                                                 0.1))<br>                                 <br>  # rearrange into coordinate pairs<br>  w_plot = torch.hstack((w0_plot.reshape(-1,1), w1_plot.reshape(-1,1)))<br><br>  # calculate the MSE for each pair<br>  mse_plot = [MSE(model(w, X), Y) for w in w_plot]<br><br>  # plot the data<br>  fig = go.Figure(data=[go.Mesh3d(x=w_plot[:,0], <br>                                  y=w_plot[:,1],<br>                                  z=mse_plot,)])<br><br>  # plot gradient descent on loss function<br>  fig.add_scatter3d(x=w0s, <br>                    y=w1s, <br>                    z=losses, <br>                    marker=dict(size=3,color=&quot;orange&quot;),<br>                    line=dict(color=&quot;red&quot;,width=5))<br>  <br>  # prepare ranges for plotting<br>  xaxis_range = [w0 + 0.01 if w0 &lt; 0 else w0 - 0.01 for w0 in w0_range] <br><br>  yaxis_range = [w1 + 0.01 if w1 &lt; 0 else w1 - 0.01 for w1 in w1_range] <br><br>  fig.update_layout(scene = dict(xaxis_title=&#39;w&lt;sub&gt;0&lt;/sub&gt;&#39;, <br>                                 yaxis_title=&#39;w&lt;sub&gt;1&lt;/sub&gt;&#39;, <br>                                 zaxis_title=&#39;MSE&#39;,<br>                                 xaxis_range=xaxis_range,<br>                                 yaxis_range=yaxis_range))<br>  fig.show()</pre><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5e6bd76b0bf8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Simple Introduction to the Dot Product]]></title>
            <link>https://medium.com/@hunter-j-phillips/a-simple-introduction-to-the-dot-product-f2b09e48c2c8?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/f2b09e48c2c8</guid>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[nlp]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[linear-algebra]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Thu, 11 May 2023 01:26:53 GMT</pubDate>
            <atom:updated>2023-05-14T21:42:05.160Z</atom:updated>
            <content:encoded><![CDATA[<p>The dot product is a common operation performed on vectors that returns a scalar as a result. This scalar provides information about the relationship between the vectors.</p><h4>Background</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/269/0*kENW2L8aw-S2KG-4.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/255/0*6A63TS4JYtC8ROfL.png" /></figure><p>For two vectors <strong><em>a</em></strong> and <strong><em>b</em></strong> of length <strong><em>n, </em></strong>the dot product can be used to show the relationship between them. For instance, are they pointing in the same direction? Opposite directions? Are they perpendicular?</p><p>The result is a scalar, so the dot product is sometimes known as the scalar product.</p><p>To build intution of how this works, it would be best to start with the geometric definition.</p><h4>Geometric Definition</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/290/1*o9x0yY3346TYGejmHNfr1A.png" /><figcaption>Image by Math is Fun</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/232/0*oC6KewDWiP4On9v0.png" /></figure><p>This formula consists of three components:</p><ul><li><strong><em>||a||</em></strong>: the magnitude of <strong><em>a</em></strong></li><li><strong><em>||b||</em></strong>: the magnitude of <strong><em>b</em></strong></li><li>θ: angle between <strong><em>a </em></strong>and <strong><em>b</em></strong></li></ul><p><strong>Magnitude</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/744/1*FBCRPMU8Q16KQIFKQT7c3w.png" /><figcaption>Image by Wumbo</figcaption></figure><p>The magnitude can be calculated by taking the square root of the squared elements. For a 2-dimensional vector, the magnitude would be <br><strong><em>√(x² + y²)</em></strong>. For 3-dimensions, the magnitude would be <br><strong><em>√(x² + y² + z²)</em></strong>. For <strong><em>n</em></strong>-dimensions, the magnitude would be:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/316/0*s17HMvDM7vG2hkYz.png" /></figure><p><strong>Cosine</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/192/1*k07NSpOz_es00yyG9w-Feg.png" /><figcaption>Image by Math is Fun</figcaption></figure><p><strong><em>cos(θ)</em></strong> is used to “project” <strong><em>a</em></strong> onto <strong><em>b</em></strong>. In the image above, <strong><em>a</em></strong> and <strong><em>b </em></strong>point in different directions, so <strong><em>||a|| cos(θ)</em></strong> projects the portion of <strong><em>a </em></strong>that is adjacent and alongside <strong><em>b</em></strong>.</p><p>This could also be viewed as taking the portion of <strong><em>b</em></strong> that is adjacent and alongside <strong><em>a</em></strong>, which can be seen in the image below:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/320/1*VGWKgnw8iDE4rsaZx6DljQ.png" /><figcaption>Image by KiKaBeN</figcaption></figure><p><strong>Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/214/0*JFC5CaoQ2qY9Ad9g.gif" /><figcaption>Image by Author</figcaption></figure><p>The geometric definition is useful when the angle and magnitude of the vectors is known, like in the example above. In this example, calculating the dot product is easy.</p><p>Note that six is negative since it is pointing in the negative direction:</p><ul><li><strong><em>||a|| = 10 = √(8² + (-6)²)</em></strong></li><li><strong><em>||b|| = 13 = √(12² + 5²)</em></strong></li><li><strong><em>θ = 59.5°</em></strong></li></ul><p>Therefore, <strong><em>a</em></strong>・ <strong><em>b = ||a|| ||b||cos(θ) = (10) (13) cos(59.5) = 65.9799871849</em></strong></p><p><strong><em>What does the output mean?</em></strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/502/1*_0dipiEY1CF0ziI39BVPOg.png" /><figcaption>Image by Math is Fun</figcaption></figure><p>When two vectors are pointing in the same direction, the angle between them will be <strong><em>θ = 0° </em></strong>or <strong><em>0 </em></strong>radians<strong><em>. </em></strong>This means the output of <strong><em>cos(θ)</em></strong> is 1. This is when the dot product is at a maximum. On the other hand, when the vectors point in opposite directions, the angle between them will be <strong><em>θ = 180°</em></strong> or <strong><em>π </em></strong>radians. The output of <strong><em>cos(θ)</em></strong> will be -1. This is when the dot product is at a minimum. When <strong><em>θ = 90°</em></strong> or (<strong><em>π/2) </em></strong>radians, the output of <strong><em>cos(θ)</em></strong> is 0. This occurs when the vectors are perpendicular or orthogonal to each other.</p><p>This indicates that the dot product can help identify the relationship between vectors, which is vital in machine learning.</p><p>While the geometric definition is useful, it is more common to have the components of a vector and no known angle. In these situations, it is more convenient to use the equivalent coordinate formula.</p><h4><strong>Coordinate Definition</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/562/1*tofZf5fz9bN4kGH9dK9MFw.png" /></figure><p>The coordinate definition does not require an angle to calculate the dot product. The corresponding components of each vector are multiplied against each other instead. The result ends up being equivalent to that of the geometric definition. The most succinct explanation of their equivalence that was available was on <a href="https://en.wikipedia.org/wiki/Dot_product#:~:text=Equivalence%20of%20the%20definitions%5Bedit%5D">Wikipedia</a>.</p><p><strong>Two-Dimensional Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/214/0*gqMYbNZi9J-6Qur1.gif" /><figcaption>Image by Math is Fun</figcaption></figure><p>To show that they are equal, consider the same example as before, but now it can be calculated with the coordinate definition.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/683/0*6pUmj_JQ-AM5RpGn.png" /></figure><p>The geometric solution was <strong><em>65.9799871849</em></strong>, which is a negligible difference.</p><p><strong>Three-Dimensional Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/317/0*7mYxHMKOhta9hxv1.gif" /><figcaption>Image by Math is Fun</figcaption></figure><p>These same formulas can be used in three or more dimensions. In the image above, the components of each vector are known, but the angle between them is not.</p><p>The coordinate definition can be used to calculate the dot product, and the angle between them can be found using the result and the geometric definition.</p><p>For this example,<strong><em> a = [4, 8, 10] </em></strong>and <strong><em>b = [9, 2, 7]</em></strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/433/1*gvxThwYW7D3gNUYrzOtN5g.png" /><figcaption>Image by Author</figcaption></figure><p>Now, the angle can be found by setting <strong><em>||a|| ||b||cos(θ) = a</em></strong>・<strong><em>b </em></strong>:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/813/1*aLcodMRLcw5QDLM8dd4ewQ.png" /></figure><p>Please don’t forget to like and follow for more! :)</p><h4>References</h4><ol><li><a href="https://www.mathsisfun.com/geometry/unit-circle.html">Math is Fun Unit Circle</a></li><li><a href="https://www.mathsisfun.com/algebra/vectors-dot-product.html">Math is Fun Vectors</a></li><li><a href="https://kikaben.com/transformers-self-attention/">KiKaBeN’s Transformer’s Self-Attention</a></li><li><a href="https://wumbo.net/formulas/magnitude-of-vector/main-600-300.svg">Wumbo</a></li><li><a href="https://en.wikipedia.org/wiki/Dot_product">Wikipedia’s Dot Product</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f2b09e48c2c8" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Simple Introduction to Broadcasting]]></title>
            <link>https://medium.com/@hunter-j-phillips/a-simple-introduction-to-broadcasting-db8e581368b3?source=rss-7a7936a6a04------2</link>
            <guid isPermaLink="false">https://medium.com/p/db8e581368b3</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[pytorch]]></category>
            <category><![CDATA[matrix-multiplication]]></category>
            <category><![CDATA[linear-algebra]]></category>
            <dc:creator><![CDATA[Hunter Phillips]]></dc:creator>
            <pubDate>Wed, 10 May 2023 17:27:12 GMT</pubDate>
            <atom:updated>2023-05-14T21:42:32.993Z</atom:updated>
            <content:encoded><![CDATA[<p>Broadcasting occurs when a smaller tensor is “stretched” to have a compatible shape with a larger tensor in order to perform an operation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/514/0*isIQwtBA3Yo0Rq5K.png" /><figcaption>Image from NumPy</figcaption></figure><p>Broadcasting can be an efficient way to perform tensor operations without creating duplicate data.</p><p>According to PyTorch, a tensor is “broadcastable” if:</p><blockquote>Each tensor has at least one dimension</blockquote><blockquote>When iterating over dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist</blockquote><p>The trailing dimension is the rightmost number when comparing shapes.</p><p>In the image above, the generic process can be seen:</p><p>1. Determine if the rightmost dimensions are compatible</p><ul><li>Does each tensor have at least one dimension?</li><li>Are the sizes equal? Is one of them one? Does one not exist?</li></ul><p>2. Stretch the dimension to the appropriate size</p><p>3. Repeat the previous steps for the next dimension</p><p>These steps can be seen in the examples that follow.</p><h4>Element-Wise Operations</h4><p>All element-wise operations require tensors to have the same shape.</p><p><strong>Vector and Scalar Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/520/0*7tfiTJQY_cWUrHpu.png" /><figcaption>Image from NumPy</figcaption></figure><pre>import torch<br>a = torch.tensor([1, 2, 3])<br>b = 2 # becomes ([2, 2, 2])<br><br>a * b</pre><pre>tensor([2, 4, 6])</pre><p>In this example, the scalar has a shape of (1,), and the vector has a shape of (3,). As the image demonstrates, <strong><em>b </em></strong>is broadcast to a shape of (3,), and the Hadamard product is performed as anticipated.</p><p><strong>Matrix and Vector Example 1</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/917/1*lJJP4JH65lzdUNTfM_dZlQ.png" /><figcaption>Image by Author</figcaption></figure><p>In this example, <strong><em>A</em></strong> has a shape of (3, 3), and <strong><em>b </em></strong>has a shape of (3,).</p><p>When multiplication occurs, the vector is stretched row-wise to create a matrix, which can be seen in the image above. Now, both <strong>A</strong> and <strong>b</strong> have a shape of (3, 3).</p><p>This can be seen below.</p><pre>A = torch.tensor([[1, 2, 3],<br>                  [4, 5, 6],<br>                  [7, 8, 9]])<br><br>b = torch.tensor([1, 2, 3])<br><br>A * b</pre><pre>tensor([[ 1,  4,  9],<br>        [ 4, 10, 18],<br>        [ 7, 16, 27]])</pre><p><strong>Matrix and Vector Example 2</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/926/1*dX_RazJzygLsk7pZnI8XJA.png" /><figcaption>Image by Author</figcaption></figure><p>In this example, <strong><em>A </em></strong>has a shape of (3, 3), and <strong><em>b </em></strong>has a shape of (3, 1).</p><p>When multiplication occurs, the vector is stretched column-wise to create two additional columns, which can be seen in the image above. Now, both <strong>A</strong> and <strong>b</strong> have a shape of (3, 3).</p><pre>A = torch.tensor([[1, 2, 3],<br>                  [4, 5, 6],<br>                  [7, 8, 9]])<br><br>b = torch.tensor([[1], <br>                  [2], <br>                  [3]])<br>A * b</pre><pre>tensor([[ 1,  2,  3],<br>        [ 8, 10, 12],<br>        [21, 24, 27]])</pre><p><strong>Tensor and Vector Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PR-fSPcgBwKIELsLJj0nNA.png" /><figcaption>Image by Author</figcaption></figure><p>In this example, <strong><em>A </em></strong>is a tensor with a shape of (2, 3, 3), and <strong><em>b </em></strong>is a column vector with a shape of (3, 1).</p><pre>A = (2, 3, 3)<br>b = ( , 3, 1)</pre><p>Starting from the rightmost dimension, each element is stretched column-wise to generate a (3, 3) matrix. The middle dimensions are equal. At this point, <strong><em>b </em></strong>is just a matrix. The leftmost dimension does not exist, so a dimension must be added. Then, the matrix must be broadcast to create a size of (2, 3, 3). There are now two (3, 3) matrices, which can be seen in the image above.</p><p>This allows the Hadamard product to be computed and generates a (2, 3, 3) matrix:</p><pre>A = torch.tensor([[[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]],<br><br>                  [[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]]])<br><br>b = torch.tensor([[1], <br>                  [2], <br>                  [3]])<br><br>A * b</pre><pre>tensor([[[ 1,  2,  3],<br>         [ 8, 10, 12],<br>         [21, 24, 27]],<br><br>        [[ 1,  2,  3],<br>         [ 8, 10, 12],<br>         [21, 24, 27]]])</pre><p><strong>Tensor and Matrix Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HRldA4Zs277cWt6klenswQ.png" /><figcaption>Image by Author</figcaption></figure><p>In this example, <strong><em>A </em></strong>is a tensor with a shape of (2, 3, 3), and <strong><em>B </em></strong>is a matrix with a shape of (3, 3).</p><pre>A = (2, 3, 3)<br>B = ( , 3, 3)</pre><p>This example is easier than the previous one because the two rightmost dimensions are identical. This means the matrix only has to be broadcast across the leftmost dimension to create a shape of (2, 3, 3). This just means an additional matrix is needed.</p><p>When the Hadamard product is calculated, the result is a (2, 3, 3).</p><pre>A = torch.tensor([[[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]],<br>                   <br>                  [[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]]])<br><br>B = torch.tensor([[1, 2, 3], <br>                  [1, 2, 3], <br>                  [1, 2, 3]])<br><br>A * B</pre><pre>tensor([[[ 1,  4,  9],<br>         [ 4, 10, 18],<br>         [ 7, 16, 27]],<br><br>        [[ 1,  4,  9],<br>         [ 4, 10, 18],<br>         [ 7, 16, 27]]])</pre><h4><strong>Matrix and Tensor Multiplication with the Dot Product</strong></h4><p>For all of the previous examples, the goal was to end up with identical shapes to allow element-wise multiplication. The goal of this example will be to enable matrix and tensor multiplication via the dot product, which requires the last dimension of the first matrix or tensor to match the second-to-last dimension of the second matrix or tensor.</p><p>For matrix multiplication:</p><ul><li><strong><em>(m, n) x (n, r) = (c, m, r)</em></strong></li></ul><p>For 3D tensor multiplication:</p><ul><li><strong><em>(c, m, n) x (c, n, r) = (c, m, r)</em></strong></li></ul><p>For 4D tensor multiplication:</p><ul><li><strong><em>(z, c, m, n) x (z, c, n, r) = (z, c, m, r)</em></strong></li></ul><p><strong>Example</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DYG4KJgmDJL-T1dlYl-2Qw.png" /><figcaption>Image by Author</figcaption></figure><p>For this example, 𝓐 has a shape of (2, 3, 3), and <strong><em>B </em></strong>has a shape of (3, 2). As of now, the last two dimensions are eligible for dot-product multiplication. A dimension needs to be added to <em>B</em>, and the (3, 2) matrix needs to be broadcast across this dimension to create a shape of (2, 3, 2).</p><p>The result of this tensor multiplication will be <strong><em>(2, 3, 3) x (2, 3, 2) = (2, 3, 2)</em></strong>.</p><pre>A = torch.tensor([[[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]],<br>                   <br>                  [[1, 2, 3],<br>                   [4, 5, 6],<br>                   [7, 8, 9]]])<br><br>B = torch.tensor([[1, 2], <br>                  [1, 2], <br>                  [1, 2]])<br><br>A @ B # A.matmul(B)</pre><pre>tensor([[[ 6, 12],<br>         [15, 30],<br>         [24, 48]],<br><br>        [[ 6, 12],<br>         [15, 30],<br>         [24, 48]]])</pre><p>Additional information on broadcasting can be found at the links below. More information about tensors and their operations can be found <a href="https://medium.com/@hunter-j-phillips/a-simple-introduction-to-tensors-c4a8321efffc">here</a>.</p><p>Please don’t forget to like and follow for more! :)</p><h4>References</h4><ol><li><a href="https://numpy.org/doc/stable/user/basics.broadcasting.html">NumPy Broadcasting</a></li><li><a href="https://pytorch.org/docs/stable/notes/broadcasting.html">PyTorch Broadcasting</a></li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=db8e581368b3" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>