Stories by Hunter Phillips on Medium

How to Convert an Image to a PDF in Python

Hunter Phillips — Tue, 08 Aug 2023 00:06:58 GMT

Want to convert one or more images to a PDF document? Look no further than the img2pdf and PyPDF2 packages.

Packages

To start, all you need is a Python environment, preferably version 3.10 or higher. The code in this tutorial was executed in a Google Colaboratory environment with Python 3.10.12.

The first step is to ensure the following packages are installed in the Python environment:

img2pdf
PyPDF2
Pillow (PIL)

Pip can be used to install these packages in Colab:

!pip install img2pdf PyPDF2 Pillow

The first package, img2pdf, will be used to convert an image to a PDF file. Then, PyPDF2 can be used to merge multiple PDFs into a single PDF file. Pillow is an image processing library; it provides additional functions necessary for the conversion.

These packages, along with os and google.colab, can now be imported.

# required libraries
import os
import img2pdf
import PyPDF2
from PIL import Image
from google.colab import files

Prepare the Images

Before writing any more code, it is important to know the file location of each image. To make this as easy as possible, a new folder can be created in the Colab environment:

!mkdir images

All the images need to be uploaded simultaneously to this location using an uploader provided by google.colab. The files will be ordered based on their names, so they should be named something like, page1.png, page2.png, ..., page9.png.

os.chdir("images")
files.upload()

With the images stored in a known file location, their names can be stored in a list.

imgs = os.listdir()
imgs.sort()

If there are more than 9 images, there will likely be issues with this approach, and list should be created with the files in the order they need to be in.

Converting the Images to PDFs

A for-loop can then be used to iterate over each image, convert it to a PDF, and write it to a new folder called pdfs.

# create a folder called pdfs
os.mkdir("../pdfs")

# loop over each image
for ind, img in enumerate(imgs):
  # open each image
  with Image.open(img) as image: 
    # convert the image to a PDF
    pdf = img2pdf.convert(image.filename)
    # write the PDF to its final destination
    with open(f"../pdfs/pdf{ind+1}.pdf", "wb") as file:
      file.write(pdf)
      print(f"Converted {img} to pdf{ind+1}.pdf")

Merging the PDFs

With the images converted to PDF files, they can either be used independently and downloaded with files.download('filename.pdf'), or they can be merged together. To merge the files together, extract the list of PDF files and sort them by their page number.

os.chdir("../pdfs")
pdfs = os.listdir()

Once again, if there are more than 9 images or PDFs, they should be stored in a list in their respective order.

A PdfMerger object can be used to concatenate each PDF into a single file.

pdfMerge = PyPDF2.PdfMerger()

# loop through each pdf page
for pdf in pdfs:
  # open each pdf
  with open(pdf, 'rb') as pdfFile:
    # merge each file
    pdfMerge.append(PyPDF2.PdfReader(pdfFile))

# write the merged pdf 
pdfMerge.write('merged.pdf')

# download the final pdf
files.download('merged.pdf')

The final merged PDF will contain each image in the order of their respective names.

Full Program

The entirety of the code can be found below. It is highly customizable to meet most use cases.

!pip install img2pdf PyPDF2 Pillow
!mkdir images
# required libraries
import os
import img2pdf
import PyPDF2
from PIL import Image
from google.colab import files

os.chdir("images")
files.upload()
imgs = os.listdir()

# create a folder called pdfs
os.mkdir("../pdfs")

# loop over each image
for ind, img in enumerate(imgs):
  # open each image
  with Image.open(img) as image: 
    # convert the image to a PDF
    pdf = img2pdf.convert(image.filename)
    # write the PDF to its final destination
    with open(f"../pdfs/pdf{ind+1}.pdf", "wb") as file:
      file.write(pdf)
      print(f"Converted {img} to pdf{ind+1}.pdf")

os.chdir("../pdfs")
pdfs = os.listdir()
pdfs.sort()

pdfMerge = PyPDF2.PdfMerger()

# loop through each pdf page
for pdf in pdfs:
  # open each pdf
  with open(pdf, 'rb') as pdfFile:
    # merge each file
    pdfMerge.append(PyPDF2.PdfReader(pdfFile))

# write the merged pdf 
pdfMerge.write('merged.pdf')

# download the final pdf
files.download('merged.pdf')

References

What is a DataFrame in PySpark?

Hunter Phillips — Sat, 10 Jun 2023 04:39:55 GMT

This article covers DataFrames in PySpark and how to use methods and SparkSQL on them.

DataFrames

In PySpark, a DataFrame is a table-like structure that can be manipulated using SQL-like methods. A DataFrame can be thought of as a table with rows and columns. Each column is a field, and each row is a record. For instance, the DataFrame below has two fields: age and name. It has three records: (null, Michael), (30, Andy), and (19, Justin).

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

The following sections will highlight how to use DataFrames and their potential advantages over RDDs.

Loading and Previewing a DataFrame

To use PySpark, a SparkSession can be created to interact with a cluster. This session can be used to create a DataFrame from an RDD, JSON files, CSV files, and more.

from pyspark.sql import SparkSession

# build a SparkSession
spark = SparkSession.builder.appName("intro").getOrCreate()

spark.read. can be used to create a DataFrame. data_type can be json, csv , text and more. The example below uses json to read a short JSON file from a Github repo. The people.csv file can be downloaded here.

# create dataframe
df = spark.read.json("people.json")

This is the DataFrame shown in the first section:

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

They can also be created using spark.createDataFrame(, schema=list) . The code below creates a DataFrame of animals on a small farm, including their number of legs and age.

df1 = spark.createDataFrame([['cow', 4, 5],
                             ['cow', 4, 3],
                             ['cow', 4, 3],
                             ['chicken', 2, 2],
                             ['chicken', 2, 1],
                             ['chicken', 2, 0],
                             ['horse', 4, 8],
                             ['donkey', 4, 8],
                             ['donkey', 4, 2],
                             ['turkey', 2, 1],
                             ['turkey', 2, 1],
                             ['pig', 4, 5],
                             ['dog', 4, 12],
                             ['cat', 4, 9],
                             ['goat', 4, 3],
                             ['goat', 5, 1]
                            ], schema=['animal', 'legs', 'age'])

This DataFrame has the following appearance:

+-------+----+---+
| animal|legs|age|
+-------+----+---+
|    cow|   4|  5|
|    cow|   4|  3|
|    cow|   4|  3|
|chicken|   2|  2|
|chicken|   2|  1|
|chicken|   2|  0|
|  horse|   4|  8|
| donkey|   4|  8|
| donkey|   4|  2|
| turkey|   2|  1|
| turkey|   2|  1|
|    pig|   4|  5|
|    dog|   4| 12|
|    cat|   4|  9|
|   goat|   4|  3|
|   goat|   5|  1|
+-------+----+---+

These DataFrames can be manipulated using DataFrame methods or by manually programming SQL expressions.

DataFrame Operations with Methods

This first section will deal with methods directly accessible on DataFrame objects.

PrintSchema

To see the structure of the DataFrame, df.printSchema() can be used to show the structure of the data, including column names and data types.

df.printSchema()
df1.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

root
 |-- animal: string (nullable = true)
 |-- legs: long (nullable = true)
 |-- age: long (nullable = true)

Show

To see the contents of the DataFrame, df.show() can be used. This is one benefit of DataFrames over RDDs.

df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

Select

The DataFrame is also much more interactive than an RDD. Columns can be accessed using attributes df.select(), df.select(df.), or indexing df.select(df[]) :

df.select("age").show()

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+

Also, a value can be easily added to a column using a similar statement:

df.select(df['name'], df['age'] + 6).show()

+-------+---------+
|   name|(age + 6)|
+-------+---------+
|Michael|     null|
|   Andy|       36|
| Justin|       25|
+-------+---------+

WithColumn

df.withColumn(, ) can be used to add a new column or overwrite an existing one. The example below takes advantage of pyspark.sql.functions import lower to lowercase all the values in the name field:

from pyspark.sql.functions import lower

df.withColumn('nameUpper', lower(df['name'])).show()

+----+-------+---------+
| age|   name|nameUpper|
+----+-------+---------+
|null|Michael|  michael|
|  30|   Andy|     andy|
|  19| Justin|   justin|
+----+-------+---------+

Filter

df.filter(cond) can be used to filter a DataFrame based on the provided condition. The example below filters for people that are 19 years old.

df.filter(df['age'] == 19).show()

+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+

Count

df.count() returns the total records for the DataFrame. Notice the animal DataFrame is being used now.

df1.count()

GroupBy

df.groupBy(). can be used to group a DataFrame based on a specific field, and then an aggregation can be performed on each group. The first example shows the average number of legs and age for each type of animal:

df1.groupBy(df1['animal']).avg().show()

+-------+---------+------------------+
| animal|avg(legs)|          avg(age)|
+-------+---------+------------------+
|  horse|      4.0|               8.0|
|    cow|      4.0|3.6666666666666665|
| donkey|      4.0|               5.0|
|chicken|      2.0|               1.0|
|    dog|      4.0|              12.0|
|    cat|      4.0|               9.0|
| turkey|      2.0|               1.0|
|    pig|      4.0|               5.0|
|   goat|      4.5|               2.0|
+-------+---------+------------------+

The second example sums the legs and age for each type of animal:

df1.groupBy(df1['animal']).sum().show()

+-------+---------+--------+
| animal|sum(legs)|sum(age)|
+-------+---------+--------+
|  horse|        4|       8|
|    cow|       12|      11|
| donkey|        8|      10|
|chicken|        6|       3|
|    dog|        4|      12|
|    cat|        4|       9|
| turkey|        4|       2|
|    pig|        4|       5|
|   goat|        9|       4|
+-------+---------+--------+

The min() and max() aggregation functions could also be used.

Distinct

df.distinct() returns a new DataFrame without any duplicate rows. In the animal DataFrame, there are two cows with the same record: (cow, 4, 3). Both turkeys also have the same record: (turkey, 2, 1). This method will remove them:

df1.distinct().show()

+-------+----+---+
| animal|legs|age|
+-------+----+---+
|chicken|   2|  0|
|    cow|   4|  5|
|  horse|   4|  8|
|chicken|   2|  2|
|chicken|   2|  1|
|    cow|   4|  3|
| donkey|   4|  8|
|   goat|   4|  3|
| turkey|   2|  1|
|    cat|   4|  9|
|   goat|   5|  1|
| donkey|   4|  2|
|    pig|   4|  5|
|    dog|   4| 12|
+-------+----+---+

DropDuplicates

df.dropDuplicates() removes duplicate rows.

df1.dropDuplicates().show()

+-------+----+---+
| animal|legs|age|
+-------+----+---+
|chicken|   2|  0|
|    cow|   4|  5|
|  horse|   4|  8|
|chicken|   2|  2|
|chicken|   2|  1|
|    cow|   4|  3|
| donkey|   4|  8|
|   goat|   4|  3|
| turkey|   2|  1|
|    cat|   4|  9|
|   goat|   5|  1|
| donkey|   4|  2|
|    pig|   4|  5|
|    dog|   4| 12|
+-------+----+---+

DataFrames with SQL

DataFrames use the same engine as Spark SQL, so the sql functionality of SparkSession can be used on DataFrames that are registered as a table. This means SQL can be can used on a DataFrame like it would be used on any other table.

CreateOrReplaceTempView

df.createOrReplaceTempView() registers a DataFrame as a table that can be accessed in SQL expressions.

df1.createOrReplaceTempView("animals")

Spark.SQL

With the table registered, spark.sql("SQL expression") can be used to query it. For the most part, SQL expressions work as expected. The query below selects all the rows from the table.

spark.sql("SELECT * FROM animals").show()

+-------+----+---+
| animal|legs|age|
+-------+----+---+
|    cow|   4|  5|
|    cow|   4|  3|
|    cow|   4|  3|
|chicken|   2|  2|
|chicken|   2|  1|
|chicken|   2|  0|
|  horse|   4|  8|
| donkey|   4|  8|
| donkey|   4|  2|
| turkey|   2|  1|
| turkey|   2|  1|
|    pig|   4|  5|
|    dog|   4| 12|
|    cat|   4|  9|
|   goat|   4|  3|
|   goat|   5|  1|
+-------+----+---+

And even more complicated queries can be used:

spark.sql("""SELECT animal, MIN(legs), AVG(age) 
                FROM animals 
                GROUP BY animal
                ORDER BY AVG(age) DESC
          """).show()

+-------+---------+------------------+
| animal|min(legs)|          avg(age)|
+-------+---------+------------------+
|    dog|        4|              12.0|
|    cat|        4|               9.0|
|  horse|        4|               8.0|
| donkey|        4|               5.0|
|    pig|        4|               5.0|
|    cow|        4|3.6666666666666665|
|   goat|        4|               2.0|
|chicken|        2|               1.0|
| turkey|        2|               1.0|
+-------+---------+------------------+

References

https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html

What is an RDD in PySpark?

Hunter Phillips — Sat, 10 Jun 2023 01:52:56 GMT

This article covers the basic uses of resilient distributed datasets in PySpark. It includes examples of both transformations and actions that can be performed on them.

Resilient Distributed Datasets (RDDs)

In PySpark, a resilient distributed dataset (RDD) is a collection of elements. Unlike a normal list, they can be operated on in parallel. This basically means that when an operation is performed on a collection, it is split into a number of subcollections. These subcollections are sent to a cluster of computers, and the operation is performed in parallel on each subcollection and returned. RDDs are also fault tolerant, which means operations will be properly performed even if a component of the cluster fails.

An RDD can be created from an existing collection, or it can be created from an external dataset. To start, a simple list can be loaded and parallelized. Parallelization is controlled by SparkContext; it connects to a cluster and can broadcast the data to it.

from pyspark import SparkContext

# initialize SparkContext
sc = SparkContext(master='local', appName='test')

data = [1, 5, 10, 15, 20, 25, 30]

# c = collection to distribute
# numSlices = partitions of collection
distributedData = sc.parallelize(c=data, numSlices=3)
# preview the partitions
distributedData.glom().collect()

[[1, 5], [10, 15], [20, 25, 30]]

When parallelizing the data, the number of partitions, numSlices, represents the number of tasks, or subcollections, to run on the cluster. About 2 to 4 slices per CPU in the cluster is normal. glom() can be used to gather each partition’s data into a list, and collect() can be used to preview the partitions. Now, operations can be performed on the RDD. There are two types of RDD operations: transformations, which yield a new RDD, and actions, which return a value.

Transformations

Transformations are operations that return a new RDD.

Map

map(func) passes each element of an RDD through a function, and the appropriate operations are performed on each element. In the example below, each element of the distributed dataset is multiplied by 2.

# map
newRDD = distributedData.map(lambda x: 2*x)
newRDD.glom().collect()

[[2, 10], [20, 30], [40, 50, 60]]

Filter

filter(func) returns an RDD of elements that meet the requirements of the function. The example below filters for elements with a value greater than 10.

# filter
newRDD = distributedData.filter(lambda x: x > 10)
newRDD.glom().collect()

[[], [15], [20, 25, 30]]

FlatMap

flatMap(func) is similar to map but each element can be mapped to an output of 0 or more elements (a sequence). In this example, the input element is mapped to a tuple of itself and the output of 5x.

# flatMap
newRDD = distributedData.flatMap(lambda x: [(x, 5*x)])
newRDD.glom().collect()

[[(1, 5), (5, 25)], [(10, 50), (15, 75)], [(20, 100), (25, 125), (30, 150)]]

MapPartitions

mapPartitions(func) is similar to map but runs on each partition and returns the new partition. In this example, each partition’s elements are summed and returned as the partition.

# mapPartitions
def f_mapPart(iterator):
  yield sum(iterator)

newRDD = distributedData.mapPartitions(f_mapPart)
newRDD.glom().collect()

[[6], [25], [75]]

MapPartitionsWithIndex

mapPartitionsWithIndex(func) is similar to mapPartitions but also includes the partition’s index. The example below yields the index of each partition.

# mapPartitionsWithIndex
def f_mapPartIndex(index, iterator):
  yield index

newRDD = distributedData.mapPartitionsWithIndex(f_mapPartIndex)
newRDD.glom().collect()

[[0], [1], [2]]

Union

union(RDD) returns a new RDD with the union of the original RDD and provided RDD. The example below shows the distributed dataset unioned with the distributed dataset, creating a new RDD twice as long.

# union
newRDD = distributedData.union(distributedData)
newRDD.glom().collect()

[[1, 5], [10, 15], [20, 25, 30], [1, 5], [10, 15], [20, 25, 30]]

Intersection

intersection(RDD) returns a new RDD with the intersection of the original and provided RDDs. The example below combines the original distributed data and a new distributed dataset to generate a new RDD with only the intersections.

data2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# distributedData2 = [[1, 2], [3, 4], [5, 6], [7, 8, 9, 10]]
distributedData2 = sc.parallelize(data2, 4)

# intersection | distributedData = [[1, 5], [10, 15], [20, 25, 30]]
newRDD = distributedData.intersection(distributedData2)
newRDD.collect()

[1, 10, 5]

Distinct

distinct() returns a new RDD with the unique elements from the original.

data3 = [1, 1, 2, 2, 3, 3, 4, 5]
distributedData3 = sc.parallelize(data3, 4)

# distinct
newRDD = distributedData3.distinct()
sorted(newRDD.collect())

[1, 2, 3, 4, 5]

GroupByKey and MapValues

groupByKey() requires an RDD with elements of (K, V) and returns a new RDD of elements (K, Iterable), where Iterable includes all values paired with K.

data4 = [("red", 1), ("red", 2), ("red", 3), ("blue", 4), ("blue", 5)]
distributedData4 = sc.parallelize(data4, 4)

# groupByKey()
newRDD = distributedData4.groupByKey()
newRDD.collect()

[('red', ),
 ('blue', )]

To view the values of the iterables, the RDD’s elements can be mapped to the list function with mapValues(func), which alters each value without altering the keys.

newRDD.mapValues(list).collect()

[('red', [1, 2, 3]), ('blue', [4, 5])]

ReduceByKey

reduceByKey(func) requires an RDD with elements of (K, V) and returns a new RDD with elements of (K, V), where V is aggregated based on K and reduced by the function. In the example, it is important to note that a and b are required for the function to add each element in the list. As an example, [1, 2, 3] may be reduced like 1 + 2 = 3, then 3 + 3 = 6. The result from the previous addition is an input for the current addition.

# reduceByKey()
newRDD = distributedData4.reduceByKey(lambda a,b: a+b)
newRDD.collect()

[('red', 6), ('blue', 9)]

SortByKey

sortByKey(ascending=True, keyfunc) returns a new RDD sorted in ascending or descending order based on the key function or the default order. The example below sorts each key in ascending order.

# sortByKey()
data5 = [("zebra", 1), ("red", 2), ("apple", 3), ("blue", 4), ("horse", 5)]
distributedData5 = sc.parallelize(data5, 4)

newRDD = distributedData5.sortByKey(ascending=True)
newRDD.collect()

[('apple', 3), ('blue', 4), ('horse', 5), ('red', 2), ('zebra', 1)]

This next example uses a key function to select the second to last letter of each key and sorts it in descending order:

# sortByKey()
newRDD = distributedData5.sortByKey(ascending=False, keyfunc=lambda k: k[-2])
newRDD.collect()

[('blue', 4), ('horse', 5), ('zebra', 1), ('apple', 3), ('red', 2)]

Join, LeftOuterJoin, RightOuterJoin, FullOuterJoin

join(RDD) returns a new RDD of (K, (V, W)) if the original and provided datasets are (K, V) and (K, W), respectively. In other words, values from identical keys are grouped together and returned. Keys without corresponding pairs in both datasets are not returned.

# join
leftData = [("a", 1), ("b", 2), ("c", 3)]
rightData = [("a", 4), ("c", 5), ("d", 6)]

leftRDD = sc.parallelize(leftData)
rightRDD = sc.parallelize(rightData)

newRDD = leftRDD.join(rightRDD)
newRDD.collect()

[('c', (3, 5)), ('a', (1, 4))]

leftOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, V) in the left dataset, the corresponding (K, W) in the right dataset will be joined. If the key does not exist in the right dataset, None will be returned. This means every K in the left dataset is present in the new RDD.

# leftOuterJoin
newRDD = leftRDD.leftOuterJoin(rightRDD)
newRDD.collect()

[('b', (2, None)), ('c', (3, 5)), ('a', (1, 4))]

rightOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, W) in the right dataset, the corresponding (K, V) in the left dataset will be joined. If the key does not exist in the left dataset, None will be returned. This means every K in the right dataset is present in the new RDD.

# rightOuterJoin
newRDD = leftRDD.rightOuterJoin(rightRDD)
newRDD.collect()

[('c', (3, 5)), ('d', (None, 6)), ('a', (1, 4))]

fullOuterJoin(RDD) returns a new RDD of (K, (V, W)). For each (K, V) in the left dataset and (K, W) in the right dataset, the matches will be returned as (K, (V, W)). If a key exists in the left dataset that is not in the right dataset, the result will be (K, (V, None)). Likewise, if a key exists in the right dataset that is not in the left dataset, the result will be (K, (None, W)). This is essentially a union of the left and right outer joins.

# fullOuterJoin
newRDD = leftRDD.fullOuterJoin(rightRDD)
newRDD.collect()

[('b', (2, None)), ('c', (3, 5)), ('d', (None, 6)), ('a', (1, 4))]

CoGroup

cogroup(RDD) returns an RDD of (K, (Iterable, Iterable)) if the original and source are (K, V) and (K, W), respectively.

# cogroup
newRDD = leftRDD.cogroup(rightRDD)
newRDD.collect()

[('b',
  (,
   )),
 ('c',
  (,
   )),
 ('d',
  (,
   )),
 ('a',
  (,
   ))]

To view the iterables, the values can be mapped to lists:

[(k, tuple(map(list, v))) for k, v in newRDD.collect()]

[('b', ([2], [])), ('c', ([3], [5])), ('d', ([], [6])), ('a', ([1], [4]))]

Coalesce

coalesce(numPartitions) reduces the number of partitions of an RDD. The example below coalesces from three partitions to two partitions.

# preview the partitions
distributedData.glom().collect()

[[1, 5], [10, 15], [20, 25, 30]]

# coalesce
distributedData.coalesce(numPartitions=2).glom().collect()

[[1, 5], [10, 15, 20, 25, 30]]

Repartition

repartition(numPartitions) randomly shuffles the data to create more or less partitions. The example below repartitions from three to two, but it differs from coalesce since it randomizes the partitions.

# preview the partitions
distributedData.glom().collect()

[[1, 5], [10, 15], [20, 25, 30]]

# repartition
distributedData.repartition(numPartitions=2).glom().collect()

[[20, 25, 30], [1, 5, 10, 15]]

Actions

Actions are operations that return a value or some values from an RDD rather than creating a new RDD.

Collect

collect() has been used in the previous examples to return the RDD as a list for viewing purposes. The example below shows that the output is a list. The previous examples show various use cases.

# collect
type(distributedData.glom().collect())

list

Reduce

reduce(func) aggregates the elements of an RDD using the provided function. This function takes two arguments and has a single output. The operations should be commutative and associative to allow parallel processes to be performed.

# reduce | distributedData = [[1, 5], [10, 15], [20, 25, 30]]
distributedData.reduce(lambda a,b: a+b)

Count

count() returns the number of elements in an RDD.

# count
distributedData.count()

First, Take, TakeSample

first() returns the first element in the RDD.

# first
distributedData.first()

take(n) returns the first n elements of the RDD

# take
distributedData.take(4)

[1, 5, 10, 15]

takeSample(withReplacement=True|False, num) returns a sample from the RDD with a size of num, with or without replacement.

# take
distributedData.takeSample(withReplacement=True, num=5)

[5, 20, 5, 15, 5]

CountByKey

countByKey() can be used on RDDs with elements of (K, V). The result will be a hashmap, or dictionary, for each key in the form of (K, Count(V)).

# distributedData4= [("red", 1), ("red", 2), ("red", 3), ("blue", 4), ("blue", 5)]
dict(distributedData4.countByKey())

{'red': 3, 'blue': 2}

References

https://spark.apache.org/docs/latest/rdd-programming-guide.html

An Introduction to Machine Learning in Python: Multiple Linear Regression

Hunter Phillips — Mon, 22 May 2023 04:47:43 GMT

An Introduction to Machine Learning in Python: Polynomial Regression

Polynomial regression can identify a nonlinear relationship between an independent variable and a dependent variable.

Background

This article is the third in a series on regression, gradient descent, and MSE. The previous articles cover Simple Linear Regression, The Normal Equation for Regression, and Multiple Linear Regression.

Polynomial Regression

Polynomial regression is used on complex data that would be best fit with curves. It can be treated as a subset of multiple linear regression.

Note that X₀ is a column of ones for the bias; this allows for the generalized formula discussed in the first article. Using the equation above, each “independent” variable can be considered an exponentiated version of X₁.

This allows the same model to be used from multiple linear regression since only the coefficients of each variable need to be identified. A simple, third-degree polynomial model can be created as an example. Its equation follows:

The generalized functions for the model, gradient descent, and the MSE can be used from the previous articles:

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """
  return torch.mean((Yhat-Y)**2) # mean((error)^2)

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Creating the Data

Now, all that is required is some data to train the model with. A “blueprint” function can be used, and randomness can be added. This follows the same approach as the previous articles. The blueprint can be seen below:

A train set with a size of (800, 4) and a test set with a size of (200, 4) can be created. Note that each feature, except the bias, is an exponentiated version of the first.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# features
X0 = torch.ones((1000,1))
X1 = (100*(torch.rand(1000) - 0.5)).reshape(-1,1) # generates 1000 random numbers from -50 to 50
X2, X3 = X1**2, X1**3
X = torch.hstack((X0,X1,X2,X3))

# normal distribution with a mean of 0 and std of 8
normal = torch.distributions.Normal(loc=0, scale=8)

# targets
Y = (3*X[:,3] + 2*X[:,2] + 1*X[:,1] + 5 + normal.sample(torch.ones(1000).shape)).reshape(-1,1)

# train, test
Xtrain, Xtest = X[:800], X[800:]
Ytrain, Ytest = Y[:800], Y[800:]

After defining the initial weights, the data can be plotted with the line of best fit.

torch.manual_seed(5)
w = torch.rand(size=(4, 1))
w

tensor([[0.83],
        [0.13],
        [0.91],
        [0.82]])

import matplotlib.pyplot as plt

def plot_lbf():
  """
    Output:
      prints the line of best fit in comparison to the train and test data
  """

  # plot the train and test sets
  plt.scatter(Xtrain[:,1],Ytrain,label="train")
  plt.scatter(Xtest[:,1],Ytest,label="test")

  # plot the line of best fit
  X1_plot = torch.arange(-50, 50.1,.1).reshape(-1,1) 
  X2_plot, X3_plot = X1_plot**2, X1_plot**3
  X0_plot = torch.ones(X1_plot.shape)
  X_plot = torch.hstack((X0_plot,X1_plot,X2_plot,X3_plot))

  plt.plot(X1_plot.flatten(), model(w, X_plot).flatten(), color="red", zorder=4)

  plt.xlim(-50, 50)
  plt.xlabel("$X$")
  plt.ylabel("$Y$")
  plt.legend()
  plt.show()

plot_lbf()

Image by Author

Training the Model

To partially minimize the cost function, a learning rate of 5e-11 and 500,000 epochs can be used with gradient descent.

lr = 5e-11
epochs = 500000

# update the weights 1000 times
for i in range(0, epochs):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100000 == 0:
    print("epoch:", i+1)
    print("weights:", w)
    print("Train MSE:", MSE(model(w,Xtrain), Ytrain))
    print("Test MSE:", MSE(model(w,Xtest), Ytest))
    print("="*10)

plot_lbf()

epoch: 100000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.87)
Test MSE: tensor(162.55)
==========
epoch: 200000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.52)
Test MSE: tensor(162.22)
==========
epoch: 300000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.19)
Test MSE: tensor(161.89)
==========
epoch: 400000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.85)
Test MSE: tensor(161.57)
==========
epoch: 500000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.51)
Test MSE: tensor(161.24)
==========

Image by Author

Even with 500,000 epochs and an extremely small learning rate, the model fails to identify the first two weights. While the current solution is highly accurate with an MSE of 161.24, it would likely require millions of epochs to completely minimize it. This is one of the limitations of gradient descent for polynomial regression.

The Normal Equation

As an alternative, the Normal Equation from the second article can be used to directly compute the optimized weights:

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

w = NormalEquation(Xtrain, Ytrain)
w

tensor([[4.57],
        [0.98],
        [2.00],
        [3.00]])

The Normal Equation is able to immediately identify the correct values for each weight, and the MSE for each set is about 100 points lower than with gradient descent:

MSE(model(w,Xtrain), Ytrain), MSE(model(w,Xtest), Ytest)

(tensor(60.64), tensor(63.84))

Conclusion

With simple linear, multiple linear, and polynomial regression implemented, the next two articles will cover Lasso and Ridge regression. These types of regression introduce two important concepts in machine learning: overfitting and regularization.

Please don’t forget to like and follow! :)

An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python

Hunter Phillips — Mon, 22 May 2023 03:15:11 GMT

The Normal Equation is a closed-form solution for minimizing a cost function and identifying the coefficients for regression.

Background

In the previous article, An Introduction to Machine Learning in Python: Simple Linear Regression, the gradient descent approach was used to minimize the MSE cost function. However, the approach required a large number of epochs and a small learning rate, both of which are difficult to identify in a short amount of time.

An alternative approach is a closed-form solution that does not require a learning rate or epochs. The closed-form solution for regression is known as the Normal Equation. It can be used to directly determine the weights of a line of best fit. It will be derived in this article and then implemented in Python.

Deriving the Normal Equation

In A Simple Introduction to Gradient Descent, the matrix derivative of the MSE was calculated.

This partial derivative can be set equal to 0, which indicates where the cost function is at a minimum for each weight. By solving for w, a direct equation to calculate these values can be identified.

set equal to 0

multiply by n/2

place each term on its own side

transpose both sides

simplify

use the inverse of X^TX to isolate w

simplify

To prove this returns new weights as anticipated, the size of each component can be examined:

The output is a vector with a size of (num features, 1). This is the same size as the original weight vector from the previous article, An Introduction to Machine Learning in Python: Simple Linear Regression.

Implementing the Normal Equation in Python

This equation can be implemented in Python, and the same example from the previous article can be used.

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

With the function created, all that is necessary is some input data, which is generated below:

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# (n samples, features)
X = torch.randint(low=0, high=11, size=(20, 1))

# normal distribution with a mean of 0 and std of 1
normal = torch.distributions.Normal(loc=0, scale=1)

# generate output
Y = (1.5*X + 2) + normal.sample(X.shape)

# add bias column
X = torch.hstack((torch.ones(X.shape),X))

These can be plugged into the Normal Equation to generate the optimized weights:

w = NormalEquation(X, Y)

tensor([[1.97],
        [1.52]])

These weights are nearly identical to the blueprint function. Instead of 2 and 1.5, the equation output 1.97 and 1.52. They aren’t perfect due to the randomness added to the output. Furthermore, these values are more accurate than those from the previous article since a learning rate and specific number of epochs did not have to be selected.

When to Use it

While this approach seems to be preferable over gradient descent, both have their use cases. For simple problems with small datasets, the Normal Equation will suffice. As the dataset grows, so does the size of the inverted matrix, which has a size of (num features, num features). This can be expensive to compute.

When the number of features is large, gradient descent should be used. Gradient descent can also be used to create a generalized equation that does not overfit to the train data.

For the next two articles, both approaches will be used. The next article is An Introduction to Machine Learning in Python: Multiple Linear Regression.

Please don’t forget to like and follow! :)

References

Normal Equation Overview

An Introduction to Machine Learning in Python: Multiple Linear Regression

Hunter Phillips — Fri, 19 May 2023 17:40:25 GMT

Multiple linear regression is used to assess the relationship between many independent variables and one dependent variable.

Background

This article follows An Introduction to Machine Learning in Python: Simple Linear Regression; it covered simple linear regression, gradient descent, and the MSE. This article will cover multiple linear regression and cover some new machine learning terminology.

Multiple Linear Regression

While simple linear regression has an equation of Ŷ = w₁X₁ + w₀X₀, multiple linear regression has a generic formula for k independent variables:

Note that X₀ is a column of ones for the bias; this allows for the generalized formula discussed in the first article. As the formula demonstrates, multiple linear regression helps identify the relationship between many independent variables (X) and a single dependent variable (Ŷ). It does this by learning the values of each weight (w).

To demonstrate this in action, multiple linear regression with 3 weights can be used:

This formula will create a “plane of best fit” for three dimensional data:

Image by Author

The Implementation

Image by Author

Like the previous example, a “blueprint” equation can be used, and randomness can be added. Then, the model can try and learn the weights. When using a model on real data, it is common to split the data into at least two sets: the train set and the test set. The train set is used to train the model and acquire the weights. The test set is used to evaluate the model’s performance on data it has never seen before. If it performs well on both, it is more likely to be useful for real-world applications. It is common to use a split of 80% train, 20% test.

For this example, 1000 samples can be generated and split into test and train sets. Also, since there are two independent variables and a bias, there will be three columns. Each column will represent an independent variable or bias, and each row will represent a sample of the variables, (X₀, X₁, X₂). The shape of the overall data will be a matrix with a size of (1000, 3); a general shape would be (n samples, num features). Remember, independent variables are also known as features in machine learning. The train set will have a size of (800, 3), and the test set will have a size of (200, 3).

The data for this article will be based around Y = 6X₂ + 3X₁ + 2. This means w₀ is 2, w₁ is 3, and w₂ is 6. In the example below, 1000 values between -250 and 250 are generated for X₁ and X₂, and 1000 ones are generated for X₀. They are reshaped into columns and stacked horizontally to create a matrix with a size of (1000, 3). The output is generated using the aforementioned equation, and values from a normal distribution with a standard deviation of 10 are added.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# create ones for the bias | 1000 ones
X0 = torch.ones(1000).reshape(-1,1)

# create values for the first feature | 1000 numbers from -250 to 250
X1 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1) 

# create values for the second feature | 1000 numbers from -250 to 250
X2 = (500*(torch.rand(1000) - 0.5)).reshape(-1,1)

# stack data together, X0 = X[:,0], X1 = X[:,1], X2 = X[:,2]
X = torch.hstack((X0, X1,X2))

# normal distribution with a mean of 0 and std of 10
normal = torch.distributions.Normal(loc=0, scale=10)

# output
Y = ((6*X[:,2] + 3*X[:,1] + 2*X[:,0]) + normal.sample(torch.ones(1000).shape)).reshape(-1,1)

This data can be previewed before being split.

import plotly.express as px

fig = px.scatter_3d(x=X[:,1].flatten(),
                    y=X[:,2].flatten(),
                    z=Y.flatten())

fig.update_traces(marker_size=3)
fig.update_layout(scene = dict(xaxis_title='X₁', 
                               yaxis_title='X₂', 
                               zaxis_title='Y'))

Image by Author

Now, the data can be split into test and train data:

# split the data
Xtrain, Xtest = X[:800], X[800:]
Ytrain, Ytest = Y[:800], Y[800:]

The train data can then be fit with a plane. To start, the functions for the model, MSE, and gradient descent need to be defined. The same ones from the first article, An Introduction to Machine Learning in Python: Simple Linear Regression, can be used. The end of the article will use the Normal Equation to verify the answer.

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """
  return torch.mean((Yhat-Y)**2) # mean((error)^2)

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Training the Model

With the functions created, the model can be trained to identify the plane of best fit. To start, three random weights can be generated.

torch.manual_seed(5)
w = torch.rand(size=(3, 1))
w

tensor([[0.83],
        [0.13],
        [0.91]])

The current plane of best fit and its MSE can be analyzed below. The plane is in orange, the train set is in red, and the test set is in green.

import plotly.graph_objects as go
def plot_model(x1_range, x2_range):
  """
    Inputs:
      x1_range: x1-axis range [low, high]
      x2_range: x2-axis range [low, high]

    Global Variables:
      Xtrain: array of inputs | (n train samples, num features)
      Ytrain: array of expected outputs | (n train samples, 1)
      Xtest:  array of inputs | (n test samples, num features)
      Xtrain: array of expected outputs | (n test samples, 1)
      
    Output:
      prints plane of best fit
  """ 

  # meshgrid of possible combinations of (X1, X2)
  X1_plot, X2_plot = torch.meshgrid(torch.arange(x1_range[0], x1_range[1], 5),
                                    torch.arange(x2_range[0], x2_range[1], 5))
  X0_plot = torch.ones(X1_plot.shape)
  
  # stack together each point (X1, X2) = (X, Y)
  X_plot = torch.hstack((X0_plot.reshape(-1,1),
                         X1_plot.reshape(-1,1), 
                         X2_plot.reshape(-1,1)))
  
  # all possible model predictions (Yhat = Z)
  Yhat = model(w, X_plot)

  # model's plane of best fit
  fig = go.Figure(data=[go.Mesh3d(x=X_plot[:,1].flatten(), 
                                  y=X_plot[:,2].flatten(), 
                                  z=Yhat.flatten(), 
                                  color='orange', 
                                  opacity=0.50)])
  
  # training data
  fig.add_scatter3d(x=Xtrain[:,1].flatten(),
                    y=Xtrain[:,2].flatten(),
                    z=Ytrain.flatten(), 
                    mode="markers",
                    marker=dict(size=3),
                    name="train")
  
  # test data
  fig.add_scatter3d(x=Xtest[:,1].flatten(),
                    y=Xtest[:,2].flatten(),
                    z=Ytest.flatten(), 
                    mode="markers",
                    marker=dict(size=3),
                    name="test")
  
  # name axes
  fig.update_layout(scene = dict(xaxis_title='X₁', 
                                 yaxis_title='X₂', 
                                 zaxis_title='Y'))

  fig.show()

plot_model([-250,250], [-250,250])

Image by Author

MSE(model(w,Xtrain), Ytrain)

tensor(653812.81)

Now, a training loop can be created to minimize the MSE. By using 50,000 epochs and a learning rate of 0.00004, the output becomes extremely accurate. These values were chosen empirically. Both the the train and test MSE can be seen as well.

torch.manual_seed(5)
w = torch.rand(size=(3, 1))

lr = 0.00004
epochs = 50000

# update the weights 1000 times
for i in range(0, epochs):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 10000 == 0:
    print("epoch:", i+1)
    print("weights:", w)
    print("Train MSE:", MSE(model(w,Xtrain), Ytrain))
    print("Test MSE:", MSE(model(w,Xtest), Ytest))
    print("="*10)

plot_model([-250,250], [-250,250])

epoch: 10000
weights: tensor([[1.51],
        [2.98],
        [6.00]])
Train MSE: tensor(87.52)
Test MSE: tensor(70.03)
==========
epoch: 20000
weights: tensor([[1.82],
        [2.98],
        [6.00]])
Train MSE: tensor(87.20)
Test MSE: tensor(70.04)
==========
epoch: 30000
weights: tensor([[1.96],
        [2.98],
        [6.00]])
Train MSE: tensor(87.12)
Test MSE: tensor(70.11)
==========
epoch: 40000
weights: tensor([[2.02],
        [2.98],
        [6.00]])
Train MSE: tensor(87.09)
Test MSE: tensor(70.16)
==========
epoch: 50000
weights: tensor([[2.05],
        [2.98],
        [6.00]])
Train MSE: tensor(87.08)
Test MSE: tensor(70.18)
==========

Image by Author

The model predicted the plane of best fit to be Ŷ = 6X₂ + 2.98X₁ + 2.05 instead of Y = 6X₂ + 3X₁ + 2. The test and train MSE’s are within 17 points of each other, which indicates the model generalizes to the unseen test data. The only limitation of this approach is the number of epochs required to minimize the loss function. An alternative approach would be to use a closed-form solution, which was covered in the previous article, An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python. A closed-form solution does not require a learning rate or epochs to acquire the weights for minimization.

The Normal Equation

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

The code above is the Python implementation for the Normal Equation, which was derived in the previous article. The training data from this example can be used in the equation to directly calculate the optimized weights.

w = NormalEquation(Xtrain,Ytrain)

tensor([[2.19],
        [2.98],
        [6.00]])

The MSE can also be calculated:

MSE(model(w, Xtrain), Ytrain), MSE(model(w, Xtest), Ytest)

(tensor(87.08), tensor(70.20))

The weights and MSE from the Normal Equation and Gradient Descent approaches are nearly identical. In this case, both are equally valid.

Conclusion

Multiple linear regression is useful for identifying the relationship between two or more independent variables, or features, and one dependent variable. It is also the basis of polynomial regression, which will be examined in the next article: An Introduction to Machine Learning in Python: Polynomial Regression.

Please don’t forget to like and follow! :)

References

Sklearn Regression Example

A Simple Introduction to Gradient Descent

Hunter Phillips — Wed, 17 May 2023 19:24:04 GMT

Gradient descent is one of the most common optimization algorithms in machine learning. Understanding its basic implementation is fundamental to understanding all the advanced optimization algorithms built off of it.

Background

This article is supplementary to An Introduction to Machine Learning in Python: Simple Linear Regression. It should be read first or in conjunction with this article. It would also be beneficial to have a basic understanding of partial derivatives in calculus because this article also examines the partial derivatives of several variations of the Mean Squared Error (MSE).

Optimization Algorithms

Image by Author

In machine learning, optimization is the process of finding the ideal parameters, or weights, to maximize or minimize a cost or loss function. The global maximum is the largest value on the domain of the function, whereas the global minimum is the smallest value. While there is only one global maximum and/or minimum, there can be many local maxima and minima. The global minimum or maximum of a cost function indicates where a model’s parameters generate predictions that are close to the actual targets. The local maxima and minima can cause problems when training a model, so their presence should always be considered. The plot above shows an example of each.

There are a few major algorithm groups within this category: bracketing, local descent, first-order, and second-order. The focus of this article will be first-order algorithms that use the first derivative for optimization. Within this category, the gradient descent algorithm is the most popular.

Gradient Descent in One Dimension

Gradient descent is a first-order, iterative optimization algorithm used to minimize a cost function. By using partial derivatives, a direction, and a learning rate, gradient descent decreases the error, or difference, between the predicted and actual values.

The idea behind gradient descent is that the derivative of each weight will reveal its direction and influence on the cost function. In the image below, the cost function is f(w) = w², which is a parabola. The minimum is at (0,0), and the current weight is -5.6. The current loss is 31.36, and the line in orange represents the derivative, or current rate of change for the weight, which is -11.2. This indicates the weight needs to move “downhill” — or become more positive — to reach a loss of 0. This is where gradient descent comes in.

Image by Author

By scaling the gradient with a value known as the learning rate and subtracting the scaled gradient from its weight’s current value, the output will minimize. This can be seen in the image below. In ten iterations (w₀ to w₉), a learning rate of 0.1 is used to minimize the cost function.

Image by Author

In the steps for the algorithm below, a weight is represented by w, with j representing its current value and j+1 representing its new value. The cost function to measure the error is represented by f, and the partial derivative is the gradient of the cost function with respect to the parameters. The learning rate is represented by α.

select a learning rate and the number of iterations
choose random values for the parameters
update the parameters with the equation below

repeat step three until the max number of iterations is reached

When taking the partial derivative, or gradient, of a function, only one parameter can be assessed at a time, and the other parameters are treated as constants. For the example above, f(w) = w², there is only one parameter, so the derivative is f`(w) = 2w. The formula for updating the parameter follows:

Using a learning rate of 0.1 and a starting weight of -5.6, the first ten iterations follow:

Table by Author

The table demonstrates how each component of the formula helps minimize the loss. By negating the scaled gradient, the new weight becomes more positive, and the slope of the new gradient is less steep. As the slope becomes more positive, each iteration yields a smaller update.

This basic implementation of gradient descent can be applied to almost any cost function, including those with numerous weights. A few variations of the mean squared error can be considered.

Gradient Descent with the Mean Squared Error (MSE)

What is the MSE?

A popular cost function for machine learning is the Mean Squared Error (MSE).

This function takes finds the difference between the model’s prediction (Ŷ) and the expected output (Y). It then squares the difference to ensure the output is always positive. This means Ŷ or Y can come first when calculating the difference. This is repeated across a set of points with a size of n. By summing the squared difference of all these points and dividing by n, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:

Table by Author

In this formula, Ŷ represents a model’s prediction. In regression, the model’s equation may contain one or more weights depending on the requirements of the training data. The table below reflects these situations.

Table by Author

Now, to perform gradient descent with any of these equations, their gradients must be calculated. The gradient contains the partial derivatives for a function:

Each weight’s partial derivative has to be calculated. A partial derivative is calculated in the same manner as a normal derivative, but every variable that is not being considered must be treated as a constant. The gradients for the MSE variations listed above can be examined below.

One Weight

When taking the gradient of the MSE with only one weight, the derivative can be calculated with respect to w. X, Y, and n must be treated as constants. With this in mind, the fraction and sum can be moved outside of the derivative:

From here, the chain rule can be used to calculate the derivative with respect to w:

Now, this can be simplified:

Two Weights

When taking the gradient of the MSE with two weights, the partial derivatives must be taken with respect to both parameters, w₀ and w₁. When taking the partial derivative of w₀, the following variables are treated as constants: X, Y, n, and w₁. When taking the partial derivative of w₁, the following variables are treated as constants: X, Y, n, and w₀. The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.

From here, the chain rule can be used to calculate the derivative with respect to each weight:

Finally, they can be simplified.

Notice that the only difference between the equations is X.

Three Weights

When taking the gradient of the MSE with three weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, X, Y, n, and the other two weights will be treated as constants. The same steps as the previous example can be repeated. First, the fraction and sum can be moved outside the derivative.

From here, the chain rule can be used to calculate the derivative with respect to each weight:

Finally, they can be simplified.

As mentioned previously, the only difference between each partial derivative is the input feature, X. This can be generalized for k weights in the next example.

More Than Three Weights

When taking the gradient of the MSE with k weights, the partial derivatives must be taken with respect to each parameter. When taking the partial derivative of one weight, X, Y, n, and the other k-1 weights will be treated as constants. As seen in the previous example, only the input feature of each partial derivative changes when there are more than two weights.

Matrix Derivation

The formulas above show how to use gradient descent without explicitly taking advantage of vectors and matrices. However, most of machine learning is best understood by using their operations. For a quick overview, see A Simple Introduction to Tensors.

The rest of this article will be dedicated to using matrix calculus to derive the derivative of the MSE. To start, Ŷ and Y should be understood as matrices with sizes of (n samples, 1). Both are matrices with 1 column and n rows, or they can be viewed as column vectors, which would change their notation to lowercase:

The MSE is element-wise vector subtraction between ŷ and y, followed by the dot product of the difference with itself. Remember, the dot product can only occur if sizes are compatible. Since the goal is to have a scalar output, the first vector must be transposed.

From here, ŷ can be replaced with Xw for regression. X is a matrix with a size of (n samples, num features), and w is a column vector with with a size of (num features, 1).

The next step is to simplify the equation before taking the derivative. Notice that w and X switch positions to ensure their multiplication is still valid: (1, features) x (num features, n samples) = (1, n samples).

These error calculations can then be multiplied together.

Notice that the third term can be rewritten by transposing it, following the third property on this page. Then, it can be added to the second term.

Now, the partial derivative of the MSE can be taken with respect to the weight.

This is equivalent to taking the derivative of each term:

Each term that is not w can be treated as a constant. The derivative of each component can be computed using these rules:

The first term in the equation follows the fourth rule and becomes zero. The second term follows the first rule, and the third term follows the third rule.

This equation can be used in gradient descent to simultaneously calculate all the partial derivatives:

Conclusion

The gradients for the variations of the MSE cost function can be easily used in gradient descent by plugging them into the formula. An example of gradient descent can be found in An Introduction to Machine Learning in Python: Simple Linear Regression.

Please don’t forget to like and follow! :)

References

An Introduction to Machine Learning in Python: Simple Linear Regression

Hunter Phillips — Wed, 17 May 2023 19:12:58 GMT

Simple linear regression offers an elegant introduction to machine learning. It can be used to identify the relationship between an independent variable and a dependent variable. Using gradient descent, a basic model can be trained to fit a set of points for future prediction.

Background

This is the first article of a series covering regression, gradient descent, classification, and other fundamental aspects of machine learning. This article focuses on simple linear regression, which identifies the line of best fit for a set of points, allowing for future predictions to be made.

The Line of Best Fit

Image by Author

The line of best fit is the equation that most accurately represents a set of points. For a given input, the equation’s output should be as close to the expected output as possible.

In the image above, it is clear that the middle line fits the blue points better than the left or right lines. However, is it the line of best fit? Could there be a fourth line that fits these points better? The line could certainly be shifted up or down to ensure an equal amount of points fall above and below it. However, there could be a dozen lines that fit this exact criteria. What makes any one of them the best?

Thankfully, there is a way to mathematically identify a line of best fit for a set of points using regression.

Regression

Regression helps identify the relationship between two or more variables, and it takes many forms, including simple linear, multiple linear, polynomial, and more. To demonstrate the usefulness of this approach, simple linear regression will be used.

Simple linear regression attempts to find the line of best fit for a set of points. More specifically, it identifies the relationship between an independent variable and a dependent variable. The line of best fit has the form of y = mx + b.

x is the input or independent variable
m is the slope, or steepness, of the line
b is the y-intercept
y is the output or dependent variable

The goal of simple linear regression is to identify the values of m and b that will generate the most accurate y value when given an x. This equation, also known as the model, can also be evaluated in machine learning terms. In the equation, w represents “weight”: Ŷ = Xw₁+ w₀

X is the input or feature
w₁ is the slope
w₀ is the bias, or y-intercept
Ŷ is the prediction, which is pronounced as “y-hat”

While this is useful, the equation needs to be assessed for its accuracy. If its predictions are poor, it is not very useful. To do this, a cost, or loss, function is used.

The Cost or Loss Function

Regression requires some method of tracking the accuracy of the model’s predictions. Given the inputs, are the outputs of the equation as close as possible to the expected output? A cost function, also known as a loss function, is used to identify the accuracy of an equation.

For instance, if the expected output is 5, and the equation outputs 18, the loss function should represent this difference. A simple loss function could output 13, which is the difference between these values. This indicates the model’s performance is poor. On the other hand, if the expected output is 5 and the model predicts 5, the loss function should output 0, which indicates the model’s performance is excellent.

A commonly used loss function that does this is the mean squared error (MSE):

This function takes finds the difference between the model’s prediction (Ŷ) and the expected output (Y). It then squares the difference to ensure the output is always positive. It does this across a set of points with a size of n. By summing the squared difference of all these points and dividing by n, the output is the mean squared difference (error). It is an easy way of assessing the model’s performance on all the points simultaneously. A simple example can be seen below:

Table by Author

While there are countless other loss functions that are just as applicable to this situation, this is one of the most popular loss functions in machine learning for regression due to its simplicity, especially when it comes to gradient descent, which will be explained later.

To best understand where gradient descent comes in, an example can be evaluated.

Predicting the Line of Best Fit

To show simple linear regression in action, data to train the model is required. This comes in the form of an X array and Y array. The data can be manually generated for this example. It can be created from a “blueprint” function. Randomness can be added to the blueprint, and the model will be forced to learn the underlying function. PyTorch, a standard machine learning library, used to implement regression.

Generating the Data

To start, the code below generates an array of input values using a random integer generator. X currently has a shape of (n samples, num features). Remember a feature is an independent variable, and simple linear regression has 1. For this example, n will be 20.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# (n samples, features)
X = torch.randint(low=0, high=11, size=(20, 1))

tensor([[ 9],
        [10],
        [ 0],
        [ 3],
        [ 8],
        [ 8],
        [ 0],
        [ 4],
        [ 1],
        [ 0],
        [ 7],
        [ 9],
        [ 3],
        [ 7],
        [ 9],
        [ 7],
        [ 3],
        [10],
        [10],
        [ 4]])

These values can then be passed through Y = 1.5X + 2 to generate output values, and some randomness can be added to these values using the normal distribution with a mean of 0 and standard deviation of 1. Y will have a shape of (n samples, 1).

The code below shows the random values, which have the same shape.

torch.manual_seed(5)

# normal distribution with a mean of 0 and std of 1
normal = torch.distributions.Normal(loc=0, scale=1)

normal.sample(X.shape)

tensor([[ 1.84],
        [ 0.52],
        [-1.71],
        [-1.70],
        [-0.13],
        [-0.60],
        [ 0.14],
        [-0.15],
        [ 2.61],
        [-0.43],
        [ 0.35],
        [-0.06],
        [ 1.48],
        [ 0.49],
        [ 0.25],
        [ 1.75],
        [ 0.74],
        [ 0.03],
        [-1.17],
        [-1.51]])

Finally, Y can be calculated with the code below.

Y = (1.5*X + 2) + normal.sample(X.shape)

Y

tensor([[15.00],
        [15.00],
        [-0.36],
        [ 6.75],
        [13.59],
        [15.16],
        [ 2.33],
        [ 8.72],
        [ 2.67],
        [ 1.81],
        [13.74],
        [14.06],
        [ 7.15],
        [12.81],
        [15.91],
        [13.15],
        [ 6.76],
        [18.05],
        [18.71],
        [ 6.80]])

They can also be plotted together with matplotlib for a better understanding of their relationship:

import matplotlib.pyplot as plt

plt.scatter(X,Y)
plt.xlim(-1,11)
plt.ylim(0,20)
plt.xlabel("$X$")
plt.ylabel("$Y$")
plt.show()

Image by Author

While it may seem counterintuitive to generate data for the example, it is a great way to demonstrate how regression works. The model, which can be seen below, will only be provided X and Y, and it will need to identify w₁ as 1.5 and w₀ as 2.

Image by Author

The weights can be stored in an array, w. This array will have two weights in it, one for the bias and one for the number of features. It will have a shape of (num features + 1 bias, 1). For this example, the array with have a shape of (2, 1).

torch.manual_seed(5)
w = torch.rand(size=(2, 1))
w

tensor([[0.83],
        [0.13]])

With these values generated, the model can be created.

Creating the Model

The first step for the model is to define a function for the line of best fit and another for the MSE.

As mentioned before, the model has an equation of Ŷ = Xw₁+ w₀. As of now, the bias is added to every sample. This is equivalent to broadcasting the bias to be the same size as X and added the arrays together. The output can be seen below.

w[1]*X + w[0]

tensor([[1.97],
        [2.09],
        [0.83],
        [1.21],
        [1.84],
        [1.84],
        [0.83],
        [1.33],
        [0.96],
        [0.83],
        [1.71],
        [1.97],
        [1.21],
        [1.71],
        [1.97],
        [1.71],
        [1.21],
        [2.09],
        [2.09],
        [1.33]])

The function below computes the output.

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features + 1 bias, 1)
      X: array of inputs  | (n samples, num features + 1 bias)

    Output:
      returns the predictions | (n samples, 1)
  """

  return w[1]*X + w[0]

The function for the MSE is straightforward:

# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """

  return torch.mean((Yhat-Y)**2) # mean((error)^2)

Previewing the Line of Best Fit

With the functions created, the line of best fit can be previewed with a plot, and a standard function can be created for future use. It will display the line of best fit in red, the predictions for each input in orange, and the expected outputs in blue.

def plot_lbf():
  """
    Output:
      plots the line of best fit in comparison to the training data
  """

  # plot the points
  plt.scatter(X,Y)

  # predictions for the line of best fit
  Yhat = model(w, X)
  plt.scatter(X, Yhat, zorder=3) # plot the predictions

  # plot the line of best fit
  X_plot = torch.arange(-1,11+0.1,.1) # generate values with a step of .1
  plt.plot(X_plot, model(w, X_plot), color="red", zorder=0)

  plt.xlim(-1, 11)
  plt.xlabel("$X$")
  plt.ylabel("$Y$")
  plt.title(f"MSE: {MSE(Yhat, Y):.2f}")
  plt.show()

plot_lbf()

Image by Author

The output with the current weights is not ideal since the MSE is 105.29. To get a better MSE, different weights need to be chosen. They could be randomized again, but the chance of acquiring the perfect line would be minimal. This is where the gradient descent algorithm can be used to alter the value of the weights in a defined manner.

Gradient Descent

An explanation of the gradient descent algorithm can be found here: A Simple Introduction to Gradient Descent. The article should be read before moving on to avoid confusion.

To summarize the article, gradient descent uses the gradient of a cost function to reveal the direction and influence of each weight on it. By scaling the gradient with a learning rate and subtracting it from each weight’s current value, the cost function minimizes, forcing the model’s prediction to be as close to the expected output as possible.

For simple linear regression, f will be the MSE. The Python implementation can be seen below. Remember, each weight has its own partial derivative to be used in the formula, which can be seen above.

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features + 1 bias, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features + 1 bias)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = len(X)

  # update the bias
  w[0] = w[0] - lr*2/n * torch.sum(model(w,X) - Y)
  
  # update the weight
  w[1] = w[1] - lr*2/n * torch.sum(X*(model(w,X) - Y))

  return w

Now, the function can be used to update the weights. The learning rate is selected empirically, but it is normally a small value. The new line of best fit can also be plotted.

lr = 0.01

print("weights before:", w.flatten())
print("MSE before:", MSE(model(w,X), Y))

# update the weights
w = gradient_descent(w)

print("weights after:", w.flatten())
print("MSE after:", MSE(model(w,X), Y))

plot_lbf()

weights before: tensor([0.83, 0.13])
MSE before: tensor(105.29)
weights after: tensor([1.01, 1.46])
MSE after: tensor(2.99)

Image by Author

The MSE decreased by more than 100 on the first try, but the line still doesn’t fit the points perfectly. Remember, the goal is to get w₀ as 2 and w₁ as 1.5. To speed up the learning process, gradient descent can be performed 500 more times, and the new result can be examined.

# update the weights
for i in range(0, 500):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100 == 0:
    print("epoch:", i+1)
    print("weights:", w.flatten())
    print("MSE:", MSE(model(w,X), Y))
    print("="*10)

plot_lbf()

epoch: 100
weights: tensor([1.44, 1.59])
MSE: tensor(1.31)
==========
epoch: 200
weights: tensor([1.67, 1.56])
MSE: tensor(1.25)
==========
epoch: 300
weights: tensor([1.80, 1.54])
MSE: tensor(1.24)
==========
epoch: 400
weights: tensor([1.87, 1.53])
MSE: tensor(1.23)
==========
epoch: 500
weights: tensor([1.91, 1.52])
MSE: tensor(1.23)
==========

Image by Author

After 500 epochs, the MSE is 1.23. w₀ is 1.91, and w₁ is 1.52. This means the model successfully identified the line of best fit. Additional updates could be performed, but the randomness added to the output values will likely prevent the model from achieving a perfect prediction.

To build additional intuition about how gradient descent works, the impact of w₀ and w₁ can be examined by plotting them with their output, the MSE. The function for plotting gradient descent can be examined in the appendix, and the output can be examined below:

torch.manual_seed(5)
w = torch.rand(size=(2, 1))

w0s, w1s, losses = list(),list(),list()

# update the weights
for i in range(0, 500):
  if i == 0 or (i+1) % 10 == 0:
    w0s.append(float(w[0]))
    w1s.append(float(w[1]))
    losses.append(MSE(model(w,X), Y))

  # update the weights
  w = gradient_descent(w)

plot_GD([-2, 5.2], [-2, 5.2])

Image by Author

Each orange point represents an update to the weights, and the red line represents the change from one iteration to the next. The largest update is from the first to second iteration, which is the red line. The other orange points are close together since their derivatives are small, making the updates even smaller. The plot shows how the weights update until the optimal MSE is acquired.

While the approach is useful, it could be simplified in a few ways. First, it does not take advantage of matrix multiplication, which would simplify the equation for the model. Second, gradient descent is not a closed-form solution to regression since the number of epochs and the learning rate vary for every problem, and the solution is an approximation. The last section of this article will address the first problem, and the next article will address the second.

An Alternative Approach

While this approach is useful, it is not as simple as it could be. It does not take advantage of matrices. As of now, the entire equation, Ŷ = Xw₁+ w₀, is used for the model’s function, and the partial derivative of each weight has to be calculated individually for gradient descent. By using matrix operations and calculus, both functions simplify.

To start, X has a shape of (n samples, num features), and w has a shape of (num features + 1 bias, 1). By adding an additional column to X, matrix multiplication can be used because it will have a new shape of (n samples, num features + 1 bias). This can be a column of ones that will be multiplied against the bias, which will scale the vector. This is equivalent to broadcasting the bias, which is how the predictions were previously calculated.

X = torch.hstack((torch.ones(X.shape),X))
X

tensor([[ 1.,  9.],
        [ 1., 10.],
        [ 1.,  0.],
        [ 1.,  3.],
        [ 1.,  8.],
        [ 1.,  8.],
        [ 1.,  0.],
        [ 1.,  4.],
        [ 1.,  1.],
        [ 1.,  0.],
        [ 1.,  7.],
        [ 1.,  9.],
        [ 1.,  3.],
        [ 1.,  7.],
        [ 1.,  9.],
        [ 1.,  7.],
        [ 1.,  3.],
        [ 1., 10.],
        [ 1., 10.],
        [ 1.,  4.]])

This changes the equation to Ŷ = X₁w₁+ X₀w₀. Moving forward, the bias can be considered a feature, so num features can represent both the independent variable and bias, and the + 1 bias can be omitted. Therefore, X has a size of (n samples, num features), and w has a size of (num features, 1). When they are multiplied by each other, the output is the prediction vector, which has a size of (n samples, 1). The output of the matrix multiplication is the same as w[1]*X + w[0].

torch.manual_seed(5)
w = torch.rand(size=(2, 1))

torch.matmul(X, w)

tensor([[1.97],
        [2.09],
        [0.83],
        [1.21],
        [1.84],
        [1.84],
        [0.83],
        [1.33],
        [0.96],
        [0.83],
        [1.71],
        [1.97],
        [1.21],
        [1.71],
        [1.97],
        [1.71],
        [1.21],
        [2.09],
        [2.09],
        [1.33]])

With this in mind, the model’s function can be updated:

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)

Since each weight is no longer thought of as an individual component, the gradient descent algorithm can also be updated. Based on A Simple Introduction to Gradient Descent, the gradient descent algorithm for matrices follows:

This can easily be implemented with PyTorch. Since w was reshaped at the beginning of the article, the derivative’s output needs to be reshaped for subtraction.

# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights | (num features, 1)
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

Using 500 epochs, the same output can be generated as before:

lr = 0.01

# update the weights
for i in range(0, 501):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100 == 0:
    print("epoch:", i+1)
    print("weights:", w.flatten())
    print("MSE:", MSE(model(w,X), Y))
    print("="*10)

epoch: 100
weights: tensor([1.43, 1.59])
MSE: tensor(1.31)
==========
epoch: 200
weights: tensor([1.66, 1.56])
MSE: tensor(1.25)
==========
epoch: 300
weights: tensor([1.79, 1.54])
MSE: tensor(1.24)
==========
epoch: 400
weights: tensor([1.87, 1.53])
MSE: tensor(1.23)
==========
epoch: 500
weights: tensor([1.91, 1.53])
MSE: tensor(1.23)
==========

Since these functions do not require additional variables to be manually added for each feature, they can be used for multiple linear regression and polynomial regression.

Conclusion

The next article will discuss the closed-form solution to regression that does not approximate the weights. Instead, the minimized values will be directly computed using An Introduction to Machine Learning in Python: The Normal Equation for Regression in Python.

Please don’t forget to like and follow! :)

References

Plotly 3D Plots

Appendix

Plotting Gradient Descent

This function utilizes Plotly to display gradient descent in three dimensions.

import plotly.graph_objects as go
import plotly
import plotly.express as px

def plot_GD(w0_range, w1_range):
  """
    Inputs:
      w0_range: weight range [w0_low, w0_high]
      w1_range: weight range [w1_low, w1_high]

    Global Variables:
      X: array of inputs  | (n samples, num features + 1 bias)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient
      
    Output:
      prints gradient descent
  """ 

  # generate all the possible weight combinations (w0, w1)
  w0_plot, w1_plot = torch.meshgrid(torch.arange(w0_range[0],
                                                 w0_range[1],
                                                 0.1),
                                    torch.arange(w1_range[0],
                                                 w1_range[1],
                                                 0.1))
                                 
  # rearrange into coordinate pairs
  w_plot = torch.hstack((w0_plot.reshape(-1,1), w1_plot.reshape(-1,1)))

  # calculate the MSE for each pair
  mse_plot = [MSE(model(w, X), Y) for w in w_plot]

  # plot the data
  fig = go.Figure(data=[go.Mesh3d(x=w_plot[:,0], 
                                  y=w_plot[:,1],
                                  z=mse_plot,)])

  # plot gradient descent on loss function
  fig.add_scatter3d(x=w0s, 
                    y=w1s, 
                    z=losses, 
                    marker=dict(size=3,color="orange"),
                    line=dict(color="red",width=5))
  
  # prepare ranges for plotting
  xaxis_range = [w0 + 0.01 if w0 < 0 else w0 - 0.01 for w0 in w0_range] 

  yaxis_range = [w1 + 0.01 if w1 < 0 else w1 - 0.01 for w1 in w1_range] 

  fig.update_layout(scene = dict(xaxis_title='w₀', 
                                 yaxis_title='w₁', 
                                 zaxis_title='MSE',
                                 xaxis_range=xaxis_range,
                                 yaxis_range=yaxis_range))
  fig.show()

A Simple Introduction to the Dot Product

Hunter Phillips — Thu, 11 May 2023 01:26:53 GMT

The dot product is a common operation performed on vectors that returns a scalar as a result. This scalar provides information about the relationship between the vectors.

Background

For two vectors a and b of length n, the dot product can be used to show the relationship between them. For instance, are they pointing in the same direction? Opposite directions? Are they perpendicular?

The result is a scalar, so the dot product is sometimes known as the scalar product.

To build intution of how this works, it would be best to start with the geometric definition.

Geometric Definition

Image by Math is Fun

This formula consists of three components:

||a||: the magnitude of a
||b||: the magnitude of b
θ: angle between a and b

Magnitude

Image by Wumbo

The magnitude can be calculated by taking the square root of the squared elements. For a 2-dimensional vector, the magnitude would be
√(x² + y²). For 3-dimensions, the magnitude would be
√(x² + y² + z²). For n-dimensions, the magnitude would be:

Cosine

Image by Math is Fun

cos(θ) is used to “project” a onto b. In the image above, a and b point in different directions, so ||a|| cos(θ) projects the portion of a that is adjacent and alongside b.

This could also be viewed as taking the portion of b that is adjacent and alongside a, which can be seen in the image below:

Image by KiKaBeN

Example

Image by Author

The geometric definition is useful when the angle and magnitude of the vectors is known, like in the example above. In this example, calculating the dot product is easy.

Note that six is negative since it is pointing in the negative direction:

||a|| = 10 = √(8² + (-6)²)
||b|| = 13 = √(12² + 5²)
θ = 59.5°

Therefore, a・ b = ||a|| ||b||cos(θ) = (10) (13) cos(59.5) = 65.9799871849

What does the output mean?

Image by Math is Fun

When two vectors are pointing in the same direction, the angle between them will be θ = 0° or 0 radians. This means the output of cos(θ) is 1. This is when the dot product is at a maximum. On the other hand, when the vectors point in opposite directions, the angle between them will be θ = 180° or π radians. The output of cos(θ) will be -1. This is when the dot product is at a minimum. When θ = 90° or (π/2) radians, the output of cos(θ) is 0. This occurs when the vectors are perpendicular or orthogonal to each other.

This indicates that the dot product can help identify the relationship between vectors, which is vital in machine learning.

While the geometric definition is useful, it is more common to have the components of a vector and no known angle. In these situations, it is more convenient to use the equivalent coordinate formula.

Coordinate Definition

The coordinate definition does not require an angle to calculate the dot product. The corresponding components of each vector are multiplied against each other instead. The result ends up being equivalent to that of the geometric definition. The most succinct explanation of their equivalence that was available was on Wikipedia.

Two-Dimensional Example

Image by Math is Fun

To show that they are equal, consider the same example as before, but now it can be calculated with the coordinate definition.

The geometric solution was 65.9799871849, which is a negligible difference.

Three-Dimensional Example

Image by Math is Fun

These same formulas can be used in three or more dimensions. In the image above, the components of each vector are known, but the angle between them is not.

The coordinate definition can be used to calculate the dot product, and the angle between them can be found using the result and the geometric definition.

For this example, a = [4, 8, 10] and b = [9, 2, 7]:

Image by Author

Now, the angle can be found by setting ||a|| ||b||cos(θ) = a・b :

Please don’t forget to like and follow for more! :)

References

A Simple Introduction to Broadcasting

Hunter Phillips — Wed, 10 May 2023 17:27:12 GMT

Broadcasting occurs when a smaller tensor is “stretched” to have a compatible shape with a larger tensor in order to perform an operation.

Image from NumPy

Broadcasting can be an efficient way to perform tensor operations without creating duplicate data.

According to PyTorch, a tensor is “broadcastable” if:

Each tensor has at least one dimension

When iterating over dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist

The trailing dimension is the rightmost number when comparing shapes.

In the image above, the generic process can be seen:

1. Determine if the rightmost dimensions are compatible

Does each tensor have at least one dimension?
Are the sizes equal? Is one of them one? Does one not exist?

2. Stretch the dimension to the appropriate size

3. Repeat the previous steps for the next dimension

These steps can be seen in the examples that follow.

Element-Wise Operations

All element-wise operations require tensors to have the same shape.

Vector and Scalar Example

Image from NumPy

import torch
a = torch.tensor([1, 2, 3])
b = 2 # becomes ([2, 2, 2])

a * b

tensor([2, 4, 6])

In this example, the scalar has a shape of (1,), and the vector has a shape of (3,). As the image demonstrates, b is broadcast to a shape of (3,), and the Hadamard product is performed as anticipated.

Matrix and Vector Example 1

Image by Author

In this example, A has a shape of (3, 3), and b has a shape of (3,).

When multiplication occurs, the vector is stretched row-wise to create a matrix, which can be seen in the image above. Now, both A and b have a shape of (3, 3).

This can be seen below.

A = torch.tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

b = torch.tensor([1, 2, 3])

A * b

tensor([[ 1,  4,  9],
        [ 4, 10, 18],
        [ 7, 16, 27]])

Matrix and Vector Example 2

Image by Author

In this example, A has a shape of (3, 3), and b has a shape of (3, 1).

When multiplication occurs, the vector is stretched column-wise to create two additional columns, which can be seen in the image above. Now, both A and b have a shape of (3, 3).

A = torch.tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

b = torch.tensor([[1], 
                  [2], 
                  [3]])
A * b

tensor([[ 1,  2,  3],
        [ 8, 10, 12],
        [21, 24, 27]])

Tensor and Vector Example

Image by Author

In this example, A is a tensor with a shape of (2, 3, 3), and b is a column vector with a shape of (3, 1).

A = (2, 3, 3)
b = ( , 3, 1)

Starting from the rightmost dimension, each element is stretched column-wise to generate a (3, 3) matrix. The middle dimensions are equal. At this point, b is just a matrix. The leftmost dimension does not exist, so a dimension must be added. Then, the matrix must be broadcast to create a size of (2, 3, 3). There are now two (3, 3) matrices, which can be seen in the image above.

This allows the Hadamard product to be computed and generates a (2, 3, 3) matrix:

A = torch.tensor([[[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],

                  [[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]]])

b = torch.tensor([[1], 
                  [2], 
                  [3]])

A * b

tensor([[[ 1,  2,  3],
         [ 8, 10, 12],
         [21, 24, 27]],

        [[ 1,  2,  3],
         [ 8, 10, 12],
         [21, 24, 27]]])

Tensor and Matrix Example

Image by Author

In this example, A is a tensor with a shape of (2, 3, 3), and B is a matrix with a shape of (3, 3).

A = (2, 3, 3)
B = ( , 3, 3)

This example is easier than the previous one because the two rightmost dimensions are identical. This means the matrix only has to be broadcast across the leftmost dimension to create a shape of (2, 3, 3). This just means an additional matrix is needed.

When the Hadamard product is calculated, the result is a (2, 3, 3).

A = torch.tensor([[[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                   
                  [[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]]])

B = torch.tensor([[1, 2, 3], 
                  [1, 2, 3], 
                  [1, 2, 3]])

A * B

tensor([[[ 1,  4,  9],
         [ 4, 10, 18],
         [ 7, 16, 27]],

        [[ 1,  4,  9],
         [ 4, 10, 18],
         [ 7, 16, 27]]])

Matrix and Tensor Multiplication with the Dot Product

For all of the previous examples, the goal was to end up with identical shapes to allow element-wise multiplication. The goal of this example will be to enable matrix and tensor multiplication via the dot product, which requires the last dimension of the first matrix or tensor to match the second-to-last dimension of the second matrix or tensor.

For matrix multiplication:

(m, n) x (n, r) = (c, m, r)

For 3D tensor multiplication:

(c, m, n) x (c, n, r) = (c, m, r)

For 4D tensor multiplication:

(z, c, m, n) x (z, c, n, r) = (z, c, m, r)

Example

Image by Author

For this example, 𝓐 has a shape of (2, 3, 3), and B has a shape of (3, 2). As of now, the last two dimensions are eligible for dot-product multiplication. A dimension needs to be added to B, and the (3, 2) matrix needs to be broadcast across this dimension to create a shape of (2, 3, 2).

The result of this tensor multiplication will be (2, 3, 3) x (2, 3, 2) = (2, 3, 2).

A = torch.tensor([[[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]],
                   
                  [[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]]])

B = torch.tensor([[1, 2], 
                  [1, 2], 
                  [1, 2]])

A @ B # A.matmul(B)

tensor([[[ 6, 12],
         [15, 30],
         [24, 48]],

        [[ 6, 12],
         [15, 30],
         [24, 48]]])

Additional information on broadcasting can be found at the links below. More information about tensors and their operations can be found here.

Please don’t forget to like and follow for more! :)