CS50x — Week 4

Daniel Wagener
12 min readJun 30, 2019

--

These are my notes for Harvard’s CS50 on edX.

Debugging with Valgrind

Another tool in the CS50 IDE, valgrind, can be used to detect memory leaks. For example, if a programmer temporarily forgets about 0-based indexing and tries to touch the 11th byte of a 10-byte chunk of heap memory:

This program will compile, run, and even print x, if we’re lucky. But, since x is in a location outside of the memory we’ve allocated, eventually we will get unlucky and hit a segmentation fault. By running help50 valgrind ./filename, we can get a simplified readout of the non-user-friendly output that valgrind provides and find our bug. In the code above, we would have to both fix the faulty index and then free() the allocated memory.

Many things can go wrong in code, and we save time later on by investing time now learning to use debuggers.

Another debugging method is rubber duck debugging, in which a programmer will explain their code to a rubber duck on their desk. By talking through their own code, they can find errors inside it.

P.S. The name “Valgrind” is a reference to the main entrance of Valhalla from Norse mythology.

Linked Lists

In C, arrays have their downsides. we cannot mix data types within arrays — the contents must all be ints or must all be chars, etc. Also, the size of the array cannot change. If we have an array of two values and want to add a third value, we have to create a new array and copy the original values into it before adding the third value. We could use a function called realloc, but that only takes an address, not an actual array.

Another solution is to call malloc separate times for each integer. Since the integers probably won’t be contiguous in memory, we malloc to allocate extra space for each integer so that, in each chunk of memory, we store our value AND the address of the next value. The end result is values that aren’t contiguous in memory but are still connected by links. The last value, having no next address to point to, has NULL set in place of where that address would be. The name for this structure is a linked list. In code, we can set up this arrangement with a custom structure:

The code below creates an array of numbers, with the array’s length provided by the user:

But what if we want to create a program where the user doesn’t have to specify the size of the array numbers? We can use realloc, which will either allocate a contiguous chunk of memory or, if a contiguous chunk is unavailable, find a spot with available contiguous chunks and move the array there.

To solve this same problem with a linked list of nodes:

After defining our node structure, we initialize a pointer to the first node in the list numbers. Then, after prompting for a number and checking for INT_MAX (a.k.a. EOF), we allocate memory with the size of our node structure. Then, to add the number to the list, we use a special arrow syntactic sugar, which really just means (*n).number, i.e. “dereference n and access the number property found there.” Then, because this new, last node has no next to point to, we set its next property to NULL using the same arrow syntax.

Now, both the last and second-to-last nodes have their next values set to NULL. We need to change the second-to-last node so that it points to the last node, so we makefor loop with an unusual setup:

for (node *ptr = numbers; ptr != NULL; ptr = ptr->next)

First, we set ptr equal to the first pointer in numbers. Then we specify: if the pointer in the current node does not equal NULL, increment up to the next pointer ( ptr = ptr->next). Once we reach our first NULL, in the second-to-last node, we change its next property to n, the pointer of the last node.

Hash Tables

In programming, hashing is the act of taking an input and outputting (usually) a number. For example, if we want to store names, we can consider the first letter of each name and store it 26-value long array: names that start with A gets stored at index 0, B names get stored at index 1, etc. We “bucketize” the names by putting them into one of 26 “buckets” (array indexes), with a linked list in each “bucket.” Then, if we want to retrieve a name later, we’ve effectively made our code 26 times faster. If we forget names and instead organize students by their birthdays, our table might look like this:

Trees

A binary search tree like the one above has a special property: going down the tree, the left child is always smaller than the one on the right. Therefore, if we wanted to search for 66, we’d start at 55, go right (because 66 is bigger), arrive at 77, go left (because 66 is smaller), and arrive at 66. Whereas the efficiency of a hash table is linear, O(n), the efficiency of a tree is O(log n). We can define tree nodes like so:

We can then write a search algorithm for trees like this:

Tries

The name “trie” is short for “retrieval,” although it’s pronounced like “try.” A trie like a tree, except each of its nodes is an array. In the name-searching example, each array might have 26 values. The above diagram is simplified, as tries get very big very quickly. It may look inefficient, but since names can share arrays, we can justify this structure if we’re using lots of names. The upshot is that the search process is simply as many steps as there are letters in the student’s name. Software with automatic real-time spellchecking usually has a dictionary storied in a trie so that it can quickly check the user’s input against a dictionary with hundreds of thousands of words.

Shorts

Defining Custom Types

The syntax for custom type definition is typedef <old name> <new name>. For example, a byte à la Windows programming could be typedef unsigned char byte, and CS50’s string would be typedef char* string.

Here’s another syntax for making structures:

Then, we can use car_t throughout our program. Alternatively, we can put typedef at the beginning (like we’ve been doing) and have the structure name at the end.

Data Structures

A summary of the four major data structures in C:

Arrays

  • Insertion is bad — lots of shifting to add an element (unless we insert at the end)
  • Deletion is bad — lots of shifting to remove an element (unless we remove from the end)
  • Lookup is great — random access, constant time
  • Relatively easy to sort
  • Relatively small size-wise
  • Stuck with a fixed size, no flexibility

Linked Lists

  • Insertion is easy — just tack onto the front
  • Deletion is easy — once you find the element
  • Lookup is bad — have to rely on linear search
  • Relatively difficult to sort — unless you’re willing to compromise on super-fast insertion and instead sort as you construct
  • Relatively small size-wise (not as small as arrays)
  • Flexible size

Hash Tables

  • Insertion is easy — hash, then add
  • Deletion is easy — once you find the element
  • Lookup is on average better than with linked lists because you have the benefit of a real-world constant factor
  • Not an ideal data structure if sorting is the goal — just use an array
  • Can run the gamut of size

Tries

  • Insertion is complex — a lot of dynamic memory allocation, but gets easier as you go
  • Deletion is easy — just free a node
  • Lookup is fast — not quite as fast as an array, but almost
  • Already sorted — sorts as you build in almost all situations
  • Rapidly becomes huge, even with very little data present, not great if space is at a premium

Singly-Linked Lists

Here is an example a of self-referential structure:

On line 4, because the structure definition is not yet finished (and doesn’t finish until line 6), we have to use the temporary name struct sllist. We’ll use the permanent name sllnode from now on.

To create a linked list, the syntax is:

sllnode* create(VALUE val);

The function will return a pointer to the first item of our new linked list. There are few steps involved:

  1. Dynamically allocate space for a new sllnode.
  2. Check to make sure we didn’t run out of memory.
  3. Initialize the node’s val field.
  4. Initialize the node’s next field (to NULL).
  5. Return a pointer to the newly-created sllnode.

Note: create() is not a library function, just an example.

To search through a linked list to find an element, the syntax is:

bool find(sllnode* head, VALUE val);

The first element of a linked list is always worth keeping track of, and making it a global variable might be wise. As long as we know the first element of the list, we can find all the others.

Thus, the steps are:

  1. Create a traversal pointer pointing the the list’s head (i.e. a copy of the original pointer, so that when we change this variable, we still have the original to reference later).
  2. If the current node’s val field is what we’re looking for, report success.
  3. If not, set the traversal pointer to the next pointer in the list and go back to step 2.
  4. If you’ve reached the end of the list, report failure.

We don’t have random access with linked lists, so there’s not a big reason to keep them sorted.

To insert a new node into the linked list, the syntax is:

sllnode* insert(sllnode* head, VALUE val);

And the steps are:

  1. Dynamically allocate space for a new sllnode.
  2. Check to make sure we didn’t run out of memory.
  3. Populate and insert the node at the beginning of the linked list.
  4. Return a pointer to the new head of the linked list.

We insert at the beginning because that’s the fastest way to do it. If we insert at the end, we have to traverse the entire list before we can do so.

The above diagram represents we we’d be at the end of step 3. Now we have to decide: do we make the “12” node the new head of the linked list, or do we connect it first? Turns out, we have to connect it first. If we just make “12” the new head, we break the chain and orphan every other item on the lists.

To delete an entire linked list, the syntax is:

void destroy(sllnode* head);

The steps are:

  1. Start at the beginning of the list. If you’ve reached a null pointer, stop. Otherwise:
  2. Destroy the rest of the list.
  3. Free the current node.

Step 2 is recursive, and makes more sense with a diagram:

The pointer in “12” is not NULL, so the function moves on the the next element, then the next, and so forth until reaches the “10” node. Then, it frees that node and all the other nodes from right to left. If we try to delete from left to right, we just wind up orphanning the rest of the list after the first deletion, resulting in a memory leak.

One limitation of a singly-linked list is that we cannot delete a single element from the list, lest we orphan part of the list. If we want to be able to delete single elements, we have to have a doubly-linked list (pointers pointing to both the next and previous nodes).

Hash Tables

Hash tables are great because insertion, deletion, and lookup can tend to O(1), unless we try to sort them, in which case the efficiency regresses to O(n).

The hash function returns a hash code, which tells use where to insert the data (e.g. “Charlie” starts with C, which returns hash code 2, so insert Charlie into the linked list located at array index 2). A good hash function should:

  • Use only the data being hashed
  • Use all of the date being hashed
  • Be deterministic
  • Uniformly distribute data
  • Generate very different hash codes for very similar data

A collision occurs when two pieces of data, when run through the hash function, yield the same hash code. If we don’t find a way to resolve collisions, the new date will overwrite the old data. One way is linear probing, where we try to place the data in the next consecutive element in the array (wrapping if necessary) until we find a vacancy. However, linear probing means lookups will start to take longer and we will regress toward O(n). Linear probing is subject to a problem called clustering: if two hashes return 6, then we have to use array index 7, and now any hash that returns 6 or 7 will force us to use array index 8, thereafter any hash that returns 6 or 7 or 8 will force us to use 9 instead, etc. Linear probing is also limited by the size of the array, and stops working once the array is full.

The method we saw earlier, using linked lists, is called chaining. Chaining eliminate clustering and allows us to store an arbitrary amount data in a finite array.

Tries

Hash tables and arrays handle the mapping of key-value pairs. A trie is a data structure with a guaranteed unique key, and the value is simple a boolean (i.e. “Does this value exist in this structure or not?”). The data itself we’re searching becomes a road map to navigate the trie, and if we can follow this road map from beginning to end, we know the data exists in the trie. Unlike with a hash table, there are no collisions: no two pieces of data can collide unless they are identical.

If we wanted to make key-value pairs where the keys are four-digit years and the value are names of universities, the paths of the root node would be labeled with digits of the year, and the leaf node would contain the university names. To insert an element into the trie, simply build the correct path from the root to the leaf:

The paths is 10 because at every junction point, we have ten possible branches we can go down.

When we start building a trie, we malloc and (globally, probably) declare a root node and never touch it again. Every time we want to start navigating through the trie, we set a traversal pointer equal to the root. Then, to insert “Harvard,” founded in 1636:

Then, to insert “Yale,” founded in 1701, and “Dartmouth,” founded in 1769:

If were then going to attempt to search for “Princeton,” founded in 1746, our lookup would reach the third level, where no 4 exists, and return false. Tries are huge, but in this case, each lookup is only four steps, and we have to create less nodes every time we insert something into the trie (assuming there will be overlap).

Reflections on Problem Set — Speller

I had a solid understanding of what a hashtable was, but I struggled with figuring out how to actually make one. I eventually got a working hashtable, and my spellchecker worked great, but I didn’t realize I’d initialized it incorrectly. When check50 and Valgrind gave me errors due to an uninitialized variable, I spent hours looking for the bug until someone on the CS50 Facebook group pointed me in the right direction. Moral of the story: you don’t really know a concept until you’ve actually made something using it.

--

--