Stories by Satya Sinha on Medium

What even is the Internet anyway?

Satya Sinha — Mon, 22 Apr 2019 01:14:36 GMT

I’ve recently started looking for my first software engineering job, and on two separate occasions I’ve been asked how the internet works. Both times, I’ve only managed to blurt out a few buzzwords (IP address! DNS!) before coming up blank. This blog post is an effort to avoid that situation happening a third time.

Origins

The internet as we know it today grew out of a computer network known as the ARPANET (Advanced Research Projects Agency Network). The ARPANET was a 1969 government funded project designed to allow researchers at different research institutions to communicate with each other more efficiently. Networks such as the ARPANET were dependent on two fundamental technologies that the internet still relies on today: packet switching and TCP/IP.

Packet Switching

Packet Switching can get pretty esoteric, but put simply, it’s a method of transmitting data over a network by breaking it up into ‘data packets’ while en route, and reassembling it once it has arrived at its destination. Each packet has at least a header and a payload. A header must at minimum contain the network address of the source of the packet, and the address of the destination. The payload will be some part of the actual data being transmitted. These packets will be directed by your router across various routes to the desired destination address specified in the packet’s header. Once every packet has arrived at the destination address, the packets are reassembled into the original data being transmitted. But what’s the mechanism to check whether or not all your packets have arrived or have been transmitted correctly? And what are network addresses? This is where TCP/IP comes into play.

TCP/IP

TCP (Transmission Control Protocol) and IP (Internet Protocol) are a complementary pair of protocols that standardize how data is transmitted across the internet. When sending data from one address to another, a TCP connection is established. When packets are sent to a destination, TCP ensures that an ACK (acknowledgment) is sent back to the sender from the receiver. Since every packet requires an ACK to be sent back it’s possible to know if all the data was not sent correctly, and resend if necessary.

IP is the protocol that determines where packets go. Each address in the header of a packet is an IP address, and functions similar to a normal mailing/phone address, just for devices connected to a network. However, unless you specifically look them up, IP addresses aren’t encountered during typical day to day use of the internet. This is because of a system to associate the url you’d see in a google search bar with a specific IP address, known as DNS (Domain Name System)

DNS

DNS is a hierarchical system designed to convert domain names to IP addresses, with the endings of url such as .com or .org acting as top level domains. When a request for a url is entered into a browser, that request is sent via your router and ISP to a DNS server, which goes through a process of going down the hierarchy, querying several servers for the information, until it finds the correct IP address.

Example of a sample query for a url

Bringing it all together

The combination of these protocols and systems are the backbone of the internet. They allow networks all over the globe to communicate despite potentially being built differently, because they have all agreed on common standards of communication and structuring data. These common standards are proposed and coordinated by international non-profits such as the IETF (Internet Engineering Task Force) and ICANN (Internet Corporation for Assigned Names and Numbers). Whereas something like ARPANET was a single network, the internet as we know it is a network of networks, that can all communicate because of their usage of common protocols such as TCP/IP. There’s a lot more to the infrastructure of the internet than the above topics, such as how it’s physically possible to transfer information, http/https, etc., but hopefully this post provides some interesting background context to the infrastructure that makes our work as developers possible.

A Refresher on Common SQL Select Statements and Joins

Satya Sinha — Mon, 15 Apr 2019 02:26:38 GMT

The vast majority of developers will have some exposure to SQL in their day to day; either via writing SQL directly, or under the hood via common ORM libraries (such as ActiveRecord for Ruby on Rails). Even if the SQL you use is abstracted away via an ORM, it’s still useful to know the basics so you can debug and optimize your queries when things don’t go according to plan.

Select Statements

Select statements are how you query information from a database, and probably the most commonly used set of SQL commands. While the simplest versions return specific columns of a table, they can also be combined with several optional clauses to increase the specificity of your queries, the most common of which I’ve listed below:

*

The ‘*’ symbol is used to return all of the data in a table.

WHERE

The WHERE clause is used to only return data that satisfies a certain condition. For example, if you only wanted to return the elements in a table that were above a certain number, you could write a query like this:

SELECT column1 FROM table WHERE column1 > x;

ORDER BY

The ORDER BY clause is used to sort the data returned from your query. ORDER BY ASC/DESC will return the data sorted in ascending or descending order, respectively.

DISTINCT

DISTINCT allows you to only return a column’s unique values, which makes it very useful for filtering through data that may contain a large amount of duplicate entries, such as records for a subscription service. A sample query would look like this:

SELECT DISTINCT column1 FROM table;

GROUP BY

GROUP BY allows you to group data by some sort of aggregate function, such as the total sum or frequency. This is a very useful clause that allows you to make a lot of useful queries. For example, if you had a table of customer ids and transactions, you could use GROUP BY to return all the customer ids and the sum total of each id’s transactions like this:

SELECT customer_id, SUM(transaction) FROM TABLE GROUP BY customer_id;

HAVING

HAVING is essentially a WHERE clause for GROUP BY statements. Adding this allows you to filter out the data returned from your query that doesn’t fulfill the conditions you specify. For example, if you wanted to alter the previous query to return only customer id’s that had total transaction amounts above $100, you could write this:

SELECT customer_id, SUM(transaction) FROM TABLE GROUP BY customer_id HAVING SUM(transaction) > 100;

JOIN Statements

Join statements are used to relate data from one table to data from other tables, based on common data between them. Joins are what really demonstrate SQL’s primary selling point, which is the Relational Model of data SQL databases are based off. There are several types of possible join statements, depending on what combination of data you’d like to return.

INNER JOIN

Inner Joins only return records that exist in both specified tables. For example, if you have two tables, one containing transactions, another containing customer contact info, you could write a join statement linking transactions to customers by name:

SELECT Transactions.customer_id, Customers.customer_name,
FROM Transactions INNER JOIN Customers ON Transactions.customer_id = Customers.customerID;

Note that this statement relies on the customer_id column existing in both the Transactions and the Customers tables. Without this common data to establish a relationship between both tables, this join statement would not be possible.

OUTER JOINS

Outer Joins allow you to select all the data from one table, in addition to data from another table. They can be further broken down into Left Joins, Right Joins, and Full Outer Joins.

LEFT/RIGHT JOIN

Left and Right Joins are mirror images of one another, and are used to return all the data from one table and matching data from another table. In the previous example, a left or right join instead of an inner join could tell you what customers had no recorded transactions:

SELECT Transactions.customer_id, Customers.customer_name,
FROM Transactions LEFT JOIN Customers ON Transactions.customer_id = Customers.customerID;

In this case, all the information from an inner join would still be returned, but any records that only exist in one table would have null values recorded in the other.

FULL OUTER JOIN

A Full Outer Join returns all records from both tables. The results of these queries may be very large, and also contain many null values if there isn’t much shared data between the two tables:

SELECT Customers.CustomerName, Transactions.transaction_id
FROM Customers
FULL OUTER JOIN Transactions ON Customers.CustomerID=Transactions.CustomerID;

Hopefully this basic overview is useful for anyone wanting to brush up on SQL before a technical interview, or new to the language.

SVG 101

Satya Sinha — Thu, 28 Feb 2019 17:12:19 GMT

My last project as a student at Flatiron School involves a good bit of data visualization, something I had no real experience in beyond simple charts in excel. During my research on simple data visualization techniques, I came across several different charting libraries, many of which relied on SVG’s to create their charts and animations, which begs the question: What are SVG’s and why use them over other image formats?.

SVG’s are an XML based image format used to draw vector graphics. Similar to writing HTML, you can use predefined tags and attributes to add pictures and animations to your website. Unlike image formats that create images with pixels (raster graphics), SVG’s are created using vector graphics, which rely on mathematical formulas to create lines, curves, and various other shapes. There are several benefits to creating images using this technique as opposed to vector graphics, such as scalability and flexibility.

Modern websites need to accommodate a wide variety of screen sizes and resolutions, which creates problems when all of your images are made with raster graphics. SVG’s can be a solution to this problem, as they are mostly resolution independent.

Zooming in on an image using raster graphics

Zooming in on an image using vector graphics

Notice the difference in quality between the two images? This is because raster images contain a fixed number of pixels, which becomes readily apparent the more you zoom in. SVG’s don’t have this problem, as they are constructed using various geometric shapes.

SVG’s also have benefits related to file sizes of images

SVG file size vs. PNG

As you can see there is a significant difference between the size of an SVG vs. a PNG. This probably won’t make much of a difference if you just have a few images here and there on your site, but if your site happens to be image heavy, then cutting down on the file sizes of your images can be a good way of improving your SEO via increasing your page load speed.

In addition to scalability, SVG’s also allow for the creation of fairly complex graphics beyond the scope of what a normal tag in HTML can do.

Mock Dashboard made with React and D3

SVG’s allow for the creation of complex graphs/graphics, which makes it the image format of choice for many charting libraries (including D3, one of the most popular data viz libraries for JavaScript). As shown in the GIF above, a good charting library + SVG’s allow for the creation of fancy transitions and animations to make your data visualizations look very slick.

SVG’s are a tool just like everything else in web dev, and have pros and cons. One main con is potential complexity. D3 has a fairly steep learning curve, and can be overkill if you just want to create simple graphs/charts. As someone who’s been burned by this while building my final project, it’s worth looking into a higher level library that abstracts some of the manual graphic creation away. Another con is that photographs can’t be saved as SVG’s, which rules using them out for a substantial amount of images.

A Quick overview of Relational vs. Non-relational Databases

Satya Sinha — Fri, 08 Feb 2019 15:35:01 GMT

So far my experience with databases as a developer has been limited to relational (“SQL”) databases. However, while exploring popular tech stacks used by professional developers, I came across the MEAN stack (MongoDB, Express, Angular, and Node), which then led me to research MongoDB, and the concept of databases that don’t rely on the techniques used to store data in relational databases.

First: a quick refresher on the basic concepts of a relational database. Relational databases are based on the Relational Model of data, first proposed by English computer scientist E.F. Codd in 1970. Data is organized based on how each table is related to one another, and these relationships are quantified based on the “foreign keys” in each table, which identify which foreign tables are related to the current table you are querying. Almost all relational databases use SQL to search and maintain the database.

Simple example of a relational database

Non-relational, or “NoSQL” databases is a catch-all term for databases that do not rely on the aforementioned model to organize data. One of the simplest examples of this could be a database that is organized around key-value pairs. Document-oriented databases, which are organized around the premise that all records in the database are documents that have been encoded in some sort of standardized file format (such as JSON), are another common example.

Why use a NoSQL database?

Non-relational databases can have an appealing amount of simplicity compared to relational databases, which often grow quickly in complexity as more relationships between tables are established. If you know that the data you’re going to be working with is relatively simple, a simple collection of key-value stores may suffice. This issue with complexity may carry over to issues with performance. This won’t be much of a problem with the small apps we’ve been creating and working with, but for an app or website with millions of users a poorly organized relational databases can slow your app down significantly. Because NoSQL databases don’t have join tables, they can potentially make queries faster than a relational database.

Slow page loading significantly increases your bounce rate

NoSQL databases can also offer better scalability if you need to scale horizontally as opposed to vertically. Horizontal scaling is a means of adding scale by adding more machines to handle your data requirements, whereas vertical scaling addresses this by adding additional computing power. Horizontal scaling is one of the reasons NoSQL databases started becoming more popular as the internet grew in popularity as well, as companies needed to manage potentially millions of users hitting their website at the same time.

Are NoSQL databases always the better choice?

For web applications with a large amount of read-only data, where speed is paramount, NoSQL databases may be a good choice. However, for complex databases with a large amount of structured and interrelated data a NoSQL may not be the best solution. An overly simplistic database will force a lot of the business logic that is implicitly enforced by a relational model of data into your actual app code. NoSQL databases are also generally less mature than their relational counterparts. This means that they can be less stable, which may be a significant issue if you’re dealing with large amounts of sensitive data, such as financial transactions or medical data. A lack of a standardized query language can also be an issue. Anyone familiar with SQL can quickly get up to speed making queries in a new database, whereas working with a NoSQL database may require learning a different query language, slowing down how quickly a new employee can get up to speed and be productive. Finally, NoSQL databases are often marketed towards websites and apps where much of the data flow revolves around CRUD. In situations where more complex business analytics is required, a SQL database may be a better choice.

Comparison of various NoSQL databases to a relational database

Sources/Further Reading

https://medium.com/media/7ceaa45254a278d92efdaa551390cb26/href

Relational databases vs Non-relational databases

https://www.hadoop360.datasciencecentral.com/blog/advantages-and-disadvantages-of-nosql-databases-what-you-should-k

A basic overview of Object-Oriented and Functional Programming

Satya Sinha — Thu, 17 Jan 2019 06:19:29 GMT

Switching from Ruby to Javascript was a bit of a jarring transition for me. I think that part of the reason why I struggled in the first few days of learning the basics of Javascript was because much of the focus had shifted away from Object-Oriented concepts such as classes and inheritance, to concepts focused on leveraging Javascript’s functional versatility, such as first-class and higher-order functions. In light of that, I thought it would be useful to explore the basics and some of the pros and cons of the two programming paradigms we’ve been using for past few weeks.

Basic principles of Object-Oriented Programming

Some of the primary features that characterize OOP are: Encapsulation, Inheritance, and Polymorphism.

Encapsulation as it pertains to OOP usually refers to two concepts: First, that one can restrict access to certain types of data that pertain to an object, and second, that data can be ‘bundled’ together with functions that manipulate that data. For example, data related to a “Person” class could be bundled with class methods on the class, such as Person.name, Person.age, etc.

Inheritance is the idea that new objects can take on the properties of existing objects. Creating a ‘Parent’ Class allows for all future ‘Child’ classes you potentially create to have access to the methods you defined in the parent class.

Polymorphism is the idea that child classes have the ability to redefine methods/functions that they’ve inherited from a parent class. This means that even though a function might have an identical name to what it was called in the parent class, it might not do the same thing.

Pros of OOP

It can help you keep your code DRY. Inheritance allows for you to not have to write the same code over and over again
It can be easy to read/write. Having clearly defined classes, and logically organized inheritance chains can make it easier for you to refactor your code, and for other people to understand what certain classes in your codebase are meant to be responsible for.
It can help keep your code modular. Organizing your code into classes and subclasses promotes a design principle known as the separation of concerns, which suggests that when designing a program you should organize your code in such a way that each section addresses one aspect of your programs functionality. This makes your codebase easier to maintain, refactor, and build upon. This design pattern is conceptually similar to the Single Responsibility Principle, which was discussed in Mod 1 as a technique for organizing the logic of your classes.

Cons of OOP

If you aren’t careful and deliberate with the way you organize your classes and manage inheritance, your code can quickly become unmaintainable and break very easily. One small change to a parent method can have ripple effects that can break your code in countless unexpected places.

2. Poorly organized OO code can lead to a lot of shared state issues, as a lot of objects may be interacting with the same data. That can lead to a significant flaw in your app known as a ‘race condition’. Race conditions occur when two execution threads try to alter the same data at the same time, which can obviously lead to all sorts of unintended side effects in your code.

Basic principles of Functional Programming

As discussed in the lecture we had recently, Functional Programming is about writing code that: avoids shared state, mutating data, and controlling the output of your functions. Unlike the concept of encapsulation in OO programming, Functional Programming separates the data from the function, so that inputs/outputs are controlled and more predictable.

Pros of Functional Programming

Knowing what your function will output given a certain input makes it easier to tell when a function is not behaving the way it should.
Avoiding shared state problems can make dealing with concurrency issues easier (such as the race conditions mentioned earlier).
It can make your code less ‘brittle’ and more scalable by avoiding the ripple effects caused by careless inheritance in OOP.

Cons of Functional Programming

Functional programming can be difficult to read compared to OOP, especially if you’ve encapsulated too much logic into a single function.
Combining several different functions can get very confusing very quickly.
For many people, functional programming is unintuitive due to concepts such as a high emphasis on immutability, and thus, difficult to read and write.

Functional Programming in a nutshell

Conclusion

Some major takeaways I’d like people to leave with after reading this post is that these are programming paradigms, not ironclad laws. Just like languages can be thought of as tools in a developer’s toolbox, Object-Oriented Programming and Functional Programming can be thought of in the same way. Situations will arise where one style makes more sense than the other or vice versa, and the most important thing is to have the flexibility to switch when necessary. Also, neither style is a panacea for writing poorly reasoned/organized code, which will be hard to scale and maintain regardless of which programming paradigm you choose to write your code in.

Sources/Further Reading

The magic behind Spotify’s Discover Weekly playlists.

Satya Sinha — Thu, 20 Dec 2018 06:55:11 GMT

An interesting irony of the internet of today is that as it expands to encompass more people than ever, it has also grown to be a more personal experience for the individuals using it. Targeted digital advertising , Youtube’s suggested videos, and Netflix’s recommended movies all fall under this umbrella of a ‘personalized’ experience for a user. This blog post is going to focus on one particular facet of this ‘personalized internet’: Spotify’s Discover Weekly playlist.

Spotify’s recommendation algorithm is primarily a combination of 3 techniques: natural language processing, collaborative filtering, and analysis of raw audio files.

Natural Language Processing

Natural language processing (NLP) is a subset of computer science that focuses on how computers can process human speech and text. Natural language processing techniques have a wide range of applications today, such as speech recognition capabilities for digital assistants like Alexa or Siri, sentiment analysis for product/business analytics, etc. In Spotify’s case, they use NLP to scan the internet for text related to various artists and songs, adjectives used to describe these artists/songs, and other artists and songs that appear frequently in the same text. This allows Spotify to “cluster” artists and songs to use as potential suggestions for users, based on who they already listen to.

Collaborative filtering

Collaborative filtering is another common technique Spotify uses to personalize its recommendations. At its core, collaborative filtering relies on similarities between users to make recommendations, instead of similarities between artists or songs. For example, if user A saved songs A, B, and C to a playlist, and user B saved songs, A,B, and D to a playlist, Spotify can use collaborative filtering to identify that these users have similar tastes, and recommend song D to user A, and song C to user B. Over time, as a user listens to more music, saves more songs, and creates more playlists, Spotify can create a detailed ‘taste profile’ of one’s music preferences down to specific sub-genres.

Raw audio analysis

While collaborative filtering is a powerful means of fine-tuning recommendations, one of the most common roadblocks in effectively implementing this technique is known as the ‘cold start problem’. The cold start problem is essentially a situation caused by a computer lacking sufficient data to properly assess similarities or otherwise. This is very common in user facing applications such as Spotify, where new users sign up every day, and therefore, haven’t listened to enough music to create an accurate taste profile. Another cold start situation would be a new artist releasing an album. Songs and albums that have few plays pose a difficult problem for a system that relies on collaborative filtering, as there isn’t enough user engagement to use as a means of comparison. However, Spotify’s analysis of the raw audio files in its database has proven to be a fairly effective workaround for this issue.

Spotify’s analysis of raw audio relies on convolutional neural networks. Convolutional neural networks are a class of neural nets that are most commonly used for analyzing visual imagery, but can be used to analyze audio via converting the audio file into a spectrogram. Once converted into a spectrogram, Spotify uses convolutional neural nets to compare songs based on common musical characteristics, such as tempo, pitch, key, etc. Since this process isn’t based on popularity or user interaction in any real way, it allows Spotify to compute similarity scores for both new and old songs without having to hit a critical mass of user listens for the collaborative filtering process to kick in. This helps Spotify inject some novelty into their Discover Weekly playlists, as otherwise the only songs shown would be ones that have been listened to by enough users. Novelty can be a very effective means of maintaining user engagement, so layering raw audio file analysis over NLP and the collaborative filtering process is immensely valuable to Spotify. This also improves the value of the app for users, as they can use Spotify as a tool to discover new music, as well as listen to music they already know.

Spectrogram

Outlier Flagging

Another cool feature Spotify’s Discover Weekly has is the ability to flag ‘outliers’. If someone primarily listens to a few genres of music, but decides to listen to a song or artist far outside of his taste profile, Spotify won’t immediately update its recommendation algorithms to account for this song. Instead it will flag the song as an outlier, and continue to serve recommendations based on one’s established tastes. If a user were to continue listening to that song or artist, or songs and artists of a similar genre on a consistent basis, Spotify would eventually drop the outlier flag, update the user’s taste profile, and change its recommendations accordingly. This encourages the user to explore Spotify’s library, as listening to music outside of one’s current preferences won’t ‘screw up’ one’s recommendations.

Other data signals

Creating a feature like this is an iterative process, and as anyone who’s built any sort of app knows, there’s always room for improvement or the potential to build out another capability. Discover Weekly is no different, and the Spotify is always assessing what other data signals they could potentially incorporate to improve the quality of the playlist. Beyond the three main techniques detailed above, Spotify has also integrated data signals related to how users interact with the Discover Weekly playlist itself. For example, if a user saves a song from a Discover Weekly playlist, Spotify will count that as a ‘like’, and factor that information in when creating next week’s playlist. Other data points such as the number of times a user listened to a song, or whether or not a user listened to an entire song, also serve as valuable inputs.

Main Takeaways/Conclusion

Spotify’s Discover Weekly serves as a great example of how a platform designed for as many people as possible can still deliver a personalized user experience, and is in fact improved by additional users joining the platform. The collaborative filtering process creates a positive feedback loop for Spotify and its users, where more user data improves the accuracy of the taste profile, which increases the likelihood that users will continue to use the platform, which provides Spotify with more data, and so on and so forth. This phenomenon can also be described as a ‘network effect’ which can also be observed in other networks such as Facebook or Twitter. Finally, Spotify’s data driven process of improving their recommendation algorithm is a good case study of how effective incorporation of user data/feedback into your development process can strengthen product/market fit and improve user engagement.

https://medium.com/media/07e908da2b7d773c9d595c559e819741/href

Sources/Related links/Further reading

The magic that makes Spotify's Discover Weekly playlists so damn good

https://blog.galvanize.com/spotify-discover-weekly-data-science/

https://medium.com/media/56c2560d67e42622142d34f41e923014/href