Geek Culture
Published in

Geek Culture

A simple guide on connecting RStudio to a PostgreSQL database

Have you thought about the possibility of querying a relational database like PostgreSQL from RStudio? In 2021, I tried to do this, but I found out that there were not many resources explaining how to do that in great detail. I should have written this article a while ago. However, during that time, I taught a project-based course on Coursera titled “Manipulate R data frames using SQL in RStudio” to understand how to use and write SQL codes to query R data frames in RStudio. You can check out the course here.

In this article, I will show you how to connect locally to a PostgreSQL database. The ideas you will glean from this article will be helpful when connecting to other relational databases, including MySQL and SQL Server, even if the database is in cloud storage.

Before exploring the article, a question that should come to mind is — what is the rationale for this article? Why do I need to connect RStudio to SQL databases?

Usually, data in a relational database are stored in a normalized format. Therefore, for data professionals to perform analyses such as statistical computing and data visualization, there is a need to write advanced and complex SQL queries. However, R can easily connect to many relational databases like PostgreSQL, MySQL, Oracle, SQL Server, etc., and retrieve data from them as a data frame. When the data is successfully in the R environment, it becomes a typical R data set, making it easy to manipulate or analyze using all the powerful packages and functions.

I believe that this provides a solid reason why this article is essential.

Before we proceed, there is a little caveat here. This article is a simple guide, meaning it is not all-encompassing to explain all the technicalities involved. Be assured that by following the steps outlined in this guide, you will be able to connect your RStudio to your local PostgreSQL database. Also, these steps, with little tweaks, can connect to cloud-based databases.

Also, before we start, I want to assume that you have PostgreSQL installed, you know how to create a database, and you have at least a database in PostgreSQL. Suppose you do not have prior experience with working with SQL databases. In that case, you can check out the project-based course I taught titled “Performing Data Definition and Manipulation in SQL” here.

Furthermore, various functions and packages can help achieve the same purpose for this article — which is connecting RStudio to a PostgreSQL database. However, I will use the odbc R packages to show this illustration. ODBC means Open Database Connectivity, an open standard application programming interface (API) that allows application programmers to access any database.

Finally, before we get started, I want to mention that make sure you read through to the end; I have added a bonus section and a gift to this article. Now, let’s get started!

Getting Started

The first step will be to install and import the required R packages. If you have used R before, you should know that R has packages for almost any purpose. There are over 13,000 packages in the R Comprehensive Archive Network (CRAN) repository. To install and import the packages we will use in this article, use the code below.

Note: To run a line of code in R, use CTRL ENTER for Windows users or use the Run button on your RStudio interface.

I suggest using a Google search to check what each package imported does. I have skipped explaining the functions of each package imported to make this article as short and compact as possible.

With this out of the way, we can now see how to connect RStudio to a PostgreSQL database.

Connect RStudio to a PostgreSQL database

In this section, we will query the employees table in the Employees table. So, first, we will connect to the local database in PostgreSQL called Employees. Note that I have this database already in my PostgreSQL. Let me show what the database looks like including some records or rows in the employees table.

The Employees database contains seven tables and two views, as seen from the image above, including the employees table. In addition, on the right-hand side, the output from querying all the data in the employees table is returned. However, we can only see the first ten rows due to the space limit. However, the employees table contains about 11,822 rows and six columns, as shown in the image below.

Great! We can start learning how to connect RStudio to the Employees PostgreSQL database. I will outline the steps to connect successfully to this database from RStudio. If you are ready and excited to see how to do that, then join me!

Step One: Define a database connection

Creating a database connection from RStudio to PostgreSQL is relatively straightforward. In this step, we will use functions from the DBI package we installed earlier.

Before we write the code to connect RStudio to PostgreSQL, we want to be sure we have every information about the database, including the port number and username. To get that information, once you open pgAdmin (the management interface for PostgreSQL), do the following as shown in the image below:

  • Right-click on PostgreSQL14 and select Properties (a dialog box will open)
  • Select the Connection tab, and you will see all information needed to create the connection.

Note: The default port number for PostgreSQL is 5432.

Amazing! Now, we have the needed information to create the connection.

Usually, it is good practice to start by first checking whether a connection exists, that is, to check whether we can connect to the PostgreSQL database from RStudio. This article will use the dbCanConnect() function from the DBI package. The function only checks validity without returning a connection object. The default implementation opens a connection and disconnects on success. Simply, it helps us check if RStudio can establish a connection with the database.

The code above looks quite understandable. We filled in the information we got about the database. Recall that we want to connect to the Employees database hence the argument dbname = “Employees”. Furthermore, input the same password you used when you installed PostgreSQL, which is the same password you put in when you open your PgAdmin interface.

Beautiful! The result of printing the connection object con is TRUE, suggesting that there is a valid connection. Having established that, we can connect to the database by going through the appropriate authentication procedure using the dbConnect() function. It takes similar arguments as the dbCanConnect() function.

As I mentioned, they take similar arguments but produce different results. Let’s see the result of the connection object con_1 we created.

I believe that we both agree that they return different results. This result tells us that we have successfully connected to the Employees database on port 5432.

Bravo! From here onwards, it is going to be a smooth ride. Connecting to the database is the first and most crucial step. Isn’t that why you are reading this article?

Now, let’s play with some other functions and interesting data tasks. Next, using the dbListTables() function, we can return a list of all views and temporary objects (tables) in the database.

What do you notice from the result?

Nine objects were returned, corresponding to seven tables and two views, as we saw in the image at the beginning of the section. This makes sense, right?

The question now is — after creating the connection, then what? Find out the answer to this question in the next step.

Step Two: Query the database

Now, let’s query the database directly from RStudio using the dbGetQuery() function from the DBI package. This function returns the result of a query as a data frame. It takes two required arguments: the connection and an SQL statement.

Upon running the code below, the employees table will be queried to return all data. The disadvantage of running the code below the way I have written it is that it prints the results on the RStudio console, thereby making your RStudio console very messy.

One better way is to store the result as an R object, for example, emp_df. With this R object, we can do many other things, including data manipulation on the employees table without affecting the original employees table in the Employees database or losing the data even when we close the connection to the database.

Great! You are doing well. The code will successfully query all data in the employees table.

Let’s proceed to step three!

Step Three: Explore the dataset

Now, we have the data in the employees table saved as an R data frame. First, we can check the data’s dimension (number of rows and columns) using the dim() function.

Isn’t this interesting? The result tells us that the data has 11,822 rows and six columns.

Do these numbers strike a chord? Yes!

These were the exact result of the number of rows and columns in the employees table in the Employees PostgreSQL database. Clearly, we have successfully connected to the employees table without reasonable doubt.

Now, we can glimpse the data using the glimpse() function. glimpse() is like a transposed version of print(): columns run down the page, and data runs across. This function makes it possible to see every column in a data frame. It’s a little like str() applied to a data frame, but it tries to show you as much data as possible. Also, it always shows the underlying data, even when applied to a remote data source. glimpse() is a function in the dplyr package.

Note: The dplyr package is under the umbrella of a family of tidyverse packages. The dplyr package is very powerful, and in fact, it is the most useful package in R for data manipulation. It is my all-time best package for data manipulation. You can check out my project-based course on Coursera titled Data Manipulation with dplyr in R here. I covered the use of the dplyr package in that course.

Amazing! At one goal, we can see all the information of a dataset, including the number of rows, number of columns, column names, column data types, and even a few rows of the data.

As we advance, we can use the head and tail base R functions to get the dataset’s first and last six rows.

Isn’t this so beautiful? These results further stress the importance of what we are considering in this article. Just by connecting to the database, we can query any of the tables in the database. In this case, we queried the employees table and saved it as an R data frame. With this, we can do absolutely any data manipulation, cleaning, transformation, or visualization on the employees table, which may have been difficult or expensive if we were to write SQL queries. Also, we can do all these data tasks without tampering with the database — thanks to RStudio.

As we begin to round off this article, it is crucial that once you have saved your queried data as an R object (like a data frame), you should disconnect or close the connection you created. Doing this helps discard all pending work and frees resources (e.g., memory, sockets).

We will use the dbDisconnect() function from the DBI package. It works like the opposite of the dbConnect() function we used to create the connection. The dbConnect() function takes one required argument: the connection.

This code closes the connection, meaning we no longer have access to the tables in the Employees database. However, we can still perform data manipulation tasks on the employees table because we have saved that data as an R data frame called emp_df.

For example, a data question could be asked, such as how many employees were employed on or before 1st January 1990? The code below helps us answer this question.

The code above uses functions in the dplyr package. The %>% represents the pipe function. It is a super useful function in dplyr to chain a sequence of operations together. The filter() function filters for where the hire_date was on or before 1st January 1990 (hence the use of <=). Also, you will notice that the date takes the yyyy-mm-dd format (like a typical SQL date format). You can decide to change the date format using functions in the powerful lubridate() package (doing this is beyond the scope of this article). Finally, we used the count() function to get the number required.

Let’s see the result of the R code above.

Amazing! Of the 11,822 employees, the company employed 6,453 on or before 1st January 1990.

We have come to the end of this discussion on connecting RStudio to a PostgreSQL database.

Ooh… Hold on! Let me quickly add this important point here. You can use the code below also to create a connection. The difference between this code and the one we used above is simply using the RStudio API to request the username and password.

Actually, there is no difference, but this code looks more secure than the first one. Using this may be helpful if you share your code with others and don’t want to reveal your username and password.

Upon running the code above, a dialog box asking for your username and password will pop up like the one below. Upon inputting the username and password, the connection will be created.

Note: You only requested the username and password in the code above. You could have also used the RStudio API to ask for the database name.

I believe that was quite understandable. I decided to add that last part to show that you can do something like this with RStudio. You can explore that further.

Would you be interested in learning how to connect RStudio to a MySQL database? Connecting to a MySQL database is not as straightforward as connecting to a PostgreSQL database. However, in my article here, I have explained everything you need to know and do to connect RStudio to a MySQL database. Make sure to read it.

Conclusion

Amazing! You have come a long way, and I am glad you came this far. I salute your doggedness and determination to get to the end of this illustration. I believe it has been worth your time learning about these concepts, which you found valuable.

I have a gift for you — I have uploaded the complete R script used for this illustration to a GitHub repository here. This script saves you the stress of writing the codes, and you can quickly run the code. Of course, to practice with this script, you will need to tweak the script, like changing the database name, table name, and so on.

Furthermore, you can do a host of other things after creating the connection to the database. You can update a table or add a new record to a table in the database from RStudio. This article has given you the first and crucial step to maximizing the power of RStudio and SQL. From here, you can read the documentation of packages and functions to do different things with the database you have connected to from RStudio.

If you enjoyed this post, please drop your comments in the comment session and share this article with others. It would mean a lot to me if you could click on the “claps” icon (up to 50 claps allowed).

As a project-based course instructor with Coursera Guided Project Network, I have taught some courses on using R, especially the R tidyverse family of packages, including the dplyr package. You can check them out here to take any of these courses. Thank You! See you at another time!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arimoro Olayinka

Arimoro Olayinka

Guided Project Instructor at Coursera Inc.