Transfer File From FTP Server to AWS S3 Bucket Using Python
File transfer functionality with help from the paramiko and boto3 modules
Hello everyone. In this article we will implement file transfer (from ftp server to amazon s3) functionality in python using the paramiko and boto3 modules.
- Python (3.6.x)
- AWS S3 bucket access
- FTP server access
Note: You don’t need to be familiar with the above python libraries to understand this article, but make sure you have access to AWS S3 bucket and FTP server with credentials. We will proceed with python functions step by step and I’ll leave a github link at the bottom of the article.
Step 1: Initial Setup
Install all of the above packages using pip install:
pip install paramiko boto3
Also install awscli on your machine and configure access id, secret key and region. here is the link on how to do it.
Step 2: Open FTP Connection
Lets have a look at the function which will make ftp connection to server.
We will make a new SSH session using paramiko’s SSHClient class. We need to load local system keys for the session. For FTP transport over ssh we need to specify server host name
ftp_host and port
ftp_port. Once the connection is made, we authenticate the FTP server to open the new ftp connection using
transport.connect(). If authentication is successful, we initiate FTP connection using SFTPClient of paramiko. We’ll get the
ftp_connection object, with which we can perform remote file operations on the FTP server.
Step 3: Transfer file from FTP to S3
This will be a big function that will do the actual transfer for you. We will break down the code snippets to understand what is actually going on here.
First things firs t— connection to FTP and S3
transfer_file_from_ftp_to_s3() function takes a bunch of arguments, most of which are self explanatory.
ftp_file_path is the path from the root directory of the ftp server to the file, with the file name. For example,
s3_file_path is the path starting from root of the S3 bucket, including the file name. The program reads the file from the ftp path and copies the same file to S3 bucket at the given s3 path.
We will also read the file size from ftp. According to the size of file we will decide the approach — whether to transfer the complete file or transfer it in chunks by providing
chunk_size (also known as multipart upload).
Avoid duplicate copy
This small try catch block will compare the provided s3 file name with the same path. It will also check the size of the file. If it matches we will abort transfer, thereby closing FTP connection and returning from function.
Transfer the small files in one go
If the file is smaller than the chunk size we have provided, then we read the complete file using the
read() method. This will return the file data in bytes. We then upload this byte data directly to s3 bucket, with the given path and file name, using the
Transfer big files in chunks AKA Multipart Upload
We will transfer thefile in chunks! This is where the real fun begins…
First we count the number of chunks we need to transfer based on the file size. Remember, AWS won’t allow any chunk size to be less than 5MB, except the last part. The last part can be less than 5MB.
We iterate over for loops for all the chunks to read data in chunks from ftp and upload it to S3. We use the multipart upload facility provided by boto3 library.
create_multipart_upload() will initiate the process. The chunk transfer will be carried out by transfer_chunk_from_ftp_to_s3() function, which will return the python dict containing information about the uploaded part called
The python dict
parts_info has key
‘Parts’ and value is a list of python dict
parts_info dict will be used by
complete_multipart_upload() to complete the transfer. It also takes the upload id from multipart dict returned after initiating multipart upload. After completing multipart upload we close the ftp connection.
How to transfer the chunk?
This function will read the ftp file data of chunk size in bytes by passing chunk size to
ftp_file.read() function. This byte data will be passed as a
Body parameter to
upload_part() will take other parameters like name of the bucket, s3 file path. PartNumber parameter is just the integer indicating number of part, like 1,2,3 etc.
Once part is uploaded, we return part-output dict with
PartNumber , which is then passed as value to the dict called
part_info to complete the multipart upload.
We did it!
That’s it! You have transferred file from ftp to s3 successfully — you should now see the message on the console.
Visit the Github Link for the complete python script. Thank you for reading this so far. I hope you found this article helpful. Cheers!