Google Cloud Storage (GCS) provides blob storage for data. Files can be uploaded to GCS and subsequently retrieved. The storage is cheap and provides excellent availability and durability. GCS provides a variety of programming language APIs that can be used by custom applications and many of Google’s products are pre-built to produce and consume data to and from GCS. Command line tools such as
gsutil also provide scripting access. Data can automatically be ingested if it is web addressable using the Storage Transfer product.
What is not provided by Google in the out of the box GCS story is the ability to access GCS data through any of the File Transfer Protocols (FTP). In this article we describe access to GCS via the Secure File Transfer Protocol (SFTP) and the corresponding SFTP client tools.
SFTP is an open specification for interacting with a remote file system to store, retrieve and list files in a hierarchical file system. SFTP exchanges all data with full encryption of the in-flight traffic. This means that whatever data you are moving, it can not be examined on the wire. Common public clients for SFTP include:
The SFTP specification has some excellent open source library implementations which means that we can write both client and server SFTP implementations. One such library is called ssh2 and is available for Node.js. Using this library, we wrote a sample which exposes itself as an SFTP server but uses GCS as the back-end storage system. This means that SFTP clients can connect to our SFTP server (which we have called sftp-gcs). Requests to put files, get files, list files and other file operations are then executed against GCS. From an SFTP client perspective, it behaves identically to working with any other hierarchical file system with the distinction that the data is being backed by GCS. If you have applications or tools that currently expect to work with file data using SFTP then this may be an excellent component to bring GCS into your story.
node sftp-gcs.js --bucket=my-bucket
There are some command line flags available to us. Only
bucket is required.
--bucket [BUCKET_NAME]— The name of the bucket to work against.
--port [PORT_NUMBER]— The TCP/IP port number for the server to listen upon. Defaults to 22.
--user [USER_NAME]— The user name if we wish to use user name login.
--password [PASSWORD]— The password if we wish to use user name login.
--public-key-file [PUBLIC_KEY_FILE]— A file containing the public SSH key for SSH login.
--service-account-key-file [KEY_FILE]— A path to a local file that contains the keys for a Service Account.
--debug [DEBUG_LEVEL]— Switch on debugging. Supply “debug” for maximal debugging.
The SFTP server will act as a gateway to GCS storage. It has been designed to allow access to only a single bucket per instance. We can always configure multiple instances where each instance can be defined to use a distinct bucket. If there is demand/interest for multi-bucket support, this can be added at a later time.
Being an SFTP server, SFTP clients must connect to it. This means that it must listen on a TCP/IP port. We can supply the port with the
--port parameter. If not supplied, the default is 22 which is the same port as used by SSH. It is likely you will want to supply an alternate port if you plan on running the server concurrently with SSH.
Security is a primary consideration and there are two aspects of security we must examine. The first is what identity can be used to access the SFTP server. The second is that once the SFTP server is accessed, what identity is presented to GCS to allow access?
Let us look at SFTP server access. There are three possibilities:
- No security. A connection request to the SFTP server will immediately succeed with no challenge. Security will fall back to the connection to GCS with the GCS permissions granted to the SFTP server.
- Userid/password. A connection request to the SFTP server will result in a request for a userid/password pair that must match that supplied during configuration.
- SSH keys. A connection request to the SFTP server will only succeed if the caller has a private key corresponding to the public key configured to the server.
To use userid/password security, supply
To use SSH keys based security, supply a public key file using
To not use any security, don’t supply any of
Once a client connects to the SFTP server, requests to put and get files will be made to GCS. The identity that the request is made as will either be the Google application default credentials or the service account specified with the
--service-account-key-file. Application default credentials is the implicit credential environment found when running in a GCP Compute Engine or with the environment variable
Once it is running, we can connect a protocol compliant SFTP client to the server and start issuing SFTP commands. The following video illustrates installation and use of the server.
Questions and answers
Q: Can we run this solution in a Serverless environment such as Cloud Functions or Cloud Run?
A: Unfortunately no. The reason for this is that, as of December 2020, neither Cloud Functions nor Cloud Run support any protocols other than HTTP(s). Only HTTP requests can be received over the Internet for processing. Our SFTP solution is using the SSH protocol carrying SFTP sub protocol requests. While we could have our SFTP-GCS server listen on a TCP port that is normally used for HTTP(s), that wouldn’t help. Until/unless we can surface TCP protocols from Serverless requests, we aren’t going to make any progress. There might be some mileage in looking at Managed Instance Groups (MIGs) associated with Compute Engines but I am not sure that these can scale down to zero.
Q: How do I start the demon at boot time?
A: Starting demons at VM startup is more of a generic question than a specific one associated with this exact demon. My recommendation is to study the
systemd documentation which is the current story for boot management in a Linux environment. There are many great articles on creating units for