My 2nd job in tech was as a DBA. I really enjoyed this job as not only did I have to work on table schemas, indexes ( i still miss INGRES as I got to choose the right type of index ) , query optimisation, access paths, backup & recovery processes , security, ETL jobs etc . I also got to spend hours looking at queue depth, iops and latency, designing the SAN looking at LUN’s and configuring appropriate RAID levels. It was probably the only time I was able to actually come up with a colour coding scheme that made sense !
Then it became even more interesting when it wasn’t just choosing which flavour of relational database it was well what sort of Non sql database do I need (That made up for the lack of even having to think about what type of index I needed kind of 😀) . So back to more recent times and the “Cloud” where now worrying about LUN’s and striping disks isn’t a thing at least it isn’t on the cloud I work with these days .
A lot of the tasks I used to do a as a DBA has been abstracted away as it’s now GCP’s problem but other concerns have stepped in to fill that low level undifferentiated heavy lifting.
There are two choices about how you manage your databases in the cloud and I’m going to focus on GCP here ( let’s assume you’re all in for this post) either you deploy and manage your own database on Compute engine( GCE) or you use one of the fully managed database services.
Deploying and managing databases on GCE
So I know I’m skimming the surface here but I just want to give a flavour of the tasks that you as a DBA need to consider when deploying to GCE. Some of it is the same as you do on premises but other tasks are actually considerably easy to accomplish .
If deploying directly onto GCE then you need to have an in-depth understanding of the operational side of the database software you are going to use as you would on premises . You need to do the usual separation of data and logs onto separate disks from the boot disk that doesn’t really change as an approach. Backup and recovery processes use Cloud Storage ( GCS) nearline to move your database backup files from persistent disks (PD’s) . Persistent disks are durable storage devices that function similarly to the physical disks in a desktop or a server. Compute Engine manages the hardware behind these devices to ensure data redundancy and optimise performance for you.
If you want to replicate a tiered storage system then implement a process that allows you to move backups from PD’s, to standard GCS and then to nearline GCS. Monitoring and logging you can use stackdriver as it has some out of the box agents for monitoring some common databases . Saves you setting up additional infra for monitoring
Need RAM ( of course you do!) choose an instance type that gives you the amount of RAM you need or if there isn’t one that quite fits customise one.
Databases ( let’s ignore in-memory databases for this post!) , Disk I/O and throughput have always been a key aspect to what you need to focus on . You still need to focus on those parameters but this is where you really need to think differently . Now you’re not worrying about DAS, NAS or SANs, fibre channels and RAID levels. The GCP docs for Data center professionals has a nice section to help you understand the mind switch from SAN’s and archival storage to PD’s and GCS.
Note :You do get local SSD but they are tied to the underlying host so only really advisable for scratch disks and disks that can be rebuilt quickly .
Now what you ultimately need to concern yourself with is the type of PD , the size of your PD’s and if you use SSD PD’s then the number of vCPUs.
The table from the docs on comparing block storage performance is a good starting point or if you want to make a gentler leap than the table comparing a 7200 rpm SATA to standard PD or SSD PD’s is great to understand how you should be thinking about block storage performance and using that to size your PD’s appropriately.
It’s worth reading and then rereading the whole page on disk optimisation but here’s a few key takeaways that I personally feel are always worth remembering:
- Standard persistent disk performance scales linearly up to the volume performance limits. The vCPU count for your instance does not limit the performance of standard persistent disks.
- SSD persistent disk performance scales linearly until it reaches either the limits of the volume or the limits of each Compute Engine instance. In general, instances with greater numbers of vCPUs can achieve higher throughput and IOPS limits.
- Each persistent disk write operation contributes to your virtual machine instance’s cumulative network egress traffic.
- You still need to think about queue depth and parallelism and optimise accordingly,
A great example of how to bring it all together and think about managing and optimising a database on GCE is this doc on the GCP site about best practices for Microsoft SQL Server on GCP.
So now a DBA looks at optimising I/O and throughput at a higher level of abstraction as GCP takes care of that undifferentiated heavy lifting (even if i do miss that it wasn’t best use of my time ultimately !). Basically the lower level infra concerns have just got way easier once you understand what you need to focus on.
GCP’s fully managed databases
With fully managed databases you are no longer concerned with the nitty gritty of PD sizes and optimising block sizes etc to maximize perf . Now you focus on the individual guidance provided for each service.
I would recommend if you’re new to GCP and have made the totally sensible decision to go fully managed that you start with this article on Data lifecycle on GCP ( it needs to add Spanner too hint hint !) . Once you’ve read that then read the concepts guide and get hands on with the quick starts.
I have called out a few examples below of managed database services on GCP together with some pointers for optimizing using them.
- Read Understanding Cloud Big table performance
- The larger the amount of storage the better it performs ( so > 1 TB)
- Choose a block storage type
- Understand the use of partitioned tables
- Tune your queries
- Use query plans to help improve performance
- Use the launch checklist
The tasks a traditional DBA would typically carry out has basically moved up the stack. This has meant there is more time to get involved with other things .
Storing and querying data is only the tip of the iceberg though as getting data into your data store and the cloud and ETL jobs all part of being a DBA. I hadn’t forgotten honest!
Integrating Hadoop and its ecosystem is the usual path followed to address a wide variety of data problems. Maybe I should have included that as part of what a typical DBA does these days ( But I guess I am starting from what tools I had when I was a DBA and Hadoop wasn’t one of them) .
My first foray into Hadoop and it’s ecosystem was actually with AWS’s EMR so I missed the whole lots of commodity hardware under desks and in cupboards thing that is part & parcel of running a Hadoop infra on premises .I had a relatively painless introduction. Have a look at GCP’s Dataproc it’s really the way Hadoop type tasks should be run imho ( but I may be slightly biased 🙂)
Soon after Hadoop type workloads( and yes I guess I really mean map/reduce) looking at solutions to help make the lambda architecture work without tearing your eyeballs out would be next to understand so Apache beam is probably the next step or indeed the first step if you have never touched Hadoop. it’s the next gen so may as well just start there anyway . The fully managed version of that being Dataflow on GCP .
By now you’re probably thinking hang on with this moving up the stack haven’t they just really renamed a DBA as a Data engineer ? and you know you’d be right (If you’re not entirely sure what a Data engineer is I love this as the best description I have read to date) .
Data engineers are imho an evolved DBA with cooler tools to help them do their jobs. I have seen mashups that randomly throw in ML too but I feel that fits more into the Data science bucket. I’m not going to get into a philosophical discussion of where you draw the line as to be honest job titles only give you a vague idea of what someone mostly does in their day to day job anyway and the edges are so blurred.