Data Architecture design guidelines

Published in

TVS Motors technology blog

4 min readMar 30, 2020

Data architecture is one of the core areas in building a highly successful data driven company. A good well thought design will help not only in faster developments but also help in adjusting to various changing needs in the future.

This blog will help in identifying various areas that one needs to look into while designing data architectures for their companies. I will try to cover most of the aspects in the design and touch base on each of them for a basic understanding. You will also find a lot of overlap with other design fields but I believe that is pretty much relevant in data architecture design as well.

Scalability
Latency
Fault Tolerance and Disaster Recovery
Monitoring and Alerting
Logging and Traceability
Data modelling, catalog and democratisation
Data Lakes and Warehouse
Data Security
Deployment
Experimentation
Automated Testing

Scalability is one of the forefront of any platform design, be it at the application server, database, producer or consumer. In this high velocity and high variability world, data is produced and consumed at a very high rate. Also rate of data being produced changes pretty rapidly. To cater to these needs a scalable solution is required which can scale based on the traffic loads and needs. Vertical scaling is always an option but look for Horizontal scaling where-ever possible like application servers, consumers, long term storage and databases.

There are a lot of things to consider in terms of choice of databases ( sql or nosql ), microservices ( kubernetes ), distributed processing ( spark ..) etc.. it all depends on the use case that you are trying to solve

Latency requirements depends on use cases. Some require micro seconds latencies while some are good with seconds or even hour latencies. Again lot of places where design choices are important like event driven pub sub, real time CDC or not so real time read replicas.

Fault tolerance is one more important aspect of data architecture design. Again design your systems such that every sub-system in your design is tolerance to fault so that there is no single point of failure and your system is always up. DR or Disaster Recovery is replicating the same infra across multiple clusters or region for complete region failures.

Monitoring and Alerting is a mandate in every design as a separate sub-system altogether. It helps in monitoring sub-systems, quantify latency numbers, understand and polish system thresholds, understand system behaviours at various times, notifying system failures and rectify them.

Logging and traceability helps to understand and identify and understand problems in the sub-system, after all it is piece of code running (100 % bug free code is a myth ). Logs helps in tracing down the problems in the sub-systems and hence forms a process to improve the systems iteratively

Data modelling, cataloging and democratisation forms another aspect of the design. It is very important to understand the incoming data and model it properly. Modelling will help give proper structure to your data, be it in databases, warehouse or lake. Cataloging is what you build on top of your modelled data for its metadata. This helps in identifying type of data, discovering and lineage. Next comes democratisation, as to make it available to the organisation for consumption.

Data Lakes and warehouse are used to store your data in raw, processed or aggregated format, one of the important pieces in the complete data architecture design. Data lakes mainly acts as a storage for storing raw data, processed data and even aggregated data as the latencies of retrieving from storage is as good as the memory and at lower price. Warehouse generally comprises of aggregated or roll up data for fast retrieval. Star, Snowflake or constellation are some structuring methods available in warehouse. With the changing requirements and processing speed, warehouses are now making way for lakes to eat them.

Data Security is another aspect that architects should look into very carefully while building and designing systems. Starting with GDPR, a lot of countries are coming up with a lot of data protection policies, making it one of the most important criteria in designing.

Deployment is another piece that I would like to give importance to. In the fail-fast world, one needs rapid deployment. This gives way to continuous integration and deployment. A faster and automated deployment process will essentially reduce the time to market as well as also helps to reduce bugs which generally erupts with manual processes.

Experimentation is a way to conduct test on a segment of users (control group) to get first level feedback. A/B testing is one of them and a lot of successful companies tends to follow a lot of experiments everyday. This can only be done if you have solid platform and automated methods with well modelled data for carrying experiments and measure results.

Automated Testing, I believe is backbone for every design. When it comes to so many sub-systems in places, you need an automated way for testing all the systems to reduce any manual intervention. Again this not only reduce the turn-around time but also gives a developer time to concentrate on other pieces. Few testing strategies include unit tests, integration tests, performance tests and monkey tests.

I have just touched upon these aspects to highlight what needs to be kept in mind while designing a system. Let me know if this is useful. I would love to hear your feedback.

Till next time…

Nipun

Data Architecture design guidelines

Written by nipun Agarwal