CovenantSQL & Data Security | CSTC conference sharing
Recently, CovenantSQL co-founder & CTO Auxten Wang attended the 2018 China Software Technology Conference and delivered a speech titled “CovenantSQL, The Highly Available Database on the Internet”.
Auxten: CovenantSQL co-founder & CTO; Former Baidu Noah data transmission project leader, 360 operation and maintenance automation project leader; former 360 traffic guard, 360 Tianji project initiator.
Auxten has worked at elong and the Fourth Paradigm as technical director. He holds over 10-year experiences in infrastructure and system security related field.
The following is a summary of some of the speeches:
HA (High Available) levels
First of all, there are four levels I defined according to my personal understanding of HA (high availability) database:
The first level is Time Irrelevant, which is a basic requirement that requires stand-alone stability and does not crash during operation. Taking Mamcached and Nginx as examples, according to our experience, almost no one has ever encountered their bugs.
The next level is Hardware Irrelevant, which means that database is not affected by hardware, which is what we often say in a distributed situation. A single machine failure will not cause the whole server to fall. If you compare the cost of letting them crash, the previous level can be done at almost no cost, just $0. But if you want the hardware to crash, you have to make the server down first. We estimate it will cost about $10,000.
Next one is the high availability of IDC Irrelevant. This is even more difficult. In mainland China, only a few large companies are able to do it and withstand HA breakdown at the IDC level.
There are two examples. One is the well-known BitTorrent, which has been banned many times but still alive and tort copyright in most cases. The other is Spanner, which is a globally distributed database developed by Google. It is very powerful to achieve global high-performance consistency. Google Drive is based on Spanner. But in my opinion, these are still not the highest level of high availability.
So here is level 4: Human Irrelevant, what does this mean? It is almost indestructible, it runs on the Internet and is resistant to a variety of malicious attacks. This should be the highest form of HA. Basically, only a planetary-level attack can destroy it. In other words: as long as the Earth does not explode, it will continue to run. A classical example is the controversial Bitcoin.
According to my personal experience, the first two are not too difficult to achieve, the third requires the support of big companies, and the last one, Planet Irrelevant is what we are doing.
The root of data security issues
Regardless of the size of the companies, there is one common problem: your data is stored in IDC (Internet Data Centre). In this case, companies face several problems. The first is the developer’s dilemma. IDC cannot prove that the user’s data is properly stored and secured.
For example, if I develop a mobile APP and get your contact list from your mobile phone, my purpose is just to make it easier for you to pay your phone bills, or to increase social interaction. However, I’m unable to prove what I did with your phone book. I might say: “ I am a very honest developer. I wouldn’t use your data to do anything illegal. Of course, I will not do any data mining to find out if you have any needs recently.”
But no one believes!
I will say again: “ Right! I took your data, but I didn’t sell it. Someone else sold it. Your personal data breach has nothing to do with me.” However, these guarantees cannot be traced and proved. It is impossible to let users believe in you from the mechanism. Everything relies on the credit and self-discipline of the development team. Therefore, this market is like a quagmire. Everyone is wallowing in the mud, and the one who stands next to them certainly will not be clean.
The second problem is the large exposure to safety issues. It’s very common that an application has hundreds of APIs, supporting iOS, Android and Web. The exposure caused by traditional API leads to very high requirements of protections. In addition, developers have their own protocols and formats for developing APIs in the software development process, which means that the security protection must be done case by case and is difficult to integrate.
The direct result is that many small and medium-sized companies are difficult to ensure data security. In general, it’s already good enough for developers to code correctly and without bugs. If you can do a good job on HA, you definitely can get a good income. Not surprisingly, if you can still also consider the security issues, you definitely can ask any pay you want.
The third problem is that your data is always stored in a third party database. Even if it is encrypted, it is difficult to avoid snooping on the data. For example, you have a startup company. You store the operational data on Tencent Cloud. If your startup is doing well and delivers great data growth, Tencent’s strategic investment department may show interest in your startup and talk to you about investment. As for how they know your operations, no one can give you the answer. In fact, either Tencent Cloud or Alibaba Cloud, they saw your data, so what？You can do nothing about it.
Current data storage solutions
Overall, with a centralized cloud database, data is always stored in a third-party database. Obvious benefits are that they can be synchronized, not easy to lose, whether it is PC or mobile, iOS or Android, even if the device is lost, as long as you log in, your data is always there. However, the problem is that the exposure is large and it is impossible to prove the data is safe with cloud, while the cost is high and the utilization rate is low. The current database of most small and medium-sized developers is relatively idle.
Then someone might say: I don’t need a cloud. I can save the data locally. The advantage for offline data storage is that I don’t have to deal with the Internet, and my data is safe offline. But the disadvantages are also obvious. You cannot synchronize, your data is easy to lose, and data security and reliability are completely linked to the device.
Another method is to store your data on public blockchain, but the cost is extremely high. Take the example of Ethereum. At present, one megabyte is basically 1000 dollars, while the storage capacity is limited, and the system performance is low, which is affected by system design. Currently, the public blockchain lacks structured support, which means that developers need to perform data serialization calculations themselves to store structured data. At a first look, it seems that there is no advantage for public blockchain. The only advantage is decentralization, which sounds tempting.
CovenantSQL: a wild database
Therefore, we built a database with the decentralized blockchain.
Although SQL has a variety of problems and has not kept up with the trend in recent years, you will find that many startups are now trying to improve the high availability, scalability and other performance of traditional SQL databases.
Meanwhile, no matter what kind of database that we are comparing with, one of our characteristics is: wild (or we prefer to say security). For most of the databases you use, the servers are controlled by yourself or the cloud service providers. No one will encourage you to expose the database ports on the internet, otherwise you will soon be attacked by hackers. The SQL architecture is not very resistant to attacks, or high frequency queries. Therefore, we use the blockchain and cryptography techniques to ensure that SQL can run on the Internet. The solutions for security mainly include:
- Algorithm autonomy & asymmetric encryption: guarantees that no data position can be located without authorization
- Column-level access control, ETEE (end-to-end encryption), transport encryption, off-disk encryption
- Read and write can be recorded and cannot be tampered with
Another feature is the ease of integration. In addition to structured data support, for global synchronization, we let users customize the CAP and choose different CAP types based on their needs. At the same time, we are almost fully compatible with the SQL standard. Of course, the “wild” database also has the advantage of low cost. Just find a server and run it for “mining”.
Q & A
1. What are the conditions for database deployment, and what are the requirements for the network if deployed separately?
Auxten: We have a testnet so that you don’t need to build anything, and we also have a docker compose. Actually, we are the database with the least requirement for the network. No matter how bad the network is, we can function, and even debug the database with worse network speed. Therefore, since it can run on the internet, you can also run in other server room. You can make a request on GitHub and try it out.
2. Why don’t you use Raft to synchronize the database directly?
Auxten: The database can’t be used just using a raft. For the untrusted and public network, we have some special optimizations for CovenantSQL. The core of optimizing high concurrency is that you minimize the lock of the unsplitable transaction that might affect your speed. Due to the RTT (Round Trip Time) of the entire network, the normal Raft’s transaction speed cannot be lifted.
Our solution is different. In the stage of preparing the database (pre-confirmation phase), we have already been doing concurrency which is in a queue outside the database core lock. After queuing, there will be a window that receives the prepare confirmation and waits for the confirmation. For example, if two-thirds of the nodes receive the confirmation, a commit is initiated. In the process of initiating the commit, we don’t wait, so that in the whole process the lock is only locked in the core part of the transaction to achieve a high concurrency. The code can be viewed on GitHub.
3. Regarding database availability, is there a situation that miner shuts down or terminates the service and causes data loss. Is there any consideration in this regard?
Auxten: This involves a problem, what if the output node shut down? These nodes involved in mining are required to run the database to get the reward. So when you are mining, you need to take some tokens as deposit. There is one problem every time you run the SQL. How do you prove that you executed the particular SQL to get the data instead of randomly generating something? The records for the deposit and data will do peer rating, and then a signature will be signed on the returned results. After that, a confidence level will be calculated on the network. If you want a high degree of confidence, you need to check multiple nodes.
How to ensure that this data is the data that miner saved, not cheated with shoddy works? Here we use a verification mechanism similar to a zero-knowledge proof. Simply put, we have a question, the questioner calculates the actual answer of the question, and then let these people go to do this question separately, calculate an answer and then do the signature and verify the answer. If the answer is wrong, the deposit of mining will be deducted. Additionally, If you do not follow the normal process to quit, you will be punished economically. Under normal circumstances, we will record your behaviour on the blockchain. If you want to quit, we will detect this situation and arrange another miner to replace you, then synchronize the data and add to the network.