Architecture of PostgreSQL DB

Pankaj kushwaha

Published in

The Startup

19 min readJun 20, 2020

Basic architecture of Database system

When you create a Database, a default named public is created for the Database Schema. Each Database can have multiple Schema. The schema can be understood as a namespace. Different schemas can have the same Table, index, view, etc.

Basic architecture of Postgres system
PostgreSQL uses the client/server (C/S) model to provide services. A PostgreSQL session consists of the following related processes (programs):

(1) A server-side process. This process manages the database files, accepts the connection between the client and the database, and operates the database on behalf of the client. The program name of this process is called postgres.

(2) Front-end applications, that is, client applications that require database operations. Client applications may be diverse in their own right: they can be a character interface tool, a graphical interface application, or a web server that displays web pages by accessing a database, or a special database management tool . Some client applications are provided with the PostgreSQL release, but most of them are developed by users.

Like a typical client/server application (C/S application), the client and server can be on different hosts. At this time, they are connected via TCP/IP, you should keep this in mind, because the files that can be accessed on the client may not be accessible on the database server machine (or can only be accessed with a different file name).

The PostgreSQL server can handle multiple concurrent requests from the client. In order to handle this, it will start (“forks”) a new process for each request, and then the client and the new server process will no longer communicate directly through the original postgres process. Therefore, the main process on the server side has been running, waiting for the connection from the client; the client and the associated server-side process will only run when needed. (Of course, these are transparent to the user, and I talk about them here mainly for the sake of completeness.)

Physical storage structure

The database files are saved in the database directory created by initdb by default.

Data file layout
(OID)
All database objects in postgresql are managed internally by their respective object identifiers (OIDs), which are unsigned 4 byte integers.
The OID of the database is stored in the pg_databasesystem table.
The OIDs of objects such as tables, indexes, and sequences in the database are stored in the pg_class system table:

SELECT oid,relname,relkind FROM pg_class WHERE relname = 'product'

Tablespace

The largest logical storage unit in postgresql is the table space, and the objects created in the database are stored in the table space. When creating a database object, you can specify the tablespace of the database object, if not specified, the default is used.

Automatically created when initializing the database catalog pg_defaultand pg_globaltwo tablespaces.

pg_globalThe physical location is in the global directory of the database directory. Used to save system tables
pg_defaultThe physical location of the bold style is in the base directory, which is the default tablespace for template0 and template1 databases. When creating a database, the default is to clone from the template1 database, so unless the table space of the newly created database is specifically specified, template1 is used by default. That ispg_default

structure of the table file

In postgresql, the block stored in the disk is called Page, and the block in memory is called Buffer. Tables and indexes are called Relation, and rows are called Tuple. Reading and writing data is Pagethe smallest unit, each Pagedefault size 8KB. Specify the BLCKSZsize determined Pageby the size when compiling Postgresql . Each table file is composed of multiple BLCKSZbyte-sized pages, and each page contains several Tuples. For hardware with good I/O performance and analysis-based databases, appropriately increasing the size of BLCKSZ can slightly improve database performance.

There are two typical access methods when retrieving data from the database: sequential scan and B-tree index scan.

Sequential scanning reads all tuples in all pages sequentially by scanning all line pointers in each page.

When a B-tree index is scanned, the index file contains index tuples. Each tuple consists of an index key and a TID that points to the target heap tuple .

PostgreSQL architecture diagram (composition structure and relationship)

PostgreSQL is composed of five parts: connection management system (system controller), compilation execution system, storage management system, transaction system, and system table.

The connection management system accepts external operation requests to the system, preprocesses and distributes the operation requests, and plays the role of system logic control;

The compilation execution system consists of a query compiler and a query executor, which completes the analysis processing and conversion of operation requests in the database, and finally realizes the operation of the data in the physical storage medium;

The storage management system is composed of an index manager, a memory manager, and an external memory manager. It is responsible for storing and managing physical data and providing support for the compilation query system;

The transaction system is composed of a transaction manager, log manager, concurrency control, and lock manager. The log manager and transaction manager complete transaction consistency support for operation requests. The lock manager and concurrency control provide consistency for concurrent access to data stand by;

The system table is the meta information management center of the PostgreSQL database, including database object information and database management control information. The system table manages metadata information and organically connects the various modules of the PostgreSQL database to form an efficient data management system.

System table

In relational database, in order to realize the control of database system, the function of data dictionary must be provided. The data dictionary not only stores descriptive information of various objects, but also stores detailed information of various objects required for system management. The data dictionary contains the description information of all objects and their attributes in the database system, the description information of the relationships between the objects, the natural language meaning of the object attributes and the history of the data dictionary changes. The data dictionary is the core of the relational database system management and control information. In the database system, system tables play the role of data dictionary.

The system table is the place where the PostgreSQL database stores structural metadata. It is represented in PostgreSQL as a common table or view that stores system information (users can delete and rebuild).

The system table stores all the metadata of the database, so the system table access is very frequent when the system is running. In order to improve system performance, shared system tables have been created in memory, and hash tables have been used to improve query efficiency.

Main system table functions

1 pg_namespace storage namespace

2 pg_tablespace storage space information

3 pg_database stores information about the database in the current data set cluster.

4 pg_class stores table and database object information of similar structure, including, index, sequence, view, compound data type, TOAST table, etc.

5 pg_type stores data type information.

6 pg_attribute stores the attribute information of the table.

7 pg_index stores specific information about the index.

The main functions of the PostgreSQL database system are all concentrated in the Postgres program. Its entry is the main function in the Main module. When the data set is initialized and the database server is started, it will be executed from here. The Main module mainly determines the current operating system platform during work, and sets and initializes some platform-related environment variables accordingly, and then passes the control of the command line parameters to the corresponding module. The following figure is the call flow of the main function.

PostgreSQL system main function main flow

The PostgreSQL daemon Postmaster allocates a background Postgres service process for user connection requests, and will also start related background service processes: SysLogger (system log process), PgStat (statistics data collection process),

AutoVacuum (automatic system cleanup process). When Postmaster enters the loop monitor, it starts as follows: BgWriter (background write process), WalWriter (pre-write log write process), PgArch (pre-write log archive process). These processes will be introduced in subsequent articles.

The following figure is the background flow chart of PostgreSQL:

1.1.1.1 Potgres (resident process)
The resident process that manages the backend is also called’postmaster’. By default, it listens to the UNIX Domain Socket and TCP/IP (Windows, etc., some platforms only listen to tcp/ip) port 5432, waiting for connection processing from the front end. The listening port number can be changed in the postgresql.conf setting file of PostgreSQL.

Once the front end is connected, postgres will spawn a child process via fork(2). If there is no Fork(2) windows platform, then use createProcess() to generate a new process. In this case, unlike fork(2), the data of the parent process will not be inherited, so you need to use the shared memory to inherit the data of the parent process.

1.1.1.2 Postgres (child process)
The child process judges whether to allow the connection according to the security policy defined in pg_hba.conf. According to the policy, it will reject certain IPs and networks, or may only allow certain users or connect to certain databases.

Postgres will accept queries from the front end, and then retrieve the database, it is best to return the results, and sometimes update the database. The updated data will also be recorded in the transaction log (PostgreSQL is called the WAL log). This is mainly used when the server is down when the power is turned off, and when the recovery process is restarted. In addition, the log is archived and can be used when recovery is required. After PostgreSQL 9.0, by transferring WAL logs to other postgreSQL, database replication can be performed in real time. This is the so-called’database replication’ function.

1.1.1.3 Other processes
There are some auxiliary processes outside of Postgres. These processes are all processes started by resident postgres.

(1) Postmaster main process and service process
When the PG database is started, the main Postmaster process will be started first. This process is the overall control process of the PG database and is responsible for starting and closing the database instance. The Postmaster process is actually a link to the postgres command, as follows:

ll /opt/postgresql/bin/postmaster

/opt/postgresql/bin/postmaster-> postgres

When the user establishes a connection with the PG database, it must first establish a connection with the Postmaster process. At this time, the client process will send an authentication message to the Postmaster main process. The Postmaster main process authenticates according to the message. After the verification is passed, the Postmaster main process will fork A session service process is provided to serve this user connection. You can view the pid of the service process through the pg_stat_activity table, as follows:

test=# select pid,usename,client_addr,client_port frompg_stat_activity;

(2) Writer process
The Writer process writes the cache on the shared memory to the disk at the appropriate time. Through this process, you can prevent the performance of a large number of writes to disk during checkpoint (checkpoint) from deteriorating, so that the server can maintain relatively stable performance. The Background writer has been resident in memory since it got up, but it has not been working. It will sleep after a period of work. The sleep interval is set by the parameter bgwriter_delay in postgresql.conf. The default is 200 microseconds.

Another important function of this process is the regular execution of checkpoints.

During the checkpoint, the cache content on the shared memory is written to the database file, so that the state of the memory and the file are consistent. In this way, you can shorten the recovery time from WAL when the system crashes, and also prevent the infinite growth of WAL. The checkpoint_segments and checkpoint_timeout of postgresql.conf can be used to specify the time interval for performing checkpoints.

The Writer process is a process that writes dirty pages in shared memory to disk. It has two functions: one is to periodically flush out the dirty data from the memory buffer to the disk to reduce the blocking during the query; the other is that the PG needs to write out all the dirty pages to the disk during the regular checkpoint, through the BgWriter Writing out some dirty pages in advance can reduce the IO operations to be performed when setting checkpoints (CheckPoint, a type of database recovery technology), so that the system’s IO load tends to be stable. BgWriter is a new feature added after PostgreSQL 8.0. Its mechanism can be controlled by configuring parameters starting with “bgwriter_” in the postgresql.conf file:

bgwriter_delay:

The time interval between two consecutive flush data in the backgroud writer process. The default value is 200, and the unit is milliseconds.

bgwriter_lru_maxpages:

The maximum amount of data written by the backgroud writer process at a time. The default value is 100, in units of buffers. If the amount of dirty data is less than this value, the write operation is all completed by the backgroud writer process; conversely, when it is greater than this value, the greater part will be completed by the server process process. When this value is set to 0, it means that the backgroud writer writing process is disabled, and it is completely done by the server process; when it is set to -1, it means that all dirty data is done by the backgroud writer. (Checkpoint operations are not included here)

bgwriter_lru_multiplier:

This parameter indicates the number of data blocks written to the disk each time, of course, the value must be less than bgwriter_lru_maxpages. If the setting is too small, the amount of dirty data that needs to be written is greater than the amount of data written each time, so the remaining work that needs to be written to the disk needs to be completed by the server process process, which will reduce performance; if the value configuration is too large, the amount of dirty data written More than the number of buffers required at the time, which is convenient for applying for buffer work later, and IO waste may occur. The default value of this parameter is 2.0.

Bgwriter’s maximum data calculation method:

1000/bgwriter_delay*bgwriter_lru_maxpages*8K=Maximum amount of data

bgwriter_flush_after:

BgWriter is triggered when the data page size reaches bgwriter_flush_after, the default is 512KB.

(3) WAL writer process (pre-write log writing)
The WAL writer process writes the WAL cache on the shared memory to the disk at an appropriate point in time. This way, you can reduce the pressure on the back-end process when writing its own WAL cache and improve performance. In addition, when the asynchronous commit is set to true, you can ensure that the contents of the WAL cache are written to the WAL log file within a certain time interval.

The core idea of the write-ahead log (Write Ahead Log, also known as Xlog) is that the modification of the data file must only occur after these modifications have been recorded in the log, that is, the log is written first and then the data is written (log first) . Using this mechanism can avoid frequent data writing to the disk and can reduce disk I/O. The database can use these WAL logs to recover the database after a downtime restart. The parameters related to the WalWriter process in the postgresql.conf file are as follows:

wal_level: Controls the level of wal storage. wal_level determines how much information is written to the WAL. The default value is minimal, which only writes the information needed to recover from a crash or immediate shutdown. The replica adds wal archive information and includes the information required by the read-only server. (New in 9.6, merge archive and hot_standby from previous version) logical is mainly used for logical decoding scenarios

fsync: This parameter directly controls whether the log is written to disk first. The default value is ON (write first), which means that the system must wait for the completion of WAL writing when the updated data is written to the disk. You can configure this parameter to OFF, indicating that the updated data is written to the disk without waiting for the completion of the WAL write.

synchronous_commit: Whether the parameter configuration waits for WAL to complete before returning status information to the user transaction. The default value is ON, indicating that the transaction status information must be returned after the completion of WAL; configured to OFF can feed back the transaction status faster.

wal_sync_method: The control method of WAL writing to disk. The default value is fsync. The available values include open_datasync, fdatasync, fsync_writethrough, fsync, and open_sync. open_datasync and open_sync respectively use the O_DSYNC and O_SYNC flags when opening the WAL file; fdatasync and fsync respectively call the fdatasync and fsync functions to write data at each submission, both functions write the operating system’s disk cache back to the disk , But the former only writes the data part of the file, while the latter will also update the file’s attributes synchronously; fsync_writethrough means that every time you commit and write back to the disk, it will ensure that the operating system disk cache and memory have the same content.

full_page_writes: indicates whether to write the entire page to the WAL.

wal_buffers: The size of the memory space used to store WAL data. The system default value is 64K. This parameter is also affected by the two parameters wal_writer_delay and commit_delay.

wal_writer_delay: Write interval of the WalWriter process. The default value is 200 milliseconds. If the time is too long, it may cause insufficient memory in the WAL buffer; if the time is too short, it will cause the WAL to continuously write, increasing the disk I/O burden.

wal_writer_flush_after:

commit_delay: indicates the time that the submitted data is stored in the WAL buffer. The default value is 0 milliseconds, which means no delay; when it is set to a non-zero value, the transaction will not be written to the WAL immediately after the commit is executed, but it is still stored in the WAL In the buffer, wait for the WalWriter process to write to the disk periodically.

commit_siblings: When a transaction issues a commit request, if the number of transactions in the database is greater than the value of commit_siblings, the transaction will wait for a period of time (commit_delay value); otherwise, the transaction is directly written to WAL. The system default value is 5, and this parameter also determines the validity of commit_delay.

wal_writer_flush_after: When dirty data exceeds the threshold, it will be flushed to disk.

(4) Archive process
Archive process transfers WAL logs to archive logs. If the basic backup and archive logs are saved, even if the disk is completely damaged, the database can be restored to the latest state.

Similar to the ARCH archive process of the Oracle database, the difference is that ARCH is redo log for archiving, and PgArch is for archiving WAL logs. Going further, the WAL log will be recycled, that is, the WAL log in the past will be overwritten by the newly generated log. The PgArch process is to back up the WAL log before overwriting. The purpose of the archive log is for the database to use the full backup and the archive log generated after the backup, so that the database returns to any point in the past. The PITR (Point-In-Time-Recovery) technology provided by PG from version 8.X is the archive log used.

The PgArch process is performed by the following parameters in the postgresql.conf file

archive_mode:

Indicates whether to perform the archive operation, and can be selected as off (off), on (start) and always (always on), the default value is off (off).

archive_command:

The command set by the administrator for archiving WAL logs. In the command for archiving, the predefined variable “%p” is used to refer to the WAL full path file name that needs to be archived, “%f” indicates the file name without a path (the paths here are relative to the current working directory route of). When each WAL segment file is archived, the command specified by archive_command will be called. When the archive command returns 0, PostgreSQL considers the file to be successfully archived, and then deletes or recycles the WAL segment file. Otherwise, if a non-zero value is returned, PostgreSQL will assume that the file has not been successfully archived and will periodically retry until it succeeds.

archive_timeout:

Indicates the archiving period. When the time set by this parameter is exceeded, the WAL segment is forcibly switched. The default value is 0 (indicating that the function is disabled).

(5) stats collector process
The process of collecting statistical information. Collect information such as the number of visits to the statistics table and the number of visits to the disk. The collected information can be used by autovaccum and can also be used as reference information for database management by other database administrators.

The PgStat process is a statistical information collector of the PostgreSQL database. It is used to collect statistical information during the operation of the database, such as the number of table additions, deletions, and changes, the number of data blocks, changes in indexes, etc. Collecting statistical information is mainly for the optimizer to make a correct judgment and choose the best execution plan. The parameters related to the PgStat process in the postgresql.conf file are as follows:

track_activities: Indicates whether to enable the statistical information collection function for the command currently executed in the session. This parameter is only visible to the super user and session owner. The default value is on.

track_counts: indicates whether to enable the statistical information collection function for database activities. Since the database to be cleaned is selected in the AutoVacuum automatic cleaning process, the database statistical information is required, so the default value of this parameter is on.

track_io_timing: Timing call data block I/O, the default is off, because set to the on state will repeatedly call the database time, which adds a lot of overhead to the database. Only super user can set

track_functions: indicates whether to enable the number of function calls and time-consuming statistics.

track_activity_query_size: Set the number of bytes used to track the currently executed command of each active session. The default value is 1024, which can only be set after the database is started.

stats_temp_directory: Temporary storage path for statistical information. The path can be a relative path or an absolute path. The default parameter is pg_stat_tmp. Setting this parameter can reduce the physical I/O of the database and improve performance. This parameter can only be modified in the postgresql.conf file or on the server command line.

(6) Logger process
Write the postgresql activity status to the log information file (not the transaction log), and rotate the log file at the specified time interval.

(7) Autovacuum (automatic cleaning) start process
The autovacuum launcher process relies on the postmaster to indirectly start the vacuum process. And it does not directly start the automatic vacuum process itself. This can improve the reliability of the system.

In the PG database, after performing UPDATE or DELETE operations on the data, the database will not immediately delete the old version of the data, but will be marked as deleted. This is because the PG database has a multi-version mechanism. If these old versions of data are being opened by other transactions, it is necessary to retain them temporarily. After the transaction is submitted, the old version of the data is no longer valuable. The database needs to clean up the junk data to make room, and the cleanup work is performed by the AutoVacuum process. The parameters related to the AutoVacuum process in the postgresql.conf file are:

autovacuum: whether to start the system automatic cleaning function, the default value is on.

log_autovacuum_min_duration: This parameter is used to record the execution time of autovacuum. When the execution time of autovaccum exceeds the setting of the log_autovacuum_min_duration parameter, the information of autovacuum is recorded in the log. The default is “-1”, which means no recording.

autovacuum_max_workers: Sets the maximum number of system automatic cleaning work processes.

autovacuum_naptime: Set the interval time between two system automatic cleaning operations.

autovacuum_vacuum_threshold and autovacuum_analyze_threshold: Set when the threshold of the number of updated tuples on the table exceeds these thresholds, vacuum and analysis need to be performed respectively.

autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor: Set the scaling factor for the table size.

autovacuum_freeze_max_age: Set the upper limit of XID that needs to be forced to clean up the database.

autovacuum_vacuum_cost_delay: When the autovacuum process is about to be executed, the cost of vacuum execution is evaluated. If the value set by autovacuum_vacuum_cost_limit is exceeded, there is a delay. The delay time is autovacuum_vacuum_cost_delay. If the value is -1, the vacuum_cost_delay value is used, and the default value is 20 ms.

autovacuum_vacuum_cost_limit: This value is the evaluation threshold of the autovacuum process. The default is -1, which means to use the “vacuum_cost_limit” value. If the cost evaluated during the execution of the autovacuum process exceeds autovacuum_vacuum_cost_limit, the autovacuum process will sleep.

(8) Automatic vacuum process
The autovacuum worker process actually executes vacuum tasks. Sometimes multiple vacuum processes are started at the same time.

(9) wal sender / wal receiver
The wal sender process and the wal receiver process are processes that implement postgresql replication (streaming replication). The Wal sender process transmits WAL logs over the network, while the wal receiver processes of other PostgreSQL instances receive the corresponding logs. After receiving the WAL log, the host PostgreSQL (also called Standby) of the Wal receiver process restores it on its own database, and generates a database exactly the same as the sender’s PostgreSQL (also called Master).

(10) CheckPoint (checkpoint) process
The checkpoint is a transaction sequence point set by the system. Setting the checkpoint ensures that the log information before the checkpoint is flushed to the disk. The relevant parameters in the postgresql.conf file are:

1.1.1.4 Backend processing flow
Let’s take a look at the processing summary of the database engine postgres subprocess. For simplicity, in the following description, backendprocess is simply referred to as backend. The main function of Backend is PostgresMain (tcop/postgres.c).

Receive the query sent by the front end (SQL text)
The SQL text is pure text, and the computer cannot understand it, so it must be converted into an internal form parser tree that is easier to handle. This process is called text parsing. The module for analysing the text is called a parser. This stage can only use the information obtained literally, so as long as there are no errors such as syntax errors, even if the select does not exist, the table will not report an error. The composition tree at this stage is called raw parse tree. The entry point for composition processing is raw_parser (parser/parser.c).
After the composition tree is parsed, it is converted into a query tree. At this time, the database is accessed to check whether the table exists, and if so, the table name is converted to OID. This processing is called analysis processing (Analyze), and the analysis processing module is the analyzer. In addition, when the PostgreSQL code refers to the parser tree, more often it refers to the query tree Query tree. The entry point of the analysis processing module is in parse_analyze (parser/analyze.c)
PostgreSQL also implements views and rules by rewriting query statements, so when needed, query statements will be rewritten at this stage. This process is called rewrite. The entry point for rewriting is QueryRewrite (rewrite/rewriteHandler.c).
By parsing the query tree, the plan tree can be actually generated. The process of generating a query tree is called’execution plan processing’, and the most important thing is to generate a plan tree that can be estimated to be completed in the shortest time. This step is called’query optimization’ (not query optimize, but optimize), and the module that completes this process is called query optimizer (not called query optimizer, but optimizer, or planner). The entry point for execution plan processing is standard_planner (optimizer/plan/planner.c).
Follow the steps in the execution plan to complete the purpose of the query. The process of running the steps in the execution plan tree is called the execution process’execute’, and the module that completes this process is called the executor’Executor’. The entry address of the executor is ExecutorRun (executor/execMain.c)
The execution result is returned to the front end.
Return to step 1 and repeat.

Thanks for reading this post.

Architecture of PostgreSQL DB

Physical storage structure

Written by Pankaj kushwaha