Postgres is eating the database world

PostgreSQL isn’t just a simple relational database; it’s a data management framework with the potential to engulf the entire database realm. The trend of “Using Postgres for Everything” is no longer limited to a few elite teams but is becoming a mainstream best practice.


OLAP’s New Challenger

In a 2016 database meetup, I argued that a significant gap in the PostgreSQL ecosystem was the lack of a sufficiently good columnar storage engine for OLAP workloads. While PostgreSQL itself offers lots of analysis features, its performance in full-scale analysis on larger datasets doesn’t quite measure up to dedicated real-time data warehouses.

Consider ClickBench, an analytics performance benchmark, where we’ve documented the performance of PostgreSQL, its ecosystem extensions, and derivative databases. The untuned PostgreSQL performs poorly (x1050), but it can reach (x47) with optimization. Additionally, there are three analysis-related extensions: columnar store Hydra (x42), time-series TimescaleDB (x103), and distributed Citus (x262).

pg-eat-clickbench.png

Clickbench c6a.4xlarge, 500gb gp2 results in relative time

This performance can’t be considered bad, especially compared to pure OLTP databases like MySQL and MariaDB (x3065, x19700); however, its third-tier performance is not “good enough,” lagging behind the first-tier OLAP components like Umbra, ClickHouse, Databend, SelectDB (x3~x4) by an order of magnitude. It’s a tough spot - not satisfying enough to use, but too good to discard.

However, the arrival of ParadeDB and DuckDB changed the game!

ParadeDB’s native PG extension pg_analytics achieves second-tier performance (x10), narrowing the gap to the top tier to just 3–4x. Given the additional benefits, this level of performance discrepancy is often acceptable - ACID, freshness and real-time data without ETL, no additional learning curve, no maintenance of separate services, not to mention its ElasticSearch grade full-text search capabilities.

DuckDB focuses on pure OLAP, pushing analysis performance to the extreme (x3.2) — excluding the academically focused, closed-source database Umbra, DuckDB is arguably the fastest for practical OLAP performance. It’s not a PG extension, but PostgreSQL can fully leverage DuckDB’s analysis performance boost as an embedded file database through projects like DuckDB FDW and pg_quack.

The emergence of ParadeDB and DuckDB propels PostgreSQL’s analysis capabilities to the top tier of OLAP, filling the last crucial gap in its analytic performance.


The Pendulum of Database Realm

The distinction between OLTP and OLAP didn’t exist at the inception of databases. The separation of OLAP data warehouses from databases emerged in the 1990s due to traditional OLTP databases struggling to support analytics scenarios’ query patterns and performance demands.

For a long time, best practice in data processing involved using MySQL/PostgreSQL for OLTP workloads and syncing data to specialized OLAP systems like Greenplum, ClickHouse, Doris, Snowflake, etc., through ETL processes.

DDIA ch3, The republic of OLTP & Kingdom of OLAP

Like many “specialized databases,” the strength of dedicated OLAP systems often lies in performance — achieving 1-3 orders of magnitude improvement over native PG or MySQL. The cost, however, is redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, poor data integrity and availability compared with a complete DMBS.

However, as the saying goes, “What goes around comes around”. With hardware improving over thirty years following Moore’s Law, performance has increased exponentially while costs have plummeted. In 2024, a single x86 machine can have hundreds of cores (512 vCPU EPYC 9754x2), several TBs of RAM, a single NVMe SSD can hold up to 64TB, and a single all-flash rack can reach 2PB; object storage like S3 offers virtually unlimited storage.

io-bandwidth.png

Hardware advancements have solved the data volume and performance issue, while database software developments (PostgreSQL, ParadeDB, DuckDB) have addressed access method challenges. This puts the fundamental assumptions of the analytics sector — the so-called “big data” industry — under scrutiny.

As DuckDB’s manifesto "Big Data is Dead" suggests, the era of big data is over. Most people don’t have that much data, and most data is seldom queried. The frontier of big data recedes as hardware and software evolve, rendering “big data” unnecessary for 99% of scenarios.

If 99% of use cases can now be handled on a single machine with standalone DuckDB or PostgreSQL (and its replicas), what’s the point of using dedicated analytics components? If every smartphone can send and receive texts freely, what’s the point of pagers? (With the caveat that North American hospitals still use pagers, indicating that maybe less than 1% of scenarios might genuinely need “big data.”)

The shift in fundamental assumptions is steering the database world from a phase of diversification back to convergence, from a big bang to a mass extinction. In this process, a new era of unified, multi-modeled, super-converged databases will emerge, reuniting OLTP and OLAP. But who will lead this monumental task of reconsolidating the database field?


PostgreSQL: The Database World Eater

There are a plethora of niches in the database realm: time-series, geospatial, document, search, graph, vector databases, message queues, and object databases. PostgreSQL makes its presence felt across all these domains.

A case in point is the PostGIS extension, which sets the de facto standard in geospatial databases; the TimescaleDB extension awkwardly positions “generic” time-series databases; and the vector extension, PGVector, turns the dedicated vector database niche into a punchline.

This isn’t the first time; we’re witnessing it again in the oldest and largest subdomain: OLAP analytics. But PostgreSQL’s ambition doesn’t stop at OLAP; it’s eyeing the entire database world!

pigsty-ecosystem.png

What makes PostgreSQL so capable? Sure, it’s advanced, but so is Oracle; it’s open-source, as is MySQL. PostgreSQL’s edge comes from being both advanced and open-source, allowing it to compete with Oracle/MySQL. But its true uniqueness lies in its extreme extensibility and thriving extension ecosystem.

pg-survey.png

Reasons users choose PostgreSQL: Open-Source, Reliable, Extensible

PostgreSQL isn’t just a relational database; it’s a data management framework capable of engulfing the entire database galaxy. Besides being open-source and advanced, its core competitiveness stems from extensibility, i.e., its infra’s reusability and extension’s composability.


The Magic of Extreme Extensibility

PostgreSQL allows users to develop extensions, leveraging the database’s common infra to deliver features at minimal cost. For instance, the vector database extension pgvector, with just several thousand lines of code, is negligible in complexity compared to PostgreSQL’s millions of lines. Yet, this “insignificant” extension achieves complete vector data types and indexing capabilities, outperforming lots of specialized vector databases.

Why? Because pgvector’s creators didn’t need to worry about the database’s general additional complexities: ACID, recovery, backup & PITR, high availability, access control, monitoring, deployment, 3rd-party ecosystem tools, client drivers, etc., which require millions of lines of code to solve well. They only focused on the essential complexity of their problem.

For example, ElasticSearch was developed on the Lucene search library, while the Rust ecosystem has an improved next-gen full-text search library, Tantivy, as a Lucene alternative. ParadeDB only needs to wrap and connect it to PostgreSQL’s interface to offer search services comparable to ElasticSearch. More importantly, it can stand on the shoulders of PostgreSQL, leveraging the entire PG ecosystem’s united strength (e.g., mixed searches with PG Vector) to “unfairly” compete with another dedicated database.

img

Pigsty has 234 extensions available. And there are 1000+ more in the ecosystem


The extensibility brings another huge advantage: the composability of extensions, allowing different extensions to work together, creating a synergistic effect where 1+1 » 2. For instance, TimescaleDB can be combined with PostGIS for spatio-temporal data support; the BM25 extension for full-text search can be combined with the PGVector extension, providing hybrid search capabilities.

Furthermore, the distributive extension Citus can transparently transform a standalone cluster into a horizontally partitioned distributed database cluster. This capability can be orthogonally combined with other features, making PostGIS a distributed geospatial database, PGVector a distributed vector database, ParadeDB a distributed full-text search database, and so on.


What’s more powerful is that extensions evolve independently, without the cumbersome need for main branch merges and coordination. This allows for scaling — PG’s extensibility lets numerous teams explore database possibilities in parallel, with all extensions being optional, not affecting the core functionality’s reliability. Those features that are mature and robust have the chance to be stably integrated into the main branch.

PostgreSQL achieves both foundational reliability and agile functionality through the magic of extreme extensibility, making it an outlier in the database world and changing the game rules of the database landscape.


Game Changer in the DB Arena

The emergence of PostgreSQL has shifted the paradigms in the database domain: Teams endeavoring to craft a “new database kernel” now face a formidable trial — how to stand out against the open-source, feature-rich Postgres. What’s their unique value proposition?

Until a revolutionary hardware breakthrough occurs, the advent of practical, new, general-purpose database kernels seems unlikely. No singular database can match the overall prowess of PG, bolstered by all its extensions — not even Oracle, given PG’s ace of being open-source and free.

A niche database product might carve out a space for itself if it can outperform PostgreSQL by an order of magnitude in specific aspects (typically performance). However, it usually doesn’t take long before the PostgreSQL ecosystem spawns open-source extension alternatives. Opting to develop a PG extension rather than a whole new database gives teams a crushing speed advantage in playing catch-up!

Following this logic, the PostgreSQL ecosystem is poised to snowball, accruing advantages and inevitably moving towards a monopoly, mirroring the Linux kernel’s status in server OS within a few years. Developer surveys and database trend reports confirm this trajectory.

pg-is-no1-2.png

StackOverflow 2023 Survey: PostgreSQL, the Decathlete

pg-is-no1-1.png

StackOverflow’s Database Trends Over the Past 7 Years

PostgreSQL has long been the favorite database in HackerNews & StackOverflow. Many new open-source projects default to PostgreSQL as their primary, if not only, database choice. And many new-gen companies are going All in PostgreSQL.

As “Radical Simplicity: Just Use Postgres” says, Simplifying tech stacks, reducing components, accelerating development, lowering risks, and adding more features can be achieved by “Just Use Postgres.” Postgres can replace many backend technologies, including MySQL, Kafka, RabbitMQ, ElasticSearch, Mongo, and Redis, effortlessly serving millions of users. Just Use Postgres is no longer limited to a few elite teams but becoming a mainstream best practice.


What Else Can Be Done?

The endgame for the database domain seems predictable. But what can we do, and what should we do?

PostgreSQL is already a near-perfect database kernel for the vast majority of scenarios, making the idea of a kernel “bottleneck” absurd. Forks of PostgreSQL and MySQL that tout kernel modifications as selling points are essentially going nowhere.

This is similar to the situation with the Linux OS kernel today; despite the plethora of Linux distros, everyone opts for the same kernel. Forking the Linux kernel is seen as creating unnecessary difficulties, and the industry frowns upon it.

Accordingly, the main conflict is no longer the database kernel itself but two directions— database extensions and services! The former pertains to internal extensibility, while the latter relates to external composability. Much like the OS ecosystem, the competitive landscape will concentrate on database distributions. In the database domain, only those distributions centered around extensions and services stand a chance for ultimate success.

Kernel remains lukewarm, with MariaDB, the fork of MySQL’s parent, nearing delisting, while AWS, profiting from offering services and extensions on top of the free kernel, thrives. Investment has flowed into numerous PG ecosystem extensions and service distributions: Citus, TimescaleDB, Hydra, PostgresML, ParadeDB, FerretDB, StackGres, Aiven, Neon, Supabase, Tembo, PostgresAI, and our own PG distro — — Pigsty.


A dilemma within the PostgreSQL ecosystem is the independent evolution of many extensions and tools, lacking a unifier to synergize them. For instance, Hydra releases its own package and Docker image, and so does PostgresML, each distributing PostgreSQL images with their own extensions and only their own. These images and packages are far from comprehensive database services like AWS RDS.

Even service providers and ecosystem integrators like AWS fall short in front of numerous extensions, unable to include many due to various reasons (AGPLv3 license, security challenges with multi-tenancy), thus failing to leverage the synergistic amplification potential of PostgreSQL ecosystem extensions.

Extesion Category Pigsty RDS & PGDG AWS RDS PG Aliyun RDS PG
Add Extension Free to Install Not Allowed Not Allowed
Geo Spatial PostGIS 3.4.2 PostGIS 3.4.1 PostGIS 3.3.4
Time Series TimescaleDB 2.14.2
Distributive Citus 12.1
AI / ML PostgresML 2.8.1
Columnar Hydra 1.1.1
Vector PGVector 0.6 PGVector 0.6 pase 0.0.1
Sparse Vector PG Sparse 0.5.6
Full-Text Search pg_bm25 0.5.6
Graph Apache AGE 1.5.0
GraphQL PG GraphQL 1.5.0
Message Queue pgq 3.5.0
OLAP pg_analytics 0.5.6
DuckDB duckdb_fdw 1.1
CDC wal2json 2.5.3 wal2json 2.5
Bloat Control pg_repack 1.5.0 pg_repack 1.5.0 pg_repack 1.4.8
Point Cloud PG PointCloud 1.2.5 Ganos PointCloud 6.1

Many important extensions are not available on Cloud RDS (PG 16, 2024-02-29)

Extensions are the soul of PostgreSQL. A Postgres without the freedom to use extensions is like cooking without salt, a giant constrained.

Addressing this issue is one of our primary goals.


Our Resolution: Pigsty

Despite earlier exposure to MySQL Oracle, and MSSQL, when I first used PostgreSQL in 2015, I was convinced of its future dominance in the database realm. Nearly a decade later, I’ve transitioned from a user and administrator to a contributor and developer, witnessing PG’s march toward that goal.

Interactions with diverse users revealed that the database field’s shortcoming isn’t the kernel anymore — PostgreSQL is already sufficient. The real issue is leveraging the kernel’s capabilities, which is the reason behind RDS’s booming success.

However, I believe this capability should be as accessible as free software, like the PostgreSQL kernel itself — available to every user, not just renting from cyber feudal lords.

Thus, I created Pigsty, a battery-included, local-first PostgreSQL distribution as an open-source RDS Alternative, which aims to harness the collective power of PostgreSQL ecosystem extensions and democratize access to production-grade database services.

img

Pigsty stands for PostgreSQL in Great STYle, representing the zenith of PostgreSQL.

We’ve defined six core propositions addressing the central issues in PostgreSQL database services:

Extensible Postgres, Reliable Infras, Observable Graphics, Available Services, Maintainable Toolbox, and Composable Modules.

The initials of these value propositions offer another acronym for Pigsty:

Postgres, Infras, Graphics, Service, Toolbox, Yours.

Your graphical Postgres infrastructure service toolbox.

Extensible PostgreSQL is the linchpin of this distribution. In the recently launched Pigsty v2.6, we integrated DuckDB FDW and ParadeDB extensions, massively boosting PostgreSQL’s analytical capabilities and ensuring every user can easily harness this power.

Our aim is to integrate the strengths within the PostgreSQL ecosystem, creating a synergistic force akin to the Ubuntu of the database world. I believe the kernel debate is settled, and the real competitive frontier lies here.

  • PostGIS: Provides geospatial data types and indexes, the de facto standard for GIS (& pgPointCloud, pgRouting).
  • TimescaleDB: Adds time-series, continuous aggregates, distributed, columnar storage, and automatic compression capabilities.
  • PGVector: Support AI vectors/embeddings and ivfflat, hnsw vector indexes (& pg_sparse for sparse vectors).
  • Citus: Transforms classic master-slave PG clusters into horizontally partitioned distributed database clusters.
  • Hydra: Adds columnar storage and analytics, rivaling ClickHouse’s analytic capabilities.
  • ParadeDB: Elevates full-text search and mixed retrieval to ElasticSearch levels (& zhparser for Chinese tokenization).
  • Apache AGE: Graph database extension, adding Neo4J-like OpenCypher query support to PostgreSQL.
  • PG GraphQL: Adds native built-in GraphQL query language support to PostgreSQL.
  • DuckDB FDW: Enables direct access to DuckDB’s powerful embedded analytic database files through PostgreSQL (& DuckDB CLI).
  • Supabase: An open-source Firebase alternative based on PostgreSQL, providing a complete app development storage solution.
  • FerretDB: An open-source MongoDB alternative based on PostgreSQL, compatible with MongoDB APIs/drivers.
  • PostgresML: Facilitates classic machine learning algorithms, calling, deploying, and training AI models with SQL.

img

Developers, your choices will shape the future of the database world. I hope my work helps you better utilize the world’s most advanced open-source database kernel: PostgreSQL.

Read in Pigsty’s Blog | GitHub Repo: Pigsty | Official Website

PostgreSQL Convention 2024

WeChat Column

Roughly translated from PostgreSQL Convention 2024 with Google.


0x00 Background

No Rules, No Lines

The functions of PostgreSQL are very powerful, but to use PostgreSQL well requires the cooperation of backend, operation and maintenance, and DBA.

This article has compiled a development/operation and maintenance protocol based on the principles and characteristics of the PostgreSQL database, hoping to reduce the confusion you encounter when using the PostgreSQL database: hello, me, everyone.

The first version of this article is mainly for PostgreSQL 9.4 - PostgreSQL 10. The latest version has been updated and adjusted for PostgreSQL 15/16.


0x01 naming convention

There are only two hard problems in computer science: cache invalidation and naming .

Generic naming rules (Generic)

  • This rule applies to all objects in the database , including: library names, table names, index names, column names, function names, view names, serial number names, aliases, etc.
  • The object name must use only lowercase letters, underscores, and numbers, and the first letter must be a lowercase letter.
  • The length of the object name must not exceed 63 characters, and the naming snake_casestyle must be uniform.
  • The use of SQL reserved words is prohibited, use select pg_get_keywords();to obtain a list of reserved keywords.
  • Dollar signs are prohibited $, Chinese characters are prohibited, and do not pgbegin with .
  • Improve your wording taste and be honest and elegant; do not use pinyin, do not use uncommon words, and do not use niche abbreviations.

Cluster naming rules (Cluster)

  • The name of the PostgreSQL cluster will be used as the namespace of the cluster resource and must be a valid DNS domain name without any dots or underscores.
  • The cluster name should start with a lowercase letter, contain only lowercase letters, numbers, and minus signs, and conform to the regular expression: [a-z][a-z0-9-]*.
  • PostgreSQL database cluster naming usually follows a three-part structure: pg-<biz>-<tld>. Database type/business name/business line or environment
  • bizThe English words that best represent the characteristics of the business should only consist of lowercase letters and numbers, and should not contain hyphens -.
  • When using a backup cluster to build a delayed slave database of an existing cluster, bizthe name should be <biz>delay, for example pg-testdelay.
  • When branching an existing cluster, you can bizadd a number at the end of : for example, pg-user1you can branch from pg-user2, pg-user3etc.
  • For horizontally sharded clusters, bizthe name should include shardand be preceded by the shard number, for example pg-testshard1, pg-testshard2,…
  • <tld>It is the top-level business line and can also be used to distinguish different environments: for example -tt, -dev, -uat, -prodetc. It can be omitted if not required.

Service naming rules (Service)

  • Each PostgreSQL cluster will provide 2 to 6 types of external services, which use fixed naming rules by default.
  • The service name is prefixed with the cluster name and the service type is suffixed, for example pg-test-primary, pg-test-replica.
  • Read-write services are uniformly primarynamed with the suffix, and read-only services are uniformly replicanamed with the suffix. These two services are required.
  • ETL pull/individual user query is offlinenamed with the suffix, and direct connection to the main database/ETL write is defaultnamed with the suffix, which is an optional service.
  • The synchronous read service is standbynamed with the suffix, and the delayed slave library service is delayednamed with the suffix. A small number of core libraries can provide this service.

Instance naming rules (Instance)

  • A PostgreSQL cluster consists of at least one instance, and each instance has a unique instance number assigned from zero or one within the cluster.
  • The instance name- is composed of the cluster name + instance number with hyphens , for example: pg-test-1, pg-test-2.
  • Once assigned, the instance number cannot be modified until the instance is offline and destroyed, and cannot be reassigned for use.
  • The instance name will be used as a label for monitoring system data insand will be attached to all data of this instance.
  • If you are using a host/database 1:1 exclusive deployment, the node Hostname can use the database instance name.

Database naming rules (Database)

  • The database name should be consistent with the cluster and application, and must be a highly distinguishable English word.
  • The naming is <tld>_<biz>constructed in the form of , <tld>which is the top-level business line. It can also be used to distinguish different environments and can be omitted if not used.
  • <biz>For a specific business name, for example, pg-test-ttthe cluster can use the library name tt_testor test. This is not mandatory, i.e. it is allowed to create <biz>other databases with different cluster names.
  • For sharded libraries, <biz>the section must shardend with but should not contain the shard number, for example pg-testshard1, pg-testshard2both testshardshould be used.
  • Multiple parts use -joins. For example: <biz>-chat-shard, <biz>-paymentetc., no more than three paragraphs in total.

Role naming convention (Role/User)

  • dbsuThere is only one database super user : postgres, the user used for streaming replication is named replicator.
  • The users used for monitoring are uniformly named dbuser_monitor, and the super users used for daily management are: dbuser_dba.
  • The business user used by the program/service defaults to using dbuser_<biz>as the username, for example dbuser_test. Access from different services should be differentiated using separate business users.
  • The database user applied for by the individual user agrees to use dbp_<name>, where is namethe standard user name in LDAP.
  • The default permission group naming is fixed as: dbrole_readonly, dbrole_readwrite, dbrole_admin, dbrole_offline.

Schema naming rules (Schema)

  • The business uniformly uses a global <prefix>as the schema name, as short as possible, and is set to search_paththe first element by default.
  • <prefix>You must not use public, monitor, and must not conflict with any schema name used by PostgreSQL extensions, such as: timescaledb, citus, repack, graphql, net, cron,… It is not appropriate to use special names: dba, trash.
  • Sharding mode naming rules adopt: rel_<partition_total_num>_<partition_index>. The middle is the total number of shards, which is currently fixed at 8192. The suffix is the shard number, counting from 0. Such as rel_8192_0,…,,, rel_8192_11etc.
  • Creating additional schemas, or using <prefix>schema names other than , will require R&D to explain their necessity.

Relationship naming rules (Relation)

  • The first priority for relationship naming is to have clear meaning. Do not use ambiguous abbreviations or be too lengthy. Follow general naming rules.
  • Table names should use plural nouns and be consistent with historical conventions. Words with irregular plural forms should be avoided as much as possible.
  • Views use v_as the naming prefix, materialized views use mv_as the naming prefix, temporary tables use tmp_as the naming prefix.
  • Inherited or partitioned tables should be prefixed by the parent table name and suffixed by the child table attributes (rules, shard ranges, etc.).
  • The time range partition uses the starting interval as the naming suffix. If the first partition has no upper bound, the R&D will specify a far enough time point: grade partition: tbl_2023, month-level partition tbl_202304, day-level partition tbl_20230405, hour-level partition tbl_2023040518. The default partition _defaultends with .
  • The hash partition is named with the remainder as the suffix of the partition table name, and the list partition is manually specified by the R&D team with a reasonable partition table name corresponding to the list item.

Index naming rules (Index)

  • When creating an index, the index name should be specified explicitly and consistent with the PostgreSQL default naming rules.
  • Index names are prefixed with the table name, primary key indexes _pkeyend with , unique indexes _keyend with , ordinary indexes end _idxwith , and indexes used for EXCLUDEDconstraints _exclend with .
  • When using conditional index/function index, the function and condition content used should be reflected in the index name. For example tbl_md5_title_idx, tbl_ts_ge_2023_idx, but the length limit cannot be exceeded.

Field naming rules (Attribute)

  • It is prohibited to use system column reserved field names: oid, xmin, xmax, cmin, cmax, ctid.
  • Primary key columns are usually named with idor as ida suffix.
  • The conventional name is the creation time field created_time, and the conventional name is the last modification time field.updated_time
  • is_It is recommended to use , etc. as the prefix for Boolean fields has_.
  • Additional flexible JSONB fields are fixed using extraas column names.
  • The remaining field names must be consistent with existing table naming conventions, and any field naming that breaks conventions should be accompanied by written design instructions and explanations.

Enumeration item naming (Enum)

  • Enumeration items should be used by default camelCase, but other styles are allowed.

Function naming rules (Function)

  • Function names start with verbs: select, insert, delete, update, upsert, create,….
  • Important parameters can be reflected in the function name through _by_idsthe _by_user_idssuffix of.
  • Avoid function overloading and try to keep only one function with the same name.
  • BIGINT/INTEGER/SMALLINTIt is forbidden to overload function signatures through integer types such as , which may cause ambiguity when calling.
  • Use named parameters for variables in stored procedures and functions, and avoid positional parameters ( $1, $2,…).
  • If the parameter name conflicts with the object name, add before the parameter _, for example _user_id.

Comment specifications (Comment)

  • Try your best to provide comments ( COMMENT) for various objects. Comments should be in English, concise and concise, and one line should be used.
  • When the object’s schema or content semantics change, be sure to update the annotations to keep them in sync with the actual situation.

0x02 Design Convention

To each his own

Things to note when creating a table

  • The DDL statement for creating a table needs to use the standard format, with SQL keywords in uppercase letters and other words in lowercase letters.
  • Use lowercase letters uniformly in field names/table names/aliases, and try not to be case-sensitive. If you encounter a mixed case, or a name that conflicts with SQL keywords, you need to use double quotation marks for quoting.
  • Use specialized type (NUMERIC, ENUM, INET, MONEY, JSON, UUID, …) if applicable, and avoid using TEXT type as much as possible. The TEXT type is not conducive to the database’s understanding of the data. Use these types to improve data storage, query, indexing, and calculation efficiency, and improve maintainability.
  • Optimizing column layout and alignment types can have additional performance/storage gains.
  • Unique constraints must be guaranteed by the database, and any unique column must have a corresponding unique constraint. EXCLUDEConstraints are generalized unique constraints that can be used to ensure data integrity in low-frequency update scenarios.

Partition table considerations

  • If a single table exceeds hundreds of TB, or the monthly incremental data exceeds more than ten GB, you can consider table partitioning.
  • A guideline for partitioning is to keep the size of each partition within the comfortable range of 1GB to 64GB.
  • Tables that are conditionally partitioned by time range are first partitioned by time range. Commonly used granularities include: decade, year, month, day, and hour. The partitions required in the future should be created at least three months in advance.
  • For extremely skewed data distributions, different time granularities can be combined, for example: 1900 - 2000 as one large partition, 2000 - 2020 as year partitions, and after 2020 as month partitions. When using time partitioning, the table name uses the value of the lower limit of the partition (if infinity, use a value that is far enough back).

Notes on wide tables

  • Wide tables (for example, tables with dozens of fields) can be considered for vertical splitting, with mutual references to the main table through the same primary key.
  • Because of the PostgreSQL MVCC mechanism, the write amplification phenomenon of wide tables is more obvious, reducing frequent updates to wide tables.
  • In Internet scenarios, it is allowed to appropriately lower the normalization level and reduce multi-table connections to improve performance.

Primary key considerations

  • Every table must have an identity column , and in principle it must have a primary key. The minimum requirement is to have a non-null unique constraint .
  • The identity column is used to uniquely identify any tuple in the table, and logical replication and many third-party tools depend on it.
  • If the primary key contains multiple columns, it should be specified using a single column after creating the field list of the table DDL PRIMARY KEY(a,b,...).
  • In principle, it is recommended to use integer UUIDtypes for primary keys, which can be used with caution and text types with limited length. Using other types requires explicit explanation and evaluation.
  • The primary key usually uses a single integer column. In principle, it is recommended to use it BIGINT. Use it with caution INTEGERand it is not allowed SMALLINT.
  • The primary key should be used to GENERATED ALWAYS AS IDENTITYgenerate a unique primary key; SERIAL, BIGSERIALwhich is only allowed when compatibility with PG versions below 10 is required.
  • The primary key can use UUIDthe type as the primary key, and it is recommended to use UUID v1/v7; use UUIDv4 as the primary key with caution, as random UUID has poor locality and has a collision probability.
  • When using a string column as a primary key, you should add a length limit. Generally used VARCHAR(64), use of longer strings should be explained and evaluated.
  • INSERT/UPDATEIn principle, it is forbidden to modify the value of the primary key column, and INSERT RETURNING it can be used to return the automatically generated primary key value.

Foreign key considerations

  • When defining a foreign key, the reference must explicitly set the corresponding action: SET NULL, SET DEFAULT, CASCADE, and use cascading operations with caution.
  • The columns referenced by foreign keys need to be primary key columns in other tables/this table.
  • Internet businesses, especially partition tables and horizontal shard libraries, use foreign keys with caution and can be solved at the application layer.

Null/Default Value Considerations

  • If there is no distinction between zero and null values in the field semantics, null values are not allowed and NOT NULLconstraints must be configured for the column.
  • If a field has a default value semantically, DEFAULTthe default value should be configured.

Numeric type considerations

  • Used for regular numeric fields INTEGER. Used for numeric columns whose capacity is uncertain BIGINT.
  • Don’t use it without special reasons SMALLINT. The performance and storage improvements are very small, but there will be many additional problems.
  • Note that the SQL standard does not provide unsigned integers, and values exceeding INTMAXbut not exceeding UINTMAXneed to be upgraded and stored. Do not store more INT64MAXvalues in BIGINTthe column as it will overflow into negative numbers.
  • REALRepresents a 4-byte floating point number, FLOATrepresents an 8-byte floating point number. Floating point numbers can only be used in scenarios where the final precision doesn’t matter, such as geographic coordinates. Remember not to use equality judgment on floating point numbers, except for zero values .
  • Use exact numeric types NUMERIC. If possible, use NUMERIC(p)and NUMERIC(p,s)to set the number of significant digits and the number of significant digits in the decimal part. For example, the temperature in Celsius ( 37.0) can NUMERIC(3,1)be stored with 3 significant digits and 1 decimal place using type.
  • Currency value type is used MONEY.

Text type considerations

  • PostgreSQL text types include char(n), varchar(n), text. By default, textthe type can be used, which does not limit the string length, but is limited by the maximum field length of 1GB.
  • If conditions permit, it is preferable to use varchar(n)the type to set a maximum string length. This will introduce minimal additional checking overhead, but can avoid some dirty data and corner cases.
  • Avoid use char(n), this type has unintuitive behavior (padding spaces and truncation) and has no storage or performance advantages in order to be compatible with the SQL standard.

Time type considerations

  • There are only two ways to store time: with time zone TIMESTAMPTZand without time zone TIMESTAMP.
  • It is recommended to use one with time zone TIMESTAMPTZ. If you use TIMESTAMPstorage, you must use 0 time zone standard time.
  • Please use it to generate 0 time zone time now() AT TIME ZONE 'UTC'. You cannot truncate the time zone directly now()::TIMESTAMP.
  • Uniformly use ISO-8601 format input and output time type: 2006-01-02 15:04:05to avoid DMY and MDY problems.
  • Users in China can use Asia/Hong_Kongthe +8 time zone uniformly because the Shanghai time zone abbreviation CSTis ambiguous.

Notes on enumeration types

  • Fields that are more stable and have a small value space (within tens to hundreds) should use enumeration types instead of integers and strings.
  • Enumerations are internally implemented using dynamic integers, which have readability advantages over integers and performance, storage, and maintainability advantages over strings.
  • Enumeration items can only be added, not deleted, but existing enumeration values can be renamed. ALTER TYPE <enum_name>Used to modify enumerations.

UUID type considerations

  • Please note that the fully random UUIDv4 has poor locality when used as a primary key. Consider using UUIDv1/v7 instead if possible.
  • Some UUID generation/processing functions require additional extension plug-ins, such as uuid-ossp, pg_uuidv7 etc. If you have this requirement, please specify it during configuration.

JSON type considerations

  • Unless there is a special reason, always use the binary storage JSONBtype and related functions instead of the text version JSON.
  • Note the subtle differences between atomic types in JSON and their PostgreSQL counterparts: the zero character textis not allowed in the type corresponding to a JSON string \u0000, and the and numericis not allowed in the type corresponding to a JSON numeric type . Boolean values only accept lowercase and literal values.NaN``infinity``true``false
  • Please note that objects in the JSON standard nulland null values in the SQL standard NULL are not the same concept.

Array type considerations

  • When storing a small number of elements, array fields can be used instead of individually.
  • Suitable for storing data with a relatively small number of elements and infrequent changes. If the number of elements in the array is very large or changes frequently, consider using a separate table to store the data and using foreign key associations.
  • For high-dimensional floating-point arrays, consider using pgvectorthe dedicated data types provided by the extension.

GIS type considerations

  • The GIS type uses the srid=4326 reference coordinate system by default.
  • Longitude and latitude coordinate points should use the Geography type without explicitly specifying the reference system coordinates 4326

Trigger considerations

  • Triggers will increase the complexity and maintenance cost of the database system, and their use is discouraged in principle. The use of rule systems is prohibited and such requirements should be replaced by triggers.
  • Typical scenarios for triggers are to automatically modify a row to the current timestamp after modifying it updated_time, or to record additions, deletions, and modifications of a table to another log table, or to maintain business consistency between the two tables.
  • Operations in triggers are transactional, meaning if the trigger or operations in the trigger fail, the entire transaction is rolled back, so test and prove the correctness of your triggers thoroughly. Special attention needs to be paid to recursive calls, deadlocks in complex query execution, and the execution sequence of multiple triggers.

Stored procedure/function considerations

  • Functions/stored procedures are suitable for encapsulating transactions, reducing concurrency conflicts, reducing network round-trips, reducing the amount of returned data, and executing a small amount of custom logic.

  • Stored procedures are not suitable for complex calculations, and are not suitable for trivial/frequent type conversion and packaging. In critical high-load systems, unnecessary computationally intensive logic in the database should be removed, such as using SQL in the database to convert WGS84 to other coordinate systems. Calculation logic closely related to data acquisition and filtering can use functions/stored procedures: for example, geometric relationship judgment in PostGIS.

  • Replaced functions and stored procedures that are no longer in use should be taken offline in a timely manner to avoid conflicts with future functions.

  • Use a unified syntax format for function creation. The signature occupies a separate line (function name and parameters), the return value starts on a separate line, and the language is the first label. Be sure to mark the function volatility level: IMMUTABLE, STABLE, VOLATILE. Add attribute tags, such as: RETURNS NULL ON NULL INPUT, PARALLEL SAFE, ROWS 1etc.

    CREATE OR REPLACE FUNCTION
      nspname.myfunc(arg1_ TEXT, arg2_ INTEGER)
      RETURNS VOID
    LANGUAGE SQL
    STABLE
    PARALLEL SAFE
    ROWS 1
    RETURNS NULL ON NULL INPUT
    AS $function$
    SELECT 1;
    $function$;
    

Use sensible Locale options

Use reasonable character encoding and localization configuration

  • Character encoding must be used UTF8, any other character encoding is strictly prohibited.
  • Must be used Cas LC_COLLATEthe default collation, any special requirements must be explicitly specified in the DDL/query clause to implement.
  • Character set LC_CTYPEis used by default en_US.UTF8, some extensions rely on character set information to work properly, such as pg_trgm.

Notes on indexing

  • All online queries must design corresponding indexes according to their access patterns, and full table scans are not allowed except for very small tables.
  • Indexes have a price, and it is not allowed to create unused indexes. Indexes that are no longer used should be cleaned up in time.
  • When building a joint index, columns with high differentiation and selectivity should be placed first, such as ID, timestamp, etc.
  • GiST index can be used to solve the nearest neighbor query problem, and traditional B-tree index cannot provide good support for KNN problem.
  • For data whose values are linearly related to the storage order of the heap table, if the usual query is a range query, it is recommended to use the BRIN index. The most typical scenario is to only append written time series data. BRIN index is more efficient than Btree.
  • When retrieving against JSONB/array fields, you can use GIN indexes to speed up queries.

Clarify the order of null values in B-tree indexes

  • NULLS FIRSTIf there is a sorting requirement on a nullable column, it needs to be explicitly specified in the query and index NULLS LAST.
  • Note that DESCthe default rule for sorting is NULLS FIRSTthat null values appear first in the sort, which is generally not desired behavior.
  • The sorting conditions of the index must match the query, such as:CREATE INDEX ON tbl (id DESC NULLS LAST);

Disable indexing on large fields

  • The size of the indexed field cannot exceed 2KB (1/3 of the page capacity). You need to be careful when creating indexes on text types. The text to be indexed should use varchar(n)types with length constraints.
  • When a text type is used as a primary key, a maximum length must be set. In principle, the length should not exceed 64 characters. In special cases, the evaluation needs to be explicitly stated.
  • If there is a need for large field indexing, you can consider hashing the large field and establishing a function index. Or use another type of index (GIN).

Make the most of functional indexes

  • Any redundant fields that can be inferred from other fields in the same row can be replaced using functional indexes.
  • For statements that often use expressions as query conditions, you can use expression or function indexes to speed up queries.
  • Typical scenario: Establish a hash function index on a large field, and establish a reversefunction index for text columns that require left fuzzy query.

Take advantage of partial indexes

  • For the part of the query where the query conditions are fixed, partial indexes can be used to reduce the index size and improve query efficiency.
  • If a field to be indexed in a query has only a limited number of values, several corresponding partial indexes can also be established.
  • If the columns in some indexes are frequently updated, please pay attention to the expansion of these indexes.

0x03 Query Convention

The limits of my language mean the limits of my world.

—Ludwig Wittgenstein

Use service access

  • Access to the production database must be through domain name access services , and direct connection using IP addresses is strictly prohibited.
  • VIP is used for services and access, LVS/HAProxy shields the role changes of cluster instance members, and master-slave switching does not require application restart.

Read and write separation

  • Internet business scenario: Write requests must go through the main library and be accessed through the Primary service.
  • In principle, read requests go from the slave library and are accessed through the Replica service.
  • Exceptions: If you need “Read Your Write” consistency guarantees, and significant replication delays are detected, read requests can access the main library; or apply to the DBA to provide Standby services.

Separation of speed and slowness

  • Queries within 1 millisecond in production are called fast queries, and queries that exceed 1 second in production are called slow queries.
  • Slow queries must go to the offline slave database - Offline service/instance, and a timeout should be set during execution.
  • In principle, the execution time of online general queries in production should be controlled within 1ms.
  • If the execution time of an online general query in production exceeds 10ms, the technical solution needs to be modified and optimized before going online.
  • Online queries should be configured with a Timeout of the order of 10ms or faster to avoid avalanches caused by accumulation.
  • ETL data from the primary is prohibited, and the offline service should be used to retrieve data from a dedicated instance.

Use connection pool

  • Production applications must access the database through a connection pool and the PostgreSQL database through a 1:1 deployed Pgbouncer proxy. Offline service, individual users are strictly prohibited from using the connection pool directly.
  • Pgbouncer connection pool uses Transaction Pooling mode by default. Some session-level functions may not be available (such as Notify/Listen), so special attention is required. Pre-1.21 Pgbouncer does not support the use of Prepared Statements in this mode. In special scenarios, you can use Session Pooling or bypass the connection pool to directly access the database, which requires special DBA review and approval.
  • When using a connection pool, it is prohibited to modify the connection status, including modifying connection parameters, modifying search paths, changing roles, and changing databases. The connection must be completely destroyed after modification as a last resort. Putting the changed connection back into the connection pool will lead to the spread of contamination. Use of pg_dump to dump data via Pgbouncer is strictly prohibited.

Configure active timeout for query statements

  • Applications should configure active timeouts for all statements and proactively cancel requests after timeout to avoid avalanches. (Go context)
  • Statements that are executed periodically must be configured with a timeout smaller than the execution period to avoid avalanches.
  • HAProxy is configured with a default connection timeout of 24 hours for rolling expired long connections. Please do not run SQL that takes more than 1 day to execute on offline instances. This requirement will be specially adjusted by the DBA.

Pay attention to replication latency

  • Applications must be aware of synchronization delays between masters and slaves and properly handle situations where replication delays exceed reasonable limits.
  • Under normal circumstances, replication delays are on the order of 100µs/tens of KB, but in extreme cases, slave libraries may experience replication delays of minutes/hours. Applications should be aware of this phenomenon and have corresponding degradation plans - Select Read from the main library and try again later, or report an error directly.

Retry failed transactions

  • Queries may be killed due to concurrency contention, administrator commands, etc. Applications need to be aware of this and retry if necessary.
  • When the application reports a large number of errors in the database, it can trigger the circuit breaker to avoid an avalanche. But be careful to distinguish the type and nature of errors.

Disconnected and reconnected

  • The database connection may be terminated for various reasons, and the application must have a disconnection reconnection mechanism.
  • It can be used SELECT 1as a heartbeat packet query to detect the presence of messages on the connection and keep it alive periodically.

Online service application code prohibits execution of DDL

  • It is strictly forbidden to execute DDL in production applications and do not make big news in the application code.
  • Exception scenario: Creating new time partitions for partitioned tables can be carefully managed by the application.
  • Special exception: Databases used by office systems, such as Gitlab/Jira/Confluence, etc., can grant application DDL permissions.

SELECT statement explicitly specifies column names

  • Avoid using it SELECT *, or RETURNINGuse it in a clause *. Please use a specific field list and do not return unused fields. When the table structure changes (for example, a new value column), queries that use column wildcards are likely to encounter column mismatch errors.
  • After the fields of some tables are maintained, the order will change. For example: after idupgrading the INTEGER primary key to BIGINT, idthe column order will be the last column. This problem can only be fixed during maintenance and migration. R&D developers should resist the compulsion to adjust the column order and explicitly specify the column order in the SELECT statement.
  • Exception: Wildcards are allowed when a stored procedure returns a specific table row type.

Disable online query full table scan

  • Exceptions: constant minimal table, extremely low-frequency operations, table/return result set is very small (within 100 records/100 KB).
  • Using negative operators such as on the first-level filter condition will result in a full table scan and must be !=avoided .<>

Disallow long waits in transactions

  • Transactions must be committed or rolled back as soon as possible after being started. Transactions that exceed 10 minutes IDEL IN Transactionwill be forcibly killed.
  • Applications should enable AutoCommit to avoid BEGINunpaired ROLLBACKor unpaired applications later COMMIT.
  • Try to use the transaction infrastructure provided by the standard library, and do not control transactions manually unless absolutely necessary.

Things to note when using count

  • count(*)It is the standard syntax for counting rows and has nothing to do with null values.
  • count(col)The count is the number of non-null recordscol in the column . NULL values in this column will not be counted.
  • count(distinct col)When coldeduplicating columns and counting them, null values are also ignored, that is, only the number of non-null distinct values is counted.
  • count((col1, col2))When counting multiple columns, even if the columns to be counted are all empty, they will still be counted. (NULL,NULL)This is valid.
  • a(distinct (col1, col2))For multi-column deduplication counting, even if the columns to be counted are all empty, they will be counted, (NULL,NULL)which is effective.

Things to note when using aggregate functions

  • All countaggregate functions except NULLBut count(col)in this case it will be returned 0as an exception.
  • If returning null from an aggregate function is not expected, use coalesceto set a default value.

Handle null values with caution

  • Clearly distinguish between zero values and null values. Use null values IS NULLfor equivalence judgment, and use regular =operators for zero values for equivalence judgment.

  • When a null value is used as a function input parameter, it should have a type modifier, otherwise the overloaded function will not be able to identify which one to use.

  • Pay attention to the null value comparison logic: the result of any comparison operation involving null values is unknown you need to pay attention to null the logic involved in Boolean operations:

    • and: TRUE or NULLWill return due to logical short circuit TRUE.
    • or: FALSE and NULLWill return due to logical short circuitFALSE
    • In other cases, as long as the operand appears NULL, the result isNULL
  • The result of logical judgment between null value and any value is null value, for example, NULL=NULLthe return result is NULLnot TRUE/FALSE.

  • For equality comparisons involving null values and non-null values, please use ``IS DISTINCT FROM for comparison to ensure that the comparison result is not null.

  • NULL values and aggregate functions: When all input values are NULL, the aggregate function returns NULL.

Note that the serial number is empty

  • When using Serialtypes, INSERT, UPSERTand other operations will consume sequence numbers, and this consumption will not be rolled back when the transaction fails.
  • When using an integer INTEGERas the primary key and the table has frequent insertion conflicts, you need to pay attention to the problem of integer overflow.

The cursor must be closed promptly after use

Repeated queries using prepared statements

  • Prepared Statements should be used for repeated queries to eliminate the CPU overhead of database hard parsing. Pgbouncer versions earlier than 1.21 cannot support this feature in transaction pooling mode, please pay special attention.
  • Prepared statements will modify the connection status. Please pay attention to the impact of the connection pool on prepared statements.

Choose the appropriate transaction isolation level

  • The default isolation level is read committed , which is suitable for most simple read and write transactions. For ordinary transactions, choose the lowest isolation level that meets the requirements.
  • For write transactions that require transaction-level consistent snapshots, use the Repeatable Read isolation level.
  • For write transactions that have strict requirements on correctness (such as money-related), use the serializable isolation level.
  • When a concurrency conflict occurs between the RR and SR isolation levels, the application should actively retry depending on the error type.

rh 09 Do not use count when judging the existence of a result.

  • It is faster than Count to SELECT 1 FROM tbl WHERE xxx LIMIT 1judge whether there are columns that meet the conditions.
  • SELECT exists(SELECT * FROM tbl WHERE xxx LIMIT 1)The existence result can be converted to a Boolean value using .

Use the RETURNING clause to retrieve the modified results in one go

  • RETURNINGThe clause can be used after the INSERT, UPDATE, DELETEstatement to effectively reduce the number of database interactions.

Use UPSERT to simplify logic

  • When the business has an insert-failure-update sequence of operations, consider using UPSERTsubstitution.

Use advisory locks to deal with hotspot concurrency .

  • For extremely high-frequency concurrent writes (spike) of single-row records, advisory locks should be used to lock the record ID.
  • If high concurrency contention can be resolved at the application level, don’t do it at the database level.

Optimize IN operator

  • Use EXISTSclause instead of INoperator for better performance.
  • Use =ANY(ARRAY[1,2,3,4])instead IN (1,2,3,4)for better results.
  • Control the size of the parameter list. In principle, it should not exceed 10,000. If it exceeds, you can consider batch processing.

It is not recommended to use left fuzzy search

  • Left fuzzy search WHERE col LIKE '%xxx'cannot make full use of B-tree index. If necessary, reverseexpression function index can be used.

Use arrays instead of temporary tables

  • Consider using an array instead of a temporary table, for example when obtaining corresponding records for a series of IDs. =ANY(ARRAY[1,2,3])Better than temporary table JOIN.

0x04 Administration Convention

Use Pigsty to build PostgreSQL cluster and infrastructure

  • The production environment uses the Pigsty trunk version uniformly, and deploys the database on x86_64 machines and CentOS 7.9 / RockyLinux 8.8 operating systems.
  • pigsty.ymlConfiguration files usually contain highly sensitive and important confidential information. Git should be used for version management and access permissions should be strictly controlled.
  • files/pkiThe CA private key and other certificates generated within the system should be properly kept, regularly backed up to a secure area for storage and archiving, and access permissions should be strictly controlled.
  • All passwords are not allowed to use default values, and make sure they have been changed to new passwords with sufficient strength.
  • Strictly control access rights to management nodes and configuration code warehouses, and only allow DBA login and access.

Monitoring system is a must

  • Any deployment must have a monitoring system, and the production environment uses at least two sets of Infra nodes to provide redundancy.

Properly plan the cluster architecture according to needs

  • Any production database cluster managed by a DBA must have at least one online slave database for online failover.
  • The template is used by default oltp, the analytical database uses olapthe template, the financial database uses critthe template, and the micro virtual machine (within four cores) uses tinythe template.
  • For businesses whose annual data volume exceeds 1TB, or for clusters whose write TPS exceeds 30,000 to 50,000, you can consider building a horizontal sharding cluster.

Configure cluster high availability using Patroni and Etcd

  • The production database cluster uses Patroni as the high-availability component and etcd as the DCS.
  • etcdUse a dedicated virtual machine cluster, with 3 to 5 nodes, strictly scattered and distributed on different cabinets.
  • Patroni Failsafe mode must be turned on to ensure that the cluster main library can continue to work when etcd fails.

Configure cluster PITR using pgBackRest and MinIO

  • The production database cluster uses pgBackRest as the backup recovery/PITR solution and MinIO as the backup storage warehouse.
  • MinIO uses a multi-node multi-disk cluster, and can also use S3/OSS/COS services instead. Password encryption must be set for cold backup.
  • All database clusters perform a local full backup every day, retain the backup and WAL of the last week, and save a full backup every other month.
  • When a WAL archiving error occurs, you should check the backup warehouse and troubleshoot the problem in time.

Core business database configuration considerations

  • The core business cluster needs to configure at least two online slave libraries, one of which is a dedicated offline query instance.
  • The core business cluster needs to build a delayed slave cluster with a 24-hour delay for emergency data recovery.
  • Core business clusters usually use asynchronous submission, while those related to money use synchronous submission.

Financial database configuration considerations

  • The financial database cluster requires at least two online slave databases, one of which is a dedicated synchronization Standby instance, and Standby service access is enabled.
  • Money-related libraries must use crittemplates with RPO = 0, enable synchronous submission to ensure zero data loss, and enable Watchdog as appropriate.
  • Money-related libraries must be forced to turn on data checksums and, if appropriate, turn on full DML logs.

Use reasonable character encoding and localization configuration

  • Character encoding must be used UTF8, any other character encoding is strictly prohibited.
  • Must be used Cas LC_COLLATEthe default collation, any special requirements must be explicitly specified in the DDL/query clause to implement.
  • Character set LC_CTYPEis used by default en_US.UTF8, some extensions rely on character set information to work properly, such as pg_trgm.

Business database management considerations

  • Multiple different databases are allowed to be created in the same cluster, and Ansible scripts must be used to create new business databases.
  • All business databases must exist synchronously in the Pgbouncer connection pool.

Business user management considerations

  • Different businesses/services must use different database users, and Ansible scripts must be used to create new business users.
  • All production business users must be synchronized in the user list file of the Pgbouncer connection pool.
  • Individual users should set a password with a default validity period of 90 days and change it regularly.
  • Individual users are only allowed to access authorized cluster offline instances or slave pg_offline_querylibraries with from the springboard machine.

Notes on extension management

  • yum/aptWhen installing a new extension, you must first install the corresponding major version of the extension binary package in all instances of the cluster .
  • Before enabling the extension, you need to confirm whether the extension needs to be added shared_preload_libraries. If necessary, a rolling restart should be arranged.
  • Note that shared_preload_librariesin order of priority, citus, timescaledb, pgmlare usually placed first.
  • pg_stat_statementsand auto_explainare required plugins and must be enabled in all clusters.
  • Install extensions uniformly using , and create them dbsuin the business database .CREATE EXTENSION

Database XID and age considerations

  • Pay attention to the age of the database and tables to avoid running out of XID transaction numbers. If the usage exceeds 20%, you should pay attention; if it exceeds 50%, you should intervene immediately.
  • When processing XID, execute the table one by one in order of age from largest to smallest VACUUM FREEZE.

Database table and index expansion considerations

  • Pay attention to the expansion rate of tables and indexes to avoid index performance degradation, and use pg_repackonline processing to handle table/index expansion problems.
  • Generally speaking, indexes and tables whose expansion rate exceeds 50% can be considered for reorganization.
  • When dealing with table expansion exceeding 100GB, you should pay special attention and choose business low times.

Database restart considerations

  • Before restarting the database, execute it CHECKPOINTtwice to force dirty pages to be flushed, which can speed up the restart process.
  • Before restarting the database, perform pg_ctl reloadreload configuration to confirm that the configuration file is available normally.
  • To restart the database, use pg_ctl restartpatronictl or patronictl to restart the entire cluster at the same time.
  • Use kill -9to shut down any database process is strictly prohibited.

Replication latency considerations

  • Monitor replication latency, especially when using replication slots.

New slave database data warm-up

  • When adding a new slave database instance to a high-load business cluster, the new database instance should be warmed up, and the HAProxy instance weight should be gradually adjusted and applied in gradients: 4, 8, 16, 32, 64, and 100. pg_prewarmHot data can be loaded into memory using .

Database publishing process

  • Online database release requires several evaluation stages: R&D self-test, supervisor review, QA review (optional), and DBA review.
  • During the R&D self-test phase, R&D should ensure that changes are executed correctly in the development and pre-release environments.
    • If a new table is created, the record order magnitude, daily data increment estimate, and read and write throughput magnitude estimate should be given.
    • If it is a new function, the average execution time and extreme case descriptions should be given.
    • If it is a mode change, all upstream and downstream dependencies must be sorted out.
    • If it is a data change and record revision, a rollback SQL must be given.
  • The R&D Team Leader needs to evaluate and review changes and be responsible for the content of the changes.
  • The DBA evaluates and reviews the form and impact of the release, puts forward review opinions, and calls back or implements them uniformly.

Data work order format

  • Database changes are made through the platform, with one work order for each change.
  • The title is clear: A certain business needs xxto perform an action in the database yy.
  • The goal is clear: what operations need to be performed on which instances in each step, and how to verify the results.
  • Rollback plan: Any changes need to provide a rollback plan, and new ones also need to provide a cleanup script.
  • Any changes need to be recorded and archived, and have complete approval records. They are first approved by the R&D superior TL Review and then approved by the DBA.

Database change release considerations

  • Using a unified release window, changes of the day will be collected uniformly at 16:00 every day and executed sequentially; requirements confirmed by TL after 16:00 will be postponed to the next day. Database release is not allowed after 19:00. For emergency releases, please ask TL to make special instructions and send a copy to the CTO for approval before execution.
  • Database DDL changes and DML changes are uniformly dbuser_dbaexecuted remotely using the administrator user to ensure that the default permissions work properly.
  • When the business administrator executes DDL by himself, he mustSET ROLE dbrole_admin first execute the release to ensure the default permissions.
  • Any changes require a rollback plan before they can be executed, and very few operations that cannot be rolled back need to be handled with special caution (such as enumeration of value additions)
  • Database changes use psqlcommand line tools, connect to the cluster main database to execute, use \iexecution scripts or \emanual execution in batches.

Things to note when deleting tables

  • The production data table DROPshould be renamed first and allowed to cool for 1 to 3 days to ensure that it is not accessed before being removed.
  • When cleaning the table, you must sort out all dependencies, including directly and indirectly dependent objects: triggers, foreign key references, etc.
  • The temporary table to be deleted is usually placed in trashSchema and ALTER TABLE SET SCHEMAthe schema name is modified.
  • In high-load business clusters, when removing particularly large tables (> 100G), select business valleys to avoid preempting I/O.

Things to note when creating and deleting indexes

  • You must use CREATE INDEX CONCURRENTLYconcurrent index creation and DROP INDEX CONCURRENTLYconcurrent index removal.
  • When rebuilding an index, always create a new index first, then remove the old index, and modify the new index name to be consistent with the old index.
  • After index creation fails, you should remove INVALIDthe index in time. After modifying the index, use analyzeto re-collect statistical data on the table.
  • When the business is idle, you can enable parallel index creation and set it maintenance_work_memto a larger value to speed up index creation.

Make schema changes carefully

  • Try to avoid full table rewrite changes as much as possible. Full table rewrite is allowed for tables within 1GB. The DBA should notify all relevant business parties when the changes are made.
  • When adding new columns to an existing table, you should avoid using functions in default values VOLATILEto avoid a full table rewrite.
  • When changing a column type, all functions and views that depend on that type should be rebuilt if necessary, and ANALYZEstatistics should be refreshed.

Control the batch size of data writing

  • Large batch write operations should be divided into small batches to avoid generating a large amount of WAL or occupying I/O at one time.
  • After a large batch UPDATEis executed, VACUUMthe space occupied by dead tuples is reclaimed.
  • The essence of executing DDL statements is to modify the system directory, and it is also necessary to control the number of DDL statements in a batch.

Data loading considerations

  • Use COPYload data, which can be executed in parallel if necessary.
  • You can temporarily shut down before loading data autovacuum, disable triggers as needed, and create constraints and indexes after loading.
  • Turn it up maintenance_work_mem, increase it max_wal_size.
  • Executed after loading is complete vacuum verbose analyze table.

Notes on database migration and major version upgrades

  • The production environment uniformly uses standard migration to build script logic, and realizes requirements such as non-stop cluster migration and major version upgrades through blue-green deployment.
  • For clusters that do not require downtime, you can use pg_dump | psqllogical export and import to stop and upgrade.

Data Accidental Deletion/Accidental Update Process

  • After an accident occurs, immediately assess whether it is necessary to stop the operation to stop bleeding, assess the scale of the impact, and decide on treatment methods.
  • If there is a way to recover on the R&D side, priority will be given to the R&D team to make corrections through SQL publishing; otherwise, use pageinspectand pg_dirtyreadto rescue data from the bad table.
  • If there is a delayed slave library, extract data from the delayed slave library for repair. First, confirm the time point of accidental deletion, and advance the delay to extract data from the database to the XID.
  • A large area was accidentally deleted and written. After communicating with the business and agreeing, perform an in-place PITR rollback to a specific time.

Data corruption processing process

  • Confirm whether the slave database data can be used for recovery. If the slave database data is intact, you can switchover to the slave database first.
  • Temporarily shut down auto_vacuum, locate the root cause of the error, replace the failed disk and add a new slave database.
  • If the system directory is damaged, or use to pg_filedumprecover data from table binaries.
  • If the CLOG is damaged, use ddto generate a fake submission record.

Things to note when the database connection is full

  • When the connection is full (avalanche), immediately use the kill connection query to cure the symptoms and stop the loss: pg_cancel_backendor pg_terminate_backend.
  • Use to pg_terminate_backendabort all normal backend processes, psql \watch 1starting with once per second ( ). And confirm the connection status from the monitoring system. If the accumulation continues, continue to increase the execution frequency of the connection killing query, for example, once every 0.1 seconds until there is no more accumulation.
  • After confirming that the bleeding has stopped from the monitoring system, try to stop the killing connection. If the accumulation reappears, immediately resume the killing connection. Immediately analyze the root cause and perform corresponding processing (upgrade, limit current, add index, etc.)

PostgreSQL, The most successful database

The StackOverflow 2023 Survey, featuring feedback from 90K developers across 185 countries, is out. PostgreSQL topped all three survey categories (used, loved, and wanted), earning its title as the undisputed “Decathlete Database” – it’s hailed as the “Linux of Database”!

https://demo.pigsty.cc/d/sf-survey

What makes a database “successful”? It’s a mix of features, quality, security, performance, and cost, but success is mainly about adoption and legacy. The size, preference, and needs of its user base are what truly shape its ecosystem’s prosperity. StackOverflow’s annual surveys for seven years have provided a window into tech trends.

PostgreSQL is now the world’s most popular database.

PostgreSQL is developers’ favorite database!

PostgreSQL sees the highest demand among users!

Popularity, the used reflects the past, the loved indicates the present, and the wanted suggests the future. These metrics vividly showcase the vitality of a technology. PostgreSQL stands strong in both stock and potential, unlikely to be rivaled soon.

As a dedicated user, community member, expert, evangelist, and contributor to PostgreSQL, witnessing this moment is profoundly moving. Let’s delve into the “Why” and “What” behind this phenomenon.


Source: Community Survey

Developers define the success of databases, and StackOverflow’s survey, with popularity, love, and demand metrics, captures this directly.

“Which database environments have you done extensive development work in over the past year, and which do you want to work in over the next year? If you both worked with the database and want to continue to do so, please check both boxes in that row.”

Each database in the survey had two checkboxes: one for current use, marking the user as “Used,” and one for future interest, marking them as “Wanted.” Those who checked both were labeled as “Loved/Admired.”

https://survey.stackoverflow.co/2023

The percentage of “Used” respondents represents popularity or usage rate, shown as a bar chart, while “Wanted” indicates demand or desire, marked with blue dots. “Loved/Admired” shows as red dots, indicating love or reputation. In 2023, PostgreSQL outstripped MySQL in popularity, becoming the world’s most popular database, and led by a wide margin in demand and reputation.

Reviewing seven years of data and plotting the top 10 databases on a scatter chart of popularity vs. net love percentage (2*love% - 100), we gain insights into the database field’s evolution and sense of scale.

X: Popularity, Y: Net Love Index (2 * loved - 100)

The 2023 snapshot shows PostgreSQL in the top right, popular and loved, while MySQL, popular yet less favored, sits in the bottom right. Redis, moderately popular but much loved, is in the top left, and Oracle, neither popular nor loved, is in the bottom left. In the middle lie SQLite, MongoDB, and SQL Server.

Trends indicate PostgreSQL’s growing popularity and love; MySQL’s love remains flat with falling popularity. Redis and SQLite are progressing, MongoDB is peaking and declining, and the commercial RDBMSs SQL Server and Oracle are on a downward trend.

The takeaway: PostgreSQL’s standing in the database realm, akin to Linux in server OS, seems unshakeable for the foreseeable future.


Historical Accumulation: Popularity

PostgreSQL — The world’s most popular database

一项技术使用者占总体的比例,就是流行度。它的含义是:过去一年有多少比例的用户使用了这项技术。流行度代表过去一年的积累使用,是存量指标,也是最核心的事实指标。

在 2023 年, “最先进” PostgreSQL 在所有开发者中以 45.6% 的使用率,首次超过“最流行”数据库 MySQL 41.1%,领先 4.5%,使用率是第二名 MySQL 的1.1倍。 对于专业开发者(约占总样本的3/4)来说,PostgreSQL 的使用率在去年(2022)就已经超过 MySQL 了,以 46.5% vs 45.7% 领先0.8个百分点; 在 2023 年,这一差距进一步拉大到 49.1% vs 40.6,领先 8.5% —— 换句话说,专业开发者中,PostgreSQL 的使用率已经是 MySQL 的 1.2 倍了。

过去几年,MySQL 一直霸占着数据库流行榜的榜首,洋洋得意地打起了“世界上最流行的开源关系型数据库” 这一旗号。 不过这次,“最流行” 的桂冠真的要让给 PostgreSQL 了。在流行度上,其他数据库和 PostgreSQL / MySQL 比根本就不是一个重量级,自然就更不用说了。

更重要的的是变化趋势:在长期列入排名的十几款头部数据库中,只有 PostgreSQL 的流行度是持续上升的,保持着高歌猛进的增长势头,而其他所有的数据库使用率都在下行。 此消彼长,随着时间的推移,PostgreSQL 与其他数据库的流行度差距只会进一步拉大 —— 因此在相当长的一段时间内,恐怕是看不到有任何挑战者能撼动 PostgreSQL 现在的位置了。

值得一提的是,“国产数据库”的标杆 ”TiDB“ 这次也加入到 StackOverflow 排行榜中,并以 0.2% 的使用率,拿到了末位第 32 名的名次。

流行度反映的是当下数据库的规模势能,而喜爱度反映的是未来数据库的增长潜能。

Popularity is the percentage of total users who have used a technology in the past year. It reflects the accumulated usage over the past year and is a core metric of factual significance.

In 2023, PostgreSQL, branded as the “most advanced,” surpassed the “most popular” database MySQL with a usage rate of 45.6%, leading by 4.5% and reaching 1.1 times the usage rate of MySQL at 41.1%. Among professional developers (about three-quarters of the sample), PostgreSQL had already overtaken MySQL in 2022, with a 0.8 percentage point lead (46.5% vs 45.7%); this gap widened in 2023 to 49.1% vs 40.6%, or 1.2 times the usage rate among professional developers.

Over the past years, MySQL enjoyed the top spot in database popularity, proudly claiming the title of the “world’s most popular open-source relational database.” However, PostgreSQL has now claimed the crown. Compared to PostgreSQL and MySQL, other databases are not in the same league in terms of popularity.

The key trend to note is that among the top-ranked databases, only PostgreSQL has shown a consistent increase in popularity, demonstrating strong growth momentum, while all other databases have seen a decline in usage. As time progresses, the gap in popularity between PostgreSQL and other databases will likely widen, making it hard for any challenger to displace PostgreSQL in the near future.

Notably, the “domestic database” TiDB has entered the StackOverflow rankings for the first time, securing the 32nd spot with a 0.2% usage rate.

Popularity reflects the current scale and potential of a database, while love indicates its future growth potential.


Current Momentum: Love

PostgreSQL — The database developers love the most

Love or admiration is a measure of the percentage of users who are willing to continue using a technology, acting as an annual “retention rate” metric that reflects the user’s opinion and evaluation of the technology.

In 2023, PostgreSQL retained its title as the most loved database by developers. While Redis had been the favorite in previous years, PostgreSQL overtook Redis in 2022, becoming the top choice. PostgreSQL and Redis have maintained close reputation scores (around 70%), significantly outpacing other contenders.

In the 2022 PostgreSQL community survey, the majority of existing PostgreSQL users reported increased usage and deeper engagement, highlighting the stability of its core user base.

Redis, known for its simplicity and ease of use as a data structure cache server, is often paired with the relational database PostgreSQL, enjoying considerable popularity (20%, ranking sixth) among developers. Cross-analysis shows a strong connection between the two: 86% of Redis users are interested in using PostgreSQL, and 30% of PostgreSQL users want to use Redis. Other databases with positive reviews include SQLite, MongoDB, and SQL Server. MySQL and ElasticSearch receive mixed feedback, hovering around the 50% mark. The least favored databases include Access, IBM DB2, CouchDB, Couchbase, and Oracle.

Not all potential can be converted into kinetic energy. While user affection is significant, it doesn’t always translate into action, leading to the third metric of interest – demand.


PostgreSQL - The Most Wanted Database

The demand rate, or the level of desire, represents the percentage of users who will actually opt for a technology in the coming year. PostgreSQL stands out in demand/desire, significantly outpacing other databases with a 42.3% rate for the second consecutive year, showing relentless growth and widening the gap with its competitors.

In 2023, some databases saw notable demand increases, likely driven by the surge in large language model AI, spearheaded by OpenAI’s ChatGPT. This demand for intelligence has, in turn, fueled the need for robust data infrastructure. A decade ago, support for NoSQL features like JSONB/GIN laid the groundwork for PostgreSQL’s explosive growth during the internet boom. Today, the introduction of pgvector, the first vector extension built on a mature database, grants PostgreSQL a ticket into the AI era, setting the stage for growth in the next decade.


But Why?

PostgreSQL leads in demand, usage, and popularity, with the right mix of timing, location, and human support, making it arguably the most successful database with no visible challengers in the near future. The secret to its success lies in its slogan: “The World’s Most Advanced Open Source Relational Database.”

Relational databases are so prevalent and crucial that they might dwarf the combined significance of other types like key-value, document, search engine, time-series, graph, and vector databases. Typically, “database” implicitly refers to “relational database,” where no other category dares claim mainstream status. Last year’s “Why PostgreSQL Will Be the Most Successful Database?” delves into the competitive landscape of relational databases—a tripartite dominance. Excluding Microsoft’s relatively isolated SQL Server, the database scene, currently in a phase of consolidation, has three key players rooted in WireProtocol: Oracle, MySQL, and PostgreSQL, mirroring a “Three Kingdoms” saga in the relational database realm.

Oracle/MySQL are waning, while PostgreSQL is thriving. Oracle is an established commercial DB with deep tech history, rich features, and strong support, favored by well-funded, risk-averse enterprises, especially in finance. Yet, it’s pricey and infamous for litigious practices. MS SQL Server shares similar traits with Oracle. Commercial databases are facing a slow decline due to the open-source wave.

MySQL, popular yet beleaguered, lags in stringent transaction processing and data analysis compared to PostgreSQL. Its agile development approach is also outperformed by NoSQL alternatives. Oracle’s dominance, sibling rivalry with MariaDB, and competition from NewSQL players like TiDB/OB contribute to its decline.

Oracle, no doubt skilled, lacks integrity, hence “talented but unprincipled.” MySQL, despite its open-source merit, is limited in capability and sophistication, hence “limited talent, weak ethics.” PostgreSQL, embodying both capability and integrity, aligns with the open-source rise, popular demand, and advanced stability, epitomizing “talented and principled.”


Open Source & Advanced

The primary reasons for choosing PostgreSQL, as reflected in the TimescaleDB community survey, are its open-source nature and stability. Open-source implies free use, potential for modification, no vendor lock-in, and no “chokepoint” issues. Stability means reliable, consistent performance with a proven track record in large-scale production environments. Experienced developers value these attributes highly.

Broadly, aspects like extensibility, ecosystem, community, and protocols fall under “open-source.” Stability, ACID compliance, SQL support, scalability, and availability define “advanced.” These resonate with PostgreSQL’s slogan: “The world’s most advanced open source relational database.”

https://www.timescale.com/state-of-postgres/2022


The Virtue of Open Source

powered by developers worldwide. Friendly BSD license, thriving ecosystem, extensive expansion. A robust Oracle alternative, leading the charge.

What is “virtue”? It’s the manifestation of “the way,” and this way is open source. PostgreSQL stands as a venerable giant among open-source projects, epitomizing global collaborative success.

Back in the day, developing software/information services required exorbitantly priced commercial databases. Just the software licensing fees could hit six or seven figures, not to mention similar costs for hardware and service subscriptions. Oracle’s licensing fee per CPU core could reach hundreds of thousands annually, prompting even giants like Alibaba to seek IOE alternatives. The rise of open-source databases like PostgreSQL and MySQL offered a fresh choice.

Open-source databases, free of charge, spurred an industry revolution: from tens of thousands per core per month for commercial licenses to a mere 20 bucks per core per month for hardware. Databases became accessible to regular businesses, enabling the provision of free information services.

Open source has been monumental: the history of the internet is a history of open-source software. The prosperity of the IT industry and the plethora of free information services owe much to open-source initiatives. Open source represents a form of successful Communism in software, with the industry’s core means of production becoming communal property, available to developers worldwide as needed. Developers contribute according to their abilities, embracing the ethos of mutual benefit.

An open-source programmer’s work encapsulates the intellect of countless top-tier developers. Programmers command high salaries because they are not mere laborers but contractors orchestrating software and hardware. They own the core means of production: software from the public domain and readily available server hardware. Thus, a few skilled engineers can swiftly tackle domain-specific problems leveraging the open-source ecosystem.

Open source synergizes community efforts, drastically reducing redundancy and propelling technical advancements at an astonishing pace. Its momentum, now unstoppable, continues to grow like a snowball. Open source dominates foundational software, and the industry now views insular development or so-called “self-reliance” in software, especially in foundational aspects, as a colossal joke.

For PostgreSQL, open source is its strongest asset against Oracle.

Oracle is advanced, but PostgreSQL holds its own. It’s the most Oracle-compatible open-source database, natively supporting 85% of Oracle’s features, with specialized distributions reaching 96% compatibility. However, the real game-changer is cost: PG’s open-source nature and significant cost advantage provide a substantial ecological niche. It doesn’t need to surpass Oracle in features; being “90% right at a fraction of the cost” is enough to outcompete Oracle.

PostgreSQL is like an open-source “Oracle,” the only real threat to Oracle’s dominance. As a leader in the “de-Oracle” movement, PG has spawned numerous “domestically controllable” database companies. According to CITIC, 36% of “domestic databases” are based on PG modifications or rebranding, with Huawei’s openGauss and GaussDB as prime examples. Crucially, PostgreSQL uses a BSD-Like license, permitting such adaptations — you can rebrand and sell without deceit. This open attitude is something Oracle-acquired, GPL-licensed MySQL can’t match.


The advanced in Talent

The talent of PG lies in its advancement. Specializing in multiple areas, PostgreSQL offers a full-stack, multi-model approach: “Self-managed, autonomous driving temporal-geospatial AI vector distributed document graph with full-text search, programmable hyper-converged, federated stream-batch processing in a single HTAP Serverless full-stack platform database”, covering almost all database needs with a single component.

PostgreSQL is not just a traditional OLTP “relational database” but a multi-modal database. For SMEs, a single PostgreSQL component can cover the vast majority of their data needs: OLTP, OLAP, time-series, GIS, tokenization and full-text search, JSON/XML documents, NoSQL features, graphs, vectors, and more.

Emperor of Databases — Self-managed, autonomous driving temporal-geospatial AI vector distributed document graph with full-text search, programmable hyper-converged, federated stream-batch processing in a single HTAP Serverless full-stack platform database.

The superiority of PostgreSQL is not only in its acclaimed kernel stability but also in its powerful extensibility. The plugin system transforms PostgreSQL from a single-threaded evolving database kernel to a platform with countless parallel-evolving extensions, exploring all possibilities simultaneously like quantum computing. PostgreSQL is omnipresent in every niche of data processing.

For instance, PostGIS for geospatial databases, TimescaleDB for time-series, Citus for distributed/columnar/HTAP databases, PGVector for AI vector databases, AGE for graph databases, PipelineDB for stream processing, and the ultimate trick — using Foreign Data Wrappers (FDW) for unified SQL access to all heterogeneous external databases. Thus, PG is a true full-stack database platform, far more advanced than a simple OLTP system like MySQL.

Within a significant scale, PostgreSQL can play multiple roles with a single component, greatly reducing project complexity and cost. Remember, designing for unneeded scale is futile and an example of premature optimization. If one technology can meet all needs, it’s the best choice rather than reimplementing it with multiple components.

Taking Tantan as an example, with 250 million TPS and 200 TB of unique TP data, a single PostgreSQL selection remains stable and reliable, covering a wide range of functions beyond its primary OLTP role, including caching, OLAP, batch processing, and even message queuing. However, as the user base approaches tens of millions daily active users, these additional functions will eventually need to be handled by dedicated components.

PostgreSQL’s advancement is also evident in its thriving ecosystem. Centered around the database kernel, there are specialized variants and “higher-level databases” built on it, like Greenplum, Supabase (an open-source alternative to Firebase), and the specialized graph database edgedb, among others. There are various open-source/commercial/cloud distributions integrating tools, like different RDS versions and the plug-and-play Pigsty; horizontally, there are even powerful mimetic components/versions emulating other databases without changing client drivers, like babelfish for SQL Server, FerretDB for MongoDB, and EnterpriseDB/IvorySQL for Oracle compatibility.

PostgreSQL’s advanced features are its core competitive strength against MySQL, another open-source relational database.

Advancement is PostgreSQL’s core competitive edge over MySQL.

MySQL’s slogan is “the world’s most popular open-source relational database,” characterized by being rough, fierce, and fast, catering to internet companies. These companies prioritize simplicity (mainly CRUD), data consistency and accuracy less than traditional sectors like banking, and can tolerate data inaccuracies over service downtime, unlike industries that cannot afford financial discrepancies.

However, times change, and PostgreSQL has rapidly advanced, surpassing MySQL in speed and robustness, leaving only “roughness” as MySQL’s remaining trait.

MySQL allows partial transaction commits by default, shocked

MySQL allows partial transaction commits by default, revealing a gap between “popular” and “advanced.” Popularity fades with obsolescence, while advancement gains popularity through innovation. In times of change, without advanced features, popularity is fleeting. Research shows MySQL’s pride in “popularity” cannot stand against PostgreSQL’s “advanced” superiority.

Advancement and open-source are PostgreSQL’s success secrets. While Oracle is advanced and MySQL is open-source, PostgreSQL boasts both. With the right conditions, success is inevitable.


Looking Ahead

The PostgreSQL database kernel’s role in the database ecosystem mirrors the Linux kernel’s in the operating system domain. For databases, particularly OLTP, the battle of kernels has settled—PostgreSQL is now a perfect engine.

However, users need more than an engine; they need the complete car, driving capabilities, and traffic services. The database competition has shifted from software to Software enabled Service—complete database distributions and services. The race for PostgreSQL-based distributions is just beginning. Who will be the PostgreSQL equivalent of Debian, RedHat, or Ubuntu?

This is why we created Pigsty — to develop an battery-included, open-source, local-first PostgreSQL distribution, making it easy for everyone to access and utilize a quality database service. Due to space limits, the detailed story is for another time.


参考阅读

2022-08 《PostgreSQL 到底有多强?

2022-07 《为什么PostgreSQL是最成功的数据库?

2022-06 《StackOverflow 2022数据库年度调查

2021-05 《Why PostgreSQL Rocks!

2021-05 《为什么说PostgreSQL前途无量?

2018 《PostgreSQL 好处都有啥?

2023 《更好的开源RDS替代:Pigsty

2023 《StackOverflow 7年调研数据跟踪

2022 《PostgreSQL 社区状态调查报告 2022