Peter Bengtsson: Django ORM optimization story on selecting the least possible

planet postgresql - 2019-02-23(土) 03:49:29

This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name column of all its child models, and the turn all these name strings into 1 MD5 checksum string.


The first attempted looked like this:

artist = Artist.objects.get(name="Bad Religion") names = [] for song in Song.objects.filter(artist=artist): names.append(song.name) return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

The SQL used to generate this is as follows:

SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name", "main_song"."text", "main_song"."language", "main_song"."key_phrases", "main_song"."popularity", "main_song"."text_length", "main_song"."metadata", "main_song"."created", "main_song"."modified", "main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Clearly, I don't need anything but just the name column, version 2:

artist = Artist.objects.get(name="Bad Religion") names = [] for song in Song.objects.filter(artist=artist).only("name"): names.append(song.name) return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

Now, the SQL used is:

SELECT "main_song"."id", "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

But still, since I don't really need instances of model class Song I can use the .values() method which gives back a list of dictionaries. This is version 3:

names = [] for song in Song.objects.filter(artist=a).values("name"): names.append(song["name"]) return hashlib.md5("".join(names).encode("utf-8")).hexdigest()

This time Django figures it doesn't even need the primary key value so it looks like this:

SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;

Last but not least; there is an even faster one. values_list(). This time it doesn't even bother to ma

カテゴリー: postgresql

Avinash Kumar: PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019

planet postgresql - 2019-02-22(金) 22:47:27

In case you didn’t already see this news, PostgreSQL has got its first minor version released for 2019. This includes minor version updates for all supported PostgreSQL versions. We have indicated in our previous blog post that PostgreSQL 9.3 had gone EOL, and it would not support any more updates. This release includes the following PostgreSQL major versions:

What’s new in this release?

One of the common fixes applied to all the supported PostgreSQL versions is on – panic instead of retrying after fsync () failure. This fsync failure has been in discussion for a year or two now, so let’s take a look at the implications.

A fix to the Linux fsync issue for PostgreSQL Buffered IO in all supported versions

PostgreSQL performs two types of IO. Direct IO – though almost never – and the much more commonly performed Buffered IO.

PostgreSQL uses O_DIRECT when it is writing to WALs (Write-Ahead Logs aka Transaction Logs) only when

wal_sync_method  is set to : open_datasync  or to  open_sync  with no archiving or streaming enabled. The default  wal_sync_method  may be fdatasync  that does not use O_DIRECT. This means, almost all the time in your production database server, you’ll see PostgreSQL using O_SYNC / O_DSYNC while writing to WAL’s. Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.  Let’s understand this further.

Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

This could still be fine if

カテゴリー: postgresql

Paul Ramsey: Proj6 in PostGIS

planet postgresql - 2019-02-22(金) 22:00:00

Map projection is a core feature of any spatial database, taking coordinates from one coordinate system and converting them to another, and PostGIS has depended on the Proj library for coordinate reprojection support for many years.

For most of those years, the Proj library has been extremely slow moving. New projection systems might be added from time to time, and some bugs fixed, but in general it was easy to ignore. How slow was development? So slow that the version number migrated into the name, and everyone just called it “Proj4”.

No more.

Starting a couple years ago, new developers started migrating into the project, and the pace of development picked up. Proj 5 in 2018 dramatically improved the plumbing in the difficult area of geodetic transformation, and promised to begin changing the API. Only a year later, here is Proj 6, with yet more huge infrastructural improvements, and the new API.

Some of this new work was funded via the GDALBarn project, so thanks go out to those sponsors who invested in this incredibly foundational library and GDAL maintainer Even Roualt.

For PostGIS that means we have to accomodate ourselves to the new API. Doing so not only makes it easier to track future releases, but gains us access to the fancy new plumbing in Proj.

For example, Proj 6 provides:

Late-binding coordinate operation capabilities, that takes metadata such as area of use and accuracy into account… This can avoid in a number of situations the past requirement of using WGS84 as a pivot system, which could cause unneeded accuracy loss.

Or, put another way: more accurate results for reprojections that involve datum shifts.

Here’s a simple example, converting from an old NAD27/NGVD29 3D coordinate with height in feet, to a new NAD83/NAVD88 coordinate with height in metres.

SELECT ST_Astext( ST_Transform( ST_SetSRID(geometry('POINT(-100 40 100)'),7406), 5500));

Note that the height in NGVD29 is 100 feet, if converted directly to meters, it would be 30.48 metres. The transformed po

カテゴリー: postgresql

Craig Kerstiens: Thinking in MapReduce, but with SQL

planet postgresql - 2019-02-22(金) 02:44:00

For those considering Citus, if your use case seems like a good fit, we often are willing to spend some time with you to help you get an understanding of the Citus database and what type of performance it can deliver. We commonly do this in a roughly two hour pairing session with one of our engineers. We’ll talk through the schema, load up some data, and run some queries. If we have time at the end it is always fun to load up the same data and queries into single node Postgres and see how we compare. After seeing this for years, I still enjoy seeing performance speed ups of 10 and 20x over a single node database, and in cases as high as 100x.

And the best part is it didn’t take heavy re-architecting of data pipelines. All it takes is just some data modeling, and parallelization with Citus.

The first step is sharding

We’ve talked about this before but the first key to these performance gains is that Citus splits up your data under the covers to smaller more manageable pieces. These are shards (which are standard Postgres tables) are spread across multiple physical nodes. This means that you have more collective horsepower within your system to pull from. When you’re targetting a single shard it is pretty simple: the query is re-routed to the underlying data and once it gets results it returns them.

Thinking in MapReduce

MapReduce has been around for a number of years now, and was popularized by Hadoop. The thing about large scale data is in order to get timely answers from it you need to divide up the problem and operate in parallel. Or you find an extremely fast system. The problem with getting a bigger and faster box is that data growth is outpacing hardware improvements.

MapReduce itself is a framework for splitting up data, shuffling the data to nodes as needed, and then performing the work on a subset of data before recombining for the result. Let’s take an example like counting up total pageviews. If we wanted leverage MapReduce on this we would split the pageviews into 4 separate buckets. We could do th

カテゴリー: postgresql

Andrew Staller: If PostgreSQL is the fastest growing database, then why is the community so small?

planet postgresql - 2019-02-22(金) 02:02:45

The database king continues its reign. For the second year in a row, PostgreSQL is still the fastest growing DBMS.

By comparison, in 2018 MongoDB was the second fastest growing, while Oracle, MySQL, and SQL Server all shrank in popularity.

For those who stay on top of news from database land, this should come as no surprise, given the number of PostgreSQL success stories that have been published recently:

Let’s all pat ourselves on the back, shall we? Not quite yet.

The PostgreSQL community by the numbers

As the popularity of PostgreSQL grows, attendance at community events remains small. This is the case even as more and more organizations and developers embrace PostgreSQL, so from our perspective, there seems to be a discrepancy between the size of the Postgres user base and that of the Postgres community.

The two main PostgreSQL community conferences are Postgres Conference (US) and PGConf EU. Below is a graph of Postgres Conference attendance for the last 5 years, with a projection for the Postgres Conference 2019 event occurring in March.

Last year, PGConf EU had around 500 attendees, a 100% increase since 4 years ago.

Combined, that’s about 1,100 attendees for the two largest conferences within the PostgreSQL community. By comparison, Oracle OpenWorld has about 60,000 attendees. Even MongoDB World had over 2,000 attendees in 2018.

We fully recognize that attendance at Postgres community events will be a portion of the user base. We also really enjoy these events and applaud the organizers for the effort they invest in running them. And in-person events may indeed be a lagging indicator of a systems growth in popularity. Let’s just grow faster!

One relatively new gathering point for the Postgres community is the P

カテゴリー: postgresql

Nickolay Ihalainen: Parallel queries in PostgreSQL

planet postgresql - 2019-02-21(木) 23:05:21

Modern CPU models have a huge number of cores. For many years, applications have been sending queries in parallel to databases. Where there are reporting queries that deal with many table rows, the ability for a query to use multiple CPUs helps us with a faster execution. Parallel queries in PostgreSQL allow us to utilize many CPUs to finish report queries faster. The parallel queries feature was implemented in 9.6 and helps. Starting from PostgreSQL 9.6 a report query is able to use many CPUs and finish faster.

The initial implementation of the parallel queries execution took three years. Parallel support requires code changes in many query execution stages. PostgreSQL 9.6 created an infrastructure for further code improvements. Later versions extended parallel execution support for other query types.

  • Do not enable parallel executions if all CPU cores are already saturated. Parallel execution steals CPU time from other queries, and increases response time.
  • Most importantly, parallel processing significantly increases memory usage with high WORK_MEM values, as each hash join or sort operation takes a work_mem amount of memory.
  • Next, low latency OLTP queries can’t be made any faster with parallel execution. In particular, queries that returns a single row can perform badly when parallel execution is enabled.
  • The Pierian spring for developers is a TPC-H benchmark. Check if you have similar queries for the best parallel execution.
  • Parallel execution supports only SELECT queries without lock predicates.
  • Proper indexing might be a better alternative to a parallel sequential table scan.
  • There is no support for cursors or suspended queries.
  • Windowed functions and ordered-set aggregate functions are non-parallel.
  • There is no benefit for an IO-bound workload.
  • There are no parallel sort algorithms. However, queries with sorts still can be parallel in some aspects.
  • Replace CTE (WITH …) with a sub-select to support parallel execution.
  • Foreign data wrappers do not currently support parallel execution (but they c
カテゴリー: postgresql

Quelques livres lus en 2018

planet PHP - 2019-02-21(木) 09:00:00
Deux mois en retard, j’ai fait le tour des livres que j’ai lus en 2019. J’en ai déjà recommandé plusieurs autour de moi et, en parcourant mes lectures de ces derniers mois, j’en ai vu d’autres dont je n’ai jamais parlé mais qui étaient tout aussi intéressants. Voici donc quelques livres que j’ai lus en 2018 et que je recommande. Why we sleep — Matthew Walker Pour commencer, Why we sleep est un livre très intéressant sur le sommeil (j’avais écrit un article à ce sujet il y a un moment, d’ailleurs).
カテゴリー: php

Bruce Momjian: Trusted and Untrusted Languages

planet postgresql - 2019-02-21(木) 05:00:01

Postgres supports two types of server-side languages, trusted and untrusted. Trusted languages are available for all users because they have safe sandboxes that limit user access. Untrusted languages are only available to superusers because they lack sandboxes.

Some languages have only trusted versions, e.g., PL/pgSQL. Others have only untrusted ones, e.g., PL/Python. Other languages like Perl have both.

Why would you want to have both trusted and untrusted languages available? Well, trusted languages like PL/Perl limit access to only safe resources, while untrusted languages like PL/PerlU allow access to files system and network resources that would be unsafe for non-superusers, i.e., it would effectively give them the same power as superusers. This is why only superusers can use untrusted languages.

Continue Reading »

カテゴリー: postgresql

Paul Ramsey: Upgrading PostGIS on Centos 7

planet postgresql - 2019-02-21(木) 01:38:04

New features and better performance get a lot of attention, but one of the relatively unsung improvements in PostGIS over the past ten years has been inclusion in standard software repositories, making installation of this fairly complex extension a "one click" affair.

Once you've got PostgreSQL/PostGIS installed though, how are upgrades handled? The key is having the right versions in place, at the right time, for the right scenario and knowing a little bit about how PostGIS works.

カテゴリー: postgresql

It depends

planet PHP - 2019-02-20(水) 22:00:00

When I was younger, I had strong opinions about many subjects. I felt I was right about a great many things, and anyone who disagreed with me was wrong. In my mind there was a right or a wrong, a black and a white, with little room for grey. Others were certainly entitled to their […]

The post It depends appeared first on BrandonSavage.net.

カテゴリー: php

Kaarel Moppel: Looking at MySQL 8 with PostgreSQL goggles on

planet postgresql - 2019-02-20(水) 18:00:55

First off – not trying to kindle any flame wars here, just trying to broaden my (your) horizons a bit, gather some ideas (maybe I’m missing out on something cool, it’s the most used Open Source RDBMS after all) and to somewhat compare the two despite being a difficult thing to do correctly / objectively. Also I’m leaving aside here performance comparisons and looking at just the available features, general querying experience and documentation clarity as this is I guess most important for beginners. So just a list of points I made for myself, grouped in no particular order.

Disclaimer: last time I used MySQL for some personal project it was 10 years ago, so basically I’m starting from zero and only took one and a half days to get to know it – thus if you see that I’ve gotten something screamingly wrong then please do leave a comment and I’ll change it. Also, my bias in this article probably tends to favour Postgres…but I’m pretty sure a MySQL veteran with good knowledge of pros and cons can write up something similar also on Postgres, so my hope is that you can leave this aside and learn a thing or two about either system.

To run MySQL I used the official Docker image, 8.0.14. Under MySQL the default InnoDB engine is meant.

docker run --rm -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root mysql:8 “mysql” CLI (“psql” equivalent) and general querying experience

* When the server requires a password why doesn’t it just ask for it?

mysql -h -u root # adding '-p' will fix the error ERROR 1045 (28000): Access denied for user 'root'@'' (using password: NO)

* Very poor tab-completion compared to “psql”. Using “mycli” instead makes much sense. I’m myself 99% of time on CLI-s, so it’s essential.
* Lot less shortcut helpers to list tables, views, functions, etc…
* Can’t set to “extended output” (columns as rows) permanently, only “auto” and “per query”.
* One does not need to specify a DB to connect to – I find it positive actually as it’s easy to forgot those database names and when once in, one can call “sh

カテゴリー: postgresql

Stefan Fercot: Monitor pgBackRest backups with Nagios

planet postgresql - 2019-02-20(水) 09:00:00

pgBackRest is a well-known powerful backup and restore tool.

Relying on the status information given by the “info” command, we’ve build a specific plugin for Nagios : check_pgbackrest.

This post will help you discover this plugin and assume you already know pgBackRest and Nagios.

Let’s assume we have a PostgreSQL cluster with pgBackRest working correctly.

Given this simple configuration:

[global] repo1-path=/some_shared_space/ repo1-retention-full=2 [mystanza] pg1-path=/var/lib/pgsql/11/data

Let’s get the status of our backups with the pgbackrest info command:

stanza: mystanza status: ok cipher: none db (current) wal archive min/max (11-1): 00000001000000040000003C/000000010000000B0000004E full backup: 20190219-121527F timestamp start/stop: 2019-02-19 12:15:27 / 2019-02-19 12:18:15 wal start/stop: 00000001000000040000003C / 000000010000000400000080 database size: 3.0GB, backup size: 3.0GB repository size: 168.5MB, repository backup size: 168.5MB incr backup: 20190219-121527F_20190219-121815I timestamp start/stop: 2019-02-19 12:18:15 / 2019-02-19 12:20:38 wal start/stop: 000000010000000400000082 / 0000000100000004000000B8 database size: 3.0GB, backup size: 2.9GB repository size: 175.2MB, repository backup size: 171.6MB backup reference list: 20190219-121527F incr backup: 20190219-121527F_20190219-122039I timestamp start/stop: 2019-02-19 12:20:39 / 2019-02-19 12:22:55 wal start/stop: 0000000100000004000000C1 / 0000000100000004000000F4 database size: 3.0GB, backup size: 3.0GB repository size: 180.9MB, repository backup size: 177.3MB backup reference list: 20190219-121527F, 20190219-121527F_20190219-121815I full backup: 20190219-122255F timestamp start/stop: 2019-02-19 12:22:55 / 2019-02-19 12:25:47 wal start/stop: 000000010000000500000000 / 00000001000000050000003[...]
カテゴリー: postgresql

Hubert 'depesz' Lubaczewski: why-upgrade updates

planet postgresql - 2019-02-20(水) 05:00:55
Recent change in layout of PG Docs broke my spider for why-upgrade.depesz.com. Today got some time and decided to bite the bullet. Fixed spider code, used it to get new changelog, and while I was at it, did couple of slight modifications of the site: display count of all changes that are there in given … Continue reading "why-upgrade updates"
カテゴリー: postgresql

412 Precondition Failed

planet PHP - 2019-02-20(水) 00:00:00

In HTTP it’s possible to do conditional requests. These are requests that only execute if the right conditions are met.

For GET requests, this might be done to only retrieve the resource if it has changed. For those cases 304 Not Modified is returned.

For other cases, 412 Precondition Failed is returned.


This client only wants the PUT request to succeed, if it didn’t already exit:

PUT /foo/new-article.md HTTP/1.1 Content-Type: text/markdown If-None-Match: *

This request is an update, and it should only succeed if the article hasn’t change since last time.

PUT /foo/old-article.md HTTP/1.1 If-Match: "1345-12315" Content-Type: text/markdown

If the condition didn’t pass, it returns:

HTTP/1.1 412 Precondition Failed Content-Type: text/plain The article you're tring to update has changed since you last seen it.

One great advantage of this is that prevents lost updates, due to multiple people writing to the same resource. This is also known as the ‘lost update’ problem.

Using the Prefer header, it’s possible for a client to get the current state of the resource, in case the local copy was outdated. This saves a GET request.

PUT /foo/old-article.md HTTP/1.1 If-Match: "1345-12315" Content-Type: text/markdown Prefer: return=representation ### Article version 2.1 HTTP/1.1 412 Precondition Failed Content-Type: text/markdown Etag: "4444-12345" Vary: Prefer ### Article version 3.0

This is useful, but it should probably have been designed with a HTTP/2 Push message instead. Nevertheless, there’s no harm in adopting this for legacy HTTP/1.1 systems.

カテゴリー: php

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 12 – Allow user control of CTE materialization, and change the default behavior.

planet postgresql - 2019-02-19(火) 22:25:15
On 16th of February 2019, Tom Lane committed patch: Allow user control of CTE materialization, and change the default behavior.   Historically we've always materialized the full output of a CTE query, treating WITH as an optimization fence (so that, for example, restrictions from the outer query cannot be pushed into it). This is appropriate … Continue reading "Waiting for PostgreSQL 12 – Allow user control of CTE materialization, and change the default behavior."
カテゴリー: postgresql

Gabriele Bartolini: Geo-redundancy of PostgreSQL database backups with Barman

planet postgresql - 2019-02-19(火) 21:13:30

Barman 2.6 introduces support for geo-redundancy, meaning that Barman can now copy from another Barman instance, not just a PostgreSQL database.

Geographic redundancy (or simply geo-redundancy) is a property of a system that replicates data from one site (primary) to a geographically distant location as redundancy, in case the primary system becomes inaccessible or is lost. From version 2.6, it is possible to configure Barman so that the primary source of backups for a server can also be another Barman instance, not just a PostgreSQL database server as before.

Briefly, you can define a server in your Barman instance (passive, according to the new Barman lingo), and map it to a server defined in another Barman instance (primary).

All you need is an SSH connection between the Barman user in the primary server and the passive one. Barman will then use the rsync copy method to synchronise itself with the origin server, copying both backups and related WAL files, in an asynchronous way. Because Barman shares the same rsync method, geo-redundancy can benefit from key features such as parallel copy and network compression. Incremental backup will be included in future releases.

Geo-redundancy is based on just one configuration option: primary_ssh_command.

Our existing scenario

To explain how geo-redundancy works, we will use the following example scenario. We keep it very simple for now.

We have two identical data centres, one in Europe and one in the US, each with a PostgreSQL database server and a Barman server:

  • Europe:
    • PostgreSQL server: eric
    • Barman server: jeff, backing up the database server hosted on eric
  • US:
    • PostgreSQL server: duane
    • Barman server: gregg, backing up the database server hosted on duane

Let’s have a look at how jeff is configured to backup eric, by reading the content of the /etc/barman.d/eric.conf file:

[eric] description = Main European PostgreSQL server 'Eric' conninfo = user=barman-jeff dbname=postgres host=eric ssh_command = ssh postgres@eric backup_method = rsync parallel_jobs = 4 r[...]
カテゴリー: postgresql

Bruce Momjian: Order of SELECT Clause Execution

planet postgresql - 2019-02-19(火) 01:30:01

SQL is a declaritive language, meaning you specify what you want, rather than how to generate what you want. This leads to a natural language syntax, like the SELECT command. However, once you dig into the behavior of SELECT, it becomes clear that it is necessary to understand the order in which SELECT clauses are executed to take full advantage of the command.

I was going to write up a list of the clause execution ordering, but found this webpage that does a better job of describing it than I could. The ordering bounces from the middle clause (FROM) to the bottom to the top, and then the bottom again. It is hard to remember the ordering, but memorizing it does help in constructing complex SELECT queries.

カテゴリー: postgresql

Robert Haas: Tuning autovacuum_naptime

planet postgresql - 2019-02-19(火) 00:23:00
One of the things I sometimes get asked to do is review someone's postgresql.conf settings.  From time to time, I run across a configuration where the value of autovacuum_naptime has been increased, often by a large multiple.  The default value of autovacuum_naptime is 1 minute, and I have seen users increase this value to 1 hour, or in one case, 1 day.  This is not a good idea.  In this blog post, I will attempt to explain why it's not a good idea, and also something about the limited circumstances under which you might want to change autovacuum_naptime.

Read more »
カテゴリー: postgresql

Jonathan Katz: WITH Queries: Present & Future

planet postgresql - 2019-02-18(月) 08:46:00

Common table expressions, aka CTEs, aka WITH queries, are not only the gateway to writing recursive SQL queries, but also help developers write maintainable SQL. WITH query clauses can help developers who are more comfortable writing in imperative languages to feel more comfortable writing SQL, as well as help reduce writing redundant code by reusing a particular common table expressions multiple times in a query.

A new patch, scheduled to be a part of PostgreSQL 12 major release later in the year, introduces the ability, under certain conditions, to inline common table expressions within a query. This is a huge feature: many developers could suddenly see their existing queries speed up significantly, and the ability to explicitly specify when to inline (i.e. the planner "substitutes" a reference to the CTE in the main query and can then optimize further) or, conversely, materialize (i.e. place the CTE into memory but lose out on certain planning & execution optimizations).

But why is this a big deal? Before we look into the future, first let's understand how WITH queries currently work in PostgreSQL.

カテゴリー: postgresql