Reports

I thought of a way to measure this in a container with some test data:

podman run --name test_postgres --rm -e POSTGRES_PASSWORD=example --mount type=bind,source=init_db.sql,destination=/docker-entrypoint-initdb.d/init_db.sql docker.io/postgres

CREATE TABLE A (
    id INTEGER PRIMARY KEY,
    val1 VARCHAR(32),
    val2 VARCHAR(32)
);

CREATE TABLE B (
    id INTEGER,
    rel INTEGER REFERENCES A(id),
    val VARCHAR(32),
    PRIMARY KEY (rel, id)
);

-- Use random_string from https://stackoverflow.com/a/3972983/2256700
-- [...]
INSERT INTO A 
SELECT s.a, random_string(32), random_string(32) FROM generate_series(1,100) as s(a);

INSERT INTO B
SELECT s.a, A.id, random_string(32) from generate_series(1,1000) as s(a), A;

Then installed strace in the container with apt-get update && apt-get install strace to find out how much data is transferred:

$ strace -e %net -o single_command_strace.out \
psql -U postgres -c "SELECT * FROM A INNER JOIN B ON A.id = B.rel;" > single_command_result.txt
$ # summing the last number of all the `recvfrom` lines:
$ cat single_command_strace.out | grep '^recvfrom' | grep -o -E '[[:digit:]]+$' | awk '{s+=$1} END {print s}'
13373905
$ strace -e %net -o two_command_strace.out \
psql -U postgres -c "SELECT * FROM A; SELECT * FROM B;" > two_command_result.txt
$ cat two_command_strace.out | grep '^recvfrom' | grep -o -E '[[:digit:]]+$' | awk '{s+=$1} END {print s}'
5590420

Napkin math:

The full cross join should be approximately 100000 * (8 + 32 + 32 + 8 + 8 + 32) bytes = 12000000 bytes, which is close to the actual 13373905 bytes received. The actual useful/unique data is only 100 * (8 + 32 + 32) + 100000 * (8 + 8 + 32) = 4807200 which is also closer to 5590420 bytes of the two-query approach.

So apparently the data isn't deduplicated for transfer and doing the big join is probably a bad idea.

79566199