I thought of a way to measure this in a container with some test data:
podman run --name test_postgres --rm -e POSTGRES_PASSWORD=example --mount type=bind,source=init_db.sql,destination=/docker-entrypoint-initdb.d/init_db.sql docker.io/postgres
CREATE TABLE A (
id INTEGER PRIMARY KEY,
val1 VARCHAR(32),
val2 VARCHAR(32)
);
CREATE TABLE B (
id INTEGER,
rel INTEGER REFERENCES A(id),
val VARCHAR(32),
PRIMARY KEY (rel, id)
);
-- Use random_string from https://stackoverflow.com/a/3972983/2256700
-- [...]
INSERT INTO A
SELECT s.a, random_string(32), random_string(32) FROM generate_series(1,100) as s(a);
INSERT INTO B
SELECT s.a, A.id, random_string(32) from generate_series(1,1000) as s(a), A;
Then installed strace
in the container with apt-get update && apt-get install strace
to find out how much data is transferred:
$ strace -e %net -o single_command_strace.out \
psql -U postgres -c "SELECT * FROM A INNER JOIN B ON A.id = B.rel;" > single_command_result.txt
$ # summing the last number of all the `recvfrom` lines:
$ cat single_command_strace.out | grep '^recvfrom' | grep -o -E '[[:digit:]]+$' | awk '{s+=$1} END {print s}'
13373905
$ strace -e %net -o two_command_strace.out \
psql -U postgres -c "SELECT * FROM A; SELECT * FROM B;" > two_command_result.txt
$ cat two_command_strace.out | grep '^recvfrom' | grep -o -E '[[:digit:]]+$' | awk '{s+=$1} END {print s}'
5590420
Napkin math:
The full cross join should be approximately 100000 * (8 + 32 + 32 + 8 + 8 + 32) bytes = 12000000 bytes, which is close to the actual 13373905 bytes received. The actual useful/unique data is only 100 * (8 + 32 + 32) + 100000 * (8 + 8 + 32) = 4807200 which is also closer to 5590420 bytes of the two-query approach.
So apparently the data isn't deduplicated for transfer and doing the big join is probably a bad idea.