Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is ABC of SQL: every SQL developer has to learn this lesson soon or later: Every additional JOIN you make creates a cartesian product of all previously joined records and the number of rows grows exponentially => you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows, each letter representing number of rows in that table)

Of course specifying useful predicates that filter down dataset and having indexes on these columns helps avoid row explosion, but you can't have an index on each and every column, especially on OLTP database.

in example #2 the author joined 7 tables (multiply row count of 7 tables and this gives you an indea of how many rows SQL engine has to churn through) - big O complexity of query is (AxBxCxDxExFxG) of course it will be SLOW.

in example #4 he joins 4 tables and unions with 5 table join, so the big O complexity of query is AxBxCxD(1+E).

same as in programming, there is Big O complexity in your SQL queries, so it helps to know O() compelxity of queries that you write

TLDR: learn the SQL Big-O and stop compaining about the query planner, pls



> Every additional JOIN you make creates a cartesian product of all previously joined records and the number of rows grows exponentially => you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows, each letter representing number of rows in that table)

I think a cartesian product would be:

SELECT * FROM A, B; Given sizes a, b the resulting number of rows would be a X b or O(N^2).

With a join SELECT * FROM A INNER JOIN B ON A.id = B.id then the result rows would be MAX(a,b) or O(N).

A JOIN is a predicate that filters down the dataset.


all queries with JOINs can be rewritten as cartesians: select * from a join B on col1=col2 will be rewritten by query optimizer into:

select * from A, B /* this is cartesian that slows down your query */ WHERE a.col1=B.col2

in fact both queries produce the same execution plan if you check yourself


I think the point is rather that a join with a condition (which is the norm) is almost never actually executed as a cartesian product. take for example from tfa

  select * from customer_order_items
the grain is one row per customer order item, and the next two joins don't change that

  select * from customer_order_items join customer_orders on order_id
the grain is still one row per customer order item

  select * from customer_order_items join customer_orders on order_id join customers on customer_id
the grain is still one row per customer order item

..etc..

of course later on they screw it up and take the cartesian product of customer_order_items by employ_markouts, but it's just 2 big factors not 7 - their query did finish after a few seconds. usually mistakes involving cartesian products with 7 factors just run for hours and eventually throw out of memory.


All queries could technically be expressed as a cartesian product but that is not necessary and not what happens in practice. Both of the above might produce the same plan because they are both treated as joins. One is expressed as an explicit join, the other as an implicit join, but neither requires the query engine to produce a cartesian product on execution. If it did, queries I run in seconds would require more storage or RAM than probably exists in my entire workplace, and I'm not using anything that would usually be considered "big data".


>>queries I run in seconds would require more storage or RAM than probably exists in my city

my explanation to this: CPUs ave very fast. My CPU is 4Ghz so a single core can do a lot of computations in one second, and a SQL engine is smart enough to make cartesian computation (as part of a query plan) and discard the result if row does not meet predicate condition.

in fact I agree that not entire cartesian is being computed, if you specify enough predicates. But the query still multiplies rows. In the author's article he is joining employees when customer_id columns in NULL so this is technically a cartesian, because NULL predicate is not very selective (=there are a lot of rows with value of NULL)


You are not using the term cartesian product correctly as applied to how database engines execute queries. Multiple people have detailed the problems with your ideas on how such things work. A fast processor cannot overcome the the need for multiple terabytes of RAM that would be required to process common queries if databases worked as you describe. Databases are significantly more likely to get bottlenecked by RAM than CPU, and your incorrect understanding of how databases work would require as much RAM as a large "big data" cluster to run queries on moderately sized data sets. Then even if it had that RAM, the bandwidth of the system's bus channels would be incapable of transporting it fast enough to return results in 500ms. Certainly not on my system with a few dozen other people working at the same time, and especially not when I have to query the live production system with a few thousand end users instead of the data warehouse. Databases do not work this way.


multiple people can be wrong, so it is not a valid argument. I can write 10 nested loops and overwrite value of a single cpu register with a single value, so many times, that it will exceed any big data cluster capacity. This is what I meant by CPUs are fast.

if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin - they all do comoute cartesian while simultaneously applying predicates. Some operations are faster because of sorting and hashing, but they still do cartesian.


If they are applying predicates to reduce the number of rows processed then they are not performing cartesian joins. You don't need to take the word of people here, you merely need to read any source about how database engines process queries. I am of course open to the possibility of being wrong should you find authoritative sources that show cartesian joins produced for all queries. However, nearly two decades of working with a variety of engines tells me you are unlikely to find such a source. Your comments betray a fundamental lack of understanding on the topic. Your unwillingness to recognize the exponential resources required for your understanding to be correct also betrays a fundamental misunderstanding. You even contradict yourself by in one comment insisting that such queries would be "SLOW" while in another comment stating that you obtain speedy results due to a fast CPU. Which in itself betrays yet another fundamental misunderstanding in how databases utilize resources and where bottlenecks arise.

Last, you have offered no significant support for your claims, and as the initiator of the discussion making the claim of cartesian joins, the burden of proof is on you to provide evidence for your claims.

In any case, your persistence in digging your hole deeper and deeper on this issue is embarrassing, and I won't be cruel enough to enable that further. Feel free to reply, but I am done, with a final recommendation that you continue your learning and not stop at the point of your current understanding. You at least seem earnest in your interests in this topic, so you should pursue it further.


> if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin

it's called merge join because it employs the merge algorithm (linear time constant space)[1] - this can only be used when the inputs are sorted. likewise the hash join entails building a hash table and using it (linear time linear space)[2] but the inputs don't have to be sorted. the point of these is to avoid the O(N*M) nested loop

[1] https://en.wikipedia.org/wiki/Merge_algorithm

[2] https://en.wikipedia.org/wiki/Hash_join


Umm.. virtually no join on tables with more than a handful of rows is done as a cartesian product. A suitable set of columns is used to sort or hash each side of the join, such that the actual join logic is basically linear performance (each row in table A is probably only ever paired with approximately 1 row in table B when evaluating the predicates). A cartesian product would involve testing the predicates on every combination of A and B (each row in table A being paired with each row in table B)

(Note when I say linear performance, I mean the join predicate is executed a linear amount of times, but the initial sort/hash operation was probably more like O(n log n))


Cartesian product means a very specific thing - testing the predicates on every possible combination. This is not what occurs in practice


? not how it works behind the scenes at all


> you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows

this isn't really true. not every join is a cross join, and not every join even changes the grain of the result set. it is important to pay attention when your joins do change the grain of the result set, especially when they grow it by some factor - then it's more like a product as you say.


all queries with JOINs can be rewritten as cartesians: select * from a join B on col1=col2 will be rewritten by query optimizer into: select * from A, B /* this is cartesian that slows down your query */ WHERE a.col1=B.col2

in fact both queries produce the same execution plan if you check yourself


the query is not the only input into the plan generator, the other inputs are the schema (with constraints and indexes) and their statistics - so you can't really say "check the plan for yourself" without all the inputs. and the only time the plan for "select * from A, B" is going to be the same as "select * from A, B where A.col1=B.col2" is when col1 and col2 are both neither primary keys nor foreign keys and that would be a weird condition under which to run such a query.


pls reread my post, I wrote that query select * from A join B on col1=col2 is the the same plan as for select * from a, b where col1=col2

this is true regardless of schema and statistics, it is simply how query planner works under the hood

and you can check it yourself on any database


Those are both inner joins, just a different syntax for writing them.


They might be logically equivalent, but they are not identical if more than two tables are joined. For instance, the current Postgres docs say [1]:

> This latter equivalence does not hold exactly when more than two tables appear, because JOIN binds more tightly than comma. For example FROM T1 CROSS JOIN T2 INNER JOIN T3 ON condition is not the same as FROM T1, T2 INNER JOIN T3 ON condition because the condition can reference T1 in the first case but not the second.

[1] https://www.postgresql.org/docs/current/queries-table-expres...


I don't understand what you mean by saying each join creates a cartesian product. Join criteria avoid this. If I left outer join 7 rows on table A and 5 on table B and join on a unique key then I still only get 7 rows, not the cartesian product of 7x5 rows. I query across a dozen or more tables on a regular basis with aggregates, window functions, with statements, having clauses and so on. Performance is rarely a problem.


Something that helps with exploding cartesian problem is using json_agg for the joined tables. IT's available in postgres, it works great.


Your understanding is flawed, these aren’t cross joins.


I've found denormalization to be helpful for such performance issues in that it reduces the number of joins. It can also be useful for debugging if you want to keep track of values from other tables at the time that data was processed within the primary table.


A certain amount of denormalization can be helpful, but often there's a trade-off in terms of maintainability.


Not who you replied to.

But yes,

As an example, we keep our data normalized and add extra tables to do the operations we need. It’s expensive since we have to precompute, etc. But then on lookups it’s quicker.

Like everything, it depends on what you want to optimize for.


Yep, in my case we wanted to make user-facing queries fast (i.e. a reports page). The extra complexity (not much; we just have to remember when to update certain values in other tables when they change) is definitely worth it since load time dropped from 30 seconds sometimes to almost instant.

Denormalizing everything is definitely a pain; keeping data in sync in mongodb (which lacks traditional joins) on a previous project was not fun; now using postgresql.


denormalization should only be seen as a last resort


denormalization aka memoization aka "dynamic programming of SQL" :-) - all three provide speed-up because they compute value once and reuse it later


Except that your source of truth aka your raw data is now memoized, hardly a win. This is not a goal of memoization or dynamic programming at all.

Perfect for a reporting database though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: