NHacker Next
login
▲Joins 13 Waysjustinjaffray.com
531 points by foldU 774 days ago | 76 comments
Loading comments...
bob1029 774 days ago [-]
Once I started thinking about joins in terms of spatial dimensions, things got a lot easier to reason with. I like to think of the inner join like the scene from Stargate where they were describing how gate addressing works. Assume you have a contrived example with 3 tables describing the position of an off world probe in each dimension - X, Y and Z. Each table looks like:

  CREATE TABLE Dim_X (
    int EntityId
    float Value
  )
Establishing a point in space for a given entity (regardless of how you derive that ID) is then a matter of:

  SELECT Dim_X.Value AS X, Dim_Y.Value as Y, Dim_Z.Value as Z
  FROM Dim_X, Dim_Y, Dim_Z
  WHERE Dim_X.EntityId = Dim_Y.EntityId --Join 1
  AND Dim_Y.EntityId = Dim_Z.EntityId --Join 2
  AND Dim_X.EntityId = @MyEntityId --The entity we want to find the 3D location of
You will note that there are 2 inner joins used in this example. That is the bare minimum needed to construct a 3 dimensional space. Think about taking 3 cards and taping them together with 2 pieces of rigid metal tape. You can make a good corner of a cube, even if it's a bit floppy on one edge. Gravity doesn't apply inside the RDBMS, so this works out.

This same reasoning can be extrapolated to higher, non-spatial dimensions. Think about adding time into that mix. In order to find an entity, you also now need to perform an inner join on that dimension and constrain on a specific time value. If you join but fail to constrain on a specific time, then you get a happy, comprehensive report of all the places an entity was over time.

The other join types are really minor variations on these themes once you have a deep conceptual grasp (i.e. can "rotate" the schema in your mind).

Playing around with some toy examples can do wonders for understanding. I sometimes find myself going back to the cartesian coordinate example when stuck trying to weld together 10+ dimensions in a real-world business situation.

quasarj 773 days ago [-]
I now understand joins less than I did before.
barrkel 773 days ago [-]
It's more a description of how dimensions work in OLAP and it's incomplete without measures.
cwbrandsma 773 days ago [-]
I miss MDX (language behind MS Analysis Service).
michaelmior 774 days ago [-]
This reminds me a lot of HyperDex[0], which hashes values into a multidimensional hyperspace based on their attributes for indexing purposes.

[0] https://dbdb.io/db/hyperdex

1970-01-01 774 days ago [-]
OLAP and MOLAP

https://www.ibm.com/topics/olap

https://en.wikipedia.org/wiki/Online_analytical_processing#M...

toyg 773 days ago [-]
That Wikipedia article, and it's offshoot pages, are so awfully out of date...

But yes, OLAP.

rjbwork 774 days ago [-]
This is like, a hypernormalization of the data though, isn't it? I think in standard BCNF you'd just leave your table as

  CREATE TABLE EntityPosition (
    int EntityId,
    float X,
    float y,
    float z
  )
It does remind me of data warehouse stuff though, given we're working with aggregates and piecing together bits of various dimensions.
jagged-chisel 774 days ago [-]
> …a hypernormalization of the data …

Well, yeah - it’s an example to get the point across, not an exercise in finding the right level of normalization.

spopejoy 772 days ago [-]
Your example raises a question I've never found a good answer to:

Why use `JOIN` (and `INNER JOIN` and friends) at all when you can write the exact join more accurately IMO with WHERE clauses (and using "*=" operators), and then just list all the tables in the FROM clause?

My brain always hurts trying to parse complex FROM clauses with various JOINs whereas reading the equivalent equations in WHERE clauses is way more straightforward to me.

iamcreasy 767 days ago [-]
I have had this issue, and solved it by learning that join happens from left to right. `A left join B left join C` is equivalent to ((A left join B) left join C).
iamcreasy 767 days ago [-]
I think you are saying that every join is a variation of cross join. Did I understand you correctly?
farkanoid 774 days ago [-]
Great. Now you've made me want to re-watch Stargate!
roywiggins 774 days ago [-]
The 0th would be "it's an operator in relational algebra."

https://en.m.wikipedia.org/wiki/Relational_algebra

("The result of the natural join [R ⋈ S] is the set of all combinations of tuples in R and S that are equal on their common attribute names... it is the relational counterpart of the logical AND operator.")

⋈ amounts to a Cartesian product with a predicate that throws out the rows that shouldn't be in the result. Lots of SQL makes sense if you think of joins this way.

jimwhite42 774 days ago [-]
In the functional interpretation of relational theory, a join is function composition, surprised that one was left out.
keithalewis 772 days ago [-]
Good point, but not exactly correct. If R is a relation on A x B and S is a relation on B x C then SR = {(a, c) | aRb and bSc for some b in B}. The join uses (a, b, c) in the comprehension.
jimwhite42 768 days ago [-]
True, but I think slightly pedantic for this context. I think there have been proposed relational operators which produce the result you describe - a join, then remove the join key fields from the result.

If we are in a pedantic mood, also, a relational-tuple is not exactly the same as a mathematical tuple, relational tuples take different forms depending on who's system you are following (e.g. whether the fields are ordered or not).

keithalewis 761 days ago [-]
Just endeavoring to make true statements. A relation on sets A and B is a subset R of the Cartesian product of A and B. E. F. Codd had something else in mind with his definition of join. Do you think he got that wrong?
gpderetta 774 days ago [-]
The cross-product plus predicate is referenced in "A join is a nested loop over rows".
roywiggins 774 days ago [-]
That's true, but I think the benefit of treating a Cartesian product as fundamental is that it lets you stop thinking about loops at all, or which one is the inner or outer loop, or of it as an iterative process at all.

Boxing all that up into the Cartesian product is a really useful concept, and the whole idea of relational algebra is to find convenient formalisms for relational operations, so it seems like it deserves a separate mention.

code_biologist 774 days ago [-]
1,000%. I learned discrete/set math in college and then later SQL on the job. Helping a few analysts moving beyond Excel skills to learn SQL was interesting. They struggled with some things that clicked quickly for me because I immediately had "oh, these are set operators" intuition. Stuff like when to use an outer join vs a cross join, or how to control cardinality of output rows in fancy ways (vs slapping `DISTINCT ON` on everything).

I passed on the set understanding where it explained an unintuitive SQL behavior and I hope it helped 'em.

cmrdporcupine 774 days ago [-]
Absolutely. One of the biggest dis-services that SQL does is getting people thinking of relations as "tables" with "rows" which ends up making them think in an iterative, sequential, tabular model for something that I think they'd be better off thinking more abstractly about.

The whole relational model clicked for me a whole lot more once I started thinking of each tuple as factual propositions (customer a's name is X, phone number is Y), and then all the operations in the relational algebra start to look more like "how would it be best to ask questions about this subject?"... "I'm interested in facts about..."

mirekrusin 774 days ago [-]
For several days I'm having trouble finding good resources on _implementation_ for query execution/planning (predicates, existing indices <<especially composite ones - how to locate them during planning etc>>, joins etc).

Google is spammed with _usage_.

Anybody has some recommendations at hand?

ps. the only one I found was CMU's Database Group resources, which are great

gavinray 774 days ago [-]
There's an entire 700-page book about this you can access for free here:

"Building Query Compilers"

https://pi3.informatik.uni-mannheim.de/~moer/querycompiler.p...

EDIT: You also might get use out of TUM's course "Database Systems on Modern CPU Architectures"

The year that contains all the video lectures is 2020:

https://db.in.tum.de/teaching/ss20/moderndbs/?lang=en

hobofan 773 days ago [-]
I'd also recommend "How Query Engines Work" for a good practical exercise in the topic: https://andygrove.io/how-query-engines-work/ . I think one should go already have a high-level overview of what individual parts make up a query engine and what there purpose is, as it only touches on that lightly.
gavinray 773 days ago [-]
I own the book and would agree -- only reason I didn't suggest this is that the areas on query-planning and optimization are a bit thin.

IMO it's great if you want a holistic view of building a query engine start-to-finish with just enough detail to build a basic implementation.

I probably learned more from the ~100 pages in this book than I did from most other sources.

sobellian 774 days ago [-]
I don't know if this has the level of detail you're looking for, but you might want to check https://www.sqlite.org/optoverview.html and https://www.sqlite.org/queryplanner.html.
Sesse__ 773 days ago [-]
The SQLite optimizer is really primitive. You generally don't want to be using it as a template if you can avoid it; it's good for OLTP but can't handle many-way joins well.
aidos 774 days ago [-]
My go to pointer for this is to read the Postgres docs (and / or source - which is also super readable).

https://www.postgresql.org/docs/current/planner-optimizer.ht...

Sesse__ 773 days ago [-]
Generally, this is a specialist topic, so you're unlikely to find e.g. good textbooks. It depends a bit on how deep you want to go and what specific parts you're interested in, though. (Query execution is pretty much entirely separate from planning, for one.)

The best join optimizer paper I know of is still the original Selinger paper (https://courses.cs.duke.edu/compsci516/cps216/spring03/paper...). It doesn't support outer joins and there are more efficient techniques by now, but anyone looking at a System R-like optimizer can read this and feel right at home. (There is also the Volcano/Cascades school of optimizers, which is rather different from System R, but I've found even less information on it.)

As others have said, Postgres' source and documentation is good. In particular, src/backend/optimizer/README contains a number of interesting things that you won't find a lot of other places. (The Postgres optimizer doesn't always use the latest fancy designs, but it's generally very polished and pretty easy to read.)

I can also echo Andy Pavlo's courses (at CMU), they're pretty much the only ones who will explain this stuff to you online. The “Building Query Compilers” PDF is rather incomplete and there's a lot of stuff I never cared for in there, but it contains several key papers from Moerkotte et al if you actually want to implement the modern stuff (DPhyp etc.). In general, most of the modern System R-like stuff (how to efficiently deal with outer joins, how to deal with interesting orders, how to deal with large queries) comes from the groups of Moerkotte and/or Neumann; all of these things had solutions before them, but less efficient and/or powerful and/or elegant.

Finding an applicable index isn't hard (you generally just try to see if it hits a predicate—a so-called “sargable predicate”, for SARG = Search ARGument). Estimating selectivity is hard. Estimating selectivity through joins is perhaps the hardest problem in optimizers, and nobody has truly solved it. This was one of the things I never really got to; there are so many hard subproblems. Like, for something that sounds really simple but isn't, what do you do if you have predicates A=x AND B=y AND C=z and you happen to have indexes on (A,B) and (B,C) (with selectivity/cardinality information) and want a selectivity for all three combined? There are papers that literally require you to build a “second-order cone programming” solver to solve this problem :-)

mrkeen 774 days ago [-]
I just tried adding 'relational algebra' in front of my 'query planning' query, and at a glance it skews more towards implementation.
malfist 774 days ago [-]
It seems content farms, farming keywords with just fluff has taken over google. Can't find how to do anything anymore, just pages and pages of people talking about doing something.

You could try your query on a different search engine. I've had good luck with kagi.

skywhopper 774 days ago [-]
It’s out of date and probably has less detail than I remember but I got a lot out of “Inside Microsoft SQL Server 7.0” which does deep dives into storage architecture, indices, query planning etc from an MSSQL specific POV. The book was updated for SQL Server 2003 and 2008. Obviously the book also has a ton of stuff about MS specific features and admin functionality that’s not really relevant, and I’m sure there are better resources out there, but I’ve found the background from that book has helped me understand the innards of Oracle, MySQL, and Postgres in the years since.
dattl 774 days ago [-]
I find the Advanced Databases Course from CMU an excellent resource. https://15721.courses.cs.cmu.edu/spring2023/schedule.html

You might want to look into academic papers, e.g., T. Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, in VLDB, 2011 https://www.vldb.org/pvldb/vol4/p539-neumann.pdf

maxdemarzi 774 days ago [-]
The 14th way is “multi way joins” also called “worst case optimal joins” which is a terrible name.

It means instead of joining tables two at a time and dealing with the temporary results along the way (eating memory), you join 3 or more tables together without the temporary results.

There is a blog post and short video of this on https://relational.ai/blog/dovetail-join and the original paper is on https://dl.acm.org/doi/pdf/10.1145/3180143

I work for RelationalAI, we and about 4 other new database companies are bringing these new join algorithms to market after ten years in academia.

gavinray 773 days ago [-]
Justin also has a post on WCOJ that's really solid:

https://justinjaffray.com/a-gentle-ish-introduction-to-worst...

namibj 774 days ago [-]
Negating inputs (set complement) turns the join's `AND` into a `NOR`, as Tetris exploits.

The worst case bounds don't tighten over (stateless/streaming) WCOJ's, but much real world data has far smaller box certificates.

One thing I didn't see is whether Dovetail join allows recursive queries (i.e., arbitrary datalog with a designated output relation, and the user having no concern about what the engine does with all the intermediate relations mentioned in the bundle of horn clauses that make up this datalog query).

Do you happen to know if it supports such queries?

ttfkam 774 days ago [-]
Excellent! We need more articles like this that demonstrate the subtleties of the relational model to primarily app-level developers. The explanations and explorations in terms of functional programming are both concise and compelling.
srcreigh 774 days ago [-]
Another missed chance to educate on the N+1 area. Join on unclustered index is still N+1, it’s just N+1 on disk instead of N+1 over the network and disk.
tourist2d 774 days ago [-]
"Another missed chance to talk about X problem I find interest in and would bloat the article"
smif 774 days ago [-]
An inner join is a Cartesian product with conditionals added.
RobinL 774 days ago [-]
There's a big performance difference between creating the Cartesian product and then filtering for the conditional, and creating the conditionals directly. An inner join with equi join conditions creates the conditions directly; any non equi join conditions actually have to be evaluated
smif 774 days ago [-]
That's true, but that is an implementation detail. In abstract terms, you can think of an inner join like that.

Also, while what you say is true in general for modern DB's, there are some implementations like old Oracle versions where the only way to create the effect of an inner join was in terms of a Cartesian product.

conor-23 774 days ago [-]
This dudes blog is fire. Very nice explanations of complex database topics.
gavinray 773 days ago [-]
Justin Jaffray is a gem
charles_f 774 days ago [-]
Very nice explanation!

> The correct way to do this is to normalize the table

This is true for transactional dbs, but in data warehouses it's widely accepted that some degree of denormalization is the way to go

tmpfile 773 days ago [-]
Great article. Side note: his normalization example reminded me how I used to design tables using a numeric primary key thinking they were more performant than strings. But then I’d have a meaningless id which required a join to get the unique value I actually wanted. One day I realized I could use the same unique key in both tables and save a join.

Simple realization. Big payoff

roselan 773 days ago [-]
I still like to have a unique id field per table. It helps logging and it doesn't care about multi fields "real" key.

However I keep an unique index on the string value and more importantly point integrity constraints to it, mainly for readability. It's way easier to read a table full of meaningful strings rather than full of numerical id or uuids.

Anon4Now 774 days ago [-]
A couple things:

This is admittedly a bit pedantic, but in E.F. Codd's original paper, he defined "relation" as the relation between a tuple (table row) and an attribute (table column) in a single table - https://en.wikipedia.org/wiki/Relation_(database). I'm not sure of the author's intent, but the user table example (user, country_id ) might imply the relationship between the user table and the country table. It's a common misconception about what "relational" means, but tbh I'm fine with that since it makes more sense to the average developer.

If you ever need to join sets of data in code, don't use nested loops - O(n^2). Use a map / dictionary. It's one of the few things I picked up doing Leetcode problems that I've actually needed to apply irl.

abtinf 773 days ago [-]
I think understanding that “relation” means the relationships of attributes to a primary key is a crucial, foundational concept. Without it, you can’t properly understand a concept like normalization, why it arises, and when it applies. You can’t even fully grasp the scope of “data corruption” (most developers have an extremely poor understanding of how broad the notion of corruption is in databases).
zzleeper 774 days ago [-]
I think it depends on the data.

If it's presorted by the join variable then rolling the loop is faster. Also, if the index is too big for memory, then it might be faster to loop.

Anon4Now 773 days ago [-]
Yeah. Many database tables are too large for an in memory hash join.

My comment was a not very well fleshed out tangential remark on the value of practicing DS&A problems. I know a lot of devs hate Leetcode style interviews. I get it. It's not fun. But contrary to what some people say, I have run into a fair number of situations where the practice helped me implement more efficient solutions.

fuy 773 days ago [-]
I think the blog author uses it the same way Codd does, that's why he's talking about two columns from the same table.
benjiweber 774 days ago [-]
Joins as a relational AND.

https://benjiweber.co.uk/blog/2021/03/21/thinking-in-questio....

boredemployee 774 days ago [-]
I always thought venn diagram was a good representation but I think I was wrong.

Edit: Why did I get down voted? :)

ericHosick 773 days ago [-]
I would also like to know why you are getting downvoted.

Even wikipedia uses a Venn diagram to explain JOIN https://en.wikipedia.org/wiki/Join_(SQL) .

Not trying to use an argument from authority but just pointing out that this is not unheard of.

radiospiel 773 days ago [-]
Venn diagrams are a terrible way to describe join types (and, tbh, I don’t understand why Wikipedia has these) because it makes it look like applying an 1:1 relationship. In a M:N relationship „artefacts“ (for lack of better words) of both tables would appear multiple times, and the venn diagram obscures this fact
ericHosick 773 days ago [-]
> because it makes it look like applying an 1:1 relationship.

People can form different mental models of the same abstraction so I see what you are saying

I've never seen it that way because "Venn diagrams do not generally contain information on the relative or absolute sizes (cardinality) of sets." (see https://en.wikipedia.org/wiki/Venn_diagram).

spiffytech 773 days ago [-]
I've always found Coding Horror's Venn diagram depiction of SQL joins quite enlightening:

https://blog.codinghorror.com/a-visual-explanation-of-sql-jo...

somat 773 days ago [-]
Because a venn diagram does not describe join mechanics well. A venn diagram does however describes the UNION, INTERSECT, EXCEPT part of sql.

https://blog.jooq.org/say-no-to-venn-diagrams-when-explainin...

And more meta, it is an innocent slightly incorrect statement, stuff like that should not be down voted, reply with a correction. Save down votes for outright malicious posts.

TechBro8615 773 days ago [-]
This is a nice way of explaining a concept, and could probably be applied to any complex topic. As someone who learns best by analogy, concepts usually "click" for me once I've associated them with a few other concepts. So I appreciate the format, and would personally enjoy seeing more explainers like this (and not just about database topics).
fifilura 773 days ago [-]
I think the next article could be about group by.

Like (only intuitively sofar...)

A group by from A rows to B rows - is a map-reduce job - is an AxB linear transformation matrix from your linear algebra course - is...

waynecochran 774 days ago [-]
Can join also be thought of as unification? Similar to type checking.

https://www.cs.cornell.edu/courses/cs3110/2011sp/Lectures/le...

BoppreH 774 days ago [-]
Fantastic post, I always enjoy reading about different computation mental models.

And under "A join is a…join", there's a typo in the partial order properties. It currently reads:

    1. Reflexivity: a≤b,
And I'm pretty sure it should be

    1. Reflexivity: a≤a,
instead (i.e., every element is ≤ to itself).
foldU 774 days ago [-]
You are correct! I will fix it when I get home, thank you for the correction!
lopatin 774 days ago [-]
Very useful. Something like this but for Flink’s fancy temporal, lateral, interval, and window joins would be great too.
danbruc 774 days ago [-]

   1. cross join
   2. natural join
   3. equi join
   4. theta join
   5. inner join
   6. left outer join
   7. right outer join
   8. full outer join
   9. left semi join
  10. right semi join
  11. left anti semi join
  12. right anti semi join
  13. ???
iaabtpbtpnn 774 days ago [-]
13. lateral join :)
kadenwolff 774 days ago [-]
The title of this post sounds like an advertisement for a cult
captaintobs 774 days ago [-]
Very cool work!
774 days ago [-]
ss892714028 774 days ago [-]
[dead]
bot12345 774 days ago [-]
[flagged]
nickpeterson 774 days ago [-]
I imagine the title ‘thirteen ways of looking at a join’ was taken?
lbrindze 774 days ago [-]
Dunno why this was downvoted, I came here to make a similar comment about the possible Wallace Steven’s reference.
lcnPylGDnU4H9OF 774 days ago [-]
Obviously I'd only be able to say for sure if I was the downvoter but I sometimes observe lighter text on comments which make a reference without calling out the reference. It's similar to using an acronym without defining it, though possibly more confounding if it's a niche enough reference. In this case, I did not get the reference and would have wondered why the parent commenter expected that title to have already been used.

At best, such a comment is referencing something topical which most readers will get (and ostensibly be entertained by); at worst, it's a distracting non sequitur. It generally ties back to a community preference that comments are curiosity-satisfying before entertaining.

lbrindze 774 days ago [-]
Wouldn’t an allusion to 20th century American poetry fall into the category of “curiosity satisfying”? Given they were not the only person who got the reference it feels kind of arbitrary to say this is frivolous entertainment when another person in the community (in this case, me) found it curious and also wondered if there was an allusion there.

If it was an intentional allusion, then it may actually add depth/meaning to the conversation but we may not know since it was already downvoted…

lcnPylGDnU4H9OF 774 days ago [-]
It could be if it was called out as such. Without the explicit callout, one runs the risk of it “going over the readers’ heads” so to speak. Anyway, I’m not intending to justify any behavior; just offering my interpretation based on past observations.
nickpeterson 773 days ago [-]
Humor on HN has to thread a needle. I take a lot of shots even though they often don’t work out. As an AI language model it’s a risk I take.