> We wanted FoundationDB to survive failures of machines, networks, disks, clocks, racks, data centers, file systems, etc., so we created a simulation framework closely tied to Flow. By replacing physical interfaces with shims, replacing the main epoll-based run loop with a time-based simulation, and running multiple logical processes as concurrent Flow Actors, Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-thread! Even better, we are able to execute this simulation in a deterministic way, enabling us to reproduce problems and add instrumentation ex post facto. This incredible capability enabled us to build FoundationDB exclusively in simulation for the first 18 months and ensure exceptional fault tolerance long before it sent its first real network packet. For a database with as strong a contract as the FoundationDB, testing is crucial, and over the years we have run the equivalent of a trillion CPU-hours of simulated stress testing.
Working on a distributed key/value store myself, I couldn't agree more and think what FoundationDB did for testing from the start is absolutely the way to go. Testing distributed system is very tricky and tests can be incredibly time consuming and bring everything to a halt.
riwsky 774 days ago [-]
“The Jepsen is coming… from INSIDE THE HOUSE!”
pavlov 774 days ago [-]
For some types of distributed systems, you can do this kind of simulated testing in advance by building a TLA+ model.
It’s not a full-blown simulator (because generally the application code doesn’t even exist yet when you’re building the TLA+ model). But it can let you collect data and validate assumptions about your design before writing a single line of code.
rockwotj 774 days ago [-]
My beef with TLA+ is that it's not the same code, so while you're testing the design yes, you aren't testing the implementation of the design, which is just as important (if not harder too) to get right.
aseipp 774 days ago [-]
Yes, but there really aren't too many good solutions to that that aren't either extremely language or domain specific. And if you're careful you can get a lot of direct mileage out of it. For example, MongoDB (yes, that one!) used it in the development of their Atlas system and has a paper about using TLA+ to model the system, characterize behaviors, then generate compilable-code test cases from those minimal set of behaviors -- which are then directly linked against the core internals of the Atlas codebase as a client library. They then run those tests and re-generate them when the model changes. "Model based test case generation" is the strategy here. So you can characterize what happens in split brain scenarios, state machine transition failures (conflicting transactions), etc.
In reality the design stage is a pretty critical phase so you need all the help you can get, so even if you don't like TLA+ you're way better off than not modeling at all.
As an example of the language specific thing, though, there's a library for Haskell I like that's very cool, called Spectacle, which also implements the temporal logic of TLA+ along with a model checker, but as a Haskell DSL. An interesting benefit of this is that you can model check actual real Haskell code that runs e.g. in your services, but I haven't taken this very far. There are also alternative solutions like Stateright for Rust. But again, not everyone has the benefit of these...
Yes, the model is more like an executable form of documentation. There’s no guarantee that code comments match what the code actually does; similarly there’s no guarantee that the TLA+ model matches what the system does.
Documentation is still generally useful, and so is a model. You have to be committed to keeping both up to date as the code evolves.
samsquire 774 days ago [-]
Thanks for sharing that and quoting an incredibly useful snippet.
This is such an interesting topic!
Some thoughts:
* I wonder if the approach could be used to implement debuggable replayability, with accurate tracing and profiling. A bit like what verdagon is doing with Vale.
* It could be used to integrate the event loop with tracing (rather than instrumentation with Jaegar)
* I really like the idea that "every object" is an event loop, which reminds me of Microsoft Orleans with its actor model for its grains.
* I am interested with actor and lightweight thread architectures.
* I am interested in the scalabiliy of nodejs event loop architecture and Win32 desktop application programming.
* I think this approach could be used to test and simulate microservices.
* Approach could be used to test GUIs with React Redux reducer style.
AaronFriel 774 days ago [-]
When I was writing a Haskell client library for Hyperdex, another distributed kv store, I found it incredibly helpful to implement a simulator for correctness. This helped me identify which behavior was unspecified (arithmetic overflow: should error) or where my simulator deviated.
Alas, I think Hyperdex development paused a few years later. It's a shame that it stopped then.
falsandtru 774 days ago [-]
I'm loving this point. The unfortunate thing is those tests are closed source (I saw a maintainer says so probably in an issue before). It seems testable but still seems to be closed source. So we cannot fork the project even if FDB becomes totally closed source again.
aseipp 774 days ago [-]
No, the simulation harness and tests are open source and you can run them. It would be impossible for anyone to contribute anyway without it, for example, Snowflake, which heavily depends on it. It's built into the server binary directly, so the same code is always used, and it's simply a different operational mode when compared to the real server. I used to have a project to do lots of simulation runs on my big 32 core server and then aggregate the logs into clickhouse for analysis. It wasn't that hard.
However, they (at least at the time most of the developers were at Apple, many have now moved to Snowflake and the Apple team has grown a little I think) haven't released or integrated their nightly cluster and performance testing systems into open, nor have they integrated them with GitHub Actions or Nightly runs or anything. My understanding is that this is "just" a lot of compute cluster/platform orchestration code on top of the tests that exist in the repository. So, while Apple or Snowflake integrates changes across hundreds of concurrent fuzzing simulations on whatever platforms they have, if you write patches yourself, you're stuck with long simulation runs. Maybe that's changed; I haven't kept up since the 7.0 series.
In practice if you write patches and they accept them, they will just do the testing in their runs for you, on a cluster far larger than what you could have. Failures reports will tell you how to reproduce them from the test files. As a contributor, testing the system on your own is mostly a matter of how much money or how many CPU cores you can personally stand to set on fire.
Someone could probably integrate this functionality into a Kubernetes operator or something so that outside engineers could run large scale simulations reliably. But it is really expensive and CPU/compute intense, no matter how you go about it.
Those tests are not the implementations of the tests, just specifying the test case and the few options. But I found the implementations. I am not sure if this is all of the simulation tests, but it seems to cover the basic cases.
> Someone could probably integrate this functionality into a Kubernetes operator or something so that outside engineers could run large scale simulations reliably. But it is really expensive and CPU/compute intense, no matter how you go about it.
Yeah, that's basically an actually good implementation of the pile of crap that I threw together several years ago while writing a few patches. :)
And yes, I linked to the spec files because there actually isn't that much test code written in Flow I feel; the high-level specs in the .txt files can be mixed and matched so much to create a lot of variety from some small number of primitives, so that's really where all the good stuff is. Implementation vs interface, and all that.
FoundationDB has, in my experience, always been well regarded in DB development circles; I think their test architecture - developed to easily reproduce rare concurrency failures - is its best legacy, as mentioned in a comment above and frequently before.
However, since these topics are always filled with effusive praise in the comments, let me give an example of a distributed scenario where FDB has shortcomings: OLTP SQL.
First, FDB is clearly designed for “read often, update rarely” workloads, in a relative sense. It produces multiple consistent replicas which are consistently queryable at a past time stamp, without a transaction - excellent for that profile. However, its transaction consistency method is both optimistic and centralized, and can lead to difficulty writing during high contention and (brief) system-wide transaction downtime if there is a failover; while it will work, it’s not optimal for “write often, read once” workloads.
Secondly, while it is an ordered key value store - facilitating building SQL on top of it - the popular thought of layering SQL on top of the distributed layer comes with many shortcomings.
My key example of this is schema changes. Optimistic application, and keeping schema information entirely “above” the transaction layer, can make it extremely slow to apply changes to large tables, and possibly require taking them partially offline during the update. There are ways to manage this, but online schema changes will be a competitive advantage for other systems.
Even for read-only queries, you lose opportunities to push many types of predicates down to the storage node, where they can be executed with fewer round trips. Depending on how distributed your system is, this could add up to significant additional latency.
Afaik, all of the spanner-likes of the world push significant schema-specific information into their transaction layers - and utilize pessimistic locking - to facilitate these scenarios with competitive performance.
For reasons like these, I think FDB will find (and has found) the most success in warehousing scenarios, where individual datum are queried often once written, and updates come in at a slower pace than the reads.
Dave_Rosenthal 774 days ago [-]
I totally agree with your high level point that there isn't a great SQL (OLTP, or otherwise) layer for FoundationDB. Building something like this would be very hard--but I don't think the FoundationDB storage engine itself would end up inflicting the limitations you mention if it was well executed. And FoundationDB was specifically designed for real-time workloads with mixed reads/writes (i.e. the OLTP case).
Whether or not concurrency is optimistic (or done with locks, or whatever) doesn't really have a bearing on things. Any database is going to suffer if it has a bunch of updates to a specific hot keys that needs to be isolated (in the ACID sense). As long as your reads and writes are sufficiently spread out you'll avoid lock contention/optimistic transaction retries.
You speak to the real main limitation of FoundationDB when you talk about stuff like schema changes. There is a five-second transaction limit which in practice means that you cannot, for example, do a single giant transaction to change every row in a table. This was definitely a deliberate deliberate design choice, but not one without tradeoffs. The bad side is that if you want to be able to do something like this (lockout clients while you migrate a table) you need a different design that uses another strategy, like indirection. The good side is that screwed-up transactions that lock big chunks of your DB for a long time don't take down your system.
I find that the people who are relatively new to databases tend to wish that the five second limit was gone because it makes things simpler to code. People that are running them in production tend to like it more because it avoids a slew of production issues.
That said, I think for many situations a timeout like 30 or 60 seconds (with a warning at 10) would be a better operating point rather than the default 5 second cliff.
mrtracy 774 days ago [-]
I think that the SQL-on-top, and optimistic model, are definitely things that can have a workflow-dependent performance impact and are relevant.
All databases do suffer under some red line of write contention; but optimistic databases will suffer more, and will start degrading at a lower level of contention. “Avoiding contention” is database optimization table stakes, and you should be structuring every schema you can to do so; but hot keys are almost inevitable when a certain class of real-time product scales, and they will show up in ways you do not expect. When it happens, you’d like your DBMS to give as much runway as possible before you have to make the tough changes to break through.
SQL-on-top becomes an issue for geographic distribution; without “pushing down” predicates, read-modify-write workloads, table joins, etc. on the client can incur significant round-trip time issuing queries. I think the lack of this is always going to present a persistent disadvantage vs selecting a competitor.
And again, given FDBs multiple-full-secondary model, it’s only a problem when working in real time, slower queries can work off a local secondary. But latest-data-latency is relevant for many applications.
aseipp 774 days ago [-]
FWIW, I believe read transactions are unlimited in duration now that the Redwood engine has been available. But I haven't tested Redwood myself. Write transactions are still definitely limited to 5 seconds, though.
gregwebs 774 days ago [-]
TiDB uses TiKV as an equivalent to foundationDB. It supports online migrations and pushing down read queries to the kv later. It also defaults to optimistic locking, but supports pessimistic. It also doesn’t have a five second rate transaction limit. a SQL layer on top of foundation DB could probably solve all these problems and it wouldn’t be novel.
mike_hearn 774 days ago [-]
You can do online schema changes with FDB, it all depends on what you do with the FDB primitives.
A great example of how to best utilize FDB is Permazen [1], described well in its white paper [2].
Permazen is a Java library, so it can be utilized from any JVM language e.g. via Truffle you get Python, JavaScript, Ruby, WASM + any bytecode language. It supports any sorted K/V backend so you can build and test locally with a simple disk or in memory impl, or RocksDB, or even a regular SQL database. Then you can point it at FoundationDB later when you're ready for scaling.
Permazen is not a SQL implementation. Instead it's "language integrated" meaning you write queries using the Java collections library and some helpers, in particular, NavigableSet and NavigableMap. In effect you write and hard code your query plans. However, for this you get many of the same features an RDBMS would have and then some more, for example you get indexes, indexes with compound keys, strongly typed and enforced schemas with ONLINE updates, strong type safety during schema changes (which are allowed to be arbitrary), sophisticated transaction support, tight control over caching and transactional "copy out", watching fields or objects for changes, constraints and the equivalent of foreign key constraints with better validation semantics than what JPA or SQL gives you, you can define any custom data derivation function for new kinds of "index", a CLI for ad-hoc querying, and a GUI for exploration of the data.
Oh yes, it also has a Raft implementation, so if you want multi-cluster FDB with Raft-driven failover you could do that too (iirc, FDB doesn't have this out of the box).
And because the K/V format is stable, it has some helpers to write in memory stores to byte arrays and streams, so you can use it as a serialization format too.
FDB has something a bit like this in its Record layer, but it's nowhere near as powerful or well thought out. Permazen is obscure and not widely used, but it's been deployed to production as part of a large US 911 dispatching system and is maintained.
Incremental schema evolution is possible because Permazen stores schema data in the K/V store, along with a version for each persisted object (row), and upgrades objects on the fly when they're first accessed.
100%. I don't have the time to read the paper but online schema changes, with the ability to fail and abort the entire operation if one row is invalid, are basically the same problem as background index building.
If instead of using some generic K/V backend, it made use of specific FDB features, it might be even better. Conflict ranges and snapshot reads have been useful for me for some background index building designs, and atomic ops have their uses.
> Oh yes, it also has a Raft implementation, so if you want multi-cluster FDB with Raft-driven failover you could do that too (iirc, FDB doesn't have this out of the box).
I don't know what you mean by this. Multiple FDB clusters?
mike_hearn 774 days ago [-]
It supports atomic ops and snapshot reads. Don't remember about conflict ranges. It doesn't require all backends to be identical, it supports a kind of graceful degradation when backends don't have all the features. The creator is quite keen on FDB and made sure Permazen works well with it.
Yes multiple FDB clusters. IIRC FDB replication doesn't support full geo-replication, or didn't. There's a post by me about it somewhere on their forums.
rdtsc 773 days ago [-]
> First, FDB is clearly designed for “read often, update rarely” workloads, in a relative sense. It produces multiple consistent replicas which are consistently queryable at a past time stamp
For reading it has a 5 second snapshot timeout that gets in the way. One can stitch multiple transactions together but that could mean losing snapshot isolation without further tricks.
In other words, even just for read-mostly workloads it has a few warts.
preseinger 774 days ago [-]
do you think the things you mention were deliberate design decisions?
mrtracy 774 days ago [-]
They absolutely were, yes. There are very valuable application profiles where FoundationDB's design is excellent, and you can see that from its internal usage at large companies like Apple and Snowflake.
mike_hearn 774 days ago [-]
Yes, one of the nice things about FDB is it has extensive design docs. Optimizing for reading more often than writing is obviously a pretty normal design choice, outside of log ingestion you'll normally be reading more than writing. There are people using FDB for logs (snowflake iirc?) and it's been optimized for that sort of use case more in recent years, but it's not like it was an unreasonable choice.
aseipp 774 days ago [-]
Snowflake uses FoundationDB for warehouse metadata in the control plane, IIRC. It is not in the data plane path for log ingestion or other warehousing tech. That said the control plane is, uh, pretty important!
romanhn 774 days ago [-]
Back in 2014 or so, I saw the FoundationDB team demo the product at a developer conference. They had the database running across a bunch of machines, with a visual showing their health and data distribution. One team member would then be turning machines on and off (or maybe unplugging them from the network) and you could see FDB effortlessly rebalancing the data across the available nodes. It was a very striking, impressive presentation (especially as we were dealing with the challenges of distributed Cassandra at the time).
boxcarr 774 days ago [-]
When I saw the post about Foundation DB, I remembered the exact same demo running on a cluster of Raspberry Pi instances! Sadly, no memory of it on YouTube.
I feel like I saw something a bit more refined (I recall node statuses aggregated on one cool UI), so this may have been an earlier iteration, but the beginning of the following video has some of what we're talking about: https://youtu.be/Nrb3LN7X1Pg
jwr 774 days ago [-]
FoundationDB is absolutely incredible and I've been wondering why it doesn't get more popular over time. I suspect it's too complex to use directly in most applications, with people used to SQL-based solutions or simple KV stores.
I always wanted my app to use a fully distributed database (for redundancy). I've been using RethinkDB in production for over 8 years now. I'm slowly rebuilding my app to use FoundationDB.
What I discovered when I started using FDB surprised me a bit. To make really good use of the database you can't really use a "database layer" and fully abstract it away from your app. Your code should be fully aware of transaction boundaries, for example. To make good use of versionstamps (an incredible feature) your code needs to be somewhat aware of them.
I think FDB is a great candidate for implementing a "user-friendly" database on top of it, and in fact several databases are doing exactly that (using FDB as a "lower layer"). But that abstracts away too much, at least for me.
The superficial take on FDB is "waah, where are my features? it doesn't do indexing? waaah, just use Postgres!".
But when you actually start mapping your app's data structures onto FDB optimally, you discover a whole new world. For example, I ended up writing my indexing code myself, in my language (Clojure). FDB gives you all the tools, and a strict serializable data model to work with — your language brings your data structures and your indexing functions. The combination is incredible. Once you define your index functions in your language, you will never want to look at SQL again. Plus, you get incredible features like versionstamps — I use them to replace RethinkDB changefeeds and implement super quick polling for recent changes.
Oh, and did I mention that it is a fully distributed database that correctly implements the strict serializable consistency model? There are very few dbs that can claim that. If you understand what that means, you probably know how incredible this is and I don't have to convince you. If you think you understand, I suggest you go and explore https://jepsen.io/consistency — carefully reading and learning about the differences in various consistency models.
I really worry that FoundationDB will not become popular because of its inherent complexity, while worse solutions (ahem, MongoDB) will be more fashionable.
I would encourage everyone to at least take a look at FDB. It really is something quite different.
spullara 773 days ago [-]
This is absolutely right. I built Wavefront's telemetry store and entity store directly on top of it and for all these reasons you can do magical things using it natively. These are also the reasons I recommended it for Snowflake's metadata store. Finally, it is why I invested in the company. FDB is insanely good if you take advantage of it.
DenisM 773 days ago [-]
> These are also the reasons I recommended it for Snowflake's metadata store
Are you saying you were “in the room” when the decision was made? Can you elaborate?
FDB is fault-tolerant, auto-scaling, auto-healing, and auto-sharding. The sorted bytes-to-bytes map data format and multi-row transactions lets a skilled developer build almost anything on top of it. Atomic operations let you do things with higher performance without worrying about conflicts. Ops can be a learning curve but it is mostly because people run it at the red-line 24/7 for maximum efficiency. If you were running it at 3/4 capacity you would never worry about it at all. A single binary for all roles makes it very easy to deploy.
AFAIK, no one has lost a byte to an FDB issue.
sigstoat 774 days ago [-]
> The superficial take on FDB is "waah, where are my features? it doesn't do indexing? waaah, just use Postgres!".
i love fdb, but, most people _should_ just use postgres. you should have a very precise explanation for why you want fdb instead of postgres. you're trading away a lot of things when you go to fdb.
> I would encourage everyone to at least take a look at FDB.
yes, make sure you're aware of it so you can spot the situations where it _is_ the right answer.
manish_gill 774 days ago [-]
Seems like a very similar use-case like Zookeeper - use it for distributed coordination / consistency etc and build your actual database on top of it.
endisneigh 774 days ago [-]
I’ve been using FDB for toy projects for a while. It’s truly rock solid. In my experience it’s the best open source database I’ve used, including mariadb, Postgres and cockroach. That being said, I wish there were more layers as the functionality out of the box is very very limited.
Ideally someone could implement the firestore or dynamodb api on top.
Is basically distributed SQLite backed by FDB. I’ve been scared to use it since I don’t know rust and can’t attest to if mvcc had been implemented correctly.
In using this I actually realized how coupled the storage engine is to the storage system and how few open source projects make the storage engine easily swap-able.
tommiegannert 774 days ago [-]
I really wanted to use FoundationDB for building a graph database, but was taken aback by the limitations in record (10+100 kB) and somewhat transaction sizes (10 MB) [1]. And the documentation [2] doesn't really give any answers than "build it yourself."
mvsqlite seems to improve the transaction size [3], which is nice. Does it also improve the key/value limitations?
> Transaction size cannot exceed 10,000,000 bytes of affected data. [---] Keys cannot exceed 10,000 bytes in size. Values cannot exceed 100,000 bytes in size.
Transaction size and duration is limited to keep the latency and throughput of the system manageable under load, from my understanding. It makes sense to some degree even with no background in the design; if you are serving X/rps with a latency of Y milliseconds, using Z resources, and you double Y, you now need to double your resources Z as well, to serve the same amount of clients. You always hit a cap somewhere, so if you want consistent throughput and latency, it's maybe not a bad tradeoff.
mvsqlite fixes the transaction size through its own transaction layer, from my understanding; I don't know how that would impact performance. The 10kb/100Kb key value limit is probably not fixable in any way, but it's not really a huge problem as a user in practice for FDB because you can just shard the value across two keys in a consistent transaction and it's fine. 10 kilobyte keys have pretty much never ever been an issue in my cases either; you can typically just do something like hash a really big key before insert and use that.
tommiegannert 772 days ago [-]
To answer myself, it looks like mvsqlite operates as a VFS implementation, which means it works with pages, not rows. That should decouple size limits.
fhrow4484 774 days ago [-]
> That being said, I wish there were more layers as the functionality out of the box is very very limited.
Yea, but mvsqlite implements its own to get around the limitations around transactions.
georgelyon 774 days ago [-]
FoundationDB is a truly one-of-a-kind bit of technology. Others have already linked to the testing methodology that allows them to run orders of magnitude more database hours in test than have run in production: https://www.youtube.com/watch?v=4fFDFbi3toc
A less known but also great talk is the follow which talked about what the a few of the team worked on next, effectively trying to generalize the methodology to any computer program: https://www.youtube.com/watch?v=fFSPwJFXVlw
I liken the approach to being able to fuzz the execution space of the program, not just the inputs.
How hard have people pushed this thing? We get regular threads of effusive praise, but little criticism. Last time I mentioned that years ago my colleagues found half a dozen ways to lose data in FDB I got called out here and even in private emails, but it seems more valuable to know where the limits of these systems are, and not very valuable to read the positive feelings of people who used FDB in trivial and uncritical ways.
Dave_Rosenthal 774 days ago [-]
Yes, there are definitely a lot of big companies that have used FoundationDB very hard at huge scale for many years. That said, yeah, it feels like there are also a lot of folks on HN who just jump on the "cool, fault simulation" bandwagon and don't have a lot of personal real-world experience.
What I can tell you, for sure, is that if you find an issue with something as important and fundamental as data loss the team working on FoundationDB would take it super seriously.
ryanworl 774 days ago [-]
FoundationDB is used at Datadog as the metadata store for Husky, the storage and query engine powering a significant number of Datadog products, such as logs, network performance monitoring, and trace analytics.
I was involved with this project from the beginning and it would've taken significantly longer to deliver without FoundationDB.
jeffbee 774 days ago [-]
I know there are multiple companies that use it. The question is not whether people put things into FDB. The question is whether anyone has checked to see if their junk was still there later. I don't consider large scale deployments to be proof of anything. When I worked on Gmail we were still finding data-loss bugs in either BigTable or Colossus regularly, even after those systems had been the largest datastores on the planet for many years.
Also, in the post itself, authors including Apple and Snowflake devs, it mentions it's run in production by Apple and Snowflake.
I haven't seen yet though what Apple uses it for.
jeffbee 774 days ago [-]
The time at which my colleagues found easy ways to lose data was well after Apple had claimed to use it in iCloud at scale. So, I don't think deployment at scale is a proof of correctness. The thing that needs doing is regularly looking in the database for things that should be there.
endisneigh 774 days ago [-]
I’m curious - could you elaborate on the circumstances? Like the version of FDB, cluster size, network circumstances, etc?
jeffbee 773 days ago [-]
I don't recall any of those details but the test involved injecting a bogus block device that always returns garbage, and noting that this results in garbage records returned from client queries. And I don't think those kinds of issues have been eradicated, browsing through their github issues there are people trying to recover corrupted clusters. https://github.com/apple/foundationdb/issues/2480
Part of the FDB team (great folks) went on to create something quite incredible I have the pleasure of having early access to. If you’re into dependability check this out: https://antithesis.com/
skybrian 773 days ago [-]
Could you summarize? What kind of tool is it?
mprime1 772 days ago [-]
The intro video on the homepage does a good job at explaining it in 3 minutes.
The video is not upselling. The tool can do what the video promises.
Say your system is "well tested" for example with a combination of unit tests, integration tests, stress tests, failure injection tests (a la Jepsen), and more.
There are probably still hundreds of nasty little bugs hidden. Not even thousands of experiment hours with a Jepsen-like approach would surface them because they are very, very unlikely.
Antithesis will find these without breaking a sweat. It's designed to hunt for very unlikely (but possible) scenarios where your system misbehaves.
And here's the real kicker: once a bug is found, you can observe and step-through execution of your entire _distributed_ system. Similar to attaching a debugger to a single process but for an entire system composed of many clients and servers connected by a network.
It's language independent and doesn't require any modification to your system in order to use. It's pretty incredible. I would not believe this is possible if it hadn't seen it with my own eyes.
Presently, I am using it for work at Synadia (makers of NATS). NATS is like lego blocks to build all kinds of distributed systems in multiple languages, so it has a very large surface area. It's well tested, stable, and deployed successfully by many large and small projects and companies. Contemporary testing approaches can hardly find bugs. Antithesis can find very insidious edge cases where things break. And we proactively investigate and fix before any user/customer can be affected by these one-in-a-million nasty bugs, which would otherwise be very hard to find and resolve.
mrAssHat 773 days ago [-]
That's a tool for developer productivity and customer satisfaction.
Also, they are hiring.
neftaly 774 days ago [-]
I've been tooling around with "Tuple Database", which claims to be FoundationDB for the frontend (by the original dev of Notion).
I have found it conceptually similar to Relic or Datascript, but with strong preformance guarantees - something Relic considers a potential issue. It also solves the problem of using reactive queries to trigger things like popups and fullscreen requests, which must be run in the same event loop as user input.
Having a full (fast!) database as my React state manager gives me LAMP nostalgia :)
jbverschoor 774 days ago [-]
Around 2010-2013 (gaming), I found fdb, and to me it seemed like the perfect database because of their architecture. I tried it a bit, and was really happy with it.
Unfortunately they were acquired by Apple, only to resurface something like 10 years later. All momentum was gone, and I’m not really aware nor interested in where they stand. I’ll stick with my rusty old Postgres for a long time before I’d try anything else out.
qaq 774 days ago [-]
Suprised none used it as a foundation for a NewSQL DB, the thing is battle tested and actively developed by Apple and Snowflake.
danpalmer 774 days ago [-]
I think I remember the FDB team developing one that was closed source back before their acquisition. I thought the business model was going to be open core and closed, paid, layers on top. I seem to remember them benchmarking the SQL layer and it being highly performant still, despite the complexity it added.
Maybe this thing still exists in close source form at Apple? It wouldn't surprise me if it does and forms the basis of a Spanner alternative, they're big enough to need it. Or maybe they canned it pre/post acquisition.
Edit: ah, you've already mentioned the closed source layer that exists at Apple. There we go!
Not a NewSQL database though as GP mentioned. I don't think Tigris has a SQL layer.
endisneigh 774 days ago [-]
Yes I know. I explicitly said it was similar to mongo. Just responding to the bit that it’s battle tested and used as a foundation (no pun intended) for another db. As far as I know it’s the only database that has a company around it that is using FDB
qaq 774 days ago [-]
there was poc of sqlite on top of FDB. There is also sql layer that Apple did not open source that they use at scale. Just seems a wasted opportunity.
endisneigh 774 days ago [-]
It’s because you introduce a lot of latency. Cockroachdb for example (which is a great db) has a lot of latency compared to Postgres.
At the time of its release it was probably hard to justify having an order of magnitude more latency than competitors (of course they were not fault tolerant, but still).
riku_iki 774 days ago [-]
hypothetically, you can run cocroach with replication factor 1, and have also low latency and apples to apples comparison.
canadiantim 774 days ago [-]
I know some people have had success using FoundationDB as a KV store with SurrealDB[1]
I built an online / mutable time-series database using FDB a few years back at a previous company. Not only was it rock solid, but it scaled linearly pretty effortlessly. It truly is one of novel modern pieces of technologies out there, and I wish there were more layers built on top of it.
tanepiper 774 days ago [-]
A few years ago I was working at an agency, one of their teams was building a real-time gaming system on top of FoundationDB.
Apple then bought it up and shut the open source down. They had to rebuild whole layers from scratch.
Dave_Rosenthal 774 days ago [-]
Yeah, that sucked for sure and we hated to disappoint people like that (co-founder here). But you have it exactly backwards. FoundationDB was never open source. There was a binary that you could download and use as a trial, or you could buy a license for real use. The users that bought licenses got to keep using those licenses. Some of those customers went on to build billion-dollar businesses on top of FoundationDB (Snowflake!) A few years after acquiring the tech Apple themselves open sourced it (!) so now it is open source. The big challenge for users is that most of the sophisticated "layers" that make the tech into more of an easy-to-use database rather than just a storage engine are still proprietary.
tanepiper 773 days ago [-]
Yea it was over 8 years ago, I didn't work on the team so I don't remember all the details - I just remember this decision impacted them hugely. Hopefully it worked out well for you, but the decision - weeks from their launch - impacted their mental health, as well as the project.
FoundationDB was a swear word.
58028641 774 days ago [-]
As far as I can tell, FoundationDB was never open source until Apple open sourced it.
It was never open source before apple. Rather the binary was freely available to be used. When apple bought them they took it away but continued to support customers with contracts. In that way it was inaccessible until it was open sourced.
metadat 774 days ago [-]
Yes, I got bitten by this and will never forget- FDB abruptly shut off public access in mid-2015. Fortunately for me, it only cost half a day to migrate my system to Postgres.
detaro 774 days ago [-]
... which both state that it wasnt open-source before the apple buyout.
stephenr 774 days ago [-]
Hey now, don't let verifiable facts and observed history get in the way of a chance to bash Apple.
tanepiper 773 days ago [-]
I'm sorry I mis-remembered the exact details of another team that I didn't work on and tech stack from 8 years ago. I'll go give myself 100 hail Marys.
The point wasn't to bash Apple - the point was that the team lost a lot of money and time and shipped and inferior product because Apple chose to purchase and close down part of the stack.
Every time I look at FoundationDB for replacing some Redis usage I wonder about key expiry/TTL, look for it and find nothing.
Is this such a strange use case, that there is not even a blog entry about it only some forum entries?
endisneigh 774 days ago [-]
You would need to implement that yourself. Easily can be done by storing tuples with your expiry date. You then could watch the keys to remove expired keys automatically. FDB is very barebones by design. Alternatively (and easier):
I worked next to the founders a decade ago and tried the first versions of the project (before Apple acq). Loved the concept, but it hasn't really lived up to the promise.
dangoodmanUT 773 days ago [-]
It's a shame that they have left the Go bindings back in version 6, despite 7 being out for some time now.
brainzap 774 days ago [-]
I think FoundationDB will also have parts written in Swift, at least that is what Apple showed at WWDC.
https://www.youtube.com/watch?v=4fFDFbi3toc
> We wanted FoundationDB to survive failures of machines, networks, disks, clocks, racks, data centers, file systems, etc., so we created a simulation framework closely tied to Flow. By replacing physical interfaces with shims, replacing the main epoll-based run loop with a time-based simulation, and running multiple logical processes as concurrent Flow Actors, Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-thread! Even better, we are able to execute this simulation in a deterministic way, enabling us to reproduce problems and add instrumentation ex post facto. This incredible capability enabled us to build FoundationDB exclusively in simulation for the first 18 months and ensure exceptional fault tolerance long before it sent its first real network packet. For a database with as strong a contract as the FoundationDB, testing is crucial, and over the years we have run the equivalent of a trillion CPU-hours of simulated stress testing.
https://pierrezemb.fr/posts/notes-about-foundationdb/
It’s not a full-blown simulator (because generally the application code doesn’t even exist yet when you’re building the TLA+ model). But it can let you collect data and validate assumptions about your design before writing a single line of code.
In reality the design stage is a pretty critical phase so you need all the help you can get, so even if you don't like TLA+ you're way better off than not modeling at all.
As an example of the language specific thing, though, there's a library for Haskell I like that's very cool, called Spectacle, which also implements the temporal logic of TLA+ along with a model checker, but as a Haskell DSL. An interesting benefit of this is that you can model check actual real Haskell code that runs e.g. in your services, but I haven't taken this very far. There are also alternative solutions like Stateright for Rust. But again, not everyone has the benefit of these...
Unfortunately I got the product wrong; it was not Atlas, it was Realm Sync. All of the test-case generation stuff is in Section 5.
Using https://github.com/awslabs/shuttle which works on our real Rust code.
Documentation is still generally useful, and so is a model. You have to be committed to keeping both up to date as the code evolves.
This is such an interesting topic!
Some thoughts:
* I wonder if the approach could be used to implement debuggable replayability, with accurate tracing and profiling. A bit like what verdagon is doing with Vale.
* It could be used to integrate the event loop with tracing (rather than instrumentation with Jaegar)
* I really like the idea that "every object" is an event loop, which reminds me of Microsoft Orleans with its actor model for its grains.
* I am interested with actor and lightweight thread architectures.
* I am interested in the scalabiliy of nodejs event loop architecture and Win32 desktop application programming.
* I think this approach could be used to test and simulate microservices.
* Approach could be used to test GUIs with React Redux reducer style.
https://github.com/AaronFriel/hyhac/blob/master/test/Test/Hy...
Alas, I think Hyperdex development paused a few years later. It's a shame that it stopped then.
However, they (at least at the time most of the developers were at Apple, many have now moved to Snowflake and the Apple team has grown a little I think) haven't released or integrated their nightly cluster and performance testing systems into open, nor have they integrated them with GitHub Actions or Nightly runs or anything. My understanding is that this is "just" a lot of compute cluster/platform orchestration code on top of the tests that exist in the repository. So, while Apple or Snowflake integrates changes across hundreds of concurrent fuzzing simulations on whatever platforms they have, if you write patches yourself, you're stuck with long simulation runs. Maybe that's changed; I haven't kept up since the 7.0 series.
In practice if you write patches and they accept them, they will just do the testing in their runs for you, on a cluster far larger than what you could have. Failures reports will tell you how to reproduce them from the test files. As a contributor, testing the system on your own is mostly a matter of how much money or how many CPU cores you can personally stand to set on fire.
Someone could probably integrate this functionality into a Kubernetes operator or something so that outside engineers could run large scale simulations reliably. But it is really expensive and CPU/compute intense, no matter how you go about it.
[1] https://forums.foundationdb.org/t/how-to-use-foundationdb-un...
[2] https://github.com/apple/foundationdb/tree/main/tests
https://github.com/apple/foundationdb/tree/main/fdbserver/wo...
> Someone could probably integrate this functionality into a Kubernetes operator or something so that outside engineers could run large scale simulations reliably. But it is really expensive and CPU/compute intense, no matter how you go about it.
Maybe this.
https://github.com/FoundationDB/fdb-joshua
And yes, I linked to the spec files because there actually isn't that much test code written in Flow I feel; the high-level specs in the .txt files can be mixed and matched so much to create a lot of variety from some small number of primitives, so that's really where all the good stuff is. Implementation vs interface, and all that.
However, since these topics are always filled with effusive praise in the comments, let me give an example of a distributed scenario where FDB has shortcomings: OLTP SQL.
First, FDB is clearly designed for “read often, update rarely” workloads, in a relative sense. It produces multiple consistent replicas which are consistently queryable at a past time stamp, without a transaction - excellent for that profile. However, its transaction consistency method is both optimistic and centralized, and can lead to difficulty writing during high contention and (brief) system-wide transaction downtime if there is a failover; while it will work, it’s not optimal for “write often, read once” workloads.
Secondly, while it is an ordered key value store - facilitating building SQL on top of it - the popular thought of layering SQL on top of the distributed layer comes with many shortcomings.
My key example of this is schema changes. Optimistic application, and keeping schema information entirely “above” the transaction layer, can make it extremely slow to apply changes to large tables, and possibly require taking them partially offline during the update. There are ways to manage this, but online schema changes will be a competitive advantage for other systems.
Even for read-only queries, you lose opportunities to push many types of predicates down to the storage node, where they can be executed with fewer round trips. Depending on how distributed your system is, this could add up to significant additional latency.
Afaik, all of the spanner-likes of the world push significant schema-specific information into their transaction layers - and utilize pessimistic locking - to facilitate these scenarios with competitive performance.
For reasons like these, I think FDB will find (and has found) the most success in warehousing scenarios, where individual datum are queried often once written, and updates come in at a slower pace than the reads.
Whether or not concurrency is optimistic (or done with locks, or whatever) doesn't really have a bearing on things. Any database is going to suffer if it has a bunch of updates to a specific hot keys that needs to be isolated (in the ACID sense). As long as your reads and writes are sufficiently spread out you'll avoid lock contention/optimistic transaction retries.
You speak to the real main limitation of FoundationDB when you talk about stuff like schema changes. There is a five-second transaction limit which in practice means that you cannot, for example, do a single giant transaction to change every row in a table. This was definitely a deliberate deliberate design choice, but not one without tradeoffs. The bad side is that if you want to be able to do something like this (lockout clients while you migrate a table) you need a different design that uses another strategy, like indirection. The good side is that screwed-up transactions that lock big chunks of your DB for a long time don't take down your system.
I find that the people who are relatively new to databases tend to wish that the five second limit was gone because it makes things simpler to code. People that are running them in production tend to like it more because it avoids a slew of production issues.
That said, I think for many situations a timeout like 30 or 60 seconds (with a warning at 10) would be a better operating point rather than the default 5 second cliff.
All databases do suffer under some red line of write contention; but optimistic databases will suffer more, and will start degrading at a lower level of contention. “Avoiding contention” is database optimization table stakes, and you should be structuring every schema you can to do so; but hot keys are almost inevitable when a certain class of real-time product scales, and they will show up in ways you do not expect. When it happens, you’d like your DBMS to give as much runway as possible before you have to make the tough changes to break through.
SQL-on-top becomes an issue for geographic distribution; without “pushing down” predicates, read-modify-write workloads, table joins, etc. on the client can incur significant round-trip time issuing queries. I think the lack of this is always going to present a persistent disadvantage vs selecting a competitor.
And again, given FDBs multiple-full-secondary model, it’s only a problem when working in real time, slower queries can work off a local secondary. But latest-data-latency is relevant for many applications.
A great example of how to best utilize FDB is Permazen [1], described well in its white paper [2].
Permazen is a Java library, so it can be utilized from any JVM language e.g. via Truffle you get Python, JavaScript, Ruby, WASM + any bytecode language. It supports any sorted K/V backend so you can build and test locally with a simple disk or in memory impl, or RocksDB, or even a regular SQL database. Then you can point it at FoundationDB later when you're ready for scaling.
Permazen is not a SQL implementation. Instead it's "language integrated" meaning you write queries using the Java collections library and some helpers, in particular, NavigableSet and NavigableMap. In effect you write and hard code your query plans. However, for this you get many of the same features an RDBMS would have and then some more, for example you get indexes, indexes with compound keys, strongly typed and enforced schemas with ONLINE updates, strong type safety during schema changes (which are allowed to be arbitrary), sophisticated transaction support, tight control over caching and transactional "copy out", watching fields or objects for changes, constraints and the equivalent of foreign key constraints with better validation semantics than what JPA or SQL gives you, you can define any custom data derivation function for new kinds of "index", a CLI for ad-hoc querying, and a GUI for exploration of the data.
Oh yes, it also has a Raft implementation, so if you want multi-cluster FDB with Raft-driven failover you could do that too (iirc, FDB doesn't have this out of the box).
And because the K/V format is stable, it has some helpers to write in memory stores to byte arrays and streams, so you can use it as a serialization format too.
FDB has something a bit like this in its Record layer, but it's nowhere near as powerful or well thought out. Permazen is obscure and not widely used, but it's been deployed to production as part of a large US 911 dispatching system and is maintained.
Incremental schema evolution is possible because Permazen stores schema data in the K/V store, along with a version for each persisted object (row), and upgrades objects on the fly when they're first accessed.
[1] https://permazen.io/
[2] https://cdn.jsdelivr.net/gh/permazen/permazen@master/permaze...
If instead of using some generic K/V backend, it made use of specific FDB features, it might be even better. Conflict ranges and snapshot reads have been useful for me for some background index building designs, and atomic ops have their uses.
> Oh yes, it also has a Raft implementation, so if you want multi-cluster FDB with Raft-driven failover you could do that too (iirc, FDB doesn't have this out of the box).
I don't know what you mean by this. Multiple FDB clusters?
Yes multiple FDB clusters. IIRC FDB replication doesn't support full geo-replication, or didn't. There's a post by me about it somewhere on their forums.
For reading it has a 5 second snapshot timeout that gets in the way. One can stitch multiple transactions together but that could mean losing snapshot isolation without further tricks.
In other words, even just for read-mostly workloads it has a few warts.
https://news.ycombinator.com/item?id=5739721
I always wanted my app to use a fully distributed database (for redundancy). I've been using RethinkDB in production for over 8 years now. I'm slowly rebuilding my app to use FoundationDB.
What I discovered when I started using FDB surprised me a bit. To make really good use of the database you can't really use a "database layer" and fully abstract it away from your app. Your code should be fully aware of transaction boundaries, for example. To make good use of versionstamps (an incredible feature) your code needs to be somewhat aware of them.
I think FDB is a great candidate for implementing a "user-friendly" database on top of it, and in fact several databases are doing exactly that (using FDB as a "lower layer"). But that abstracts away too much, at least for me.
The superficial take on FDB is "waah, where are my features? it doesn't do indexing? waaah, just use Postgres!".
But when you actually start mapping your app's data structures onto FDB optimally, you discover a whole new world. For example, I ended up writing my indexing code myself, in my language (Clojure). FDB gives you all the tools, and a strict serializable data model to work with — your language brings your data structures and your indexing functions. The combination is incredible. Once you define your index functions in your language, you will never want to look at SQL again. Plus, you get incredible features like versionstamps — I use them to replace RethinkDB changefeeds and implement super quick polling for recent changes.
Oh, and did I mention that it is a fully distributed database that correctly implements the strict serializable consistency model? There are very few dbs that can claim that. If you understand what that means, you probably know how incredible this is and I don't have to convince you. If you think you understand, I suggest you go and explore https://jepsen.io/consistency — carefully reading and learning about the differences in various consistency models.
I really worry that FoundationDB will not become popular because of its inherent complexity, while worse solutions (ahem, MongoDB) will be more fashionable.
I would encourage everyone to at least take a look at FDB. It really is something quite different.
Are you saying you were “in the room” when the decision was made? Can you elaborate?
Edit: indeed you were… https://www.snowflake.com/wp-content/uploads/2020/11/Rise-of...
AFAIK, no one has lost a byte to an FDB issue.
i love fdb, but, most people _should_ just use postgres. you should have a very precise explanation for why you want fdb instead of postgres. you're trading away a lot of things when you go to fdb.
> I would encourage everyone to at least take a look at FDB.
yes, make sure you're aware of it so you can spot the situations where it _is_ the right answer.
Ideally someone could implement the firestore or dynamodb api on top.
https://github.com/losfair/mvsqlite
Is basically distributed SQLite backed by FDB. I’ve been scared to use it since I don’t know rust and can’t attest to if mvcc had been implemented correctly.
In using this I actually realized how coupled the storage engine is to the storage system and how few open source projects make the storage engine easily swap-able.
mvsqlite seems to improve the transaction size [3], which is nice. Does it also improve the key/value limitations?
> Transaction size cannot exceed 10,000,000 bytes of affected data. [---] Keys cannot exceed 10,000 bytes in size. Values cannot exceed 100,000 bytes in size.
[1] https://apple.github.io/foundationdb/known-limitations.html
[2] https://apple.github.io/foundationdb/largeval.html
mvsqlite fixes the transaction size through its own transaction layer, from my understanding; I don't know how that would impact performance. The 10kb/100Kb key value limit is probably not fixable in any way, but it's not really a huge problem as a user in practice for FDB because you can just shard the value across two keys in a consistent transaction and it's fine. 10 kilobyte keys have pretty much never ever been an issue in my cases either; you can typically just do something like hash a really big key before insert and use that.
The record layer https://github.com/FoundationDB/fdb-record-layer which allows to store protobuf, and define the primary keys and index directly on those proto fields is truly amazing:
https://github.com/FoundationDB/fdb-record-layer/blob/main/d...
A less known but also great talk is the follow which talked about what the a few of the team worked on next, effectively trying to generalize the methodology to any computer program: https://www.youtube.com/watch?v=fFSPwJFXVlw
I liken the approach to being able to fuzz the execution space of the program, not just the inputs.
https://deno.com/kv on FoundationDB!
What I can tell you, for sure, is that if you find an issue with something as important and fundamental as data loss the team working on FoundationDB would take it super seriously.
1. https://www.datadoghq.com/blog/engineering/introducing-husky...
2. https://www.datadoghq.com/blog/engineering/husky-deep-dive/
3. https://www.youtube.com/watch?v=mNneCaZewTg
4. https://www.youtube.com/watch?v=1-zo9jqdRZU
I was involved with this project from the beginning and it would've taken significantly longer to deliver without FoundationDB.
https://news.ycombinator.com/item?id=16880404
Also, in the post itself, authors including Apple and Snowflake devs, it mentions it's run in production by Apple and Snowflake.
I haven't seen yet though what Apple uses it for.
https://machinelearning.apple.com/research/foundationdb-reco...
Say your system is "well tested" for example with a combination of unit tests, integration tests, stress tests, failure injection tests (a la Jepsen), and more.
There are probably still hundreds of nasty little bugs hidden. Not even thousands of experiment hours with a Jepsen-like approach would surface them because they are very, very unlikely. Antithesis will find these without breaking a sweat. It's designed to hunt for very unlikely (but possible) scenarios where your system misbehaves.
And here's the real kicker: once a bug is found, you can observe and step-through execution of your entire _distributed_ system. Similar to attaching a debugger to a single process but for an entire system composed of many clients and servers connected by a network.
It's language independent and doesn't require any modification to your system in order to use. It's pretty incredible. I would not believe this is possible if it hadn't seen it with my own eyes.
Presently, I am using it for work at Synadia (makers of NATS). NATS is like lego blocks to build all kinds of distributed systems in multiple languages, so it has a very large surface area. It's well tested, stable, and deployed successfully by many large and small projects and companies. Contemporary testing approaches can hardly find bugs. Antithesis can find very insidious edge cases where things break. And we proactively investigate and fix before any user/customer can be affected by these one-in-a-million nasty bugs, which would otherwise be very hard to find and resolve.
Also, they are hiring.
https://github.com/ccorcos/tuple-database/
I have found it conceptually similar to Relic or Datascript, but with strong preformance guarantees - something Relic considers a potential issue. It also solves the problem of using reactive queries to trigger things like popups and fullscreen requests, which must be run in the same event loop as user input.
https://github.com/wotbrew/relic https://github.com/tonsky/datascript
Having a full (fast!) database as my React state manager gives me LAMP nostalgia :)
Unfortunately they were acquired by Apple, only to resurface something like 10 years later. All momentum was gone, and I’m not really aware nor interested in where they stand. I’ll stick with my rusty old Postgres for a long time before I’d try anything else out.
Maybe this thing still exists in close source form at Apple? It wouldn't surprise me if it does and forms the basis of a Spanner alternative, they're big enough to need it. Or maybe they canned it pre/post acquisition.
Edit: ah, you've already mentioned the closed source layer that exists at Apple. There we go!
It’s similar to mongo (it’s nosql)
At the time of its release it was probably hard to justify having an order of magnitude more latency than competitors (of course they were not fault tolerant, but still).
[1] https://github.com/orgs/surrealdb/discussions/25
Apple then bought it up and shut the open source down. They had to rebuild whole layers from scratch.
FoundationDB was a swear word.
And here's a news story - https://www.forbes.com/sites/benkepes/2015/03/25/a-cautionar...
The point wasn't to bash Apple - the point was that the team lost a lot of money and time and shipped and inferior product because Apple chose to purchase and close down part of the stack.
Is this such a strange use case, that there is not even a blog entry about it only some forum entries?
https://forums.foundationdb.org/t/designing-key-value-expira...
[1]: https://developer.apple.com/videos/play/wwdc2023/10164/?time...