If you were to design a system like Discord, how would you go about it? Specifically how they store all the messages.
Discord has about 150 million monthly active users and 19 million weekly active servers. They store all chat history forever so users can come back at any time and have their data available on any device. It counts for billions of messages that are still increasing in velocity and size. 🤯
What's your first thought or question while designing such a system?
Usage pattern. It drastically decides how you want to store the messages. Discord has about the same read-write ratio and extremely random reads. They have voice chats, private chats and Large public servers that rack millions of messages in a month. 📈
Earlier, Discord stored everything in a single MongoDB replica set to iterate things quickly. They created a single compound index on
created_at. Slowly, with millions of messages pouring in, data and the index could no longer fit in RAM, and latencies started to become unpredictable. They had to move to another database.
Here came Cassandra! An open-source, linearly scalable and distributed database. Discord could now add nodes to scale it and have replicas to tolerate a loss of nodes. It stored all related data contiguously on disk, providing minimum seeks. Understanding Cassandra can be easy. It comprises two primary keys, a partition key - used to determine which node the data lives on and where it is on disk. The clustering key identifies a row from that particular partition.
Cassandra partition keys can be compounded, so the new primary key became
((channel_id, bucket), message_id).
Discord had a 20 node cluster a few years ago, and as data increases, they will continue to add new nodes as needed. They should do fine even with more data because companies like Netflix and Apple run clusters of hundreds of nodes.
People often argue about one database being the best, but it's all about use cases and trade-offs. 🤷🏻
How do you store your product data, and why? 🤔