In October 2019, we made a critical mistake that led to 10,000 important files disappearing overnight. It was a disaster—one that could have ruined our business. But five years later, that same experience saved our new company from an even bigger crisis.
\ This is a story about data loss, misconfigurations, and the hard lessons that led us to build a bulletproof backup system. If you're running a system that stores critical data, this could help you avoid making the same mistakes.
Background: How Gama Stored DataGama (gama.ir) is a K-12 educational content-sharing platform launched in 2014 in Iran, with over 10 million users worldwide. It provides services such as:
\
\ Since our content was user-generated, maintaining secure file storage was a top priority. We used MooseFS, a distributed file system with five nodes and a triple-replication model, ensuring redundancy.
Our Backup StrategyA simple external HDD where we stored copies of every file. It worked fine, and we rarely needed it. But then, we made a dangerous assumption.
The Migration That Led to DisasterOne of our engineers suggested migrating to GlusterFS, a more well-known distributed file system. It sounded great—more scalability, higher adoption, and seemingly better performance. After evaluating the cost-benefit tradeoff, we decided to switch.
\ Two months later, the migration was complete. Our team was thrilled with the new system. Everything seemed stable… until it wasn’t.
\ There was just one small problem:
\ Our backup HDD was 90% full, and we needed to make a decision.
The MistakeBecause we had never really needed our full backups before, we assumed GlusterFS was reliable enough.
\ We removed our old backup strategy and trusted GlusterFS replication.
\ That was a bad decision.
The Day Everything Went WrongTwo months later, one morning, we started receiving reports: some files were missing.
\ At first, we thought it was a network glitch—something minor. But as we dug deeper, we found that Gluster was showing missing chunks and sync errors.
\
3:30 AM: We decided to restart the Gluster network, believing a fresh bootstrap would fix the problem. At first, it seemed to work!
\ We thought we had solved it.
\ Then, a WhatsApp message from the content team came in:
“The files are empty.”
\ Wait, what? The files existed, but they contained nothing.
\ We checked manually. The files still had size and metadata, but when we opened them, they were completely blank.
\ 10,000 files were gone.
The Backup That Was UselessWe had a backup HDD. That should have saved us, right?
\ Wrong. Because after migrating to GlusterFS, we had restructured our directory system. Every file had a new hashed path in the database.
\ Our old backups were useless because they had different filenames.
\ We tried multiple recovery methods. Nothing worked.
\ In the end, we had to email thousands of users, asking them to re-upload their lost files.
\ It was a nightmare. But it forced us to rethink everything.
\
How We Fixed It: Introducing Gama File Keeper (GFK)After this disaster, we completely redesigned our storage and backup strategy. Our solution had two parts:
1. Gama File Keeper (GFK): A Smarter Storage SystemWe no longer rely on a single storage system. Instead, we implemented a three-layered backup strategy:
\
Fast forward five years. Gamatrain.com, our new business in the UK, faced another rare incident.
\ But this time, we didn’t lose a single file.
\ Why? Because of the lessons we learned in 2019 and the system we built to prevent it.
Lessons for Every Engineer\ #devops #backupstrategy #datarecovery #engineeringfailures #disasterrecovery
All Rights Reserved. Copyright , Central Coast Communications, Inc.