Your resource for web content, online publishing
and the distribution of digital products.
«  
  »
S M T W T F S
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
 
 
 

How We Lost 10,000 Files Overnight—And Built a Bulletproof Backup System

Tags: content small
DATE POSTED:March 4, 2025
The Disaster That Changed Everything

In October 2019, we made a critical mistake that led to 10,000 important files disappearing overnight. It was a disaster—one that could have ruined our business. But five years later, that same experience saved our new company from an even bigger crisis.

\ This is a story about data loss, misconfigurations, and the hard lessons that led us to build a bulletproof backup system. If you're running a system that stores critical data, this could help you avoid making the same mistakes.

Background: How Gama Stored Data

Gama (gama.ir) is a K-12 educational content-sharing platform launched in 2014 in Iran, with over 10 million users worldwide. It provides services such as:

\

  • Past Papers
  • Tutorials & Learning Resources
  • Online Exams & School Hub
  • Live Streaming & Q&A Community
  • Tutoring Services

\ Since our content was user-generated, maintaining secure file storage was a top priority. We used MooseFS, a distributed file system with five nodes and a triple-replication model, ensuring redundancy.

Our Backup Strategy

A simple external HDD where we stored copies of every file. It worked fine, and we rarely needed it. But then, we made a dangerous assumption.

The Migration That Led to Disaster

One of our engineers suggested migrating to GlusterFS, a more well-known distributed file system. It sounded great—more scalability, higher adoption, and seemingly better performance. After evaluating the cost-benefit tradeoff, we decided to switch.

\ Two months later, the migration was complete. Our team was thrilled with the new system. Everything seemed stable… until it wasn’t.

\ There was just one small problem:

\ Our backup HDD was 90% full, and we needed to make a decision.

The Mistake

Because we had never really needed our full backups before, we assumed GlusterFS was reliable enough.

\ We removed our old backup strategy and trusted GlusterFS replication.

\ That was a bad decision.

The Day Everything Went Wrong

Two months later, one morning, we started receiving reports: some files were missing.

\ At first, we thought it was a network glitch—something minor. But as we dug deeper, we found that Gluster was showing missing chunks and sync errors.

\

  • Files were disappearing.
  • More and more pages were throwing errors.
  • It was spreading fast.
The Immediate Response

3:30 AM: We decided to restart the Gluster network, believing a fresh bootstrap would fix the problem. At first, it seemed to work!

\ We thought we had solved it.

\ Then, a WhatsApp message from the content team came in:

“The files are empty.”

\ Wait, what? The files existed, but they contained nothing.

\ We checked manually. The files still had size and metadata, but when we opened them, they were completely blank.

\ 10,000 files were gone.

The Backup That Was Useless

We had a backup HDD. That should have saved us, right?

\ Wrong. Because after migrating to GlusterFS, we had restructured our directory system. Every file had a new hashed path in the database.

\ Our old backups were useless because they had different filenames.

\ We tried multiple recovery methods. Nothing worked.

\ In the end, we had to email thousands of users, asking them to re-upload their lost files.

\ It was a nightmare. But it forced us to rethink everything.

\

How We Fixed It: Introducing Gama File Keeper (GFK)

After this disaster, we completely redesigned our storage and backup strategy. Our solution had two parts:

1. Gama File Keeper (GFK): A Smarter Storage System
  • Every uploaded file is mapped with a checksum, making it trackable even if renamed.
  • Instead of hard deletions, files now go through a 3-month soft delete process before starting removal process.
  • Recovery is now instant using checksum-based matching.
2. Backapp: A Multi-Layered Backup Strategy

We no longer rely on a single storage system. Instead, we implemented a three-layered backup strategy:

\

  • Warm Backup (Every 2 Hours): Real-time sync within the same data center.
  • Cold Backup (Every 6 Hours): Replicated to a separate data center.
  • Offline Backup (Weekly): Stored on physical HDDs in a separate location.
Database Backups
  • Full backups every 24 hours, stored for 12 months.
The Real Test: How This System Saved Us in 2025

Fast forward five years. Gamatrain.com, our new business in the UK, faced another rare incident.

\ But this time, we didn’t lose a single file.

\ Why? Because of the lessons we learned in 2019 and the system we built to prevent it.

Lessons for Every Engineer
  • Never trust a single storage system—even if it seems rock solid.
  • Backups should be independent, multi-layered, and stored in different locations.
  • Disasters will happen. Your resilience depends on how well you prepare for them.
What’s the worst data loss disaster you’ve faced? Share your experience in the comments!

\ #devops #backupstrategy #datarecovery #engineeringfailures #disasterrecovery

Tags: content small