Client-Side Deduplication

Deduplication is an approach that is based on multiple usage of identical data in several places.

This functionality is supported in the new backup format only

The new backup format is based on the client-side deduplication. Deduplication on a local computer brings the following benefits:

  • Client-side deduplication is much faster compared to a server deduplication
  • Absence of internet connection issues: data is deduplicated locally
  • Significant decrease in internet traffic
  • Ability to purge an unnecessary data
  • Storage costs: server deduplication database constantly grows, so this can cause a significant expense increase. Client-side deduplication uses local capacities only

How It Works

The first backup is always full. In most cases, it is enough to have a full backup with subsequent incremental backups. Thus, after a full backup, the next backup plan executions are usually incremental and depend on full backup and previous incremental backups as well.

The new backup format reckons for a full backup plan independence, so each separate backup plan has its own deduplication database. Moreover, backup plan generations (generation is a sequence of full and incremental backups that follow this full one) also have their own deduplication databases.

Once a backup plan is run, Backup for Linux reads backup data in batches multiple (2x, 4x,...) to block size. Once a block is read, it is compared with deduplication database records. If a block is not found, it is delivered to storage and is assigned with a block ID, which becomes a new deduplication database record. The block scanning continues, and if a block matches any of the deduplication database records, a block with such ID is not backed up.

This approach significantly decreases a backup size, especially in virtual environments with a large number of identical blocks.

If a deduplication database is manually deleted or corrupted, a full backup type is always initiated

https://git.cloudberrylab.com/egor.m/doc-help-std.git