Backup Exec and Deduplication

Backup Exec’s deduplication option offers technologies to reduce the space needed for backup-to-disk. In addition to this, deduplication may help you to reduce the network load resulting from backups.

The minimum hardware requirements for deduplication with Backup Exec are:

  • One quad-core CPU or two dual-core CPUs.
  • Eight Gigabyte free memory for up to five Terabyte deduplication storage and another 1.5 gigabyte RAM for each additional terabyte of storage.
  • A dedicated storage volume where the deduplication storage will be created or a supported Open Storage Technology (OST) appliance.
    Please refer to the Backup Exec Hardware Compatibility List (HCL) for details on the supported devices.

Hint:
As of today, Backup Exec supports max. 64 terabyte per deduplication storage and one deduplication storage per backup server.

The Deduplication Option within Backup Exec supports three different methods of deduplication:

Server-Side Deduplication

Server-side deduplication means, that all deduplication work is done at the backup server.
This implies that the source server sends all files to the backup server, as if it was a regular backup job. The backup server receives the data stream, splits it into 512kB chunks and calculates a unique hash value for each chunk that is stored in a dedicated database.
If the backup server receives a data packet who’s hash value already exists in the database, the packet is discarded and only a pointer o the existing chunk is created in the database. Thereby it is completely irrelevant, whether that data packet is part of an office document, a file from an operating system or a video clip.
Deduplication data by splitting files into chunks yields to better results than doing it at the file level.

Excursion:
The size of the chunks used for deduplication is pretty important: The smaller the chunks are, the more likely it is that two chunks have the same content.
Unfortunately, the calculation of the hash values and the lookup in the database to find out, whether the hash value of the actual chunk is already known, is time-consuming.
So, from a performance view, it is better to use larger chunks and accept that the deduplication ratio will decrease.

Because of the fact that all data from the source server has to be transported to the backup server before it is deduplicated, server-side deduplication does not save any bandwidth.
Also the backup windows stays the same as with classic backup jobs.
However, the client doesn’t have to meet any special requirements. So you can use server-side deduplication for nearly every type of data.

Hint:
Please note that files deriving from NDMP filers like NetApp, EMC etc. cannot be deduplicated by Backup Exec. This is due to the fact that Backup Exec cannot read the content of the data stream sent by these systems and therefore cannot split it into chunks.

Client-Side Deduplication

If you enable client-side deduplication for a backup job in Backup Exec, the source server’s remote agent splits the backup data into chunks and sends their hash values to th backup server to determine, whether the server already has a copy of this chunk, or not. If the chunk is already present in the deduplication storage on the backup server, the source agent drops the packet and only a pointer is created on the backup server. So only unique packets are sent over the network.
Since this game of questions and answers takes a while, the first run of a backup job may be quite lengthy.
During subsequent runs of the same backup job, the backup server sends the remote agent a blob of information, containing the answers of all questions he asked during the last run. Using this, the client itself can “sort out” packages, the server will not need and directly demand the creation of the pointers.
This means that from the second run on, the amount of data sent over the network is massively reduced, resulting in less need of bandwidth and smaller backup windows.

Nevertheless, as in so many things in IT, this technology also has a downside: The source server has to meet some technical requirements:

  • One free CPU core
  • One gigabyte of free memory
  • An installed Backup Exec remote agent

The last requirement means that this technology cannot be used for VMware environments where you want to do host-based backups, as the remote agent can’t be installed on the ESX host.

Deduplication using OST Appliances

The abbreviation OST means Open Storage Technology and describes a technology used to integrate deduplication aware hardware appliances into Backup Exec’s deduplication option.
Among other things, this technology enables the Backup Exec server to control and monitor the deduplication process on the appliance.

When using Backup Exec together with a OST appliance, the appliance does the deduplication work and just reports to the backup server, so Backup Exec can keep its catalogs and database entries up to date.

One of the advantages you get when using OST appliances is that you can suppress the 64 terabyte boundary Backup Exec has for its “internal” deduplication storages. Another one is the huge performance these appliances deliver during backups as well as during restores.

The most important disadvantage however is the fact, that Backup Exec handles OST appliances similar to tape devices. This means that GRT restores cannot be done directly but rather require a staging process and therefore enough local disk space on the backup server to restore the whole container file.

Optimized Duplication

The process of copying (or duplicating) data from one deduplication storage to another one, is named Optimized Duplication. The most interesting information here is that the data is not “reassembled” or “rehydrated”, as Veritas calls it. Instead, only the missing blocks are sent from one storage to the other.

Hint:
In order to be able to use Optimized Duplication in Backup Exec, the source storage and the target storage have to be the same type. So you can use an “integrated” PureDisk deduplication storage on both sides or the same type of OST appliance connected to the source server and the target server. But you cannot mix them.

The idea behind Optimized Duplication is as simple, as it is brilliant and solves quite a lot of backup challenges that I’ve seen over the last years, like building a central backup environment for companies with multiple branch offices where they have to store data on local servers.

Copying Deduplicated Data to Tapes

Data that is stored in a deduplication storage on a Backup Exec server can be duplicated to tape simply by creating another stage in the backup job.
During this stage, the data is reassembled (rehydrated) and written to tape as if it was backed up directly from the source to tape. The advantage of doing so is that you can restore directly from tape without having the need to stage the data back to the backup server (Single Pass Restore).

Hint:
Compared to copying data from a classing backup-to-disk storage to tape, duplicating data from a deduplication storage takes considerable longer.

How Much Disk Space Can Be Saved By Deduplication?

When it comes to deduplication, one of the most interesting question is, how much space could be saved by implementing this technology.

In most environments we see deduplication ratios from 4:1 to 7:1, which means, that only 14-25% of the backed up data is stored on the deduplication storage. In other words, we see space savings between 75 and 85%.
But please keep the following in mind that the savings you can expect is dependent on multiple factors:

  • The type of backups you do.
    Full backups transfer much more data from the source servers than incremental backups do, because every file is backed up, whether it was changed or not. Since the deduplication storage only saves blocks that are unique, the deduplication engine of Backup Exec will discard every single block transferred by a full backup that would not have been backed up by an incremental job. Or, in other words: The amount of data stored on the deduplication storage from an incremental backup is the same as from a full backup of the same source, but the amount of discarded data is much higher.
  • The type of data you back up.
    The amount of unique blocks within a file differs greatly between different file types. Media files, like pictures, music and videos contain nearly no data that can be deduplicated. Other files, like office documents, contain much more identical chunks and backing up those result in much better deduplication ratios.

Veritas has developed a tool called „Backup Exec Deduplication Assessment Tool (BEDAT)“ for their partners that examines the servers that shall be backed up and calculates the suggested amount of deduplication storage needed to protect those servers depending on your given retention time and backup model.

Hint:
The tool is free of charge, but only available for registered Veritas partners. So, if you’re an end customer and want to evaluate your environment to see, how effective deduplication will be, please get in contact with us, we can perform the tests for you and present you the reports afterwards.

Leave a Comment