Nearly all backup tools for VMware environments rely on the vStorage API to create snapshots from virtual machines to be backed up. Every now and then the backup tool encounters problems and cancels the backup process. This could lead to a situation where the backup tool is unable to clean up the backup snapshot on the VM and this results in a VM still running in snapshot mode although the snapshot manager in vSphere client has no information about the snapshot. You can only see the snapshot by inspecting the VMs hard disk configuration. If you see a filename that contains 0s (like harddisk1-0000001.vmdk) then the VM is still running in snapshot mode.
Knowing this problem, VMware added a check for this condition in vSphere 5.x called "Consolidation". If consolidation is needed, that means your VM is running in snapshot mode without the snapshot manager knowing anything about that. This will give you a simple option to consolidate the snapshots manually by right-clicking the VM, select "Snapshot" in the context menu and then choose "Consolidate". Probably this will work and your snapshot(s) will be committed.
Unfortunately the world isn't perfect all the time and you will probably be faced by an error messages telling you that a needed file is locked and the consolidation process can't be done. I don't know why VMware hasn't implemented that information but the error message won't tell you which file is locked. In almost every situation the lock is on the original *-flat.vmdk file because this file is still in use by the backup server (to be more precise, the ESX server THINKS it is still in use).
Without checking for more detail you can do a standard procedure and first reboot the backup server (could be a Windows server in case of tools like Veeam or vRanger) because this server can set a lock on a vmdk file on a VMFS volume. Try the consolidation again. Perhaps it works now but probably you still get the same error message. Second step is rebooting the ESXi server where the VM reside when the backup ran. Start the consolidation task once again. If it still fails you have to dig a bit deeper.
Connect to the CLI of one of the ESXi servers in the cluster where the locked VM runs. Change to the directory where the locked files are located. Type in "vmkfstools -D name_of_data_file-flat.vmdk"
You will get a strange looking output like:
Lock [type 10c00001 offset 45842432 v 33232, hb offset 4116480
gen 2397, mode 2, owner 00000000-00000000-0000-000000000000 mtime 5436998]
RO Owner HB offset 3293184 4f284470-4991d61b-4b28-001a64c335dc
Addr <4, 80, 160>, gen 33179, links 1, type reg, flags 0, uid 0, gid 0, mode 100600
len 738242560, nb 353 tbz 0, cow 0, zla 3, bs 2097152
First look at the second line where you can see the keyword "owner". This will tell you the MAC address (the last 12 digits, marked as red) of the lock owner. Sometimes you will see the output above containing only 0s. This means, a server unknown by the VMFS holds the lock. This is the situation in most cases when the problem is caused by a backup tool.
Then look at the third line starting with RO Owner (marked blue). This will tell you which system holds a READ-ONLY lock on the file. Again, use the last 12 digits to identify the server holding the lock. The MAC address is always the address of a NIC used by the management port group of any ESXi server in the vSphere cluster. So simply goto the vSphere client, choose one ESXi server after the other, select configuration -> network adapters and compare the listed MAC addresses to the output above until you can identify the host. Then evacuate all VMs from that host and reboot it.
Try the consolidation again.
During the last few weeks this problem came up quite often as I look after several vSphere installations backed up by snapshot-based tools. Every now and then the above mentioned procedure doesn't work. Even if you reboot all ESXi servers, the vCenter server and the backup server, the lock still exists and you are unable to commit the snapshot.
Sometimes you will also see in the output above a second RO Owner. In one case this owner changed every time I rebooted the mentioned host and switched to another host in the cluster. So no matter what I did there were always two RO owners on the file.
I opened a support case with VMware and they asked me to clone or move the VM with SVMotion. I couldn't imagine that this will help so I decided to use SVMotion as this won't have any impact on availability of the VM. After the SVMotion the snapshot was still there and unknown by the snapshot manager BUT I was able to do a consolidation. Perfect.
This resolution has some side effects as you have to spent some time on moving the VMs files to another datastore. Depending on the size of the VM and the snapshot files this can take several hours and needs plenty of storage capacity as during SVMotion, ALL files including snapshot files are copied. So this shouldn't be your default action if you run into a locked-file-condition but it's worth a try if you can't resolve the problem by rebooting a few servers.