Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/scheduledtasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ Example configuration:
```


To disblale the automatic scrubbing job add `"ovs.generic.execute_scrub": null` to the JSON object.
In case you want to change the schedule for the ALBA backend verifictaion process which checks the state of each object in the backend, add `"alba.verify_namespaces": {"minute": "0", "hour": "0", "month_of_year": "*/X"}` where X is the amount of months between each run.
To disable the automatic scrubbing job add `"ovs.generic.execute_scrub": null` to the JSON object.
In case you want to change the schedule for the ALBA backend verification process which checks the state of each object in the backend, add `"alba.verify_namespaces": {"minute": "0", "hour": "0", "month_of_year": "*/X"}` where X is the amount of months between each run.


In case the configuration cannot be parsed at all (e.g. invalid JSON), the code will fallback to the hardcoded schedule. If the crontab arguments are invalid (e.g. they contain an unsupported key) the task will be disabled.
Expand Down
91 changes: 91 additions & 0 deletions docs/snapshots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
### Snapshot management
The Framework will, by default, create snapshots of every vDisk every hour
(can be adjusted. See docs/scheduledtasks.md).

To keep the snapshots manageable overtime, the Framework schedules a clean-up every day to enforce a retention policy.
This automatic task will:
- Create an overview of the all the snapshots for every volume
- Skip the first 24 hours (allows the user to create as many snaphots as he wants daily)
- Enforce the retention policy

The default retention policy is:
- a single snapshot is kept for the first 7 days after that
- Prioritizes consistent snapshots over older ones for the first day in the policy
(which is 2 days back, starting from now)
- A single snapshot is kept for the 2nd, 3rd and 4th week to have a single snapshot of the week for the first month
- All older snapshots are discarded

#### Configuring the retention policy
A retention policy can be configured so the scheduled task will enforce a different one from the default.

It can be customized on:
- Global level, enforces the policy to all vDisks within the cluster
- VPool level, overrides the global level, enforce to all vDisks within the vPool
- VDisk level, overrides the global and vPool level, enforce to this vDisk only

The notation of the policy is a list containing policies. A policies consists minimally of `nr_of_snapshots`, which
is the the number of snapshots to have over the given `nr_of_days`, and `nr_of_days` which is the number of days to span
the `nr_of_snapshots` over. This notation allows for some fine grained control while also being easy to configure.
Since we are working with days, *monthly and weekly policies will not follow the calendar days!*

There are two additional options available: `consistency_first`
which indicates that:
- this policy has to search for the oldest consistent snapshot instead of oldest one
- When no consistent snapshot was found, find the oldest snapshot

If a policy interval spans multiple days, the `consistency_first_on` can be configured to narrow the days down
to apply the `consistency_first` rules
This options takes in a list of day numbers.


If we were to write out the default retention policy, it would look like:
```
[# one per day for the week and opt for a consistent snapshot for the first day
{'nr_of_snapshots': 7, 'nr_of_days': 7, 'consistency_first': True, 'consistency_first_on': [1]},
# One per week for the rest of the month
{'nr_of_snapshots': 3, 'nr_of_days': 21}]
```

Configuring it on different levels can be done using the API:
- Global level: POST to: `'/storagerouters/<storagerouter_guid>/global_snapshot_retention_policy'`
- vPool level: POST to: `/vpools/<vpool_guid>/snapshot_retention_policy`
- vDisk level: POST to: `/vdisks/<vdisk_guid>/snapshot_retention_policy`

##### Examples:
The examples simplify a week as 7 days and months as 4 * 7 days.

I wish to keep hourly snapshots from the first week
```
[{'nr_of_days': 7, # A week spans 7 days
'nr_of_snapshots': 168}] # Keep 24 snapshot for every day for 7 days: 7 * 24
```
I wish to keep hourly snapshots from the first week and one for every week for the whole year
```
[ # First policy
{'nr_of_days': 7, # A week spans 7 days
'nr_of_snapshots': 7 * 24}, # Keep 24 snapshot for every day for 7 days: 7 * 24
# Second policy
{'nr_of_days': 7 * (52 - 1), # The first week is already covered by the previous policy, so 52 - 1 weeks remaining
'nr_of_snapshots': 1 * (52 - 1)}
]
```

A production use case could be:
```
[ # First policy - keep the first 24 snapshots
{'nr_of_days': 1,
'nr_of_snapshots': 24 },
# Second policy - Keep 4 snapshots a day for the remaining week (6 leftover days)
{'nr_of_days': 6,
'nr_of_snapshots': 4 * 6},
# Third policy - keep 1 snapshot per day for the 3 weeks to come
{'nr_of_days': 3 * 7,
'nr_of_snapshots': 3 * 7]
# Fourth policy - keep 1 snapshot per week for the next 5 months
{'nr_of_days': 4 * 7 * 5, # Use the week notation to avoid issues (4 * 7 days = month)
'nr_of_snapshots': 5 * 7
# Fift policy - first 6 months are configured by now - Keep a snapshot every 6 month until 2 years have passed
{'nr_of_days': (4 * 7) * (6 * 3),
'nr_of_snapshots': 3}
]
```
9 changes: 9 additions & 0 deletions ovs/constants/vdisk.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,19 @@
"""
VDisk Constants module. Contains constants related to vdisks
"""
import os

# General
LOCK_NAMESPACE = 'ovs_locks'

# Scrub related
SCRUB_VDISK_LOCK = '{0}_{{0}}'.format(LOCK_NAMESPACE) # Second format is the vdisk guid
SCRUB_VDISK_EXCEPTION_MESSAGE = 'VDisk is being scrubbed. Unable to remove snapshots at this time'

# Snapshot related
# Note: the scheduled task will always skip the first 24 hours before enforcing the policy
SNAPSHOT_POLICY_DEFAULT = [# one per day for rest of the week and opt for a consistent snapshot for the first day
{'nr_of_snapshots': 7, 'nr_of_days': 7, 'consistency_first': True, 'consistency_first_on': [1]},
# One per week for the rest of the month
{'nr_of_snapshots': 3, 'nr_of_days': 21}]
SNAPSHOT_POLICY_LOCATION = os.path.join(os.path.sep, 'ovs', 'cluster', 'snapshot_retention_policy')
3 changes: 2 additions & 1 deletion ovs/dal/hybrids/vdisk.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ class VDisk(DataObject):
Property('pagecache_ratio', float, default=1.0, doc='Ratio of the volume\'s metadata pages that needs to be cached'),
Property('metadata', dict, default=dict(), doc='Contains fixed metadata about the volume (e.g. lba_size, ...)'),
Property('cache_quota', dict, mandatory=False, doc='Maximum caching space(s) this volume can consume (in Bytes) per cache type. If not None, the caching(s) for this volume has been set manually'),
Property('scrubbing_information', dict, mandatory=False, doc='Scrubbing metadata set by scrubber with an expiration date')]
Property('scrubbing_information', dict, mandatory=False, doc='Scrubbing metadata set by scrubber with an expiration date'),
Property('snapshot_retention_policy', list, mandatory=False, doc='Snapshot retention policy configuration')]
__relations = [Relation('vpool', VPool, 'vdisks'),
Relation('parent_vdisk', None, 'child_vdisks', mandatory=False)]
__dynamics = [Dynamic('dtl_status', str, 60),
Expand Down
3 changes: 2 additions & 1 deletion ovs/dal/hybrids/vpool.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ class VPool(DataObject):
Property('metadata', dict, mandatory=False, doc='Metadata for the backends, as used by the Storage Drivers.'),
Property('rdma_enabled', bool, default=False, doc='Has the vpool been configured to use RDMA for DTL transport, which is only possible if all storagerouters are RDMA capable'),
Property('status', STATUSES.keys(), doc='Status of the vPool'),
Property('metadata_store_bits', int, mandatory=False, doc='StorageDrivers deployed for this vPool will make use of this amount of metadata store bits')]
Property('metadata_store_bits', int, mandatory=False, doc='StorageDrivers deployed for this vPool will make use of this amount of metadata store bits'),
Property('snapshot_retention_policy', list, mandatory=False, doc='Snapshot retention policy configuration')]
__relations = []
__dynamics = [Dynamic('configuration', dict, 3600),
Dynamic('statistics', dict, 4),
Expand Down
158 changes: 22 additions & 136 deletions ovs/lib/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,10 @@
from celery import group
from celery.utils import uuid
from celery.result import GroupResult
from datetime import datetime, timedelta
from datetime import timedelta
from threading import Thread
from time import mktime
from ovs.constants.vdisk import SCRUB_VDISK_EXCEPTION_MESSAGE
from ovs.dal.hybrids.servicetype import ServiceType
from ovs.dal.hybrids.storagedriver import StorageDriver
from ovs.dal.hybrids.vdisk import VDisk
from ovs.dal.lists.servicelist import ServiceList
from ovs.dal.lists.storagedriverlist import StorageDriverList
from ovs.dal.lists.storagerouterlist import StorageRouterList
Expand All @@ -41,6 +38,7 @@
from ovs.lib.helpers.toolbox import Toolbox, Schedule
from ovs.lib.vdisk import VDiskController
from ovs.log.log_handler import LogHandler
from ovs.lib.helpers.generic.snapshots import SnapshotManager


class GenericController(object):
Expand Down Expand Up @@ -90,160 +88,48 @@ def delete_snapshots(timestamp=None):
:return: The GroupResult
:rtype: GroupResult
"""
if os.environ.get('RUNNING_UNITTESTS') == 'False':
assert timestamp is None, 'Providing a timestamp is only possible during unittests'

# The result cannot be fetched in this task
group_id = uuid()
return group(GenericController.delete_snapshots_storagedriver.s(storagedriver.guid, timestamp, group_id)
for storagedriver in StorageDriverList.get_storagedrivers()).apply_async(task_id=group_id)

@staticmethod
@ovs_task(name='ovs.generic.delete_snapshots_storagedriver', ensure_single_info={'mode': 'DEDUPED'})
@ovs_task(name='ovs.generic.delete_snapshots_storagedriver', ensure_single_info={'mode': 'DEDUPED', 'ignore_arguments': ['timestamp', 'group_id']})
def delete_snapshots_storagedriver(storagedriver_guid, timestamp=None, group_id=None):
# type: (str, float, str) -> Dict[str, List[str]]
"""
Delete snapshots per storagedriver & scrubbing policy
Delete snapshots & scrubbing policy

Implemented delete snapshot policy:
Implemented default delete snapshot policy:
< 1d | 1d bucket | 1 | best of bucket | 1d
< 1w | 1d bucket | 6 | oldest of bucket | 7d = 1w
< 1m | 1w bucket | 3 | oldest of bucket | 4w = 1m
> 1m | delete

The configured policy can differ from this one.
:param storagedriver_guid: Guid of the StorageDriver to remove snapshots on
:type storagedriver_guid: str
:param timestamp: Timestamp to determine whether snapshots should be kept or not, if none provided, current time will be used
:param timestamp: Timestamp to determine whether snapshots should be kept or not,
if none provided, the current timestamp - 1 day is used. Used in unittesting only!
The scheduled task will not remove snapshots of the current day this way!
:type timestamp: float
:param group_id: ID of the group task. Used to identify which snapshot deletes were called during the scheduled task
:type group_id: str
:return: None
:return: Dict with vdisk guid as key, deleted snapshot ids as value
:rtype: dict
"""
if group_id:
log_id = 'Group job {} - '.format(group_id)
else:
log_id = ''

def format_log(message):
return '{}{}'.format(log_id, message)

GenericController._logger.info(format_log('Delete snapshots started for StorageDriver {0}'.format(storagedriver_guid)))

storagedriver = StorageDriver(storagedriver_guid)
exceptions = []

day = timedelta(1)
week = day * 7
if os.environ.get('RUNNING_UNITTESTS') == 'False':
assert timestamp is None, 'Providing a timestamp is only possible during unittests'

def make_timestamp(offset):
"""
Create an integer based timestamp
:param offset: Offset in days
:return: Timestamp
"""
return int(mktime((base - offset).timetuple()))

# Calculate bucket structure
if timestamp is None:
timestamp = time.time()
base = datetime.fromtimestamp(timestamp).date() - day
buckets = []
# Buckets first 7 days: [0-1[, [1-2[, [2-3[, [3-4[, [4-5[, [5-6[, [6-7[
for i in xrange(0, 7):
buckets.append({'start': make_timestamp(day * i),
'end': make_timestamp(day * (i + 1)),
'type': '1d',
'snapshots': []})
# Week buckets next 3 weeks: [7-14[, [14-21[, [21-28[
for i in xrange(1, 4):
buckets.append({'start': make_timestamp(week * i),
'end': make_timestamp(week * (i + 1)),
'type': '1w',
'snapshots': []})
buckets.append({'start': make_timestamp(week * 4),
'end': 0,
'type': 'rest',
'snapshots': []})

# Get a list of all snapshots that are used as parents for clones
parent_snapshots = set([vd.parentsnapshot for vd in VDiskList.get_with_parent_snaphots()])

# Place all snapshots in bucket_chains
bucket_chains = []
for vdisk_guid in storagedriver.vdisks_guids:
try:
vdisk = VDisk(vdisk_guid)
vdisk.invalidate_dynamics('being_scrubbed')
if vdisk.being_scrubbed:
continue
timestamp = time.time() - timedelta(1).total_seconds()

if vdisk.info['object_type'] in ['BASE']:
bucket_chain = copy.deepcopy(buckets)
for snapshot in vdisk.snapshots:
if snapshot.get('is_sticky') is True:
continue
if snapshot['guid'] in parent_snapshots:
GenericController._logger.info(format_log('Not deleting snapshot {0} because it has clones'.format(snapshot['guid'])))
continue
timestamp = int(snapshot['timestamp'])
for bucket in bucket_chain:
if bucket['start'] >= timestamp > bucket['end']:
bucket['snapshots'].append({'timestamp': timestamp,
'snapshot_id': snapshot['guid'],
'vdisk_guid': vdisk.guid,
'is_consistent': snapshot['is_consistent']})
bucket_chains.append(bucket_chain)
except Exception as ex:
exceptions.append(ex)

# Clean out the snapshot bucket_chains, we delete the snapshots we want to keep
# And we'll remove all snapshots that remain in the buckets
for bucket_chain in bucket_chains:
first = True
for bucket in bucket_chain:
if first is True:
best = None
for snapshot in bucket['snapshots']:
if best is None:
best = snapshot
# Consistent is better than inconsistent
elif snapshot['is_consistent'] and not best['is_consistent']:
best = snapshot
# Newer (larger timestamp) is better than older snapshots
elif snapshot['is_consistent'] == best['is_consistent'] and \
snapshot['timestamp'] > best['timestamp']:
best = snapshot
bucket['snapshots'] = [s for s in bucket['snapshots'] if
s['timestamp'] != best['timestamp']]
first = False
elif bucket['end'] > 0:
oldest = None
for snapshot in bucket['snapshots']:
if oldest is None:
oldest = snapshot
# Older (smaller timestamp) is the one we want to keep
elif snapshot['timestamp'] < oldest['timestamp']:
oldest = snapshot
bucket['snapshots'] = [s for s in bucket['snapshots'] if
s['timestamp'] != oldest['timestamp']]

# Delete obsolete snapshots
for bucket_chain in bucket_chains:
# Each bucket chain represents one vdisk's snapshots
try:
for bucket in bucket_chain:
for snapshot in bucket['snapshots']:
VDiskController.delete_snapshot(vdisk_guid=snapshot['vdisk_guid'],
snapshot_id=snapshot['snapshot_id'])
except RuntimeError as ex:
vdisk_guid = next((snapshot['vdisk_guid'] for bucket in bucket_chain for snapshot in bucket['snapshots']), '')
vdisk_id_log = ''
if vdisk_guid:
vdisk_id_log = ' for VDisk with guid {}'.format(vdisk_guid)
if SCRUB_VDISK_EXCEPTION_MESSAGE in ex.message:
GenericController._logger.warning(format_log('Being scrubbed exception occurred while deleting snapshots{}'.format(vdisk_id_log)))
else:
GenericController._logger.exception(format_log('Exception occurred while deleting snapshots{}'.format(vdisk_id_log)))
exceptions.append(ex)
if exceptions:
raise RuntimeError('Exceptions occurred while deleting snapshots: \n- {}'.format('\n- '.join((str(ex) for ex in exceptions))))
GenericController._logger.info(format_log('Delete snapshots finished for StorageDriver {0}'))
GenericController._logger.info('Delete snapshots started')
storagedriver = StorageDriver(storagedriver_guid)
snapshot_manager = SnapshotManager(storagedriver, group_id)
return snapshot_manager.delete_snapshots(timestamp)

@staticmethod
@ovs_task(name='ovs.generic.execute_scrub', schedule=Schedule(minute='0', hour='3'), ensure_single_info={'mode': 'DEDUPED'})
Expand Down
Loading