openvstorage · JeffreyDevloo · Mar 13, 2019 · Mar 15, 2019 · Mar 18, 2019 · Mar 18, 2019
diff --git a/docs/scheduledtasks.md b/docs/scheduledtasks.md
@@ -16,8 +16,8 @@ Example configuration:
 ```
 
 
-To disblale the automatic scrubbing job add `"ovs.generic.execute_scrub": null` to the JSON object. 
-In case you want to change the schedule for the ALBA backend verifictaion process which checks the state of each object in the backend, add `"alba.verify_namespaces": {"minute": "0", "hour": "0", "month_of_year": "*/X"}` where X is the amount of months between each run.
+To disable the automatic scrubbing job add `"ovs.generic.execute_scrub": null` to the JSON object. 
+In case you want to change the schedule for the ALBA backend verification process which checks the state of each object in the backend, add `"alba.verify_namespaces": {"minute": "0", "hour": "0", "month_of_year": "*/X"}` where X is the amount of months between each run.
 
 
 In case the configuration cannot be parsed at all (e.g. invalid JSON), the code will fallback to the hardcoded schedule. If the crontab arguments are invalid (e.g. they contain an unsupported key) the task will be disabled.

diff --git a/docs/snapshots.md b/docs/snapshots.md
@@ -0,0 +1,91 @@
+### Snapshot management
+The Framework will, by default, create snapshots of every vDisk every hour 
+(can be adjusted. See docs/scheduledtasks.md).
+
+To keep the snapshots manageable overtime, the Framework schedules a clean-up every day to enforce a retention policy.
+This automatic task will:
+- Create an overview of the all the snapshots for every volume
+- Skip the first 24 hours (allows the user to create as many snaphots as he wants daily)
+- Enforce the retention policy
+
+The default retention policy is:
+- a single snapshot is kept for the first 7 days after that
+    - Prioritizes consistent snapshots over older ones for the first day in the policy
+     (which is 2 days back, starting from now)
+- A single snapshot is kept for the 2nd, 3rd and 4th week to have a single snapshot of the week for the first month
+- All older snapshots are discarded
+
+#### Configuring the retention policy
+A retention policy can be configured so the scheduled task will enforce a different one from the default.
+
+It can be customized on:
+- Global level, enforces the policy to all vDisks within the cluster
+- VPool level, overrides the global level, enforce to all vDisks within the vPool
+- VDisk level, overrides the global and vPool level, enforce to this vDisk only
+
+The notation of the policy is a list containing policies. A policies consists minimally of `nr_of_snapshots`, which
+is the the number of snapshots  to have over the given `nr_of_days`, and `nr_of_days` which is the number of days to span
+the `nr_of_snapshots` over. This notation allows for some fine grained control while also being easy to configure.
+Since we are working with days, *monthly and weekly policies will not follow the calendar days!*
+
+There are two additional options available: `consistency_first` 
+which indicates that:
+- this policy has to search for the oldest consistent snapshot instead of oldest one
+- When no consistent snapshot was found, find the oldest snapshot
+
+If a policy interval spans multiple days, the `consistency_first_on` can be configured to narrow the days down 
+to apply the `consistency_first` rules
+This options takes in a list of day numbers.
+
+
+If we were to write out the default retention policy, it would look like:
+```
+[# one per day for the week and opt for a consistent snapshot for the first day
+ {'nr_of_snapshots': 7, 'nr_of_days': 7, 'consistency_first': True, 'consistency_first_on': [1]},
+ # One per week for the rest of the month
+ {'nr_of_snapshots': 3, 'nr_of_days': 21}]
+```
+
+Configuring it on different levels can be done using the API:
+- Global level: POST to: `'/storagerouters/<storagerouter_guid>/global_snapshot_retention_policy'`
+- vPool level: POST to: `/vpools/<vpool_guid>/snapshot_retention_policy`
+- vDisk level: POST to: `/vdisks/<vdisk_guid>/snapshot_retention_policy`
+
+##### Examples:
+The examples simplify a week as 7 days and months as 4 * 7 days.
+
+I wish to keep hourly snapshots from the first week
+```
+[{'nr_of_days': 7,  # A week spans 7 days
+  'nr_of_snapshots': 168}]  # Keep 24 snapshot for every day for 7 days: 7 * 24
+```
+I wish to keep hourly snapshots from the first week and one for every week for the whole year
+```
+[ # First policy
+  {'nr_of_days': 7,  # A week spans 7 days
+  'nr_of_snapshots': 7 * 24},  # Keep 24 snapshot for every day for 7 days: 7 * 24
+  # Second policy
+  {'nr_of_days': 7 * (52 - 1),  # The first week is already covered by the previous policy, so 52 - 1 weeks remaining
+   'nr_of_snapshots': 1 * (52 - 1)}
+]
+```
+
+A production use case could be:
+```
+[ # First policy - keep the first 24 snapshots
+  {'nr_of_days': 1,
+  'nr_of_snapshots': 24 },
+  # Second policy - Keep 4 snapshots a day for the remaining week (6 leftover days)
+  {'nr_of_days': 6,
+   'nr_of_snapshots': 4 * 6},
+  # Third policy - keep 1 snapshot per day for the 3 weeks to come
+  {'nr_of_days': 3 * 7,
+   'nr_of_snapshots': 3 * 7]
+  # Fourth policy - keep 1 snapshot per week for the next 5 months
+  {'nr_of_days': 4 * 7 * 5,  # Use the week notation to avoid issues (4 * 7 days = month)
+   'nr_of_snapshots': 5 * 7
+  # Fift policy - first 6 months are configured by now - Keep a snapshot every 6 month until 2 years have passed
+   {'nr_of_days': (4 * 7) * (6 * 3),
+    'nr_of_snapshots': 3}
+ ] 
+   ```
diff --git a/ovs/constants/vdisk.py b/ovs/constants/vdisk.py
@@ -17,10 +17,19 @@
 """
 VDisk Constants module. Contains constants related to vdisks
 """
+import os
 
 # General
 LOCK_NAMESPACE = 'ovs_locks'
 
 # Scrub related
 SCRUB_VDISK_LOCK = '{0}_{{0}}'.format(LOCK_NAMESPACE)  # Second format is the vdisk guid
 SCRUB_VDISK_EXCEPTION_MESSAGE = 'VDisk is being scrubbed. Unable to remove snapshots at this time'
+
+# Snapshot related
+# Note: the scheduled task will always skip the first 24 hours before enforcing the policy
+SNAPSHOT_POLICY_DEFAULT = [# one per day for rest of the week and opt for a consistent snapshot for the first day
+                           {'nr_of_snapshots': 7, 'nr_of_days': 7, 'consistency_first': True, 'consistency_first_on': [1]},
+                           # One per week for the rest of the month
+                           {'nr_of_snapshots': 3, 'nr_of_days': 21}]
+SNAPSHOT_POLICY_LOCATION = os.path.join(os.path.sep, 'ovs', 'cluster', 'snapshot_retention_policy')
diff --git a/ovs/dal/hybrids/vdisk.py b/ovs/dal/hybrids/vdisk.py
@@ -51,7 +51,8 @@ class VDisk(DataObject):
                     Property('pagecache_ratio', float, default=1.0, doc='Ratio of the volume\'s metadata pages that needs to be cached'),
                     Property('metadata', dict, default=dict(), doc='Contains fixed metadata about the volume (e.g. lba_size, ...)'),
                     Property('cache_quota', dict, mandatory=False, doc='Maximum caching space(s) this volume can consume (in Bytes) per cache type. If not None, the caching(s) for this volume has been set manually'),
-                    Property('scrubbing_information', dict, mandatory=False, doc='Scrubbing metadata set by scrubber with an expiration date')]
+                    Property('scrubbing_information', dict, mandatory=False, doc='Scrubbing metadata set by scrubber with an expiration date'),
+                    Property('snapshot_retention_policy', list, mandatory=False, doc='Snapshot retention policy configuration')]
     __relations = [Relation('vpool', VPool, 'vdisks'),
                    Relation('parent_vdisk', None, 'child_vdisks', mandatory=False)]
     __dynamics = [Dynamic('dtl_status', str, 60),

diff --git a/ovs/dal/hybrids/vpool.py b/ovs/dal/hybrids/vpool.py
@@ -46,7 +46,8 @@ class VPool(DataObject):
                     Property('metadata', dict, mandatory=False, doc='Metadata for the backends, as used by the Storage Drivers.'),
                     Property('rdma_enabled', bool, default=False, doc='Has the vpool been configured to use RDMA for DTL transport, which is only possible if all storagerouters are RDMA capable'),
                     Property('status', STATUSES.keys(), doc='Status of the vPool'),
-                    Property('metadata_store_bits', int, mandatory=False, doc='StorageDrivers deployed for this vPool will make use of this amount of metadata store bits')]
+                    Property('metadata_store_bits', int, mandatory=False, doc='StorageDrivers deployed for this vPool will make use of this amount of metadata store bits'),
+                    Property('snapshot_retention_policy', list, mandatory=False, doc='Snapshot retention policy configuration')]
     __relations = []
     __dynamics = [Dynamic('configuration', dict, 3600),
                   Dynamic('statistics', dict, 4),

diff --git a/ovs/lib/generic.py b/ovs/lib/generic.py
@@ -23,13 +23,10 @@
 from celery import group
 from celery.utils import uuid
 from celery.result import GroupResult
-from datetime import datetime, timedelta
+from datetime import timedelta
 from threading import Thread
-from time import mktime
-from ovs.constants.vdisk import SCRUB_VDISK_EXCEPTION_MESSAGE
 from ovs.dal.hybrids.servicetype import ServiceType
 from ovs.dal.hybrids.storagedriver import StorageDriver
-from ovs.dal.hybrids.vdisk import VDisk
 from ovs.dal.lists.servicelist import ServiceList
 from ovs.dal.lists.storagedriverlist import StorageDriverList
 from ovs.dal.lists.storagerouterlist import StorageRouterList
@@ -41,6 +38,7 @@
 from ovs.lib.helpers.toolbox import Toolbox, Schedule
 from ovs.lib.vdisk import VDiskController
 from ovs.log.log_handler import LogHandler
+from ovs.lib.helpers.generic.snapshots import SnapshotManager
 
 
 class GenericController(object):
@@ -90,160 +88,48 @@ def delete_snapshots(timestamp=None):
         :return: The GroupResult
         :rtype: GroupResult
         """
+        if os.environ.get('RUNNING_UNITTESTS') == 'False':
+            assert timestamp is None, 'Providing a timestamp is only possible during unittests'
+
         # The result cannot be fetched in this task
         group_id = uuid()
         return group(GenericController.delete_snapshots_storagedriver.s(storagedriver.guid, timestamp, group_id)
                      for storagedriver in StorageDriverList.get_storagedrivers()).apply_async(task_id=group_id)
 
     @staticmethod
-    @ovs_task(name='ovs.generic.delete_snapshots_storagedriver', ensure_single_info={'mode': 'DEDUPED'})
+    @ovs_task(name='ovs.generic.delete_snapshots_storagedriver', ensure_single_info={'mode': 'DEDUPED', 'ignore_arguments': ['timestamp', 'group_id']})
     def delete_snapshots_storagedriver(storagedriver_guid, timestamp=None, group_id=None):
+        # type: (str, float, str) -> Dict[str, List[str]]
         """
-        Delete snapshots per storagedriver & scrubbing policy
+        Delete snapshots & scrubbing policy
 
-        Implemented delete snapshot policy:
+        Implemented default delete snapshot policy:
         < 1d | 1d bucket | 1 | best of bucket   | 1d
         < 1w | 1d bucket | 6 | oldest of bucket | 7d = 1w
         < 1m | 1w bucket | 3 | oldest of bucket | 4w = 1m
         > 1m | delete
-
+        The configured policy can differ from this one.
         :param storagedriver_guid: Guid of the StorageDriver to remove snapshots on
         :type storagedriver_guid: str
-        :param timestamp: Timestamp to determine whether snapshots should be kept or not, if none provided, current time will be used
+        :param timestamp: Timestamp to determine whether snapshots should be kept or not,
+        if none provided, the current timestamp - 1 day is used. Used in unittesting only!
+        The scheduled task will not remove snapshots of the current day this way!
         :type timestamp: float
         :param group_id: ID of the group task. Used to identify which snapshot deletes were called during the scheduled task
         :type group_id: str
-        :return: None
+        :return: Dict with vdisk guid as key, deleted snapshot ids as value
+        :rtype: dict
         """
-        if group_id:
-            log_id = 'Group job {} - '.format(group_id)
-        else:
-            log_id = ''
-
-        def format_log(message):
-            return '{}{}'.format(log_id, message)
-
-        GenericController._logger.info(format_log('Delete snapshots started for StorageDriver {0}'.format(storagedriver_guid)))
-
-        storagedriver = StorageDriver(storagedriver_guid)
-        exceptions = []
-
-        day = timedelta(1)
-        week = day * 7
+        if os.environ.get('RUNNING_UNITTESTS') == 'False':
+            assert timestamp is None, 'Providing a timestamp is only possible during unittests'
 
-        def make_timestamp(offset):
-            """
-            Create an integer based timestamp
-            :param offset: Offset in days
-            :return: Timestamp
-            """
-            return int(mktime((base - offset).timetuple()))
-
-        # Calculate bucket structure
         if timestamp is None:
-            timestamp = time.time()
-        base = datetime.fromtimestamp(timestamp).date() - day
-        buckets = []
-        # Buckets first 7 days: [0-1[, [1-2[, [2-3[, [3-4[, [4-5[, [5-6[, [6-7[
-        for i in xrange(0, 7):
-            buckets.append({'start': make_timestamp(day * i),
-                            'end': make_timestamp(day * (i + 1)),
-                            'type': '1d',
-                            'snapshots': []})
-        # Week buckets next 3 weeks: [7-14[, [14-21[, [21-28[
-        for i in xrange(1, 4):
-            buckets.append({'start': make_timestamp(week * i),
-                            'end': make_timestamp(week * (i + 1)),
-                            'type': '1w',
-                            'snapshots': []})
-        buckets.append({'start': make_timestamp(week * 4),
-                        'end': 0,
-                        'type': 'rest',
-                        'snapshots': []})
-
-        # Get a list of all snapshots that are used as parents for clones
-        parent_snapshots = set([vd.parentsnapshot for vd in VDiskList.get_with_parent_snaphots()])
-
-        # Place all snapshots in bucket_chains
-        bucket_chains = []
-        for vdisk_guid in storagedriver.vdisks_guids:
-            try:
-                vdisk = VDisk(vdisk_guid)
-                vdisk.invalidate_dynamics('being_scrubbed')
-                if vdisk.being_scrubbed:
-                    continue
+            timestamp = time.time() - timedelta(1).total_seconds()
 
-                if vdisk.info['object_type'] in ['BASE']:
-                    bucket_chain = copy.deepcopy(buckets)
-                    for snapshot in vdisk.snapshots:
-                        if snapshot.get('is_sticky') is True:
-                            continue
-                        if snapshot['guid'] in parent_snapshots:
-                            GenericController._logger.info(format_log('Not deleting snapshot {0} because it has clones'.format(snapshot['guid'])))
-                            continue
-                        timestamp = int(snapshot['timestamp'])
-                        for bucket in bucket_chain:
-                            if bucket['start'] >= timestamp > bucket['end']:
-                                bucket['snapshots'].append({'timestamp': timestamp,
-                                                            'snapshot_id': snapshot['guid'],
-                                                            'vdisk_guid': vdisk.guid,
-                                                            'is_consistent': snapshot['is_consistent']})
-                    bucket_chains.append(bucket_chain)
-            except Exception as ex:
-                exceptions.append(ex)
-
-        # Clean out the snapshot bucket_chains, we delete the snapshots we want to keep
-        # And we'll remove all snapshots that remain in the buckets
-        for bucket_chain in bucket_chains:
-            first = True
-            for bucket in bucket_chain:
-                if first is True:
-                    best = None
-                    for snapshot in bucket['snapshots']:
-                        if best is None:
-                            best = snapshot
-                        # Consistent is better than inconsistent
-                        elif snapshot['is_consistent'] and not best['is_consistent']:
-                            best = snapshot
-                        # Newer (larger timestamp) is better than older snapshots
-                        elif snapshot['is_consistent'] == best['is_consistent'] and \
-                                snapshot['timestamp'] > best['timestamp']:
-                            best = snapshot
-                    bucket['snapshots'] = [s for s in bucket['snapshots'] if
-                                           s['timestamp'] != best['timestamp']]
-                    first = False
-                elif bucket['end'] > 0:
-                    oldest = None
-                    for snapshot in bucket['snapshots']:
-                        if oldest is None:
-                            oldest = snapshot
-                        # Older (smaller timestamp) is the one we want to keep
-                        elif snapshot['timestamp'] < oldest['timestamp']:
-                            oldest = snapshot
-                    bucket['snapshots'] = [s for s in bucket['snapshots'] if
-                                           s['timestamp'] != oldest['timestamp']]
-
-        # Delete obsolete snapshots
-        for bucket_chain in bucket_chains:
-            # Each bucket chain represents one vdisk's snapshots
-            try:
-                for bucket in bucket_chain:
-                    for snapshot in bucket['snapshots']:
-                        VDiskController.delete_snapshot(vdisk_guid=snapshot['vdisk_guid'],
-                                                        snapshot_id=snapshot['snapshot_id'])
-            except RuntimeError as ex:
-                vdisk_guid = next((snapshot['vdisk_guid'] for bucket in bucket_chain for snapshot in bucket['snapshots']), '')
-                vdisk_id_log = ''
-                if vdisk_guid:
-                    vdisk_id_log = ' for VDisk with guid {}'.format(vdisk_guid)
-                if SCRUB_VDISK_EXCEPTION_MESSAGE in ex.message:
-                    GenericController._logger.warning(format_log('Being scrubbed exception occurred while deleting snapshots{}'.format(vdisk_id_log)))
-                else:
-                    GenericController._logger.exception(format_log('Exception occurred while deleting snapshots{}'.format(vdisk_id_log)))
-                    exceptions.append(ex)
-        if exceptions:
-            raise RuntimeError('Exceptions occurred while deleting snapshots: \n- {}'.format('\n- '.join((str(ex) for ex in exceptions))))
-        GenericController._logger.info(format_log('Delete snapshots finished for StorageDriver {0}'))
+        GenericController._logger.info('Delete snapshots started')
+        storagedriver = StorageDriver(storagedriver_guid)
+        snapshot_manager = SnapshotManager(storagedriver, group_id)
+        return snapshot_manager.delete_snapshots(timestamp)
 
     @staticmethod
     @ovs_task(name='ovs.generic.execute_scrub', schedule=Schedule(minute='0', hour='3'), ensure_single_info={'mode': 'DEDUPED'})