-
Notifications
You must be signed in to change notification settings - Fork 483
Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
When doing backup, the firstNotUsedLsn is not the start of the wal log file, but the location after XLogLongPageHeaderData, for example 'lsn: 3/A0500028', so using the following method to compute the last segment file is not correct. lastUsedLsn := firstNotUsedLsn - 1 lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn) The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD' Here we can directly get the last not used wal segment by NewWalSegmentNo(firstNotUsedLSN)
@robertmu Hi Robert! Can you please help reviewing and merging this PR? |
Thanks for your PR, but I think there's a misunderstanding about how Why the assumption is incorrectThe assumption that
Evidence from PostgreSQL source codeThis is further evidenced by PostgreSQL's own macro for calculating which segment contains the LSN that comes right before a given LSN: #define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \
logSegNo = ((xlrp) - 1) / (wal_segsz_bytes) This macro shows that PostgreSQL itself uses the same logic of subtracting 1 from an LSN to determine which segment contains the previous LSN position. Why current implementation is correctTherefore, the current implementation: lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn) is correct because:
Comparison of implementationsThe PR's implementation: firstNotUsedWalSegmentNo := NewWalSegmentNo(firstNotUsedLSN) is incorrect because The original implementation correctly handles this by:
This approach is consistent with PostgreSQL's own implementation (as shown in the |
Thanks for you review and comment, but I cannot agree with you . Let me explain more about it. As in wal-g,
In pg_basebackup, the
As it points to the end position of a xlog record , so there is no problem of But in wal-g , in function
As we can see above, the
When start to do base backup, database will firstly switch the xlog, and the start point of base backup points to the next wal file position after the So here we cannot use the same method of |
Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all. |
Thanks for your detailed comments!
You are right, I misunderstood it.
Yes, my commit to fix this issue is not correct. The 'firstNotUsedWalSegmentNo' should not be used here as it will miss some delta blocks.
If we use the original But there is no good method to solve it now. Fallback to the fullbackup may be the best choice. For function
to
as the redo point is after the This can make delta backup work if the redo point starts at the begin of a wal file. |
Thanks for pointing it out, I need to check the start point of a backup in this situation. |
This reverts commit d1cc3cb.
hi @x4m , I checked the basebackup code in postgresql, found that in recovery mode, xlog will not be switched,
This would not impact 'getWalSegmentRange ' I think. |
The issue is that the archiving process is asynchronous, so when wal-g executes backup push, it doesn't know whether the last group of WAL segments has been uploaded. However, it needs to download these WAL segments and parse them to obtain the changed block numbers (page numbers) in order to perform page-level delta backup in the tar composer. In practice, we often encounter situations where the last WAL segment (containing the redo) hasn't been successfully uploaded, causing a fallback to file mod time-based delta backup. I think adding retry logic for downloading WAL segments might be more appropriate, because the archive process will eventually archive the WAL segments we need. Can we query PostgreSQL's archive status for each file to determine whether to read the last group of WAL segments from local storage or download them from S3? |
In 'getWalSegmentRange', the redo point points to a position in the wal file. The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup-end XlogRecord will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database. However there is no good method to solve it now. Fallback to the fullbackup may be the best choice. For function 'getWalSegmentRange', I think should change 'lastUsedLsn := firstNotUsedLsn - 1' to 'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1' as the redo point is after the XLogLongPageHeaderData This can make delta backup work if the redo point starts at the begin of a wal file.
c2ccfd6
to
7582094
Compare
When doing backup, the firstNotUsedLsn is not the start of the wal log file,
but the location after XLogLongPageHeaderData, for example 'lsn: 3/A0500028',
so using the following method to compute the last segment file is not correct.
lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)
The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD -1'
In 'getWalSegmentRange', the redo point points to a position in the wal file.
The delta backup will wait for that wal log file, but the wal log file which
contains checkpoint and backup-end XlogRecord will be archived only after
the backup finished, so the delta backup will fail. This is a common case
when this is a few write traffics in the database.
However there is no good method to solve it now. Fallback to the fullbackup
may be the best choice.
For function 'getWalSegmentRange', I think should change
'lastUsedLsn := firstNotUsedLsn - 1'
to
'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1'
as the redo point is after the XLogLongPageHeaderData
This can make delta backup work if the redo point starts at the begin of a wal
file.
Database name
Postgres
Pull request description
Describe what this PR fixes
// problem is ...
Please provide steps to reproduce (if it's a bug)
// it can really help
Please add config and wal-g stdout/stderr logs for debug purpose
also you can use WALG_LOG_LEVEL=DEVEL for logs collecting
If you can, provide logs
```bash any logs here ```