Skip to content

Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

fanfuxiaoran
Copy link
Contributor

@fanfuxiaoran fanfuxiaoran commented Mar 24, 2025

When doing backup, the firstNotUsedLsn is not the start of the wal log file,
but the location after XLogLongPageHeaderData, for example 'lsn: 3/A0500028',
so using the following method to compute the last segment file is not correct.

lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)

The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD -1'

In 'getWalSegmentRange', the redo point points to a position in the wal file.
The delta backup will wait for that wal log file, but the wal log file which
contains checkpoint and backup-end XlogRecord will be archived only after
the backup finished, so the delta backup will fail. This is a common case
when this is a few write traffics in the database.

However there is no good method to solve it now. Fallback to the fullbackup
may be the best choice.

For function 'getWalSegmentRange', I think should change
'lastUsedLsn := firstNotUsedLsn - 1'
to
'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1'
as the redo point is after the XLogLongPageHeaderData
This can make delta backup work if the redo point starts at the begin of a wal
file.

Database name

Postgres

Pull request description

Describe what this PR fixes

// problem is ...

Please provide steps to reproduce (if it's a bug)

// it can really help

Please add config and wal-g stdout/stderr logs for debug purpose

also you can use WALG_LOG_LEVEL=DEVEL for logs collecting

If you can, provide logs

```bash any logs here ```

When doing backup, the firstNotUsedLsn is not the start of the wal log
file, but the location after XLogLongPageHeaderData, for example
'lsn: 3/A0500028', so using the following method to compute the last
segment file is not correct.

lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)

The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD'

Here we can directly get the last not used wal segment by
NewWalSegmentNo(firstNotUsedLSN)
@fanfuxiaoran fanfuxiaoran requested a review from a team as a code owner March 24, 2025 02:34
@x4m
Copy link
Collaborator

x4m commented Apr 2, 2025

@robertmu Hi Robert! Can you please help reviewing and merging this PR?

@robertmu
Copy link
Contributor

robertmu commented Apr 3, 2025

Thanks for your PR, but I think there's a misunderstanding about how firstNotUsedLsn (which is checkpoint.redo) works in WAL logging.

Why the assumption is incorrect

The assumption that firstNotUsedLsn always points to a location after XLogLongPageHeaderData is incorrect. According to the WAL insertion logic, checkpoint.redo can point to different locations:

  1. Most commonly, it points to the space right after the last WAL record in the current page, when there's still enough free space for the next record. In this case, no page header adjustment is needed.

  2. Only when the current page is full (freespace == 0), will it point to a location after a page header, and even then:

    • If it's at the beginning of a segment, it will be after XLogLongPHD
    • Otherwise, it will be after XLogShortPHD

Evidence from PostgreSQL source code

This is further evidenced by PostgreSQL's own macro for calculating which segment contains the LSN that comes right before a given LSN:

#define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \
	logSegNo = ((xlrp) - 1) / (wal_segsz_bytes)

This macro shows that PostgreSQL itself uses the same logic of subtracting 1 from an LSN to determine which segment contains the previous LSN position.

Why current implementation is correct

Therefore, the current implementation:

lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)

is correct because:

  1. It handles all possible cases of where firstNotUsedLsn might point to
  2. The segment calculation using NewWalSegmentNo will work correctly regardless of the exact position within the segment, as it's only concerned with the segment number

Comparison of implementations

The PR's implementation:

firstNotUsedWalSegmentNo := NewWalSegmentNo(firstNotUsedLSN)

is incorrect because firstNotUsedLSN could point to any position within a WAL segment (after a WAL record, after a long page header, or after a short page header). Simply using this LSN to calculate the segment number might give us the wrong segment.

The original implementation correctly handles this by:

  1. First calculating the last used LSN position (firstNotUsedLsn - 1)
  2. Getting its segment number
  3. Then moving to the next segment

This approach is consistent with PostgreSQL's own implementation (as shown in the XLByteToPrevSeg macro) and correctly handles all possible LSN positions.

@fanfuxiaoran
Copy link
Contributor Author

Thanks for you review and comment, but I cannot agree with you . Let me explain more about it.

As in wal-g, lastUsedLsn is actually the start point of the base backup. Totally different with pg_basebackup.

This is further evidenced by PostgreSQL's own macro for calculating which segment contains the LSN that comes right before a given LSN:

#define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes)
logSegNo = ((xlrp) - 1) / (wal_segsz_bytes)

In pg_basebackup, the xlrp points to the end position of the xlog record of base backup finished , but before switching xlog

 stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);

As it points to the end position of a xlog record , so there is no problem of XLByteToPrevSeg

But in wal-g , in function getDeltaMap

func (bundle *Bundle) DownloadDeltaMap(reader internal.StorageFolderReader, backupStartLSN LSN) error {
	deltaMap, err := getDeltaMap(reader, bundle.Timeline, *bundle.IncrementFromLsn, backupStartLSN)
	if err != nil {
		return err
	}
	bundle.DeltaMap = deltaMap
	return nil
}
func getDeltaMap(reader internal.StorageFolderReader,
	timeline uint32,
	firstUsedLSN,
	firstNotUsedLSN LSN) (PagedFileDeltaMap, error) {

As we can see above, the firstNotUsedLSN is actually backupStartLSN. It is always like 'lsn: 3/A0500028', ends with 28 which is the length of XLogLongPageHeaderData , you can run the following sql to check

select pg_start_backup('baseline');

When start to do base backup, database will firstly switch the xlog, and the start point of base backup points to the next wal file position after the XLogLongPageHeaderData.

So here we cannot use the same method of XLByteToPrevSeg .

@robertmu
Copy link
Contributor

robertmu commented Apr 3, 2025

PostgreSQL is a multi-user system where multiple sessions can execute operations concurrently. This significantly impacts how pg_start_backup() works.

Let me give you an example. When user A executes pg_start_backup and after a WAL log switch occurs (meaning a new logsegno is generated), if user B performs a write operation on a table (which generates WAL records), then the checkpoint.redo position returned by pg_start_backup (what you refer to as the start point) won't be in the form of "3/A0500028". Instead, it will be positioned after XLogLongPageHeaderData plus one or more WAL records.

[XLogLongPageHeaderData (0x28 bytes)]
[WAL Record 1 from Session B]
[WAL Record 2 from Session B]
[WAL Record 3 from Session B] 
[Online CheckPoint Wal record]  <- Actual backup start point(checkpoint.redo position)

This is the checkpoint.redo(start point) in non-concurrent situations
image

This is the checkpoint.redo(start point) in concurrent situations
image

We can clearly see that in concurrent situations, the checkpoint.redo(start point) returned by pg_start_backup does not end with '28'

In this case, according to your algorithm, firstNotUsedWalSegmentNo would be the WAL segment file containing the current checkpoint. This would cause the NewPagedFileDeltaMap.getLocationsFromWals function(firstUsedWalSegmentNo <= logno <firstNotUsedWalSegmentNo) to miss parsing the WAL records before checkpoint.redo, resulting in lost delta information

While the original function is capable of handling all these situations properly

func getWalSegmentRange(firstNotUsedDeltaNo DeltaNo, firstNotUsedLsn LSN) (WalSegmentNo, WalSegmentNo) {
	firstUsedWalSegmentNo := firstNotUsedDeltaNo.firstWalSegmentNo()
	lastUsedLsn := firstNotUsedLsn - 1
	lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)
	return firstUsedWalSegmentNo, lastUsedWalSegmentNo.Next()
}

@x4m
Copy link
Collaborator

x4m commented Apr 3, 2025

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

@fanfuxiaoran
Copy link
Contributor Author

fanfuxiaoran commented Apr 7, 2025

Thanks for your detailed comments!

PostgreSQL is a multi-user system where multiple sessions can execute operations concurrently. This significantly impacts how pg_start_backup() works.

Let me give you an example. When user A executes pg_start_backup and after a WAL log switch occurs (meaning a new logsegno is generated), if user B performs a write operation on a table (which generates WAL records), then the checkpoint.redo position returned by pg_start_backup (what you refer to as the start point) won't be in the form of "3/A0500028". Instead, it will be positioned after XLogLongPageHeaderData plus one or more WAL records.

[XLogLongPageHeaderData (0x28 bytes)]
[WAL Record 1 from Session B]
[WAL Record 2 from Session B]
[WAL Record 3 from Session B] 
[Online CheckPoint Wal record]  <- Actual backup start point(checkpoint.redo position)

This is the checkpoint.redo(start point) in non-concurrent situations image

This is the checkpoint.redo(start point) in concurrent situations image

We can clearly see that in concurrent situations, the checkpoint.redo(start point) returned by pg_start_backup does not end with '28'

You are right, I misunderstood it.

In this case, according to your algorithm, firstNotUsedWalSegmentNo would be the WAL segment file containing the current checkpoint. This would cause the NewPagedFileDeltaMap.getLocationsFromWals function(firstUsedWalSegmentNo <= logno <firstNotUsedWalSegmentNo) to miss parsing the WAL records before checkpoint.redo, resulting in lost delta information

Yes, my commit to fix this issue is not correct. The 'firstNotUsedWalSegmentNo' should not be used here as it will miss some delta blocks.

While the original function is capable of handling all these situations properly

func getWalSegmentRange(firstNotUsedDeltaNo DeltaNo, firstNotUsedLsn LSN) (WalSegmentNo, WalSegmentNo) {
	firstUsedWalSegmentNo := firstNotUsedDeltaNo.firstWalSegmentNo()
	lastUsedLsn := firstNotUsedLsn - 1
	lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)
	return firstUsedWalSegmentNo, lastUsedWalSegmentNo.Next()
}

If we use the original getWalSegmentRange , the redo point points to a position in the wal file (the wal file will contain checkpoint and 'backup end' related xlog record). The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database.

But there is no good method to solve it now. Fallback to the fullbackup may be the best choice.

For function getWalSegmentRange, I think should change

	lastUsedLsn := firstNotUsedLsn - 1

to

	lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD

as the redo point is after the XLogLongPageHeaderData (backup will switch xlog file).

This can make delta backup work if the redo point starts at the begin of a wal file.
How do you think @x4m @robertmu ?

@fanfuxiaoran
Copy link
Contributor Author

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

Thanks for pointing it out, I need to check the start point of a backup in this situation.

This reverts commit d1cc3cb.
@fanfuxiaoran
Copy link
Contributor Author

fanfuxiaoran commented Apr 7, 2025

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

Thanks for pointing it out, I need to check the start point of a backup in this situation.

hi @x4m , I checked the basebackup code in postgresql, found that in recovery mode, xlog will not be switched,
but still request checkpoint.

if (!backup_started_in_recovery)
			RequestXLogSwitch(false);

		do
		{
			bool		checkpointfpw;

			/*
			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
			 * page problems, this guarantees that two successive backup runs
			 * will have different checkpoint positions and hence different
			 * history file names, even if nothing happened in between.
			 *
			 * During recovery, establish a restartpoint if possible. We use
			 * the last restartpoint as the backup starting checkpoint. This
			 * means that two successive backup runs can have same checkpoint
			 * positions.
			 *
			 * Since the fact that we are executing do_pg_start_backup()
			 * during recovery means that checkpointer is running, we can use
			 * RequestCheckpoint() to establish a restartpoint.
			 *
			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
			 * passing fast = true).  Otherwise this can take awhile.
			 */
			RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
							  (fast ? CHECKPOINT_IMMEDIATE : 0));

This would not impact 'getWalSegmentRange ' I think.

@robertmu
Copy link
Contributor

robertmu commented Apr 7, 2025

If we use the original getWalSegmentRange , the redo point points to a position in the wal file (the wal file will contain checkpoint and 'backup end' related xlog record). The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database.

But there is no good method to solve it now. Fallback to the fullbackup may be the best choice.

The issue is that the archiving process is asynchronous, so when wal-g executes backup push, it doesn't know whether the last group of WAL segments has been uploaded. However, it needs to download these WAL segments and parse them to obtain the changed block numbers (page numbers) in order to perform page-level delta backup in the tar composer.

In practice, we often encounter situations where the last WAL segment (containing the redo) hasn't been successfully uploaded, causing a fallback to file mod time-based delta backup. I think adding retry logic for downloading WAL segments might be more appropriate, because the archive process will eventually archive the WAL segments we need.

Can we query PostgreSQL's archive status for each file to determine whether to read the last group of WAL segments from local storage or download them from S3?

In 'getWalSegmentRange', the redo point points to a position in the wal
file. The delta backup will wait for that wal log file, but the wal log
file which contains checkpoint and backup-end XlogRecord will be archived
only after the backup finished, so the delta backup will fail. This is a
common case when this is a few insert traffics in the database.

However there is no good method to solve it now. Fallback to the fullbackup
may be the best choice.

For function 'getWalSegmentRange', I think should change
'lastUsedLsn := firstNotUsedLsn - 1'
to
'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1'
as the redo point is after the XLogLongPageHeaderData

This can make delta backup work if the redo point starts at the begin of
a wal file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants