Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

fanfuxiaoran · 2025-03-24T02:34:41Z

When doing backup, the firstNotUsedLsn is not the start of the wal log file,
but the location after XLogLongPageHeaderData, for example 'lsn: 3/A0500028',
so using the following method to compute the last segment file is not correct.

lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)

The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD -1'

In 'getWalSegmentRange', the redo point points to a position in the wal file.
The delta backup will wait for that wal log file, but the wal log file which
contains checkpoint and backup-end XlogRecord will be archived only after
the backup finished, so the delta backup will fail. This is a common case
when this is a few write traffics in the database.

However there is no good method to solve it now. Fallback to the fullbackup
may be the best choice.

For function 'getWalSegmentRange', I think should change
'lastUsedLsn := firstNotUsedLsn - 1'
to
'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1'
as the redo point is after the XLogLongPageHeaderData
This can make delta backup work if the redo point starts at the begin of a wal
file.

Database name

Postgres

Pull request description

Describe what this PR fixes

// problem is ...

Please provide steps to reproduce (if it's a bug)

// it can really help

Please add config and wal-g stdout/stderr logs for debug purpose

also you can use WALG_LOG_LEVEL=DEVEL for logs collecting

If you can, provide logs

```bash any logs here ```

When doing backup, the firstNotUsedLsn is not the start of the wal log file, but the location after XLogLongPageHeaderData, for example 'lsn: 3/A0500028', so using the following method to compute the last segment file is not correct. lastUsedLsn := firstNotUsedLsn - 1 lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn) The lastUsedLsn should be 'firstNotUsedLsn - SizeOfXLogLongPHD' Here we can directly get the last not used wal segment by NewWalSegmentNo(firstNotUsedLSN)

x4m · 2025-04-02T09:06:18Z

@robertmu Hi Robert! Can you please help reviewing and merging this PR?

robertmu · 2025-04-03T02:19:05Z

Thanks for your PR, but I think there's a misunderstanding about how firstNotUsedLsn (which is checkpoint.redo) works in WAL logging.

Why the assumption is incorrect

The assumption that firstNotUsedLsn always points to a location after XLogLongPageHeaderData is incorrect. According to the WAL insertion logic, checkpoint.redo can point to different locations:

Most commonly, it points to the space right after the last WAL record in the current page, when there's still enough free space for the next record. In this case, no page header adjustment is needed.
Only when the current page is full (freespace == 0), will it point to a location after a page header, and even then:
- If it's at the beginning of a segment, it will be after XLogLongPHD
- Otherwise, it will be after XLogShortPHD

Evidence from PostgreSQL source code

This is further evidenced by PostgreSQL's own macro for calculating which segment contains the LSN that comes right before a given LSN:

#define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes) \
	logSegNo = ((xlrp) - 1) / (wal_segsz_bytes)

This macro shows that PostgreSQL itself uses the same logic of subtracting 1 from an LSN to determine which segment contains the previous LSN position.

Why current implementation is correct

Therefore, the current implementation:

lastUsedLsn := firstNotUsedLsn - 1
lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)

is correct because:

It handles all possible cases of where firstNotUsedLsn might point to
The segment calculation using NewWalSegmentNo will work correctly regardless of the exact position within the segment, as it's only concerned with the segment number

Comparison of implementations

The PR's implementation:

firstNotUsedWalSegmentNo := NewWalSegmentNo(firstNotUsedLSN)

is incorrect because firstNotUsedLSN could point to any position within a WAL segment (after a WAL record, after a long page header, or after a short page header). Simply using this LSN to calculate the segment number might give us the wrong segment.

The original implementation correctly handles this by:

First calculating the last used LSN position (firstNotUsedLsn - 1)
Getting its segment number
Then moving to the next segment

This approach is consistent with PostgreSQL's own implementation (as shown in the XLByteToPrevSeg macro) and correctly handles all possible LSN positions.

fanfuxiaoran · 2025-04-03T03:54:13Z

Thanks for you review and comment, but I cannot agree with you . Let me explain more about it.

As in wal-g, lastUsedLsn is actually the start point of the base backup. Totally different with pg_basebackup.

This is further evidenced by PostgreSQL's own macro for calculating which segment contains the LSN that comes right before a given LSN:

#define XLByteToPrevSeg(xlrp, logSegNo, wal_segsz_bytes)
logSegNo = ((xlrp) - 1) / (wal_segsz_bytes)

In pg_basebackup, the xlrp points to the end position of the xlog record of base backup finished , but before switching xlog

 stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END);

As it points to the end position of a xlog record , so there is no problem of XLByteToPrevSeg

But in wal-g , in function getDeltaMap

func (bundle *Bundle) DownloadDeltaMap(reader internal.StorageFolderReader, backupStartLSN LSN) error {
	deltaMap, err := getDeltaMap(reader, bundle.Timeline, *bundle.IncrementFromLsn, backupStartLSN)
	if err != nil {
		return err
	}
	bundle.DeltaMap = deltaMap
	return nil
}
func getDeltaMap(reader internal.StorageFolderReader,
	timeline uint32,
	firstUsedLSN,
	firstNotUsedLSN LSN) (PagedFileDeltaMap, error) {

As we can see above, the firstNotUsedLSN is actually backupStartLSN. It is always like 'lsn: 3/A0500028', ends with 28 which is the length of XLogLongPageHeaderData , you can run the following sql to check

select pg_start_backup('baseline');

When start to do base backup, database will firstly switch the xlog, and the start point of base backup points to the next wal file position after the XLogLongPageHeaderData.

So here we cannot use the same method of XLByteToPrevSeg .

robertmu · 2025-04-03T12:46:27Z

PostgreSQL is a multi-user system where multiple sessions can execute operations concurrently. This significantly impacts how pg_start_backup() works.

Let me give you an example. When user A executes pg_start_backup and after a WAL log switch occurs (meaning a new logsegno is generated), if user B performs a write operation on a table (which generates WAL records), then the checkpoint.redo position returned by pg_start_backup (what you refer to as the start point) won't be in the form of "3/A0500028". Instead, it will be positioned after XLogLongPageHeaderData plus one or more WAL records.

[XLogLongPageHeaderData (0x28 bytes)]
[WAL Record 1 from Session B]
[WAL Record 2 from Session B]
[WAL Record 3 from Session B] 
[Online CheckPoint Wal record]  <- Actual backup start point(checkpoint.redo position)

This is the checkpoint.redo(start point) in non-concurrent situations

This is the checkpoint.redo(start point) in concurrent situations

We can clearly see that in concurrent situations, the checkpoint.redo(start point) returned by pg_start_backup does not end with '28'

In this case, according to your algorithm, firstNotUsedWalSegmentNo would be the WAL segment file containing the current checkpoint. This would cause the NewPagedFileDeltaMap.getLocationsFromWals function(firstUsedWalSegmentNo <= logno <firstNotUsedWalSegmentNo) to miss parsing the WAL records before checkpoint.redo, resulting in lost delta information

While the original function is capable of handling all these situations properly

func getWalSegmentRange(firstNotUsedDeltaNo DeltaNo, firstNotUsedLsn LSN) (WalSegmentNo, WalSegmentNo) {
	firstUsedWalSegmentNo := firstNotUsedDeltaNo.firstWalSegmentNo()
	lastUsedLsn := firstNotUsedLsn - 1
	lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)
	return firstUsedWalSegmentNo, lastUsedWalSegmentNo.Next()
}

x4m · 2025-04-03T14:59:47Z

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

fanfuxiaoran · 2025-04-07T03:40:01Z

Thanks for your detailed comments!

PostgreSQL is a multi-user system where multiple sessions can execute operations concurrently. This significantly impacts how pg_start_backup() works.

Let me give you an example. When user A executes pg_start_backup and after a WAL log switch occurs (meaning a new logsegno is generated), if user B performs a write operation on a table (which generates WAL records), then the checkpoint.redo position returned by pg_start_backup (what you refer to as the start point) won't be in the form of "3/A0500028". Instead, it will be positioned after XLogLongPageHeaderData plus one or more WAL records.
[XLogLongPageHeaderData (0x28 bytes)]
[WAL Record 1 from Session B]
[WAL Record 2 from Session B]
[WAL Record 3 from Session B] 
[Online CheckPoint Wal record]  <- Actual backup start point(checkpoint.redo position)
This is the checkpoint.redo(start point) in non-concurrent situations

This is the checkpoint.redo(start point) in concurrent situations

We can clearly see that in concurrent situations, the checkpoint.redo(start point) returned by pg_start_backup does not end with '28'

You are right, I misunderstood it.

In this case, according to your algorithm, firstNotUsedWalSegmentNo would be the WAL segment file containing the current checkpoint. This would cause the NewPagedFileDeltaMap.getLocationsFromWals function(firstUsedWalSegmentNo <= logno <firstNotUsedWalSegmentNo) to miss parsing the WAL records before checkpoint.redo, resulting in lost delta information

Yes, my commit to fix this issue is not correct. The 'firstNotUsedWalSegmentNo' should not be used here as it will miss some delta blocks.

While the original function is capable of handling all these situations properly

func getWalSegmentRange(firstNotUsedDeltaNo DeltaNo, firstNotUsedLsn LSN) (WalSegmentNo, WalSegmentNo) {
	firstUsedWalSegmentNo := firstNotUsedDeltaNo.firstWalSegmentNo()
	lastUsedLsn := firstNotUsedLsn - 1
	lastUsedWalSegmentNo := NewWalSegmentNo(lastUsedLsn)
	return firstUsedWalSegmentNo, lastUsedWalSegmentNo.Next()
}

If we use the original getWalSegmentRange , the redo point points to a position in the wal file (the wal file will contain checkpoint and 'backup end' related xlog record). The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database.

But there is no good method to solve it now. Fallback to the fullbackup may be the best choice.

For function getWalSegmentRange, I think should change

	lastUsedLsn := firstNotUsedLsn - 1

to

	lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD

as the redo point is after the XLogLongPageHeaderData (backup will switch xlog file).

This can make delta backup work if the redo point starts at the begin of a wal file.
How do you think @x4m @robertmu ?

This reverts commit 09a8d8c.

This reverts commit 1d10ce4.

fanfuxiaoran · 2025-04-07T03:51:46Z

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

Thanks for pointing it out, I need to check the start point of a backup in this situation.

This reverts commit d1cc3cb.

fanfuxiaoran · 2025-04-07T10:02:33Z

Also, keep in minds that nowadays all Postgres backups are made on standbys if they are available. Without checkpoint at all.

Thanks for pointing it out, I need to check the start point of a backup in this situation.

hi @x4m , I checked the basebackup code in postgresql, found that in recovery mode, xlog will not be switched,
but still request checkpoint.

if (!backup_started_in_recovery)
			RequestXLogSwitch(false);

		do
		{
			bool		checkpointfpw;

			/*
			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
			 * page problems, this guarantees that two successive backup runs
			 * will have different checkpoint positions and hence different
			 * history file names, even if nothing happened in between.
			 *
			 * During recovery, establish a restartpoint if possible. We use
			 * the last restartpoint as the backup starting checkpoint. This
			 * means that two successive backup runs can have same checkpoint
			 * positions.
			 *
			 * Since the fact that we are executing do_pg_start_backup()
			 * during recovery means that checkpointer is running, we can use
			 * RequestCheckpoint() to establish a restartpoint.
			 *
			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
			 * passing fast = true).  Otherwise this can take awhile.
			 */
			RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
							  (fast ? CHECKPOINT_IMMEDIATE : 0));

This would not impact 'getWalSegmentRange ' I think.

robertmu · 2025-04-07T15:24:24Z

If we use the original getWalSegmentRange , the redo point points to a position in the wal file (the wal file will contain checkpoint and 'backup end' related xlog record). The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database.

But there is no good method to solve it now. Fallback to the fullbackup may be the best choice.

The issue is that the archiving process is asynchronous, so when wal-g executes backup push, it doesn't know whether the last group of WAL segments has been uploaded. However, it needs to download these WAL segments and parse them to obtain the changed block numbers (page numbers) in order to perform page-level delta backup in the tar composer.

In practice, we often encounter situations where the last WAL segment (containing the redo) hasn't been successfully uploaded, causing a fallback to file mod time-based delta backup. I think adding retry logic for downloading WAL segments might be more appropriate, because the archive process will eventually archive the WAL segments we need.

Can we query PostgreSQL's archive status for each file to determine whether to read the last group of WAL segments from local storage or download them from S3?

In 'getWalSegmentRange', the redo point points to a position in the wal file. The delta backup will wait for that wal log file, but the wal log file which contains checkpoint and backup-end XlogRecord will be archived only after the backup finished, so the delta backup will fail. This is a common case when this is a few insert traffics in the database. However there is no good method to solve it now. Fallback to the fullbackup may be the best choice. For function 'getWalSegmentRange', I think should change 'lastUsedLsn := firstNotUsedLsn - 1' to 'lastUsedLsn := firstNotUsedLsn - SizeOfXLogLongPHD - 1' as the redo point is after the XLogLongPageHeaderData This can make delta backup work if the redo point starts at the begin of a wal file.

fanfuxiaoran requested a review from a team as a code owner March 24, 2025 02:34

fanfuxiaoran added 2 commits April 2, 2025 15:06

remove the unused test

1d10ce4

Fix lint

d1cc3cb

fanfuxiaoran added 2 commits April 7, 2025 11:40

Revert "Fix firstNotUsedWalSegmentNo used by getLocationsFromWals"

4b9f40f

This reverts commit 09a8d8c.

Revert "remove the unused test"

477a2c1

This reverts commit 1d10ce4.

Revert "Fix lint"

6dee57a

This reverts commit d1cc3cb.

fanfuxiaoran force-pushed the fix_wal_segfile branch from c2ccfd6 to 7582094 Compare April 8, 2025 03:10

fanfuxiaoran mentioned this pull request Apr 17, 2025

Wal delta fetch fix #710

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

Uh oh!

fanfuxiaoran commented Mar 24, 2025 •

edited

Loading

Uh oh!

x4m commented Apr 2, 2025

Uh oh!

robertmu commented Apr 3, 2025

Uh oh!

fanfuxiaoran commented Apr 3, 2025

Uh oh!

robertmu commented Apr 3, 2025

Uh oh!

x4m commented Apr 3, 2025

Uh oh!

fanfuxiaoran commented Apr 7, 2025 •

edited

Loading

Uh oh!

fanfuxiaoran commented Apr 7, 2025

Uh oh!

fanfuxiaoran commented Apr 7, 2025 •

edited

Loading

Uh oh!

robertmu commented Apr 7, 2025

Uh oh!

Uh oh!

Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

Are you sure you want to change the base?

Fix firstNotUsedWalSegmentNo used by getLocationsFromWals #1944

Uh oh!

Conversation

fanfuxiaoran commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Database name

Pull request description

Describe what this PR fixes

Please provide steps to reproduce (if it's a bug)

Please add config and wal-g stdout/stderr logs for debug purpose

Uh oh!

x4m commented Apr 2, 2025

Uh oh!

robertmu commented Apr 3, 2025

Why the assumption is incorrect

Evidence from PostgreSQL source code

Why current implementation is correct

Comparison of implementations

Uh oh!

fanfuxiaoran commented Apr 3, 2025

Uh oh!

robertmu commented Apr 3, 2025

Uh oh!

x4m commented Apr 3, 2025

Uh oh!

fanfuxiaoran commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanfuxiaoran commented Apr 7, 2025

Uh oh!

fanfuxiaoran commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertmu commented Apr 7, 2025

Uh oh!

Uh oh!

fanfuxiaoran commented Mar 24, 2025 •

edited

Loading

fanfuxiaoran commented Apr 7, 2025 •

edited

Loading

fanfuxiaoran commented Apr 7, 2025 •

edited

Loading