Skip to content

10.2.12 scitags backport #7788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

marian-babik
Copy link
Contributor

This is backport of scitags functionality to 10.2.12 (PR created to just build the RPMs).

kofemann and others added 30 commits October 9, 2024 10:50
Motivation:
If a pool with the file is online and a tape copy available, then
dCache will trigger stage and wait until file is restored on disk.
However, if pool becomes available again, the stage request is not
interrupted and client will wait for tape.

Modification:
Update request container 'onPoolUp' logic to retry the request if
the file expected to be on that pool. Added unit test to validate the
behavior.

Result:
pool selection succeeds then a pool with the file becomes online
despite the on-going stage request.

NOTE (1): the stage request is not interrupted
NOTE (2): if newly enabled pool doesn't contains the expected file, then
double stage is very likely.

Target: master
Acked-by: Lea Morschel
Require-book: no
Require-notes: yes
(cherry picked from commit f945c3d)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
If file has QoS policy "HSM-only" then dCache should reject the access
to the file, even if file is still available on disk (as transition
might take some time).

Modification:
- Update QoS policy engine to reflect policy in the namespace attributes
- Update Transfer to request policy state (as an indicate of applied
  policy) and reject read transfers if access latency is nearline.

Result:
dCache behavior compliant with selected QoS policy.

Acked-by: Lea Morschel
Acked-by: Dmitry Litvintsev
Target: master
Require-book: no
Require-notes: yes
(cherry picked from commit 1a3c12e)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Bumps org.eclipse.jetty:jetty-servlets from 9.4.52.v20230823 to 9.4.54.v20240208.

---
updated-dependencies:
- dependency-name: org.eclipse.jetty:jetty-servlets
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
(cherry picked from commit c18934e)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Bumps org.eclipse.jetty:jetty-server from 9.4.52.v20230823 to 9.4.55.v20240627.

---
updated-dependencies:
- dependency-name: org.eclipse.jetty:jetty-server
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
(cherry picked from commit 8254b67)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
match dcache.org page colors

Modification:
pulled colors into css variables
synced colors across pages
fixed html generation

Result:
new look-and-feel

Acked-by: Lea Morschel
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 7e2040e)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
----------
When using xrootd doors behind an HAProxy w/
`xrootd.enable.proxy-protocol=true` it has been discovered
that
```
xrdcp --cksum adler32:<value> <source> <destiation>
```
hangs after upload has completed and then eventually fails
after a timeout. This is due to xrootd door repoting actual
door address to the client.

Modification:
-------------
Return destination address (that is haproxy address) if
`xrootd.enable.proxy-protocol=true` is set.

Result:
-------
```
xrdcp --cksum adler32:<value> <source> <destiation>
```
works as expected (and likely many other similar commands)

Target: trunk
Request: 10.*
Request: 9.2
Patch: https://rb.dcache.org/r/14338/
Acked-by: Tigran
Require-book: no
Require-notes: yes
Signed-off-by: Dmitry Litvintsev <[email protected]>
(cherry picked from commit e98ab94)
Motivation:
As scheduled executors are not shutdown cleanly, when dcache stops
running threads logged.

Modification:
shutdown ConcurrentRequestManager and adjuster-executor when dcache
stopped.

Result:

less stack traces in logs.

Acked-by: Lea Morschel
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 5f67bdd)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
-----------
Users reported 2 day pin lifetime on staged files (which is a default)
despite specifying different values. This is due to failure to match
target key on truncated vs prefixed path in target argument map keyed on
un-prefixed paths.

Modification:
-------------
Use full target paths throughout the system. Make sure to strip prefix
when exposing paths to users.

Result:
-------
Observe correct user specified pin lifetime.

Ticket: dCache#7687
Patch: https://rb.dcache.org/r/14339/
Target: trunk
Request: 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes

Signed-off-by: Dmitry Litvintsev <[email protected]>
Motivation:
When Zookeeper updates core domain infos, dCache will first kill the existing cell tunnels
and then later try to read and parse the new value. If the new value is an empty
string (for whatever reason), parsing will fail, but a new connection will not be
established. The corresponding error in the log:

18 Nov 2024 08:45:00 (c-dcache-head-xxx03_messageDomain-AAYmVA1LtnA-AAYmVA16phA) [dcache-head-xxx03_messageDomain,9.2.21,CORE] Error while reading from tunnel: java.net.SocketExceptio>
18 Nov 2024 08:45:43 (c-dcache-head-xxx03_messageDomain-AAYnKxn40fA) [] Uncaught exception in thread TunnelConnector-dcache-head-xxx03_messageDomain
java.lang.NullPointerException: null
        at java.base/java.net.Socket.<init>(Socket.java:448)
        at java.base/java.net.Socket.<init>(Socket.java:264)
        at java.base/javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
        at dmg.cells.network.LocationManagerConnector.connect(LocationManagerConnector.java:64)
        at dmg.cells.network.LocationManagerConnector.run(LocationManagerConnector.java:94)
        at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:725)
        at java.base/java.lang.Thread.run(Thread.java:829)

Modification:
before killing existing tunnel check that ZK didn't propagate empty
data.

Result:
More roust cell communication

NOTE: a non empty invalid data still accepted!!!

Fixes: dCache#7696
Acked-by: Lea Morschel
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 30829c9)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
Starting java7, the UseCompressedOops is dynamically controed by heap
size.

```
$ java -Xmx32g -XX:+PrintFlagsFinal 2>/dev/null | grep UseCompressedOops
     bool UseCompressedOops                        = false                          {product lp64_product} {default}

$ java -Xmx28g -XX:+PrintFlagsFinal 2>/dev/null | grep UseCompressedOops
     bool UseCompressedOops                        = true                           {product lp64_product} {ergonomic}
```

The mismatch between UseCompressedOops endup with error:

```
OpenJDK 64-Bit Server VM warning: Max heap size too large for Compressed Oops
***** WARNING! INCORRECT SYSTEM CONFIGURATION DETECTED! *****
The system limit on number of memory mappings per process might be too low for the given
[gc] max Java heap size (40960M). Please adjust /proc/sys/vm/max_map_count to allow for at
[gc] least 73728 mappings (current limit is 65530). Continuing execution with the current
```

Modification:
drop UseCompressedOops JVM option from defaults.

Result:
correct behavior on JVMs with large heap

Acked-by: Lea Morschel
Target: master
Require-book: no
Require-notes: yes
(cherry picked from commit 59ea69f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: dCache#7707, dCache#5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: Find out where in the code we throw the above error and resolve it.

Modification: Removal of redundant qos policy logic.

Result: Instead of checking if the qos_policy attribute is defined AND null, we will check if it is present using one of the predefined methods.
If not, set it to null and resume previous logic.

Acked-by: Tigran Mkrtchyan
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 8b9dfb3)
Signed-off-by: khys95 <[email protected]>
Motivation:
-----------
Recent change(s) that massaged user input target paths and
stored absolute paths on bulk backend lead to ambiguity between
user provided and dcache resolved paths and also resulted in inability
to use full paths (i.e. only relative paths are supported). At
Fermilab we need to use both - relative and absolute paths

Modification:
-------------
Revert all recent changes that appended prefix to user
supplied paths, stored the result and then stripped the
prefix so that only "original" paths are exposed to the user.
Instead, like before, store user supplied paths but carry
over request prefix which is computed from user root and
door root. When calling PnfsManager using paths the full
paths of the targets are reassembled using the prefix

Result:
------
Restored ability to use absolute paths when using REST API.

Issue: dCache#7693
Patch: https://rb.dcache.org/r/14355/
Target: trunk
Request: 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
Acked-by: Lea Morschel, Tigran Mkrtchyan
Motivation

when there are no avvailable space on a pool, the pool will go to DISABLED mode and no data could be read anymore.

This is wrong we still want to be able to read data from the file.

Modification

this is a temp change, trying to get the info from the eroor message, sice therer is no a specific error code,

and then based on that info separate two cases DISABLED AND READONLY

Acked-by: Tigran Mkrtchyan
Target: master. 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
Commited:master@862963e
…definitely to SKIPPED

Motivaton:
----------

Setting files having infinite pin to state SKIPPED seems to
prevents them from being staged if pool goes down.

Modification:
-------------

Set state to COMPLETED if pin lifetime is infinite.

Result:
------

As tested and reported by DESY, the staging of files that happen
to be on offline pools works properly

Target: trunk
Request: 10.2
Request: 9.2
Patch: https://rb.dcache.org/r/14365/

Require-notes: yes
Require-book: no
Motivation:

If newly started thread runs before `running` flag is set, then tunnel
with shutdown instantly.

Thread-1: Create T2 -->  | T2.start()         | --> set running flag
Thread-2:                | check flag; exit   |

Modification:

set `running` before thread starts.

Result:
race is fixed

Issue: dCache#5326
Acked-by: Paul Millar
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 2440b22)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
kofemann and others added 28 commits February 3, 2025 12:14
Motivation:
the dcache-billing-indexer uses commons-compress to handle bz2 files,
and has non-optional dependency on commons-io, which is missing. So we
get:

```
/usr/sbin/dcache-billing-indexer -index /var/lib/dcache/billing/2025/01/billing-2025.01.27.bz2
ERROR - Uncaught exception
java.lang.NoClassDefFoundError: org/apache/commons/io/input/CloseShieldInputStream
at org.dcache.services.billing.text.Indexer$7.openStream(Indexer.java:526)
at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474)
at com.google.common.io.CharSource.readLines(CharSource.java:371)
at org.dcache.services.billing.text.Indexer.produceIndex(Indexer.java:510)
at org.dcache.services.billing.text.Indexer.index(Indexer.java:396)
at org.dcache.services.billing.text.Indexer.(Indexer.java:211)
at org.dcache.services.billing.text.Indexer.main(Indexer.java:686)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.io.input.CloseShieldInputStream
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
... 7 common frames omitted
```

Modification:
explicitly add commons-io into dcache-billing-indexer classpath.

Result:
dcache-billing-indexer can process compressed files.

Fixes: dCache#7738
Acked-by: Lea Morschel
Tested-by: Ville Salmela
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 153a018)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation:
Some filesystems miscalculate/report free space in respect tu used
space:

```
$ df /dcache/pool-a
Filesystem        1K-blocks         Used Available Use% Mounted on
/dev/sda1      558602657792 558397210232 205447560 100% /dcache/pool-a

$ du -s /dcache/pool-a
554502450484	/dcache/pool-a

$ bc
558602657792 - 554502450484
4100207308
```

File system reports 200GB of the free space, however, total - real used
gives ~4TB.

Thus dCache assumes 4TB and write the file system full... ==> IO error

Modification:

use an `effective` free space, which is the minimum between disk
reported and internal accounting free spaces.

Result:
Pool uses the effective free space instead of mathematically correct
value.

Acked-by: Dmitry Litvintsev
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 208bfbe)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: null is a valid value for FileAttribute.QOS_POLICY, however, it is not a valid value to be encapsulated within an Optional.

Modification: Update Optional.of for Optional.ofNullable since the later will take into account null as a value and return Optional.empty() where Optional.of will not and hence throw the NPE.

Result: QosPolicy can still be set to null and we will no longer show a NPE.

Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0 and 9.2
Require-book: no
Require-notes: no
(cherry picked from commit 466c97e)
Signed-off-by: khys95 <[email protected]>
Motivation

we are seeing the this error wen there is no space on the pool left, and pool goes into disabled mode,

AB56598C55] WRITE failed : IOError
AB56598C55] Transfer failed due to a disk error: CacheException(rc=204;msg=Disk I/O Error )
AB56598C55] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Pool disabled: Disk I/O Error
AB56598C55] Pool: dcache-xfel487-01, fault occurred in transfer: Disk I/O Error . Pool disabled: , cause: CacheException(rc=204;msg=Disk I/O Error )

however we want it to go into READ-ONLY mode and files could stay accesable.

the privous patch (https://rb.dcache.org/r/14357/diff/3/#index_header) did not fix the issue.

so the new changes are still a try to fix the issue.

Acked-by: Dmitry Litvintsev
Target: master. 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
Patch: https://rb.dcache.org/r/14368/
Motivation:
latest version in 9.4 series

Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit f6d6e31)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Update webdav.md: add STAGE to table of activities
Update macaroons.md: add STAGE to list of activities
10.2: docs: change openjdk 11 to 17
Motivation:

Recently we have observe cases whee tunnel connections failed with:

java.lang.NullPointerException: null
        at java.base/java.net.Socket.<init>(Socket.java:501)
        at java.base/java.net.Socket.<init>(Socket.java:319)
        at java.base/javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
        at dmg.cells.network.LocationManagerConnector.connect(LocationManagerConnector.java:66)
        at dmg.cells.network.LocationManagerConnector.run(LocationManagerConnector.java:96)
        at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:725)
        at java.base/java.lang.Thread.run(Thread.java:840)

This is possible only if hostname can't be resolve. Such tunnels die and
never re-connect.

Modification:
Update zookeeper node update logic to ensure that we accept endpoint
only if hostname is resolvable.

Result:
More robust tunnel handling.

Issue: dCache#5326
Acked-by: Marina Sahakyan
Target: master, 10.2
Require-book: no
Require-notes: yes
(cherry picked from commit 543bfc5)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation
----------

Security scans flag dCache application as non-compliant
because it reports Jetty version number

Modification
------------

Disable reporting Jetty version

Result
------

Baseline security compliant

Target: trunk
Request: 10.2, 10.1, 10.0, 9.2
Acked-by: Tigran

Patch: https://rb.dcache.org/r/14383/

Signed-off-by: Dmitry Litvintsev <[email protected]>
Motivation:

A client that is uploading a file or initiating an HTTP TPC pull
transfer may wish to validate the transfer did not result in data
corruption, using the new file's checksum / digest value.  When
receiving a file, the pool will calculate a set of checksums (from
different checksum algorithms) based on the pool's configuration.  This
configured list of checksum algorithms might not include the client's
desired checksum algorithm.

One solution would be for the pool configuration to be updated, to
include the client's desired checksum.  Although technically possible,
this is impractical, as it would require the person transferring the
data to negotiate with the dCache admins, asking them to update the pool
configuration for all pools their transfer might use.  Such support
operations are undesirable.

As an alternative approach, the WebDAV door accepts a `Want-Digest` HTTP
request header on PUT and COPY (HTTP-TPC) requests.  The door uses this
header to select a single desired checksum algorithm.  This algorithm
(if provided) is transferred to the pool, which then ensures that this
algorithm is calculated during the file upload.

A dCache instance may provide storage to multiple communities, with
different conventions on which algorithm is use for data integrety.  A
client initiating a transfer between these communities (using dCache as
an intermediate) would require two checksums: one to verify data
integrety of the transfer to dCache and another to validate transferred
within the second community.

Modification:

The ProtocolInfo subclasses are updated to carry a collection of
algorithms, taking care to maintain backwards compatibility.

Result:

No user- or admin observable change, but dCache now supports a door
requesting multiple checksum algorithms when a pool receives a file.

Target: master
Request: 10.2
Requires-notes: yes
Requires-book: no
Patch: https://rb.dcache.org/r/14341/
Acked-by: Tigran Mkrtchyan
The ProtocolInfo class is updated to pass information from doors to pools to propagate client-supplied SciTag to fireflies.
The FLowMarker uses this information to calculate the experiment ID and the activity. 



Signed-off-by: Marian Babik <[email protected]>
This brings back the fallback mechanism to generate fireflies based on the subject's virtual organisation,
even if there are no transfer tags indicated in the protocols. 

It introduces a default mapping via the configuration variable pool.firefly.vo-mapping

Adds fallback mechanism to determine network tags from the Subject's
virtual organisation. A configurable mapping is used to map VO names
to the corresponding experiment IDs.

Signed-off-by: Marian Babik <[email protected]>
* Introduces usage and storage statistics in fireflies (FlowMaker), which
are used to track the data transfer(s) direction and estimate storage
performance wrt. network usage.

Signed-off-by: Marian Babik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants