-
Notifications
You must be signed in to change notification settings - Fork 142
10.2.12 scitags backport #7788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
marian-babik
wants to merge
58
commits into
dCache:master
Choose a base branch
from
marian-babik:10.2.12-scitags
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
10.2.12 scitags backport #7788
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Motivation: If a pool with the file is online and a tape copy available, then dCache will trigger stage and wait until file is restored on disk. However, if pool becomes available again, the stage request is not interrupted and client will wait for tape. Modification: Update request container 'onPoolUp' logic to retry the request if the file expected to be on that pool. Added unit test to validate the behavior. Result: pool selection succeeds then a pool with the file becomes online despite the on-going stage request. NOTE (1): the stage request is not interrupted NOTE (2): if newly enabled pool doesn't contains the expected file, then double stage is very likely. Target: master Acked-by: Lea Morschel Require-book: no Require-notes: yes (cherry picked from commit f945c3d) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: If file has QoS policy "HSM-only" then dCache should reject the access to the file, even if file is still available on disk (as transition might take some time). Modification: - Update QoS policy engine to reflect policy in the namespace attributes - Update Transfer to request policy state (as an indicate of applied policy) and reject read transfers if access latency is nearline. Result: dCache behavior compliant with selected QoS policy. Acked-by: Lea Morschel Acked-by: Dmitry Litvintsev Target: master Require-book: no Require-notes: yes (cherry picked from commit 1a3c12e) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Bumps org.eclipse.jetty:jetty-servlets from 9.4.52.v20230823 to 9.4.54.v20240208. --- updated-dependencies: - dependency-name: org.eclipse.jetty:jetty-servlets dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> (cherry picked from commit c18934e) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Bumps org.eclipse.jetty:jetty-server from 9.4.52.v20230823 to 9.4.55.v20240627. --- updated-dependencies: - dependency-name: org.eclipse.jetty:jetty-server dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> (cherry picked from commit 8254b67) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: match dcache.org page colors Modification: pulled colors into css variables synced colors across pages fixed html generation Result: new look-and-feel Acked-by: Lea Morschel Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 7e2040e) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: ---------- When using xrootd doors behind an HAProxy w/ `xrootd.enable.proxy-protocol=true` it has been discovered that ``` xrdcp --cksum adler32:<value> <source> <destiation> ``` hangs after upload has completed and then eventually fails after a timeout. This is due to xrootd door repoting actual door address to the client. Modification: ------------- Return destination address (that is haproxy address) if `xrootd.enable.proxy-protocol=true` is set. Result: ------- ``` xrdcp --cksum adler32:<value> <source> <destiation> ``` works as expected (and likely many other similar commands) Target: trunk Request: 10.* Request: 9.2 Patch: https://rb.dcache.org/r/14338/ Acked-by: Tigran Require-book: no Require-notes: yes Signed-off-by: Dmitry Litvintsev <[email protected]> (cherry picked from commit e98ab94)
Motivation: As scheduled executors are not shutdown cleanly, when dcache stops running threads logged. Modification: shutdown ConcurrentRequestManager and adjuster-executor when dcache stopped. Result: less stack traces in logs. Acked-by: Lea Morschel Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 5f67bdd) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: ----------- Users reported 2 day pin lifetime on staged files (which is a default) despite specifying different values. This is due to failure to match target key on truncated vs prefixed path in target argument map keyed on un-prefixed paths. Modification: ------------- Use full target paths throughout the system. Make sure to strip prefix when exposing paths to users. Result: ------- Observe correct user specified pin lifetime. Ticket: dCache#7687 Patch: https://rb.dcache.org/r/14339/ Target: trunk Request: 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes Signed-off-by: Dmitry Litvintsev <[email protected]>
Motivation: When Zookeeper updates core domain infos, dCache will first kill the existing cell tunnels and then later try to read and parse the new value. If the new value is an empty string (for whatever reason), parsing will fail, but a new connection will not be established. The corresponding error in the log: 18 Nov 2024 08:45:00 (c-dcache-head-xxx03_messageDomain-AAYmVA1LtnA-AAYmVA16phA) [dcache-head-xxx03_messageDomain,9.2.21,CORE] Error while reading from tunnel: java.net.SocketExceptio> 18 Nov 2024 08:45:43 (c-dcache-head-xxx03_messageDomain-AAYnKxn40fA) [] Uncaught exception in thread TunnelConnector-dcache-head-xxx03_messageDomain java.lang.NullPointerException: null at java.base/java.net.Socket.<init>(Socket.java:448) at java.base/java.net.Socket.<init>(Socket.java:264) at java.base/javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277) at dmg.cells.network.LocationManagerConnector.connect(LocationManagerConnector.java:64) at dmg.cells.network.LocationManagerConnector.run(LocationManagerConnector.java:94) at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:725) at java.base/java.lang.Thread.run(Thread.java:829) Modification: before killing existing tunnel check that ZK didn't propagate empty data. Result: More roust cell communication NOTE: a non empty invalid data still accepted!!! Fixes: dCache#7696 Acked-by: Lea Morschel Target: master, 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes (cherry picked from commit 30829c9) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: Starting java7, the UseCompressedOops is dynamically controed by heap size. ``` $ java -Xmx32g -XX:+PrintFlagsFinal 2>/dev/null | grep UseCompressedOops bool UseCompressedOops = false {product lp64_product} {default} $ java -Xmx28g -XX:+PrintFlagsFinal 2>/dev/null | grep UseCompressedOops bool UseCompressedOops = true {product lp64_product} {ergonomic} ``` The mismatch between UseCompressedOops endup with error: ``` OpenJDK 64-Bit Server VM warning: Max heap size too large for Compressed Oops ***** WARNING! INCORRECT SYSTEM CONFIGURATION DETECTED! ***** The system limit on number of memory mappings per process might be too low for the given [gc] max Java heap size (40960M). Please adjust /proc/sys/vm/max_map_count to allow for at [gc] least 73728 mappings (current limit is 65530). Continuing execution with the current ``` Modification: drop UseCompressedOops JVM option from defaults. Result: correct behavior on JVMs with large heap Acked-by: Lea Morschel Target: master Require-book: no Require-notes: yes (cherry picked from commit 59ea69f) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: As long as cell tunnel is not explicitly stopped by calling dmg.cells.network.LocationManagerConnector#stopped, other interrupts should be ignored. Modification: Update retry logic to never give up, unless _isRunning flag set to false. Result: More robust tulles in case of network issues. Issue: dCache#7707, dCache#5326 Acked-by: Dmitry Litvintsev Target: master, 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes (cherry picked from commit 600ed1f) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: Find out where in the code we throw the above error and resolve it. Modification: Removal of redundant qos policy logic. Result: Instead of checking if the qos_policy attribute is defined AND null, we will check if it is present using one of the predefined methods. If not, set it to null and resume previous logic. Acked-by: Tigran Mkrtchyan Target: master, 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes (cherry picked from commit 8b9dfb3) Signed-off-by: khys95 <[email protected]>
Motivation: ----------- Recent change(s) that massaged user input target paths and stored absolute paths on bulk backend lead to ambiguity between user provided and dcache resolved paths and also resulted in inability to use full paths (i.e. only relative paths are supported). At Fermilab we need to use both - relative and absolute paths Modification: ------------- Revert all recent changes that appended prefix to user supplied paths, stored the result and then stripped the prefix so that only "original" paths are exposed to the user. Instead, like before, store user supplied paths but carry over request prefix which is computed from user root and door root. When calling PnfsManager using paths the full paths of the targets are reassembled using the prefix Result: ------ Restored ability to use absolute paths when using REST API. Issue: dCache#7693 Patch: https://rb.dcache.org/r/14355/ Target: trunk Request: 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes Acked-by: Lea Morschel, Tigran Mkrtchyan
Motivation when there are no avvailable space on a pool, the pool will go to DISABLED mode and no data could be read anymore. This is wrong we still want to be able to read data from the file. Modification this is a temp change, trying to get the info from the eroor message, sice therer is no a specific error code, and then based on that info separate two cases DISABLED AND READONLY Acked-by: Tigran Mkrtchyan Target: master. 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes Commited:master@862963e
…definitely to SKIPPED Motivaton: ---------- Setting files having infinite pin to state SKIPPED seems to prevents them from being staged if pool goes down. Modification: ------------- Set state to COMPLETED if pin lifetime is infinite. Result: ------ As tested and reported by DESY, the staging of files that happen to be on offline pools works properly Target: trunk Request: 10.2 Request: 9.2 Patch: https://rb.dcache.org/r/14365/ Require-notes: yes Require-book: no
Motivation: If newly started thread runs before `running` flag is set, then tunnel with shutdown instantly. Thread-1: Create T2 --> | T2.start() | --> set running flag Thread-2: | check flag; exit | Modification: set `running` before thread starts. Result: race is fixed Issue: dCache#5326 Acked-by: Paul Millar Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 2440b22) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: the dcache-billing-indexer uses commons-compress to handle bz2 files, and has non-optional dependency on commons-io, which is missing. So we get: ``` /usr/sbin/dcache-billing-indexer -index /var/lib/dcache/billing/2025/01/billing-2025.01.27.bz2 ERROR - Uncaught exception java.lang.NoClassDefFoundError: org/apache/commons/io/input/CloseShieldInputStream at org.dcache.services.billing.text.Indexer$7.openStream(Indexer.java:526) at com.google.common.io.ByteSource$AsCharSource.openStream(ByteSource.java:474) at com.google.common.io.CharSource.readLines(CharSource.java:371) at org.dcache.services.billing.text.Indexer.produceIndex(Indexer.java:510) at org.dcache.services.billing.text.Indexer.index(Indexer.java:396) at org.dcache.services.billing.text.Indexer.(Indexer.java:211) at org.dcache.services.billing.text.Indexer.main(Indexer.java:686) Caused by: java.lang.ClassNotFoundException: org.apache.commons.io.input.CloseShieldInputStream at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525) ... 7 common frames omitted ``` Modification: explicitly add commons-io into dcache-billing-indexer classpath. Result: dcache-billing-indexer can process compressed files. Fixes: dCache#7738 Acked-by: Lea Morschel Tested-by: Ville Salmela Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 153a018) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: Some filesystems miscalculate/report free space in respect tu used space: ``` $ df /dcache/pool-a Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 558602657792 558397210232 205447560 100% /dcache/pool-a $ du -s /dcache/pool-a 554502450484 /dcache/pool-a $ bc 558602657792 - 554502450484 4100207308 ``` File system reports 200GB of the free space, however, total - real used gives ~4TB. Thus dCache assumes 4TB and write the file system full... ==> IO error Modification: use an `effective` free space, which is the minimum between disk reported and internal accounting free spaces. Result: Pool uses the effective free space instead of mathematically correct value. Acked-by: Dmitry Litvintsev Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 208bfbe) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation: null is a valid value for FileAttribute.QOS_POLICY, however, it is not a valid value to be encapsulated within an Optional. Modification: Update Optional.of for Optional.ofNullable since the later will take into account null as a value and return Optional.empty() where Optional.of will not and hence throw the NPE. Result: QosPolicy can still be set to null and we will no longer show a NPE. Acked-by: Dmitry Litvintsev Target: master, 10.2, 10.1, 10.0 and 9.2 Require-book: no Require-notes: no (cherry picked from commit 466c97e) Signed-off-by: khys95 <[email protected]>
Motivation we are seeing the this error wen there is no space on the pool left, and pool goes into disabled mode, AB56598C55] WRITE failed : IOError AB56598C55] Transfer failed due to a disk error: CacheException(rc=204;msg=Disk I/O Error ) AB56598C55] Pool mode changed to disabled(fetch,store,stage,p2p-client,p2p-server): Pool disabled: Disk I/O Error AB56598C55] Pool: dcache-xfel487-01, fault occurred in transfer: Disk I/O Error . Pool disabled: , cause: CacheException(rc=204;msg=Disk I/O Error ) however we want it to go into READ-ONLY mode and files could stay accesable. the privous patch (https://rb.dcache.org/r/14357/diff/3/#index_header) did not fix the issue. so the new changes are still a try to fix the issue. Acked-by: Dmitry Litvintsev Target: master. 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes Patch: https://rb.dcache.org/r/14368/
Motivation: latest version in 9.4 series Acked-by: Dmitry Litvintsev Target: master, 10.2, 10.1, 10.0, 9.2 Require-book: no Require-notes: yes (cherry picked from commit f6d6e31) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Update webdav.md: add STAGE to table of activities
Update macaroons.md: add STAGE to list of activities
…java-17-openjdk-headless
…che version example.
10.2: docs: change openjdk 11 to 17
Motivation: Recently we have observe cases whee tunnel connections failed with: java.lang.NullPointerException: null at java.base/java.net.Socket.<init>(Socket.java:501) at java.base/java.net.Socket.<init>(Socket.java:319) at java.base/javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277) at dmg.cells.network.LocationManagerConnector.connect(LocationManagerConnector.java:66) at dmg.cells.network.LocationManagerConnector.run(LocationManagerConnector.java:96) at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$2(CellNucleus.java:725) at java.base/java.lang.Thread.run(Thread.java:840) This is possible only if hostname can't be resolve. Such tunnels die and never re-connect. Modification: Update zookeeper node update logic to ensure that we accept endpoint only if hostname is resolvable. Result: More robust tunnel handling. Issue: dCache#5326 Acked-by: Marina Sahakyan Target: master, 10.2 Require-book: no Require-notes: yes (cherry picked from commit 543bfc5) Signed-off-by: Tigran Mkrtchyan <[email protected]>
Motivation ---------- Security scans flag dCache application as non-compliant because it reports Jetty version number Modification ------------ Disable reporting Jetty version Result ------ Baseline security compliant Target: trunk Request: 10.2, 10.1, 10.0, 9.2 Acked-by: Tigran Patch: https://rb.dcache.org/r/14383/ Signed-off-by: Dmitry Litvintsev <[email protected]>
Motivation: A client that is uploading a file or initiating an HTTP TPC pull transfer may wish to validate the transfer did not result in data corruption, using the new file's checksum / digest value. When receiving a file, the pool will calculate a set of checksums (from different checksum algorithms) based on the pool's configuration. This configured list of checksum algorithms might not include the client's desired checksum algorithm. One solution would be for the pool configuration to be updated, to include the client's desired checksum. Although technically possible, this is impractical, as it would require the person transferring the data to negotiate with the dCache admins, asking them to update the pool configuration for all pools their transfer might use. Such support operations are undesirable. As an alternative approach, the WebDAV door accepts a `Want-Digest` HTTP request header on PUT and COPY (HTTP-TPC) requests. The door uses this header to select a single desired checksum algorithm. This algorithm (if provided) is transferred to the pool, which then ensures that this algorithm is calculated during the file upload. A dCache instance may provide storage to multiple communities, with different conventions on which algorithm is use for data integrety. A client initiating a transfer between these communities (using dCache as an intermediate) would require two checksums: one to verify data integrety of the transfer to dCache and another to validate transferred within the second community. Modification: The ProtocolInfo subclasses are updated to carry a collection of algorithms, taking care to maintain backwards compatibility. Result: No user- or admin observable change, but dCache now supports a door requesting multiple checksum algorithms when a pool receives a file. Target: master Request: 10.2 Requires-notes: yes Requires-book: no Patch: https://rb.dcache.org/r/14341/ Acked-by: Tigran Mkrtchyan
The ProtocolInfo class is updated to pass information from doors to pools to propagate client-supplied SciTag to fireflies. The FLowMarker uses this information to calculate the experiment ID and the activity. Signed-off-by: Marian Babik <[email protected]>
This brings back the fallback mechanism to generate fireflies based on the subject's virtual organisation, even if there are no transfer tags indicated in the protocols. It introduces a default mapping via the configuration variable pool.firefly.vo-mapping Adds fallback mechanism to determine network tags from the Subject's virtual organisation. A configurable mapping is used to map VO names to the corresponding experiment IDs. Signed-off-by: Marian Babik <[email protected]>
* Introduces usage and storage statistics in fireflies (FlowMaker), which are used to track the data transfer(s) direction and estimate storage performance wrt. network usage. Signed-off-by: Marian Babik <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is backport of scitags functionality to 10.2.12 (PR created to just build the RPMs).